I am trying to identifying the trending tags (based on maximum hits) on time series using mysql json feature. Below is my table
CREATE TABLE TAG_COUNTER ( account varchar(36) NOT NULL, time_id INT NOT NULL, counters JSON, PRIMARY KEY (account, time_id) )
In every web api request, i will be getting the multiple different tags per account, and based on number of tags, i will prepare the INSERT ON DUPLICATE KEY UPDATE
query. Below example is showing insertion with two tags.
INSERT INTO `TAG_COUNTER` (`account`, `time_id`, `counters`) VALUES ('google', '2018061023', '{"tag1": 1, "tag2": 1}') ON DUPLICATE KEY UPDATE `counters` = JSON_SET(`counters`, '$."tag1"', IFNULL(JSON_EXTRACT(`counters`, '$."tag1"'), 0) + 1, '$."tag2"', IFNULL(JSON_EXTRACT(`counters`, '$."tag2"'), 0) + 1 );
time_id is yyyyMMddhh, and it is hourly aggregation on each row.
Now my problem is retrival of treding tags. Below query will give me aggregation for tag1, but we will not be knowing the tags before making this query.
SELECT SUBSTRING(time_id, 1, 6) AS month, SUM(counters->>'$.tag1') FROM TAG_COUNTER WHERE counters->>'$.tag1' > 0 GROUP BY month;
So i need generic group by query along with order by to get the trending tags for the time hourly/daily/monthly.
The sample of output expected is
Time(hour/day/month) Tag_name Tag_count_value(total hits)
When i have searched the web, every where it is mentioned like below {"tag_name": "tag1", "tag_count": 1}
instead of direct {"tag1" : 1}
and they were using tag_name in the group by.
Q1) So is it always mandatory to have common known json key to perform group by ..?
Q2) If i have to go with this way, then what is the change in my INSERT ON DUPLICATE KEY UPDATE query for this new json label/value struture? Since the counter has to be created when it is not existing and should increment by one when it is existing.
Q3) do i have to maintain array of objects
[ {"tag_name": "tag1", "tag_count": 2}, {"tag_name": "tag2", "tag_count": 3} ]
OR object of objects like below?
{ {"tag_name": "tag1", "tag_count": 2}, {"tag_name": "tag2", "tag_count": 3} }
So which is better above json structure interms of INSERT and RETRIEVAL of trending count?
Q4) Can i go with existing {"key" : "value"}
format instead of {"key_label" : key, "value_lable" : "value"}
and possible to extract trending ..? since i am thinking that {"key" : "value"}
is very straight forward and good at performance wise.
Q5) While retrieving i am using SUBSTRING(time_id, 1, 6) AS month
. Will it be able to use index?
OR do i need to create multiple columns like time_hour(2018061023)
, time_day(20180610)
, time_month(201806)
and use query on specific columns?
OR can i use mysql date-time functions? will that use index for faster retrieval?
Please help.
2 Answers
Answers 1
I don't see a good reason, why you use JSON here. It's also not clear, why you believe that a "nosql schema" within MySQL would do anything better.
What you probably need is something like this:
CREATE TABLE TAG_COUNTER ( account varchar(36) NOT NULL, time_id INT NOT NULL, tag_name varchar(50) NOT NULL, counter INT UNSIGNED NOT NULL, PRIMARY KEY (account, time_id, tag_name) );
This will simplify your queries. The INSERT statement would look like:
INSERT INTO TAG_COUNTER (account, time_id, tag_name, counter) VALUES ('google', 2018061023, 'tag1', 1), ('google', 2018061023, 'tag2', 1) ON DUPLICATE KEY UPDATE counter = counter + VALUES(counter);
The SELECT statement might be something like this
SELECT SUBSTRING(time_id, 1, 6) AS month, tag_name, SUM(counter) AS counter_agg FROM TAG_COUNTER GROUP BY month, tag_name ORDER BY month, counter_agg DESC;
Note that I did't try to optimize the table/schema for data size and performance. That would be a different question. But you must see, that the queries are much simpler now.
Answers 2
As I have said in comments, I think moving away from JSON is the way to go. However, if you want to keep using JSON, this function (a direct copy of the one in my answer to this question, see an explanation of what it does there) and procedure will do what you want.
DELIMITER // DROP FUNCTION IF EXISTS json_merge_sum // CREATE FUNCTION json_sum_merge(IN j1 JSON, IN total JSON) RETURNS JSON BEGIN DECLARE knum INT DEFAULT 0; DECLARE jkeys JSON DEFAULT JSON_KEYS(j1); DECLARE kpath VARCHAR(30); DECLARE v INT; DECLARE l INT DEFAULT JSON_LENGTH(jkeys); kloop: LOOP IF knum >= l THEN LEAVE kloop; END IF; SET kpath = CONCAT('$.', JSON_EXTRACT(jkeys, CONCAT('$[', knum, ']'))); SET v = JSON_EXTRACT(j1, kpath); IF JSON_CONTAINS_PATH(total, 'one', kpath) THEN SET total = JSON_REPLACE(total, kpath, JSON_EXTRACT(total, kpath) + v); ELSE SET total = JSON_SET(total, kpath, v); END IF; SET knum = knum + 1; END LOOP kloop; RETURN total; END //
The procedure is similar to the one in my other answer, in that it finds all the distinct tags associated with a given substring of time_id
(specified as a parameter) and sums the values associated with each tag. The individual tags and counts are then written to a temporary table, from which a selection is then made grouping by time period and tag name.
DELIMITER // DROP PROCEDURE IF EXISTS count_tags // CREATE PROCEDURE count_tags(IN period VARCHAR(50)) BEGIN DECLARE finished INT DEFAULT 0; DECLARE timeval VARCHAR(20); DECLARE knum, l INT; DECLARE jkeys JSON; DECLARE time_cursor CURSOR FOR SELECT DISTINCT time_id FROM tag_counter; DECLARE CONTINUE HANDLER FOR NOT FOUND SET finished=1; CREATE TEMPORARY TABLE tag_counts (Time VARCHAR(20), Tag_Name VARCHAR(30), Tag_count_value INT, INDEX(Time, Tag_Name)); OPEN time_cursor; time_loop: LOOP FETCH time_cursor INTO timeval; IF finished=1 THEN LEAVE time_loop; END IF; SET @total = '{}'; SET @query = CONCAT("SELECT MIN(@total:=json_sum_merge(counters, @total)) INTO @json FROM TAG_COUNTER WHERE time_id='", timeval, "'"); PREPARE stmt FROM @query; EXECUTE stmt; DEALLOCATE PREPARE stmt; SET @query = CONCAT('INSERT INTO tag_counts VALUES(', period, ', ?, ?)'); PREPARE stmt FROM @query; SET @timeval = timeval; SET l = JSON_LENGTH(@total); SET jkeys = JSON_KEYS(@total); SET knum = 0; key_loop: LOOP IF knum >= l THEN LEAVE key_loop; END IF; SET @k = JSON_EXTRACT(jkeys, CONCAT('$[', knum, ']')); SET @t = JSON_EXTRACT(@total, CONCAT('$.', @k)); EXECUTE stmt USING @k, @t; SET knum = knum + 1; END LOOP key_loop; DEALLOCATE PREPARE stmt; END LOOP time_loop; SELECT Time, Tag_Name, SUM(Tag_count_value) AS Tag_count_value FROM tag_counts GROUP BY Time, Tag_Name; DROP TABLE tag_counts; END
A couple of examples based on some limited sample data from your prior question. In these examples @timeval
is equivalent to the time_id
column. Input data:
account time_id counters google 20180510 {"gmail_page_viewed": 2, "search_page_viewed": 51} google 20180511 {"gmail_page_viewed": 3, "search_page_viewed": 102} apple 20180511 {"apple_page_viewed": 5, "search_page_viewed": 16}
CALL count_tags('@timeval')
:
Time Tag_Name Tag_count_value 20180510 "gmail_page_viewed" 2 20180510 "search_page_viewed" 51 20180511 "apple_page_viewed" 5 20180511 "gmail_page_viewed" 3 20180511 "search_page_viewed" 118
CALL count_tags('SUBSTRING(@timeval, 1, 6)')
:
Time Tag_Name Tag_count_value 201805 "apple_page_viewed" 5 201805 "gmail_page_viewed" 5 201805 "search_page_viewed" 169
Note that you can also use json_sum_merge
to simplify your INSERT
query e.g.
INSERT INTO `TAG_COUNTER` (`account`, `time_id`, `counters`) VALUES ('apple', '20180511', '{"apple_page_viewed": 9, "itunes_page_viewed": 4}') ON DUPLICATE KEY UPDATE `counters` = json_sum_merge(VALUES(counters), counters)
Result:
account time_id counters apple 20180511 {"apple_page_viewed": 14, "itunes_page_viewed": 4, "search_page_viewed": 16}
In terms of the specific questions in your answer:
- No. This answer shows it can be done with your existing data format.
- Not applicable.
- Not applicable.
- Yes, you can stick with the existing
{"key" : "value"}
format - Since we have to go through every entry of
tag_counter
to get the list of tags, an index is not beneficial for that section. For the temporary table I have included indexes on theTime
andTag_Name
columns which should benefit speed as they are used directly in theGROUP BY
clause.
If you were to maintain a list of keys (e.g. in a separate table, maintained by a trigger on insert/update/delete to tag_counter
) this code could be made a lot simpler and more efficient. But that is for another question.
0 comments:
Post a Comment