We have a table that saves information about the interval of the employees. Let's call it INTERVAL_TABLE.
We save when the user starts a interval and when he finishes. The user can start a interval as many times as he wants and finish as many times as he wants as well.
This is a simplified structure of the INTERVAL_TABLE
:
INTERVAL_ID | USER_ID | INTERVAL_TYPE_ID | INTERVAL_TIMESTAMP | ENTRY_TYPE
A user may have these entries in the table:
Now, we must create a report combining different entries of that table that refer to the same user and interval type. We should be able to identify intervals that have a start and an end and group these two in one row. Assuming the data in the image above, the output of the report should be the following:
The output should be ordered by date, like the above image.
I have no idea how to create a query to do that.
Thanks!
Edit - Extra info:
To find the END interval for any INIT interval, we should find the closest END interval based on the timestamp of that interval. That's how we know we should match ID 1 with ID 2 and not with ID 3.
It's important to note that if a INIT interval is followed by another INIT interval (based on the timestamps), we should not proceed to find the END for that INIT. That is because this is a INIT without END.
4 Answers
Answers 1
DBFiddle
This might not be the most efficient way to do this (I imagine a recursive query might be), but I find these subqueries easier to maintain:
WITH ordered_table AS ( SELECT row_number() OVER(ORDER BY INTERVAL_TIMESTAMP ASC) row_num, * FROM INTERVAL_TABLE ORDER BY row_num ), _inits AS ( SELECT t1.USER_ID, t1.INTERVAL_TYPE_ID AS INTERVAL_TYPE, t1.INTERVAL_TIMESTAMP AS INTERVAL_TIMESTAMP_INIT, CASE WHEN t1.ENTRY_TYPE = 'INIT_INTERVAL' AND t2.ENTRY_TYPE = 'END_INTERVAL' THEN t2.INTERVAL_TIMESTAMP END AS INTERVAL_TIMESTAMP_END, t1.INTERVAL_ID AS INTERVAL_ID_INIT, CASE WHEN t1.ENTRY_TYPE = 'INIT_INTERVAL' AND t2.ENTRY_TYPE = 'END_INTERVAL' THEN t2.INTERVAL_ID END AS INTERVAL_ID_END FROM ordered_table AS t1 LEFT JOIN ordered_table AS t2 ON (t1.row_num = t2.row_num - 1) WHERE t1.ENTRY_TYPE = 'INIT_INTERVAL' ), _ends AS ( SELECT t1.USER_ID, t1.INTERVAL_TYPE_ID AS INTERVAL_TYPE, NULL::timestamp AS INTERVAL_TIMESTAMP_INIT, CASE WHEN t1.ENTRY_TYPE = 'END_INTERVAL' AND t2.ENTRY_TYPE = 'END_INTERVAL' THEN t2.INTERVAL_TIMESTAMP END AS INTERVAL_TIMESTAMP_END, NULL::int AS INTERVAL_ID_INIT, t2.INTERVAL_ID AS INTERVAL_ID_END FROM ordered_table AS t1 RIGHT JOIN ordered_table AS t2 ON (t1.row_num = t2.row_num - 1) WHERE t2.ENTRY_TYPE = 'END_INTERVAL' ) SELECT * FROM ( SELECT * FROM _inits UNION ALL SELECT * FROM _ends ) qry WHERE COALESCE(interval_timestamp_init, interval_timestamp_end) IS NOT NULL ORDER BY COALESCE(interval_timestamp_init, interval_timestamp_end)
Basically, INITs will always be listed. They will either have an associated END or a null. So almost all the content from _inits
will be there.
Because the ENDs were already captured by the INITs, we only need to capture the ones that don't have an INIT (they were preceded by an END).
Because they are outer joins, you simply can remove the cases where INIT and END both are NULL and apply proper ordering.
Answers 2
It could be done easy and efficiently using LEAD
and LAG
functions. At least it is much more efficient than self-join of the table: O(n)
vs O(n*n)
.
At first add columns for the next and previous row using LEAD
and LAG
with appropriate PARTITION BY
.
Then build two sets of pairs - the first that starts with INIT_INTERVAL
, the second that ends with END_INTERVAL
. If there is a pair that has both Init and End - it will be included twice and later eliminated in UNION
.
Sample data (this is something that you should have included in your question in addition to the screenshot)
CREATE TABLE INTERVAL_TABLE ( INTERVAL_ID int, USER_ID int, INTERVAL_TYPE_ID int, INTERVAL_TIMESTAMP timestamp, ENTRY_TYPE varchar(255)); INSERT INTO INTERVAL_TABLE (INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, INTERVAL_TIMESTAMP, ENTRY_TYPE) VALUES (1, 1, 1, '2018-03-08 14:00:00', 'INIT_INTERVAL'), (2, 1, 1, '2018-03-08 15:00:00', 'END_INTERVAL' ), (3, 1, 1, '2018-03-08 15:30:00', 'END_INTERVAL' ), (4, 1, 1, '2018-03-08 15:45:00', 'INIT_INTERVAL'), (5, 1, 1, '2018-03-08 15:50:00', 'INIT_INTERVAL');
Query
WITH CTE AS ( SELECT USER_ID ,INTERVAL_TYPE_ID ,ENTRY_TYPE AS Curr_Entry_Type ,INTERVAL_TIMESTAMP AS Curr_Interval_Timestamp ,INTERVAL_ID AS Curr_Interval_ID ,LAG(ENTRY_TYPE) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Prev_Entry_Type ,LAG(INTERVAL_TIMESTAMP) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Prev_Interval_Timestamp ,LAG(INTERVAL_ID) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Prev_Interval_ID ,LEAD(ENTRY_TYPE) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Next_Entry_Type ,LEAD(INTERVAL_TIMESTAMP) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Next_Interval_Timestamp ,LEAD(INTERVAL_ID) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Next_Interval_ID FROM INTERVAL_TABLE ) ,CTE_Result AS ( SELECT USER_ID ,INTERVAL_TYPE_ID ,Curr_Entry_Type AS Entry_Type_Init ,Curr_Interval_Timestamp AS Interval_Timestamp_Init ,Curr_Interval_ID AS Interval_ID_Init ,Next_Entry_Type AS Entry_Type_End ,CASE WHEN Next_Entry_Type = 'END_INTERVAL' THEN Next_Interval_Timestamp END AS Interval_Timestamp_End ,CASE WHEN Next_Entry_Type = 'END_INTERVAL' THEN Next_Interval_ID END AS Interval_ID_End FROM CTE WHERE Curr_Entry_Type = 'INIT_INTERVAL' UNION -- sic! not UNION ALL SELECT USER_ID ,INTERVAL_TYPE_ID ,Prev_Entry_Type AS Entry_Type_Init ,CASE WHEN Prev_Entry_Type = 'INIT_INTERVAL' THEN Prev_Interval_Timestamp END AS Interval_Timestamp_Init ,CASE WHEN Prev_Entry_Type = 'INIT_INTERVAL' THEN Prev_Interval_ID END AS Interval_ID_Init ,Curr_Entry_Type AS Entry_Type_End ,Curr_Interval_Timestamp AS Interval_Timestamp_End ,Curr_Interval_ID AS Interval_ID_End FROM CTE WHERE Curr_Entry_Type = 'END_INTERVAL' ) SELECT USER_ID ,INTERVAL_TYPE_ID ,Interval_Timestamp_Init ,Interval_Timestamp_End ,Interval_ID_Init ,Interval_ID_End FROM CTE_Result ORDER BY USER_ID ,INTERVAL_TYPE_ID ,COALESCE(Interval_Timestamp_Init, Interval_Timestamp_End)
| user_id | interval_type_id | interval_timestamp_init | interval_timestamp_end | interval_id_init | interval_id_end | |---------|------------------|-------------------------|------------------------|------------------|-----------------| | 1 | 1 | 2018-03-08T14:00:00Z | 2018-03-08T15:00:00Z | 1 | 2 | | 1 | 1 | (null) | 2018-03-08T15:30:00Z | (null) | 3 | | 1 | 1 | 2018-03-08T15:45:00Z | (null) | 4 | (null) | | 1 | 1 | 2018-03-08T15:50:00Z | (null) | 5 | (null) |
Answers 3
You can use the INTERVAL_ID (or a new column with a generated row_number) to join two instances of the same table, using as predicate something like this:
on a.INTERVAL_ID=b.INTERVAL_ID + 1
This way, you can compare and get in 1 line each record with the next one.
Answers 4
This query gives the output you need:
WITH Intervals AS ( WITH Events AS ( WITH OrderedEvents AS ( SELECT INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, INTERVAL_TIMESTAMP, ENTRY_TYPE, row_number() over (partition by USER_ID, INTERVAL_TYPE_ID order by INTERVAL_TIMESTAMP ASC) AS EVENT_ORDER FROM INTERVAL_TABLE UNION ALL SELECT NULL AS INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, NULL AS INTERVAL_TIMESTAMP, 'INIT_INTERVAL' AS ENTRY_TYPE, 0 AS EVENT_ORDER FROM INTERVAL_TABLE GROUP BY USER_ID, INTERVAL_TYPE_ID UNION ALL SELECT NULL AS INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, NULL AS INTERVAL_TIMESTAMP, 'END_INTERVAL' AS ENTRY_TYPE, COUNT(*) + 1 AS EVENT_ORDER FROM INTERVAL_TABLE GROUP BY USER_ID, INTERVAL_TYPE_ID ) SELECT Events1.USER_ID, Events1.INTERVAL_TYPE_ID, Events1.INTERVAL_TIMESTAMP AS INTERVAL_TIMESTAMP_INIT, Events2.INTERVAL_TIMESTAMP AS INTERVAL_TIMESTAMP_END, Events1.INTERVAL_ID AS INTERVAL_ID_INIT, Events2.INTERVAL_ID AS INTERVAL_ID_END, Events1.ENTRY_TYPE AS ENTRY_TYPE1, Events2.ENTRY_TYPE AS ENTRY_TYPE2 FROM OrderedEvents Events1 INNER JOIN OrderedEvents Events2 ON Events1.USER_ID = Events2.USER_ID AND Events1.INTERVAL_TYPE_ID = Events2.INTERVAL_TYPE_ID AND Events1.EVENT_ORDER + 1 = Events2.EVENT_ORDER ) SELECT USER_ID, INTERVAL_TYPE_ID, CASE WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_TIMESTAMP_INIT WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'INIT_INTERVAL' THEN INTERVAL_TIMESTAMP_INIT WHEN ENTRY_TYPE1 = 'END_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN NULL END AS INTERVAL_TIMESTAMP_INIT, CASE WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_TIMESTAMP_END WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'INIT_INTERVAL' THEN NULL WHEN ENTRY_TYPE1 = 'END_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_TIMESTAMP_END END AS INTERVAL_TIMESTAMP_END, CASE WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_ID_INIT WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'INIT_INTERVAL' THEN INTERVAL_ID_INIT WHEN ENTRY_TYPE1 = 'END_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN NULL END AS INTERVAL_ID_INIT, CASE WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_ID_END WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'INIT_INTERVAL' THEN NULL WHEN ENTRY_TYPE1 = 'END_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_ID_END END AS INTERVAL_ID_END FROM Events ) SELECT * FROM Intervals WHERE INTERVAL_ID_INIT IS NOT NULL OR INTERVAL_ID_END IS NOT NULL;
At first, we build OrderedEvents
CTE that groups entries by USER_ID
and INTERVAL_TYPE_ID
, sorts them by INTERVAL_TIMESTAMP
within each group and assign numeric order to each event. Also for each group we add INIT_INTERVAL
as first event and END_INTERVAL
as last event to cover cases when group starts with END_INTERVAL
or finishes with INIT_INTERVAL
:
WITH OrderedEvents AS ( SELECT INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, INTERVAL_TIMESTAMP, ENTRY_TYPE, row_number() over (partition by USER_ID, INTERVAL_TYPE_ID order by INTERVAL_TIMESTAMP ASC) AS EVENT_ORDER FROM INTERVAL_TABLE UNION ALL SELECT NULL AS INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, NULL AS INTERVAL_TIMESTAMP, 'INIT_INTERVAL' AS ENTRY_TYPE, 0 AS EVENT_ORDER FROM INTERVAL_TABLE GROUP BY USER_ID, INTERVAL_TYPE_ID UNION ALL SELECT NULL AS INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, NULL AS INTERVAL_TIMESTAMP, 'END_INTERVAL' AS ENTRY_TYPE, COUNT(*) + 1 AS EVENT_ORDER FROM INTERVAL_TABLE GROUP BY USER_ID, INTERVAL_TYPE_ID ) SELECT * FROM OrderedEvents ORDER BY user_id, interval_type_id, event_order;
This query gives following results for the provided data:
Then we intersect OrderedEvents
with itself on USER_ID
and INTERVAL_TYPE_ID
and select pairs of neighbor events (Events1.EVENT_ORDER + 1 = Events2.EVENT_ORDER
):
WITH OrderedEvents AS ( ... ) SELECT Events1.USER_ID, Events1.INTERVAL_TYPE_ID, Events1.INTERVAL_TIMESTAMP AS INTERVAL_TIMESTAMP_INIT, Events2.INTERVAL_TIMESTAMP AS INTERVAL_TIMESTAMP_END, Events1.INTERVAL_ID AS INTERVAL_ID_INIT, Events2.INTERVAL_ID AS INTERVAL_ID_END, Events1.ENTRY_TYPE AS ENTRY_TYPE1, Events2.ENTRY_TYPE AS ENTRY_TYPE2 FROM OrderedEvents Events1 INNER JOIN OrderedEvents Events2 ON Events1.USER_ID = Events2.USER_ID AND Events1.INTERVAL_TYPE_ID = Events2.INTERVAL_TYPE_ID AND Events1.EVENT_ORDER + 1 = Events2.EVENT_ORDER
This query gives following results:
Now we should transform these pairs of neighbor events to intervals based on the logic you described. Previous output has columns entry_type1
and entry_type2
which could take values of INIT_INTERVAL
or END_INTERVAL
. The possible combinations are:
<INIT_INTERVAL, END_INTERVAL>
- this is the most natural case whenINIT_INTERVAL
is followed byEND_INTERVAL
. We take event values as is.<INIT_INTERVAL(1), INIT_INTERVAL(2)>
- the case of two consecutiveINIT_INTERVAL
. We force ending of the interval by taking<INIT_INTERVAL(1), NULL>
.INIT_INTERVAL(2)
will be taken with the next pair when it will be in the first entry.<END_INTERVAL(1), END_INTERVAL(2)>
- the case of two consecutiveEND_INTERVAL
. We force start of the interval by taking<NULL, END_INTERVAL(2)>
.END_INTERVAL(1)
is processed either by case #1 or by the current case when it is the second entry in the pair.<END_INTERVAL, INIT_INTERVAL>
- such pairs are just skipped.END_INTERVAL
is taken either by case #1 or case #3.INIT_INTERVAL
is taken either by case #1 or case #2.
All this logic is put into set of CASE
expressions. There are 4 such expressions with duplicated conditions, because we conditionally select 4 different columns (INTERVAL_TIMESTAMP_INIT
, INTERVAL_TIMESTAMP_END
, INTERVAL_ID_INIT
and INTERVAL_ID_END
) which could not be done with one CASE
expression.
The final output is the same as you described:
0 comments:
Post a Comment