Sunday, March 18, 2018

Combine different rows of same table - Postgres

Leave a Comment

We have a table that saves information about the interval of the employees. Let's call it INTERVAL_TABLE.

We save when the user starts a interval and when he finishes. The user can start a interval as many times as he wants and finish as many times as he wants as well.

This is a simplified structure of the INTERVAL_TABLE:

   INTERVAL_ID | USER_ID | INTERVAL_TYPE_ID | INTERVAL_TIMESTAMP | ENTRY_TYPE 

A user may have these entries in the table:

table possible entries

Now, we must create a report combining different entries of that table that refer to the same user and interval type. We should be able to identify intervals that have a start and an end and group these two in one row. Assuming the data in the image above, the output of the report should be the following:

report expected output

The output should be ordered by date, like the above image.

I have no idea how to create a query to do that.

Thanks!

Edit - Extra info:

To find the END interval for any INIT interval, we should find the closest END interval based on the timestamp of that interval. That's how we know we should match ID 1 with ID 2 and not with ID 3.

It's important to note that if a INIT interval is followed by another INIT interval (based on the timestamps), we should not proceed to find the END for that INIT. That is because this is a INIT without END.

4 Answers

Answers 1

DBFiddle

This might not be the most efficient way to do this (I imagine a recursive query might be), but I find these subqueries easier to maintain:

WITH ordered_table AS (   SELECT row_number() OVER(ORDER BY INTERVAL_TIMESTAMP ASC) row_num, *   FROM INTERVAL_TABLE   ORDER BY row_num ),  _inits AS (   SELECT     t1.USER_ID,     t1.INTERVAL_TYPE_ID      AS INTERVAL_TYPE,     t1.INTERVAL_TIMESTAMP    AS INTERVAL_TIMESTAMP_INIT,     CASE       WHEN t1.ENTRY_TYPE = 'INIT_INTERVAL' AND t2.ENTRY_TYPE = 'END_INTERVAL'        THEN t2.INTERVAL_TIMESTAMP      END                      AS INTERVAL_TIMESTAMP_END,     t1.INTERVAL_ID           AS INTERVAL_ID_INIT,     CASE       WHEN t1.ENTRY_TYPE = 'INIT_INTERVAL' AND t2.ENTRY_TYPE = 'END_INTERVAL'        THEN t2.INTERVAL_ID      END                      AS INTERVAL_ID_END   FROM      ordered_table AS t1   LEFT JOIN ordered_table AS t2 ON (t1.row_num = t2.row_num - 1)   WHERE t1.ENTRY_TYPE = 'INIT_INTERVAL' ),  _ends AS (   SELECT     t1.USER_ID,     t1.INTERVAL_TYPE_ID      AS INTERVAL_TYPE,     NULL::timestamp          AS INTERVAL_TIMESTAMP_INIT,     CASE       WHEN t1.ENTRY_TYPE = 'END_INTERVAL' AND t2.ENTRY_TYPE = 'END_INTERVAL'        THEN t2.INTERVAL_TIMESTAMP      END                      AS INTERVAL_TIMESTAMP_END,     NULL::int                AS INTERVAL_ID_INIT,     t2.INTERVAL_ID           AS INTERVAL_ID_END   FROM       ordered_table AS t1   RIGHT JOIN ordered_table AS t2 ON (t1.row_num = t2.row_num - 1)   WHERE t2.ENTRY_TYPE = 'END_INTERVAL' )  SELECT * FROM (     SELECT * FROM _inits     UNION ALL     SELECT * FROM _ends ) qry WHERE COALESCE(interval_timestamp_init, interval_timestamp_end) IS NOT NULL ORDER BY COALESCE(interval_timestamp_init, interval_timestamp_end) 

Basically, INITs will always be listed. They will either have an associated END or a null. So almost all the content from _inits will be there.

Because the ENDs were already captured by the INITs, we only need to capture the ones that don't have an INIT (they were preceded by an END).

Because they are outer joins, you simply can remove the cases where INIT and END both are NULL and apply proper ordering.

Answers 2

It could be done easy and efficiently using LEAD and LAG functions. At least it is much more efficient than self-join of the table: O(n) vs O(n*n).

At first add columns for the next and previous row using LEAD and LAG with appropriate PARTITION BY.

Then build two sets of pairs - the first that starts with INIT_INTERVAL, the second that ends with END_INTERVAL. If there is a pair that has both Init and End - it will be included twice and later eliminated in UNION.

SQL Fiddle

Sample data (this is something that you should have included in your question in addition to the screenshot)

CREATE TABLE INTERVAL_TABLE (   INTERVAL_ID int,   USER_ID int,   INTERVAL_TYPE_ID int,   INTERVAL_TIMESTAMP timestamp,   ENTRY_TYPE varchar(255));  INSERT INTO INTERVAL_TABLE (INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, INTERVAL_TIMESTAMP, ENTRY_TYPE) VALUES (1, 1, 1, '2018-03-08 14:00:00', 'INIT_INTERVAL'), (2, 1, 1, '2018-03-08 15:00:00', 'END_INTERVAL' ), (3, 1, 1, '2018-03-08 15:30:00', 'END_INTERVAL' ), (4, 1, 1, '2018-03-08 15:45:00', 'INIT_INTERVAL'), (5, 1, 1, '2018-03-08 15:50:00', 'INIT_INTERVAL'); 

Query

WITH CTE AS (   SELECT     USER_ID     ,INTERVAL_TYPE_ID     ,ENTRY_TYPE AS Curr_Entry_Type     ,INTERVAL_TIMESTAMP AS Curr_Interval_Timestamp     ,INTERVAL_ID AS Curr_Interval_ID      ,LAG(ENTRY_TYPE) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Prev_Entry_Type     ,LAG(INTERVAL_TIMESTAMP) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Prev_Interval_Timestamp     ,LAG(INTERVAL_ID) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Prev_Interval_ID      ,LEAD(ENTRY_TYPE) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Next_Entry_Type     ,LEAD(INTERVAL_TIMESTAMP) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Next_Interval_Timestamp     ,LEAD(INTERVAL_ID) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Next_Interval_ID   FROM     INTERVAL_TABLE ) ,CTE_Result AS (   SELECT     USER_ID     ,INTERVAL_TYPE_ID     ,Curr_Entry_Type AS Entry_Type_Init     ,Curr_Interval_Timestamp AS Interval_Timestamp_Init     ,Curr_Interval_ID AS Interval_ID_Init     ,Next_Entry_Type AS Entry_Type_End     ,CASE WHEN Next_Entry_Type = 'END_INTERVAL' THEN Next_Interval_Timestamp END AS Interval_Timestamp_End     ,CASE WHEN Next_Entry_Type = 'END_INTERVAL' THEN Next_Interval_ID END AS Interval_ID_End   FROM CTE   WHERE Curr_Entry_Type = 'INIT_INTERVAL'    UNION -- sic! not UNION ALL    SELECT     USER_ID     ,INTERVAL_TYPE_ID     ,Prev_Entry_Type AS Entry_Type_Init     ,CASE WHEN Prev_Entry_Type = 'INIT_INTERVAL' THEN Prev_Interval_Timestamp END AS Interval_Timestamp_Init     ,CASE WHEN Prev_Entry_Type = 'INIT_INTERVAL' THEN Prev_Interval_ID END AS Interval_ID_Init     ,Curr_Entry_Type AS Entry_Type_End     ,Curr_Interval_Timestamp AS Interval_Timestamp_End     ,Curr_Interval_ID AS Interval_ID_End   FROM CTE   WHERE Curr_Entry_Type = 'END_INTERVAL' ) SELECT     USER_ID     ,INTERVAL_TYPE_ID     ,Interval_Timestamp_Init     ,Interval_Timestamp_End     ,Interval_ID_Init     ,Interval_ID_End FROM CTE_Result ORDER BY   USER_ID   ,INTERVAL_TYPE_ID   ,COALESCE(Interval_Timestamp_Init, Interval_Timestamp_End) 

Results

| user_id | interval_type_id | interval_timestamp_init | interval_timestamp_end | interval_id_init | interval_id_end | |---------|------------------|-------------------------|------------------------|------------------|-----------------| |       1 |                1 |    2018-03-08T14:00:00Z |   2018-03-08T15:00:00Z |                1 |               2 | |       1 |                1 |                  (null) |   2018-03-08T15:30:00Z |           (null) |               3 | |       1 |                1 |    2018-03-08T15:45:00Z |                 (null) |                4 |          (null) | |       1 |                1 |    2018-03-08T15:50:00Z |                 (null) |                5 |          (null) | 

Answers 3

You can use the INTERVAL_ID (or a new column with a generated row_number) to join two instances of the same table, using as predicate something like this:

on a.INTERVAL_ID=b.INTERVAL_ID + 1 

This way, you can compare and get in 1 line each record with the next one.

Answers 4

This query gives the output you need:

WITH Intervals AS (     WITH Events AS     (         WITH OrderedEvents AS         (             SELECT INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, INTERVAL_TIMESTAMP, ENTRY_TYPE, row_number() over (partition by USER_ID, INTERVAL_TYPE_ID order by INTERVAL_TIMESTAMP ASC) AS EVENT_ORDER FROM INTERVAL_TABLE             UNION ALL             SELECT NULL AS INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, NULL AS INTERVAL_TIMESTAMP, 'INIT_INTERVAL' AS ENTRY_TYPE, 0 AS EVENT_ORDER FROM INTERVAL_TABLE GROUP BY USER_ID, INTERVAL_TYPE_ID             UNION ALL             SELECT NULL AS INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, NULL AS INTERVAL_TIMESTAMP, 'END_INTERVAL' AS ENTRY_TYPE, COUNT(*) + 1 AS EVENT_ORDER FROM INTERVAL_TABLE GROUP BY USER_ID, INTERVAL_TYPE_ID         )         SELECT Events1.USER_ID, Events1.INTERVAL_TYPE_ID, Events1.INTERVAL_TIMESTAMP AS INTERVAL_TIMESTAMP_INIT, Events2.INTERVAL_TIMESTAMP AS INTERVAL_TIMESTAMP_END, Events1.INTERVAL_ID AS INTERVAL_ID_INIT, Events2.INTERVAL_ID  AS INTERVAL_ID_END, Events1.ENTRY_TYPE AS ENTRY_TYPE1, Events2.ENTRY_TYPE AS ENTRY_TYPE2         FROM OrderedEvents Events1 INNER JOIN         OrderedEvents Events2         ON Events1.USER_ID = Events2.USER_ID AND Events1.INTERVAL_TYPE_ID = Events2.INTERVAL_TYPE_ID AND Events1.EVENT_ORDER + 1 = Events2.EVENT_ORDER     )     SELECT USER_ID, INTERVAL_TYPE_ID,        CASE WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_TIMESTAMP_INIT            WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'INIT_INTERVAL' THEN INTERVAL_TIMESTAMP_INIT            WHEN ENTRY_TYPE1 = 'END_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN NULL       END AS INTERVAL_TIMESTAMP_INIT,        CASE WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_TIMESTAMP_END            WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'INIT_INTERVAL' THEN NULL            WHEN ENTRY_TYPE1 = 'END_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_TIMESTAMP_END       END AS INTERVAL_TIMESTAMP_END,        CASE WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_ID_INIT            WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'INIT_INTERVAL' THEN INTERVAL_ID_INIT            WHEN ENTRY_TYPE1 = 'END_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN NULL       END AS INTERVAL_ID_INIT,        CASE WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_ID_END            WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'INIT_INTERVAL' THEN NULL            WHEN ENTRY_TYPE1 = 'END_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_ID_END       END AS INTERVAL_ID_END      FROM Events ) SELECT * FROM Intervals WHERE INTERVAL_ID_INIT IS NOT NULL OR INTERVAL_ID_END IS NOT NULL; 

At first, we build OrderedEvents CTE that groups entries by USER_ID and INTERVAL_TYPE_ID, sorts them by INTERVAL_TIMESTAMP within each group and assign numeric order to each event. Also for each group we add INIT_INTERVAL as first event and END_INTERVAL as last event to cover cases when group starts with END_INTERVAL or finishes with INIT_INTERVAL:

WITH OrderedEvents AS (     SELECT INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, INTERVAL_TIMESTAMP, ENTRY_TYPE, row_number() over (partition by USER_ID, INTERVAL_TYPE_ID order by INTERVAL_TIMESTAMP ASC) AS EVENT_ORDER FROM INTERVAL_TABLE     UNION ALL     SELECT NULL AS INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, NULL AS INTERVAL_TIMESTAMP, 'INIT_INTERVAL' AS ENTRY_TYPE, 0 AS EVENT_ORDER FROM INTERVAL_TABLE GROUP BY USER_ID, INTERVAL_TYPE_ID     UNION ALL     SELECT NULL AS INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, NULL AS INTERVAL_TIMESTAMP, 'END_INTERVAL' AS ENTRY_TYPE, COUNT(*) + 1 AS EVENT_ORDER FROM INTERVAL_TABLE GROUP BY USER_ID, INTERVAL_TYPE_ID ) SELECT * FROM OrderedEvents ORDER BY user_id, interval_type_id, event_order; 

This query gives following results for the provided data:

enter image description here

Then we intersect OrderedEvents with itself on USER_ID and INTERVAL_TYPE_ID and select pairs of neighbor events (Events1.EVENT_ORDER + 1 = Events2.EVENT_ORDER):

WITH OrderedEvents AS (     ... ) SELECT Events1.USER_ID, Events1.INTERVAL_TYPE_ID, Events1.INTERVAL_TIMESTAMP AS INTERVAL_TIMESTAMP_INIT, Events2.INTERVAL_TIMESTAMP AS INTERVAL_TIMESTAMP_END, Events1.INTERVAL_ID AS INTERVAL_ID_INIT, Events2.INTERVAL_ID  AS INTERVAL_ID_END, Events1.ENTRY_TYPE AS ENTRY_TYPE1, Events2.ENTRY_TYPE AS ENTRY_TYPE2 FROM OrderedEvents Events1 INNER JOIN OrderedEvents Events2 ON Events1.USER_ID = Events2.USER_ID AND Events1.INTERVAL_TYPE_ID = Events2.INTERVAL_TYPE_ID AND Events1.EVENT_ORDER + 1 = Events2.EVENT_ORDER 

This query gives following results:

enter image description here

Now we should transform these pairs of neighbor events to intervals based on the logic you described. Previous output has columns entry_type1 and entry_type2 which could take values of INIT_INTERVAL or END_INTERVAL. The possible combinations are:

  • <INIT_INTERVAL, END_INTERVAL> - this is the most natural case when INIT_INTERVAL is followed by END_INTERVAL. We take event values as is.
  • <INIT_INTERVAL(1), INIT_INTERVAL(2)> - the case of two consecutive INIT_INTERVAL. We force ending of the interval by taking <INIT_INTERVAL(1), NULL>. INIT_INTERVAL(2) will be taken with the next pair when it will be in the first entry.
  • <END_INTERVAL(1), END_INTERVAL(2)> - the case of two consecutive END_INTERVAL. We force start of the interval by taking <NULL, END_INTERVAL(2)>. END_INTERVAL(1) is processed either by case #1 or by the current case when it is the second entry in the pair.
  • <END_INTERVAL, INIT_INTERVAL> - such pairs are just skipped. END_INTERVAL is taken either by case #1 or case #3. INIT_INTERVAL is taken either by case #1 or case #2.

All this logic is put into set of CASE expressions. There are 4 such expressions with duplicated conditions, because we conditionally select 4 different columns (INTERVAL_TIMESTAMP_INIT, INTERVAL_TIMESTAMP_END, INTERVAL_ID_INIT and INTERVAL_ID_END) which could not be done with one CASE expression.

The final output is the same as you described:

enter image description here

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment