Friday, March 18, 2016

DISTINCT ON in an aggregate function in postgres

Leave a Comment

For my problem, we have a schema whereby one photo has many tags and also many comments. So if I have a query where I want all the comments and tags, it will multiply the rows together. So if one photo has 2 tags and 13 comments, I get 26 rows for that one photo:

SELECT         tag.name,          comment.comment_id FROM         photo         LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id         LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id         LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id 

enter image description here

That's fine for most things, but it means that if I GROUP BY and then json_agg(tag.*), I get 13 copies of the first tag, and 13 copies of the second tag.

SELECT json_agg(tag.name) as tags FROM         photo         LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id         LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id         LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id GROUP BY photo.photo_id 

enter image description here

Instead I want an array that is only 'suburban' and 'city', like this:

 [       {"tag_id":1,"name":"suburban"},        {"tag_id":2,"name":"city"}  ] 

I could json_agg(DISTINCT tag.name), but this will only make an array of tag names, when I want the entire row as json. I would like to json_agg(DISTINCT ON(tag.name) tag.*), but that's not valid SQL apparently.

How then can I simulate DISTINCT ON inside an aggregate function in Postgres?

3 Answers

Answers 1

Whenever you have a central table and want to left-join it to many rows in table A and also left-join it to many rows in table B, you get these problems of duplicating rows. It can especially throw off aggregrate functions like COUNT and SUM if you're not careful! So I think you need to construct your tags-for-each-photo and comments-for-each-photo separately, and then join them together:

WITH tags AS (   SELECT  photo.photo_id, json_agg(row_to_json(tag.*)) AS tags   FROM    photo   LEFT OUTER JOIN photo_tag on photo_tag.photo_id = photo.photo_id   LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id   GROUP BY photo.photo_id ), comments AS (   SELECT  photo.photo_id, json_agg(row_to_json(comment.*)) AS comments   FROM    photo   LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id   GROUP BY photo.photo_id ) SELECT  COALESCE(tags.photo_id, comments.photo_id) AS photo_id,         tags.tags,         comments.comments FROM    tags FULL OUTER JOIN comments ON      tags.photo_id = comments.photo_id 

EDIT: If you really want to join everything together without CTEs, this looks like it gives correct results:

SELECT  photo.photo_id,         to_json(array_agg(DISTINCT tag.*)) AS tags,         to_json(array_agg(DISTINCT comment.*)) AS comments FROM    photo LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id LEFT OUTER JOIN photo_tag on photo_tag.photo_id = photo.photo_id LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id GROUP BY photo.photo_id 

Answers 2

As stated in comments, json_agg does not serialize a row as an object, but builds a JSON array of the values that you pass it. You'll need row_to_json to turn your row into a JSON object, and then json_agg to perform the aggregation to an array:

SELECT json_agg(DISTINCT row_to_json(comment)) as tags FROM     photo     LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id     LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id     LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id GROUP BY photo.photo_id 

Answers 3

The cheapest and simplest DISTINCT operation is .. not to multiply rows in a "proxy cross join" in the first place. Aggregate first, then join. See:

Best for returning selected rows

Assuming (see RFC in my comment) you actually don't want to retrieve the whole table, but just one or few selected photos at a time, with aggregated details, the most elegant (and probably also fastest) way is with LATERAL subqueries:

SELECT * FROM   photo p LEFT   JOIN LATERAL (    SELECT json_agg(c) AS comments    FROM   comment c    WHERE  c.photo_id = p.photo_id    ) c ON true LEFT  JOIN LATERAL (    SELECT json_agg(t) AS tags    FROM   photo_tag pt    JOIN   tag        t USING (tag_id)  -- no need for LEFT JOIN here    WHERE  pt.photo_id = p.photo_id    ) t ON true WHERE  p.photo_id = 2;  -- arbitrary selection 

This should give you exactly what you've been asking for. DISTINCT whole rows from comment and tag are aggregated into JSON objects. LATERAL and json_agg() require Postgres 9.3+.

And this is what I guess you are actually looking for:

SELECT * FROM   photo p LEFT   JOIN LATERAL (    SELECT json_agg(json_build_object('comment_id', comment_id                                  , 'comment', comment)) AS comments    FROM   comment c    WHERE  photo_id = p.photo_id    ) c ON true LEFT   JOIN LATERAL (    SELECT json_agg(t) AS tags    FROM   photo_tag pt    JOIN   tag       t USING (tag_id)  -- no need for LEFT JOIN here!    WHERE  pt.photo_id = p.photo_id    ) t ON true WHERE  p.photo_id = 2;

Typically, you'd only want a subset of columns (at least excluding the redundant photo_id, maybe more). That used to be tricky in older versions since the Postgres ROW constructor does not preserve column names. The above uses json_build_object() introduced in Postgres 9.4+, but there are workarounds for older versions. Detailed explanation:

This also allows to choose tag names freely, you don't have to stick to column names.

Best for returning the whole table

To return all rows, it would be more efficient to use:

SELECT p.*      , COALESCE(c.comments, '[]') AS comments      , COALESCE(t.tags, '[]') AS tags FROM   photo p LEFT   JOIN (    SELECT photo_id         , json_agg(json_build_object('comment_id', comment_id                                    , 'comment', comment)) AS comments    FROM   comment c    GROUP  BY 1    ) c USING (photo_id) LEFT  JOIN LATERAL (    SELECT photo_id , json_agg(t) AS tags    FROM   photo_tag pt    JOIN   tag       t USING (tag_id)  -- no need for LEFT JOIN here!    GROUP  BY 1    ) t USING (photo_id); 

Once we retrieve enough rows, this gets cheaper than LATERAL subqueries. Works for Postgres 9.3+.

Note the strategic use of the USING clause for the join condition. This way we can conveniently use SELECT * in the outer query without getting duplicate columns for photo_id. I didn't use it here because your deleted answer seems to indicate you want empty JSON arrays instead of NULL for no tags / no comments.

SQL Fiddle demonstrating all (backpatched to Postgres 9.3).

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment