For my problem, we have a schema whereby one photo has many tags and also many comments. So if I have a query where I want all the comments and tags, it will multiply the rows together. So if one photo has 2 tags and 13 comments, I get 26 rows for that one photo:
SELECT tag.name, comment.comment_id FROM photo LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id That's fine for most things, but it means that if I GROUP BY and then json_agg(tag.*), I get 13 copies of the first tag, and 13 copies of the second tag.
SELECT json_agg(tag.name) as tags FROM photo LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id GROUP BY photo.photo_id Instead I want an array that is only 'suburban' and 'city', like this:
[ {"tag_id":1,"name":"suburban"}, {"tag_id":2,"name":"city"} ] I could json_agg(DISTINCT tag.name), but this will only make an array of tag names, when I want the entire row as json. I would like to json_agg(DISTINCT ON(tag.name) tag.*), but that's not valid SQL apparently.
How then can I simulate DISTINCT ON inside an aggregate function in Postgres?
3 Answers
Answers 1
Whenever you have a central table and want to left-join it to many rows in table A and also left-join it to many rows in table B, you get these problems of duplicating rows. It can especially throw off aggregrate functions like COUNT and SUM if you're not careful! So I think you need to construct your tags-for-each-photo and comments-for-each-photo separately, and then join them together:
WITH tags AS ( SELECT photo.photo_id, json_agg(row_to_json(tag.*)) AS tags FROM photo LEFT OUTER JOIN photo_tag on photo_tag.photo_id = photo.photo_id LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id GROUP BY photo.photo_id ), comments AS ( SELECT photo.photo_id, json_agg(row_to_json(comment.*)) AS comments FROM photo LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id GROUP BY photo.photo_id ) SELECT COALESCE(tags.photo_id, comments.photo_id) AS photo_id, tags.tags, comments.comments FROM tags FULL OUTER JOIN comments ON tags.photo_id = comments.photo_id EDIT: If you really want to join everything together without CTEs, this looks like it gives correct results:
SELECT photo.photo_id, to_json(array_agg(DISTINCT tag.*)) AS tags, to_json(array_agg(DISTINCT comment.*)) AS comments FROM photo LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id LEFT OUTER JOIN photo_tag on photo_tag.photo_id = photo.photo_id LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id GROUP BY photo.photo_id Answers 2
As stated in comments, json_agg does not serialize a row as an object, but builds a JSON array of the values that you pass it. You'll need row_to_json to turn your row into a JSON object, and then json_agg to perform the aggregation to an array:
SELECT json_agg(DISTINCT row_to_json(comment)) as tags FROM photo LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id GROUP BY photo.photo_id Answers 3
The cheapest and simplest DISTINCT operation is .. not to multiply rows in a "proxy cross join" in the first place. Aggregate first, then join. See:
Best for returning selected rows
Assuming (see RFC in my comment) you actually don't want to retrieve the whole table, but just one or few selected photos at a time, with aggregated details, the most elegant (and probably also fastest) way is with LATERAL subqueries:
SELECT * FROM photo p LEFT JOIN LATERAL ( SELECT json_agg(c) AS comments FROM comment c WHERE c.photo_id = p.photo_id ) c ON true LEFT JOIN LATERAL ( SELECT json_agg(t) AS tags FROM photo_tag pt JOIN tag t USING (tag_id) -- no need for LEFT JOIN here WHERE pt.photo_id = p.photo_id ) t ON true WHERE p.photo_id = 2; -- arbitrary selection This should give you exactly what you've been asking for. DISTINCT whole rows from comment and tag are aggregated into JSON objects. LATERAL and json_agg() require Postgres 9.3+.
json_agg(c)is short forjson_agg(c.*).Minor thing: No need for
LEFT JOINbetweenphoto_tagandtag. There should be a FK constraint enforcing referential integrity anyway.It's wise to have
LEFT JOIN LATERAL () ON truefor the the outer query though:
And this is what I guess you are actually looking for:
SELECT * FROM photo p LEFT JOIN LATERAL ( SELECT json_agg(json_build_object('comment_id', comment_id , 'comment', comment)) AS comments FROM comment c WHERE photo_id = p.photo_id ) c ON true LEFT JOIN LATERAL ( SELECT json_agg(t) AS tags FROM photo_tag pt JOIN tag t USING (tag_id) -- no need for LEFT JOIN here! WHERE pt.photo_id = p.photo_id ) t ON true WHERE p.photo_id = 2; Typically, you'd only want a subset of columns (at least excluding the redundant photo_id, maybe more). That used to be tricky in older versions since the Postgres ROW constructor does not preserve column names. The above uses json_build_object() introduced in Postgres 9.4+, but there are workarounds for older versions. Detailed explanation:
This also allows to choose tag names freely, you don't have to stick to column names.
Best for returning the whole table
To return all rows, it would be more efficient to use:
SELECT p.* , COALESCE(c.comments, '[]') AS comments , COALESCE(t.tags, '[]') AS tags FROM photo p LEFT JOIN ( SELECT photo_id , json_agg(json_build_object('comment_id', comment_id , 'comment', comment)) AS comments FROM comment c GROUP BY 1 ) c USING (photo_id) LEFT JOIN LATERAL ( SELECT photo_id , json_agg(t) AS tags FROM photo_tag pt JOIN tag t USING (tag_id) -- no need for LEFT JOIN here! GROUP BY 1 ) t USING (photo_id); Once we retrieve enough rows, this gets cheaper than LATERAL subqueries. Works for Postgres 9.3+.
Note the strategic use of the USING clause for the join condition. This way we can conveniently use SELECT * in the outer query without getting duplicate columns for photo_id. I didn't use it here because your deleted answer seems to indicate you want empty JSON arrays instead of NULL for no tags / no comments.
SQL Fiddle demonstrating all (backpatched to Postgres 9.3).
0 comments:
Post a Comment