Showing posts with label aggregate-functions. Show all posts
Showing posts with label aggregate-functions. Show all posts

Thursday, March 31, 2016

Best performance in sampling repeated value from a grouped column

Leave a Comment

This question is about the functionality of first_value(), using another function or workaround.

It is also about "little gain in performance" in big tables. To use eg. max() in the explained context below, demands spurious comparisons. Even if fast, it imposes some additional cost.


This typical query

SELECT x, y, count(*) as n  FROM t  GROUP BY x, y; 

needs to repeat all columns in GROUP BY to return more than one column. A syntactic sugar to do this, is to use positional references:

SELECT x, y, count(*) as n  FROM t  GROUP BY x, 2  -- imagine that 2, 3, etc. are repeated with x 

Sometimes needs not only sugar, but also some semantic to understand complex context:

SELECT x, COALESCE(y,z), count(*) as n  FROM t  GROUP BY x, y, z  -- y and z are not "real need" grouping clauses? 

I can imagine many other complex contexts. Let's see usual solutions:

SELECT x, max(y) as y, count(*) as n  FROM t  GROUP BY x  -- best semantic! no need for other columns here 

where max() function can be any "sample()" (eg. first or last value). The performance of something that do nothing is better than max(), e.g. the aggregate function first_value(), but it needs a WINDOW, so lost performance. There are some old suggestions to implement first/last agg functions in C.

Is there some "get any one value fast" aggregate function with better performance than max() or GROUP BY X,2,...?
Perhaps some new feature in a recent release?

2 Answers

Answers 1

If you really don't care which member of the set is picked, and if you don't need to compute additional aggregates (like count), there is a fast and simple alternative with DISTINCT ON (x) without ORDER BY:

SELECT DISTINCT ON (x) x, y, z FROM t; 

x, y and z are from the same row, but the row is an arbitrary pick from each set of rows with the same x.

If you need a count anyway, your options are limited since the whole table has to be read in either case.

Then the first_last_agg extension is the only realistic option I see to gain some performance. But don't expect much.

For other use cases without count (including the simple case at the top), there are faster solutions, depending on your exact use case: emulating a loose index scan like @Mihai commented:

Answers 2

Not an offical source, but some thoughts an a question perceived as rather generic:

In general aggregators neeed to process all matching rows. From your question text you might target aggregators that try identifying specific values (max, min, first, last, n-th, etc). Those could benefit from datastructures that maintain the proper values for a specific such aggregator. Then "selecting" that value can be sped up drastically.
E.g. some databases keep track of max and min values of columns.
You can view this support as highly specialised internal indexs that are maintained by the system itself and not under (direct) control of a user.

Now postgresql focusses more on support that helps improving queries in general, not just special cases. So, they avoid adding effort for speeding up special cases that are not obviously benefitting a broad range of use cases.

Back to speeding up sample value aggregators.

With aggregators having to process all rows in general case and not hving a general strategy that allows short circuiting that requirement for aggregators that try identying specific values (sample kind aggregators for now), it is obvious that any reformulating of a query that does not lead to a reduced set of rows that need to be processed, will take similar time to complete.

For speeding up such queries beyond processing all rows you will need a supporting datastructure. With databases this usually is provided in the form of an index.

You also could benefit from special execution operations that allow reducing the number of rows to be read.

With pg you have the capability of providing own index implementation. So you could add an implementation that best supports a special kind of aggregator you are interested in. (At least for cases where you do need to run such queries often.)

Also, execution operations like index only scans or lazy evaluation with recursive queries may allow writing a specific query in a way that speeds compared to "straight" coding.

If you are targeting your question more into general approaches you might better consult with researchers on such topics as this then is beyond anything SO is intended to provide.

If you have specific (set of) queries that need to be improved, providing explicit questions on those might allow the community to help identifying potential optimizations. Trying to optimize without good base of measurement leads nowhere, as what yields perfect result in one case might kill performance in another.

Read More

Friday, March 18, 2016

DISTINCT ON in an aggregate function in postgres

Leave a Comment

For my problem, we have a schema whereby one photo has many tags and also many comments. So if I have a query where I want all the comments and tags, it will multiply the rows together. So if one photo has 2 tags and 13 comments, I get 26 rows for that one photo:

SELECT         tag.name,          comment.comment_id FROM         photo         LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id         LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id         LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id 

enter image description here

That's fine for most things, but it means that if I GROUP BY and then json_agg(tag.*), I get 13 copies of the first tag, and 13 copies of the second tag.

SELECT json_agg(tag.name) as tags FROM         photo         LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id         LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id         LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id GROUP BY photo.photo_id 

enter image description here

Instead I want an array that is only 'suburban' and 'city', like this:

 [       {"tag_id":1,"name":"suburban"},        {"tag_id":2,"name":"city"}  ] 

I could json_agg(DISTINCT tag.name), but this will only make an array of tag names, when I want the entire row as json. I would like to json_agg(DISTINCT ON(tag.name) tag.*), but that's not valid SQL apparently.

How then can I simulate DISTINCT ON inside an aggregate function in Postgres?

3 Answers

Answers 1

Whenever you have a central table and want to left-join it to many rows in table A and also left-join it to many rows in table B, you get these problems of duplicating rows. It can especially throw off aggregrate functions like COUNT and SUM if you're not careful! So I think you need to construct your tags-for-each-photo and comments-for-each-photo separately, and then join them together:

WITH tags AS (   SELECT  photo.photo_id, json_agg(row_to_json(tag.*)) AS tags   FROM    photo   LEFT OUTER JOIN photo_tag on photo_tag.photo_id = photo.photo_id   LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id   GROUP BY photo.photo_id ), comments AS (   SELECT  photo.photo_id, json_agg(row_to_json(comment.*)) AS comments   FROM    photo   LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id   GROUP BY photo.photo_id ) SELECT  COALESCE(tags.photo_id, comments.photo_id) AS photo_id,         tags.tags,         comments.comments FROM    tags FULL OUTER JOIN comments ON      tags.photo_id = comments.photo_id 

EDIT: If you really want to join everything together without CTEs, this looks like it gives correct results:

SELECT  photo.photo_id,         to_json(array_agg(DISTINCT tag.*)) AS tags,         to_json(array_agg(DISTINCT comment.*)) AS comments FROM    photo LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id LEFT OUTER JOIN photo_tag on photo_tag.photo_id = photo.photo_id LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id GROUP BY photo.photo_id 

Answers 2

As stated in comments, json_agg does not serialize a row as an object, but builds a JSON array of the values that you pass it. You'll need row_to_json to turn your row into a JSON object, and then json_agg to perform the aggregation to an array:

SELECT json_agg(DISTINCT row_to_json(comment)) as tags FROM     photo     LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id     LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id     LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id GROUP BY photo.photo_id 

Answers 3

The cheapest and simplest DISTINCT operation is .. not to multiply rows in a "proxy cross join" in the first place. Aggregate first, then join. See:

Best for returning selected rows

Assuming (see RFC in my comment) you actually don't want to retrieve the whole table, but just one or few selected photos at a time, with aggregated details, the most elegant (and probably also fastest) way is with LATERAL subqueries:

SELECT * FROM   photo p LEFT   JOIN LATERAL (    SELECT json_agg(c) AS comments    FROM   comment c    WHERE  c.photo_id = p.photo_id    ) c ON true LEFT  JOIN LATERAL (    SELECT json_agg(t) AS tags    FROM   photo_tag pt    JOIN   tag        t USING (tag_id)  -- no need for LEFT JOIN here    WHERE  pt.photo_id = p.photo_id    ) t ON true WHERE  p.photo_id = 2;  -- arbitrary selection 

This should give you exactly what you've been asking for. DISTINCT whole rows from comment and tag are aggregated into JSON objects. LATERAL and json_agg() require Postgres 9.3+.

And this is what I guess you are actually looking for:

SELECT * FROM   photo p LEFT   JOIN LATERAL (    SELECT json_agg(json_build_object('comment_id', comment_id                                  , 'comment', comment)) AS comments    FROM   comment c    WHERE  photo_id = p.photo_id    ) c ON true LEFT   JOIN LATERAL (    SELECT json_agg(t) AS tags    FROM   photo_tag pt    JOIN   tag       t USING (tag_id)  -- no need for LEFT JOIN here!    WHERE  pt.photo_id = p.photo_id    ) t ON true WHERE  p.photo_id = 2;

Typically, you'd only want a subset of columns (at least excluding the redundant photo_id, maybe more). That used to be tricky in older versions since the Postgres ROW constructor does not preserve column names. The above uses json_build_object() introduced in Postgres 9.4+, but there are workarounds for older versions. Detailed explanation:

This also allows to choose tag names freely, you don't have to stick to column names.

Best for returning the whole table

To return all rows, it would be more efficient to use:

SELECT p.*      , COALESCE(c.comments, '[]') AS comments      , COALESCE(t.tags, '[]') AS tags FROM   photo p LEFT   JOIN (    SELECT photo_id         , json_agg(json_build_object('comment_id', comment_id                                    , 'comment', comment)) AS comments    FROM   comment c    GROUP  BY 1    ) c USING (photo_id) LEFT  JOIN LATERAL (    SELECT photo_id , json_agg(t) AS tags    FROM   photo_tag pt    JOIN   tag       t USING (tag_id)  -- no need for LEFT JOIN here!    GROUP  BY 1    ) t USING (photo_id); 

Once we retrieve enough rows, this gets cheaper than LATERAL subqueries. Works for Postgres 9.3+.

Note the strategic use of the USING clause for the join condition. This way we can conveniently use SELECT * in the outer query without getting duplicate columns for photo_id. I didn't use it here because your deleted answer seems to indicate you want empty JSON arrays instead of NULL for no tags / no comments.

SQL Fiddle demonstrating all (backpatched to Postgres 9.3).

Read More