Best performance in sampling repeated value from a grouped column

This question is about the functionality of first_value(), using another function or workaround.

It is also about "little gain in performance" in big tables. To use eg. max() in the explained context below, demands spurious comparisons. Even if fast, it imposes some additional cost.

This typical query

SELECT x, y, count(*) as n  FROM t  GROUP BY x, y;

needs to repeat all columns in GROUP BY to return more than one column. A syntactic sugar to do this, is to use positional references:

SELECT x, y, count(*) as n  FROM t  GROUP BY x, 2  -- imagine that 2, 3, etc. are repeated with x

Sometimes needs not only sugar, but also some semantic to understand complex context:

SELECT x, COALESCE(y,z), count(*) as n  FROM t  GROUP BY x, y, z  -- y and z are not "real need" grouping clauses?

I can imagine many other complex contexts. Let's see usual solutions:

SELECT x, max(y) as y, count(*) as n  FROM t  GROUP BY x  -- best semantic! no need for other columns here

where max() function can be any "sample()" (eg. first or last value). The performance of something that do nothing is better than max(), e.g. the aggregate function first_value(), but it needs a WINDOW, so lost performance. There are some old suggestions to implement first/last agg functions in C.

Is there some "get any one value fast" aggregate function with better performance than max() or GROUP BY X,2,...?
Perhaps some new feature in a recent release?

2 Answers

Answers 1

If you really don't care which member of the set is picked, and if you don't need to compute additional aggregates (like count), there is a fast and simple alternative with DISTINCT ON (x) without ORDER BY:

SELECT DISTINCT ON (x) x, y, z FROM t;

x, y and z are from the same row, but the row is an arbitrary pick from each set of rows with the same x.

If you need a count anyway, your options are limited since the whole table has to be read in either case.

Fast way to discover the row count of a table

Then the first_last_agg extension is the only realistic option I see to gain some performance. But don't expect much.

For other use cases without count (including the simple case at the top), there are faster solutions, depending on your exact use case: emulating a loose index scan like @Mihai commented:

Optimize GROUP BY query to retrieve latest record per user

Answers 2

Not an offical source, but some thoughts an a question perceived as rather generic:

In general aggregators neeed to process all matching rows. From your question text you might target aggregators that try identifying specific values (max, min, first, last, n-th, etc). Those could benefit from datastructures that maintain the proper values for a specific such aggregator. Then "selecting" that value can be sped up drastically.
E.g. some databases keep track of max and min values of columns.
You can view this support as highly specialised internal indexs that are maintained by the system itself and not under (direct) control of a user.

Now postgresql focusses more on support that helps improving queries in general, not just special cases. So, they avoid adding effort for speeding up special cases that are not obviously benefitting a broad range of use cases.

Back to speeding up sample value aggregators.

With aggregators having to process all rows in general case and not hving a general strategy that allows short circuiting that requirement for aggregators that try identying specific values (sample kind aggregators for now), it is obvious that any reformulating of a query that does not lead to a reduced set of rows that need to be processed, will take similar time to complete.

For speeding up such queries beyond processing all rows you will need a supporting datastructure. With databases this usually is provided in the form of an index.

You also could benefit from special execution operations that allow reducing the number of rows to be read.

With pg you have the capability of providing own index implementation. So you could add an implementation that best supports a special kind of aggregator you are interested in. (At least for cases where you do need to run such queries often.)

Also, execution operations like index only scans or lazy evaluation with recursive queries may allow writing a specific query in a way that speeds compared to "straight" coding.

If you are targeting your question more into general approaches you might better consult with researchers on such topics as this then is beyond anything SO is intended to provide.

If you have specific (set of) queries that need to be improved, providing explicit questions on those might allow the community to help identifying potential optimizations. Trying to optimize without good base of measurement leads nowhere, as what yields perfect result in one case might kill performance in another.

Coding Question

Thursday, March 31, 2016

Best performance in sampling repeated value from a grouped column

2 Answers

Answers 1

Answers 2

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment

Search

Popular Posts

Labels

Blog Archive

Find Us On Facebook