Monday, April 18, 2016

SQL Server - conditional aggregation with correlation

Leave a Comment

Background:

The original case was very simple. Calculate running total per user from highest revenue to lowest:

CREATE TABLE t(Customer INTEGER  NOT NULL PRIMARY KEY                ,"User"   VARCHAR(5) NOT NULL               ,Revenue  INTEGER  NOT NULL);  INSERT INTO t(Customer,"User",Revenue) VALUES (001,'James',500),(002,'James',750),(003,'James',450), (004,'Sarah',100),(005,'Sarah',500),(006,'Sarah',150), (007,'Sarah',600),(008,'James',150),(009,'James',100); 

Query:

SELECT *,     1.0 * Revenue/SUM(Revenue) OVER(PARTITION BY "User") AS percentage,     1.0 * SUM(Revenue) OVER(PARTITION BY "User" ORDER BY Revenue DESC)          /SUM(Revenue) OVER(PARTITION BY "User") AS running_percentage FROM t; 

LiveDemo

Output:

╔════╦═══════╦═════════╦════════════╦════════════════════╗ ║ ID ║ User  ║ Revenue ║ percentage ║ running_percentage ║ ╠════╬═══════╬═════════╬════════════╬════════════════════╣ ║  2 ║ James ║     750 ║ 0.38       ║ 0.38               ║ ║  1 ║ James ║     500 ║ 0.26       ║ 0.64               ║ ║  3 ║ James ║     450 ║ 0.23       ║ 0.87               ║ ║  8 ║ James ║     150 ║ 0.08       ║ 0.95               ║ ║  9 ║ James ║     100 ║ 0.05       ║ 1                  ║ ║  7 ║ Sarah ║     600 ║ 0.44       ║ 0.44               ║ ║  5 ║ Sarah ║     500 ║ 0.37       ║ 0.81               ║ ║  6 ║ Sarah ║     150 ║ 0.11       ║ 0.93               ║ ║  4 ║ Sarah ║     100 ║ 0.07       ║ 1                  ║ ╚════╩═══════╩═════════╩════════════╩════════════════════╝ 

It could be calculated differently using specific windowed functions.


Now let's assume that we cannot use windowed SUM and rewrite it:

SELECT c.Customer, c."User", c."Revenue"     ,1.0 * Revenue / NULLIF(c3.s,0) AS percentage     ,1.0 * c2.s    / NULLIF(c3.s,0) AS running_percentage FROM t c CROSS APPLY         (SELECT SUM(Revenue) AS s         FROM t c2         WHERE c."User" = c2."User"             AND c2.Revenue >= c.Revenue) AS c2 CROSS APPLY         (SELECT SUM(Revenue) AS s         FROM t c2         WHERE c."User" = c2."User") AS c3 ORDER BY "User", Revenue DESC; 

LiveDemo

I have used CROSS APPLY because I do not like correlated subqueries in SELECT colums list and c3 is used twice.

Everything work as it should. But when we look closer c2 and c3 are very similiar. So why not combine them and use simple conditional aggregation:

SELECT c.Customer, c."User", c."Revenue"     ,1.0 * Revenue        / NULLIF(c2.sum_total,0) AS percentage     ,1.0 * c2.sum_running / NULLIF(c2.sum_total,0) AS running_percentage FROM t c CROSS APPLY         (SELECT SUM(Revenue) AS sum_total,                 SUM(CASE WHEN c2.Revenue >= c.Revenue THEN Revenue ELSE 0 END)                  AS sum_running         FROM t c2         WHERE c."User" = c2."User") AS c2 ORDER BY "User", Revenue DESC; 

Unfortunately it is not possible.

Multiple columns are specified in an aggregated expression containing an outer reference. If an expression being aggregated contains an outer reference, then that outer reference must be the only column referenced in the expression.

Of course I could circumvent it wrapping with another subquery, but it becomes a bit "ugly":

SELECT c.Customer, c."User", c."Revenue"     ,1.0 * Revenue        / NULLIF(c2.sum_total,0) AS percentage     ,1.0 * c2.sum_running / NULLIF(c2.sum_total,0) AS running_percentage FROM t c CROSS APPLY (   SELECT SUM(Revenue) AS sum_total,            SUM(running_revenue) AS sum_running      FROM (SELECT Revenue,                   CASE WHEN c2.Revenue >= c.Revenue THEN Revenue ELSE 0 END                    AS running_revenue            FROM t c2            WHERE c."User" = c2."User") AS sub ) AS c2 ORDER BY "User", Revenue DESC 

LiveDemo


Postgresql version. The only difference is LATERAL instead of CROSS APPLY.

SELECT c.Customer, c."User", c.Revenue     ,1.0 * Revenue        / NULLIF(c2.sum_total,0) AS percentage      ,1.0 * c2.running_sum / NULLIF(c2.sum_total,0) AS running_percentage  FROM t c ,LATERAL (SELECT SUM(Revenue) AS sum_total,                  SUM(CASE WHEN c2.Revenue >= c.Revenue THEN c2.Revenue ELSE 0 END)                   AS running_sum         FROM t c2         WHERE c."User" = c2."User") c2 ORDER BY "User", Revenue DESC; 

SqlFiddleDemo

It works very nice.


SQLite/MySQL version (that is why I preferLATERAL/CROSS APPLY):

SELECT c.Customer, c."User", c.Revenue,     1.0 * Revenue / (SELECT SUM(Revenue)                      FROM t c2                      WHERE c."User" = c2."User") AS percentage,     1.0 * (SELECT SUM(CASE WHEN c2.Revenue >= c.Revenue THEN c2.Revenue ELSE 0 END)            FROM t c2           WHERE c."User" = c2."User")  /            (SELECT SUM(c2.Revenue)            FROM t c2            WHERE c."User" = c2."User") AS running_percentage FROM t c ORDER BY "User", Revenue DESC; 

SQLFiddleDemo-SQLite SQLFiddleDemo-MySQL


I've read Aggregates with an Outer Reference:

The source for the restriction is in the SQL-92 standard, and SQL Server inherited it from the Sybase codebase. The problem is that SQL Server needs to figure out which query will compute the aggregate.

I do not search for answers that only show how to circumvent it.

The questions are:

  1. Which part of standard disallow or interfere with it?
  2. Why other RDBMSes do not have problem with this kind of outer dependency?
  3. Do they extend SQL Standard and SQL Server behaves as it should or SQL Server does not implement it fully(correctly?)?.

I would be very grateful for references to:

  • ISO standard (92 or newer)
  • SQL Server Standards Support
  • official documenation from any RDBMS that explains it (SQL Server/Postgresql/Oracle/...).

EDIT:

I know that SQL-92 does not have concept of LATERAL. But version with subqueries (like in SQLite/MySQL) does not work too.

LiveDemo

EDIT 2:

To simplify it a bit, let's check only correlated subquery only:

SELECT c.Customer, c."User", c.Revenue,        1.0*(SELECT SUM(CASE WHEN c2.Revenue >= c.Revenue THEN c2.Revenue ELSE 0 END)               FROM t c2               WHERE c."User" = c2."User")         / (SELECT SUM(c2.Revenue)           FROM t c2           WHERE c."User" = c2."User") AS running_percentage FROM t c ORDER BY "User", Revenue DESC; 

The version above works fine in MySQL/SQLite/Postgresql.

In SQL Server we get error. After wraping it with subquery to "flatten" it to one level it works:

SELECT c.Customer, c."User", c.Revenue,       1.0 * (               SELECT SUM(CASE WHEN r1 >= r2 THEN r1 ELSE 0 END)               FROM (SELECT c2.Revenue AS r1, c.Revenue r2                     FROM t c2                     WHERE c."User" = c2."User") AS S)  /               (SELECT SUM(c2.Revenue)               FROM t c2               WHERE c."User" = c2."User") AS running_percentage FROM t c ORDER BY "User", Revenue DESC; 

The point of this question is how does SQL standard regulate it.

LiveDemo

2 Answers

Answers 1

There is an easier solution:

SELECT c.Customer, c."User", c."Revenue",        1.0 * Revenue/ NULLIF(c2.sum_total, 0) AS percentage,        1.0 * c2.sum_running / NULLIF(c2.sum_total, 0) AS running_percentage FROM t c CROSS APPLY      (SELECT SUM(c2.Revenue) AS sum_total,              SUM(CASE WHEN c2.Revenue >= x.Revenue THEN c2.Revenue ELSE 0 END)                   as sum_running       FROM t c2 CROSS JOIN            (SELECT c.REVENUE) x       WHERE c."User" = c2."User"      ) c2 ORDER BY "User", Revenue DESC; 

I am not sure why or if this limitation is in the SQL '92 standard. I did have it pretty well memorized 20 or so years ago, but I don't recall that particular limitation.

I should note:

  • At the time of the SQL 92 standard, lateral joins were not really on the radar. Sybase definitely had no such concept.
  • Other databases do have problems with outer references. In particular, they often limit the scoping to one level deep.
  • The SQL Standard itself tends to highly political (that is, vendor-driven) rather than driven by actual database user requirements. Well, over time, it does move in the right direction.

Answers 2

There is no such limitation in the SQL standard for LATERAL. CROSS APPLY is a vendor-specific extension from Microsoft (Oracle adopted it later for compatibility) and its limitations are obviously not owed to the ANSI SQL standard, since the MS feature pre-dates the standard.

LATERAL according to ANSI SQL is basically just a modifier for joins to allow lateral references in the join tree. There is no limit to the number of columns that can be referenced.

I wouldn't see a reason for the odd restriction to begin with. Maybe it's because CROSS APPLY was originally intended to allow table-valued functions, which was later extended to allow sub-SELECTs.

The Postgres manual explains LATERAL like this:

The LATERAL key word can precede a sub-SELECT FROM item. This allows the sub-SELECT to refer to columns of FROM items that appear before it in the FROM list. (Without LATERAL, each sub-SELECT is evaluated independently and so cannot cross-reference any other FROM item.)

The Postgres version of your query (without the more elegant window functions) can be simpler:

SELECT c.*      , round(revenue        / c2.sum_total, 2) END AS percentage       , round(c2.running_sum / c2.sum_total, 2) END AS running_percentage  FROM   t c, LATERAL (    SELECT NULLIF(SUM(revenue), 0)::numeric AS sum_total  -- NULLIF, cast once         , SUM(revenue) FILTER (WHERE revenue >= c.revenue) AS running_sum    FROM   t    WHERE  "User" = c."User"    ) c2 ORDER  BY c."User", c.revenue DESC; 
  • Postgres 9.4+ has the more elegant aggregate FILTER for conditional aggregates.

  • NULLIF was redundant. revenue is defined NOT NULL, the aggregates are guaranteed to find 1 or more rows, and the LATERAL sub-SELECT is joined in a CROSS JOIN, so sum_total cannot be NULL. That was backwards, I thought about COALESCE. NULLIF makes sense, I only suggest a minor simplification.

  • Cast sum_total to numeric once.

  • Round result to match your desired result.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment