Background:
The original case was very simple. Calculate running total per user from highest revenue to lowest:
CREATE TABLE t(Customer INTEGER NOT NULL PRIMARY KEY ,"User" VARCHAR(5) NOT NULL ,Revenue INTEGER NOT NULL); INSERT INTO t(Customer,"User",Revenue) VALUES (001,'James',500),(002,'James',750),(003,'James',450), (004,'Sarah',100),(005,'Sarah',500),(006,'Sarah',150), (007,'Sarah',600),(008,'James',150),(009,'James',100);
Query:
SELECT *, 1.0 * Revenue/SUM(Revenue) OVER(PARTITION BY "User") AS percentage, 1.0 * SUM(Revenue) OVER(PARTITION BY "User" ORDER BY Revenue DESC) /SUM(Revenue) OVER(PARTITION BY "User") AS running_percentage FROM t;
Output:
╔════╦═══════╦═════════╦════════════╦════════════════════╗ ║ ID ║ User ║ Revenue ║ percentage ║ running_percentage ║ ╠════╬═══════╬═════════╬════════════╬════════════════════╣ ║ 2 ║ James ║ 750 ║ 0.38 ║ 0.38 ║ ║ 1 ║ James ║ 500 ║ 0.26 ║ 0.64 ║ ║ 3 ║ James ║ 450 ║ 0.23 ║ 0.87 ║ ║ 8 ║ James ║ 150 ║ 0.08 ║ 0.95 ║ ║ 9 ║ James ║ 100 ║ 0.05 ║ 1 ║ ║ 7 ║ Sarah ║ 600 ║ 0.44 ║ 0.44 ║ ║ 5 ║ Sarah ║ 500 ║ 0.37 ║ 0.81 ║ ║ 6 ║ Sarah ║ 150 ║ 0.11 ║ 0.93 ║ ║ 4 ║ Sarah ║ 100 ║ 0.07 ║ 1 ║ ╚════╩═══════╩═════════╩════════════╩════════════════════╝
It could be calculated differently using specific windowed functions.
Now let's assume that we cannot use windowed SUM
and rewrite it:
SELECT c.Customer, c."User", c."Revenue" ,1.0 * Revenue / NULLIF(c3.s,0) AS percentage ,1.0 * c2.s / NULLIF(c3.s,0) AS running_percentage FROM t c CROSS APPLY (SELECT SUM(Revenue) AS s FROM t c2 WHERE c."User" = c2."User" AND c2.Revenue >= c.Revenue) AS c2 CROSS APPLY (SELECT SUM(Revenue) AS s FROM t c2 WHERE c."User" = c2."User") AS c3 ORDER BY "User", Revenue DESC;
I have used CROSS APPLY
because I do not like correlated subqueries in SELECT
colums list and c3
is used twice.
Everything work as it should. But when we look closer c2
and c3
are very similiar. So why not combine them and use simple conditional aggregation:
SELECT c.Customer, c."User", c."Revenue" ,1.0 * Revenue / NULLIF(c2.sum_total,0) AS percentage ,1.0 * c2.sum_running / NULLIF(c2.sum_total,0) AS running_percentage FROM t c CROSS APPLY (SELECT SUM(Revenue) AS sum_total, SUM(CASE WHEN c2.Revenue >= c.Revenue THEN Revenue ELSE 0 END) AS sum_running FROM t c2 WHERE c."User" = c2."User") AS c2 ORDER BY "User", Revenue DESC;
Unfortunately it is not possible.
Multiple columns are specified in an aggregated expression containing an outer reference. If an expression being aggregated contains an outer reference, then that outer reference must be the only column referenced in the expression.
Of course I could circumvent it wrapping with another subquery, but it becomes a bit "ugly":
SELECT c.Customer, c."User", c."Revenue" ,1.0 * Revenue / NULLIF(c2.sum_total,0) AS percentage ,1.0 * c2.sum_running / NULLIF(c2.sum_total,0) AS running_percentage FROM t c CROSS APPLY ( SELECT SUM(Revenue) AS sum_total, SUM(running_revenue) AS sum_running FROM (SELECT Revenue, CASE WHEN c2.Revenue >= c.Revenue THEN Revenue ELSE 0 END AS running_revenue FROM t c2 WHERE c."User" = c2."User") AS sub ) AS c2 ORDER BY "User", Revenue DESC
Postgresql
version. The only difference is LATERAL
instead of CROSS APPLY
.
SELECT c.Customer, c."User", c.Revenue ,1.0 * Revenue / NULLIF(c2.sum_total,0) AS percentage ,1.0 * c2.running_sum / NULLIF(c2.sum_total,0) AS running_percentage FROM t c ,LATERAL (SELECT SUM(Revenue) AS sum_total, SUM(CASE WHEN c2.Revenue >= c.Revenue THEN c2.Revenue ELSE 0 END) AS running_sum FROM t c2 WHERE c."User" = c2."User") c2 ORDER BY "User", Revenue DESC;
It works very nice.
SQLite
/MySQL
version (that is why I preferLATERAL/CROSS APPLY
):
SELECT c.Customer, c."User", c.Revenue, 1.0 * Revenue / (SELECT SUM(Revenue) FROM t c2 WHERE c."User" = c2."User") AS percentage, 1.0 * (SELECT SUM(CASE WHEN c2.Revenue >= c.Revenue THEN c2.Revenue ELSE 0 END) FROM t c2 WHERE c."User" = c2."User") / (SELECT SUM(c2.Revenue) FROM t c2 WHERE c."User" = c2."User") AS running_percentage FROM t c ORDER BY "User", Revenue DESC;
SQLFiddleDemo-SQLite
SQLFiddleDemo-MySQL
I've read Aggregates with an Outer Reference:
The source for the restriction is in the
SQL-92
standard, andSQL Server
inherited it from theSybase
codebase. The problem is that SQL Server needs to figure out which query will compute the aggregate.
I do not search for answers that only show how to circumvent it.
The questions are:
- Which part of standard disallow or interfere with it?
- Why other RDBMSes do not have problem with this kind of outer dependency?
- Do they extend
SQL Standard
andSQL Server
behaves as it should orSQL Server
does not implement it fully(correctly?)?.
I would be very grateful for references to:
ISO standard
(92 or newer)- SQL Server Standards Support
- official documenation from any RDBMS that explains it (
SQL Server/Postgresql/Oracle/...
).
EDIT:
I know that SQL-92
does not have concept of LATERAL
. But version with subqueries (like in SQLite/MySQL
) does not work too.
EDIT 2:
To simplify it a bit, let's check only correlated subquery only:
SELECT c.Customer, c."User", c.Revenue, 1.0*(SELECT SUM(CASE WHEN c2.Revenue >= c.Revenue THEN c2.Revenue ELSE 0 END) FROM t c2 WHERE c."User" = c2."User") / (SELECT SUM(c2.Revenue) FROM t c2 WHERE c."User" = c2."User") AS running_percentage FROM t c ORDER BY "User", Revenue DESC;
The version above works fine in MySQL/SQLite/Postgresql
.
In SQL Server
we get error. After wraping it with subquery to "flatten" it to one level it works:
SELECT c.Customer, c."User", c.Revenue, 1.0 * ( SELECT SUM(CASE WHEN r1 >= r2 THEN r1 ELSE 0 END) FROM (SELECT c2.Revenue AS r1, c.Revenue r2 FROM t c2 WHERE c."User" = c2."User") AS S) / (SELECT SUM(c2.Revenue) FROM t c2 WHERE c."User" = c2."User") AS running_percentage FROM t c ORDER BY "User", Revenue DESC;
The point of this question is how does SQL standard
regulate it.
2 Answers
Answers 1
There is an easier solution:
SELECT c.Customer, c."User", c."Revenue", 1.0 * Revenue/ NULLIF(c2.sum_total, 0) AS percentage, 1.0 * c2.sum_running / NULLIF(c2.sum_total, 0) AS running_percentage FROM t c CROSS APPLY (SELECT SUM(c2.Revenue) AS sum_total, SUM(CASE WHEN c2.Revenue >= x.Revenue THEN c2.Revenue ELSE 0 END) as sum_running FROM t c2 CROSS JOIN (SELECT c.REVENUE) x WHERE c."User" = c2."User" ) c2 ORDER BY "User", Revenue DESC;
I am not sure why or if this limitation is in the SQL '92 standard. I did have it pretty well memorized 20 or so years ago, but I don't recall that particular limitation.
I should note:
- At the time of the SQL 92 standard, lateral joins were not really on the radar. Sybase definitely had no such concept.
- Other databases do have problems with outer references. In particular, they often limit the scoping to one level deep.
- The SQL Standard itself tends to highly political (that is, vendor-driven) rather than driven by actual database user requirements. Well, over time, it does move in the right direction.
Answers 2
There is no such limitation in the SQL standard for LATERAL
. CROSS APPLY
is a vendor-specific extension from Microsoft (Oracle adopted it later for compatibility) and its limitations are obviously not owed to the ANSI SQL standard, since the MS feature pre-dates the standard.
LATERAL
according to ANSI SQL is basically just a modifier for joins to allow lateral references in the join tree. There is no limit to the number of columns that can be referenced.
I wouldn't see a reason for the odd restriction to begin with. Maybe it's because CROSS APPLY
was originally intended to allow table-valued functions, which was later extended to allow sub-SELECT
s.
The Postgres manual explains LATERAL
like this:
The
LATERAL
key word can precede a sub-SELECT FROM
item. This allows the sub-SELECT
to refer to columns ofFROM
items that appear before it in theFROM
list. (WithoutLATERAL
, each sub-SELECT
is evaluated independently and so cannot cross-reference any otherFROM
item.)
The Postgres version of your query (without the more elegant window functions) can be simpler:
SELECT c.* , round(revenue / c2.sum_total, 2) END AS percentage , round(c2.running_sum / c2.sum_total, 2) END AS running_percentage FROM t c, LATERAL ( SELECT NULLIF(SUM(revenue), 0)::numeric AS sum_total -- NULLIF, cast once , SUM(revenue) FILTER (WHERE revenue >= c.revenue) AS running_sum FROM t WHERE "User" = c."User" ) c2 ORDER BY c."User", c.revenue DESC;
Postgres 9.4+ has the more elegant aggregate
FILTER
for conditional aggregates.That was backwards, I thought aboutNULLIF
was redundant.revenue
is definedNOT NULL
, the aggregates are guaranteed to find 1 or more rows, and theLATERAL
sub-SELECT
is joined in aCROSS JOIN
, sosum_total
cannot beNULL
.COALESCE
.NULLIF
makes sense, I only suggest a minor simplification.Cast
sum_total
tonumeric
once.Round result to match your desired result.
0 comments:
Post a Comment