Thursday, January 26, 2017

SparkSQL - The correlated scalar subquery can only contain equality predicates

Leave a Comment

I would like to execute the following query with Spark SQL 2.0

SELECT a.id as id, (SELECT SUM(b.points)    FROM tableB b    WHERE b.id = a.id AND b.date <= a.date) AS points FROM tableA a 

but I get the following error

The correlated scalar subquery can only contain equality predicates.

Any idea how can I rewrite the query or use operations between the two datframes tableA and tableB to make it working?

1 Answers

Answers 1

select a.id as id,  sum(b.points) as points  from a, b  where a.id = b.id  and b.date <= a.date  group by a.id  ; 

Skip the sub-select and group by id to ensure a one to one relationship between ids and the sum of b's points column.

Here's a 'down and dirty' example which I used:

select * from a ;  id|date 1|2017-01-22 17:59:49 2|2017-01-22 18:00:00 3|2017-01-22 18:00:05 4|2017-01-22 18:00:11 5|2017-01-22 18:00:15  select * from b ; id|points|date 1|12|2017-01-21 18:03:20 3|25|2017-01-21 18:03:37 5|17|2017-01-21 18:03:55 2|-1|2017-01-22 18:04:27 4|-4|2017-01-22 18:04:35 5|400|2017-01-20 18:17:31 5|-1000|2017-01-23 18:18:36 

Notice that b has three entries of id = 5, two before a.date and one after.

select a.id, sum(b.points) as points from a, b where a.id = b.id and b.date <= a.date group by a.id ; 1|12 3|25 5|417 

I also confirmed "group by" is supported: http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment