Tuesday, November 7, 2017

Are table statistics of any use prior to Spark 2.2?

Leave a Comment

Spark 2.2 introduced cost-based optimization (CBO, https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html) which makes use of table statistics (as computed by ANALYZE TABLE COMPUTE STATISTICS....)

My question is: Are precomputed statistics also useful prior to Spark 2.2 (in my case 2.1) operating on (external hive) tables? Do statistics influence the optimizer? If yes, can I also compute the statistics in Impala instead of Hive?

UPDATE:

The only hint I have found so far is https://issues.apache.org/jira/browse/SPARK-15365

Apparently statistics are used to decide whether a broadcast-join is done are not

0 Answers

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment