Friday, March 9, 2018

Hive Query Efficiency

Leave a Comment

Could you help me with a Hive Query Efficiency problem? I have two queries working for the same problem. I just cannot figure out why one is much faster than the other. If you know please feel free to provide insight. Any info is welcomed!

Problem: I am trying to check the minimum value of a bunch of variables in a Hive parquet table.

Queries: I tried two two queries as follows:

query 1

drop table if exists tb_1 purge; create table if not exists tb_1 as select 'v1' as name, min(v1) as min_value from src_tb union all select 'v2' as name, min(v2) as min_value from src_tb union all select 'v3' as name, min(v3) as min_value from src_tb union all ... select 'v200' as name, min(v200) as min_value from src_tb ; 

query 2

drop table if exists tb_2 purge; create table if not exists tb_2 as select min(v1) as min_v1 , min(v2) as min_v2 , min(v3) as min_v3 ... , min(v200) as min_v200 from src_tb ; 

Result: Query 2 is much faster than query 1. It took probably 5 mins to finish the second query. I don't know how long will query 1 take. But after I submit the first query, it took a long time to even react to the query, by which I mean that usually after I submit a query, the system will start to analyze and provides some compiling information in the terminal. However, for my first query, after my submission, the system won't even react to this. So I just killed it.

What do you think? Thank you in advance.

3 Answers

Answers 1

Query execution time depends on environment that you execute it.

In MSSQL.

Some people like you think query execution is similar to algorithm that they see in some theoretical resources, but in practical situation, it depends on other things.

For example both of your queries have SELECT statement that perform on a table and at first glance, they need to read all rows, but database server must analyze the statement to determine the most efficient way to extract the requested data. This is referred to as optimizing the SELECT statement. The component that does this is called the Query Optimizer. The input to the Query Optimizer consists of the query, the database schema (table and index definitions), and the database statistics. The output of the Query Optimizer is a query execution plan, sometimes referred to as a query plan or just a plan. (Please see this for more information about query-processing architecture)

You can see execution plan in MSSQL by reading this article and I think you will understand better by seeing execution plan for both of your queries.

Edit (Hive)

Hive provides an EXPLAIN command that shows the execution plan for a query. The syntax for this statement is as follows:

EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query 

A Hive query gets converted into a sequence of stages. The description of the stages itself shows a sequence of operators with the metadata associated with the operators.

Please see LanguageManual Explain for more information.

Answers 2

What is surprising? The first query has to read src_tb a total of 200 times. The second reads the data once and performs 200 aggregations. It is a no brainer that it is faster.

Answers 3

As mentioned, source table is in parquet. Hive leverages columnar approach, but the latency caused while executing your query1 is due to scheduling delay.

Here's the breakdown.

Query 1 :

Since It is a Union All operation of 200 queries, when you run EXPLAIN on your query1, you should probably see roughly 202-205 stages in the execution plan.However resource manager will keep you waiting until it finds resources to schedule roughly about 200+ stages, in some cases application master may schedule/run multiple stages at once.Even though application master tries to run multiple stages in parallel, the scheduling delay is relatively higher compared to query2. This was the reason why you didn't see any response on your hive client when you submitted query 1. Coming columnar approach, In query1 it is definitely an optimization since you are selecting a single column and your source table is in parquet, but the scheduling delay out weighs in you case.

Query 2:

As opposed to query 1, when you run EXPLAIN on the query2 you may see 3 stages, if not fewer than 5 stages.However there is no scheduling delay in this case.This is the reason why you are seeing the results quick.Moving on to columnar format, in case you are selecting most of the columns in your source table, column/row orientation may not make any difference. Otherwise it is optimal as opposed to row oriented.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment