Wednesday, June 28, 2017

Hive Sql dynamically get null column counts from a table

Leave a Comment

I am using datastax + spark integration and spark SQL thrift server, which gives me a Hive SQL interface to query the tables in Cassandra.

The tables in my database get dynamically created, what I want to do is get a count of null values in each column for the table based on just the table name.

I can get the column names using describe database.table but in hive SQL, how do I use its output in another select query which counts null for all the columns.

Update 1: Traceback with Dudu's solution

Error running query: TExecuteStatementResp(status=TStatus(errorCode=0, errorMessage="org.apache.spark.sql.AnalysisException: Invalid usage of '*' in explode/json_tuple/UDTF;", sqlState=None, infoMessages=["org.apache.hive.service.cli.HiveSQLException:org.apache.spark.sql.AnalysisException: Invalid usage of '' in explode/json_tuple/UDTF;:16:15", 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute:SparkExecuteStatementOperation.scala:258', 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:runInternal:SparkExecuteStatementOperation.scala:152', 'org.apache.hive.service.cli.operation.Operation:run:Operation.java:257', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:388', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:369', 'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:262', 'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:437', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1313', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1298', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1142', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:617', 'java.lang.Thread:run:Thread.java:745'], statusCode=3), operationHandle=None)

3 Answers

Answers 1

In the following solution there is no need to deal with each column separately. The result is a column index and the number of null values in that column.
You can later on join it by the column index to an information retrieved from the metastore.
One limitations is that strings containning the exact text null will be counted as nulls.

Demo

The CTE (mytable as defined by with mytable as) can obviously be replaced by as actual table

with        mytable as              (                 select  stack                         (                             5                             ,1   ,1.2     ,date '2017-06-21'     ,null                            ,2   ,2.3     ,null                  ,null                            ,3   ,null    ,null                  ,'hello'                            ,4   ,4.5     ,null                  ,'world'                            ,5   ,null    ,date '2017-07-22'     ,null                         ) as (id,amt,dt,txt)             )  select      pe.pos                                          as col_index            ,count(case when pe.val='null' then 1 end)       as nulls_count  from        mytable t lateral view posexplode (split(printf(concat('%s',repeat('\u0001%s',field(unhex(1),t.*,unhex(1))-2)),t.*),'\\x01')) pe  group by    pe.pos        ; 

+-----------+-------------+ | col_index | nulls_count | +-----------+-------------+ |         0 |           0 | |         1 |           2 | |         2 |           3 | |         3 |           3 | +-----------+-------------+ 

Answers 2

Instead of describe database.table, you can use

Select column_name from system_schema.columns where keyspace_name='YOUR KEYSPACE' and table_name='YOUR TABLE'

There is also a column called kind in the above table with values like partition_key,clustering,regular.

The columns with values as partition_key and clustering will not have null values.

For other columns you can use

select sum(CASE WHEN col1 is NULL THEN 1 ELSE 0 END) as col1_cnt,sum(CASE WHEN col2 is NULL THEN 1 ELSE 0 END) as col2_cnt from table1 where col1 is null;

You can also try below query (Not tried myself)

SELECT COUNT(*)-COUNT(col1) As A, COUNT(*)-COUNT(col2) As B, COUNT(*)-COUNT(col3) As C FROM YourTable;  

May be for above query you can create variable for count instead of count(*) everytime.

Note: system_schema.columns is cassandra table and cassandra user should have read permission to this table

Answers 3

You will have to count null values from each column separately. For example -

select count(*) from mytable where col1 is null; select count(*) from mytable where col2 is null; 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment