I am using datastax + spark integration and spark SQL thrift server, which gives me a Hive SQL interface to query the tables in Cassandra.
The tables in my database get dynamically created, what I want to do is get a count of null values in each column for the table based on just the table name.
I can get the column names using describe database.table
but in hive SQL, how do I use its output in another select query which counts null for all the columns.
Update 1: Traceback with Dudu's solution
Error running query: TExecuteStatementResp(status=TStatus(errorCode=0, errorMessage="org.apache.spark.sql.AnalysisException: Invalid usage of '*' in explode/json_tuple/UDTF;", sqlState=None, infoMessages=["org.apache.hive.service.cli.HiveSQLException:org.apache.spark.sql.AnalysisException: Invalid usage of '' in explode/json_tuple/UDTF;:16:15", 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute:SparkExecuteStatementOperation.scala:258', 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:runInternal:SparkExecuteStatementOperation.scala:152', 'org.apache.hive.service.cli.operation.Operation:run:Operation.java:257', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:388', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:369', 'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:262', 'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:437', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1313', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1298', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1142', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:617', 'java.lang.Thread:run:Thread.java:745'], statusCode=3), operationHandle=None)
3 Answers
Answers 1
In the following solution there is no need to deal with each column separately. The result is a column index and the number of null values in that column.
You can later on join it by the column index to an information retrieved from the metastore.
One limitations is that strings containning the exact text null
will be counted as nulls.
Demo
The CTE (mytable
as defined by with mytable as
) can obviously be replaced by as actual table
with mytable as ( select stack ( 5 ,1 ,1.2 ,date '2017-06-21' ,null ,2 ,2.3 ,null ,null ,3 ,null ,null ,'hello' ,4 ,4.5 ,null ,'world' ,5 ,null ,date '2017-07-22' ,null ) as (id,amt,dt,txt) ) select pe.pos as col_index ,count(case when pe.val='null' then 1 end) as nulls_count from mytable t lateral view posexplode (split(printf(concat('%s',repeat('\u0001%s',field(unhex(1),t.*,unhex(1))-2)),t.*),'\\x01')) pe group by pe.pos ;
+-----------+-------------+ | col_index | nulls_count | +-----------+-------------+ | 0 | 0 | | 1 | 2 | | 2 | 3 | | 3 | 3 | +-----------+-------------+
Answers 2
Instead of describe database.table
, you can use
Select column_name from system_schema.columns where keyspace_name='YOUR KEYSPACE' and table_name='YOUR TABLE'
There is also a column called kind
in the above table with values like partition_key,clustering,regular
.
The columns with values as partition_key
and clustering
will not have null values.
For other columns you can use
select sum(CASE WHEN col1 is NULL THEN 1 ELSE 0 END) as col1_cnt,sum(CASE WHEN col2 is NULL THEN 1 ELSE 0 END) as col2_cnt from table1 where col1 is null;
You can also try below query (Not tried myself)
SELECT COUNT(*)-COUNT(col1) As A, COUNT(*)-COUNT(col2) As B, COUNT(*)-COUNT(col3) As C FROM YourTable;
May be for above query you can create variable for count instead of count(*) everytime.
Note: system_schema.columns
is cassandra table and cassandra user should have read permission to this table
Answers 3
You will have to count null values from each column separately. For example -
select count(*) from mytable where col1 is null; select count(*) from mytable where col2 is null;
0 comments:
Post a Comment