Showing posts with label hiveql. Show all posts
Showing posts with label hiveql. Show all posts

Wednesday, June 28, 2017

Hive Sql dynamically get null column counts from a table

Leave a Comment

I am using datastax + spark integration and spark SQL thrift server, which gives me a Hive SQL interface to query the tables in Cassandra.

The tables in my database get dynamically created, what I want to do is get a count of null values in each column for the table based on just the table name.

I can get the column names using describe database.table but in hive SQL, how do I use its output in another select query which counts null for all the columns.

Update 1: Traceback with Dudu's solution

Error running query: TExecuteStatementResp(status=TStatus(errorCode=0, errorMessage="org.apache.spark.sql.AnalysisException: Invalid usage of '*' in explode/json_tuple/UDTF;", sqlState=None, infoMessages=["org.apache.hive.service.cli.HiveSQLException:org.apache.spark.sql.AnalysisException: Invalid usage of '' in explode/json_tuple/UDTF;:16:15", 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute:SparkExecuteStatementOperation.scala:258', 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:runInternal:SparkExecuteStatementOperation.scala:152', 'org.apache.hive.service.cli.operation.Operation:run:Operation.java:257', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:388', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:369', 'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:262', 'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:437', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1313', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1298', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1142', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:617', 'java.lang.Thread:run:Thread.java:745'], statusCode=3), operationHandle=None)

3 Answers

Answers 1

In the following solution there is no need to deal with each column separately. The result is a column index and the number of null values in that column.
You can later on join it by the column index to an information retrieved from the metastore.
One limitations is that strings containning the exact text null will be counted as nulls.

Demo

The CTE (mytable as defined by with mytable as) can obviously be replaced by as actual table

with        mytable as              (                 select  stack                         (                             5                             ,1   ,1.2     ,date '2017-06-21'     ,null                            ,2   ,2.3     ,null                  ,null                            ,3   ,null    ,null                  ,'hello'                            ,4   ,4.5     ,null                  ,'world'                            ,5   ,null    ,date '2017-07-22'     ,null                         ) as (id,amt,dt,txt)             )  select      pe.pos                                          as col_index            ,count(case when pe.val='null' then 1 end)       as nulls_count  from        mytable t lateral view posexplode (split(printf(concat('%s',repeat('\u0001%s',field(unhex(1),t.*,unhex(1))-2)),t.*),'\\x01')) pe  group by    pe.pos        ; 

+-----------+-------------+ | col_index | nulls_count | +-----------+-------------+ |         0 |           0 | |         1 |           2 | |         2 |           3 | |         3 |           3 | +-----------+-------------+ 

Answers 2

Instead of describe database.table, you can use

Select column_name from system_schema.columns where keyspace_name='YOUR KEYSPACE' and table_name='YOUR TABLE'

There is also a column called kind in the above table with values like partition_key,clustering,regular.

The columns with values as partition_key and clustering will not have null values.

For other columns you can use

select sum(CASE WHEN col1 is NULL THEN 1 ELSE 0 END) as col1_cnt,sum(CASE WHEN col2 is NULL THEN 1 ELSE 0 END) as col2_cnt from table1 where col1 is null;

You can also try below query (Not tried myself)

SELECT COUNT(*)-COUNT(col1) As A, COUNT(*)-COUNT(col2) As B, COUNT(*)-COUNT(col3) As C FROM YourTable;  

May be for above query you can create variable for count instead of count(*) everytime.

Note: system_schema.columns is cassandra table and cassandra user should have read permission to this table

Answers 3

You will have to count null values from each column separately. For example -

select count(*) from mytable where col1 is null; select count(*) from mytable where col2 is null; 
Read More

Friday, March 24, 2017

SQL query Frequency Distribution matrix for product

Leave a Comment

i want to create a frequency distribution matrix

1.Create a matrix.**Is it possible to get this in separate columns**    customer1       p1         p2      p3   customer 2      p2         p3   customer 3      p2         p3      p1   customer 4      p2         p1  2. Then I have to count the number of products that come together the most     For eg       p2 and p3 comes together 3 times     p1 p3   comes 2 times     p1 p2  comes  2 times  I want to recommend products to customers ,frequency of products that comes together   select customerId,product,count(*) from sales group by customerId,product 

Can anyone please help me for a solution to this

3 Answers

Answers 1

If you want pairs of products that customers purchase, then you can use a self join:

select s1.product, s2.product, count(*) as cnt from sales s1 join      sales s2      on s1.customerId = s2.customerId where s1.product < s2.product group by s1.product, s2.product order by cnt desc; 

You can extend this to more than two products by using more joins.

Answers 2

1.

Array

select      customerId            ,sort_array(collect_set (product))   as products  from        sales   group by    customerId ; 

Multiple columns

select      customerId             ,products[0] as p0            ,products[1] as p1            ,products[2] as p2            ,products[3] as p3            ,products[4] as p4            ,products[5] as p5            ,products[6] as p6            ,products[7] as p7            ,products[8] as p8            ,products[9] as p9  from       (select      customerId                        ,sort_array(collect_set (product))   as products              from        sales               group by    customerId             ) s ; 

2.

This returns the frequency of the whole products' combinations.
In your example (p1,p2,p3) is the most frequent (appears twice).
(p1,p2) appears once and so is (p2,p3).
For frequency of tuples, see @GordonLinoff answer.

select      s.products            ,count(*)    as frequency   from       (select      customerId                        ,sort_array(collect_set (product))   as products              from        sales               group by    customerId             ) s  group by    s.products             order by    frequency desc 

Answers 3

I know cursors are not the flavor of the month anymore - but I still find them very useful from time to time

Below code creates a table variable and then loops through the sales data to find out which combination of products are sold together most frequently

--Results will be loaded into this table DECLARE @OutputCombination TABLE (CombinationName VARCHAR(MAX), CombinationsFound INT)   --Just for demo - create table and load with data declare @demodata table (ClientID int, ProductBought varchar(50)) insert into @demodata (Clientid, ProductBought) values  (1,'Coke'), (1,'Chips'), (1,'Chips'), (2,'Coke'), (2,'Coke'), (2,'Sweets'), (3,'Coke'), (4,'Chips'), (5,'Coke'), (5,'Chips'), (6,'Coke'), (7,'Coke')  DECLARE clients CURSOR READ_ONLY FOR SELECT DISTINCT clientID from @demodata  DECLARE @clientID INT OPEN clients  FETCH NEXT FROM clients INTO @clientID WHILE (@@fetch_status <> -1) BEGIN        DECLARE @ThisClientCombination VARCHAR(MAX) --This is going to be used to find/update combinations of products sold from the data - pipe-delimiting them        SET @ThisClientCombination  = '' --better to actually wipe variables in cursor loops, as cursors sometimes funny about things like that        ;WITH thisClientsDistinctCoverages AS --CTE used because of the SQL-funnies caused when using DISTINCT and ORDER BY        (               SELECT DISTINCT TOP 100 PERCENT CoverageItems = ProductBought + '|' FROM @demodata WHERE ClientID = @clientID ORDER BY ProductBought + '|' --order alphabetically and do DISTINCT so that 2x cokes are changed into just 1 coke - for the sake of combos        )        SELECT @ThisClientCombination = @ThisClientCombination + CoverageItems FROM thisClientsDistinctCoverages        SET @ThisClientCombination = LEFT(@ThisClientCombination,LEN(@ThisClientCombination)-1) --Drop trailing Pipe '|'         IF EXISTS(SELECT * FROM @OutputCombination WHERE CombinationName = @ThisClientCombination)               UPDATE @OutputCombination SET CombinationsFound = CombinationsFound + 1 WHERE CombinationName = @ThisClientCombination --found before, increase count by 1        ELSE               INSERT INTO @OutputCombination (CombinationName, CombinationsFound) VALUES (@ThisClientCombination, 1)--first time found, make entry 1         FETCH NEXT FROM clients INTO @clientID END  CLOSE clients DEALLOCATE clients  --Show the results SELECT * FROM @OutputCombination ORDER BY CombinationsFound DESC, CombinationName 

This will yield the results:

CombinationName---------CombinationsFound
Coke--------------------------3
Chips|Coke------------------2
Chips--------------------------1
Coke|Sweets----------------1

Interpretation of above results:
Most often, 3 times, a customer only bought coke(s)
Next up, twice a customer bought chips and a coke
Once a customer only bought chips
And once a customer bought a coke & sweets

Read More

Sunday, March 6, 2016

Looping using Hiveql

Leave a Comment

I'm trying to merge 2 datasets, say A and B. The dataset A has a variable "Flag" which takes 2 values. Rather than jut merging both data together I was trying to merge 2 datasets based on "flag" variable.

The merging code is the following:

create table new_data as select a.*,b.y from A as a left join B as b on a.x=b.x 

Since I'm running Hive code through CLI, I'm calling this through the following command

hive -f new_data.hql 

The looping part of the code I'm calling to merge data based on "Flag" variable is the following:

for flag in 1 2; do   hive -hivevar flag=$flag -f new_data.hql done 

I put the above code in another ".hql" file asn calling it:

hive -f loop_data.hql 

But it's throwing error.

cannot recognize input near 'for' 'flag' 'in'

Can anybody please tell me where I'm making mistake.

Thanks!

1 Answers

Answers 1

  1. You should be adding the loop logic to shell script.

File Name: loop_data.sh

for flag in 1 2; do   hive -hivevar flag=$flag -f new_data.hql done 

And execute the script like:

sh loop_data.sh 
  1. In your new_data.hql script, you are creating table. Since you should split out the DDL & DML in 2 separate scripts. Like

DDL: create_new_data.hql

create table new_data as select    a.*,   b.y from    A as a left join    B as b on    a.x = b.x where    1 = 0; 

DML: insert_new_data.hql

insert into new_data  select    a.*,   b.y from    A as a left join    B as b on    a.x = b.x where   flag = ${hiveconf:flag} 

And update you shell script like:

File Name: loop_new_data.sh

# Create table hive -f create_new_data.hql  # Insert data for flag in 1 2; do   hive -hiveconf flag=$flag -f insert_new_data.hql done 

And execute it like:

sh loop_new_data.sh 

Let me know if you want more info.

Read More