Thursday, November 9, 2017

How would I temporarily store a column as a json object to derive other columns?

Leave a Comment

I have a dataset with key value pairs like this

likes=dogs;hates=birds;likes=sports;eats=cheese 

I then turn it into json

{"likes": ["dogs","sports"], "hates": ["birds"], "eats": ["cheese"]} 

Is there a way I can keep this json datastructure without casting it to a string, so I can derive more columns from it on a row by row basis? I would like it to look something like this, without having to turn decode the json from string every column added.

        Dataset<Row> df1 = df.withColumn("interests", callUDF("to_json", col("interests")))                          .withColumn("likes", callUDF("extract_from_json", "likes", col("interests")))                          .withColumn("hates", callUDF("extract_from_json", "hates", col("interests")))                          .withColumn("hates", callUDF("extract_from_json", "eats", col("interests"))); 

1 Answers

Answers 1

If you're working off the original file

likes=dogs;hates=birds;likes=sports;eats=cheese 

then you could read it in with sc.textFile then do some simple RDD manipulations.

val df = sc.textFile(file)   .flatMap(x => x.split(";"))   .map(x => (x.split("=")(0), x.split("=")(1)))   .toDF("interest","value")  df.withColumn("tmp",lit(1)).groupBy("tmp").pivot("interest").agg(collect_list("value"))  +---+--------+-------+--------------+ |tmp|    eats|  hates|         likes| +---+--------+-------+--------------+ |  1|[cheese]|[birds]|[dogs, sports]| +---+--------+-------+--------------+ 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment