I have a dataset with key value pairs like this
likes=dogs;hates=birds;likes=sports;eats=cheese I then turn it into json
{"likes": ["dogs","sports"], "hates": ["birds"], "eats": ["cheese"]} Is there a way I can keep this json datastructure without casting it to a string, so I can derive more columns from it on a row by row basis? I would like it to look something like this, without having to turn decode the json from string every column added.
Dataset<Row> df1 = df.withColumn("interests", callUDF("to_json", col("interests"))) .withColumn("likes", callUDF("extract_from_json", "likes", col("interests"))) .withColumn("hates", callUDF("extract_from_json", "hates", col("interests"))) .withColumn("hates", callUDF("extract_from_json", "eats", col("interests"))); 1 Answers
Answers 1
If you're working off the original file
likes=dogs;hates=birds;likes=sports;eats=cheese then you could read it in with sc.textFile then do some simple RDD manipulations.
val df = sc.textFile(file) .flatMap(x => x.split(";")) .map(x => (x.split("=")(0), x.split("=")(1))) .toDF("interest","value") df.withColumn("tmp",lit(1)).groupBy("tmp").pivot("interest").agg(collect_list("value")) +---+--------+-------+--------------+ |tmp| eats| hates| likes| +---+--------+-------+--------------+ | 1|[cheese]|[birds]|[dogs, sports]| +---+--------+-------+--------------+
0 comments:
Post a Comment