Tuesday, October 4, 2016

Scala & Spark: Recycling SQL statements

Leave a Comment

I spent quite some time to code multiple SQL queries that were formerly used to fetch the data for various R scripts. This is how it worked

sqlContent = readSQLFile("file1.sql") sqlContent = setSQLVariables(sqlContent, variables) results = executeSQL(sqlContent) 

The clue is, that for some queries a result from a prior query is required - why creating VIEWs in the database itself does not solve this problem. With Spark 2.0 I already figured out a way to do just that through

// create a dataframe using a jdbc connection to the database val tableDf = spark.read.jdbc(...) var tempTableName = "TEMP_TABLE" + java.util.UUID.randomUUID.toString.replace("-", "").toUpperCase var sqlQuery = Source.fromURL(getClass.getResource("/sql/" + sqlFileName)).mkString sqlQuery = setSQLVariables(sqlQuery, sqlVariables) sqlQuery = sqlQuery.replace("OLD_TABLE_NAME",tempTableName) tableDf.createOrReplaceTempView(tempTableName)  var data = spark.sql(sqlQuery) 

But this is in my humble opinion very fiddly. Also, more complex queries, e.g. queries that incooporate subquery factoring currently don't work. Is there a more robust way like re-implementing the SQL code into Spark.SQL code using filter($""), .select($""), etc.

The overall goal is to get multiple org.apache.spark.sql.DataFrames, each representing the results of one former SQL query (which always a few JOINs, WITHs, etc.). So n queries leading to n DataFrames.

Is there a better option than the provided two?

Setup: Hadoop v.2.7.3, Spark 2.0.0, Intelli J IDEA 2016.2, Scala 2.11.8, Testcluster on Win7 Workstation

0 Answers

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment