I spent quite some time to code multiple SQL queries that were formerly used to fetch the data for various R
scripts. This is how it worked
sqlContent = readSQLFile("file1.sql") sqlContent = setSQLVariables(sqlContent, variables) results = executeSQL(sqlContent)
The clue is, that for some queries a result from a prior query is required - why creating VIEW
s in the database itself does not solve this problem. With Spark 2.0
I already figured out a way to do just that through
// create a dataframe using a jdbc connection to the database val tableDf = spark.read.jdbc(...) var tempTableName = "TEMP_TABLE" + java.util.UUID.randomUUID.toString.replace("-", "").toUpperCase var sqlQuery = Source.fromURL(getClass.getResource("/sql/" + sqlFileName)).mkString sqlQuery = setSQLVariables(sqlQuery, sqlVariables) sqlQuery = sqlQuery.replace("OLD_TABLE_NAME",tempTableName) tableDf.createOrReplaceTempView(tempTableName) var data = spark.sql(sqlQuery)
But this is in my humble opinion very fiddly. Also, more complex queries, e.g. queries that incooporate subquery factoring currently don't work. Is there a more robust way like re-implementing the SQL code into Spark.SQL
code using filter($"")
, .select($"")
, etc.
The overall goal is to get multiple org.apache.spark.sql.DataFrame
s, each representing the results of one former SQL query (which always a few JOIN
s, WITH
s, etc.). So n
queries leading to n
DataFrame
s.
Is there a better option than the provided two?
Setup: Hadoop v.2.7.3
, Spark 2.0.0
, Intelli J IDEA 2016.2
, Scala 2.11.8
, Testcluster on Win7 Workstation
0 comments:
Post a Comment