Thursday, November 16, 2017

org.apache.spark.sql.SQLContext not able to load file

Leave a Comment

I having a simple Spark job, which reads values from a pipe separated file and do some business logic on it and write the processed value in our DB.

So to load the file, I am using org.apache.spark.sql.SQLContext. This is the code that I having to load the file as DataFrame

 DataFrame df = sqlContext.read()             .format("com.databricks.spark.csv")             .option("header", "false")             .option("comment", null)             .option("delimiter", "|")             .option("quote", null)             .load(pathToTheFile); 

Now the issue is 1. The load function was not able to load the file 2. It is not giving much details(exception) about the issue, except in my console I get

WARN  2017-11-07 17:26:40,108 akka.remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@172.17.0.2:35359] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. ERROR 2017-11-07 17:26:40,134 org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend: Asked to remove non-existent executor 0 

and it keeps on polling.

I am sure, the file is available in the expected folder with right format. But no idea what is this log is and why SQLContext could able to load the file.

Here is my build.gradle's dependencies section:

dependencies {  provided(         [group: 'org.apache.spark', name: 'spark-core_2.10', version: '1.4.0'],         [group: 'org.apache.spark', name: 'spark-sql_2.10', version: '1.4.0'],         [group: 'com.datastax.spark', name: 'spark-cassandra-connector-java_2.10', version: '1.4.0'] )      compile([             dependencies.create(project(path: ':account-core')) {                 exclude group: 'org.springframework.boot'                 exclude group: 'com.fasterxml.jackson.jaxrs'                 exclude group: 'com.google.guava'             },              [group: 'com.databricks', name: 'spark-csv_2.10', version: '1.4.0'],     ])   } 

And I am running this job inside docker container

Any help would be appreciated

1 Answers

Answers 1

You can check if that issue is not the same as this thread:

Long story short, akka opens up dynamic, random ports for each job. So, simple NAT fails.
You might try some trickery with a DNS server and docker's --net=host.

Based on Jacob's suggestion, I started using --net=host which is a new option in latest version of docker.
I also set SPARK_LOCAL_IP to the host's IP address and then AKKA does not use the hostname and I don't need the Spark driver's hostname to be resolvable.

You can also compare your Dockerfile with the one used in P7h/docker-spark 2.2.0 to see if there are any difference which might explain that issue.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment