I having a simple Spark job, which reads values from a pipe separated file and do some business logic on it and write the processed value in our DB.
So to load the file, I am using org.apache.spark.sql.SQLContext
. This is the code that I having to load the file as DataFrame
DataFrame df = sqlContext.read() .format("com.databricks.spark.csv") .option("header", "false") .option("comment", null) .option("delimiter", "|") .option("quote", null) .load(pathToTheFile);
Now the issue is 1. The load
function was not able to load the file 2. It is not giving much details(exception) about the issue, except in my console I get
WARN 2017-11-07 17:26:40,108 akka.remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@172.17.0.2:35359] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. ERROR 2017-11-07 17:26:40,134 org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend: Asked to remove non-existent executor 0
and it keeps on polling.
I am sure, the file is available in the expected folder with right format. But no idea what is this log is and why SQLContext
could able to load the file.
Here is my build.gradle's dependencies section:
dependencies { provided( [group: 'org.apache.spark', name: 'spark-core_2.10', version: '1.4.0'], [group: 'org.apache.spark', name: 'spark-sql_2.10', version: '1.4.0'], [group: 'com.datastax.spark', name: 'spark-cassandra-connector-java_2.10', version: '1.4.0'] ) compile([ dependencies.create(project(path: ':account-core')) { exclude group: 'org.springframework.boot' exclude group: 'com.fasterxml.jackson.jaxrs' exclude group: 'com.google.guava' }, [group: 'com.databricks', name: 'spark-csv_2.10', version: '1.4.0'], ]) }
And I am running this job inside docker
container
Any help would be appreciated
1 Answers
Answers 1
You can check if that issue is not the same as this thread:
Long story short, akka opens up dynamic, random ports for each job. So, simple NAT fails.
You might try some trickery with a DNS server and docker's--net=host
.Based on Jacob's suggestion, I started using
--net=host
which is a new option in latest version of docker.
I also setSPARK_LOCAL_IP
to the host's IP address and then AKKA does not use the hostname and I don't need the Spark driver's hostname to be resolvable.
You can also compare your Dockerfile with the one used in P7h/docker-spark 2.2.0 to see if there are any difference which might explain that issue.
0 comments:
Post a Comment