Monday, August 28, 2017

Connecting to remote master on standalone Spark

Leave a Comment

I launch Spark in standalone mode on my remote server via following next steps:

  • cp spark-env.sh.template spark-env.sh
  • append to spark-env.sh SPARK_MASTER_HOST=IP_OF_MY_REMOTE_SERVER
  • and run next commands for standalone mode: sbin/start-master.sh sbin/start-slave.sh spark://IP_OF_MY_REMOTE_SERVER:7077

And I try to connect to remote master:

val spark = SparkSession.builder()   .appName("SparkSample")   .master("spark://IP_OF_MY_REMOTE_SERVER:7077")   .getOrCreate() 

And I receive the following errors:

ERROR SparkContext: Error initializing SparkContext. java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 16 retries! 

and warnings:

    WARN Utils: Service 'sparkMaster' could not bind on port 7077. Attempting port 7078. .....     WARN Utils: Service 'sparkMaster' could not bind on port 7092. Attempting port 7092. 

3 Answers

Answers 1

I advise against submitting spark jobs remotely using the port opening strategy, because it can create security problems and is in my experience, more trouble than it's worth, especially due to having to troubleshoot the communication layer.

Alternatives:

1) Livy - now an Apache project! http://livy.io or http://livy.incubator.apache.org/

2) Spark Job server - https://github.com/spark-jobserver/spark-jobserver

Similar Q&A: Submitting jobs to Spark EC2 cluster remotely

If you insist on connecting without libraries like Livy, then opening the ports to ensure communication is required. The Spark network comm docs: http://spark.apache.org/docs/latest/security.html#configuring-ports-for-network-security

Since you're not using YARN (per your Standalone design), the prior link to YARN remote submission may not be relevant.

Answers 2

The Spark Documentation says

spark.driver.port

(random)    Port for the driver to listen on. This is used for communicating with the executors and the standalone Master. 

spark.port.maxRetries

16  Maximum number of retries when binding to a port before giving up. When a port is given a specific value (non 0), each subsequent retry will increment the port used in the previous attempt by 1 before retrying. This essentially allows it to try a range of ports from the start port specified to port + maxRetries. 

You need to ensure that the Spark Master is running on remote host at port 7077. Also the firewall must allow connections to it.

AND

Also, you need to copy core-site.xml file from your cluster to HADOOP_CONF_DIR, so that Spark service can read hadoop settings, such as the IP address of your master. Read here for more..

Hope it helps!

Answers 3

The spark-jobserver seems very tempting but has some issues. I'd recommend the "hidden" spark REST api! It's not documented but it's super easy and much more comfortable, Unlike jobserver which requires maintenance (another thing you need to worry about and troubleshoot - and it has it's problems) Also you have great library for that - https://github.com/ywilkof/spark-jobs-rest-client

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment