Showing posts with label yarn. Show all posts
Showing posts with label yarn. Show all posts

Sunday, June 24, 2018

Unable to increase Max Application Master Resources

Leave a Comment

I am using uhopper/hadoop docker image to create yarn cluster. I have 3 nodes with 64GB RAM per node. I have added configuration. I have given 32GB to yarn. So total cluster memory is 96GB.

 -    name: YARN_CONF_yarn_scheduler_minimum___allocation___mb       value: "2048"     - name: YARN_CONF_yarn_scheduler_maximum___allocation___mb       value: "16384"     - name:  MAPRED_CONF_mapreduce_framework_name       value: "yarn"      - name: MAPRED_CONF_mapreduce_map_memory_mb       value: "8192"     - name: MAPRED_CONF_mapreduce_reduce_memory_mb       value: "8192"     - name: MAPRED_CONF_mapreduce_map_java_opts       value: "-Xmx8192m"     - name: MAPRED_CONF_mapreduce_reduce_java_opts       value: "-Xmx8192m"     - name: YARN_CONF_yarn_nodemanager_resource_memory___mb       value: "32768" 

Max Application Master Resources is 10240 MB. I ran 5 spark jobs with each 3 GB driver memory, 2 jobs never came in RUNNING state due 10240MB. I am unable to fully utilize my hardware.

enter image description here

How I can increase the Max Application Master Resources memory ?

enter image description here

2 Answers

Answers 1

I hope, i found an answer, if you change yarn.scheduler.capacity.maximum-am-resource-percent then Max Application Master Resources will change. Here's a documentation - Setting Application Limits from docs.hortonworks.com

Let me know if it worked.

Answers 2

To change the Maximum Application Master resources, you have to change the percentage of yarn.scheduler.capacity.maximum-am-resource-percent , which is by default 0.2 which means 20% of the memory allocated to Yarn.

If I am not wrong, the total memory given to YARN is 10240 MB(10GB), and if the maximum percentage Application master can use is 20% then it makes the memory allocated to AM 2GB.

Now, if you want to allocate more memory to your application-master then simply increase the percentage. But it is recommended that your AM percentage should not be more than 0.5. Hope it makes it clear now.

Read More

Saturday, May 26, 2018

Spark Driver memory and Application Master memory

Leave a Comment

Am I understanding the documentation for client mode correctly?

  1. client mode is opposed to cluster mode where the driver runs within the application master?
  2. In client mode the driver and application master are separate processes and therefore spark.driver.memory + spark.yarn.am.memory must be less than the machine's memory?
  3. In client mode is the driver memory is not included in the application master memory setting?

2 Answers

Answers 1

client mode is opposed to cluster mode where the driver runs within the application master?

Yes, When Spark application deployed over YARN in

  • Client mode, driver will be running in the machine where application got submitted and the machine has to be available in the network till the application completes.
  • Cluster mode, driver will be running in application master(one per spark application) node and machine submitting the application need not to be in network after submission

Client mode

Client mode

Cluster mode

Cluster mode

If Spark spark application submitted with cluster mode on it's own resource manager(standalone) then driver will be in one of the worker node.

References for images and content:

In client mode the driver and application master are separate processes and therefore spark.driver.memory + spark.yarn.am.memory must be less than the machine's memory?

No, In client mode, driver and AM are separate processes and exists in different machines, so memory need not to be combined but spark.yarn.am.memory + some overhead should be less then YARN container memory(yarn.nodemanager.resource.memory-mb). If it exceeds YARN's Resource Manager will kill the container.

In client mode is the driver memory is not included in the application master memory setting?

Here spark.driver.memory must be lass then the available memory in the machine from where the spark application is going to launch.

But, In cluster mode use spark.driver.memory instead of spark.yarn.am.memory.

spark.yarn.am.memory : 512m (default)

Amount of memory to use for the YARN Application Master in client mode, in the same format as JVM memory strings (e.g. 512m, 2g). In cluster mode, use spark.driver.memory instead. Use lower-case suffixes, e.g. k, m, g, t, and p, for kibi-, mebi-, gibi-, tebi-, and pebibytes, respectively.

Check more about these properties here

Answers 2

In client mode, the driver is launched directly within the spark-submit i.e client program. The application master to be created in any one of node in cluster. The spark.driver.memory (+ memory overhead) to be less than machine's memory.

In cluster mode, driver is running inside the application master in any of node in the cluster.

https://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

Read More

Tuesday, November 7, 2017

Re Run Spark Jobs On Failure or Abort

Leave a Comment

Gents, I'm looking forward for configuration or parameter that auto restart the Spark Jobs in case of any failure submitted via Yarn. I know tasks auto restart on failure. I am exactly looking forward for a YARN or Spark configuration that would trigger re-run whole job.

Right now if any of our Job abort due to any issue, we have to re start it manually, that causes long data queue to process, as these are designed to work in near real-time.

Current configurations:

#!/bin/bash  export SPARK_MAJOR_VERSION=2  # Minimum TODOs on a per job basis: # 1. define name, application jar path, main class, queue and log4j-yarn.properties path # 2. remove properties not applicable to your Spark version (Spark 1.x vs. Spark 2.x) # 3. tweak num_executors, executor_memory (+ overhead), and backpressure settings  # the two most important settings: num_executors=6 executor_memory=32g  # 3-5 cores per executor is a good default balancing HDFS client throughput vs. JVM overhead # see http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ executor_cores=2  # backpressure reciever_minRate=1 receiver_max_rate=10 receiver_initial_rate=10  /usr/hdp/2.6.1.0-129/spark2/bin/spark-submit --master yarn --deploy-mode cluster \   --name br1_warid_ccn_sms_production \   --class com.spark.main\   --driver-memory 16g \   --num-executors ${num_executors} --executor-cores ${executor_cores} --executor-memory ${executor_memory} \   --queue default \   --files log4j-yarn-warid-br1-ccn-sms.properties \   --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j-yarn-warid-br1-ccn-sms.properties" \   --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j-yarn-warid-br1-ccn-sms.properties" \   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer `# Kryo Serializer is much faster than the default Java Serializer` \   --conf spark.kryoserializer.buffer.max=1g \   --conf spark.locality.wait=30 \   --conf spark.task.maxFailures=8 `# Increase max task failures before failing job (Default: 4)` \   --conf spark.ui.killEnabled=true `# Prevent killing of stages and corresponding jobs from the Spark UI` \   --conf spark.logConf=true `# Log Spark Configuration in driver log for troubleshooting` \ `# SPARK STREAMING CONFIGURATION` \   --conf spark.scheduler.mode=FAIR \   --conf spark.default.parallelism=32 \   --conf spark.streaming.blockInterval=200 `# [Optional] Tweak to balance data processing parallelism vs. task scheduling overhead (Default: 200ms)` \   --conf spark.streaming.receiver.writeAheadLog.enable=true `# Prevent data loss on driver recovery` \   --conf spark.streaming.backpressure.enabled=false \   --conf spark.streaming.kafka.maxRatePerPartition=${receiver_max_rate} `# [Spark 1.x]: Corresponding max rate setting for Direct Kafka Streaming (Default: not set)` \ `# YARN CONFIGURATION` \   --conf spark.yarn.driver.memoryOverhead=4096 `# [Optional] Set if --driver-memory < 5GB` \   --conf spark.yarn.executor.memoryOverhead=4096 `# [Optional] Set if --executor-memory < 10GB` \   --conf spark.yarn.maxAppAttempts=4 `# Increase max application master attempts (needs to be <= yarn.resourcemanager.am.max-attempts in YARN, which defaults to 2) (Default: yarn.resourcemanager.am.max-attempts)` \   --conf spark.yarn.am.attemptFailuresValidityInterval=1h `# Attempt counter considers only the last hour (Default: (none))` \   --conf spark.yarn.max.executor.failures=$((8 * ${num_executors})) `# Increase max executor failures (Default: max(numExecutors * 2, 3))` \   --conf spark.yarn.executor.failuresValidityInterval=1h `# Executor failure counter considers only the last hour` \   --conf spark.task.maxFailures=8 \   --conf spark.speculation=false \ /home//runscripts/production.jar 

Note: There are couple of questions on the subject area, but they do not have accepted answers, or the answer deviate from expected solution. Running a Spark application on YARN, without spark-submit How to configure automatic restart of the application driver on Yarn

This question explores the possible solutions from the scope of YARN and Spark. So this question may not be treated duplicate.

2 Answers

Answers 1

Just a thought!

Let us call the script file (containing the above script) as run_spark_job.sh.

Try adding these statements at the end of the script:

return_code=$?  if [[ ${return_code} -ne 0 ]]; then     echo "Job failed"     exit ${return_code} fi  echo "Job succeeded" exit 0 

Let us have another script file spark_job_runner.sh, from where we call the above script. For example,

./run_spark_job.sh while [ $? -ne 0 ]; do     ./run_spark_job.sh done 

YARN-based approaches: Update 1: This link will be a good read. It discusses YARN REST API to submit and track: https://community.hortonworks.com/articles/28070/starting-spark-jobs-directly-via-yarn-rest-api.html

Update 2: This link shows how to submit spark application to YARN environment using Java: https://github.com/mahmoudparsian/data-algorithms-book/blob/master/misc/how-to-submit-spark-job-to-yarn-from-java-code.md

Spark-based programmatic approach:

How to use the programmatic spark submit capability

Spark based configuration approach for YARN:

The only spark parameter on YARN mode for restarting is spark.yarn.maxAppAttempts and it should not exceed the YARN resource manager parameter yarn.resourcemanager.am.max-attempts

Excerpt from the official documentation https://spark.apache.org/docs/latest/running-on-yarn.html

The maximum number of attempts that will be made to submit the application.

Answers 2

In yarn mode you can set yarn.resourcemanager.am.max-attempts which is default 2 to re-run the failed job, you can increase as many time you want. Or you can use spark's spark.yarn.maxAppAttempts configuration for same.

Read More

Wednesday, August 16, 2017

How to run a Kafka connect worker in YARN?

Leave a Comment

I'm playing with Kafka-Connect. I've got the HDFS connector working both in stand-alone mode and distributed mode.

They advertise that the workers (which are responsible for running the connectors) can be managed via YARN. However, I haven't seen any documentation that describes how to achieve this goal.

I don't know much about YARN either. How do I go about getting YARN to execute workers? If there is no specific approach, are there generic how-tos about how to get an application to run within YARN?

I already know what YARN is, I've used it with SPARK using spark-submit however I cannot figure out how to get the connector to run in YARN.

0 Answers

Read More

Thursday, August 10, 2017

Do exit codes and exit statuses mean anything in spark?

Leave a Comment

I see exit codes and exit statuses all the time when running spark on yarn:

Here are a few:

  • CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM

  • ...failed 2 times due to AM Container for application_1431523563856_0001_000002 exited with exitCode: 10...

  • ...Exit status: 143. Diagnostics: Container killed on request

  • ...Container exited with a non-zero exit code 52:...

  • ...Container killed on request. Exit code is 137...

I have never found any of these messages as being useful....Is there any chance of interpreting what actually goes wrong with these? I have searched high and low for a table explaining the errors but nothing.

The ONLY one I am able to decipher from those above is exit code 52, but that's because I looked at the source code here. It is saying that is an OOM.

Should I stop trying to interpret the rest of these exit codes and exit statuses? Or am I missing some obvious way that these numbers actually mean something?

Even if someone could tell me the difference between exit code, exit status, and SIGNAL that would be useful. But I am just randomly guessing right now, and it seems as everyone else around me who uses spark is, too.

1 Answers

Answers 1

Neither exit codes and status nor signals are Spark specific but part of the way processes work on Unix-like systems.

Exit status and exit code

Exit status and exit codes are different names for the same thing. An exit status is a number between 0 and 255 which indicates the outcome of a process after it terminated. Exit status 0 usually indicates success. The meaning of the other codes is program dependent and should be described in the program's documentation. There are some established standard codes, though. See this answer for a comprehensive list.

Exit codes used by Spark

In the Spark sources I found the following exit codes. Their descriptions are taken from log statements and comments in the code and from my understanding of the code where the exit status appeared.

Spark SQL CLI Driver in Hive Thrift Server

  • 3: if an UnsupportedEncodingException occurred when setting up stdout and stderr streams.

Spark/Yarn

  • 10: if an uncaught exception occurred
  • 11: if more than spark.yarn.scheduler.reporterThread.maxFailures executor failures occurred
  • 12: if the reporter thread failed with an exception
  • 13: if the program terminated before the user had initialized the spark context or if the spark context did not initialize before a timeout.
  • 14: This is declared as EXIT_SECURITY but never used
  • 15: if a user class threw an exception
  • 16: if the shutdown hook called before final status was reported. A comment in the source code explains the expected behaviour of user applications:

    The default state of ApplicationMaster is failed if it is invoked by shut down hook. This behavior is different compared to 1.x version. If user application is exited ahead of time by calling System.exit(N), here mark this application as failed with EXIT_EARLY. For a good shutdown, user shouldn't call System.exit(0) to terminate the application.

Executors

  • 50: The default uncaught exception handler was reached
  • 51: The default uncaught exception handler was called and an exception was encountered while logging the exception
  • 52: The default uncaught exception handler was reached, and the uncaught exception was an OutOfMemoryError
  • 53: DiskStore failed to create local temporary directory after many attempts (bad spark.local.dir?)
  • 54: ExternalBlockStore failed to initialize after many attempts
  • 55: ExternalBlockStore failed to create a local temporary directory after many attempts
  • 56: Executor is unable to send heartbeats to the driver more than "spark.executor.heartbeat.maxFailures" times.

  • 101: Returned by spark-submit if the main class of the launch environment was not found (not the main user class!)

Exit codes greater than 128

These exit codes most likely result from a program shutdown triggered by a Unix signal. The signal number can be calculated by substracting 128 from the exit code. This is explained in more details in this blog post (which was originally linked in this question). There is also a good answer explaining JVM-generated exit codes. Spark works with this assumption as explained in a comment in ExecutorExitCodes.scala

Other exit codes

Apart from the exit codes listed above there are number of System.exit() calls in the Spark sources setting 1 or -1 as exit code. As far as I an tell -1 seems to be used to indicate missing or incorrect command line parameters while 1 indicates all other errors.

Signals

Signals are a kind of events which allow to send system messages to a process. These messages are used to ask a process to reload its configuration (SIGHUP) or to terminate itself (SIGKILL), for instance. A list of standard signals can be found in the signal(7) man page in section Standard Signals.

As explained by Rick Moritz in the comments below (thank you!), the most likely sources of signals in a Spark setup are

  • the cluster resource manager when the container size exceeded, the job finished, a dynamic scale-down was made, or a job was aborted by the user
  • the operating system: as part of a controlled system shut down or if some resource limit was hit (out of memory, over hard quota, no space left on disk etc.)
  • a local user who killed a job

I hope this makes it a bit clearer what these messages by spark might mean.

Read More

Saturday, June 17, 2017

How to get execution DAG from spark web UI after job has finished running, when I am running spark on YARN?

Leave a Comment

I frequently do analysis of the DAG of my spark job while it is running. But, it is annoying to have to sit and watch the application while it is running in order to see the DAG.

So, I tried to view the DAg using this thing called the spark history-server, which I know should help me see past jobs. I'm easily able to access port 18080, and I can see the history server UI.

But, it doesn't show me any information related to the spark program's execution. I know I have the history server running, because when I do sudo service --status-all I see

spark history-server is running [ OK ]

So I already tried what this question suggested: here.

I think this is because I'm running spark on YARN, and it can only use one resource manager at a time? maybe?

So, how do I see the spark execution DAG, *after* a job has finished? and more specifically, when running YARN as my resource manager?

2 Answers

Answers 1

Running only history-server is not sufficient to get execution DAG of previous jobs. You need specify the jobs to store the events logs of all previous jobs.

Run Spark history server by ./sbin/start-history-server.sh

Enable event log for the spark job

spark.eventLog.enabled true spark.eventLog.dir file:/path to event log(local or hdfs) 

Add these on spark-defaults.conf file

Answers 2

As mentioned in Monitoring and Instrumentation, we need following three parameters to be set in spark-defaults.conf

spark.eventLog.enabled spark.eventLog.dir spark.history.fs.logDirectory 

The first property should be true

spark.eventLog.enabled           true 

The second and the third properties should point to the event-log locations which can either be local-file-system or hdfs-file-system. The second property defines where to store the logs for spark jobs and the third property is for history-server to display logs in web UI at 18080.

If you choose linux local-file-system (/opt/spark/spark-events)
Either

spark.eventLog.dir               file:/opt/spark/spark-events spark.history.fs.logDirectory    file:/opt/spark/spark-events 

Or

spark.eventLog.dir               file:///opt/spark/spark-events spark.history.fs.logDirectory    file:///opt/spark/spark-events 

should work

If you choose hdfs-file-system (/spark-events)
Either

spark.eventLog.dir               hdfs:/spark-events spark.history.fs.logDirectory    hdfs:/spark-events 

Or

spark.eventLog.dir               hdfs:///spark-events spark.history.fs.logDirectory    hdfs:///spark-events 

Or

spark.eventLog.dir               hdfs://masterIp:9090/spark-events spark.history.fs.logDirectory    hdfs://masterIp:9090/spark-events 

should work where masterIp:9090 is the fs.default.name property in core-site.xml of hadoop configuration.

Apache spark history server can be started by

$SPARK_HOME/sbin/start-history-server.sh 

Third party spark history server for example of Cloudera can be started by

sudo service spark-history-server start 

And to stop the history server (for Apache)

$SPARK_HOME/sbin/stop-history-server.sh 

Or (for cloudera)

sudo service spark-history-server stop 
Read More

Sunday, May 14, 2017

Not able to invoke a spark application using a java class on cluster

Leave a Comment

Below is my project's structure:

spark-application:

scala1.scala // I am calling the java class from this class.

java.java // this will submit another spark application to the yarn cluster.

The spark-application that is being triggered by java class:

scala2.scala

My reference tutorial is here

When I run my java class from scala1.scala via spark-submit in the local mode the second spark application scala2.scala is getting triggered and working as expected.

But, when I run the same application via spark-submit in yarn cluster it is showing the below error!

Error: Could not find or load main class  org.apache.spark.deploy.yarn.ApplicationMaster   Application application_1493671618562_0072 failed 5 times due to AM Container for appattempt_1493671618562_0072_000005 exited with exitCode: 1  For more detailed output, check the application tracking page: http://headnode.internal.cloudapp.net:8088/cluster/app/application_1493671618562_0072 Then click on links to logs of each attempt.  Diagnostics: Exception from container-launch.  Container id: container_e02_1493671618562_0072_05_000001  Exit code: 1  Exception message: /mnt/resource/hadoop/yarn/local/usercache/helixuser/appcache/application_1493671618562_0072/container_e02_1493671618562_0072_05_000001/launch_container.sh: line 26: $PWD:$PWD/spark_conf:$PWD/spark.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/:/usr/hdp/current/hadoop-client/lib/:/usr/hdp/current/hadoop-hdfs-client/:/usr/hdp/current/hadoop-hdfs-client/lib/:/usr/hdp/current/hadoop-yarn-client/:/usr/hdp/current/hadoop-yarn-client/lib/:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/:$PWD/mr-framework/hadoop/share/hadoop/common/:$PWD/mr-framework/hadoop/share/hadoop/common/lib/:$PWD/mr-framework/hadoop/share/hadoop/yarn/:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/:$PWD/mr-framework/hadoop/share/hadoop/hdfs/:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution  Stack trace: ExitCodeException exitCode=1: /mnt/resource/hadoop/yarn/local/usercache/helixuser/appcache/application_1493671618562_0072/container_e02_1493671618562_0072_05_000001/launch_container.sh: line 26: $PWD:$PWD/spark_conf:$PWD/spark.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/:/usr/hdp/current/hadoop-client/lib/:/usr/hdp/current/hadoop-hdfs-client/:/usr/hdp/current/hadoop-hdfs-client/lib/:/usr/hdp/current/hadoop-yarn-client/:/usr/hdp/current/hadoop-yarn-client/lib/:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/:$PWD/mr-framework/hadoop/share/hadoop/common/:$PWD/mr-framework/hadoop/share/hadoop/common/lib/:$PWD/mr-framework/hadoop/share/hadoop/yarn/:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/:$PWD/mr-framework/hadoop/share/hadoop/hdfs/:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution  at org.apache.hadoop.util.Shell.runCommand(Shell.java:933)  at org.apache.hadoop.util.Shell.run(Shell.java:844)  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1123)  at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:225)  at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:317)  at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)  at java.util.concurrent.FutureTask.run(FutureTask.java:266)  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)  at java.lang.Thread.run(Thread.java:745)  Container exited with a non-zero exit code 1  Failing this attempt. Failing the application. 

My project's directory structure is given below:

lrwxrwxrwx 1 yarn hadoop   95 May  5 06:03 __app__.jar -> /mnt/resource/hadoop/yarn/local/filecache/10/sparkfiller-1.0-SNAPSHOT-jar-with-dependencies.jar -rw-r--r-- 1 yarn hadoop   74 May  5 06:03 container_tokens -rwx------ 1 yarn hadoop  710 May  5 06:03 default_container_executor_session.sh -rwx------ 1 yarn hadoop  764 May  5 06:03 default_container_executor.sh -rwx------ 1 yarn hadoop 6433 May  5 06:03 launch_container.sh lrwxrwxrwx 1 yarn hadoop  102 May  5 06:03 __spark_conf__ -> /mnt/resource/hadoop/yarn/local/usercache/helixuser/filecache/80/__spark_conf__6125877397366945561.zip lrwxrwxrwx 1 yarn hadoop  125 May  5 06:03 __spark__.jar -> /mnt/resource/hadoop/yarn/local/usercache/helixuser/filecache/81/spark-assembly-1.6.3.2.5.4.0-121-hadoop2.7.3.2.5.4.0-121.jar drwx--x--- 2 yarn hadoop 4096 May  5 06:03 tmp find -L . -maxdepth 5 -ls: 3933556      4 drwx--x---   3 yarn     hadoop       4096 May  5 06:03 . 3933558      4 drwx--x---   2 yarn     hadoop       4096 May  5 06:03 ./tmp 3933562      4 -rw-r--r--   1 yarn     hadoop         60 May  5 06:03 ./.launch_container.sh.crc 3933517 185944 -r-x------   1 yarn     hadoop   190402950 May  5 06:03 ./__spark__.jar 3933564      4 -rw-r--r--   1 yarn     hadoop          16 May  5 06:03 ./.default_container_executor_session.sh.crc 3933518      4 drwx------   2 yarn     hadoop        4096 May  5 06:03 ./__spark_conf__ 3933548      4 -r-x------   1 yarn     hadoop         945 May  5 06:03 ./__spark_conf__/taskcontroller.cfg 3933543      4 -r-x------   1 yarn     hadoop         249 May  5 06:03 ./__spark_conf__/slaves 3933541      4 -r-x------   1 yarn     hadoop        2316 May  5 06:03 ./__spark_conf__/ssl-client.xml.example 3933520      4 -r-x------   1 yarn     hadoop        1734 May  5 06:03 ./__spark_conf__/log4j.properties 3933526      4 -r-x------   1 yarn     hadoop         265 May  5 06:03 ./__spark_conf__/hadoop-metrics2-azure-file-system.properties 3933536      4 -r-x------   1 yarn     hadoop        1045 May  5 06:03 ./__spark_conf__/container-executor.cfg 3933519      8 -r-x------   1 yarn     hadoop        5685 May  5 06:03 ./__spark_conf__/hadoop-env.sh 3933531      4 -r-x------   1 yarn     hadoop        2358 May  5 06:03 ./__spark_conf__/topology_script.py 3933547      8 -r-x------   1 yarn     hadoop        4113 May  5 06:03 ./__spark_conf__/mapred-queues.xml.template 3933528      4 -r-x------   1 yarn     hadoop         744 May  5 06:03 ./__spark_conf__/ssl-client.xml 3933544      4 -r-x------   1 yarn     hadoop         417 May  5 06:03 ./__spark_conf__/topology_mappings.data 3933549      4 -r-x------   1 yarn     hadoop         342 May  5 06:03 ./__spark_conf__/__spark_conf__.properties 3933523      4 -r-x------   1 yarn     hadoop         247 May  5 06:03 ./__spark_conf__/hadoop-metrics2-adl-file-system.properties 3933535      4 -r-x------   1 yarn     hadoop        1020 May  5 06:03 ./__spark_conf__/commons-logging.properties 3933525     24 -r-x------   1 yarn     hadoop       22138 May  5 06:03 ./__spark_conf__/yarn-site.xml 3933529      4 -r-x------   1 yarn     hadoop        2450 May  5 06:03 ./__spark_conf__/capacity-scheduler.xml 3933538      4 -r-x------   1 yarn     hadoop        2490 May  5 06:03 ./__spark_conf__/hadoop-metrics.properties  3933534     12 -r-x------   1 yarn     hadoop        8754 May  5 06:03 ./__spark_conf__/hdfs-site.xml  3933533      8 -r-x------   1 yarn     hadoop        4261 May  5 06:03 ./__spark_conf__/yarn-env.sh  3933532      4 -r-x------   1 yarn     hadoop        1335 May  5 06:03 ./__spark_conf__/configuration.xsl  3933530      4 -r-x------   1 yarn     hadoop         758 May  5 06:03 ./__spark_conf__/mapred-site.xml.template  3933545      4 -r-x------   1 yarn     hadoop        1000 May  5 06:03 ./__spark_conf__/ssl-server.xml  3933527      8 -r-x------   1 yarn     hadoop        4680 May  5 06:03 ./__spark_conf__/core-site.xml  3933522      8 -r-x------   1 yarn     hadoop        5783 May  5 06:03 ./__spark_conf__/hadoop-metrics2.properties  3933542      4 -r-x------   1 yarn     hadoop        1308 May  5 06:03 ./__spark_conf__/hadoop-policy.xml  3933540      4 -r-x------   1 yarn     hadoop        1602 May  5 06:03 ./__spark_conf__/health_check  3933537      8 -r-x------   1 yarn     hadoop        4221 May  5 06:03 ./__spark_conf__/task-log4j.properties  3933521      8 -r-x------   1 yarn     hadoop        7596 May  5 06:03 ./__spark_conf__/mapred-site.xml  3933546      4 -r-x------   1 yarn     hadoop        2697 May  5 06:03 ./__spark_conf__/ssl-server.xml.example  3933539      4 -r-x------   1 yarn     hadoop         752 May  5 06:03 ./__spark_conf__/mapred-env.sh  3932820 135852 -r-xr-xr-x   1 yarn     hadoop   139105807 May  4 22:53 ./__app__.jar  3933566      4 -rw-r--r--   1 yarn     hadoop          16 May  5 06:03 ./.default_container_executor.sh.crc  3933563      4 -rwx------   1 yarn     hadoop         710 May  5 06:03 ./default_container_executor_session.sh  3933559      4 -rw-r--r--   1 yarn     hadoop          74 May  5 06:03 ./container_tokens 3933565      4 -rwx------   1 yarn     hadoop         764 May  5 06:03 ./default_container_executor.sh 3933560      4 -rw-r--r--   1 yarn     hadoop          12 May  5 06:03 ./.container_tokens.crc 3933561      8 -rwx------   1 yarn     hadoop        6433 May  5 06:03 ./launch_container.sh broken symlinks(find -L . -maxdepth 5 -type l -ls): 

Below is the java code that invokes the second Spark application:

import org.apache.spark.deploy.yarn.Client; import org.apache.spark.deploy.yarn.ClientArguments; import org.apache.hadoop.conf.Configuration; import org.apache.spark.SparkConf; import org.apache.spark.SparkException;   public class CallingSparkJob {          public void submitJob(String latestreceivedpitrL,String newPtr) throws Exception {            System.out.println("In submit job method");             try{             System.out.println("Building a spark command");     // prepare arguments to be passed to     // org.apache.spark.deploy.yarn.Client object    String[] args = new String[] {        // the name of your application "--name", "name", // "--master",     // "yarn",       //    "--deploy-mode",     //  "cluster",                        //       "--conf", "spark.yarn.executor.memoryOverhead=600", "--conf",          "spark.yarn.submit.waitAppCompletion=false",         // memory for driver (optional)        "--driver-memory",        "1000M",         "--num-executors",        "2",        "--executor-cores",        "2",         // path to your application's JAR file         // required in yarn-cluster mode              "--jar",    "wasb://storage_account_container@storageaccount.blob.core.windows.net/user/ankushuser/sparkfiller/sparkfiller-1.0-SNAPSHOT-jar-with-dependencies.jar",        // name of your application's main class (required)        "--class",        "com.test.SparkFiller",         // comma separated list of local jars that want         // SparkContext.addJar to work with             // "--addJars",       // "/Users/mparsian/zmp/github/data-algorithms-book/lib/spark-assembly-1.5.2-hadoop2.6.0.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/log4j-1.2.17.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/junit-4.12-beta-2.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/jsch-0.1.42.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/JeraAntTasks.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/jedis-2.5.1.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/jblas-1.2.3.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/hamcrest-all-1.3.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/guava-18.0.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/commons-math3-3.0.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/commons-math-2.2.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/commons-logging-1.1.1.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/commons-lang3-3.4.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/commons-lang-2.6.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/commons-io-2.1.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/commons-httpclient-3.0.1.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/commons-daemon-1.0.5.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/commons-configuration-1.6.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/commons-collections-3.2.1.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/commons-cli-1.2.jar,/Users/mparsian/zmp/github/data-algorithms-book/lib/cloud9-1.3.2.jar",          // argument 1 for latestreceivedpitrL     "--arg",        latestreceivedpitrL,         // argument 2 for newPtr      "--arg",        newPtr,  "--arg", "yarn-cluster"      };    System.out.println("create a Hadoop Configuration object");  // create a Hadoop Configuration object    Configuration config = new Configuration();     // identify that you will be using Spark as YARN mode   System.setProperty("SPARK_YARN_MODE", "true");     // create an instance of SparkConf object    SparkConf sparkConf = new SparkConf(); sparkConf.setSparkHome("/usr/hdp/current/spark-client");     // sparkConf.setMaster("yarn");     sparkConf.setMaster("yarn-cluster");     // sparkConf.setAppName("spark-yarn");    //  sparkConf.set("master", "yarn");      // sparkConf.set("spark.submit.deployMode", "cluster"); // worked     // create ClientArguments, which will be passed to Client    // ClientArguments cArgs = new ClientArguments(args);    ClientArguments cArgs = new ClientArguments(args, sparkConf);     // create an instance of yarn Client client    Client client = new Client(cArgs, config, sparkConf);      // submit Spark job to YARN    client.run();     }catch(Exception e){         System.out.println("Error submitting spark Job");         System.out.println(e.getMessage());    }    }   } 

The spark-submit command used to run the first spark application locally:

spark-submit --class scala1 --master yarn --deploy-mode cluster --num-executors 2 --executor-cores 2 --conf spark.yarn.executor.memoryOverhead=600 --conf spark.yarn.submit.waitAppCompletion=false /home/ankushuser/kafka_retry/kafka_retry_test/sparkflightaware/target/sparkflightaware-0.0.1-SNAPSHOT-jar-with-dependencies.jar

If I run this spark-submit command locally it is invoking the java class and the spark-submit command for the second scala2 application is working fine.

If I run it in yarn mode, then I am facing the issue.

Thank you for your help.

1 Answers

Answers 1

Since there's a bounty, I'll repost this as a reply as well -- but in reality I would like to flag this as a duplicate, since the actual Exception is the one covered in another question, and answered:

It is caused by hdp.version not getting substituted correctly. You have to set hdp.version in the file java-opts under $SPARK_HOME/conf.

Alternatively, use

--driver-java-options="-Dhdp.version=INSERT_VERSION_STRING_HERE" --conf "spark.executor.extraJavaOptions=-Dhdp.version=INSERT_VERSION_STRING_HERE" in your spark-submit and make sure to use the correct version string, as in the subdirectory of /usr/hdp.

If you want to stick with calling client.submit from your code, then you need to put those lines into the --arg you build in your code.

Read More

Wednesday, May 10, 2017

How to solve yarn container sizing issue on spark?

Leave a Comment

I want to launch some pyspark jobs on YARN. I have 2 nodes, with 10 GB each. I am able to open up the pyspark shell like so: pyspark

Now when I have a very simple example that I try to launch:

import random NUM_SAMPLES=1000 def inside(p):     x, y = random.random(), random.random()     return x*x + y*y < 1  count = sc.parallelize(xrange(0, NUM_SAMPLES)) \              .filter(inside).count() print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES) 

I get as a result a very long spark log with the error output. The most important information is:

ERROR cluster.YarnScheduler: Lost executor 1 on (ip>: Container marked as failed: <containerID> on host: <ip>. Exit status 1.  Diagnostics: Exception from container-launch.  ...... 

later on in the logs I see...

ERROR scheduler.TaskSetManager: Task 0 in stage 0.0 failed 1 times: aborting job INFO cluster.YarnClientSchedulerBackend: Asked to remove non-existent executor 1 INFO spark.ExecutorAllocationManager: Existing executor 1 has been removed (new total is 0) 

From what I'm gathering from the logs above, this seems to be a container sizing issue in yarn.

My yarn-site.xml file has the following settings:

yarn.scheduler.maximum-allocation-mb = 10240 yarn.nodemanager.resource.memory-mb = 10240 

and in spark-defaults.conf contains:

spark.yarn.executor.memoryOverhead=2048 spark.driver.memory=3g 

If there are any other settings you'd like to know about, please let me know.

How do I set the container size in yarn appropriately?
(bounty on the way for someone who can help me with this)

1 Answers

Answers 1

Let me first explain the basic set of properties required to tune your spark application on a YARN cluster.

Note: Container in YARN is equivalent to Executor in Spark. For understandability, you can consider that both are same.

On yarn-site.xml:

yarn.nodemanager.resource.memory-mb is the total memory available to the cluster from a given node.

yarn.nodemanager.resource.cpu-vcores is the total number of CPU vcores available to the cluster from a given node.

yarn.scheduler.maximum-allocation-mb is the maximum memory in mb that cab be allocated per yarn container.

yarn.scheduler.maximum-allocation-vcores is the maximum number of vcores that can be allocated per yarn container.

Example: If a node has 16GB and 8vcores and you would like to contribute 14GB and 6vcores to the cluster(for containers), then set properties as shown below:

yarn.nodemanager.resource.memory-mb : 14336 (14GB)

yarn.nodemanager.resource.cpu-vcores : 6

And, to create containers with 2GB and 1vcore each, set these properties:

yarn.scheduler.maximum-allocation-mb : 2049

yarn.scheduler.maximum-allocation-vcores : 1

Note: Even though there is enough memory(14gb) to create 7 containers with 2GB, above config will only create 6 containers with 2GB and only 12GB out of 14GB will be utilized to the cluster. This is because there are only 6vcores available to the cluster.

Now on Spark side,

Below properties specify memory to be requested per executor/container

spark.driver.memory

spark.executor.memory

Below properties specify vcores to be requested per executor/container

spark.driver.cores

spark.executor.cores

IMP: All the Spark's memory and vcore properties should be less than or equal to what YARN's configuration

Below property specifies the total number of executors/containers that can be used for your spark application from the YARN cluster.

spark.executor.instances

This property should be less than the total number of containers available in the YARN cluster.

Once the yarn configuration is complete, the spark should request for containers that can be allocated based on the YARN configurations. That means if YARN is configured to allocate a maximum of 2GB per container and Spark requests a container with 3GB memory, then the job will either halt or stop because YARN cannot satisfy the spark's request.

Now for your use case: Usually, cluster tuning is based on the workloads. But below config should be more suitable.

Memory available: 10GB * 2 nodes Vcores available: 5 * 2 vcores [Assumption]

On yarn-site.xml [In both the nodes]

yarn.nodemanager.resource.memory-mb : 10240

yarn.nodemanager.resource.cpu-vcores : 5

yarn.scheduler.maximum-allocation-mb : 2049

yarn.scheduler.maximum-allocation-vcores : 1

Using above config, you can create a maximum of 10 containers on each of the nodes having 2GB,1vcore per container.

Spark config

spark.driver.memory 1536mb

spark.yarn.executor.memoryOverhead 512mb

spark.executor.memory 1536mb

spark.yarn.executor.memoryOverhead 512mb

spark.driver.cores 1

spark.executor.cores 1

spark.executor.instances 19

Please feel free to play around these configurations to suit your needs.

Read More

Friday, May 5, 2017

How to display yarn globally installed packages?

Leave a Comment

I am using MacOs Sierra 10.12.4 and I have installed yarn by brew install yarn and it's version is yarn version v0.23.2

I installed angular-cli, bower and ionic using yarn global add <package-name>

Then I use yarn global ls to display globally installed packages and I am expecting to see the above installed packages but yarn gives me this:

$ yarn global ls                                                                yarn global v0.23.2 warning No license field ✨  Done in 0.99s. 

Then i check yarn global bin and get path /Users/myusername/.config/yarn/bin and I go to the directory and see softlinks:

lrwxr-xr-x  1 myusername  staff    38B 19 Apr 10:17 bower -> ../global/node_modules/bower/bin/bower lrwxr-xr-x  1 myusername  staff    42B 19 Apr 10:21 cordova -> ../global/node_modules/cordova/bin/cordova lrwxr-xr-x  1 myusername  staff    38B 19 Apr 10:20 ionic -> ../global/node_modules/ionic/bin/ionic lrwxr-xr-x  1 myusername  staff    41B 19 Apr 10:15 ng -> ../global/node_modules/angular-cli/bin/ng 

Apparently all packages were installed and saved under /Users/myusername/.config/yarn/global/node_modules

I searched the following threads https://github.com/yarnpkg/yarn/issues/2446

Tried appending below paths but still not work:

YARN_BIN=$HOME/.config/yarn/bin  # `yarn global bin` result export PATH=$YARN_BIN:$PATH export PATH=$PATH:$HOME/.config/yarn/global/node_modules/.bin 

Can anyone help? What should I do and how to display the globally installed packages?

1 Answers

Answers 1

yarn global list is currently broken, too. See the related issue.

Currently I directly list Yarn global packages folder content:

  • Windows: %LOCALAPPDATA%/Yarn/config/global
  • OSX and Linus non-root: ~/.config/yarn/global
  • Linux if logged in as root: /usr/local/share/.config/yarn/global
Read More

Tuesday, April 11, 2017

Why does my pyspark just hang as ACCEPTED in yarn when I launch it?

Leave a Comment

I just spun up a new AWS instance in Linux. And, I installed pyspark on it. It has spark 1.6.

I'm running pyspark with yarn. When I do the command pyspark in the terminal, it launches initially, but then I get the message:

dd/mm/YY HH:MM:SS INFO yarn.Client: Application report for application_XXXXXXXXXXX_XXXX (state: ACCEPTED) 

.....and then this just continues for forever.

So, I checked yarnto see if anything else was running:

yarn application -list 

And ONLY shows my application running. How do I open up the pyspark shell and get my application to start rather than just being ACCEPTED?

3 Answers

Answers 1

Can you try to run spark-shell and see if that goes into running state or not?

This happens when yarn doesn't have requested resources from it.

Example: Lets say yarn has 5gb of free memory available and you are requesting 10gb. Your job would be stuck in Accepted phase till it gets the requested memory.

Answers 2

Adding to Grover answer, you can set spark.dynamicAllocation.enabled and yarn.scheduler.fair.preemption to True to get your job started asap.

Answers 3

This problem if with resources or problems with queue.

Please set all this options in yarn-site.xml to have enough resources on your cluster: yarn.scheduler.maximum-allocation-mb, yarn.scheduler.maximum-allocation-vcores, yarn.nodemanager.resource.memory-mb, yarn.nodemanager.resource.cpu-vcores

Also you might hit a bug/problem with queues, which can be resolved by setting queueMaxAMShareDefault to -1.0 in fair-scheduler.xml(on the node with Resource Manager) if you use fair scheduler, and restart Resource Manager.

Read More

Wednesday, March 29, 2017

Yarn slave nodes are not communicating with master node?

Leave a Comment

I am not able to see my nodes when I do yarn node -list, even though I have configured /etc/hadoop/conf/yarn-site.xml with the correct properties (it seems to me, at least according to this question Slave nodes not in Yarn ResourceManager).

Here's what I've done so far:

  • installed resourcemanager on the master
  • installed nodemanager on the slaves
  • checked yarn-site.xml for this on ALL the nodes:

    <property> <name>yarn.resourcemanager.hostname</name> <value>master-node</value> </property>

  • after modifying the config file, restarted resourcemanager and nodemanager on the master and slaves, respectively.

But yet when I do yarn node -list I only see

Total Nodes: 0  Node-Id       Node-state    Node-Http-Address      Number-of-Running-Containers 

At my nodes, I looked at the .out files in /var/log/hadoop-yarn/ and I see this in them:

ulimit -a core file size          (blocks, -c) 0 data seg size           (kbytes, -d) unlimited scheduling priority             (-e) 0 file size               (blocks, -f) unlimited pending signals                 (-i) 244592 max locked memory       (kbytes, -l) 64 max memory size         (kbytes, -m) unlimited open files                      (-n) 32768 pipe size            (512 bytes, -p) 8 POSIX message queues     (bytes, -q) 819200 real-time priority              (-r) 0 stack size              (kbytes, -s) 10240 cpu time               (seconds, -t) unlimited max user processes              (-u) 65536 virtual memory          (kbytes, -v) unlimited file locks                      (-x) unlimited 

EDIT: when I look at the .log files I see the following, but I'm not sure how to fix it:

    INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state STARTED; cause:  org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.IllegalArgumentException: Does not contain a valid host:port authority: <master node ip>:8020:8031 (configuration property 'yarn.resourcemanager.resource-tracker.address')  Caused by: java.lang.IllegalArgumentException: Does not contain a valid host:port authority: <master node ip>:8020:8031 (configuration property 'yarn.resourcemanager.resource-tracker.address') 

How do I connect my slave nodes to my master node?

3 Answers

Answers 1

The value set for yarn.resourcemanager.hostname acts as the base value for all the ResourceManager properties. The property yarn.resourcemanager.resource-tracker.address defaults to the value of ${yarn.resourcemanager.hostname}:8031. Refer yarn-default.xml for the complete list of default YARN configurations.

And from the nodemanager ERROR log,

Caused by: java.lang.IllegalArgumentException: Does not contain a valid host:port authority: <master node ip>:8020:8031 (configuration property 'yarn.resourcemanager.resource-tracker.address') 

It looks the yarn.resourcemanager.hostname property is configured incorrectly as <master node ip>:8020 instead of <master node ip> on the slave nodes.

Edit the yarn-site.xml on all the nodes to have

<property>    <name>yarn.resourcemanager.hostname</name>    <value>master_node</value> <!-- IP address or Hostname of the node where Resource Manager is started, Omit the port number --> </property> 

Finally, restart the YARN services.

Answers 2

please set all this properties and try     <property>         <name>yarn.resourcemanager.address</name>         <value>master_node:8032</value>       </property>       <property>         <name>yarn.resourcemanager.admin.address</name>         <value>master_node:8033</value>       </property>       <property>         <name>yarn.resourcemanager.scheduler.address</name>         <value>master_node:8030</value>       </property>       <property>         <name>yarn.resourcemanager.resource-tracker.address</name>         <value>master_node:8031</value>       </property>       <property>         <name>yarn.resourcemanager.webapp.address</name>         <value>master_node:8088</value>       </property>       <property>         <name>yarn.resourcemanager.webapp.https.address</name>         <value>master_node:8090</value>       </property> 

Answers 3

You need to set ip for yarn.resourcemanager.hostname property. if you want to use the hostname, your machine needs to know to which ip that hostname is pointing to. So you need to add host entry in /etc/hosts file.

To do that,

  1. Open terminal

  2. Type vim /etc/hosts and hit enter

  3. Add this line at end of the file (use key i to enable insertion)

    <your resourcemanager ip><space><your hostname>

    example: `192.168.1.23 master-node` 
  4. Save the file by typing <Esc> + :wq

  5. Restart the nodemanager

I recommend using ambari kind of managing tool to do these kind of stuffs. This allows easy modification of configuration at any time for hadoop environment. Because manual work always have more chance for error.

Read More

Friday, January 27, 2017

Hadoop truncated/inconsistent counter name

Leave a Comment

For now I have hadoop job which creates counters with pretty big name. For example, the following one: stats.counters.server-name.job.job-name.mapper.site.site-name.qualifier.qualifier-name.super-long-string-which-is-not-within-standard-limits. This counter is truncated on web interface and on getName() method call. I've found out that hadoop have limitations on the counter max name and this settings id mapreduce.job.counters.counter.name.max is for configuring this limit. So I incremented this to 500 and web interface now shows full counter name. But getName() of the counter still returns truncated name.

Cound somebody, please, explain this or point me on my mistakes? Thank you.

EDIT 1

My hadoop server configuration is consist of single server with hdfs, yarn and map-reduce itself on it. During map-reduce there are some counter increments and after job is completed, in ToolRunner I fetch counters with use of org.apache.hadoop.mapreduce.Job#getCounters.

EDIT 2

Hadoop version is the following:

Hadoop 2.6.0-cdh5.8.0 Subversion http://github.com/cloudera/hadoop -r 042da8b868a212c843bcbf3594519dd26e816e79  Compiled by jenkins on 2016-07-12T22:55Z Compiled with protoc 2.5.0 From source with checksum 2b6c319ecc19f118d6e1c823175717b5 This command was run using /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.8.0.jar 

I made some additional investigation and it seems that this issue describes situation similar to mine. But it's pretty confusing cause I'm able to increase number of counters but not the length of counter's name...

EDIT 3

Today I spent pretty much time debugging internals of the hadoop. Some interesting stuff:

  1. org.apache.hadoop.mapred.ClientServiceDelegate#getJobCounters method returns bunch of counters from yarn with TRUNCATED names and FULL display names.
  2. Was unable to debug maps and reducers itself but with help of logging it seems that org.apache.hadoop.mapreduce.Counter#getName method works correctly during reducer execution.

0 Answers

Read More

Monday, April 4, 2016

Spark: YARN token can't be found in cache

Leave a Comment

Using Spark under EMR I'm running into a curious error when trying to get pyspark (or spark-shell) to boot up. I can load my jars onto HDFS and get the application master started on Spark, but it crashes soon after booting up trying to contact the YARN resource manager. The resulting error is strange:

WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (owner= �����*, renewer=, realUser=, issueDate=0, maxDate=0, sequenceNumber=0, masterKeyId=0) can't be found in cache ERROR yarn.ApplicationMaster: Uncaught exception: org.apache.hadoop.security.token.SecretManager$InvalidToken: token (owner= �����*, renewer=, realUser=, issueDate=0, maxDate=0, sequenceNumber=0, masterKeyId=0) can't be found in cache at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ...

This is despite trying to disable YARN authentication in every way I can think of (e.g.):

security.authorization=false security.authentication=simple

It appears that the Hadoop RPC proxy is pulling some garbage data from the environment, but I'm at a loss for where it's coming from. Any ideas for how to definitively disable authentication or to determine where the garbage data is coming from?

0 Answers

Read More

Friday, April 1, 2016

Making spark use /etc/hosts file for binding in YARN cluster mode

1 comment

Have a spark cluster setup on a machine with two inets, one public another private. The /etc/hosts file in the cluster has the internal ip of all the other machines in the cluster, like so.

internal_ip FQDN

However when I request a SparkContext via pyspark in YARN client mode(pyspark --master yarn --deploy-mode client), akka binds onto the public ip and thus a time out occurs.

15/11/07 23:29:23 INFO Remoting: Starting remoting 15/11/07 23:29:23 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkYarnAM@public_ip:44015] 15/11/07 23:29:23 INFO util.Utils: Successfully started service 'sparkYarnAM' on port 44015. 15/11/07 23:29:23 INFO yarn.ApplicationMaster: Waiting for Spark driver to be reachable. 15/11/07 23:31:30 ERROR yarn.ApplicationMaster: Failed to connect to driver at yarn_driver_public_ip:48875, retrying ... 15/11/07 23:31:30 ERROR yarn.ApplicationMaster: Uncaught exception:  org.apache.spark.SparkException: Failed to connect to driver!     at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:427)     at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:293)     at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:149)     at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:574)     at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)     at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)     at java.security.AccessController.doPrivileged(Native Method)     at javax.security.auth.Subject.doAs(Subject.java:422)     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)     at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)     at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:572)     at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:599)     at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) 15/11/07 23:31:30 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!) 15/11/07 23:31:30 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!) 15/11/07 23:31:30 INFO yarn.ApplicationMaster: Deleting staging directory .sparkStaging/application_1446960366742_0002 

As seen from the log, private IP is completely ignored, how can I make YARN and spark use the private IP address as specified in the hosts file ?

Cluster was provisioned using Ambari(HDP 2.4)

0 Answers

Read More