Saturday, June 17, 2017

How to get execution DAG from spark web UI after job has finished running, when I am running spark on YARN?

Leave a Comment

I frequently do analysis of the DAG of my spark job while it is running. But, it is annoying to have to sit and watch the application while it is running in order to see the DAG.

So, I tried to view the DAg using this thing called the spark history-server, which I know should help me see past jobs. I'm easily able to access port 18080, and I can see the history server UI.

But, it doesn't show me any information related to the spark program's execution. I know I have the history server running, because when I do sudo service --status-all I see

spark history-server is running [ OK ]

So I already tried what this question suggested: here.

I think this is because I'm running spark on YARN, and it can only use one resource manager at a time? maybe?

So, how do I see the spark execution DAG, *after* a job has finished? and more specifically, when running YARN as my resource manager?

2 Answers

Answers 1

Running only history-server is not sufficient to get execution DAG of previous jobs. You need specify the jobs to store the events logs of all previous jobs.

Run Spark history server by ./sbin/start-history-server.sh

Enable event log for the spark job

spark.eventLog.enabled true spark.eventLog.dir file:/path to event log(local or hdfs) 

Add these on spark-defaults.conf file

Answers 2

As mentioned in Monitoring and Instrumentation, we need following three parameters to be set in spark-defaults.conf

spark.eventLog.enabled spark.eventLog.dir spark.history.fs.logDirectory 

The first property should be true

spark.eventLog.enabled           true 

The second and the third properties should point to the event-log locations which can either be local-file-system or hdfs-file-system. The second property defines where to store the logs for spark jobs and the third property is for history-server to display logs in web UI at 18080.

If you choose linux local-file-system (/opt/spark/spark-events)
Either

spark.eventLog.dir               file:/opt/spark/spark-events spark.history.fs.logDirectory    file:/opt/spark/spark-events 

Or

spark.eventLog.dir               file:///opt/spark/spark-events spark.history.fs.logDirectory    file:///opt/spark/spark-events 

should work

If you choose hdfs-file-system (/spark-events)
Either

spark.eventLog.dir               hdfs:/spark-events spark.history.fs.logDirectory    hdfs:/spark-events 

Or

spark.eventLog.dir               hdfs:///spark-events spark.history.fs.logDirectory    hdfs:///spark-events 

Or

spark.eventLog.dir               hdfs://masterIp:9090/spark-events spark.history.fs.logDirectory    hdfs://masterIp:9090/spark-events 

should work where masterIp:9090 is the fs.default.name property in core-site.xml of hadoop configuration.

Apache spark history server can be started by

$SPARK_HOME/sbin/start-history-server.sh 

Third party spark history server for example of Cloudera can be started by

sudo service spark-history-server start 

And to stop the history server (for Apache)

$SPARK_HOME/sbin/stop-history-server.sh 

Or (for cloudera)

sudo service spark-history-server stop 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment