I frequently do analysis of the DAG of my spark job while it is running. But, it is annoying to have to sit and watch the application while it is running in order to see the DAG.
So, I tried to view the DAg using this thing called the spark history-server
, which I know should help me see past jobs. I'm easily able to access port 18080
, and I can see the history server UI.
But, it doesn't show me any information related to the spark program's execution. I know I have the history server running, because when I do sudo service --status-all
I see
spark history-server is running [ OK ]
So I already tried what this question suggested: here.
I think this is because I'm running spark on YARN, and it can only use one resource manager at a time? maybe?
So, how do I see the spark execution DAG, *after* a job has finished? and more specifically, when running YARN as my resource manager?
2 Answers
Answers 1
Running only history-server
is not sufficient to get execution DAG
of previous jobs. You need specify the jobs to store the events logs of all previous jobs.
Run Spark history server by ./sbin/start-history-server.sh
Enable event log for the spark job
spark.eventLog.enabled true spark.eventLog.dir file:/path to event log(local or hdfs)
Add these on spark-defaults.conf
file
Answers 2
As mentioned in Monitoring and Instrumentation, we need following three parameters to be set in spark-defaults.conf
spark.eventLog.enabled spark.eventLog.dir spark.history.fs.logDirectory
The first property should be true
spark.eventLog.enabled true
The second and the third properties should point to the event-log
locations which can either be local-file-system
or hdfs-file-system
. The second property defines where to store the logs for spark jobs and the third property is for history-server to display logs in web UI at 18080.
If you choose linux local-file-system (/opt/spark/spark-events)
Either
spark.eventLog.dir file:/opt/spark/spark-events spark.history.fs.logDirectory file:/opt/spark/spark-events
Or
spark.eventLog.dir file:///opt/spark/spark-events spark.history.fs.logDirectory file:///opt/spark/spark-events
should work
If you choose hdfs-file-system (/spark-events)
Either
spark.eventLog.dir hdfs:/spark-events spark.history.fs.logDirectory hdfs:/spark-events
Or
spark.eventLog.dir hdfs:///spark-events spark.history.fs.logDirectory hdfs:///spark-events
Or
spark.eventLog.dir hdfs://masterIp:9090/spark-events spark.history.fs.logDirectory hdfs://masterIp:9090/spark-events
should work where masterIp:9090
is the fs.default.name
property in core-site.xml
of hadoop
configuration.
Apache spark history server can be started by
$SPARK_HOME/sbin/start-history-server.sh
Third party spark history server for example of Cloudera can be started by
sudo service spark-history-server start
And to stop the history server (for Apache)
$SPARK_HOME/sbin/stop-history-server.sh
Or (for cloudera)
sudo service spark-history-server stop
0 comments:
Post a Comment