Tuesday, April 3, 2018

How to evaluate E2E performance of spark streaming job, in particular, the duration, using Kafka as event producer?

By Hường Hana 8:00 PM apache-kafka, apache-spark, python, spark-streaming Leave a Comment

I'm trying to evaluate the performance Spark Streaming job against a particular infrastructure ( storage stack (HDFS), network, VMs ...etc )

I have Kafka producer which generates random sentences to kafka cluster. On the consumer side, I use Spark Streaming, the direct streaming approach to consume the micro-batches. I run simple work count over 2 second batch window and write the counts in storage stack. Very similar to what's in the example here using direct streaming approach.

Some questions:

How to evaluate the duration of the streaming? Since this is continuous job, what would be best approach? A statement like: it took T seconds on average to process M kafka messages (of size S bytes)?
Currently I am running kafka producer and spark streaming simultaneously. Say, I choose to run kafka producer for like 1 hr, then start spark streaming job, can I start at zero offset to capture all kafka messages from all topics / partitions? If so, can I detect at consumer (spark) level when I am done ingesting all messages? (If so this can help getting accurate value for duration )

NOTE: currently I am using the python API for both Kafka producer and spark streaming consumer. I thought I should mention that since python API might be more restricted than Scala and Java APIs for both kafka and spark.

Thank You!

Coding Question

Tuesday, April 3, 2018

How to evaluate E2E performance of spark streaming job, in particular, the duration, using Kafka as event producer?

0 Answers

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment

Search

Popular Posts

Labels

Blog Archive

Find Us On Facebook