Tuesday, April 3, 2018

How to evaluate E2E performance of spark streaming job, in particular, the duration, using Kafka as event producer?

Leave a Comment

I'm trying to evaluate the performance Spark Streaming job against a particular infrastructure ( storage stack (HDFS), network, VMs ...etc )

I have Kafka producer which generates random sentences to kafka cluster. On the consumer side, I use Spark Streaming, the direct streaming approach to consume the micro-batches. I run simple work count over 2 second batch window and write the counts in storage stack. Very similar to what's in the example here using direct streaming approach.

Some questions:

  • How to evaluate the duration of the streaming? Since this is continuous job, what would be best approach? A statement like: it took T seconds on average to process M kafka messages (of size S bytes)?
  • Currently I am running kafka producer and spark streaming simultaneously. Say, I choose to run kafka producer for like 1 hr, then start spark streaming job, can I start at zero offset to capture all kafka messages from all topics / partitions? If so, can I detect at consumer (spark) level when I am done ingesting all messages? (If so this can help getting accurate value for duration )

NOTE: currently I am using the python API for both Kafka producer and spark streaming consumer. I thought I should mention that since python API might be more restricted than Scala and Java APIs for both kafka and spark.

Thank You!

0 Answers

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment