I'm trying to evaluate the performance Spark Streaming job against a particular infrastructure ( storage stack (HDFS), network, VMs ...etc )
I have Kafka producer which generates random sentences to kafka cluster. On the consumer side, I use Spark Streaming, the direct streaming approach to consume the micro-batches. I run simple work count over 2 second batch window and write the counts in storage stack. Very similar to what's in the example here using direct streaming approach.
Some questions:
- How to evaluate the duration of the streaming? Since this is continuous job, what would be best approach? A statement like: it took T seconds on average to process M kafka messages (of size S bytes)?
- Currently I am running kafka producer and spark streaming simultaneously. Say, I choose to run kafka producer for like 1 hr, then start spark streaming job, can I start at zero offset to capture all kafka messages from all topics / partitions? If so, can I detect at consumer (spark) level when I am done ingesting all messages? (If so this can help getting accurate value for duration )
NOTE: currently I am using the python API for both Kafka producer and spark streaming consumer. I thought I should mention that since python API might be more restricted than Scala and Java APIs for both kafka and spark.
Thank You!
0 comments:
Post a Comment