Saturday, December 16, 2017

Spark Streams: Parallelization when consuming from S3

Leave a Comment

my S3 objects will contain this type of data in different files

metric-name start-time          stop-time           request-id service-A   12/06/2017 19:00:00 12/06/2017 19:01:00 12345 service-B   12/06/2017 19:01:00 12/06/2017 19:02:00 12345 service-C   12/06/2017 19:02:00 12/06/2017 19:03:00 12345 

I want to run a Spark Streaming job that will aggregate this data into something like the following

(Basically it takes the start time of one metric and the stop time of another to create an aggregated metric)

metric-name             start-time          stop-time           request-id service-A to service-B  12/06/2017 19:00:00 12/06/2017 19:02:00 12345 service-A to service-C  12/06/2017 19:00:00 12/06/2017 19:03:00 12345 

I have a few questions though:

  1. how do Spark jobs parallelize when consuming from S3? Do they read from different files simultaneously?
  2. Is there a way to determine how such partitioning occurs?
  3. In a more traditional programming model I'd probably create a map structure and use this to create the aggregated metrics...how can I achieve this bearing in mind the data could be spread across multiple S3 objects?

0 Answers

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment