Thursday, August 17, 2017

Kafka streams application design principals

Leave a Comment

I want to dive into stream processing with kafka and i neeed some help to get my head around some design principals which are currently not very clear to me.

1.) assuming i have some realtime stock price data. would you make one topic "price" keyed (and therefore partitioned) by the stock symbol? Or would you make one topic per symbol? in example what happens if i decide to produce (add) some more stock symbols including a full history later on? now my history (ordering in the log) in the topic "price" is a mess right? On the other hand for each price series i want to calculate the returns later on and if they are in different topics i have to keep track of them and start new stream applications for every single symbol.

2.) having now different realtime prices and I need to join an arbitary number of them into one big record. in example join all sp500 symbols into one record. since i do not have a price of all sp500 symbols for the very same time but maybe pretty close. how can i join them using always the latest price if one is missing at this exact time?

3.) say i have solved the join use case and i pump the joined records of all sp500 stocks back into kafka. what do i need to do if i have made a mistake and i forgot one symbol? obviously i want to add it to the stream. now i kind of need to whipe the "sp500" log and rebuild it right? or is there some mechanism to reset the starting offset to a perticular offset (the one where i have fixed the join)? also most likely i have other stream applications which are consuming from this topic. The also need to do some kind of reset/replay. is it probaly a better idea to not store the sp500 topic but make it part of a long stream process? but then i will end up doing the same join potentially serveral times.

4.) maybe this should be 1. since this is part of my goal ^^ - however, how could i model a data flow like such:

produce prices -> calculate returns -> join serveral returns into a row vector -> calcualte covariance (window of rowvectors) -> join covariance with returns                                                                                  ->                                             -> into a tuple (ret, cov) 

I am not even sure if such a complicated use case is possible using todays stream processing.

1 Answers

Answers 1

When using Kafka I think of the messages as key/value pairs, stored in a distributed, persisted and replicated topic, sent as endless data-stream. The topic can get configured for different retention times and retention/(cleanup) methods.

1) How you organize your topics is up to you. You can do both basically and depending on how you want to use the data later both might make some sense. In your use case I would write the prices to one topic. The key should get picked like a primary key in a relational DB. It guarantees the order of values sent per key and might also get used for retention. BTW: you can consume from multiple streams/topics in one application.

2) What you want to use here is the so called "table/stream duality". (Side note: I think of streamed data as stateless and of a table as statefull.) So technically you construct a mapping (e.g. in memory) from a key to a value (the latest value to this key in the stream). Kafka Streams will do this for you with KTable. Kafka itself can also do this for you using an additional topic whith retention configured to keep only the latest value for a key. Some nice links:

3) The messages in the Kafka topic are stored based on your retention configuration. So you can configure it e.g. to store all data for 7 days. If you want to add data later but use some other time for it then its produce time you need to send a time as part of your message data and use this one when processing it later. For every consumer you can set/reset the offset where it should start reading. Which means you can go back and reprocess all the data which is still in your topic.

4) I am not sure what you are asking for because your flow seams to be fine for your goal. Kafka and stream processing is a good match for your use case.

In general I can recommend reading the Confluent blog, Confluent documentation and everything on the Kafka website. A lot of your questions depend on your requirements, hardware and what you want to do in the software so even with the given inforamtion I need to say "it depends". I hope this helps you and others starting with Kafka even if its just a quick try on explaining the concept and give some links as starting points.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment