What is the ideal, prototypical mechanism/topology for implementing UPSERT type logic in streaming frameworks like Storm/Spark/Flink.
What do I mean exactly? Simple example:
I'm concurrently processing a stream of email addresses.
For each address, the system needs to lookup the local, data store specific id for the domain part of the address, and the company local id that domain currently belongs to.
The first time the system sees a new domain, it must insert a domain record into the data store, generating a local (to the store) id for the domain. It must also ensure the domain is mapped to a company, which may require scraping a web page, calling some other REST api, or even human intervention, potentially creating a new company record and id.
While 'resolving' a newly seen domain, the email address stream is still being processed, and some addresses contain the same domain.
The individual tasks processing additional email addresses that contain the same domain of one currently being 'resolved', should obviously see that, and continued processing of those tasks should be put into some 'domain resolution completion' queue, so when that domain is resolved, all those waiting can continue as if they had originally found the domain.
I hope this example is clear enough?
0 comments:
Post a Comment