Saturday, March 26, 2016

Redis PFADD to check a exists-in-set query

Leave a Comment

I have a requirement to process multiple records from a queue. But due to some external issues the items may sporadically occur multiple times. I need to process items only once

What I planned to use is PFADD into redis every record ( as a md5sum) and then see if that returns success. If that shows no increment then the record is a duplicate else process the record.

This seems pretty straightforward , but I am getting too many false positives while using PFADD

Is there a better way to do this ?

2 Answers

Answers 1

Being the probabilistic data structure that it is, Redis' HyperLogLog exhibits 0.81% standard error. You can reduce (but never get rid of) the probability for false positives by using multiple HLLs, each counting a the value of a different hash function on your record.

Also note that if you're using a single HLL there's no real need to hash the record - just PFADD as is.

Alternatively, use a Redis Set to keep all the identifiers/hashes/records and have 100%-accurate membership tests with SISMEMBER. This approach requires more (RAM) resources as you're storing each processed element, but unless your queue is really huge that shouldn't be a problem for a modest Redis instance. To keep memory consumption under control, switch between Sets according to the date and set an expiry on the Set keys (another approach is to use a single Sorted Set and manually remove old items from it by keeping their timestamp in the score).

Answers 2

In general in distributed systems you have to choose between processing items either :

  • at most once
  • at least once

Processing something exactly-once would be convenient however this is generally impossible.

That being said there could be acceptable workarounds for your specific use case, and as you suggest storing the items already processed could be an acceptable solution.

Be aware though that PFADD uses HyperLogLog, which is fast and scales but is approximate about the count of the items, so in this case I do not think this is what you want. However if you are fine with having a small probability of errors, the most appropriate data structure here would be a Bloom filter (as described here for Redis), which can be implemented in a very memory-efficient way.

A simple, efficient, and recommended solution would be to use a simple redis key (for instance a hash) storing a boolean-like value ("0", "1" or "true", "false") for instance with the HSET or SET with the NX option instruction. You could also put it under a namespace if you wish to. It has the added benefit of being able to expire keys also.

It would avoid you to use a set (not the SET command, but rather the SINTER, SUNION commands), which doesn't necessarily work well with Redis cluster if you want to scale to more than one node. SISMEMBER is still fine though (but lacks some features from hashes such as time to live).

If you use a hash, I would also advise you to pick a hash function that has fewer chances of collisions than md5 (a collision means that two different objects end up with the same hash).

An alternative approach to the hash would be to assign an uuid to every item when putting it in the queue (or a squuid if you want to have some time information).

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment