Home
>
Data Science
>
Update Redis: How Probabilistic Data Structures Support State-of-the-Art Apps

March 29, 2022 by Phu Nguyen

Update Redis: How Probabilistic Data Structures Support State-of-the-Art Apps

Main Contents:

Redis: How Probabilistic Data Structures Support State-of-the-Art Apps is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn Redis: How Probabilistic Data Structures Support State-of-the-Art Apps in today’s post !

A Quick Probabilistic Data Structures Primer

Before talking about these specific use cases, let’s explore some of the counter-intuitive properties of probabilistic data structures.

The behavior of these data structures is strongly dependent on hashing functions, and sometimes a source of randomness. Hashing allows for greater flexibility in storing different distributions of data, often resulting in greater efficiency. Unfortunately, this also makes behavior harder to predict.
Probabilistic data structures do not store whole items, so you can’t use them to retrieve an item you added earlier. This also means that (opposite of what you would expect from non-probabilistic data structures) probabilistic data structures can store much larger data volumes in a given amount of space. This is because they only store the bare minimum amount of data required to solve their specific problem (e.g. counting or set membership).
Finally, it’s worth noting that probabilistic data structures give you a fuzzy view of the data. As such, the count will not be precise to a single unit, and set membership tests might give you the wrong answer. While not optimal, this is not a problem because these error rates have specific structures, are quantifiable and are also tweakable by tuning some parameters. Probabilistic data structures are great for when you care more about speed or low memory footprint than about pinpoint accuracy. When you do need complete correctness, you can always pair with traditional data structures and use probabilistic data structures as an optimization layer.

With that primer, we’ll dig into real-world examples for three powerful data structures: HyperLogLog, Bloom Filters and Cuckoo Filters.

HyperLogLog for Set Cardinality

HyperLogLog is a great data structure when you have a huge number of users interacting with an even bigger set of items and you want to keep count of these interactions. In addition, this data type can keep count of various sets and quickly compute the cardinality of their union or intersection. It’s built-in to Redis, so you can use it right away. Each counter takes 12 kilobytes of memory, has a 0.81% standard error and can count a virtually unlimited amount of unique items (2^64).

Examples:

YouTube, Reddit, Amazon views
Efficient unions/intersections of user groups

Basic Usage

Creating a new counter is as simple as adding the first item. If the key does not exist, Redis will create it for you. The command structure is as follows:

<br />
PFADD <key>; <item>

1	PFADD <key>; <item>

Counting Unique Views

Let’s say that you want to keep a unique view count of your videos. To add a view by user 01 to video 42, use:

<br />
PFADD video42 user01

1	PFADD video42 user01

This creates a key named after the video that contains the event user01. To obtain the unique view count for an item, use:

<br />
PFCOUNT video42

1	PFCOUNT video42

This will return the total amount of unique elements that were added before. Since we identify each event by user id, multiple views by the same user are only counted once.

Different Views per Day

If your definition of uniqueness is more nuanced, you just need to pack all relevant information into the strings identifying each view event. Let’s say that you want to consider views happening on subsequent days as different, even if they are from the same user. In that case, your command structure might be:

<br />
PFADD video42 user01:2018-08-22

1	PFADD video42 user01:2018–08–22

This way, user01:2018-08-22 and user01:2018-08-23 will count as two separate events. Another approach could be to create a different counter for each day:

<br />
PFADD video42:2018-08-22 user01

1	PFADD video42:2018–08–22 user01

Note how the date moved from being part of each item, to being part of the key. With this approach, you can easily plot a histogram, but beware that you will be creating a new counter for each day. To get the total view count across a span of multiple days, you can sum the result by calling the PFCOUNT on each counter. And if you still want to also know the total view count by unique users only, you can use:

<br />
PFCOUNT video42:2018-08-22 video42:2018-08-23

1	PFCOUNT video42:2018–08–22 video42:2018–08–23

This invocation returns the cardinality of the union of all given counters. Since we stored each view event by user id alone, this count will correspond to our first definition of uniqueness.

Merging Counters

You can also merge multiple counters into one, which is useful both for tracking multiple levels of granularity and for rotating out older counts. For example:

<br />
PFMERGE video42:2018-08 video42:2018-08-22 video42:2018-08-23 …

1	PFMERGE video42:2018–08 video42:2018–08–22 video42:2018–08–23 ...

This invocation will store in video42:2018-08 the union of all the given sets. Compacting older metrics allows you to keep historical data at hand while still allocating most of your memory to more recent events. All our examples are about slicing data by date, but the same concept can be applied to any other dimension, such as category, video length, age range, etc.

Bloom Filters for Set Membership

Bloom Filters are the most iconic type of probabilistic data structure and are used in a wide range of applications. Databases, networking equipment and even cryptocurrencies use Bloom Filters extensively to speed up internal operations. Clients can query a service for yes/no answers and efficiently cache those answers. Available as a Redis module called ReBloom, this data structure allows you to test whether an item is part of a large set without having to hold that whole set in memory.

Of note, you can specify false positive error probability between 0 percent and 100 percent (extremes not included), and avoid false negatives. The most important thing to keep in mind about Bloom Filters is that negative answers have zero error probability. In other words, Bloom Filters always reply either “Probably yes” or “Definitely no.”

When using ReBloom, space usage is inversely related to the target error rate.

In addition, ReBloom supports growing filters, which allows you to dynamically allocate memory for each filter. This is great for situations without a “one size fits all” solution (e.g. keeping track of relationships in a social network, which follows a power law distribution). While the filters do grow automatically, in order to ensure the best space and CPU efficiency, it is important to allocate as accurately as possible when you have a known approximate set size.

To learn how to load modules like ReBloom into Redis, please refer to the official Redis documentation.

Examples:

Checking for username availability
Fraud detection and mitigation of some types of network attacks
Web crawlers that track known URLs

Basic Usage

Once you’ve loaded the ReBloom module, adding an item to a non-existing key will create it for you seamlessly:

<br />
BF.ADD <key> <item>;

1	BF.ADD <key> <item>;

If you want to specify more options, you can use:

<br />
BF.RESERVE <key>; <error_rate>; <size>;

1	BF.RESERVE <key>; <error_rate>; <size>;

This creates a filter named <key> that holds up to <size> items with a target error rate of <error_rate>. The filter will grow automatically once you overflow your original <size> estimation.

Crawling URLs No More Than Once

Let’s say you are running a web crawler and want to make sure you don’t keep crawling ad infinitum a known URL every time you encounter it.

As long as you are crawling a single domain, it might not be a problem to keep a list of all the known URLs. However, somewhere between that scope and Google-scale, you might begin wasting too many resources on updating and reading said list (which might not even fit in memory anymore).

The same problem can be solved with a Bloom Filter, such as:

<br />
BF.ADD crawled “redis.io/documentation”

1	BF.ADD crawled “redis.io/documentation”

To test if a URL has already been crawled or not you can use:

<br />
BF.EXISTS crawled “redis.io/maybe-new-url”

1	BF.EXISTS crawled “redis.io/maybe-new-url”

The reply will be either:

0 (Definitely no): this is a new URL, you can crawl it; or
1 (Probably yes): this is most probably a known URL.

In the positive case, it’s up to you whether to accept that there is a small chance of skipping some URLs and move on, or to keep track of these URLs on disk in a system that you can query to get a precise, albeit slower, answer.

How Much Space Does a Bloom Filter Need?

A filter sized to use 100MB can hold up to 100 million unique items with a 2 percent error rate.

A complete list of commands for ReBloom can be found in the official documentation. On the web, you can find many sites (like this interactive visual calculator) that help calculate storage requirements for Bloom Filters. Bitcoin also implemented a proposal to optimize and obscure client communication using Bloom Filters.

When Bloom Is Not Enough: Cuckoo Filters

Bloom Filters are an amazing and time-tested data structure that fits most needs, but they are not perfect. Their biggest imperfection is the inability to delete items. Because of the way data is stored inside the filter, once an item is added, it’s impossible to fully separate it from other items. This makes deletion impossible without introducing new errors.

Alternatively, Cuckoo Filters offer a more recent probabilistic data structure. This approach stores information differently, resulting in slightly different performance characteristics and the ability to delete items when needed.

Cuckoo Filters Are Better than Bloom Filters for:

Deleting items
Faster lookups (because of better memory locality)
Space efficiency (when the target error rate is lower than 3%)
Faster insertions (while the filter is under an 80% fill rate)

Cuckoo Filters Are Worse than Bloom Filters when:

Your fill rate is over 80 percent; in which case, Cuckoo Filters’ insertion speeds quickly drop below Bloom’s.
You have more lenient target error rates (greater than 3 percent), making Cuckoo Filters less space efficient
You need highly predictable behavior (because Cuckoo Filters use a source of randomness during insertion to deliver performance improvements)

Basic Usage

Cuckoo Filters are also present in ReBloom and can be used in the same exact way you use Bloom. The only difference is that the command prefix is CF instead of BF, and you have a new command for deletion:

<br />
CF.DEL <key> <item>

1	CF.DEL <key> <item>

That’s it. Everything else can be modeled in the same way as with Bloom Filters.

In our previous example (web crawling), we did not need that functionality to properly model the ever-growing list of crawled URLs. But now, we can solve problems in scenarios where items might be removed from a set. For example, pre-checking for stock availability when receiving a shipping order from a client.

A full list of commands can be found in ReBloom’s documentation. This data structure was first presented in “Cuckoo Filter: Practically better than Bloom.”

Conclusion

Probabilistic data structures elegantly solve many types of problems that would otherwise require more computational power, cost and development effort. In this article, we presented three useful probabilistic data structures:

HyperLogLog (included in Redis) to count elements in a set.
Bloom Filter (available in ReBloom) to track elements present or missing in a set.
Cuckoo Filter (available in ReBloom) to track elements just like Bloom, but with the additional ability of removing elements from the set.

While these are the most renowned, it is not a complete list, and we fully expect that new probabilistic data structures will be invented in the future, just as Cuckoo Filters were presented in 2014. For this reason, it is worth keeping an eye out for new Redis modules, as they might solve a problem in unexpectedly effective ways.

Feature image via Pixabay.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Recommended

July 17, 2024 by Thuận Cao

Update Redis: How Probabilistic Data Structures Support State-of-the-Art Apps

Read more about Redis: How Probabilistic Data Structures Support State-of-the-Art Apps at Wikipedia

A Quick Probabilistic Data Structures Primer

HyperLogLog for Set Cardinality

Basic Usage

Counting Unique Views

Different Views per Day

Merging Counters

Further Reading

Bloom Filters for Set Membership

Basic Usage

Crawling URLs No More Than Once

How Much Space Does a Bloom Filter Need?

When Bloom Is Not Enough: Cuckoo Filters

Cuckoo Filters Are Better than Bloom Filters for:

Cuckoo Filters Are Worse than Bloom Filters when:

Basic Usage

Conclusion

Offshore Product Development and How It Differs From Other Models?

Hire Offshore PHP Developers From Vietnam

An Overview on Vietnam Software Industry Update 2024

Compare Best Countries to Outsource Software Development

12+ The Fastest Programming Languages List

Hire Online Developers From Software Outsourcing Companies

What to Look for When You Hire Express.js Developers?

What is Express js and Why It is Used?

9 Examples of Robots in Everyday Life

2024’s Most Popular AI Programming Languages for Your Projects

Blog post

9 Practical Tips to Choose a Mobile App Development Company for 2023

How To Hire Offshore WordPress Developers From Vietnam

How to Hire Offshore Python Developers From Vietnam

Key Steps to Hire Offshore AngularJS Developers

Locations

Read more about Redis: How Probabilistic Data Structures Support State-of-the-Art Apps at Wikipedia

A Quick Probabilistic Data Structures Primer

HyperLogLog for Set Cardinality

Basic Usage

Counting Unique Views

Different Views per Day

Merging Counters

Further Reading

Bloom Filters for Set Membership

Basic Usage

Crawling URLs No More Than Once

How Much Space Does a Bloom Filter Need?

When Bloom Is Not Enough: Cuckoo Filters

Cuckoo Filters Are Better than Bloom Filters for:

Cuckoo Filters Are Worse than Bloom Filters when:

Basic Usage

Conclusion

Get a custom Proposal

You need to enter your email to download

Blog post

Locations