Watch the replay of our CI/CD Express Conference 2023 →

    21 Sep 2023 · Software Engineering

    Using Probabilistic Data Structures in Redis

    12 min read
    Contents

    Redis is an open-source, in-memory data structure store. You can use Redis as a messaging system, thanks to its support for Redis Streams, Pub/Sub, and lists (which can act as a queue). Although it’s primarily an in-memory store, a spectrum of persistence options are available. Transactions and pipeline capabilities enhance performance for scalable applications. Moreover, Lua scripting, server-side functions, and Redis Modules allow for the extension of Redis functionality.

    With Redis, you can also leverage powerful probabilistic data structures like HyperLogLog and Bloom Filter. Probabilistic data structures excel when operations such as counting or membership checks need to be performed over vast amounts of data. These structures are space-efficient and offer a high degree of accuracy. Redis natively supports these data structures, compatible with any Redis client.

    This article provides an overview of probabilistic data structures in Redis and demonstrates their use in Go (though the concepts apply universally). But before delving into these data structures, let’s define probabilistic data structures.

    What Are Probabilistic Data Structures?

    Probabilistic data structures are specialized structures that use probabilistic algorithms to estimate certain properties of the stored data. Unlike traditional data structures, which aim for precise and deterministic results, probabilistic counterparts trade accuracy for efficiency and scalability. They are adept at handling large datasets, where traditional methods might be resource-intensive.

    The appeal of probabilistic data structures is their ability to offer approximate answers with controlled error rates, consuming significantly less memory and computational resources than deterministic alternatives. They thrive in situations where near-perfect results suffice.

    Advantages Over Traditional Data Structures

    Probabilistic data structures have notable advantages, including:

    • Space Efficiency: They require less memory, using probabilistic algorithms and techniques like hash functions and Bloom filters for compact data representation.
    • Speed and Performance: By sacrificing some accuracy, they deliver outstanding speed. They employ probabilistic algorithms that allow for quick processing of extensive datasets.
    • Memory Optimization: In-memory stores like Redis emphasize memory efficiency. Probabilistic data structures, by compressing data, help optimize memory usage, ensuring that Redis can process large datasets without sacrificing performance.

    Overview of Probabilistic Data Structures in Redis

    Here’s a brief rundown of the probabilistic data structures Redis offers:

    HyperLogLog

    Used to estimate set cardinality, HyperLogLog is memory-efficient and can approximate the count of unique elements in large datasets. Practical applications range from web analytics and unique visitor counts to detecting duplicate log entries.Redis commands for HyperLogLog include:

    • PFADD – Add elements to a HyperLogLog.
    • PFCOUNT – Count the number of elements in a HyperLogLog.
    • PFMERGE – Merge multiple HyperLogLogs into a single one.

    Bloom and Cuckoo Filter

    The Bloom Filter checks set membership efficiently, while the Cuckoo filter is similar but also allows deletions. Bloom filters are preferable for insertion-heavy tasks due to superior scalability.

    Redis commands for the Bloom Filter include:

    • BF.RESERVE – Create a new Bloom filter.
    • BF.ADD – Add element to a Bloom filter.
    • BF.EXISTS – Verify existence of an element.
    • BF.MADD – Add multiple elements to a Bloom filter.
    • BF.INSERT – Add element to a Bloom filter, creating it if it does not exist (combines BF.RESERVE and BF.MADD).

    And for the Cuckoo Filter:

    • CF.RESERVE – Create a new Cuckoo filter.
    • CF.ADD – Add element to a Cuckoo filter.
    • CF.EXISTS – Verify existence of one or more elements.
    • CF.COUNT – Count the number of occurrences of an element.

    Top-K

    Tracks and retrieves the most frequent elements. Suitable for real-time analytics and trending analysis.

    Redis commands for Top-K include:

    • TOPK.RESERVE – Create a new Top-K data structure.
    • TOPK.ADD – Add one or more items to a Top-K data structure, or increment the count if it already exists.
    • TOPK.LIST – Retrieve the top items from a Top-K data structure.
    • TOPK.QUERY – Check for existence of one or more items in a Top-K data structure.
    • TOPK.INCRBY – Increment the count of one or more items in a Top-K data structure.

    Count-min Sketch

    Estimates element frequency in a dataset. Ideal for traffic analysis and frequency counting in various applications.

    Redis commands for Count-Min Sketch include:

    • CMS.INITBYDIM (or CMS.INITBYPROB) – Create a new Count-Min Sketch data structure.
    • CMS.INCRBY – Increment the count of one or more items in a Count-Min Sketch data structure.
    • CMS.QUERY – Check for existence of one or more items in a Count-Min Sketch data structure.

    Using Redis Probabilistic Data Structures with the go-redis Client

    Now versed in the probabilistic data structures of Redis, let’s explore their utilization in Go.

    For this endeavor, ensure you’ve installed a recent version of Go. The aforementioned data structures are all part of Redis Stack, which packages Redis with related tools. You can initialize a local Redis Stack instance using Docker.

    docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest

    Create a new Go module and add the required dependencies:

    go mod init go-redis-probabilistic-data-structures
    go get github.com/redis/go-redis/v9

    HyperLogLog

    Since HyperLogLog is a core Redis data structure, its commands are supported directly via functions in the go-redis library.

    The below example demonstrates how to:

    • Use PFAdd to add elements to a HyperLogLog, and
    • Use PFCount to count the approximate number of unique elements in a HyperLogLog.
    package main
    
    import (
    	"context"
    	"fmt"
    	"log"
    
    	"github.com/redis/go-redis/v9"
    )
    
    func main() {
    	client := redis.NewClient(&redis.Options{
    		Addr: "localhost:6379",
    	})
    
    	ctx := context.Background()
    
    	pong, err := client.Ping(ctx).Result()
    	if err != nil {
    		log.Fatalf("Failed to connect to Redis: %s", err)
    	}
    	fmt.Println("Connected to Redis:", pong)
    
    	hllName := "test-hll"
    
    	// Add elements to HyperLogLog
    	err = client.PFAdd(ctx, hllName, "item1", "item2", "item3", "item3", "item3").Err()
    	if err != nil {
    		log.Fatalf("Failed to add elements to HyperLogLog: %s", err)
    	}
    	fmt.Println("Elements added to HyperLogLog")
    
    	// Count the approximate number of unique elements in HyperLogLog
    	count, err := client.PFCount(ctx, hllName).Result()
    	if err != nil {
    		log.Fatalf("Failed to count unique elements in HyperLogLog: %s", err)
    	}
    	fmt.Println("Approximate number of unique elements in HyperLogLog:", count)
    
    	// Close the Redis client connection
    	err = client.Close()
    	if err != nil {
    		log.Fatalf("Failed to close Redis client: %s", err)
    	}
    }
    

    To run the above code, copy it to a file named main.go and execute the following command:

    go run main.go

    You should see the following output:

    Connected to Redis: PONG
    Elements added to HyperLogLog
    Approximate number of unique elements in HyperLogLog: 3

    Notice that we added five elements to the HyperLogLog, but the approximate number of unique elements is three.

    Bloom Filter

    Although redisbloom-go and rueidis libraries support these data structures, the below example demonstrates how to use the go-redis library to execute Bloom filter commands with the generic Do function.

    The below example demonstrates how to:

    • Use BF.RESERVE to create a Bloom filter,
    • BF.ADD to add an element to a Bloom filter,
    • BF.EXISTS to check if an element exists in a Bloom filter, and
    • BF.MADD to add multiple elements to a Bloom filter.
    package main
    
    import (
    	"context"
    	"fmt"
    	"log"
    
    	"github.com/redis/go-redis/v9"
    )
    
    func main() {
    	client := redis.NewClient(&redis.Options{
    		Addr: "localhost:6379",
    	})
    
    	ctx := context.Background()
    
    	pong, err := client.Ping(ctx).Result()
    	if err != nil {
    		log.Fatalf("Failed to connect to Redis: %s", err)
    	}
    	fmt.Println("Connected to Redis:", pong)
    
    	bloomFilterName := "test-bloom"
    
    	result, err := client.Do(ctx, "BF.RESERVE", bloomFilterName, "0.01", "1000").Result()
    	if err != nil {
    		log.Fatalf("BF.RESERVE error: %s", err)
    	}
    	log.Println("Bloom filter created", result)
    
    	result, err = client.Do(ctx, "BF.ADD", bloomFilterName, "item1").Result()
    	if err != nil {
    		log.Fatalf("BF.ADD error: %s", err)
    	}
    	log.Println("Added element to Bloom filter")
    
    	result, err = client.Do(ctx, "BF.EXISTS", bloomFilterName, "item1").Result()
    	if err != nil {
    		log.Fatalf("BF.EXISTS error: %s", err)
    	}
    	log.Println("Does element item1 exist in Bloom filter?", result)
    
    	_, err = client.Do(ctx, "BF.MADD", bloomFilterName, "item2", "item3").Result()
    	if err != nil {
    		log.Fatalf("BF.MADD error: %s", err)
    	}
    	log.Println("Added elements to Bloom filter")
    
    	result, err = client.Do(ctx, "BF.EXISTS", bloomFilterName, "item2").Result()
    	if err != nil {
    		log.Fatalf("BF.EXISTS error: %s", err)
    	}
    	log.Println("Does item2 exist in Bloom filter?", result)
    
    	result, err = client.Do(ctx, "BF.EXISTS", bloomFilterName, "item3").Result()
    	if err != nil {
    		log.Fatalf("BF.EXISTS error: %s", err)
    	}
    	log.Println("Does item3 exist in Bloom filter?", result)
    
    	result, err = client.Do(ctx, "BF.EXISTS", bloomFilterName, "item42").Result()
    	if err != nil {
    		log.Fatalf("BF.EXISTS error: %s", err)
    	}
    	log.Println("Does item42 exist in Bloom filter?", result)
    
    	// Close the Redis client connection
    	err = client.Close()
    	if err != nil {
    		log.Fatalf("Failed to close Redis client: %s", err)
    	}
    }

    To run the above code, copy it to a file named main.go and execute the following command:

    go run main.go

    You should see the following output:

    Connected to Redis: PONG
    Bloom filter created OK
    Added element to Bloom filter
    Does element item1 exist in Bloom filter? 1
    Added elements to Bloom filter
    Does item2 exist in Bloom filter? 1
    Does item3 exist in Bloom filter? 1
    Does item42 exist in Bloom filter? 0

    Notice that we added three elements to the Bloom filter, and the last element does not exist (item42) was reported to be missing from the Bloom filter.

    Top-K

    Just like the Bloom filter example, we use the go-redis library to execute TopK commands with the generic Do function.

    The below example demonstrates how to:

    • Use TOPK.RESERVE to create a Top-K data structure,
    • TOPK.ADD to add elements to a Top-K data structure, TOPK.INCRBY to increment the count of elements in a Top-K data structure, and
    • TOPK.LIST to list the elements in a Top-K data structure.
    package main
    
    import (
    	"context"
    	"fmt"
    	"log"
    
    	"github.com/redis/go-redis/v9"
    )
    
    func main() {
    	client := redis.NewClient(&redis.Options{
    		Addr: "localhost:6379",
    	})
    
    	ctx := context.Background()
    
    	pong, err := client.Ping(ctx).Result()
    	if err != nil {
    		log.Fatalf("Failed to connect to Redis: %s", err)
    	}
    	fmt.Println("Connected to Redis:", pong)
    
    	topkName := "test-topk"
    
    	_, err = client.Do(ctx, "TOPK.RESERVE", topkName, 3).Result()
    	if err != nil {
    		log.Fatalf("TOPK.RESERVE error: %s", err)
    	}
    	log.Println("Created topk")
    
    	_, err = client.Do(ctx, "TOPK.ADD", topkName, "item1", "item2", "item3", "item4", "item5").Result()
    	if err != nil {
    		log.Fatalf("TOPK.ADD error: %s", err)
    	}
    	log.Println("Added elements to topk")
    
    	_, err = client.Do(ctx, "TOPK.INCRBY", topkName, "item1", 2, "item2", 42, "item3", 10, "item4", 15).Result()
    	if err != nil {
    		log.Fatalf("TOPK.INCRBY error: %s", err)
    	}
    	log.Println("Increment topk element counts")
    
    	r, err := client.Do(ctx, "TOPK.LIST", topkName, "WITHCOUNT").Result()
    	if err != nil {
    		log.Fatalf("TOPK.LIST error: %s", err)
    	}
    
    	var items = r.([]interface{})
    
    	log.Println("listed elements to topk - ", items)
    
    	// Close the Redis client connection
    	err = client.Close()
    	if err != nil {
    		log.Fatalf("Failed to close Redis client: %s", err)
    	}
    }

    To run the above code, copy it to a file named main.go and execute the following command:

    go run main.go

    You should see the following output:

    Connected to Redis: PONG
    Created topk
    Added elements to topk
    Increment topk element counts
    listed elements to topk -  [item2 43 item4 16 item3 11]

    Notice how we added five elements to the Top-K data structure, and then incremented the count of four of them. The TOPK.LISTcommand returns only the top three elements with the highest counts, since we had specified that while creating the Top-K.

    Count-min Sketch

    In this example, we use the generic Do function to execute commands to create a Count-min Sketch data structure, increment the count of four elements, and then query the count of those elements.

    package main
    
    import (
    	"context"
    	"fmt"
    	"log"
    
    	"github.com/redis/go-redis/v9"
    )
    
    func main() {
    	client := redis.NewClient(&redis.Options{
    		Addr: "localhost:6379",
    	})
    
    	ctx := context.Background()
    
    	pong, err := client.Ping(ctx).Result()
    	if err != nil {
    		log.Fatalf("Failed to connect to Redis: %s", err)
    	}
    	fmt.Println("Connected to Redis:", pong)
    
    	cmsName := "test-cms"
    
    	_, err = client.Do(ctx, "CMS.INITBYPROB", cmsName, 0.001, 0.01).Result()
    	if err != nil {
    		log.Fatalf("CMS.INITBYPROB error: %s", err)
    	}
    	log.Println("Created CMS")
    
    	_, err = client.Do(ctx, "CMS.INCRBY", cmsName, "item1", 1, "item2", 2, "item3", 3, "item4", 4).Result()
    	if err != nil {
    		log.Fatalf("CMS.INCRBY error: %s", err)
    	}
    	log.Println("Increment CMS element counts")
    
    	r, err := client.Do(ctx, "CMS.QUERY", cmsName, "item1", "item2", "item3", "item4").Result()
    	if err != nil {
    		log.Fatalf("CMS.QUERY error: %s", err)
    	}
    
    	var items = r.([]interface{})
    
    	log.Println("listed elements count from CMS -", items)
    
    	// Close the Redis client connection
    	err = client.Close()
    	if err != nil {
    		log.Fatalf("Failed to close Redis client: %s", err)
    	}
    }
    


    To run the above code, copy it to a file named main.go and execute the following command:

    go run main.go

    You should see the following output:

    Connected to Redis: PONG
    Created CMS
    Increment CMS element counts
    listed elements count from CMS - [1 2 3 4]

    Notice how we added four elements to the Count-min Sketch data structure, and then queried the count of those elements. The CMS.QUERY command returns the count of each element, which is the same as the count we had incremented.

    Caveats and Limitations

    Although these probabilistic data structures are potent, it’s essential to keep the following in mind:

    1. Understand the Trade-offs: Probabilistic data structures prioritize efficiency and scalability over perfect accuracy. Recognize the inherent limitations, including the potential for false positives or errors. Select the right data structure by considering both your use case and the margin of error you’re willing to accept.
    2. Determine Parameters with Care: Many probabilistic data structures require setting parameters, such as precision, error rates, or memory allocation. Assess your particular needs, weighing the trade-offs between accuracy and memory consumption. Fine-tune these parameters to find a balance fitting for your application.
    3. Anticipate Scalability Needs: While probabilistic data structures excel with large datasets and scale efficiently, it’s crucial to keep an eye on performance and resource consumption as your data expands. If necessary, plan for horizontal scaling by distributing the data across multiple Redis instances or utilizing Redis Cluster.

    Conclusion

    This article introduced you to probabilistic data structures, elucidated their support in Redis, and explained their implementation in Go applications. When chosen and configured aptly—and with their trade-offs in mind—these data structures can be invaluable assets.

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    mm
    Writen by:
    mm
    Reviewed by:
    I picked up most of my skills during the years I worked at IBM. Was a DBA, developer, and cloud engineer for a time. After that, I went into freelancing, where I found the passion for writing. Now, I'm a full-time writer at Semaphore.