What are Bloom Filters - Basic

A Bloom filter is a probabilistic data structure that is used to test whether an element is a member of a set.
Bloom filter can tell us with high accuracy if an element exists in group of things.
query Bloom filters first before checking a database, thus potentially preventing a disk read

Algorithm -

An empty Bloom filter is a bit array of m bits, all set to 0.

# Empty Bloom filter with m=10
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

There must be k different hash functions defined,
each of which hashes some set element to one of the m array indexes.

"Hello" -> fn(HashFunction) -> 4

To add an element, feed element to all of the k hash functions
to get k array indexes. Set the bits at these indexes to 1.

# For example, Bloom filter with three hash functions, k=3

"Hello" -> fn(HashFunction1) -> 2
"Hello" -> fn(HashFunction2) -> 4
"Hello" -> fn(HashFunction3) -> 9

# Resulting Bloom filter array becomes
[0, 0, 1, 0, 1, 0, 0, 0, 1, 0]

To query for an element or to test if element is present in the set,
- feed the element to all the k hash functions to get k indexes.
- check whether the k indexes are set to 1.
- If any of the bits at these indexes is 0, the element is 100% not in the set
- And if all the elements are set to 1 it either means the element exists in the set or those bits were set by some other inputs.[False Positives]
Thus, a Bloom filter is really effective at determining if an element is not a member of a set.
Even though Bloom filters returns false positives,
- It offers very compact storage, less than 10 bits per element are required for a 1% false positive probability
- independent of the size or number of elements in the set.

Bloom filter test case

install bloomfilter library - pypi.org/project/pybloomfiltermmap3

pip install pybloomfiltermmap3

Python test code,

import uuid
import pybloomfilter

unique_ids = [uuid.uuid4() for i in range(0, 1_000_000)]
bloom_filter = pybloomfilter.BloomFilter(capacity=1_000_000, error_rate=0.01, filename='filter.bloom')

def get_bloom_filter_stats():
    start = time.time()
    false_positives = 0
    for i, element in enumerate(unique_ids):
        if element in bloom_filter:
            print(f'false positive for id={element} at index={i}')
            false_positives += 1
        bloom_filter.add(element)
    elapsed_time = time.time() - start

    return {'execution_time': elapsed_time, 'false_positives': false_positives}

get_bloom_filter_stats()

OUTPUT

{'execution_time': 1.0306200981140137, 'false_positives': 1810} bloom_filter.error_rate 0.01

Redis test case

install redis

pip install redis

$ redis-server

Redis test case

from redis import StrictRedis
redis_client = StrictRedis().from_url('redis://localhost:6379')

def get_redis_execution_time():
    start = time.time()
    for element in unique_ids:
        redis_client.get(str(element))
    elapsed_time = time.time() - start
    return elapsed_time

get_redis_execution_time()

OUTPUT

execution_time: 68.12501311302185

Final Points

So redis took 68 secs to iterate through 1M rows, where as bloom filters took only 1 sec.
SQL database is usually slower than redis, so did not test that.
Which means for key existence checks, a Bloom filter is at least 68x faster than a SQL database/redis.
Of course, there were some trade-offs: the false positives and the Bloom filter file size (~1MB/1M records)
However, with an error rate lower than 1%, it is definitely worth it.

Reference - levelup.gitconnected.com/eli5-what-the-f-is..

Day 25/100

What are Bloom Filters - Basic

Algorithm -

Bloom filter test case

Redis test case

Final Points