Day 25/100
What are Bloom Filters - Basic
- A Bloom filter is a probabilistic data structure that is used to test whether an element is a member of a set.
- Bloom filter can tell us with high accuracy if an element exists in group of things.
- query Bloom filters first before checking a database, thus potentially preventing a disk read
Algorithm -
An empty Bloom filter is a bit array of m bits, all set to 0.
# Empty Bloom filter with m=10
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
- There must be k different hash functions defined,
- each of which hashes some set element to one of the m array indexes.
"Hello" -> fn(HashFunction) -> 4
- To add an element, feed element to all of the k hash functions
- to get k array indexes. Set the bits at these indexes to 1.
# For example, Bloom filter with three hash functions, k=3
"Hello" -> fn(HashFunction1) -> 2
"Hello" -> fn(HashFunction2) -> 4
"Hello" -> fn(HashFunction3) -> 9
# Resulting Bloom filter array becomes
[0, 0, 1, 0, 1, 0, 0, 0, 1, 0]
To query for an element or to test if element is present in the set,
- feed the element to all the k hash functions to get k indexes.
- check whether the k indexes are set to 1.
- If any of the bits at these indexes is 0, the element is 100% not in the set
- And if all the elements are set to 1 it either means the element exists in the set or those bits were set by some other inputs.[False Positives]
Thus, a Bloom filter is really effective at determining if an element is not a member of a set.
- Even though Bloom filters returns false positives,
- It offers very compact storage, less than 10 bits per element are required for a 1% false positive probability
- independent of the size or number of elements in the set.
Bloom filter test case
install bloomfilter library - pypi.org/project/pybloomfiltermmap3
pip install pybloomfiltermmap3
Python test code,
import uuid
import pybloomfilter
unique_ids = [uuid.uuid4() for i in range(0, 1_000_000)]
bloom_filter = pybloomfilter.BloomFilter(capacity=1_000_000, error_rate=0.01, filename='filter.bloom')
def get_bloom_filter_stats():
start = time.time()
false_positives = 0
for i, element in enumerate(unique_ids):
if element in bloom_filter:
print(f'false positive for id={element} at index={i}')
false_positives += 1
bloom_filter.add(element)
elapsed_time = time.time() - start
return {'execution_time': elapsed_time, 'false_positives': false_positives}
get_bloom_filter_stats()
OUTPUT
{'execution_time': 1.0306200981140137, 'false_positives': 1810} bloom_filter.error_rate 0.01
Redis test case
install redis
pip install redis
$ redis-server
Redis test case
from redis import StrictRedis
redis_client = StrictRedis().from_url('redis://localhost:6379')
def get_redis_execution_time():
start = time.time()
for element in unique_ids:
redis_client.get(str(element))
elapsed_time = time.time() - start
return elapsed_time
get_redis_execution_time()
OUTPUT
execution_time: 68.12501311302185
Final Points
- So redis took 68 secs to iterate through 1M rows, where as bloom filters took only 1 sec.
- SQL database is usually slower than redis, so did not test that.
- Which means for key existence checks, a Bloom filter is at least 68x faster than a SQL database/redis.
- Of course, there were some trade-offs: the false positives and the Bloom filter file size (~1MB/1M records)
- However, with an error rate lower than 1%, it is definitely worth it.
Reference - levelup.gitconnected.com/eli5-what-the-f-is..