Structuring the Unstructured: Bloom Filters

What Are Bloom Filters

These data structures are like sets. You can insert items into the bloom filter. Once they are in the set you can check if a given item is in the set. However, this is an imperfect check if it is in the bloom filter. If the bloom filter says that an item is not in the bloom filter, then the item was certainly not put into the set. If the bloom filter says the item is in it, then that item may or may not have been inserted. So in short membership can false positive, but cannot false negative. You can somewhat control the rate of false positives by changing some of the input parameters.

The basic implementation of a bloom filter however does not have a removal.

Why Would I Use A Bloom Filter

Both inserting and set membership checks are fast, really fast. In fact both of these take a constant amount of time. So you can use the bloom filter as a gate keeper. You can use it to check if you should try a more computational intensive check.

As a result it is predominantly used to cache website requests and for recommendation systems. Both of these cases need to quick responses, and failures are not important or can be corrected when they occasionally occur.

Basic Concept

First we need `k` hashes. We can use this to create a fingerprint for items. Say we wish to insert the string "Hello, World!", we would use each hash to produce an integer. Then we would take these integers and find the remainder of a division by `m`. Below are some examples where m = 7 and k = 2, if a cell is grey than it was marked by a hash.

While some items may have the same fingerprint, conflicts should not be too common if we increase the size of the fingerprint `m`.

Lets say we now we just put items 1 - 3 fingerprints on top of one another. We get the following.

Bloom Filter

This is the bloom filter. Now if we check if item 1 is in the bloom filter we can see that everywhere it has a mark, there is a mark in the bloom filter. So it might have been in filter. However item 4 (shown below), has a mark in a location that does not exist in the filter. So it cannot have been used to construct the filter. However, item 5 would appear to be in the bloom filter.

By changing the size of the fingerprint and the number of marks, you can change the rate of false positives.

Building the fingerprint

One problem we have is that we need to build a fingerprint up. For this problem we use many hashes. Let `k` be the number of independent hashes we shall use, let `m` be the size of the fingerprint. Then for item `x` the fingerprint marks are at [hash_1(x) mod m, hash_2(x) mod m, ... hash_k(x) mod m].

We need `k` independent hashes though. This is extremely difficult for most applications. Luckily, there is a way to generate some pretty independent hashes once we have two or three. A fast non-standard hash is Murmur3 which can be seen in [3].

Modified Double Hashing

This method uses 2 initial independent hashes and generates more. This method was shown to be almost as effective as triple hashing in [1].

	def double_hash(delta, k, m):
	x = hash1(delta) % m
	y = hash2(delta) % m
	f = np.zeros(k, 'int')
	f[0] = 0

	for i in range(1, k):
	x = (x + y) % m
	y = (y + i) % m
	f[i] = x

	return f

view raw double_hash.py hosted with ❤ by GitHub

Triple Hashing

This method produces very good results but requires 3 independent hashes.

def triple_hash(delta, k, m):

x = hash1(delta)

y = hash2(delta)

z = hash3(delta)

f = np.zeros(k, 'int')

f[0] = 0

for i in range(1, k):

x = (x + y) % m

y = (y + z) % m

f[i] = x

return f

view raw triple_hash.py hosted with ❤ by GitHub

Coding Bloom Filter

The implementation of the bloom filter ends up being fairly simple once you have your hash function generator.

class Bloom:

def init(self, m, k):

self.info = np.zeros(m, dtype='bool')

self.k = k

def insert(self, x):

f = double_hash(x, self.k, self.info.shape[0])

self.info[f] = True

def is_in(self, x):

f = double_hash(x, self.k, self.info.shape[0])

return np.all(self.info[f])

def str(self):

return f'Bloom(info={self.info}, k={self.k})'

view raw bloom.py hosted with ❤ by GitHub

Choosing Efficient Parameters

One thing that becomes a question is what are good parameters for k and m. Let n be the number of elements inserted into the bloom filter. An estimate of the best values can be found. First the rate of false positives are given approximately by

It can also be shown that to minimize m for a given false positive rate the parameters can be approximated by

These are usually good suggestions for hyper-parameters. We see that k only depends on the false positive rate. The plot for the suggest k is shown below

As we can see the number of hashes required drops very quickly with our desired false positive rate. However, we can produce large number of hashes from only two or three starting hashes. So we can use a low false positive rate if we want.

The size of the bloom filter is however suggested to be linearly related to the number of items we wish to put in the filter. It also grows as we lower our false positive rate. In terms of memory consumption the bloom filter won't be an improvement for space. However, for larger items like strings or complex data this can save a large amount of space. This is due to the fact the space requirements doesn't depend on the size of the data structure.

References

[1] Peter C. Dillinger and Panagiotis Manolios, Bloom Filters in Probabilistic Verification (http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf)

[2] Mitzenmacher, Michael; Upfal, Eli (2005), Probability and computing: Randomized algorithms and probabilistic analysis, Cambridge University Press, pp. 107–112, ISBN 9780521835404 (https://books.google.com/books?id=0bAYl6d7hvkC&pg=PA110#v=onepage&q&f=false)

[3] https://github.com/aappleby/smhasher/wiki/MurmurHash3

Structuring the Unstructured

Thursday, June 28, 2018

Bloom Filters