Structuring the Unstructured

What Are Bloom Filters

These data structures are like sets. You can insert items into the bloom filter. Once they are in the set you can check if a given item is in the set. However, this is an imperfect check if it is in the bloom filter. If the bloom filter says that an item is not in the bloom filter, then the item was certainly not put into the set. If the bloom filter says the item is in it, then that item may or may not have been inserted. So in short membership can false positive, but cannot false negative. You can somewhat control the rate of false positives by changing some of the input parameters.

The basic implementation of a bloom filter however does not have a removal.

Why Would I Use A Bloom Filter

Both inserting and set membership checks are fast, really fast. In fact both of these take a constant amount of time. So you can use the bloom filter as a gate keeper. You can use it to check if you should try a more computational intensive check.

As a result it is predominantly used to cache website requests and for recommendation systems. Both of these cases need to quick responses, and failures are not important or can be corrected when they occasionally occur.

Basic Concept

First we need `k` hashes. We can use this to create a fingerprint for items. Say we wish to insert the string "Hello, World!", we would use each hash to produce an integer. Then we would take these integers and find the remainder of a division by `m`. Below are some examples where m = 7 and k = 2, if a cell is grey than it was marked by a hash.

While some items may have the same fingerprint, conflicts should not be too common if we increase the size of the fingerprint `m`.

Lets say we now we just put items 1 - 3 fingerprints on top of one another. We get the following.

Bloom Filter

This is the bloom filter. Now if we check if item 1 is in the bloom filter we can see that everywhere it has a mark, there is a mark in the bloom filter. So it might have been in filter. However item 4 (shown below), has a mark in a location that does not exist in the filter. So it cannot have been used to construct the filter. However, item 5 would appear to be in the bloom filter.

By changing the size of the fingerprint and the number of marks, you can change the rate of false positives.

Building the fingerprint

One problem we have is that we need to build a fingerprint up. For this problem we use many hashes. Let `k` be the number of independent hashes we shall use, let `m` be the size of the fingerprint. Then for item `x` the fingerprint marks are at [hash_1(x) mod m, hash_2(x) mod m, ... hash_k(x) mod m].

We need `k` independent hashes though. This is extremely difficult for most applications. Luckily, there is a way to generate some pretty independent hashes once we have two or three. A fast non-standard hash is Murmur3 which can be seen in [3].

Modified Double Hashing

This method uses 2 initial independent hashes and generates more. This method was shown to be almost as effective as triple hashing in [1].

Triple Hashing

This method produces very good results but requires 3 independent hashes.

Coding Bloom Filter

The implementation of the bloom filter ends up being fairly simple once you have your hash function generator.

Choosing Efficient Parameters

One thing that becomes a question is what are good parameters for k and m. Let n be the number of elements inserted into the bloom filter. An estimate of the best values can be found. First the rate of false positives are given approximately by

It can also be shown that to minimize m for a given false positive rate the parameters can be approximated by

These are usually good suggestions for hyper-parameters. We see that k only depends on the false positive rate. The plot for the suggest k is shown below

As we can see the number of hashes required drops very quickly with our desired false positive rate. However, we can produce large number of hashes from only two or three starting hashes. So we can use a low false positive rate if we want.

The size of the bloom filter is however suggested to be linearly related to the number of items we wish to put in the filter. It also grows as we lower our false positive rate. In terms of memory consumption the bloom filter won't be an improvement for space. However, for larger items like strings or complex data this can save a large amount of space. This is due to the fact the space requirements doesn't depend on the size of the data structure.

References

[1] Peter C. Dillinger and Panagiotis Manolios, Bloom Filters in Probabilistic Verification (http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf)

[2] Mitzenmacher, Michael; Upfal, Eli (2005), Probability and computing: Randomized algorithms and probabilistic analysis, Cambridge University Press, pp. 107–112, ISBN 9780521835404 (https://books.google.com/books?id=0bAYl6d7hvkC&pg=PA110#v=onepage&q&f=false)

[3] https://github.com/aappleby/smhasher/wiki/MurmurHash3

Single Trees

Random forests are typically described using trees. They can also be described as partitioning space. To explore this representation, let's generate some data. Below we generate the location and color for many points. These points are plotted below. This data has a bright spot near each corner. We will see that random forests are able to identify this structure.

One method to predict the color of the a new point at a specific location would be to use random forests. We start by looking at the behavior of a single tree of a forest. A random forest tree partitions the data into small buckets. For the data we have been given, the random forest tree draws partitions as shown below.

The random forest will then find the average color for each partition. Below the I have drawn the plot with the mean of the average color in each partition. I will call this the trained partition plot. We can see what the random forest tree will predict for every point.

If we wanted to predict a new point we would simply put on the trained partition plot and see what color located there. For instance we have plotted (0.5, 0.7) on the trained partition plot. We can see that the random forest tree will predict a medium red.

Entire Forests

Now we want to build an entire forest of trees. We give a sample of the data to each tree and train each tree individually. For our data each tree was given 50% of the data. We can see how each individual tree trains on the data. Each individual tree does not necessarily look like a good fit to the original data. However, combined together they describe the data much better.

Now that we have trained each tree we need to combine all of the trees together. To combine the trees we average the predictions from each tree together. This results in a singular trained partition plot. For our random forest this produces the following trained partition plot. As we can see the random forest predicts a fairly reasonable description of the original data.

Finding the Partitions

Random forests need to have a way to determine the partitions for them. We will use the original data to show how it decides on the splits.

Each tree is trained independently, so we need to only understand how each tree is trained. The tree will search for a horizontal or vertical line that is "best". To determine the "best" we have a way to score a particular model (the score is usually determined by MSE). We than take the mean of each partition and use this model to determine a score. We will pick the horizontal or vertical line that reduces the score the most. For our points the "best" split is.

The random forest takes each smaller partition and repeats the same process on both of them. This repeats until there are very few points in each partition. The next level of splits for our data is.

And at the end we saw that the partitions look like this.

We now just find the mean for each partition and that gives us are final trained model.

code

This uses a slightly modified version of Jeremy Howard's Random Forest found here. All code used to generate the plots can be found here.