This second edition of the technical section of my blog is going to cover another data structure that I found fascinating and if used in the right way can be an absolute boon for developers. The best thing about data structures is how finding the right one to solve a problem is like finding the exact piece while solving a puzzle – they fit just right and sit perfectly to complete the big picture. 😀
The data structure in question today is the bloom filter.
A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. What is meant by probabilistic is that you can use this to query if an element is either “definitely not in the set” or “possibly in the set.” In a world dominated by Big Data and Data Science, it would be very useful if you could make this query to a very large data set very fast and without having to store all the information (i.e. time and space efficient). I’ll get back to it’s application in a bit, first let’s see how it works.
Consider a very large set S consisting of N elements. Now the bloom filter will enable you to efficiently query if an element ‘a’ is part of S.
At the start your Bloom Filter is nothing but an array of size M (M<<N) with all elements set to zero. Now you select K different hash functions (K<<M) so that each of them maps any element given as input to one of the M array positions. Essentially if you take any element and feed it to any of your hash functions you will be returned an array position. Along with this constraint, the hash functions are chosen which can be computed quickly.
Now for all the elements in the set, the K different hash functions are computed for each of these and the results i.e. K different array positions per element are set to 1. This generates a uniform random distribution of 1’s and 0’s in the array associated the set of elements.
Now when a user wants to know if a particular element is there in a set, he computes the K different hashes of that element and checks the positions yielded by their results. If there is a zero at the position for any one of the hash results then we can conclude that the element is definitely not present in the set. However the converse cannot be stated with certainty as there exists a possibility where the combination of hashes exists without the element itself as shown in the above figure with element b. The parameters of M and K can be chosen to keep it such that the probability of a false positive is minimal.
The main utility lies in the fact that adding an element as well as querying membership occur in constant time which is a massive saving when compared to retrieving information and searching.
Issues against using it are mainly the probability of a false positive and removing an element from a set is not possible. Once an element has become part of the filter, it is not possible to remove it using this implementation.
The latter is overcome by replacing the binary 1/0 at each array position with a counter. The bloom filter will start at a zero in each position and whenever an element hash returns that position, the counter there is incremented by 1. There’s no improvement in searching – you just see that the positions denoted by the hashes of the search value are non zero but removal is possible by the fact that you just decrement the counter by 1 at all the concerned positions. If the counter value was earlier 1 then this element was the only one for which a hash value returned that position, so now it will be zero. If it was more than one then there exists another function which returned the same position and we need a non zero value to query elements. How you store the counter depends on the amount of storage available to you i.e. how many bits can you allocate for the counter given the size of M that you choose and how much space you can use. A Bloom filter with a 3 bit counter is shown in the figure below.
The applications for this are numerous. The site Medium uses a bloom filter to ensure that it’s readers are only recommended new articles to be read. It’s an elegant solution to a problem that has vexed many a developer. Basically for every user one such filter is maintained and every time he reads an article the hashes of that get added to the filter. So therefore when recommending new articles, no previously read ones are in contention.
Ever wondered how Google invalidates your username when you’re signing up? 😛 Bloom Filters is the answer. If you consider the number of accounts that are on Google the fact that they can tell you within a matter of seconds to change to a new one is so cool.
Another application is within Bitcoin where lightweight clients can request only those transactions pertaining to it. They may not want to or even be physically able to store the entire block chain. So what they can do is to calculate a bloom filter for it’s set of required transactions and transmit this to a full node peer (one who stores the entire chain). Receiving this the full node can remove all unrelated transactions and return the rest. Doing it this way is beneficial and even though there is a chance for a false positive, it’s effect is minimal. The full node passes along the transaction information that passes through the filter and when the lightweight node compares it to its own list of the exact transactions it realizes it’s a false positive and can discard it.
The filtering nature of the data structure i.e. no false negatives provides a great way to solve that particular class of problems.
Some more links for reference:
- Python Implementation of a bloom filter here.
- In Depth Look at Probabilistic Data Structures and their use in Web Analytics and Data Mining here.