Skip to main content

Posts

Showing posts from December, 2014

Distributed bloom filter

Requirements To filter in very quick and efficient way the stream of location data coming from the external source with a rate of 1M events per second. Assumption that 80% of events should be filtered out and only 20% pass to the analytic system for further processing. Filtering process should find the match between the event on input and predefined half-static data structure (predefined location areas) contains 60,000M entries. In this article, I assume that reader familiar with Storm framework  and Bloom filter definition. Solution Once requirement talking about a stream of data Storm framework was chosen, to provide efficient filtering Guava Bloom filter was chosen. Using Bloom filter calculator  find out that having 0.02% false positive probability Bloom filter bit array takes about 60G memory which can't be loaded into java process heap memory. Created simple Storm topology contains Kafka Spout and Filter bolts. There are several possible solutions ...