- Регистрация
- 1 Мар 2015
- Сообщения
- 1,481
- Баллы
- 155
By someone who’s tired of over-allocating hash sets for no good reason.
? What is a Bloom Filter?
A Bloom Filter is a probabilistic data structure used to test set membership. It’s:
You can ask: "Is element X in the set?" and it will either say:
It trades 100% accuracy for massive space and time savings.
? Real-World Use Cases
? How a Bloom Filter Works
A Bloom filter uses:
Insertion:
For each element:
To check if element exists:
Why “maybe in set”?
Because multiple elements might hash to overlapping bit positions.
? Choosing Parameters: m, k, n
Optimal k = (m/n) * ln(2)
False positive rate ≈ (1 - e^(-k * n/m))^k
Use these formulas to tune based on space and error tolerance.
? Java Example Implementation
import java.util.BitSet;
import java.util.function.Function;
public class BloomFilter<T> {
private final BitSet bitset;
private final int bitSize;
private final int hashFunctions;
private final Function<T, Integer>[] hashers;
@SafeVarargs
public BloomFilter(int bitSize, Function<T, Integer>... hashers) {
this.bitSize = bitSize;
this.bitset = new BitSet(bitSize);
this.hashFunctions = hashers.length;
this.hashers = hashers;
}
public void add(T item) {
for (Function<T, Integer> hasher : hashers) {
int hash = Math.abs(hasher.apply(item)) % bitSize;
bitset.set(hash);
}
}
public boolean mightContain(T item) {
for (Function<T, Integer> hasher : hashers) {
int hash = Math.abs(hasher.apply(item)) % bitSize;
if (!bitset.get(hash)) return false;
}
return true;
}
}
Usage:
BloomFilter<String> filter = new BloomFilter<>(1024,
s -> s.hashCode(),
s -> s.length(),
s -> s.indexOf('a'));
filter.add("apple");
filter.add("banana");
System.out.println(filter.mightContain("apple")); // true
System.out.println(filter.mightContain("grape")); // false or true (false positive)
? Clear Example Walkthrough
Let’s walk through what happens when we add and check items:
Step 1: Initialize
This example demonstrates how Bloom filters gain speed and space efficiency by trading certainty for probability.
Pros and Cons
? Variants
Bloom filters power massive-scale systems like Bigtable, Apache HBase, and even Bitcoin. If you're building for speed, scale, and low memory — it's a tool worth mastering.
? What is a Bloom Filter?
A Bloom Filter is a probabilistic data structure used to test set membership. It’s:
- Space-efficient

- Extremely fast

- Allows false positives

- Never has false negatives

You can ask: "Is element X in the set?" and it will either say:
- Definitely not
- Possibly yes
It trades 100% accuracy for massive space and time savings.
? Real-World Use Cases
| Use Case | Why Bloom Filter? |
|---|---|
| Caches (e.g., CDN, Memcached) | Avoid unnecessary DB hits for missing keys |
| Web crawlers | Don't reprocess already seen URLs |
| Spell-checkers | Fast word lookup with compact storage |
| Distributed systems (e.g., Bigtable, Cassandra) | Avoid cross-node calls for missing data |
| Blockchain (Bitcoin/SPV) | Verify transactions without full node |
A Bloom filter uses:
- A bit array of length m
- k independent hash functions
For each element:
- Apply the k hash functions → get k indices
- Set all those k positions in the bit array to 1
To check if element exists:
- Hash the element k times
- If any bit at those k positions is 0 → definitely not in set
- If all bits are 1 → might be in set
Why “maybe in set”?
Because multiple elements might hash to overlapping bit positions.
? Choosing Parameters: m, k, n
| Symbol | Meaning |
|---|---|
| n | Number of expected elements |
| m | Size of bit array |
| k | Number of hash functions |
Optimal k = (m/n) * ln(2)
False positive rate ≈ (1 - e^(-k * n/m))^k
Use these formulas to tune based on space and error tolerance.
? Java Example Implementation
import java.util.BitSet;
import java.util.function.Function;
public class BloomFilter<T> {
private final BitSet bitset;
private final int bitSize;
private final int hashFunctions;
private final Function<T, Integer>[] hashers;
@SafeVarargs
public BloomFilter(int bitSize, Function<T, Integer>... hashers) {
this.bitSize = bitSize;
this.bitset = new BitSet(bitSize);
this.hashFunctions = hashers.length;
this.hashers = hashers;
}
public void add(T item) {
for (Function<T, Integer> hasher : hashers) {
int hash = Math.abs(hasher.apply(item)) % bitSize;
bitset.set(hash);
}
}
public boolean mightContain(T item) {
for (Function<T, Integer> hasher : hashers) {
int hash = Math.abs(hasher.apply(item)) % bitSize;
if (!bitset.get(hash)) return false;
}
return true;
}
}
Usage:
BloomFilter<String> filter = new BloomFilter<>(1024,
s -> s.hashCode(),
s -> s.length(),
s -> s.indexOf('a'));
filter.add("apple");
filter.add("banana");
System.out.println(filter.mightContain("apple")); // true
System.out.println(filter.mightContain("grape")); // false or true (false positive)
? Clear Example Walkthrough
Let’s walk through what happens when we add and check items:
Step 1: Initialize
- Bit array size = 16 bits
- Hash functions: hashCode(), length(), 'a' position
- "apple".hashCode() % 16 = 6
- "apple".length() = 5 → 5 % 16 = 5
- "apple".indexOf('a') = 0 → 0 % 16 = 0
- Bits 0, 5, and 6 are set to 1
- "banana".hashCode() % 16 = 15
- "banana".length() = 6 → 6 % 16 = 6
- "banana".indexOf('a') = 1 → 1 % 16 = 1
- Bits 1, 6, and 15 are set to 1 (6 is reused)
- If any of the 3 calculated positions (say: 2, 4, 9) are 0 → "grape" is definitely not in set
- If all are 1 → maybe present (false positive possible)
This example demonstrates how Bloom filters gain speed and space efficiency by trading certainty for probability.
| Pros | Cons |
|---|---|
| Very space-efficient | False positives possible |
| Constant-time inserts/lookups | Cannot remove elements (in basic Bloom) |
| Simple and fast | Can't enumerate contents |
- Counting Bloom Filter: allows deletions using counters instead of bits
- Scalable Bloom Filter: grows over time as elements increase
- Compressed Bloom Filter: reduces transmission cost across networks
Use it when:A Bloom filter is like a memory-efficient bouncer:
“You might be in the club, but if I say you’re not — you’re definitely not.”
- You need ultra-fast membership checks
- You can afford false positives
- You can’t afford gigabytes of memory for massive sets
Bloom filters power massive-scale systems like Bigtable, Apache HBase, and even Bitcoin. If you're building for speed, scale, and low memory — it's a tool worth mastering.