Go library

bloom

A fast, space-efficient Bloom filter — a probabilistic set that answers membership queries with no false negatives and a tunable false-positive rate.

A Bloom filter is a concise, compressed representation of a set, built for one job: answering membership queries — is this item in the set? It can use far less storage than the original set, at the cost of occasional false positives. It will never produce a false negative: if it says an item is absent, it truly is.


Installation

go get -u github.com/bits-and-blooms/bloom/v3

A Bloom filter is not a dynamic data structure: you must know your desired capacity ahead of time. To build a filter for one million elements with a 1% false-positive rate:

filter := bloom.NewWithEstimates(1000000, 0.01)

Call NewWithEstimates conservatively: if you specify too few elements, the false-positive bound may be exceeded. A lower false-positive rate or a higher capacity both cost more memory.

What it gives you

  • Membership queries with zero false negatives and a tunable false-positive rate.
  • Automatic sizing from a target capacity and error rate via NewWithEstimates.
  • Add, Test, and the combined TestAndAdd / TestOrAdd primitives, plus *String variants.
  • Set algebra across compatible filters with Merge and Copy.
  • Cardinality estimation with ApproximatedSize and rate validation with EstimateFalsePositiveRate.
  • Portable binary serialization plus json, gob, and BinaryMarshaler support.
  • Backed by a compact bitset and the fast, non-cryptographic murmur3 hash.

Example

Keys are []byte. Adding and testing a string item:

package main

import (
    "fmt"

    "github.com/bits-and-blooms/bloom/v3"
)

func main() {
    filter := bloom.NewWithEstimates(1000000, 0.01)

    filter.Add([]byte("Love"))

    if filter.TestString("Love") {
        fmt.Println("Love is (probably) in the set")
    }
    if !filter.TestString("Hate") {
        fmt.Println("Hate is definitely not in the set")
    }
}

For numeric data, reach for encoding/binary. For example, to add a uint32:

i := uint32(100)
n1 := make([]byte, 4)
binary.BigEndian.PutUint32(n1, i)
filter.Add(n1)

Verifying the false-positive rate

Sometimes the actual false-positive rate differs slightly from the theoretical one. You can estimate it for a filter with m bits and k hash functions over a set of size n:

if bloom.EstimateFalsePositiveRate(20*n, 5, n) > 0.001 {
    // too high — increase m or k
}

Use it to validate the computed m, k parameters:

m, k := bloom.EstimateParameters(n, fp)
actualFPRate := bloom.EstimateFalsePositiveRate(m, k, n)

Note: EstimateFalsePositiveRate builds a temporary filter and is relatively expensive — it is meant for validation, not for hot paths. Estimating the rate clears the filter it operates on.

Serialization

You can write a Bloom filter to a stable byte stream and read it back:

f := bloom.New(1000, 4)

var buf bytes.Buffer
bytesWritten, err := f.WriteTo(&buf)
if err != nil {
    // handle error
}

var g bloom.BloomFilter
bytesRead, err := g.ReadFrom(&buf)
if err != nil {
    // handle error
}
// bytesRead == bytesWritten

Performance tip: when reading from or writing to a file or network connection, wrap your streams in bufio instances for better throughput.

Design

A Bloom filter has two parameters: m, the number of bits used in storage, and k, the number of hash functions. The filter is backed by a bitset; a key is represented by setting the bits at each of the k hashed locations (modulo m). Membership is tested by checking whether all of those bits are set.

If an item is truly in the set, the filter never fails (the true-positive rate is 1.0); it is only susceptible to false positives. The art is to choose k and m correctly. This implementation hashes with murmur3, a fast non-cryptographic hash.

Used by demanding systems

The package is relied on across databases, observability systems, networking tools, blockchains, and infrastructure software.

Databases and storage engines

Observability and log processing

  • grafana/loki — Grafana's log aggregation system, across its index and storage layers.

Distributed systems and consensus

Networking and security tools

Blockchain and crypto

Privacy and identity

Web crawling and scraping

Application infrastructure

Goroutine safety

Filters are intentionally unsynchronized for performance. If multiple goroutines need the same filter, provide your own synchronization — typically a sync.Mutex, or channel hand-off so there is only ever one owner. Exceptionally, multiple goroutines may share a filter if none of them ever modify its content.

Further reading

Mastering Programming: From Testing to Performance in Go covers the kind of testing and optimization discipline that libraries like this benefit from.