bloom README

A Bloom filter is a concise, compressed representation of a set, built for one job: answering membership queries — is this item in the set? It can use far less storage than the original set, at the cost of occasional false positives. It will never produce a false negative: if it says an item is absent, it truly is.

Installation

go get -u github.com/bits-and-blooms/bloom/v3

A Bloom filter is not a dynamic data structure: you must know your desired capacity ahead of time. To build a filter for one million elements with a 1% false-positive rate:

filter := bloom.NewWithEstimates(1000000, 0.01)

Call NewWithEstimates conservatively: if you specify too few elements, the false-positive bound may be exceeded. A lower false-positive rate or a higher capacity both cost more memory.

What it gives you

Membership queries with zero false negatives and a tunable false-positive rate.
Automatic sizing from a target capacity and error rate via NewWithEstimates.
Add, Test, and the combined TestAndAdd / TestOrAdd primitives, plus *String variants.
Set algebra across compatible filters with Merge and Copy.
Cardinality estimation with ApproximatedSize and rate validation with EstimateFalsePositiveRate.
Portable binary serialization plus json, gob, and BinaryMarshaler support.
Backed by a compact bitset and the fast, non-cryptographic murmur3 hash.

Example

Keys are []byte. Adding and testing a string item:

package main

import (
    "fmt"

    "github.com/bits-and-blooms/bloom/v3"
)

func main() {
    filter := bloom.NewWithEstimates(1000000, 0.01)

    filter.Add([]byte("Love"))

    if filter.TestString("Love") {
        fmt.Println("Love is (probably) in the set")
    }
    if !filter.TestString("Hate") {
        fmt.Println("Hate is definitely not in the set")
    }
}

For numeric data, reach for encoding/binary. For example, to add a uint32:

i := uint32(100)
n1 := make([]byte, 4)
binary.BigEndian.PutUint32(n1, i)
filter.Add(n1)

Verifying the false-positive rate

Sometimes the actual false-positive rate differs slightly from the theoretical one. You can estimate it for a filter with m bits and k hash functions over a set of size n:

if bloom.EstimateFalsePositiveRate(20*n, 5, n) > 0.001 {
    // too high — increase m or k
}

Use it to validate the computed m, k parameters:

m, k := bloom.EstimateParameters(n, fp)
actualFPRate := bloom.EstimateFalsePositiveRate(m, k, n)

Note: EstimateFalsePositiveRate builds a temporary filter and is relatively expensive — it is meant for validation, not for hot paths. Estimating the rate clears the filter it operates on.

Serialization

You can write a Bloom filter to a stable byte stream and read it back:

f := bloom.New(1000, 4)

var buf bytes.Buffer
bytesWritten, err := f.WriteTo(&buf)
if err != nil {
    // handle error
}

var g bloom.BloomFilter
bytesRead, err := g.ReadFrom(&buf)
if err != nil {
    // handle error
}
// bytesRead == bytesWritten

Performance tip: when reading from or writing to a file or network connection, wrap your streams in bufio instances for better throughput.

Design

A Bloom filter has two parameters: m, the number of bits used in storage, and k, the number of hash functions. The filter is backed by a bitset; a key is represented by setting the bits at each of the k hashed locations (modulo m). Membership is tested by checking whether all of those bits are set.

If an item is truly in the set, the filter never fails (the true-positive rate is 1.0); it is only susceptible to false positives. The art is to choose k and m correctly. This implementation hashes with murmur3, a fast non-cryptographic hash.

Used by demanding systems

The package is relied on across databases, observability systems, networking tools, blockchains, and infrastructure software.

Databases and storage engines

milvus-io/milvus — leading open-source vector database, explicitly named in the repo README.
weaviate/weaviate — vector DB, used in its lsmkv LSM storage layer.
openGemini/openGemini — time-series database with multiple index packages.
siglens/siglens — observability and log search engine.
ankur-anand/unisondb — embedded database memtable.
jakub-galecki/godb — SSTable layer in an LSM-tree database.

Observability and log processing

grafana/loki — Grafana's log aggregation system, across its index and storage layers.

Distributed systems and consensus

atomix/atomix — distributed primitives framework.
authzed/spicedb — Zanzibar-style permissions system, dispatch layer.
splitio/go-split-commons — Split.io Go SDK, impression deduplication across major versions.

Networking and security tools

projectdiscovery/hmap — used across the nuclei/subfinder/httpx scanner ecosystem.
daeuniverse/dae — Linux eBPF transparent proxy and routing table.
bloXroute-Labs/gateway — blockchain network gateway.

Blockchain and crypto

NethermindEth/juno — Starknet full node.
iost-official/go-iost — IOST blockchain peer-to-peer layer.
status-im/status-go — Status decentralized messenger.
code-payments/code-server — crypto payments backend.

Privacy and identity

optable/match — privacy-preserving record linkage.
Arceliar/ironwood — anonymous routing network library.

Web crawling and scraping

beego/beego — web framework, named in the upstream README.
editorpost/spider — deduplication store for crawling.

Application infrastructure

bentoml/yatai — model serving infrastructure.
rotationalio/ensign — event streaming platform.
puppetlabs/leg — message deduplication middleware.
letsencrypt/x509search — certificate search tooling.

Goroutine safety

Filters are intentionally unsynchronized for performance. If multiple goroutines need the same filter, provide your own synchronization — typically a sync.Mutex, or channel hand-off so there is only ever one owner. Exceptionally, multiple goroutines may share a filter if none of them ever modify its content.