Building a search index

Build a local BM25 inverted index from crawled pages and search it in milliseconds.

ccrawl index builds and queries a local BM25 inverted index over any JSONL page corpus, whether downloaded from Common Crawl or produced by crawl fetch.

Building the index

index build reads JSONL on stdin or from a file list, tokenizes each document, computes BM25 scores with per-document length normalization, and writes a compact binary index to a directory.

# from a file
ccrawl index build --dir idx/ pages.jsonl

# from stdin
ccrawl crawl fetch seeds.jsonl -o jsonl | ccrawl index build --dir idx/ -

# multiple input files
ccrawl index build --dir idx/ pages-a.jsonl pages-b.jsonl

# parallel build with 16 workers
ccrawl index build --dir idx/ --workers 16 pages.jsonl

Flags:

Flag	Default	Purpose
`--dir`	required	Directory to write the index into
`--workers`	8	Parallel tokenize-and-score workers
`--urls`	(stdin)	Explicit list of URLs to index instead of reading JSONL
`--input`	(positional args)	Input JSONL files

Each worker reads documents from a shared channel and writes to a per-worker intermediate segment. A final merge step combines all segments into the output directory. With 8 workers on a 4-core machine the build is typically I/O-bound; add --workers 16 on machines with faster storage.

Index layout

The index directory contains four files:

File	Purpose
`terms.dat`	Term → (byte offset, doc count) map
`postings.bin`	VByte delta-encoded posting lists: `(docID, TF, DL)` per entry
`stats.dat`	Document count N and average document length avgDL
`forward.jsonl`	Forward index: docID → `{url, title, snippet}`

These are stable across minor versions, so you can build once and query many times.

Searching

index search scores documents with BM25 and returns ranked results.

ccrawl index search --dir idx/ "golang concurrency goroutines"
ccrawl index search --dir idx/ "rust memory safety" -n 20 -o json
ccrawl index search --dir idx/ "python asyncio" -o jsonl

Flags:

Flag	Default	Purpose
`--dir`	required	Index directory to query
`-n`	10	Number of results to return
`-o`	table	Output format: `table`, `json`, `jsonl`

Each result includes the URL, title, a snippet of the matching text, and the BM25 score. Queries are tokenized the same way as the corpus, and multi-term queries are ANDed by default.

BM25 parameters

BM25 has two parameters you can tune:

Parameter	Default	Effect
`k1`	1.2	Term frequency saturation: lower = diminishing returns set in earlier
`b`	0.75	Document length normalization: 0 = no normalization, 1 = full normalization

Set them with --k1 and --b:

ccrawl index search --dir idx/ "distributed systems" --k1 1.5 --b 0.5

The default values work well for web pages. Lower b if your corpus has highly variable document lengths (e.g. full WARC records mixed with short excerpts).

End-to-end example

# 1. Seed, crawl, and index in one pipeline
ccrawl crawl seed --max-tier 2 --max-seeds 100000 -o jsonl \
  | ccrawl crawl fetch - -o jsonl \
  | ccrawl index build --dir ~/cc-index/ -

# 2. Search the result
ccrawl index search --dir ~/cc-index/ "machine learning Python" -n 5

On a mid-range laptop with a fast internet connection, 100,000 URLs take roughly four to six hours to crawl and a few minutes to index. Query latency is under 10 ms for most queries once the index is built.

Incremental updates

To add new pages to an existing index, build a second index from the new documents and merge:

ccrawl index build --dir idx-new/ new-pages.jsonl
ccrawl index merge --src idx-new/ --dst idx/

index merge is additive: it appends new postings without rebuilding from scratch.