Building a search index
Build a local BM25 inverted index from crawled pages and search it in milliseconds.
ccrawl index builds and queries a local BM25 inverted index over any JSONL page corpus, whether downloaded from Common Crawl or produced by crawl fetch.
Building the index
index build reads JSONL on stdin or from a file list, tokenizes each document, computes BM25 scores with per-document length normalization, and writes a compact binary index to a directory.
# from a file
ccrawl index build --dir idx/ pages.jsonl
# from stdin
ccrawl crawl fetch seeds.jsonl -o jsonl | ccrawl index build --dir idx/ -
# multiple input files
ccrawl index build --dir idx/ pages-a.jsonl pages-b.jsonl
# parallel build with 16 workers
ccrawl index build --dir idx/ --workers 16 pages.jsonl
Flags:
| Flag | Default | Purpose |
|---|---|---|
--dir |
required | Directory to write the index into |
--workers |
8 | Parallel tokenize-and-score workers |
--urls |
(stdin) | Explicit list of URLs to index instead of reading JSONL |
--input |
(positional args) | Input JSONL files |
Each worker reads documents from a shared channel and writes to a per-worker intermediate segment.
A final merge step combines all segments into the output directory.
With 8 workers on a 4-core machine the build is typically I/O-bound; add --workers 16 on machines with faster storage.
Index layout
The index directory contains four files:
| File | Purpose |
|---|---|
terms.dat |
Term → (byte offset, doc count) map |
postings.bin |
VByte delta-encoded posting lists: (docID, TF, DL) per entry |
stats.dat |
Document count N and average document length avgDL |
forward.jsonl |
Forward index: docID → {url, title, snippet} |
These are stable across minor versions, so you can build once and query many times.
Searching
index search scores documents with BM25 and returns ranked results.
ccrawl index search --dir idx/ "golang concurrency goroutines"
ccrawl index search --dir idx/ "rust memory safety" -n 20 -o json
ccrawl index search --dir idx/ "python asyncio" -o jsonl
Flags:
| Flag | Default | Purpose |
|---|---|---|
--dir |
required | Index directory to query |
-n |
10 | Number of results to return |
-o |
table | Output format: table, json, jsonl |
Each result includes the URL, title, a snippet of the matching text, and the BM25 score. Queries are tokenized the same way as the corpus, and multi-term queries are ANDed by default.
BM25 parameters
BM25 has two parameters you can tune:
| Parameter | Default | Effect |
|---|---|---|
k1 |
1.2 | Term frequency saturation: lower = diminishing returns set in earlier |
b |
0.75 | Document length normalization: 0 = no normalization, 1 = full normalization |
Set them with --k1 and --b:
ccrawl index search --dir idx/ "distributed systems" --k1 1.5 --b 0.5
The default values work well for web pages.
Lower b if your corpus has highly variable document lengths (e.g. full WARC records mixed with short excerpts).
End-to-end example
# 1. Seed, crawl, and index in one pipeline
ccrawl crawl seed --max-tier 2 --max-seeds 100000 -o jsonl \
| ccrawl crawl fetch - -o jsonl \
| ccrawl index build --dir ~/cc-index/ -
# 2. Search the result
ccrawl index search --dir ~/cc-index/ "machine learning Python" -n 5
On a mid-range laptop with a fast internet connection, 100,000 URLs take roughly four to six hours to crawl and a few minutes to index. Query latency is under 10 ms for most queries once the index is built.
Incremental updates
To add new pages to an existing index, build a second index from the new documents and merge:
ccrawl index build --dir idx-new/ new-pages.jsonl
ccrawl index merge --src idx-new/ --dst idx/
index merge is additive: it appends new postings without rebuilding from scratch.