v0.2.0
v2 platform: host enumeration, recrawl engine, BM25 search index, REST API, and content pipeline.
v0.2.0 is the v2 platform release. It adds seven new command groups on top of the v0.1.0 data-access layer, turning ccrawl into a full crawl-and-index platform: enumerate every host CC has seen, drive a fresh recrawl, build a BM25 search index, and serve everything over a REST API.
New commands
host
ccrawl host enumerates and enriches hosts from the CC web graph.
It streams the 262 M host rank table, joins in graph topology, and optionally aggregates CDX statistics per host via DuckDB.
ccrawl host top -n 20 -o table
ccrawl host get golang.org -o json
ccrawl host enrich --degrees --cdx -o jsonl > enriched.jsonl
ccrawl host cdx --filter example.com
ccrawl host vertices --graph cc-main-2026-mar-apr-may -n 5
ccrawl host degrees -n 100 -o jsonl
Subcommands: top, get, vertices, degrees, cdx, enrich.
crawl
ccrawl crawl drives the recrawl engine.
seed generates priority-ordered seed URLs from the rank table.
fetch fetches a live URL with the v2 crawler config.
status shows the daily page budget across the five recrawl tiers.
ccrawl crawl seed --max-tier 2 -n 1000000 -o jsonl > seeds.jsonl
ccrawl crawl fetch https://golang.org/ --robots -o json
ccrawl crawl status -o table
sched
ccrawl sched handles recrawl scheduling.
assign maps hosts to one of five tiers (daily through on-demand) based on harmonic rank and change rate.
diff joins two CDX snapshots on URL, compares content_digest, and emits per-host change rates that drive tier re-assignment.
ccrawl sched assign --change-rate 0.5 -n 20
ccrawl sched diff --crawl-a CC-MAIN-2026-17 --crawl-b CC-MAIN-2026-21 -o jsonl
index
ccrawl index builds and queries a local BM25 full-text search index.
build fetches URLs in parallel (8 workers by default), extracts clean text, tokenizes it, and writes a delta-encoded VByte posting list with per-document length normalization.
search queries the index and ranks results by BM25 score blended with the host link-graph signal.
ccrawl index build --urls https://golang.org/,https://pkg.go.dev/ --workers 16
ccrawl index search "golang web server" -n 10 -o json
content
ccrawl content computes live content signals for any URL using the v2 crawler config rather than the CDX archive.
ccrawl content extract https://golang.org/
ccrawl content quality https://example.com/ -o json
ccrawl content outlinks https://news.ycombinator.com/ -n 20
api
ccrawl api starts a v2 HTTP REST API server.
On startup it loads the top 1 M hosts from the rank table into memory.
Full-text search is available when --index-dir points to a built index.
ccrawl api --addr :8080 --index-dir /data/idx
| Endpoint | Returns |
|---|---|
GET /v2/host/{host} |
Enriched host profile |
GET /v2/hosts?tld=&n= |
Top N hosts, optional TLD filter |
GET /v2/search?q=&k= |
BM25 full-text search results |
GET /v2/health |
Health check |
Improvements
BM25 length normalization.
The inverted index now stores a DL (document token length) varint per posting entry.
Scoring uses the actual document length rather than the corpus average, so length normalization works correctly.
Parallel index build.
index build --urls fetches all URLs in parallel using an errgroup worker pool (default 8, flag --workers).
A single drain goroutine serializes writes to the index.
Typical speedup is 5-8x over the sequential v0.1.0 path.
Connection pooling.
The crawler shares a package-level http.Transport (200 idle connections, 10 per host) so parallel workers reuse keep-alive connections.
Frontier anti-starvation.
Frontier.Pop scans up to 16 candidates when the top-priority host is inside its politeness window, preventing the crawler from stalling when one host dominates the heap.
Content extraction.
HTML is parsed with bytes.NewReader instead of a string(b) copy, saving one allocation per page.
Snippet truncation is rune-safe so multi-byte UTF-8 is never cut mid-character.
Differential CDX.
sched diff uses content_digest (the real CC column) instead of the non-existent url_digest.
Forward index.
ForwardIndexWriter.Write uses encoding/json.Marshal.
loadForwardIndex uses json.Unmarshal with a 1 MiB scanner buffer, replacing the fragile hand-rolled field parser from v0.1.0.
Install
go install github.com/tamnd/ccrawl-cli/cmd/[email protected]
Or grab a pre-built binary for your platform from the release page.