Skip to content
ccrawl

v0.2.0

v2 platform: host enumeration, recrawl engine, BM25 search index, REST API, and content pipeline.

v0.2.0 is the v2 platform release. It adds seven new command groups on top of the v0.1.0 data-access layer, turning ccrawl into a full crawl-and-index platform: enumerate every host CC has seen, drive a fresh recrawl, build a BM25 search index, and serve everything over a REST API.

New commands

host

ccrawl host enumerates and enriches hosts from the CC web graph. It streams the 262 M host rank table, joins in graph topology, and optionally aggregates CDX statistics per host via DuckDB.

ccrawl host top -n 20 -o table
ccrawl host get golang.org -o json
ccrawl host enrich --degrees --cdx -o jsonl > enriched.jsonl
ccrawl host cdx --filter example.com
ccrawl host vertices --graph cc-main-2026-mar-apr-may -n 5
ccrawl host degrees -n 100 -o jsonl

Subcommands: top, get, vertices, degrees, cdx, enrich.

crawl

ccrawl crawl drives the recrawl engine. seed generates priority-ordered seed URLs from the rank table. fetch fetches a live URL with the v2 crawler config. status shows the daily page budget across the five recrawl tiers.

ccrawl crawl seed --max-tier 2 -n 1000000 -o jsonl > seeds.jsonl
ccrawl crawl fetch https://golang.org/ --robots -o json
ccrawl crawl status -o table

sched

ccrawl sched handles recrawl scheduling. assign maps hosts to one of five tiers (daily through on-demand) based on harmonic rank and change rate. diff joins two CDX snapshots on URL, compares content_digest, and emits per-host change rates that drive tier re-assignment.

ccrawl sched assign --change-rate 0.5 -n 20
ccrawl sched diff --crawl-a CC-MAIN-2026-17 --crawl-b CC-MAIN-2026-21 -o jsonl

index

ccrawl index builds and queries a local BM25 full-text search index. build fetches URLs in parallel (8 workers by default), extracts clean text, tokenizes it, and writes a delta-encoded VByte posting list with per-document length normalization. search queries the index and ranks results by BM25 score blended with the host link-graph signal.

ccrawl index build --urls https://golang.org/,https://pkg.go.dev/ --workers 16
ccrawl index search "golang web server" -n 10 -o json

content

ccrawl content computes live content signals for any URL using the v2 crawler config rather than the CDX archive.

ccrawl content extract https://golang.org/
ccrawl content quality https://example.com/ -o json
ccrawl content outlinks https://news.ycombinator.com/ -n 20

api

ccrawl api starts a v2 HTTP REST API server. On startup it loads the top 1 M hosts from the rank table into memory. Full-text search is available when --index-dir points to a built index.

ccrawl api --addr :8080 --index-dir /data/idx
Endpoint Returns
GET /v2/host/{host} Enriched host profile
GET /v2/hosts?tld=&n= Top N hosts, optional TLD filter
GET /v2/search?q=&k= BM25 full-text search results
GET /v2/health Health check

Improvements

BM25 length normalization. The inverted index now stores a DL (document token length) varint per posting entry. Scoring uses the actual document length rather than the corpus average, so length normalization works correctly.

Parallel index build. index build --urls fetches all URLs in parallel using an errgroup worker pool (default 8, flag --workers). A single drain goroutine serializes writes to the index. Typical speedup is 5-8x over the sequential v0.1.0 path.

Connection pooling. The crawler shares a package-level http.Transport (200 idle connections, 10 per host) so parallel workers reuse keep-alive connections.

Frontier anti-starvation. Frontier.Pop scans up to 16 candidates when the top-priority host is inside its politeness window, preventing the crawler from stalling when one host dominates the heap.

Content extraction. HTML is parsed with bytes.NewReader instead of a string(b) copy, saving one allocation per page. Snippet truncation is rune-safe so multi-byte UTF-8 is never cut mid-character.

Differential CDX. sched diff uses content_digest (the real CC column) instead of the non-existent url_digest.

Forward index. ForwardIndexWriter.Write uses encoding/json.Marshal. loadForwardIndex uses json.Unmarshal with a 1 MiB scanner buffer, replacing the fragile hand-rolled field parser from v0.1.0.

Install

go install github.com/tamnd/ccrawl-cli/cmd/[email protected]

Or grab a pre-built binary for your platform from the release page.