v0.4.2

Smarter retry backoff, faster columnar domain queries, and a rewritten README with a demo.

v0.4.2 makes ccrawl steadier under load and quicker on targeted columnar queries.

Smarter retry backoff

The HTTP client now retries 403, 429, and 5xx responses with exponential backoff and equal jitter instead of a flat wait, so a fleet of workers does not retry in lockstep when the CDN throttles them. When a response carries a Retry-After header, ccrawl honors it, whether the value is a number of seconds or an HTTP date, and clamps it so a long server hint cannot stall a run.

The wait grows from a 1s base up to a 30s ceiling, both reported by ccrawl config show as backoff and backoff_max. --retries still sets the attempt count.

Faster columnar domain queries

A --domain or --host filter on ccrawl columnar now also bounds url_surtkey, the reversed-host key the Parquet index is sorted by. The engine can prune whole row groups by their min and max key rather than reading them to test the registered-domain equality, so a *.example.com style query is noticeably faster on a cold scan. Results are unchanged; only the work the engine skips is different.

Docs

The README is rewritten around what you want to do, with a recorded demo and worked examples for every command group, and it links the Common Crawl dataset, index, terms of use, and status page directly.

Install

brew install tamnd/tap/ccrawl
scoop install ccrawl

The release attaches the prebuilt archives, the deb, rpm, and apk packages, and the container image at ghcr.io/tamnd/ccrawl, and refreshes the apt and dnf repositories.