v0.5.0

Pull a whole crawl's URLs into a sharded seed, then offload it to a HuggingFace dataset.

v0.5.0 turns ccrawl into a full-crawl URL pipeline: read every URL a Common Crawl snapshot captured, shard it by host, and park it on HuggingFace so the corpus never has to be pulled twice.

seed cc

ccrawl seed cc <crawl-id> reads only the url column of the crawl's columnar index and writes a host-sharded .seed straight out. It skips the WARC bodies and every other column, so the pull moves at the speed of the URL text alone rather than the full 150 GB+ index. Row groups are pruned with a url_surtkey prefix predicate, and a flaky index file retries on its own instead of aborting the whole run.

The shards tile the 64-bit hostkey space with no gap and no overlap, so every URL of a host lands in exactly one shard. That is the same partitioning meguri assigns when it ingests a seed, so each shard maps one to one onto a crawl-frontier partition.

seed publish

ccrawl seed publish <seed-dir> transcodes each shard to Parquet, a url column plus a derived host column, and commits them to a HuggingFace dataset under data/crawl=<id>/. It deletes the local Parquet as it goes, so a full crawl fits on a small box. A resume ledger skips shards already published, commits are batched, and low disk pauses the run rather than failing it. The dataset card is generated with the schema, datasets and DuckDB load snippets, and instructions for reseeding a meguri frontier.

The default target is open-index/commoncrawl-urls. Point --repo elsewhere to publish under your own org.

Plumbing

The HTTP client now backs off with exponential wait plus jitter and honors Retry-After, shared by every fetch. A ranged ReaderAt reads the columnar index without buffering whole files, and the Parquet writer gained a batched WriteRows path that amortizes per-row overhead at billions of rows. ccrawl now depends on meguri's published seed API directly, so its build no longer needs a local checkout.

Install

brew install tamnd/tap/ccrawl
scoop install ccrawl

The release attaches the prebuilt archives, the deb, rpm, and apk packages, and the container image at ghcr.io/tamnd/ccrawl, and refreshes the apt and dnf repositories.