Skip to content
ccrawl

v0.2.2

host dataset command: build all 262M CC hosts as partitioned Parquet shards.

Released 2026-06-18

host dataset

ccrawl host dataset builds a complete host dataset from Common Crawl: all ~262 million hosts, enriched with CDX statistics and harmonic rank, written as 28 per-prefix Parquet shards ready for upload to HuggingFace or local querying.

ccrawl host dataset --work-dir /data/cc-work --out-dir /data/cc-shards

The pipeline runs entirely in Go with no external dependencies. It does not require DuckDB.

How it works

Four phases run in sequence, each resumable via .done markers:

  1. CDX raw extract — 8 parallel workers download the ~184 GB CDX Parquet index (302 files) and fan each row to a per-prefix cdx-raw-{x}.jsonl.gz. No aggregation happens here; every URL capture is written as a raw row so the download can proceed at full speed.
  2. CDX aggregate — a local Go pass groups raw rows by host and writes cdx-agg-{x}.jsonl.gz with one row per unique host. Peak RAM per prefix is under 800 MB.
  3. Rank split — streams the rank table (~2.8 GB) once and fans it to 28 per-prefix rank-{x}.tsv.gz files. The table is downloaded to disk first so the split can resume after a TCP reset.
  4. Shard build — joins CDX and rank for each prefix and writes hosts-{x}.parquet. Shards can be processed in parallel across machines since phases 1–3 wrote independent prefix files.

Performance

Measured on a 6-core server with ~2.5 MB/s CC bandwidth: CDX extract ~2.4 h, full pipeline ~3 h. On a home machine with ~1.2 MB/s bandwidth: CDX extract ~5 h, full pipeline ~7 h. The bottleneck is always network bandwidth to Common Crawl; CPU load stays below 20%.

Flags

Flag Meaning
--work-dir Directory for intermediate files (default: ~/.ccrawl/dataset)
--out-dir Directory for output Parquet shards (default: .)
--prefix Run a single prefix (a–z, 0, misc) for a test run
--cdx-workers Parallel CDX download workers (default 8)
--upload Upload each shard to HuggingFace after building
--hf-repo HuggingFace repository (default: open-index/cc-host-dataset)

Output schema

Each hosts-{x}.parquet shard has one row per host:

Column Type Source
host string CDX url_host_name
registered_domain string CDX url_host_registered_domain
harmonic_rank int64 web-graph rank table
harmonic_value float64 web-graph rank table
url_count int64 CDX row count
status_2xx int64 CDX fetch status
status_3xx int64 CDX fetch status
status_4xx int64 CDX fetch status
status_5xx int64 CDX fetch status
top_mime string most common detected MIME
language string most common detected language
first_seen string earliest CDX timestamp
last_seen string latest CDX timestamp
total_bytes int64 sum of WARC record lengths

Documentation

  • Host graph and enrichment guide updated with a full host dataset walkthrough.
  • CLI reference updated with the full flag surface.