v0.2.2
host dataset command: build all 262M CC hosts as partitioned Parquet shards.
Released 2026-06-18
host dataset
ccrawl host dataset builds a complete host dataset from Common Crawl: all ~262 million hosts, enriched with CDX statistics and harmonic rank, written as 28 per-prefix Parquet shards ready for upload to HuggingFace or local querying.
ccrawl host dataset --work-dir /data/cc-work --out-dir /data/cc-shards
The pipeline runs entirely in Go with no external dependencies. It does not require DuckDB.
How it works
Four phases run in sequence, each resumable via .done markers:
- CDX raw extract — 8 parallel workers download the ~184 GB CDX Parquet index (302 files) and fan each row to a per-prefix
cdx-raw-{x}.jsonl.gz. No aggregation happens here; every URL capture is written as a raw row so the download can proceed at full speed. - CDX aggregate — a local Go pass groups raw rows by host and writes
cdx-agg-{x}.jsonl.gzwith one row per unique host. Peak RAM per prefix is under 800 MB. - Rank split — streams the rank table (~2.8 GB) once and fans it to 28 per-prefix
rank-{x}.tsv.gzfiles. The table is downloaded to disk first so the split can resume after a TCP reset. - Shard build — joins CDX and rank for each prefix and writes
hosts-{x}.parquet. Shards can be processed in parallel across machines since phases 1–3 wrote independent prefix files.
Performance
Measured on a 6-core server with ~2.5 MB/s CC bandwidth: CDX extract ~2.4 h, full pipeline ~3 h. On a home machine with ~1.2 MB/s bandwidth: CDX extract ~5 h, full pipeline ~7 h. The bottleneck is always network bandwidth to Common Crawl; CPU load stays below 20%.
Flags
| Flag | Meaning |
|---|---|
--work-dir |
Directory for intermediate files (default: ~/.ccrawl/dataset) |
--out-dir |
Directory for output Parquet shards (default: .) |
--prefix |
Run a single prefix (a–z, 0, misc) for a test run |
--cdx-workers |
Parallel CDX download workers (default 8) |
--upload |
Upload each shard to HuggingFace after building |
--hf-repo |
HuggingFace repository (default: open-index/cc-host-dataset) |
Output schema
Each hosts-{x}.parquet shard has one row per host:
| Column | Type | Source |
|---|---|---|
host |
string | CDX url_host_name |
registered_domain |
string | CDX url_host_registered_domain |
harmonic_rank |
int64 | web-graph rank table |
harmonic_value |
float64 | web-graph rank table |
url_count |
int64 | CDX row count |
status_2xx |
int64 | CDX fetch status |
status_3xx |
int64 | CDX fetch status |
status_4xx |
int64 | CDX fetch status |
status_5xx |
int64 | CDX fetch status |
top_mime |
string | most common detected MIME |
language |
string | most common detected language |
first_seen |
string | earliest CDX timestamp |
last_seen |
string | latest CDX timestamp |
total_bytes |
int64 | sum of WARC record lengths |
Documentation
- Host graph and enrichment guide updated with a full
host datasetwalkthrough. - CLI reference updated with the full flag surface.