Skip to content
ccrawl

Host graph and enrichment

Enumerate every host Common Crawl has seen, join in graph topology, and aggregate per-host CDX statistics.

Common Crawl publishes a web graph alongside its crawl archives: a snapshot of the domain-level link graph distilled into rank tables, vertex maps, and edge files. ccrawl host reads those files and builds enriched per-host records.

Looking up a single host

ccrawl host get golang.org -o json

This streams the rank table, finds the entry for golang.org, and returns its harmonic rank position and value.

Browsing the top of the graph

ccrawl host top -n 20 -o table
ccrawl host top -n 1000 -o jsonl > top1k.jsonl

Results are streamed from the rank table in rank order, so -n 20 is fast even though the full table covers 262 million hosts. Pin a specific web-graph release with --graph:

ccrawl host top --graph cc-main-2026-mar-apr-may -n 100

Vertex map

The vertex file maps each numeric vertex ID to a hostname. host vertices streams it:

ccrawl host vertices -n 10
ccrawl host vertices --graph cc-main-2026-mar-apr-may -n 5 -o jsonl

This is useful when joining edge files (which use vertex IDs) back to human-readable names.

The edge files record domain-level links. host degrees streams all edge files (~7.7 GB) and computes in-degree and out-degree for every host:

ccrawl host degrees -n 100 -o table
ccrawl host degrees -o jsonl > degrees.jsonl

This is a large scan. Run it on a machine with a fast connection, or pipe to a file and query locally.

CDX statistics per host

host cdx queries the columnar Parquet index and returns per-host URL counts, HTTP status breakdown, top MIME type, language, first/last seen crawl, and total bytes:

ccrawl host cdx --filter golang.org -o json     # one host
ccrawl host cdx -n 100 -o jsonl                 # top 100 hosts by URL count

Without --filter this scans ~184 GB of Parquet. It requires duckdb on your PATH. The query runs directly against the public S3 Parquet index, so no local download is needed.

Building the full host dataset

host dataset builds a complete, pre-joined host dataset covering all 262 million hosts Common Crawl has ever seen. The output is partitioned Parquet shards — one per alphabet prefix plus 0 and misc — ready to upload to HuggingFace or query locally.

ccrawl host dataset \
  --work-dir /data/cc-work \
  --out-dir  /data/cc-shards

The pipeline runs four phases in sequence:

Phase What happens Typical time (server)
CDX raw extract Downloads ~184 GB of Parquet across 302 files with 8 parallel workers, fans each row to a per-prefix .jsonl.gz file ~2–3 h
CDX aggregate Groups raw rows by host using a local Go pass; no network I/O ~15–30 min
Rank split Streams the rank table (~2.8 GB) and fans it to 28 per-prefix .tsv.gz files ~10 min
Shard build Joins CDX + rank for each prefix and writes hosts-{x}.parquet ~8 min

Every phase writes a .done marker so the run resumes cleanly if interrupted:

# resume after a crash or power cut — already-done phases are skipped
ccrawl host dataset --work-dir /data/cc-work --out-dir /data/cc-shards

Tune the CDX worker count to your connection's bandwidth (default 8 is safe; raise to 16 on a fast server):

ccrawl host dataset --cdx-workers 16 --work-dir /data/cc-work --out-dir /data/cc-shards

Upload each shard to HuggingFace as it finishes and delete the local copy:

ccrawl host dataset \
  --work-dir /data/cc-work \
  --out-dir  /data/cc-shards \
  --upload \
  --hf-repo  your-org/cc-host-dataset

Process a single prefix first to measure throughput on your machine before running all 28:

ccrawl host dataset --prefix a --work-dir /data/cc-work --out-dir /data/cc-shards

Full enrichment pipeline

host enrich runs all four phases in one command and streams enriched HostRecord rows:

ccrawl host enrich -n 20                                 # rank only (fast)
ccrawl host enrich --degrees -n 100                      # rank + degrees
ccrawl host enrich --degrees --cdx -o jsonl > out.jsonl  # full enrichment

The phases are:

Phase Flag Data scanned What it adds
1+5 always rank table (~2.8 GB) harmonic rank and value
2 always vertex file (~1.1 GB) vertex ID map (used for degree join)
3 --degrees edge files (~7.7 GB) in-degree, out-degree
4 --cdx CDX Parquet (~184 GB) URL count, status mix, language, bytes

Phases 3 and 4 are opt-in because they require large data scans. Phase 4 requires duckdb on your PATH.

Pipe to a file or a database store with --db:

ccrawl host enrich --degrees --cdx -o jsonl > enriched.jsonl
ccrawl host enrich --degrees --cdx --db hosts.db

Picking a web-graph release

The web graph is published a few times a year, separate from the monthly crawls. Pass --graph <release-id> to pin a specific release. Without it, ccrawl resolves the latest available release automatically.

ccrawl host top --graph cc-main-2026-mar-apr-may -n 20

Find current and past releases at commoncrawl.org/web-graphs.