Skip to content
ccrawl

v0.2.3

Per-URL dataset shards: 20 raw CDX fields per row, no aggregation, WARC byte-range access.

Released 2026-06-18

Per-URL dataset shards

The ccrawl host dataset shard schema changes from one row per host to one row per URL capture. Each row now carries 20 raw CDX fields joined with host rank signals — no aggregation.

Users who want per-host statistics run a single GROUP BY host in DuckDB or pandas after loading a shard. The raw rows carry far more information than any pre-aggregated view could.

New fields in every shard row

Field Type Description
url string The full crawled URL
surt string SURT canonical form (sort key for CDX lookups)
tld string Effective TLD (e.g., com, co.uk)
proto string http or https
redir string Final redirect target URL, empty if none
digest string SHA1 content hash — use for dedup and change detection
mime_d string Declared MIME type from Content-Type header
charset string Character set from Content-Type
trunc string Truncation reason (bytes, disconnect, etc.) or empty
warc_f string Relative WARC file path
warc_o int64 Byte offset into the WARC file
robots_ok bool robotstxt_forceget — robots.txt permitted the crawl
crawl string CC crawl ID (e.g., CC-MAIN-2026-21)

Fields carried over from v0.2.2: host, rd, st, mime, lang, ts, bytes, harmonic_pos, harmonic_val, pagerank_pos, pagerank_val, graph_id.

WARC byte-range access

With warc_f and warc_o in every row, callers can fetch the raw HTML for any URL without downloading full WARC files:

import requests, gzip

row = df[df["url"] == "https://example.com/"].iloc[0]
url = f"https://data.commoncrawl.org/{row.warc_f}"
resp = requests.get(url, headers={"Range": f"bytes={row.warc_o}-{row.warc_o + row.bytes - 1}"})
html = gzip.decompress(resp.content)

Digest-based change detection

With digest and ts in every row, change detection across two crawls is a join:

a = pd.read_parquet("hosts-a-2026-21.parquet", columns=["url", "digest"])
b = pd.read_parquet("hosts-a-2026-17.parquet", columns=["url", "digest"])
merged = a.merge(b, on="url", suffixes=("_new", "_old"))
changed = merged[merged.digest_new != merged.digest_old]
print(f"{len(changed) / len(merged):.1%} of URLs changed")

Aggregation is opt-in

The CDX aggregate phase (Phase 5 in v0.2.2) is removed from the default pipeline. The shard build reads cdx-raw-{prefix}.jsonl.gz directly. Pass --cdx-agg to produce cdx-agg-{prefix}.jsonl.gz per-host summary files as a side product — useful for quick host-level queries without loading the full shards.

Size

Shards are larger because they contain one row per URL (~1.9 B total rows) rather than one row per host (~262 M total rows). Estimated compressed size: ~80–120 GB for the full 28-prefix dataset. The w, s, p prefixes are largest (~5–10 GB each) due to subdomain patterns (www., static., play., etc.).