v0.2.3
Per-URL dataset shards: 20 raw CDX fields per row, no aggregation, WARC byte-range access.
Released 2026-06-18
Per-URL dataset shards
The ccrawl host dataset shard schema changes from one row per host to one row per URL capture.
Each row now carries 20 raw CDX fields joined with host rank signals — no aggregation.
Users who want per-host statistics run a single GROUP BY host in DuckDB or pandas after loading a shard.
The raw rows carry far more information than any pre-aggregated view could.
New fields in every shard row
| Field | Type | Description |
|---|---|---|
url |
string | The full crawled URL |
surt |
string | SURT canonical form (sort key for CDX lookups) |
tld |
string | Effective TLD (e.g., com, co.uk) |
proto |
string | http or https |
redir |
string | Final redirect target URL, empty if none |
digest |
string | SHA1 content hash — use for dedup and change detection |
mime_d |
string | Declared MIME type from Content-Type header |
charset |
string | Character set from Content-Type |
trunc |
string | Truncation reason (bytes, disconnect, etc.) or empty |
warc_f |
string | Relative WARC file path |
warc_o |
int64 | Byte offset into the WARC file |
robots_ok |
bool | robotstxt_forceget — robots.txt permitted the crawl |
crawl |
string | CC crawl ID (e.g., CC-MAIN-2026-21) |
Fields carried over from v0.2.2: host, rd, st, mime, lang, ts, bytes,
harmonic_pos, harmonic_val, pagerank_pos, pagerank_val, graph_id.
WARC byte-range access
With warc_f and warc_o in every row, callers can fetch the raw HTML for any URL
without downloading full WARC files:
import requests, gzip
row = df[df["url"] == "https://example.com/"].iloc[0]
url = f"https://data.commoncrawl.org/{row.warc_f}"
resp = requests.get(url, headers={"Range": f"bytes={row.warc_o}-{row.warc_o + row.bytes - 1}"})
html = gzip.decompress(resp.content)
Digest-based change detection
With digest and ts in every row, change detection across two crawls is a join:
a = pd.read_parquet("hosts-a-2026-21.parquet", columns=["url", "digest"])
b = pd.read_parquet("hosts-a-2026-17.parquet", columns=["url", "digest"])
merged = a.merge(b, on="url", suffixes=("_new", "_old"))
changed = merged[merged.digest_new != merged.digest_old]
print(f"{len(changed) / len(merged):.1%} of URLs changed")
Aggregation is opt-in
The CDX aggregate phase (Phase 5 in v0.2.2) is removed from the default pipeline.
The shard build reads cdx-raw-{prefix}.jsonl.gz directly.
Pass --cdx-agg to produce cdx-agg-{prefix}.jsonl.gz per-host summary files as a
side product — useful for quick host-level queries without loading the full shards.
Size
Shards are larger because they contain one row per URL (~1.9 B total rows) rather than
one row per host (~262 M total rows).
Estimated compressed size: ~80–120 GB for the full 28-prefix dataset.
The w, s, p prefixes are largest (~5–10 GB each) due to subdomain patterns
(www., static., play., etc.).