v0.2.3

Per-URL dataset shards: 20 raw CDX fields per row, no aggregation, WARC byte-range access.

Released 2026-06-18

Per-URL dataset shards

The ccrawl host dataset shard schema changes from one row per host to one row per URL capture. Each row now carries 20 raw CDX fields joined with host rank signals — no aggregation.

Users who want per-host statistics run a single GROUP BY host in DuckDB or pandas after loading a shard. The raw rows carry far more information than any pre-aggregated view could.

New fields in every shard row

Field	Type	Description
`url`	string	The full crawled URL
`surt`	string	SURT canonical form (sort key for CDX lookups)
`tld`	string	Effective TLD (e.g., `com`, `co.uk`)
`proto`	string	`http` or `https`
`redir`	string	Final redirect target URL, empty if none
`digest`	string	SHA1 content hash — use for dedup and change detection
`mime_d`	string	Declared MIME type from Content-Type header
`charset`	string	Character set from Content-Type
`trunc`	string	Truncation reason (`bytes`, `disconnect`, etc.) or empty
`warc_f`	string	Relative WARC file path
`warc_o`	int64	Byte offset into the WARC file
`robots_ok`	bool	`robotstxt_forceget` — robots.txt permitted the crawl
`crawl`	string	CC crawl ID (e.g., `CC-MAIN-2026-21`)

Fields carried over from v0.2.2: host, rd, st, mime, lang, ts, bytes, harmonic_pos, harmonic_val, pagerank_pos, pagerank_val, graph_id.

WARC byte-range access

With warc_f and warc_o in every row, callers can fetch the raw HTML for any URL without downloading full WARC files:

import requests, gzip

row = df[df["url"] == "https://example.com/"].iloc[0]
url = f"https://data.commoncrawl.org/{row.warc_f}"
resp = requests.get(url, headers={"Range": f"bytes={row.warc_o}-{row.warc_o + row.bytes - 1}"})
html = gzip.decompress(resp.content)

Digest-based change detection

With digest and ts in every row, change detection across two crawls is a join:

a = pd.read_parquet("hosts-a-2026-21.parquet", columns=["url", "digest"])
b = pd.read_parquet("hosts-a-2026-17.parquet", columns=["url", "digest"])
merged = a.merge(b, on="url", suffixes=("_new", "_old"))
changed = merged[merged.digest_new != merged.digest_old]
print(f"{len(changed) / len(merged):.1%} of URLs changed")

Aggregation is opt-in

The CDX aggregate phase (Phase 5 in v0.2.2) is removed from the default pipeline. The shard build reads cdx-raw-{prefix}.jsonl.gz directly. Pass --cdx-agg to produce cdx-agg-{prefix}.jsonl.gz per-host summary files as a side product — useful for quick host-level queries without loading the full shards.

Size

Shards are larger because they contain one row per URL (~1.9 B total rows) rather than one row per host (~262 M total rows). Estimated compressed size: ~80–120 GB for the full 28-prefix dataset. The w, s, p prefixes are largest (~5–10 GB each) due to subdomain patterns (www., static., play., etc.).