Skip to content
ccrawl

v0.2.4

Native HuggingFace publish with hive-partition layout and per-shard commits.

Released 2026-06-18

Native HuggingFace publish

The --upload flag now uses a native Go HF client backed by an embedded Python helper (hf_commit.py) run via uv. The old dependency on huggingface-cli being on PATH is removed.

Hive-partition layout

Shards are published under a Hive-partition path that DuckDB understands natively:

data/crawl=CC-MAIN-2026-21/subset=urls/hosts-a.parquet
data/crawl=CC-MAIN-2026-21/subset=urls/hosts-b.parquet
...

DuckDB automatically extracts crawl and subset as virtual columns when using hive_partitioning=true:

-- Load all shards from one crawl
SELECT host, url, st, digest
FROM read_parquet(
  'hf://datasets/open-index/cc-host-dataset/data/crawl=CC-MAIN-2026-21/subset=urls/*.parquet'
)
WHERE st = 200

-- Multi-crawl: `crawl` and `subset` become columns automatically
SELECT crawl, count(*) AS urls
FROM read_parquet(
  'hf://datasets/open-index/cc-host-dataset/data/**/*.parquet',
  hive_partitioning = true
)
GROUP BY crawl
ORDER BY crawl DESC

The subset key reserves space for a future hosts subset containing per-host aggregate rows, without breaking existing queries.

Per-shard commits

Each of the 28 prefix shards is committed to HuggingFace as a separate commit immediately after it finishes building. This lets the dataset grow incrementally — the repo is usable before all shards complete, and a restart after a failure resumes from the last committed shard (re-running skips shards with a .done marker).

Commit messages follow the format:

Add crawl=CC-MAIN-2026-21/subset=urls/prefix=a (68M rows)

New flags

Flag Default Description
--hf-token $HUGGINGFACE_TOKEN HuggingFace write token
--hf-private false Create the repo as private

Prerequisites

uv must be installed for the upload path. Install with curl -LsSf https://astral.sh/uv/install.sh | sh. The Python dependencies (huggingface_hub, hf-xet) are resolved automatically by uv on the first run — no pip install needed.

Example: full end-to-end run

export HUGGINGFACE_TOKEN=hf_...

# Test with one prefix first
ccrawl host dataset \
  --work-dir /data/cc-ds \
  --out-dir /tmp/shards \
  --prefix a \
  --upload \
  --hf-repo open-index/cc-host-dataset

# Full 28-prefix run
ccrawl host dataset \
  --work-dir /data/cc-ds \
  --out-dir /tmp/shards \
  --upload \
  --hf-repo open-index/cc-host-dataset