v0.2.4
Native HuggingFace publish with hive-partition layout and per-shard commits.
Released 2026-06-18
Native HuggingFace publish
The --upload flag now uses a native Go HF client backed by an embedded Python helper
(hf_commit.py) run via uv.
The old dependency on huggingface-cli being on PATH is removed.
Hive-partition layout
Shards are published under a Hive-partition path that DuckDB understands natively:
data/crawl=CC-MAIN-2026-21/subset=urls/hosts-a.parquet
data/crawl=CC-MAIN-2026-21/subset=urls/hosts-b.parquet
...
DuckDB automatically extracts crawl and subset as virtual columns when using
hive_partitioning=true:
-- Load all shards from one crawl
SELECT host, url, st, digest
FROM read_parquet(
'hf://datasets/open-index/cc-host-dataset/data/crawl=CC-MAIN-2026-21/subset=urls/*.parquet'
)
WHERE st = 200
-- Multi-crawl: `crawl` and `subset` become columns automatically
SELECT crawl, count(*) AS urls
FROM read_parquet(
'hf://datasets/open-index/cc-host-dataset/data/**/*.parquet',
hive_partitioning = true
)
GROUP BY crawl
ORDER BY crawl DESC
The subset key reserves space for a future hosts subset containing per-host aggregate
rows, without breaking existing queries.
Per-shard commits
Each of the 28 prefix shards is committed to HuggingFace as a separate commit immediately
after it finishes building.
This lets the dataset grow incrementally — the repo is usable before all shards complete,
and a restart after a failure resumes from the last committed shard (re-running skips
shards with a .done marker).
Commit messages follow the format:
Add crawl=CC-MAIN-2026-21/subset=urls/prefix=a (68M rows)
New flags
| Flag | Default | Description |
|---|---|---|
--hf-token |
$HUGGINGFACE_TOKEN |
HuggingFace write token |
--hf-private |
false | Create the repo as private |
Prerequisites
uv must be installed for the upload path.
Install with curl -LsSf https://astral.sh/uv/install.sh | sh.
The Python dependencies (huggingface_hub, hf-xet) are resolved automatically by uv
on the first run — no pip install needed.
Example: full end-to-end run
export HUGGINGFACE_TOKEN=hf_...
# Test with one prefix first
ccrawl host dataset \
--work-dir /data/cc-ds \
--out-dir /tmp/shards \
--prefix a \
--upload \
--hf-repo open-index/cc-host-dataset
# Full 28-prefix run
ccrawl host dataset \
--work-dir /data/cc-ds \
--out-dir /tmp/shards \
--upload \
--hf-repo open-index/cc-host-dataset