v0.3.0
Markdown pipeline: CC WARCs to Markdown Parquet on HuggingFace, with parallel export and live refetch.
v0.3.0 adds the Markdown dataset pipeline.
Two new subcommands under ccrawl markdown turn CC WARC archives into cleaned Markdown Parquet files and upload them to a HuggingFace dataset repo.
The parallel orchestrator keeps several shards in flight at once so the pipeline saturates the NIC rather than waiting on one download at a time.
New commands
markdown export
ccrawl markdown export downloads CC WARC shards, converts HTML to Markdown with h2m (go-trafilatura + GFM renderer), and writes each shard to a zstd-compressed Parquet file.
After each batch it commits to a HuggingFace dataset repo.
A ledger records committed shards so a killed run resumes where it stopped.
# single shard
ccrawl markdown export --shards 0 --repo open-index/open-markdown-v2
# range with parallel downloads
ccrawl markdown export --shards 0-99 --parallel 4 --commit-batch 10
# resume a killed run (ledger auto-detected)
ccrawl markdown export --shards 0-9999 --parallel 3
Flags: --shards, --parallel, --workers, --commit-batch, --repo, --out, --push, --keep-parquet, --min-free-gb, --ledger, --skip-errors, --limit.
Output schema (open-markdown-v2):
| field | type | notes |
|---|---|---|
| doc_id | string | SHA-256 of URL (16 bytes hex) |
| url | string | original page URL |
| host | string | hostname |
| crawl_date | string | WARC-Date YYYY-MM-DD |
| warc_record_id | string | WARC record ID |
| html_length | int64 | raw HTML bytes |
| markdown_length | int64 | converted Markdown bytes |
| markdown | string | converted Markdown text |
markdown refetch
ccrawl markdown refetch takes the URLs from a CC WARC shard, re-fetches each page live via the ami fetch library, converts HTML to Markdown, and writes a Parquet file with the same schema plus fresh fetch metadata.
# refetch one shard live and push
ccrawl markdown refetch --shards 0 --repo open-index/open-markdown-refetch
# run multiple shards in parallel
ccrawl markdown refetch --shards 0-49 --parallel 3 --fetch-workers 64
Additional flags: --fetch-workers (concurrent HTTP connections per shard), --fetch-timeout, --user-agent.
Improvements
Parallel shard orchestrator (--parallel N): shards download and convert concurrently.
Each shard spawns its own goroutine pool for HTML→Markdown conversion while a single background committer batches finished Parquet files into one HuggingFace commit.
The commit round-trip is off the per-shard critical path.
Disk-pressure guard (--min-free-gb): the downloader pauses when free disk drops below the threshold and resumes once space is available.
Ledger-based resume: ~/<out>/.committed records which shard indices have been committed.
Restarting with the same flags skips already-committed shards automatically.
Bufio flush bug fixed: both Parquet writers previously wrapped the output file in a bufio.Writer but never flushed before closing.
The Parquet footer (written during Close()) was silently lost, producing valid-looking but empty Parquet files.
The bufio layer has been removed; parquet-go writes directly to the file.
Install
# Homebrew
brew upgrade tamnd/tap/ccrawl
# Go
go install github.com/tamnd/ccrawl-cli/cmd/[email protected]