Skip to content
ccrawl

v0.3.0

Markdown pipeline: CC WARCs to Markdown Parquet on HuggingFace, with parallel export and live refetch.

v0.3.0 adds the Markdown dataset pipeline. Two new subcommands under ccrawl markdown turn CC WARC archives into cleaned Markdown Parquet files and upload them to a HuggingFace dataset repo. The parallel orchestrator keeps several shards in flight at once so the pipeline saturates the NIC rather than waiting on one download at a time.

New commands

markdown export

ccrawl markdown export downloads CC WARC shards, converts HTML to Markdown with h2m (go-trafilatura + GFM renderer), and writes each shard to a zstd-compressed Parquet file. After each batch it commits to a HuggingFace dataset repo. A ledger records committed shards so a killed run resumes where it stopped.

# single shard
ccrawl markdown export --shards 0 --repo open-index/open-markdown-v2

# range with parallel downloads
ccrawl markdown export --shards 0-99 --parallel 4 --commit-batch 10

# resume a killed run (ledger auto-detected)
ccrawl markdown export --shards 0-9999 --parallel 3

Flags: --shards, --parallel, --workers, --commit-batch, --repo, --out, --push, --keep-parquet, --min-free-gb, --ledger, --skip-errors, --limit.

Output schema (open-markdown-v2):

field type notes
doc_id string SHA-256 of URL (16 bytes hex)
url string original page URL
host string hostname
crawl_date string WARC-Date YYYY-MM-DD
warc_record_id string WARC record ID
html_length int64 raw HTML bytes
markdown_length int64 converted Markdown bytes
markdown string converted Markdown text

markdown refetch

ccrawl markdown refetch takes the URLs from a CC WARC shard, re-fetches each page live via the ami fetch library, converts HTML to Markdown, and writes a Parquet file with the same schema plus fresh fetch metadata.

# refetch one shard live and push
ccrawl markdown refetch --shards 0 --repo open-index/open-markdown-refetch

# run multiple shards in parallel
ccrawl markdown refetch --shards 0-49 --parallel 3 --fetch-workers 64

Additional flags: --fetch-workers (concurrent HTTP connections per shard), --fetch-timeout, --user-agent.

Improvements

Parallel shard orchestrator (--parallel N): shards download and convert concurrently. Each shard spawns its own goroutine pool for HTML→Markdown conversion while a single background committer batches finished Parquet files into one HuggingFace commit. The commit round-trip is off the per-shard critical path.

Disk-pressure guard (--min-free-gb): the downloader pauses when free disk drops below the threshold and resumes once space is available.

Ledger-based resume: ~/<out>/.committed records which shard indices have been committed. Restarting with the same flags skips already-committed shards automatically.

Bufio flush bug fixed: both Parquet writers previously wrapped the output file in a bufio.Writer but never flushed before closing. The Parquet footer (written during Close()) was silently lost, producing valid-looking but empty Parquet files. The bufio layer has been removed; parquet-go writes directly to the file.

Install

# Homebrew
brew upgrade tamnd/tap/ccrawl

# Go
go install github.com/tamnd/ccrawl-cli/cmd/[email protected]