v0.4.0

Reproducible WARC export, parquet output, multi-crawl selection, search parity flags, and a columnar rename.

v0.4.0 is a usability release. It closes the gaps that turned up after living with the tool: a way to export a query as reproducible WARC files, Parquet as a first-class output format, richer crawl selection, and the search flags that bring it level with cdx_toolkit. It also retries transient 403s and renames the columnar command to match how the docs already describe it.

New commands

export

ccrawl export runs a query, pulls each matching capture by byte range, and writes them into one or more .warc.gz files. Each file opens with a warcinfo record carrying provenance: the tool and version, the prefix, and the exact command line that produced it, so the output is self-describing and reproducible. Files rotate once they pass --size bytes (1 GB by default), named <prefix>-NNNNNN.extracted.warc.gz.

# export every successful capture under a host
ccrawl export 'example.com/*' --status 200 --prefix example

# rotate every 100 MB, skip robots.txt, stamp a creator
ccrawl export '*.example.com' --size 100000000 --url-fgrepv /robots.txt --creator "me <[email protected]>"

# feed locations from search or the columnar index
ccrawl search example.com --locations | ccrawl export - --prefix example

It takes the same filters as search (--match, --from, --to, --status, --mime, --lang, --filter) plus --url-fgrep and --url-fgrepv. The exported records are the original gzip members fetched by byte range, so they round-trip through ccrawl parse unchanged.

New features

parquet output

Every list command can now write Parquet with -o parquet, not just the convert and seed commands that hand-rolled it. The schema is built from the row's columns, every field is an optional UTF-8 string, and the stream is zstd compressed.

ccrawl search '*.gov/*' --status 200 -o parquet > gov.parquet
ccrawl host top -n 100000 -o parquet > top_hosts.parquet

richer crawl selection

-c / --crawl now understands more than one crawl at a time:

ccrawl search example.com -c 2024            # every crawl of 2024
ccrawl search example.com -c 3               # the three newest crawls
ccrawl search example.com -c 2024-51,2023-50 # an explicit list
ccrawl search example.com -c all             # across every crawl

A bare year resolves to every crawl of that year, a bare integer to the newest N, and a comma list to exactly those crawls (deduplicated).

search parity flags

search gained the flags that bring it level with cdx_toolkit:

--at <date> keeps the capture closest to a moment in time, per URL.
--sort newest|oldest orders the results.
--estimate reports rough page and record counts instead of listing rows, so you can size a query before pulling it.
--url-contains / --url-not-contains filter on a URL substring.
--from / --to bound the capture date range.

Improvements

Transient 403 retries: Common Crawl's S3 and CloudFront fronting returns 403 as a throttle or availability signal under load, not only as a hard forbidden. The client now retries 403 alongside 429 and 5xx with linear backoff, matching what cdx_toolkit does, so a polite bulk run rides out a transient block instead of failing on it.

columnar is the primary command name: the Parquet index command is now ccrawl columnar, which matches how the docs and help text describe it. table and athena remain aliases, so existing scripts keep working.

Scope

The Internet Archive and other web archives are out of scope. ccrawl speaks Common Crawl's layout and conventions on purpose; a second backend would dilute that focus. Common Crawl is the only data source.

Install

# Homebrew
brew upgrade tamnd/tap/ccrawl

# Go
go install github.com/tamnd/ccrawl-cli/cmd/[email protected]