v0.1.0
The first public release of ccrawl: the full command surface, the dataset library, and the WARC/WAT/WET parser packages.
The first public release. ccrawl is a single pure-Go binary that puts Common
Crawl behind a tool that feels like curl: find a URL in the index, fetch the
exact capture, stream whole archives, run SQL over the columnar index, and look
up domain ranks. It talks to the public data on data.commoncrawl.org over
plain HTTPS, so there are no credentials to set up and nothing to pay for.
What you get
- Find captures.
ccrawl searchqueries the CDX URL index for any URL or path pattern and filters by status, MIME, or language. - Fetch the exact bytes.
ccrawl getandccrawl fetchpull a single capture with an HTTP byte-range request, so a page comes back without downloading the WARC file it lives in. - Pull out the content.
--text,--markdown,--links, and--headersturn a captured page into the form you actually want. - Work with whole archives.
ccrawl paths,download,parse, andconvertlist, fetch, decode, and reshape WARC, WAT, and WET files.convertwrites columnar Parquet (zstd, dictionary-encoded) or JSONL. - Query the columnar index.
ccrawl tablebuilds the SQL for bulk questions across a crawl and runs it through a localduckdbbinary, or prints ready-to-run SQL when DuckDB is not installed. - Look up ranks.
ccrawl rankreads host and domain positions from the web graph tables. - Scan CC-NEWS.
ccrawl newsstreams the continuous news dataset, which has no index of its own.
Dataset library
For bulk work, --library gives the archive files a home and extends the
commands you already know to list, download, and process them in place. Raw
files land under ~/notes/ccrawl/<crawl>/<kind>/ and processed output beside
them under <crawl>/<format>/<kind>/, so a directory listing tells you exactly
what you have. Re-running download only fetches what is missing, so a corpus
grows a piece at a time. Point it elsewhere with CCRAWL_LIBRARY or
--library-dir. See Bulk and archives for the walkthrough.
Parsers you can import
The archive parsers live in their own packages so you can read Common Crawl files from your own program without pulling in the rest of the tool:
import "github.com/tamnd/ccrawl-cli/pkg/warc"
err := warc.Iterate(r, func(rec warc.Record) error {
fmt.Println(rec.Header.TargetURI)
return nil
})
pkg/warc reads WARC records and splits an HTTP block into its headers and
body. pkg/wat decodes WAT metadata (status, title, meta tags, links) and
pkg/wet decodes WET plain text, both on top of pkg/warc. None of them depend
on the ccrawl library or the CLI.
Install
go install github.com/tamnd/ccrawl-cli/cmd/ccrawl@latest
Prebuilt archives for Linux, macOS, Windows, and FreeBSD, plus Linux packages (deb, rpm, apk) and checksums, are on the release page. The container image is on GHCR:
docker run --rm ghcr.io/tamnd/ccrawl:0.1.0 get example.com --text
The binary is pure Go with no runtime dependencies. DuckDB is optional and only used to run the columnar index queries locally; without it, ccrawl prints the SQL for you to run elsewhere.