v0.1.0

The first public release of ccrawl: the full command surface, the dataset library, and the WARC/WAT/WET parser packages.

The first public release. ccrawl is a single pure-Go binary that puts Common Crawl behind a tool that feels like curl: find a URL in the index, fetch the exact capture, stream whole archives, run SQL over the columnar index, and look up domain ranks. It talks to the public data on data.commoncrawl.org over plain HTTPS, so there are no credentials to set up and nothing to pay for.

What you get

Find captures. ccrawl search queries the CDX URL index for any URL or path pattern and filters by status, MIME, or language.
Fetch the exact bytes. ccrawl get and ccrawl fetch pull a single capture with an HTTP byte-range request, so a page comes back without downloading the WARC file it lives in.
Pull out the content. --text, --markdown, --links, and --headers turn a captured page into the form you actually want.
Work with whole archives. ccrawl paths, download, parse, and convert list, fetch, decode, and reshape WARC, WAT, and WET files. convert writes columnar Parquet (zstd, dictionary-encoded) or JSONL.
Query the columnar index. ccrawl table builds the SQL for bulk questions across a crawl and runs it through a local duckdb binary, or prints ready-to-run SQL when DuckDB is not installed.
Look up ranks. ccrawl rank reads host and domain positions from the web graph tables.
Scan CC-NEWS. ccrawl news streams the continuous news dataset, which has no index of its own.

Dataset library

For bulk work, --library gives the archive files a home and extends the commands you already know to list, download, and process them in place. Raw files land under ~/notes/ccrawl/<crawl>/<kind>/ and processed output beside them under <crawl>/<format>/<kind>/, so a directory listing tells you exactly what you have. Re-running download only fetches what is missing, so a corpus grows a piece at a time. Point it elsewhere with CCRAWL_LIBRARY or --library-dir. See Bulk and archives for the walkthrough.

Parsers you can import

The archive parsers live in their own packages so you can read Common Crawl files from your own program without pulling in the rest of the tool:

import "github.com/tamnd/ccrawl-cli/pkg/warc"

err := warc.Iterate(r, func(rec warc.Record) error {
    fmt.Println(rec.Header.TargetURI)
    return nil
})

pkg/warc reads WARC records and splits an HTTP block into its headers and body. pkg/wat decodes WAT metadata (status, title, meta tags, links) and pkg/wet decodes WET plain text, both on top of pkg/warc. None of them depend on the ccrawl library or the CLI.

Install

go install github.com/tamnd/ccrawl-cli/cmd/ccrawl@latest

Prebuilt archives for Linux, macOS, Windows, and FreeBSD, plus Linux packages (deb, rpm, apk) and checksums, are on the release page. The container image is on GHCR:

docker run --rm ghcr.io/tamnd/ccrawl:0.1.0 get example.com --text

The binary is pure Go with no runtime dependencies. DuckDB is optional and only used to run the columnar index queries locally; without it, ccrawl prints the SQL for you to run elsewhere.