Finding pages
Query the URL index for captures of a URL or a path pattern, and filter the results.
ccrawl search queries the URL index (the CDX server) for captures of a URL.
This is how you find what Common Crawl saw, and where each capture lives, before
you fetch anything.
A single URL
ccrawl search example.com
Each row is one capture. The default output adapts to where it is going: an
aligned table when you are looking at a terminal, JSONL when the output is
piped. Force it with -o:
ccrawl search example.com -o table # columns for reading
ccrawl search example.com -o jsonl # one JSON object per line
ccrawl search example.com -o json # a single JSON array
ccrawl search example.com -o csv # spreadsheet friendly
ccrawl search example.com -o url # just the URL column
Path and host patterns
A trailing /* matches everything under a path. This is the fastest way to
enumerate a site as Common Crawl indexed it:
ccrawl search 'example.com/*' # every capture under the host
ccrawl search 'example.com/blog/*' -o url # every URL under /blog
Filtering
Narrow the matches with the capture fields:
ccrawl search 'example.com/*' --mime application/pdf # only PDFs
ccrawl search 'example.com/*' --status 200 # only successful fetches
Choosing a crawl
search runs against the latest crawl unless you say otherwise. -c takes a
full ID, a year (resolved to that year's newest crawl), or latest:
ccrawl search example.com -c 2024-51 # one specific crawl
ccrawl search example.com -c 2024 # the newest 2024 crawl
ccrawl search example.com -c all # across every crawl
Shaping the rows
Keep only the columns you care about, or template each row into whatever shape you need downstream:
ccrawl search example.com --fields url,status,length
ccrawl search example.com --template '{{.URL}} {{.Status}}'
--limit (or -n) caps the number of results; 0 means unlimited.
From a match to the bytes
The point of finding a capture is usually to read it. The url, filename,
offset, and length on each row are exactly what the fetcher needs, so
search composes straight into fetch:
ccrawl search 'example.com/*' --mime application/pdf -o jsonl \
| ccrawl fetch - --dir --out-dir pdfs/
For the same question asked across a whole crawl at once, the columnar index is faster and cheaper than the CDX server. See the columnar index.