Scanning the news
Work with the continuous CC-NEWS dataset, which has no URL index.
CC-NEWS is Common Crawl's continuous news crawl: news articles collected around the clock and published as WARC files, organised by year and month. Unlike the monthly crawls, it has no URL index, so you cannot look a URL up; you stream the files and match as they go by.
Listing files
ccrawl news list --year 2026 --month 5 # the month's WARC files
ccrawl news list --year 2026 --month 5 -n 10 # just the first ten
Downloading
ccrawl news download --year 2026 --month 5 -n 1 # one news WARC file
The files land under <data-dir>/raw like any other download, and you can
parse them with ccrawl parse exactly as you would a crawl WARC.
Searching by host
Because there is no index, searching means streaming the month and keeping
records whose target host matches. It is slower than an indexed search, and
--workers parallelises it across files:
ccrawl news search bbc.co.uk --year 2026 --month 5 -n 50
This downloads and scans real WARC data, so be patient and use -n to stop
once you have enough. For anything that exists in a monthly crawl, prefer
the indexed search; reach for news only when you
specifically want the continuous news feed.