Content signals

Extract text, measure quality, and map outlinks from live pages or stored WARC records.

ccrawl content runs content-analysis operations against live pages or stored WARC/JSONL corpora. These commands are useful for quality filtering before indexing and for building link graphs from your own crawl data.

Extracting content from a live URL

content extract fetches a URL and returns one or more content views:

ccrawl content extract https://example.com --text        # clean readable text
ccrawl content extract https://example.com --markdown    # Markdown with headings/links
ccrawl content extract https://example.com --outlinks    # outbound links only
ccrawl content extract https://example.com --all -o json # all signals in one JSON

Flags:

Flag	Default	Purpose
`--text`	false	Return readable plain text (boilerplate removed)
`--markdown`	false	Return page as Markdown
`--outlinks`	false	Return outbound links as a JSONL list
`--all`	false	Return all signals together
`--crawl`	latest	Use a specific CC crawl instead of fetching live

Without a content flag, content extract returns the raw HTML.

Measuring content quality

content quality computes a set of quality signals for a URL or a JSONL stream:

# single URL
ccrawl content quality https://example.com -o json

# batch from a crawl result
ccrawl crawl fetch seeds.jsonl -o jsonl | ccrawl content quality - -o jsonl

Output fields:

Field	Description
`lang`	Detected language (BCP-47)
`lang_confidence`	Language detection confidence (0–1)
`text_length`	Character count of the extracted text
`word_count`	Word count of the extracted text
`link_density`	Ratio of link text to total text
`boilerplate_ratio`	Estimated fraction of the page that is boilerplate
`readability_score`	Flesch-Kincaid readability estimate

Use these signals to filter low-quality pages before indexing:

ccrawl crawl fetch seeds.jsonl -o jsonl \
  | ccrawl content quality - -o jsonl \
  | jq 'select(.word_count > 200 and .lang == "en")' \
  | ccrawl index build --dir idx/ -

Extracting outlinks

content outlinks focuses on the link graph. It reads a JSONL stream and emits (source_url, target_url, anchor_text) triples:

ccrawl crawl fetch seeds.jsonl -o jsonl | ccrawl content outlinks - -o jsonl > links.jsonl

This is cheaper than running the full quality pipeline when you only need the link graph. Each output row is a JSON object with src, dst, and text fields.

Processing stored WARC files

All three subcommands (extract, quality, outlinks) accept a --warc flag that reads from a local WARC file instead of fetching live:

ccrawl content extract --warc out/worker-0.warc --text
ccrawl content quality --warc out/worker-0.warc -o jsonl
ccrawl content outlinks --warc out/worker-0.warc -o jsonl > links.jsonl

This is the standard way to post-process the output of crawl fetch -f warc.

Language filtering

All three commands pass language metadata through. Filter to a single language at the shell:

ccrawl content quality - -o jsonl | jq 'select(.lang == "vi")'

Or use the --lang flag directly (where supported):

ccrawl content quality --lang en - -o jsonl