Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Web crawler

mq-crawler is a web crawler that fetches HTML content from websites, converts it to Markdown format, and processes it with mq queries.

Key Features

  • HTML to Markdown conversion: Automatically converts crawled HTML pages to clean Markdown
  • robots.txt compliance: Respects robots.txt rules for ethical web crawling
  • mq-lang integration: Processes content with mq-lang queries for filtering and transformation
  • Configurable crawling: Customizable delays, domain restrictions, and link discovery
  • Flexible output: Save to files or output to stdout
  • Headless Chrome: Built-in headless Chrome for JavaScript-heavy sites (no external server needed)
  • WebDriver support: Browser-based crawling via Selenium WebDriver
  • Domain filtering: Restrict crawling to specific domains

Installation

Quick Install

curl -sSL https://mqlang.org/install_crawler.sh | bash

The installer will:

  • Download the latest mq-crawl binary for your platform
  • Install it to ~/.local/bin/
  • Verify the checksum of the downloaded binary
  • Update your shell profile to add mq-crawl to your PATH

After installation, restart your terminal or source your shell profile, then verify:

mq-crawl --version

Homebrew

brew install harehare/tap/mq-crawl

Cargo

cargo install mq-crawler

Binaries

You can download pre-built binaries from the GitHub releases page.

Usage

mq-crawl [OPTIONS] <URL>

Options

OptionDescriptionDefault
-o, --output <OUTPUT>Directory to save markdown files (stdout if not specified)stdout
-d, --crawl-delay <SECONDS>Delay between requests in seconds1
-c, --concurrency <N>Number of concurrent workers1
--depth <DEPTH>Maximum crawl depth (0 = start URL only)unlimited
-q, --mq-query <QUERY>mq-lang query for processing content
--robots-path <PATH>Custom robots.txt file path
--allowed-domains <DOMAINS>Comma-separated list of extra domains to crawl; the start URL’s domain is always includedstart domain only
--headlessUse built-in headless Chrome (Chrome/Chromium must be installed)
--chrome-path <PATH>Path to Chrome/Chromium executable (requires --headless)auto-detect
-U, --webdriver-url <URL>External WebDriver URL for browser-based crawling
--page-load-timeout <SECONDS>Timeout for loading a single page30
--script-timeout <SECONDS>Timeout for executing scripts on the page10
--implicit-timeout <SECONDS>Timeout for element finding5
--extract-scripts-as-code-blocksExtract <script> tags as code blocks
--generate-front-matterGenerate YAML front matter from page metadata
--use-title-as-h1Use the HTML <title> as the first H1 heading
-f, --format <FORMAT>Output format: text or jsontext

Examples

# Basic crawling to stdout
mq-crawl https://example.com

# Save to directory with custom delay
mq-crawl -o ./output -d 2 https://example.com

# Limit crawl depth and use concurrent workers
mq-crawl --depth 2 -c 3 https://example.com

# Process with mq-lang query
mq-crawl -q '.h | select(contains("News"))' https://example.com

# Extract code blocks from a docs site
mq-crawl -q '.code' https://docs.example.com

Domain Filtering

By default, only the start URL’s domain is crawled. Use --allowed-domains to include additional domains:

# Also crawl docs.example.com and blog.example.com
# The start URL's domain is always included automatically
mq-crawl --allowed-domains docs.example.com,blog.example.com https://example.com

Headless Chrome

For JavaScript-heavy sites, use the built-in headless Chrome without an external server:

# Use built-in headless Chrome (Chrome or Chromium must be installed)
mq-crawl --headless https://spa-example.com

# Specify a custom Chrome/Chromium executable path
mq-crawl --headless --chrome-path /usr/bin/chromium https://spa-example.com

WebDriver

Alternatively, use an external Selenium WebDriver server:

# Start Selenium server first
# docker run -d -p 4444:4444 selenium/standalone-chrome

# Crawl with WebDriver
mq-crawl -U http://localhost:4444 https://spa-example.com

# Custom timeouts
mq-crawl -U http://localhost:4444 \
  --page-load-timeout 60 \
  --script-timeout 30 \
  --implicit-timeout 10 \
  https://spa-example.com

HTML to Markdown Options

# Generate YAML front matter with metadata
mq-crawl --generate-front-matter https://example.com

# Use page title as H1 heading
mq-crawl --use-title-as-h1 https://example.com

# Extract <script> tags as code blocks
mq-crawl --extract-scripts-as-code-blocks https://example.com

# Combine options
mq-crawl --generate-front-matter --use-title-as-h1 -o ./docs https://example.com

Output Formats

# Output as JSON
mq-crawl --format json https://example.com

# Output as plain text (default)
mq-crawl --format text https://example.com