Web crawler
mq-crawler is a web crawler that fetches HTML content from websites, converts it to Markdown format, and processes it with mq queries.
Key Features
- HTML to Markdown conversion: Automatically converts crawled HTML pages to clean Markdown
- robots.txt compliance: Respects robots.txt rules for ethical web crawling
- mq-lang integration: Processes content with mq-lang queries for filtering and transformation
- Configurable crawling: Customizable delays, domain restrictions, and link discovery
- Flexible output: Save to files or output to stdout
- Headless Chrome: Built-in headless Chrome for JavaScript-heavy sites (no external server needed)
- WebDriver support: Browser-based crawling via Selenium WebDriver
- Domain filtering: Restrict crawling to specific domains
Installation
Quick Install
curl -sSL https://mqlang.org/install_crawler.sh | bash
The installer will:
- Download the latest
mq-crawlbinary for your platform - Install it to
~/.local/bin/ - Verify the checksum of the downloaded binary
- Update your shell profile to add
mq-crawlto your PATH
After installation, restart your terminal or source your shell profile, then verify:
mq-crawl --version
Homebrew
brew install harehare/tap/mq-crawl
Cargo
cargo install mq-crawler
Binaries
You can download pre-built binaries from the GitHub releases page.
Usage
mq-crawl [OPTIONS] <URL>
Options
| Option | Description | Default |
|---|---|---|
-o, --output <OUTPUT> | Directory to save markdown files (stdout if not specified) | stdout |
-d, --crawl-delay <SECONDS> | Delay between requests in seconds | 1 |
-c, --concurrency <N> | Number of concurrent workers | 1 |
--depth <DEPTH> | Maximum crawl depth (0 = start URL only) | unlimited |
-q, --mq-query <QUERY> | mq-lang query for processing content | — |
--robots-path <PATH> | Custom robots.txt file path | — |
--allowed-domains <DOMAINS> | Comma-separated list of extra domains to crawl; the start URL’s domain is always included | start domain only |
--headless | Use built-in headless Chrome (Chrome/Chromium must be installed) | — |
--chrome-path <PATH> | Path to Chrome/Chromium executable (requires --headless) | auto-detect |
-U, --webdriver-url <URL> | External WebDriver URL for browser-based crawling | — |
--page-load-timeout <SECONDS> | Timeout for loading a single page | 30 |
--script-timeout <SECONDS> | Timeout for executing scripts on the page | 10 |
--implicit-timeout <SECONDS> | Timeout for element finding | 5 |
--extract-scripts-as-code-blocks | Extract <script> tags as code blocks | — |
--generate-front-matter | Generate YAML front matter from page metadata | — |
--use-title-as-h1 | Use the HTML <title> as the first H1 heading | — |
-f, --format <FORMAT> | Output format: text or json | text |
Examples
# Basic crawling to stdout
mq-crawl https://example.com
# Save to directory with custom delay
mq-crawl -o ./output -d 2 https://example.com
# Limit crawl depth and use concurrent workers
mq-crawl --depth 2 -c 3 https://example.com
# Process with mq-lang query
mq-crawl -q '.h | select(contains("News"))' https://example.com
# Extract code blocks from a docs site
mq-crawl -q '.code' https://docs.example.com
Domain Filtering
By default, only the start URL’s domain is crawled. Use --allowed-domains to include additional domains:
# Also crawl docs.example.com and blog.example.com
# The start URL's domain is always included automatically
mq-crawl --allowed-domains docs.example.com,blog.example.com https://example.com
Headless Chrome
For JavaScript-heavy sites, use the built-in headless Chrome without an external server:
# Use built-in headless Chrome (Chrome or Chromium must be installed)
mq-crawl --headless https://spa-example.com
# Specify a custom Chrome/Chromium executable path
mq-crawl --headless --chrome-path /usr/bin/chromium https://spa-example.com
WebDriver
Alternatively, use an external Selenium WebDriver server:
# Start Selenium server first
# docker run -d -p 4444:4444 selenium/standalone-chrome
# Crawl with WebDriver
mq-crawl -U http://localhost:4444 https://spa-example.com
# Custom timeouts
mq-crawl -U http://localhost:4444 \
--page-load-timeout 60 \
--script-timeout 30 \
--implicit-timeout 10 \
https://spa-example.com
HTML to Markdown Options
# Generate YAML front matter with metadata
mq-crawl --generate-front-matter https://example.com
# Use page title as H1 heading
mq-crawl --use-title-as-h1 https://example.com
# Extract <script> tags as code blocks
mq-crawl --extract-scripts-as-code-blocks https://example.com
# Combine options
mq-crawl --generate-front-matter --use-title-as-h1 -o ./docs https://example.com
Output Formats
# Output as JSON
mq-crawl --format json https://example.com
# Output as plain text (default)
mq-crawl --format text https://example.com