Website Scraper
Integrates with
Readability, TurndownService
Website Scraper
A command-line tool and MCP server for scraping websites and converting HTML to Markdown.
Features
- Extracts meaningful content from web pages using Mozilla's Readability library (the same engine used in Firefox's Reader View)
- Converts clean HTML to high-quality Markdown with TurndownService
- Securely handles HTML by removing potentially harmful script tags
- Works as both a command-line tool and an MCP server
- Supports direct conversion of local HTML files to Markdown
Installation
## Install dependencies
npm install
## Build the project
npm run build
## Optionally, install globally
npm install -g .
Usage
CLI Mode
## Print output to console
scrape https://example.com
## Save output to a file
scrape https://example.com output.md
## Convert a local HTML file to Markdown
scrape --html-file input.html
## Convert a local HTML file and save output to a file
scrape --html-file input.html output.md
## Show help
scrape --help
## Or run via npm script
npm run start:cli -- https://example.com
MCP Server Mode
This tool can be used as a Model Context Protocol (MCP) server:
## Start in MCP server mode
npm start
Code Structure
src/index.ts
- Core functionality and MCP server implementationsrc/cli.ts
- Command-line interface implementationsrc/data_processing.ts
- HTML to Markdown conversion functionality
API
The tool exports the following functions:
// Scrape a website and convert to Markdown
import { scrapeToMarkdown } from './build/index.js';
// Convert HTML string to Markdown directly
import { htmlToMarkdown } from './build/data_processing.js';
async function example() {
// Web scraping
const markdown = await scrapeToMarkdown('https://example.com');
console.log(markdown);
// Direct HTML conversion
const html = '<h1>Hello World</h1><p>This is <strong>bold</strong> text.</p>';
const md = htmlToMarkdown(html);
console.log(md);
}
License
ISC