69 lines
941 B
Markdown
69 lines
941 B
Markdown
# HTML Scraper
|
|
|
|
A simple Python API that exposes a single route to return the HTML content of any page, using Flask and SeleniumBase.
|
|
|
|
## Stack
|
|
|
|
- **Python 3.12** with **uv** for dependency management
|
|
- **Flask** as web framework
|
|
- **SeleniumBase** (undetected Chrome) for page rendering
|
|
- **Gunicorn** as production WSGI server
|
|
- **Docker** for containerization
|
|
|
|
## Setup
|
|
|
|
### Local development
|
|
|
|
```bash
|
|
# Install dependencies
|
|
uv sync
|
|
|
|
# Copy and edit environment variables
|
|
cp .env.example .env
|
|
|
|
# Run the server
|
|
uv run python run.py
|
|
```
|
|
|
|
### Docker
|
|
|
|
```bash
|
|
# Build
|
|
docker build -t html-scraper .
|
|
|
|
# Run
|
|
docker run -p 4001:4001 --env-file .env html-scraper
|
|
```
|
|
|
|
## API
|
|
|
|
### Health check
|
|
|
|
```
|
|
GET /api/health
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{"status": "ok"}
|
|
```
|
|
|
|
### Scrape HTML
|
|
|
|
```
|
|
POST /api/scrape
|
|
Content-Type: application/json
|
|
|
|
{
|
|
"url": "https://example.com"
|
|
}
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"success": true,
|
|
"html": "<!DOCTYPE html>..."
|
|
}
|
|
```
|