A powerful, customizable, and containerized web spider designed for testing and data extraction.
- Recursive Crawling: Spider engine capable of BFS/DFS traversal with configurable depth.
- Advanced Configuration:
- Max Depth: Limit how many links deep the crawler goes.
- Max Pages: Hard limit on total pages to scrape.
- Custom Selectors: Define CSS selectors for specific link following and content extraction.
- Domain Control: Option to restrict crawling to the original domain.
- Regex Filters: Include/Exclude URLs based on regex patterns.
- Real-Time Visualization: Live progress updates, depth tracking, and status logs via WebSockets.
- High Performance: Built on Playwright (Async) and FastAPI for concurrent processing.
- Modern UI: Polished React interface for easy configuration and data inspection.
- Framework: React 19
- Language: TypeScript (Strongly typed)
- Build Tool: Vite
- Styling: Vanilla CSS (Custom properties, grid layouts)
- Framework: FastAPI (Python 3.12)
- Engine: Playwright (Async Chromium automation)
- Concurrency:
asyncio&websockets - Validation: Pydantic v2
- Containerization: Docker & Docker Compose
This is the easiest way to get started.
-
Clone the repository.
-
Open a terminal in the project root.
-
Run the following command:
docker-compose up --build
-
Open your browser at http://localhost:5173.
If you prefer running it locally or don't have Docker installed.
- Navigate to the
backenddirectory. - Create and activate a Virtual Environment:
python -m venv venv # Windows venv\Scripts\activate # Mac/Linux source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt playwright install
- Start the server:
Server will run on http://localhost:8000
uvicorn main:app --reload
- Open a new terminal and navigate to the
frontenddirectory. - Install dependencies:
npm install
- Start the dev server:
npm run dev
- Open your browser at http://localhost:5173.
This project includes a GitHub Actions workflow that automatically builds and creates Docker images for both the frontend and backend whenever you push to master or main.
- Registry: GitHub Container Registry (
ghcr.io) - Images:
ghcr.io/sebastianreinig/webcrawler_playground/backend:latestghcr.io/sebastianreinig/webcrawler_playground/frontend:latest
This project is open source and available under the MIT License.
You are free to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, subject to the conditions of the license.
This project uses several open-source libraries. Please review their licenses if you plan to distribute heavily modified versions:
- FastAPI: MIT
- Playwright: Apache 2.0
- React: MIT
- Vite: MIT
Use Responsibly. This tool is intended for testing, educational purposes, and scraping sites you own or have permission to access.
- Respect
robots.txtfiles. - Do not overwhelm servers with excessive requests (use the
TimeoutandMax Pagesfeatures). - The authors are not responsible for any misuse of this tool.
- The code is vibecoded.
Once crawling is complete, export your dataset including all metadata and content for external analysis.