Webcrawler Playground

A powerful, customizable, and containerized web spider designed for testing and data extraction.

📚 Table of Contents

Features
Tech Stack
How to Run
- With Docker (Recommended)
- Local Setup (No Docker)
Usage
License

🚀 Features

Recursive Crawling: Spider engine capable of BFS/DFS traversal with configurable depth.
Advanced Configuration:
- Max Depth: Limit how many links deep the crawler goes.
- Max Pages: Hard limit on total pages to scrape.
- Custom Selectors: Define CSS selectors for specific link following and content extraction.
- Domain Control: Option to restrict crawling to the original domain.
- Regex Filters: Include/Exclude URLs based on regex patterns.
Real-Time Visualization: Live progress updates, depth tracking, and status logs via WebSockets.
High Performance: Built on Playwright (Async) and FastAPI for concurrent processing.
Modern UI: Polished React interface for easy configuration and data inspection.

🛠 Tech Stack

Frontend

Framework: React 19
Language: TypeScript (Strongly typed)
Build Tool: Vite
Styling: Vanilla CSS (Custom properties, grid layouts)

Backend

Framework: FastAPI (Python 3.12)
Engine: Playwright (Async Chromium automation)
Concurrency: asyncio & websockets
Validation: Pydantic v2

Infrastructure

Containerization: Docker & Docker Compose

🏃‍♂️ How to Run

With Docker (Recommended)

This is the easiest way to get started.

Clone the repository.
Open a terminal in the project root.
Run the following command:
```
docker-compose up --build
```
Open your browser at http://localhost:5173.

Local Setup (No Docker)

If you prefer running it locally or don't have Docker installed.

Backend

Navigate to the backend directory.

Create and activate a Virtual Environment:

python -m venv venv
# Windows
venv\Scripts\activate
# Mac/Linux
source venv/bin/activate

Install dependencies:

pip install -r requirements.txt
playwright install

Start the server:
```
uvicorn main:app --reload
```
Server will run on http://localhost:8000

Frontend

Open a new terminal and navigate to the frontend directory.
Install dependencies:
```
npm install
```
Start the dev server:
```
npm run dev
```
Open your browser at http://localhost:5173.

CI/CD (GitHub Actions)

This project includes a GitHub Actions workflow that automatically builds and creates Docker images for both the frontend and backend whenever you push to master or main.

Registry: GitHub Container Registry (ghcr.io)
Images:
- ghcr.io/sebastianreinig/webcrawler_playground/backend:latest
- ghcr.io/sebastianreinig/webcrawler_playground/frontend:latest

⚖️ License & Legal

MIT License

This project is open source and available under the MIT License.

You are free to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, subject to the conditions of the license.

Third-Party Licenses

This project uses several open-source libraries. Please review their licenses if you plan to distribute heavily modified versions:

FastAPI: MIT
Playwright: Apache 2.0
React: MIT
Vite: MIT

⚠️ Disclaimer

Use Responsibly. This tool is intended for testing, educational purposes, and scraping sites you own or have permission to access.

Respect robots.txt files.
Do not overwhelm servers with excessive requests (use the Timeout and Max Pages features).
The authors are not responsible for any misuse of this tool.
The code is vibecoded.

📦 Export

Once crawling is complete, export your dataset including all metadata and content for external analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
backend		backend
frontend		frontend
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Webcrawler Playground

📚 Table of Contents

🚀 Features

🛠 Tech Stack

Frontend

Backend

Infrastructure

🏃‍♂️ How to Run

With Docker (Recommended)

Local Setup (No Docker)

Backend

Frontend

CI/CD (GitHub Actions)

⚖️ License & Legal

MIT License

Third-Party Licenses

⚠️ Disclaimer

📦 Export

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

sebastianreinig/webcrawler_playground

Folders and files

Latest commit

History

Repository files navigation

Webcrawler Playground

📚 Table of Contents

🚀 Features

🛠 Tech Stack

Frontend

Backend

Infrastructure

🏃‍♂️ How to Run

With Docker (Recommended)

Local Setup (No Docker)

Backend

Frontend

CI/CD (GitHub Actions)

⚖️ License & Legal

MIT License

Third-Party Licenses

⚠️ Disclaimer

📦 Export

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages