Summary
Scraperr is a self-hosted web application that allows users to scrape data from web pages by specifying elements via XPath. Users can submit URLs and the corresponding elements to be scraped, and the results will be displayed in a table.
From the table, users can download an excel sheet of the job's results, along with an option to rerun the job.
Features
Submitting URLs for Scraping
- Submit/Queue URLs for web scraping
- Add and manage elements to scrape using XPath
- Scrape all pages within same domain
- Add custom json headers to send in requests to URLs
- Display results of scraped data
Managing Previous Jobs
- Download csv containing results
- Rerun jobs
- View status of queued jobs
- Favorite and view favorited jobs
User Management
- User login/signup to organize jobs
Log Viewing
- View app logs inside of web ui
Statistics View
- View a small statistics view of jobs ran
Installation
-
Clone the repository:
git clone https://github.com/jaypyles/scraperr.git -
Create
.envfile.
MONGODB_URI=mongodb://root:example@webscrape-mongo:27017
SECRET_KEY=your_secret_key
ALGORITHM=HS256
ACCESS_TOKEN_EXPIRE_MINUTES=600
HOSTNAME="localhost"
HOSTNAME_DEV="localhost"
Change HOSTNAME from localhost to your domain, if deploying this through traefik publicly.
- Deploy
make up
The app provides its own traefik configuration to use independently, but can easily be reverse-proxied by any other app, or your own reverse-proxy.
Usage
- Open the application in your browser at
http://localhost. - Enter the URL you want to scrape in the URL field.
- Add elements to scrape by specifying a name and the corresponding XPath.
- Click the "Submit" button to queue URL to be scraped.
- View queue in the "Previous Jobs" section.
API Endpoints
Use this service as an API for your own projects. Due to this using FastAPI, a docs page is available at /docs for the API.
Troubleshooting
Q: When running Scraperr, I'm met with "404 Page not found".
A: This is probably an issue with MongoDB related to running Scraperr in a VM. You should see something liks this in make logs:
WARNING: MongoDB 5.0+ requires a CPU with AVX support, and your current system does not appear to have that!
To resolve this issue, simply set CPU host type to host. This can be done in Proxmox in the VM settings > Processor. Related issue.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Contributions
Development made easy by developing from webapp template. View documentation for extra information.
Start development server:
make deps build up-dev






