diff --git a/README.md b/README.md index a1e3feb..d6eece2 100644 --- a/README.md +++ b/README.md @@ -1,104 +1,61 @@ -![logo_picture](https://github.com/jaypyles/www-scrape/blob/master/docs/logo_picture.png) -
- MongoDB - FastAPI - Next JS - TailwindCSS + Scraperr Logo + + # Scraperr + + **A powerful self-hosted web scraping solution** + +
+ MongoDB + FastAPI + Next JS + TailwindCSS +
-# Summary +## 📋 Overview -Scraperr is a self-hosted web application that allows users to scrape data from web pages by specifying elements via XPath. Users can submit URLs and the corresponding elements to be scraped, and the results will be displayed in a table. +Scraperr enables you to extract data from websites with precision using XPath selectors. This self-hosted application provides a clean interface to manage scraping jobs, view results, and export data. -From the table, users can download an excel sheet of the job's results, along with an option to rerun the job. +> 📚 **[Check out the docs](https://scraperr-docs.pages.dev)** for a comprehensive quickstart guide and detailed information. -View the [docs](https://scraperr-docs.pages.dev) for a quickstart guide and more information. +
+ Scraperr Main Interface +
-## Features +## ✨ Key Features -### Submitting URLs for Scraping +- **XPath-Based Extraction**: Precisely target page elements +- **Queue Management**: Submit and manage multiple scraping jobs +- **Domain Spidering**: Option to scrape all pages within the same domain +- **Custom Headers**: Add JSON headers to your scraping requests +- **Media Downloads**: Automatically download images, videos, and other media +- **Results Visualization**: View scraped data in a structured table format +- **Data Export**: Export your results in various formats +- **Notifcation Channels**: Send completion notifcations, through various channels -- Submit/Queue URLs for web scraping -- Add and manage elements to scrape using XPath -- Scrape all pages within same domain -- Add custom json headers to send in requests to URLs -- Display results of scraped data -- Download media found on the page (images, videos, etc.) +## 🚀 Getting Started -![main_page](https://github.com/jaypyles/www-scrape/blob/master/docs/main_page.png) - -### Managing Previous Jobs - -- Download csv containing results -- Rerun jobs -- View status of queued jobs -- Favorite and view favorited jobs - -![job_page](https://github.com/jaypyles/www-scrape/blob/master/docs/job_page.png) - -### User Management - -- User login/signup to organize jobs (optional) - -![login](https://github.com/jaypyles/www-scrape/blob/master/docs/login.png) - -### Log Viewing - -- View app logs inside of web ui - -![logs](https://github.com/jaypyles/www-scrape/blob/master/docs/log_page.png) - -### Statistics View - -- View a small statistics view of jobs ran - -![statistics](https://github.com/jaypyles/www-scrape/blob/master/docs/stats_page.png) - -### AI Integration - -- Include the results of a selected job into the context of a conversation -- Currently supports: - -1. Ollama -2. OpenAI - -![chat](https://github.com/jaypyles/www-scrape/blob/master/docs/chat_page.png) - -## API Endpoints - -Use this service as an API for your own projects. Due to this using FastAPI, a docs page is available at `/docs` for the API. - -![docs](https://github.com/jaypyles/www-scrape/blob/master/docs/docs_page.png) - -## Troubleshooting - -Q: When running Scraperr, I'm met with "404 Page not found". -A: This is probably an issue with MongoDB related to running Scraperr in a VM. You should see something liks this in `make logs`: - -``` -WARNING: MongoDB 5.0+ requires a CPU with AVX support, and your current system does not appear to have that! +```bash +make up ``` -To resolve this issue, simply set CPU host type to `host`. This can be done in Proxmox in the VM settings > Processor. [Related issue](https://github.com/jaypyles/Scraperr/issues/9). +## ⚖️ Legal and Ethical Guidelines -## Legal and Ethical Considerations +When using Scraperr, please remember to: -When using Scraperr, please ensure that you: +1. **Respect `robots.txt`**: Always check a website's `robots.txt` file to verify which pages permit scraping +2. **Terms of Service**: Adhere to each website's Terms of Service regarding data extraction +3. **Rate Limiting**: Implement reasonable delays between requests to avoid overloading servers -1. **Check Robots.txt**: Verify allowed pages by reviewing the `robots.txt` file of the target website. -2. **Compliance**: Always comply with the website's Terms of Service (ToS) regarding web scraping. +> **Disclaimer**: Scraperr is intended for use only on websites that explicitly permit scraping. The creator accepts no responsibility for misuse of this tool. -**Disclaimer**: This tool is intended for use only on websites that permit scraping. The author is not responsible for any misuse of this tool. - -## License +## 📄 License This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details. -### Contributions +## 👏 Contributions -Development made easy by developing from [webapp template](https://github.com/jaypyles/webapp-template). View documentation for extra information. +Development made easier with the [webapp template](https://github.com/jaypyles/webapp-template). -Start development server: - -`make deps build up-dev` +To get started, simply run `make build up-dev`. \ No newline at end of file diff --git a/docs/main_page.png b/docs/main_page.png index fea8675..ea94c17 100644 Binary files a/docs/main_page.png and b/docs/main_page.png differ