Jayden Pyles a71c65e8ba fix: add image
2024-07-07 14:12:00 -05:00
2024-06-26 16:06:01 -05:00
2024-06-26 20:29:46 -05:00
2024-07-07 14:12:00 -05:00
2024-07-07 13:59:16 -05:00
2024-07-07 13:33:56 -05:00
2024-07-07 13:59:16 -05:00
2024-07-07 13:59:16 -05:00
2024-07-07 14:12:00 -05:00
2024-07-06 11:56:05 -05:00
2024-06-25 20:29:29 -05:00
2024-07-06 16:56:56 -05:00
2024-07-07 13:59:16 -05:00
2024-07-06 22:40:47 -05:00
2024-07-07 13:59:16 -05:00

Scraperr

Scraperr is a self-hosted web application that allows users to scrape data from web pages by specifying elements via XPath. Users can submit URLs and the corresponding elements to be scraped, and the results will be displayed in a table.

From the table, users can download a csv of the job's results, along with an option to rerun the job.

Features

  • Submit URLs for web scraping
  • Add and manage elements to scrape using XPath
  • Display results of scraped data
  • Download csv containing results
  • Rerun jobs
  • User login/signup to organize jobs

Installation

  1. Clone the repository:

    git clone https://github.com/jaypyles/scraperr.git
    
    
  2. Create .env file.

MONGODB_URI=mongodb://root:example@webscrape-mongo:27017
SECRET_KEY=your_secret_key
ALGORITHM=HS256
ACCESS_TOKEN_EXPIRE_MINUTES=600
HOSTNAME="yourdomain"
HOSTNAME_DEV="localhost"
  1. Deploy
make up

The app provides its own traefik configuration to use independently, but can easily be reverse-proxied by any other app, or your own reverse-proxy.

Usage

  1. Open the application in your browser at http://localhost.
  2. Enter the URL you want to scrape in the URL field.
  3. Add elements to scrape by specifying a name and the corresponding XPath.
  4. Click the "Submit" button to start the scraping process.
  5. The results will be displayed in the "Results" section.

API Endpoints

Use this service as an API for your own projects.

  • /api/submit-scrape-job: Endpoint to submit the scraping job. Accepts a POST request with the following payload:

    {
      "url": "http://example.com",
      "elements": [
        {
          "name": "ElementName",
          "xpath": "/div[@class='example']"
        }
      ],
      "user": "user@example.com",
      "time_created": "2024-07-07T12:34:56.789Z"
    }
    
  • /api/retrieve-scrape-jobs: Endpoint to retrieve jobs made by specific accounts.

    {
      "user": "user@example.com"
    }
    
  • /api/download: Endpoint to download job in csv format.

    {
      "id": "85312b8b8b204aacab9631f2d76f1af0"
    }
    

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributions

Development made easy by developing from (webapp template)[https://github.com/jaypyles/webapp-template]. View documentation for extra information.

Start development server:

make deps build up-dev

Description
Self-hosted webscraper.
Readme MIT 7.1 MiB
Languages
TypeScript 67%
Python 30.4%
CSS 0.8%
Dockerfile 0.5%
Makefile 0.5%
Other 0.8%