76 Commits

Author SHA1 Message Date
github-actions[bot]
ee1e27ac1b chore: bump version to 1.1.7
Some checks failed
Merge / version (push) Has been cancelled
Merge / build-and-deploy (push) Has been cancelled
2025-10-12 16:55:31 +00:00
Jayden Pyles
e90e7e9564 fix: only log if it got a job 2025-10-12 11:55:20 -05:00
github-actions[bot]
44ccad1935 chore: bump version to 1.1.6
Some checks failed
Merge / version (push) Has been cancelled
Merge / build-and-deploy (push) Has been cancelled
2025-07-13 02:13:39 +00:00
Jayden Pyles
308759d70c chore: bump version 2025-07-12 21:13:37 -05:00
github-actions[bot]
6bf130dd4b chore: bump version to 1.1.6 2025-07-13 02:12:41 +00:00
Jayden Pyles
875a3684c9 Feat/swap to sqlalchemy (#99)
* chore: wip swap to sqlalchemy

* feat: swap to sqlalchemy

* feat: swap to sqlalchemy

* feat: swap to sqlalchemy

* feat: swap to sqlalchemy
2025-07-12 21:12:33 -05:00
github-actions[bot]
b096fb1b3c chore: bump version to 1.1.5
Some checks failed
Merge / version (push) Has been cancelled
Merge / build-and-deploy (push) Has been cancelled
2025-07-05 16:30:10 +00:00
Jayden Pyles
5f65125882 chore: push for arm64 2025-07-05 11:30:01 -05:00
github-actions[bot]
327db34683 chore: bump version to 1.1.5 2025-07-05 16:03:12 +00:00
Jayden Pyles
8d0f362a70 chore: push for arm64 2025-07-05 10:47:02 -05:00
github-actions[bot]
24f4b57fea chore: bump version to 1.1.4
Some checks failed
Merge / version (push) Has been cancelled
Merge / build-and-deploy (push) Has been cancelled
2025-06-18 23:06:20 +00:00
Gaurav Agnihotri
1c0dec6db6 fix: pin browserforge version to 1.2.1 (#93)
Co-authored-by: gauravagnihotri <gaagniho@mtu.edu>
2025-06-18 18:06:10 -05:00
github-actions[bot]
e9c60f6338 chore: bump version to 1.1.3
Some checks failed
Merge / version (push) Has been cancelled
Merge / build-and-deploy (push) Has been cancelled
2025-06-12 23:06:34 +00:00
Jayden Pyles
5719a85491 chore: update chart version 2025-06-12 18:07:55 -05:00
github-actions[bot]
052d80de07 chore: bump version to 1.1.3 2025-06-12 23:03:26 +00:00
Jayden Pyles
7047a3c0e3 chore: update chart version 2025-06-12 18:04:47 -05:00
github-actions[bot]
71f603fc62 chore: bump version to 1.1.3 2025-06-12 23:01:57 +00:00
Jayden Pyles
86a77a27df chore: update chart version 2025-06-12 18:03:20 -05:00
github-actions[bot]
b11e263b93 chore: bump version to 1.1.3 2025-06-12 23:00:47 +00:00
Jayden Pyles
91dc13348d feat: add import/export for job configurations (#91)
* chore: wip add upload/import

* chore: wip add upload/import

* feat: update job rerunning

* fix: update workflow

* fix: update workflow

* chore: temp disable workflow
2025-06-12 18:00:39 -05:00
github-actions[bot]
93b0c83381 chore: bump version to 1.1.2
Some checks failed
Merge / tests (push) Has been cancelled
Merge / version (push) Has been cancelled
Merge / build-and-deploy (push) Has been cancelled
2025-06-08 23:24:17 +00:00
Jayden Pyles
9381ba9232 chore: update workflow 2025-06-08 18:17:03 -05:00
Jayden Pyles
20dccc5527 feat: edit ui + add return html option (#90)
* fix: restyle the element table

* chore: wip ui

* wip: edit styles

* feat: add html return

* fix: build

* fix: workflow

* fix: workflow

* fix: workflow

* fix: workflow

* fix: workflow

* fix: workflow

* fix: workflow

* fix: cypress test

* chore: update photo [skip ci]
2025-06-08 18:14:02 -05:00
Jayden Pyles
02619eb184 feat: update workflows [no bump]
Some checks failed
Merge / tests (push) Has been cancelled
Merge / version (push) Has been cancelled
Merge / build-and-deploy (push) Has been cancelled
2025-06-05 22:19:41 -05:00
github-actions[bot]
58c6c09fc9 chore: bump version to 1.1.2 2025-06-06 03:18:03 +00:00
Jayden Pyles
bf896b4c6b feat: update workflows [no bump] 2025-06-05 22:09:49 -05:00
Jayden Pyles
e3b9c11ab7 feat: update workflows [no bump] 2025-06-05 21:59:54 -05:00
github-actions[bot]
32da3375b3 chore: bump version to 1.1.1 2025-06-06 02:56:42 +00:00
Jayden Pyles
b5131cbe4c feat: update workflows [no bump] 2025-06-05 21:47:55 -05:00
github-actions[bot]
47c4c9a7d1 chore: bump version to 1.1.1
Some checks failed
Merge / tests (push) Has been cancelled
Merge / version (push) Has been cancelled
Merge / build-and-deploy (push) Has been cancelled
2025-06-05 01:48:00 +00:00
Jayden Pyles
4352988666 feat: update workflows 2025-06-04 20:40:16 -05:00
Jayden Pyles
00759151e6 feat: update workflos 2025-06-04 20:35:06 -05:00
github-actions[bot]
bfae00ca72 chore: bump version to 1.1.1 2025-06-04 23:06:51 +00:00
Jayden Pyles
e810700569 chore: remove deprecated output version 2025-06-04 17:58:49 -05:00
Jayden Pyles
9857fa96e0 fix: message [skip ci] 2025-06-03 17:12:51 -05:00
github-actions[bot]
b52fbc538d chore: bump version to 1.1.1
Some checks failed
Merge / tests (push) Has been cancelled
Merge / version (push) Has been cancelled
Merge / build-and-deploy (push) Has been cancelled
2025-06-03 16:27:54 +00:00
Jayden Pyles
42c0f3ae79 feat: add auto deploy workflow 2025-06-03 11:20:13 -05:00
github-actions[bot]
9aab2f9b4f chore: bump version to 1.1.1 2025-06-03 15:55:42 +00:00
Jayden Pyles
e182d3e4b8 feat: add auto deploy workflow 2025-06-03 10:48:02 -05:00
github-actions[bot]
53f35989f5 chore: bump version to 1.1.1
Some checks failed
Merge / tests (push) Has been cancelled
Merge / version (push) Has been cancelled
Merge / build-and-deploy (push) Has been cancelled
2025-06-03 03:06:41 +00:00
Jayden Pyles
a67ab34cfa feat: add auto deploy workflow 2025-06-02 21:59:06 -05:00
Jayden Pyles
3bf6657191 Merge branch 'master' of github.com:jaypyles/Scraperr 2025-06-02 21:58:04 -05:00
Jayden Pyles
c38d19a0ca feat: add auto deploy workflow 2025-06-02 21:57:44 -05:00
github-actions[bot]
a53e7e1aa1 chore: bump version to 1.1.1 2025-06-03 02:46:40 +00:00
Jayden Pyles
84368b1f6d feat: add auto deploy workflow 2025-06-02 21:38:25 -05:00
github-actions[bot]
ce4c1ceaa7 chore: bump version to 1.1.1 2025-06-03 02:27:38 +00:00
Jayden Pyles
7e1ce58bb8 feat: add auto deploy workflow 2025-06-02 21:19:43 -05:00
github-actions[bot]
175e7d63bf chore: bump version to 1.1.1 2025-06-03 01:52:57 +00:00
Jayden Pyles
d2c06de247 feat: add auto deploy workflow 2025-06-02 20:45:06 -05:00
github-actions[bot]
e0159bf9d4 chore: bump version to 1.1.1 2025-06-03 01:39:05 +00:00
Jayden Pyles
6d574ddfd2 feat: add auto deploy workflow 2025-06-02 20:31:34 -05:00
github-actions[bot]
b089d72786 chore: bump version to 2025-06-03 01:02:11 +00:00
Jayden Pyles
9ee4d577fd feat: add auto deploy workflow 2025-06-02 19:54:33 -05:00
Jayden Pyles
cddce5164d feat: add auto deploy workflow 2025-06-02 19:45:49 -05:00
Jayden Pyles
bf3163bfba feat: add auto deploy workflow 2025-06-02 19:09:57 -05:00
Jayden Pyles
54b513e92c feat: auto deploy 2025-06-02 19:08:18 -05:00
Jayden Pyles
6c56f2f161 Chore: app refactor (#88)
Some checks failed
Unit Tests / unit-tests (push) Has been cancelled
Unit Tests / cypress-tests (push) Has been cancelled
Unit Tests / success-message (push) Has been cancelled
* chore: refactor wip

* chore: refactor wip

* chore: refactor wip

* chore: refactor wip

* chore: refactor wip

* chore: refactor wip

* chore: work in progress

* chore: work in progress

* chore: work in progress

* chore: work in progress

* chore: work in progress

* chore: work in progress

* chore: work in progress

* chore: work in progress

* chore: work in progress

* chore: refactor wip

* chore: work in progress

* chore: refactor wip

* chore: refactor wip

* chore: refactor wip

* fix: build

* fix: cypress test

* fix: cypress test

* fix: cypress test

* fix: cypress test

* fix: cypress test

* fix: cypress test

* fix: cypress test

* fix: cypress test

* fix: cypress tests

* fix: cypress tests

* fix: cypress tests

* fix: cypress tests

* fix: cypress tests

* fix: cypress tests

* fix: cypress tests

* fix: cypress tests

* fix: cypress tests

* fix: cypress tests

* fix: cypress tests

* fix: cypress tests

* fix: cypress tests

* fix: cypress tests

* fix: cypress tests

* fix: cypress tests

* fix: cypress tests
2025-06-01 15:56:15 -05:00
Jayden Pyles
d4edb9d93e chore: update chart version [skip ci] 2025-05-19 20:46:19 -05:00
Jayden Pyles
5ebd96b62b feat: add agent mode (#81)
* chore: wip agent mode

* wip: add agent mode frontend

* wip: add agent mode frontend

* chore: cleanup code

* chore: cleanup code

* chore: cleanup code
2025-05-19 20:44:41 -05:00
Jayden Pyles
d602d3330a fix: site map
Some checks failed
Unit Tests / unit-tests (push) Has been cancelled
Unit Tests / cypress-tests (push) Has been cancelled
Unit Tests / success-message (push) Has been cancelled
2025-05-17 17:05:37 -05:00
Jayden Pyles
6639e8b48f chore: update chart version [skip ci] 2025-05-17 16:33:18 -05:00
Jayden Pyles
263e46ba4d feat: add media viewer + other fixes (#79)
* feat: add media viewer + other fixes

* chore: remove logging [skip ci]

* chore: remove logging [skip ci]

* feat: add unit test for media

* feat: add unit test for media

* feat: add unit test for media [skip ci]

* feat: add unit test for media [skip ci]

* feat: add unit test for media [skip ci]

* feat: add unit test for media [skip ci]

* chore: update docs [skip ci]
2025-05-17 16:31:34 -05:00
Jayden Pyles
f815a58efc chore: update docker version [skip ci] 2025-05-16 22:04:46 -05:00
Jayden Pyles
50ec5df657 chore: update chart version [skip ci] 2025-05-16 21:39:04 -05:00
Jayden Pyles
28de0f362c feat: add recording viewer and vnc (#78)
* feat: add recording viewer and vnc

* feat: add recording viewer and vnc

* feat: add recording viewer and vnc

* feat: add recording viewer and vnc

* chore: update gitignore [skip ci]

* chore: update dev compose [skip ci]

* fix: only run manually
2025-05-16 21:37:09 -05:00
Jayden Pyles
6b33723cac feat: update version
Some checks failed
Unit Tests / unit-tests (push) Has been cancelled
Unit Tests / cypress-tests (push) Has been cancelled
Unit Tests / success-message (push) Has been cancelled
2025-05-16 14:15:53 -05:00
Jayden Pyles
5c89e4d7d2 feat: allow custom cookies (#77)
* feat: working new advanced job options

* feat: working new advanced job options

* feat: add tests for adding custom cookies/headers
2025-05-16 14:13:58 -05:00
Jayden Pyles
ed0828a585 fix: deployment
Some checks failed
Unit Tests / unit-tests (push) Has been cancelled
Unit Tests / cypress-tests (push) Has been cancelled
Unit Tests / success-message (push) Has been cancelled
2025-05-13 21:03:21 -05:00
Jayden Pyles
1b8c8c779a Feature: Allow Multiple Download Options (#75)
* feat: allow downloading in MD format

* fix: unit tests

* fix: deployments [skip ci]

* fix: deployment
2025-05-13 18:23:59 -05:00
Jayden Pyles
267cc73657 docs: update docs [skip ci] 2025-05-13 13:11:52 -05:00
Jayden Pyles
92ff16d9c3 docs: update docs [skip ci] 2025-05-12 21:37:37 -05:00
Jayden Pyles
8b2e5dc9c3 Feat/add helm chart (#69)
* chore: start on helm chart

* chore: start on helm chart

* chore: start on helm chart

* chore: start on helm chart

* chore: start on helm chart

* chore: start on helm chart

* chore: start on helm chart

* chore: start on helm chart
2025-05-12 21:19:17 -05:00
Jayden Pyles
7f1bc295ac Feat/add data reader (#68)
Some checks failed
Unit Tests / unit-tests (push) Has been cancelled
Unit Tests / cypress-tests (push) Has been cancelled
Unit Tests / success-message (push) Has been cancelled
* feat: working new data view

* feat: working new data view

* fix: remove unused deps

* fix: typing

* chore: cleanup code
2025-05-12 17:58:45 -05:00
Jayden Pyles
031572325f Fix/UI and backend fixes (#67)
Some checks failed
Unit Tests / unit-tests (push) Has been cancelled
Unit Tests / cypress-tests (push) Has been cancelled
Unit Tests / success-message (push) Has been cancelled
* chore: wip

* chore: wip

* chore: wip

* fix: cypress test

* chore: cleanup code
2025-05-11 17:33:29 -05:00
Jayden Pyles
48d3bf9214 chore: docs [skip ci] 2025-05-11 13:46:21 -05:00
Jayden Pyles
e07abcd089 chore: docs [skip ci] 2025-05-11 13:42:37 -05:00
226 changed files with 14864 additions and 14778 deletions

4
.dockerignore Normal file
View File

@@ -0,0 +1,4 @@
node_modules
npm-debug.log
Dockerfile
.dockerignore

View File

@@ -0,0 +1,58 @@
name: Publish Helm Chart
description: Publish a Helm chart to a target repository
inputs:
app-repo-token:
required: true
description: "The token for the target repository"
version:
required: true
description: "The version of the Helm chart"
runs:
using: 'composite'
steps:
- name: Checkout app repo
uses: actions/checkout@v4
- name: Set up Helm
uses: azure/setup-helm@v3
- name: Update Helm chart version
run: |
sed -i "s/^version: .*/version: ${{ inputs.version }}/" helm/Chart.yaml
shell: bash
- name: Package Helm chart
run: |
mkdir -p packaged
helm package helm -d packaged
shell: bash
- name: Clone target Helm repo
run: |
git clone https://github.com/jaypyles/helm.git target-repo
cd target-repo
git config user.name "github-actions"
git config user.email "github-actions@github.com"
git fetch origin gh-pages # Fetch gh-pages explicitly
git checkout gh-pages # Checkout gh-pages branch
git pull origin gh-pages # Pull latest changes from gh-pages
shell: bash
- name: Copy package and update index
run: |
APP_NAME="scraperr"
mkdir -p target-repo/charts/$APP_NAME
cp packaged/*.tgz target-repo/charts/$APP_NAME/
cd target-repo/charts/$APP_NAME
helm repo index . --url https://jaypyles.github.io/helm/charts/$APP_NAME
shell: bash
- name: Commit and push to target repo
run: |
cd target-repo
git add charts/
git commit -m "Update $APP_NAME chart $(date +'%Y-%m-%d %H:%M:%S')" || echo "No changes"
git push https://x-access-token:${{ inputs.app-repo-token }}@github.com/jaypyles/helm.git gh-pages
shell: bash

View File

@@ -2,6 +2,13 @@ name: Run Cypress Tests
description: Run Cypress tests
inputs:
openai_key:
description: "OpenAI API key"
required: true
default: ""
runs:
using: "composite"
steps:
@@ -13,13 +20,25 @@ runs:
with:
node-version: 22
- name: Setup yarn
shell: bash
run: npm install -g yarn
- name: Install xvfb for headless testing
shell: bash
run: |
sudo apt-get update
sudo apt-get install -y xvfb libnss3 libatk1.0-0 libatk-bridge2.0-0 libcups2 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 libasound2t64 libpango-1.0-0 libcairo2 libgtk-3-0 libgdk-pixbuf2.0-0 libx11-6 libx11-xcb1 libxcb1 libxss1 libxtst6 libnspr4
- name: Setup Docker project
shell: bash
run: make build up-dev
run: |
export OPENAI_KEY="${{ inputs.openai_key }}"
make build-ci up-ci
- name: Install dependencies
shell: bash
run: npm install
run: yarn install
- name: Wait for frontend to be ready
shell: bash
@@ -54,5 +73,8 @@ runs:
- name: Run Cypress tests
shell: bash
run: npm run cy:run
run: |
set -e
npm run cy:run

31
.github/workflows/cypress-tests.yml vendored Normal file
View File

@@ -0,0 +1,31 @@
name: Cypress Tests
on:
workflow_call:
secrets:
openai_key:
required: true
jobs:
cypress-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Cypress Tests
id: run-tests
uses: ./.github/actions/run-cypress-tests
with:
openai_key: ${{ secrets.openai_key }}
- name: Check container logs on failure
if: steps.run-tests.conclusion == 'failure'
run: |
echo "Cypress tests failed. Dumping container logs..."
docker logs scraperr_api || true
- name: Fail job if Cypress failed
if: steps.run-tests.conclusion == 'failure'
run: exit 1

View File

@@ -1,54 +1,89 @@
name: Docker Image
on:
workflow_run:
workflows: ["Unit Tests"]
types:
- completed
workflow_dispatch:
workflow_call:
inputs:
version:
required: true
type: string
secrets:
dockerhub_username:
required: true
dockerhub_token:
required: true
repo_token:
required: true
discord_webhook_url:
required: true
jobs:
build:
if: ${{ github.event.workflow_run.conclusion == 'success' && github.event.workflow_run.head_branch == 'master' }}
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Echo version
run: |
echo "Version is ${{ inputs.version }}"
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Build and push frontend
- name: Build and push frontend (multi-arch)
uses: docker/build-push-action@v5
with:
context: .
file: ./docker/frontend/Dockerfile
push: true
tags: ${{ secrets.DOCKERHUB_USERNAME }}/${{ secrets.DOCKERHUB_REPO }}:latest
platforms: linux/amd64,linux/arm64
tags: |
${{ secrets.DOCKERHUB_USERNAME }}/scraperr:latest
${{ secrets.DOCKERHUB_USERNAME }}/scraperr:${{ inputs.version }}
- name: Build and push api
- name: Build and push api (multi-arch)
uses: docker/build-push-action@v5
with:
context: .
file: ./docker/api/Dockerfile
push: true
tags: ${{ secrets.DOCKERHUB_USERNAME }}/scraperr_api:latest
platforms: linux/amd64,linux/arm64
tags: |
${{ secrets.DOCKERHUB_USERNAME }}/scraperr_api:latest
${{ secrets.DOCKERHUB_USERNAME }}/scraperr_api:${{ inputs.version }}
push-helm-chart:
runs-on: ubuntu-latest
needs:
- build
steps:
- uses: actions/checkout@v4
- name: Push Helm Chart
uses: ./.github/actions/push-to-helm
with:
app-repo-token: ${{ secrets.repo_token }}
version: ${{ inputs.version }}
success-message:
runs-on: ubuntu-latest
needs:
- build
- push-helm-chart
steps:
- name: Send Discord Message
uses: jaypyles/discord-webhook-action@v1.0.0
with:
webhook-url: ${{ secrets.DISCORD_WEBHOOK_URL }}
content: "Scraperr Successfully Built Docker Images"
content: "Scraperr Successfully Built Docker Images (v${{ inputs.version }})"
username: "Scraperr CI"
embed-title: "✅ Deployment Status"
embed-description: "Scraperr successfully built docker images."

35
.github/workflows/merge.yml vendored Normal file
View File

@@ -0,0 +1,35 @@
name: Merge
on:
push:
branches:
- master
pull_request:
types: [closed]
branches:
- master
jobs:
# TODO: Renable once browser forge is fixed for camoufox, or else tests will never pass
# tests:
# uses: ./.github/workflows/tests.yml
# secrets:
# openai_key: ${{ secrets.OPENAI_KEY }}
# discord_webhook_url: ${{ secrets.DISCORD_WEBHOOK_URL }}
version:
uses: ./.github/workflows/version.yml
secrets:
git_token: ${{ secrets.GPAT_TOKEN }}
build-and-deploy:
if: needs.version.outputs.version_bump == 'true'
needs: version
uses: ./.github/workflows/docker-image.yml
secrets:
dockerhub_username: ${{ secrets.DOCKERHUB_USERNAME }}
dockerhub_token: ${{ secrets.DOCKERHUB_TOKEN }}
repo_token: ${{ secrets.GPAT_TOKEN }}
discord_webhook_url: ${{ secrets.DISCORD_WEBHOOK_URL }}
with:
version: ${{ needs.version.outputs.version }}

15
.github/workflows/pr.yml vendored Normal file
View File

@@ -0,0 +1,15 @@
name: PR
on:
pull_request:
branches:
- master
types: [opened, synchronize, reopened]
workflow_dispatch:
jobs:
tests:
uses: ./.github/workflows/tests.yml
secrets:
openai_key: ${{ secrets.OPENAI_KEY }}
discord_webhook_url: ${{ secrets.DISCORD_WEBHOOK_URL }}

29
.github/workflows/pytest.yml vendored Normal file
View File

@@ -0,0 +1,29 @@
name: Pytest
on:
workflow_call:
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- uses: actions/setup-node@v3
- name: Set env
run: echo "ENV=test" >> $GITHUB_ENV
- name: Install pdm
run: pip install pdm
- name: Install project dependencies
run: pdm install
- name: Install playwright
run: pdm run playwright install --with-deps
- name: Run tests
run: PYTHONPATH=. pdm run pytest -v -ra api/backend/tests

42
.github/workflows/tests.yml vendored Normal file
View File

@@ -0,0 +1,42 @@
name: Reusable PR Tests
on:
workflow_call:
secrets:
openai_key:
required: true
discord_webhook_url:
required: true
jobs:
pytest:
uses: ./.github/workflows/pytest.yml
cypress-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Cypress Tests
uses: ./.github/actions/run-cypress-tests
with:
openai_key: ${{ secrets.openai_key }}
success-message:
runs-on: ubuntu-latest
needs:
- pytest
- cypress-tests
steps:
- name: Send Discord Message
uses: jaypyles/discord-webhook-action@v1.0.0
with:
webhook-url: ${{ secrets.discord_webhook_url }}
content: "Scraperr Successfully Passed Tests"
username: "Scraperr CI"
embed-title: "✅ Deployment Status"
embed-description: "Scraperr successfully passed all tests."
embed-color: 3066993
embed-footer-text: "Scraperr CI"
embed-timestamp: ${{ github.event.head_commit.timestamp }}

View File

@@ -1,54 +0,0 @@
name: Unit Tests
on:
push:
branches:
- master
pull_request:
types: [opened, synchronize, reopened]
workflow_dispatch:
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set env
run: echo "ENV=test" >> $GITHUB_ENV
- name: Install pdm
run: pip install pdm
- name: Install project dependencies
run: pdm install
- name: Run tests
run: PYTHONPATH=. pdm run pytest api/backend/tests
cypress-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/run-cypress-tests
success-message:
runs-on: ubuntu-latest
needs:
- unit-tests
- cypress-tests
steps:
- name: Send Discord Message
uses: jaypyles/discord-webhook-action@v1.0.0
with:
webhook-url: ${{ secrets.DISCORD_WEBHOOK_URL }}
content: "Scraperr Successfully Passed Tests"
username: "Scraperr CI"
embed-title: "✅ Deployment Status"
embed-description: "Scraperr successfully passed all tests."
embed-color: 3066993 # Green
embed-footer-text: "Scraperr CI"
embed-timestamp: ${{ github.event.head_commit.timestamp }}

89
.github/workflows/version.yml vendored Normal file
View File

@@ -0,0 +1,89 @@
name: Version
on:
workflow_call:
secrets:
git_token:
required: true
outputs:
version:
description: "The new version number"
value: ${{ jobs.version.outputs.version }}
version_bump:
description: "Whether the version was bumped"
value: ${{ jobs.version.outputs.version_bump }}
jobs:
version:
runs-on: ubuntu-latest
outputs:
version: ${{ steps.set_version.outputs.version }}
version_bump: ${{ steps.check_version_bump.outputs.version_bump }}
steps:
- name: Checkout
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Get version bump
id: get_version_type
run: |
COMMIT_MSG=$(git log -1 --pretty=%B)
if [[ $COMMIT_MSG =~ ^feat\(breaking\) ]]; then
VERSION_TYPE="major"
elif [[ $COMMIT_MSG =~ ^feat! ]]; then
VERSION_TYPE="minor"
elif [[ $COMMIT_MSG =~ ^(feat|fix|chore): ]]; then
VERSION_TYPE="patch"
else
VERSION_TYPE="patch"
fi
echo "VERSION_TYPE=$VERSION_TYPE" >> $GITHUB_ENV
- name: Check for version bump
id: check_version_bump
run: |
COMMIT_MSG=$(git log -1 --pretty=%B)
if [[ $COMMIT_MSG =~ .*\[no\ bump\].* ]]; then
echo "version_bump=false" >> $GITHUB_OUTPUT
else
echo "version_bump=true" >> $GITHUB_OUTPUT
fi
- name: Skip version bump
if: steps.check_version_bump.outputs.version_bump == 'false'
run: |
echo "Skipping version bump as requested"
gh run cancel ${{ github.run_id }}
exit 0
env:
GITHUB_TOKEN: ${{ secrets.git_token }}
- name: Set version
if: steps.check_version_bump.outputs.version_bump != 'false'
id: set_version
run: |
VERSION=$(./scripts/version.sh "$VERSION_TYPE")
echo "VERSION=$VERSION" >> $GITHUB_ENV
echo "Version is $VERSION"
echo "version=$VERSION" >> $GITHUB_OUTPUT
env:
VERSION_TYPE: ${{ env.VERSION_TYPE }}
- name: Update chart file
if: steps.check_version_bump.outputs.version_bump != 'false'
run: |
sed -i "s/^version: .*/version: $VERSION/" helm/Chart.yaml
git config --local user.email "github-actions[bot]@users.noreply.github.com"
git config --local user.name "github-actions[bot]"
git add helm/Chart.yaml
git commit -m "chore: bump version to $VERSION"
git push
env:
VERSION: ${{ env.VERSION }}

16
.gitignore vendored
View File

@@ -188,4 +188,18 @@ postgres_data
.vscode
ollama
data
media
media/images
media/videos
media/audio
media/pdfs
media/spreadsheets
media/presentations
media/documents
media/recordings
media/download_summary.txt
cypress/screenshots
cypress/videos
docker-compose.dev.local.yml

2
.prettierignore Normal file
View File

@@ -0,0 +1,2 @@
*.yaml
*.yml

View File

@@ -1,6 +1,6 @@
.DEFAULT_GOAL := help
COMPOSE_DEV = docker compose -f docker-compose.yml -f docker-compose.dev.yml
COMPOSE_DEV = docker compose -f docker-compose.yml -f docker-compose.dev.local.yml
COMPOSE_PROD = docker compose -f docker-compose.yml
.PHONY: help deps build pull up up-dev down setup deploy
@@ -17,6 +17,7 @@ help:
@echo " make down - Stop and remove containers, networks, images, and volumes"
@echo " make setup - Setup server with dependencies and clone repo"
@echo " make deploy - Deploy site onto server"
@echo " make cypress-start - Start Cypress"
@echo ""
logs:
@@ -51,3 +52,15 @@ setup:
deploy:
ansible-playbook -i ./ansible/inventory.yaml ./ansible/deploy_site.yaml -v
build-ci:
docker compose -f docker-compose.yml -f docker-compose.dev.yml build
up-ci:
docker compose -f docker-compose.yml -f docker-compose.dev.yml up -d --force-recreate
cypress-start:
DISPLAY=:0 npx cypress open
cypress-run:
npx cypress run

129
README.md
View File

@@ -1,104 +1,71 @@
![logo_picture](https://github.com/jaypyles/www-scrape/blob/master/docs/logo_picture.png)
<div align="center">
<img src="https://img.shields.io/badge/MongoDB-%234ea94b.svg?style=for-the-badge&logo=mongodb&logoColor=white" alt="MongoDB" />
<img src="https://img.shields.io/badge/FastAPI-005571?style=for-the-badge&logo=fastapi" alt="FastAPI" />
<img src="https://img.shields.io/badge/Next-black?style=for-the-badge&logo=next.js&logoColor=white" alt="Next JS" />
<img src="https://img.shields.io/badge/tailwindcss-%2338B2AC.svg?style=for-the-badge&logo=tailwind-css&logoColor=white" alt="TailwindCSS" />
<img src="https://github.com/jaypyles/www-scrape/blob/master/docs/logo_picture.png" alt="Scraperr Logo" width="250px">
**A powerful self-hosted web scraping solution**
<div>
<img src="https://img.shields.io/badge/MongoDB-%234ea94b.svg?style=for-the-badge&logo=mongodb&logoColor=white" alt="MongoDB" />
<img src="https://img.shields.io/badge/FastAPI-005571?style=for-the-badge&logo=fastapi" alt="FastAPI" />
<img src="https://img.shields.io/badge/Next-black?style=for-the-badge&logo=next.js&logoColor=white" alt="Next JS" />
<img src="https://img.shields.io/badge/tailwindcss-%2338B2AC.svg?style=for-the-badge&logo=tailwind-css&logoColor=white" alt="TailwindCSS" />
</div>
</div>
# Summary
## 📋 Overview
Scraperr is a self-hosted web application that allows users to scrape data from web pages by specifying elements via XPath. Users can submit URLs and the corresponding elements to be scraped, and the results will be displayed in a table.
Scrape websites without writing a single line of code.
From the table, users can download an excel sheet of the job's results, along with an option to rerun the job.
> 📚 **[Check out the docs](https://scraperr-docs.pages.dev)** for a comprehensive quickstart guide and detailed information.
View the [docs](https://scraperr-docs.pages.dev) for a quickstart guide and more information.
<div align="center">
<img src="https://github.com/jaypyles/www-scrape/blob/master/docs/main_page.png" alt="Scraperr Main Interface" width="800px">
</div>
## Features
## ✨ Key Features
### Submitting URLs for Scraping
- **XPath-Based Extraction**: Precisely target page elements
- **Queue Management**: Submit and manage multiple scraping jobs
- **Domain Spidering**: Option to scrape all pages within the same domain
- **Custom Headers**: Add JSON headers to your scraping requests
- **Media Downloads**: Automatically download images, videos, and other media
- **Results Visualization**: View scraped data in a structured table format
- **Data Export**: Export your results in markdown and csv formats
- **Notifcation Channels**: Send completion notifcations, through various channels
- Submit/Queue URLs for web scraping
- Add and manage elements to scrape using XPath
- Scrape all pages within same domain
- Add custom json headers to send in requests to URLs
- Display results of scraped data
- Download media found on the page (images, videos, etc.)
## 🚀 Getting Started
![main_page](https://github.com/jaypyles/www-scrape/blob/master/docs/main_page.png)
### Docker
### Managing Previous Jobs
- Download csv containing results
- Rerun jobs
- View status of queued jobs
- Favorite and view favorited jobs
![job_page](https://github.com/jaypyles/www-scrape/blob/master/docs/job_page.png)
### User Management
- User login/signup to organize jobs (optional)
![login](https://github.com/jaypyles/www-scrape/blob/master/docs/login.png)
### Log Viewing
- View app logs inside of web ui
![logs](https://github.com/jaypyles/www-scrape/blob/master/docs/log_page.png)
### Statistics View
- View a small statistics view of jobs ran
![statistics](https://github.com/jaypyles/www-scrape/blob/master/docs/stats_page.png)
### AI Integration
- Include the results of a selected job into the context of a conversation
- Currently supports:
1. Ollama
2. OpenAI
![chat](https://github.com/jaypyles/www-scrape/blob/master/docs/chat_page.png)
## API Endpoints
Use this service as an API for your own projects. Due to this using FastAPI, a docs page is available at `/docs` for the API.
![docs](https://github.com/jaypyles/www-scrape/blob/master/docs/docs_page.png)
## Troubleshooting
Q: When running Scraperr, I'm met with "404 Page not found".
A: This is probably an issue with MongoDB related to running Scraperr in a VM. You should see something liks this in `make logs`:
```
WARNING: MongoDB 5.0+ requires a CPU with AVX support, and your current system does not appear to have that!
```bash
make up
```
To resolve this issue, simply set CPU host type to `host`. This can be done in Proxmox in the VM settings > Processor. [Related issue](https://github.com/jaypyles/Scraperr/issues/9).
### Helm
## Legal and Ethical Considerations
> Refer to the docs for helm deployment: https://scraperr-docs.pages.dev/guides/helm-deployment
When using Scraperr, please ensure that you:
## ⚖️ Legal and Ethical Guidelines
1. **Check Robots.txt**: Verify allowed pages by reviewing the `robots.txt` file of the target website.
2. **Compliance**: Always comply with the website's Terms of Service (ToS) regarding web scraping.
When using Scraperr, please remember to:
**Disclaimer**: This tool is intended for use only on websites that permit scraping. The author is not responsible for any misuse of this tool.
1. **Respect `robots.txt`**: Always check a website's `robots.txt` file to verify which pages permit scraping
2. **Terms of Service**: Adhere to each website's Terms of Service regarding data extraction
3. **Rate Limiting**: Implement reasonable delays between requests to avoid overloading servers
## License
> **Disclaimer**: Scraperr is intended for use only on websites that explicitly permit scraping. The creator accepts no responsibility for misuse of this tool.
## 💬 Join the Community
Get support, report bugs, and chat with other users and contributors.
👉 [Join the Scraperr Discord](https://discord.gg/89q7scsGEK)
## 📄 License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
### Contributions
## 👏 Contributions
Development made easy by developing from [webapp template](https://github.com/jaypyles/webapp-template). View documentation for extra information.
Development made easier with the [webapp template](https://github.com/jaypyles/webapp-template).
Start development server:
`make deps build up-dev`
To get started, simply run `make build up-dev`.

147
alembic.ini Normal file
View File

@@ -0,0 +1,147 @@
# A generic, single database configuration.
[alembic]
# path to migration scripts.
# this is typically a path given in POSIX (e.g. forward slashes)
# format, relative to the token %(here)s which refers to the location of this
# ini file
script_location = %(here)s/alembic
# template used to generate migration file names; The default value is %%(rev)s_%%(slug)s
# Uncomment the line below if you want the files to be prepended with date and time
# see https://alembic.sqlalchemy.org/en/latest/tutorial.html#editing-the-ini-file
# for all available tokens
# file_template = %%(year)d_%%(month).2d_%%(day).2d_%%(hour).2d%%(minute).2d-%%(rev)s_%%(slug)s
# sys.path path, will be prepended to sys.path if present.
# defaults to the current working directory. for multiple paths, the path separator
# is defined by "path_separator" below.
prepend_sys_path = .
# timezone to use when rendering the date within the migration file
# as well as the filename.
# If specified, requires the python>=3.9 or backports.zoneinfo library and tzdata library.
# Any required deps can installed by adding `alembic[tz]` to the pip requirements
# string value is passed to ZoneInfo()
# leave blank for localtime
# timezone =
# max length of characters to apply to the "slug" field
# truncate_slug_length = 40
# set to 'true' to run the environment during
# the 'revision' command, regardless of autogenerate
# revision_environment = false
# set to 'true' to allow .pyc and .pyo files without
# a source .py file to be detected as revisions in the
# versions/ directory
# sourceless = false
# version location specification; This defaults
# to <script_location>/versions. When using multiple version
# directories, initial revisions must be specified with --version-path.
# The path separator used here should be the separator specified by "path_separator"
# below.
# version_locations = %(here)s/bar:%(here)s/bat:%(here)s/alembic/versions
# path_separator; This indicates what character is used to split lists of file
# paths, including version_locations and prepend_sys_path within configparser
# files such as alembic.ini.
# The default rendered in new alembic.ini files is "os", which uses os.pathsep
# to provide os-dependent path splitting.
#
# Note that in order to support legacy alembic.ini files, this default does NOT
# take place if path_separator is not present in alembic.ini. If this
# option is omitted entirely, fallback logic is as follows:
#
# 1. Parsing of the version_locations option falls back to using the legacy
# "version_path_separator" key, which if absent then falls back to the legacy
# behavior of splitting on spaces and/or commas.
# 2. Parsing of the prepend_sys_path option falls back to the legacy
# behavior of splitting on spaces, commas, or colons.
#
# Valid values for path_separator are:
#
# path_separator = :
# path_separator = ;
# path_separator = space
# path_separator = newline
#
# Use os.pathsep. Default configuration used for new projects.
path_separator = os
# set to 'true' to search source files recursively
# in each "version_locations" directory
# new in Alembic version 1.10
# recursive_version_locations = false
# the output encoding used when revision files
# are written from script.py.mako
# output_encoding = utf-8
# database URL. This is consumed by the user-maintained env.py script only.
# other means of configuring database URLs may be customized within the env.py
# file.
sqlalchemy.url = driver://user:pass@localhost/dbname
[post_write_hooks]
# post_write_hooks defines scripts or Python functions that are run
# on newly generated revision scripts. See the documentation for further
# detail and examples
# format using "black" - use the console_scripts runner, against the "black" entrypoint
# hooks = black
# black.type = console_scripts
# black.entrypoint = black
# black.options = -l 79 REVISION_SCRIPT_FILENAME
# lint with attempts to fix using "ruff" - use the module runner, against the "ruff" module
# hooks = ruff
# ruff.type = module
# ruff.module = ruff
# ruff.options = check --fix REVISION_SCRIPT_FILENAME
# Alternatively, use the exec runner to execute a binary found on your PATH
# hooks = ruff
# ruff.type = exec
# ruff.executable = ruff
# ruff.options = check --fix REVISION_SCRIPT_FILENAME
# Logging configuration. This is also consumed by the user-maintained
# env.py script only.
[loggers]
keys = root,sqlalchemy,alembic
[handlers]
keys = console
[formatters]
keys = generic
[logger_root]
level = WARNING
handlers = console
qualname =
[logger_sqlalchemy]
level = WARNING
handlers =
qualname = sqlalchemy.engine
[logger_alembic]
level = INFO
handlers =
qualname = alembic
[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = NOTSET
formatter = generic
[formatter_generic]
format = %(levelname)-5.5s [%(name)s] %(message)s
datefmt = %H:%M:%S

1
alembic/README Normal file
View File

@@ -0,0 +1 @@
Generic single-database configuration.

103
alembic/env.py Normal file
View File

@@ -0,0 +1,103 @@
# STL
import os
import sys
from logging.config import fileConfig
# PDM
from dotenv import load_dotenv
from sqlalchemy import pool, engine_from_config
# LOCAL
from alembic import context
from api.backend.database.base import Base
from api.backend.database.models import Job, User, CronJob # type: ignore
load_dotenv()
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "api")))
# Load the raw async database URL
raw_database_url = os.getenv("DATABASE_URL", "sqlite+aiosqlite:///data/database.db")
# Map async dialects to sync ones
driver_downgrade_map = {
"sqlite+aiosqlite": "sqlite",
"postgresql+asyncpg": "postgresql",
"mysql+aiomysql": "mysql",
}
# Extract scheme and convert if async
for async_driver, sync_driver in driver_downgrade_map.items():
if raw_database_url.startswith(async_driver + "://"):
sync_database_url = raw_database_url.replace(async_driver, sync_driver, 1)
break
else:
# No async driver detected — assume it's already sync
sync_database_url = raw_database_url
# Apply it to Alembic config
config = context.config
config.set_main_option("sqlalchemy.url", sync_database_url)
# Interpret the config file for Python logging.
# This line sets up loggers basically.
if config.config_file_name is not None:
fileConfig(config.config_file_name)
# add your model's MetaData object here
# for 'autogenerate' support
# from myapp import mymodel
# target_metadata = mymodel.Base.metadata
target_metadata = Base.metadata
def run_migrations_offline() -> None:
"""Run migrations in 'offline' mode.
This configures the context with just a URL
and not an Engine, though an Engine is acceptable
here as well. By skipping the Engine creation
we don't even need a DBAPI to be available.
Calls to context.execute() here emit the given string to the
script output.
"""
url = config.get_main_option("sqlalchemy.url")
context.configure(
url=url,
target_metadata=target_metadata,
literal_binds=True,
dialect_opts={"paramstyle": "named"},
)
with context.begin_transaction():
context.run_migrations()
def run_migrations_online() -> None:
"""Run migrations in 'online' mode.
In this scenario we need to create an Engine
and associate a connection with the context.
"""
connectable = engine_from_config(
config.get_section(config.config_ini_section, {}),
prefix="sqlalchemy.",
poolclass=pool.NullPool,
)
with connectable.connect() as connection:
context.configure(connection=connection, target_metadata=target_metadata)
with context.begin_transaction():
context.run_migrations()
if context.is_offline_mode():
run_migrations_offline()
else:
run_migrations_online()

28
alembic/script.py.mako Normal file
View File

@@ -0,0 +1,28 @@
"""${message}
Revision ID: ${up_revision}
Revises: ${down_revision | comma,n}
Create Date: ${create_date}
"""
from typing import Sequence, Union
from alembic import op
import sqlalchemy as sa
${imports if imports else ""}
# revision identifiers, used by Alembic.
revision: str = ${repr(up_revision)}
down_revision: Union[str, Sequence[str], None] = ${repr(down_revision)}
branch_labels: Union[str, Sequence[str], None] = ${repr(branch_labels)}
depends_on: Union[str, Sequence[str], None] = ${repr(depends_on)}
def upgrade() -> None:
"""Upgrade schema."""
${upgrades if upgrades else "pass"}
def downgrade() -> None:
"""Downgrade schema."""
${downgrades if downgrades else "pass"}

View File

@@ -0,0 +1,67 @@
"""initial revision
Revision ID: 6aa921d2e637
Revises:
Create Date: 2025-07-12 20:17:44.448034
"""
from typing import Sequence, Union
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision: str = '6aa921d2e637'
down_revision: Union[str, Sequence[str], None] = None
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None
def upgrade() -> None:
"""Upgrade schema."""
# ### commands auto generated by Alembic - please adjust! ###
op.create_table('users',
sa.Column('email', sa.String(length=255), nullable=False),
sa.Column('hashed_password', sa.String(length=255), nullable=False),
sa.Column('full_name', sa.String(length=255), nullable=True),
sa.Column('disabled', sa.Boolean(), nullable=True),
sa.PrimaryKeyConstraint('email')
)
op.create_table('jobs',
sa.Column('id', sa.String(length=64), nullable=False),
sa.Column('url', sa.String(length=2048), nullable=False),
sa.Column('elements', sa.JSON(), nullable=False),
sa.Column('user', sa.String(length=255), nullable=True),
sa.Column('time_created', sa.DateTime(timezone=True), server_default=sa.text('(CURRENT_TIMESTAMP)'), nullable=False),
sa.Column('result', sa.JSON(), nullable=False),
sa.Column('status', sa.String(length=50), nullable=False),
sa.Column('chat', sa.JSON(), nullable=True),
sa.Column('job_options', sa.JSON(), nullable=True),
sa.Column('agent_mode', sa.Boolean(), nullable=False),
sa.Column('prompt', sa.String(length=1024), nullable=True),
sa.Column('favorite', sa.Boolean(), nullable=False),
sa.ForeignKeyConstraint(['user'], ['users.email'], ),
sa.PrimaryKeyConstraint('id')
)
op.create_table('cron_jobs',
sa.Column('id', sa.String(length=64), nullable=False),
sa.Column('user_email', sa.String(length=255), nullable=False),
sa.Column('job_id', sa.String(length=64), nullable=False),
sa.Column('cron_expression', sa.String(length=255), nullable=False),
sa.Column('time_created', sa.DateTime(timezone=True), server_default=sa.text('(CURRENT_TIMESTAMP)'), nullable=False),
sa.Column('time_updated', sa.DateTime(timezone=True), server_default=sa.text('(CURRENT_TIMESTAMP)'), nullable=False),
sa.ForeignKeyConstraint(['job_id'], ['jobs.id'], ),
sa.ForeignKeyConstraint(['user_email'], ['users.email'], ),
sa.PrimaryKeyConstraint('id')
)
# ### end Alembic commands ###
def downgrade() -> None:
"""Downgrade schema."""
# ### commands auto generated by Alembic - please adjust! ###
op.drop_table('cron_jobs')
op.drop_table('jobs')
op.drop_table('users')
# ### end Alembic commands ###

View File

@@ -0,0 +1,6 @@
from typing_extensions import TypedDict
class Action(TypedDict):
type: str
url: str

View File

@@ -0,0 +1,96 @@
# STL
import random
from typing import Any
# PDM
from camoufox import AsyncCamoufox
from playwright.async_api import Page
# LOCAL
from api.backend.constants import RECORDINGS_ENABLED
from api.backend.ai.clients import ask_ollama, ask_open_ai, open_ai_key
from api.backend.job.models import CapturedElement
from api.backend.worker.logger import LOG
from api.backend.ai.agent.utils import (
parse_response,
capture_elements,
convert_to_markdown,
)
from api.backend.ai.agent.prompts import (
EXTRACT_ELEMENTS_PROMPT,
ELEMENT_EXTRACTION_PROMPT,
)
from api.backend.job.scraping.add_custom import add_custom_items
from api.backend.job.scraping.collect_media import collect_media
ask_ai = ask_open_ai if open_ai_key else ask_ollama
async def scrape_with_agent(agent_job: dict[str, Any]):
LOG.info(f"Starting work for agent job: {agent_job}")
pages = set()
proxy = None
if agent_job["job_options"]["proxies"]:
proxy = random.choice(agent_job["job_options"]["proxies"])
LOG.info(f"Using proxy: {proxy}")
async with AsyncCamoufox(headless=not RECORDINGS_ENABLED, proxy=proxy) as browser:
page: Page = await browser.new_page()
await add_custom_items(
agent_job["url"],
page,
agent_job["job_options"]["custom_cookies"],
agent_job["job_options"]["custom_headers"],
)
try:
await page.set_viewport_size({"width": 1920, "height": 1080})
await page.goto(agent_job["url"], timeout=60000)
if agent_job["job_options"]["collect_media"]:
await collect_media(agent_job["id"], page)
html_content = await page.content()
markdown_content = convert_to_markdown(html_content)
response = await ask_ai(
ELEMENT_EXTRACTION_PROMPT.format(
extraction_prompt=EXTRACT_ELEMENTS_PROMPT,
webpage=markdown_content,
prompt=agent_job["prompt"],
)
)
xpaths = parse_response(response)
captured_elements = await capture_elements(
page, xpaths, agent_job["job_options"].get("return_html", False)
)
final_url = page.url
pages.add((html_content, final_url))
finally:
await page.close()
await browser.close()
name_to_elements = {}
for page in pages:
for element in captured_elements:
if element.name not in name_to_elements:
name_to_elements[element.name] = []
name_to_elements[element.name].append(element)
scraped_elements: list[dict[str, dict[str, list[CapturedElement]]]] = [
{
page[1]: name_to_elements,
}
for page in pages
]
return scraped_elements

View File

@@ -0,0 +1,58 @@
EXTRACT_ELEMENTS_PROMPT = """
You are an assistant that extracts XPath expressions from webpages.
You will receive HTML content in markdown format.
Each element in the markdown has their xpath shown above them in a path like:
<!-- //div -->
Respond only with a list of general XPath expressions inside `<xpaths>...</xpaths>` tags.
You will also decide the decision of what to do next. If there is no decision available, return nothing for that section.
"""
ELEMENT_EXTRACTION_PROMPT = """
{extraction_prompt}
**Guidelines:**
- Prefer shorter, more general XPaths like `//div[...]` or `//span[...]`.
- Avoid overly specific or deep paths like `//div[3]/ul/li[2]/a`.
- Do **not** chain multiple elements deeply (e.g., `//div/span/a`).
- Use XPaths further down the tree when possible.
- Do not include any extra explanation or text.
- One XPath is acceptable if that's all that's needed.
- Try and limit it down to 1 - 3 xpaths.
- Include a name for each xpath.
<important>
- USE THE MOST SIMPLE XPATHS POSSIBLE.
- USE THE MOST GENERAL XPATHS POSSIBLE.
- USE THE MOST SPECIFIC XPATHS POSSIBLE.
- USE THE MOST GENERAL XPATHS POSSIBLE.
</important>
**Example Format:**
```xml
<xpaths>
- <name: insert_name_here>: <xpath: //div>
- <name: insert_name_here>: <xpath: //span>
- <name: insert_name_here>: <xpath: //span[contains(@text, 'example')]>
- <name: insert_name_here>: <xpath: //div[contains(@text, 'example')]>
- <name: insert_name_here>: <xpath: //a[@href]>
- etc
</xpaths>
<decision>
<next_page>
- //a[@href='next_page_url']
</next_page>
</decision>
```
**Input webpage:**
{webpage}
**Target content:**
{prompt}
"""

View File

@@ -0,0 +1,272 @@
# STL
import re
# PDM
from lxml import html, etree
from playwright.async_api import Page
# LOCAL
from api.backend.job.models import CapturedElement
from api.backend.job.utils.text_utils import clean_text
def convert_to_markdown(html_str: str):
parser = html.HTMLParser()
tree = html.fromstring(html_str, parser=parser)
root = tree.getroottree()
def format_attributes(el: etree._Element) -> str:
"""Convert element attributes into a string."""
return " ".join(f'{k}="{v}"' for k, v in el.attrib.items())
def is_visible(el: etree._Element) -> bool:
style = el.attrib.get("style", "").lower()
class_ = el.attrib.get("class", "").lower()
# Check for visibility styles
if "display: none" in style or "visibility: hidden" in style:
return False
if "opacity: 0" in style or "opacity:0" in style:
return False
if "height: 0" in style or "width: 0" in style:
return False
# Check for common hidden classes
if any(
hidden in class_
for hidden in ["hidden", "invisible", "truncate", "collapse"]
):
return False
# Check for hidden attributes
if el.attrib.get("hidden") is not None:
return False
if el.attrib.get("aria-hidden") == "true":
return False
# Check for empty or whitespace-only content
if not el.text and len(el) == 0:
return False
return True
def is_layout_or_decorative(el: etree._Element) -> bool:
tag = el.tag.lower()
# Layout elements
if tag in {"nav", "footer", "header", "aside", "main", "section"}:
return True
# Decorative elements
if tag in {"svg", "path", "circle", "rect", "line", "polygon", "polyline"}:
return True
# Check id and class for layout/decorative keywords
id_class = " ".join(
[el.attrib.get("id", ""), el.attrib.get("class", "")]
).lower()
layout_keywords = {
"sidebar",
"nav",
"header",
"footer",
"menu",
"advert",
"ads",
"breadcrumb",
"container",
"wrapper",
"layout",
"grid",
"flex",
"row",
"column",
"section",
"banner",
"hero",
"card",
"modal",
"popup",
"tooltip",
"dropdown",
"overlay",
}
return any(keyword in id_class for keyword in layout_keywords)
# Tags to ignore in the final markdown output
included_tags = {
"div",
"span",
"a",
"p",
"h1",
"h2",
"h3",
"h4",
"h5",
"h6",
"img",
"button",
"input",
"textarea",
"ul",
"ol",
"li",
"table",
"tr",
"td",
"th",
"input",
"textarea",
"select",
"option",
"optgroup",
"fieldset",
"legend",
}
special_elements = []
normal_elements = []
for el in tree.iter():
if el.tag is etree.Comment:
continue
tag = el.tag.lower()
if tag not in included_tags:
continue
if not is_visible(el):
continue
if is_layout_or_decorative(el):
continue
path = root.getpath(el)
attrs = format_attributes(el)
attrs_str = f" {attrs}" if attrs else ""
text = el.text.strip() if el.text else ""
if not text and not attrs:
continue
# input elements
if tag == "button":
prefix = "🔘 **<button>**"
special_elements.append(f"<!-- {path} -->\n{prefix} {text}")
elif tag == "a":
href = el.attrib.get("href", "")
prefix = f"🔗 **<a href='{href}'>**"
special_elements.append(f"<!-- {path} -->\n{prefix} {text}")
elif tag == "input":
input_type = el.attrib.get("type", "text")
prefix = f"📝 **<input type='{input_type}'>**"
special_elements.append(f"<!-- {path} -->\n{prefix}")
else:
prefix = f"**<{tag}{attrs_str}>**"
if text:
normal_elements.append(f"<!-- {path} -->\n{prefix} {text}")
return "\n\n".join(normal_elements + special_elements) # type: ignore
def parse_response(text: str) -> list[dict[str, str]]:
xpaths = re.findall(r"<xpaths>(.*?)</xpaths>", text, re.DOTALL)
results = []
if xpaths:
lines = xpaths[0].strip().splitlines()
for line in lines:
if line.strip().startswith("-"):
name = re.findall(r"<name: (.*?)>", line)[0]
xpath = re.findall(r"<xpath: (.*?)>", line)[0]
results.append({"name": name, "xpath": xpath})
else:
results.append({"name": "", "xpath": line.strip()})
return results
def parse_next_page(text: str) -> str | None:
next_page = re.findall(r"<next_page>(.*?)</next_page>", text, re.DOTALL)
if next_page:
lines = next_page[0].strip().splitlines()
next_page = [
line.strip().lstrip("-").strip()
for line in lines
if line.strip().startswith("-")
]
return next_page[0] if next_page else None
async def capture_elements(
page: Page, xpaths: list[dict[str, str]], return_html: bool
) -> list[CapturedElement]:
captured_elements = []
seen_texts = set()
for xpath in xpaths:
try:
locator = page.locator(f"xpath={xpath['xpath']}")
count = await locator.count()
for i in range(count):
if return_html:
element_text = (
await page.locator(f"xpath={xpath['xpath']}")
.nth(i)
.inner_html()
)
seen_texts.add(element_text)
captured_elements.append(
CapturedElement(
name=xpath["name"],
text=element_text,
xpath=xpath["xpath"],
)
)
continue
element_text = ""
element_handle = await locator.nth(i).element_handle()
if not element_handle:
continue
link = await element_handle.get_attribute("href") or ""
text = await element_handle.text_content()
if text:
element_text += text
if link:
element_text += f" ({link})"
cleaned = clean_text(element_text)
if cleaned in seen_texts:
continue
seen_texts.add(cleaned)
captured_elements.append(
CapturedElement(
name=xpath["name"],
text=cleaned,
xpath=xpath["xpath"],
)
)
except Exception as e:
print(f"Error processing xpath {xpath}: {e}")
return captured_elements

View File

@@ -1,32 +1,28 @@
# STL
import os
import logging
from collections.abc import Iterable, AsyncGenerator
# PDM
from openai import OpenAI
from ollama import Message
from fastapi import APIRouter
from fastapi.responses import JSONResponse, StreamingResponse
from openai.types.chat import ChatCompletionMessageParam
# LOCAL
from ollama import Message, AsyncClient
from api.backend.models import AI
from api.backend.ai.clients import (
llama_model,
open_ai_key,
llama_client,
open_ai_model,
openai_client,
)
from api.backend.ai.schemas import AI
from api.backend.routers.handle_exceptions import handle_exceptions
LOG = logging.getLogger(__name__)
LOG = logging.getLogger("AI")
ai_router = APIRouter()
# Load environment variables
open_ai_key = os.getenv("OPENAI_KEY")
open_ai_model = os.getenv("OPENAI_MODEL")
llama_url = os.getenv("OLLAMA_URL")
llama_model = os.getenv("OLLAMA_MODEL")
# Initialize clients
openai_client = OpenAI(api_key=open_ai_key) if open_ai_key else None
llama_client = AsyncClient(host=llama_url) if llama_url else None
async def llama_chat(chat_messages: list[Message]) -> AsyncGenerator[str, None]:
if llama_client and llama_model:
@@ -43,6 +39,14 @@ async def llama_chat(chat_messages: list[Message]) -> AsyncGenerator[str, None]:
async def openai_chat(
chat_messages: Iterable[ChatCompletionMessageParam],
) -> AsyncGenerator[str, None]:
if openai_client and not open_ai_model:
LOG.error("OpenAI model is not set")
yield "An error occurred while processing your request."
if not openai_client:
LOG.error("OpenAI client is not set")
yield "An error occurred while processing your request."
if openai_client and open_ai_model:
try:
response = openai_client.chat.completions.create(
@@ -59,6 +63,7 @@ chat_function = llama_chat if llama_client else openai_chat
@ai_router.post("/ai")
@handle_exceptions(logger=LOG)
async def ai(c: AI):
return StreamingResponse(
chat_function(chat_messages=c.messages), media_type="text/plain"
@@ -66,5 +71,6 @@ async def ai(c: AI):
@ai_router.get("/ai/check")
@handle_exceptions(logger=LOG)
async def check():
return JSONResponse(content={"ai_enabled": bool(open_ai_key or llama_model)})

39
api/backend/ai/clients.py Normal file
View File

@@ -0,0 +1,39 @@
# STL
import os
# PDM
from ollama import AsyncClient
from openai import OpenAI
# Load environment variables
open_ai_key = os.getenv("OPENAI_KEY")
open_ai_model = os.getenv("OPENAI_MODEL")
llama_url = os.getenv("OLLAMA_URL")
llama_model = os.getenv("OLLAMA_MODEL")
# Initialize clients
openai_client = OpenAI(api_key=open_ai_key) if open_ai_key else None
llama_client = AsyncClient(host=llama_url) if llama_url else None
async def ask_open_ai(prompt: str) -> str:
if not openai_client:
raise ValueError("OpenAI client not initialized")
response = openai_client.chat.completions.create(
model=open_ai_model or "gpt-4.1-mini",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content or ""
async def ask_ollama(prompt: str) -> str:
if not llama_client:
raise ValueError("Ollama client not initialized")
response = await llama_client.chat(
model=llama_model or "", messages=[{"role": "user", "content": prompt}]
)
return response.message.content or ""

View File

@@ -0,0 +1,4 @@
# LOCAL
from .ai import AI
__all__ = ["AI"]

View File

@@ -0,0 +1,9 @@
# STL
from typing import Any
# PDM
import pydantic
class AI(pydantic.BaseModel):
messages: list[Any]

View File

@@ -1,40 +1,57 @@
# STL
import os
import logging
import apscheduler # type: ignore
from contextlib import asynccontextmanager
# PDM
import apscheduler.schedulers
import apscheduler.schedulers.background
from fastapi import FastAPI, Request, status
from fastapi.responses import JSONResponse
from fastapi.exceptions import RequestValidationError
from fastapi.middleware.cors import CORSMiddleware
# LOCAL
from api.backend.ai.ai_router import ai_router
from api.backend.auth.auth_router import auth_router
from api.backend.utils import get_log_level
from api.backend.routers.job_router import job_router
from api.backend.routers.log_router import log_router
from api.backend.routers.stats_router import stats_router
from api.backend.database.startup import init_database
from fastapi.responses import JSONResponse
from api.backend.job.cron_scheduling.cron_scheduling import start_cron_scheduler
from api.backend.scheduler import scheduler
from api.backend.ai.ai_router import ai_router
from api.backend.job.job_router import job_router
from api.backend.auth.auth_router import auth_router
from api.backend.stats.stats_router import stats_router
from api.backend.job.cron_scheduling.cron_scheduling import start_cron_scheduler
log_level = os.getenv("LOG_LEVEL")
LOG_LEVEL = get_log_level(log_level)
logging.basicConfig(
level=LOG_LEVEL,
format="%(levelname)s: %(asctime)s - %(name)s - %(message)s",
format="%(levelname)s: %(asctime)s - [%(name)s] - %(message)s",
handlers=[logging.StreamHandler()],
)
LOG = logging.getLogger(__name__)
app = FastAPI(title="api", root_path="/api")
@asynccontextmanager
async def lifespan(_: FastAPI):
# Startup
LOG.info("Starting application...")
LOG.info("Starting cron scheduler...")
await start_cron_scheduler(scheduler)
scheduler.start()
LOG.info("Cron scheduler started successfully")
yield
# Shutdown
LOG.info("Shutting down application...")
LOG.info("Stopping cron scheduler...")
scheduler.shutdown(wait=False) # Set wait=False to not block shutdown
LOG.info("Cron scheduler stopped")
LOG.info("Application shutdown complete")
app = FastAPI(title="api", root_path="/api", lifespan=lifespan)
app.add_middleware(
CORSMiddleware,
@@ -44,29 +61,12 @@ app.add_middleware(
allow_headers=["*"],
)
app.include_router(auth_router)
app.include_router(ai_router)
app.include_router(job_router)
app.include_router(log_router)
app.include_router(stats_router)
@app.on_event("startup")
async def startup_event():
start_cron_scheduler(scheduler)
scheduler.start()
if os.getenv("ENV") != "test":
init_database()
LOG.info("Starting up...")
@app.on_event("shutdown")
def shutdown_scheduler():
scheduler.shutdown(wait=False) # Set wait=False to not block shutdown
@app.exception_handler(RequestValidationError)
async def validation_exception_handler(request: Request, exc: RequestValidationError):
exc_str = f"{exc}".replace("\n", " ").replace(" ", " ")

View File

@@ -1,13 +1,16 @@
# STL
from datetime import timedelta
import os
import logging
from datetime import timedelta
# PDM
from fastapi import Depends, APIRouter, HTTPException, status
from fastapi.security import OAuth2PasswordRequestForm
from sqlalchemy.ext.asyncio import AsyncSession
# LOCAL
from api.backend.schemas import User, Token, UserCreate
from api.backend.auth.schemas import User, Token, UserCreate
from api.backend.database.base import AsyncSessionLocal, get_db
from api.backend.auth.auth_utils import (
ACCESS_TOKEN_EXPIRE_MINUTES,
get_current_user,
@@ -15,18 +18,19 @@ from api.backend.auth.auth_utils import (
get_password_hash,
create_access_token,
)
import logging
from api.backend.database.common import update
from api.backend.database.models import User as DatabaseUser
from api.backend.routers.handle_exceptions import handle_exceptions
auth_router = APIRouter()
LOG = logging.getLogger("auth_router")
LOG = logging.getLogger("Auth")
@auth_router.post("/auth/token", response_model=Token)
async def login_for_access_token(form_data: OAuth2PasswordRequestForm = Depends()):
user = await authenticate_user(form_data.username, form_data.password)
@handle_exceptions(logger=LOG)
async def login_for_access_token(form_data: OAuth2PasswordRequestForm = Depends(), db: AsyncSession = Depends(get_db)):
user = await authenticate_user(form_data.username, form_data.password, db)
if not user:
raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
@@ -47,23 +51,37 @@ async def login_for_access_token(form_data: OAuth2PasswordRequestForm = Depends(
@auth_router.post("/auth/signup", response_model=User)
@handle_exceptions(logger=LOG)
async def create_user(user: UserCreate):
hashed_password = get_password_hash(user.password)
user_dict = user.model_dump()
user_dict["hashed_password"] = hashed_password
del user_dict["password"]
query = "INSERT INTO users (email, hashed_password, full_name) VALUES (?, ?, ?)"
_ = update(query, (user_dict["email"], hashed_password, user_dict["full_name"]))
async with AsyncSessionLocal() as session:
new_user = DatabaseUser(
email=user.email,
hashed_password=user_dict["hashed_password"],
full_name=user.full_name,
)
session.add(new_user)
await session.commit()
return user_dict
@auth_router.get("/auth/users/me", response_model=User)
@handle_exceptions(logger=LOG)
async def read_users_me(current_user: User = Depends(get_current_user)):
return current_user
@auth_router.get("/auth/check")
@handle_exceptions(logger=LOG)
async def check_auth():
return {"registration": os.environ.get("REGISTRATION_ENABLED", "True") == "True"}
return {
"registration": os.environ.get("REGISTRATION_ENABLED", "True") == "True",
"recordings_enabled": os.environ.get("RECORDINGS_ENABLED", "true").lower()
== "true",
}

View File

@@ -1,28 +1,30 @@
# STL
import os
import logging
from typing import Any, Optional
from datetime import datetime, timedelta
import logging
# PDM
from jose import JWTError, jwt
from dotenv import load_dotenv
from fastapi import Depends, HTTPException, status
from sqlalchemy import select
from passlib.context import CryptContext
from fastapi.security import OAuth2PasswordBearer
from sqlalchemy.ext.asyncio import AsyncSession
# LOCAL
from api.backend.schemas import User, UserInDB, TokenData
from api.backend.auth.schemas import User, UserInDB, TokenData
from api.backend.database.base import get_db
from api.backend.database.models import User as UserModel
from api.backend.database.common import query
LOG = logging.getLogger(__name__)
LOG = logging.getLogger("Auth")
_ = load_dotenv()
SECRET_KEY = os.getenv("SECRET_KEY") or ""
ALGORITHM = os.getenv("ALGORITHM") or ""
ACCESS_TOKEN_EXPIRE_MINUTES = os.getenv("ACCESS_TOKEN_EXPIRE_MINUTES")
SECRET_KEY = os.getenv("SECRET_KEY") or "secret"
ALGORITHM = os.getenv("ALGORITHM") or "HS256"
ACCESS_TOKEN_EXPIRE_MINUTES = os.getenv("ACCESS_TOKEN_EXPIRE_MINUTES") or 600
pwd_context = CryptContext(schemes=["bcrypt"], deprecated="auto")
oauth2_scheme = OAuth2PasswordBearer(tokenUrl="auth/token")
@@ -38,18 +40,24 @@ def get_password_hash(password: str):
return pwd_context.hash(password)
async def get_user(email: str):
user_query = "SELECT * FROM users WHERE email = ?"
user = query(user_query, (email,))[0]
async def get_user(session: AsyncSession, email: str) -> UserInDB | None:
stmt = select(UserModel).where(UserModel.email == email)
result = await session.execute(stmt)
user = result.scalars().first()
if not user:
return
return None
return UserInDB(**user)
return UserInDB(
email=str(user.email),
hashed_password=str(user.hashed_password),
full_name=str(user.full_name),
disabled=bool(user.disabled),
)
async def authenticate_user(email: str, password: str):
user = await get_user(email)
async def authenticate_user(email: str, password: str, db: AsyncSession):
user = await get_user(db, email)
if not user:
return False
@@ -75,7 +83,9 @@ def create_access_token(
return encoded_jwt
async def get_current_user(token: str = Depends(oauth2_scheme)):
async def get_current_user(
db: AsyncSession = Depends(get_db), token: str = Depends(oauth2_scheme)
):
LOG.debug(f"Getting current user with token: {token}")
if not token:
@@ -83,7 +93,7 @@ async def get_current_user(token: str = Depends(oauth2_scheme)):
return EMPTY_USER
if len(token.split(".")) != 3:
LOG.error(f"Malformed token: {token}")
LOG.debug(f"Malformed token: {token}")
return EMPTY_USER
try:
@@ -118,14 +128,15 @@ async def get_current_user(token: str = Depends(oauth2_scheme)):
LOG.error(f"Exception occurred: {e}")
return EMPTY_USER
user = await get_user(email=token_data.email)
user = await get_user(db, email=token_data.email or "")
if user is None:
return EMPTY_USER
return user
async def require_user(token: str = Depends(oauth2_scheme)):
async def require_user(db: AsyncSession, token: str = Depends(oauth2_scheme)):
credentials_exception = HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Could not validate credentials",
@@ -136,6 +147,7 @@ async def require_user(token: str = Depends(oauth2_scheme)):
payload: Optional[dict[str, Any]] = jwt.decode(
token, SECRET_KEY, algorithms=[ALGORITHM]
)
if not payload:
raise credentials_exception
@@ -149,7 +161,7 @@ async def require_user(token: str = Depends(oauth2_scheme)):
except JWTError:
raise credentials_exception
user = await get_user(email=token_data.email)
user = await get_user(db, email=token_data.email or "")
if user is None:
raise credentials_exception

View File

@@ -0,0 +1,4 @@
# LOCAL
from .auth import User, Token, UserInDB, TokenData, UserCreate
__all__ = ["User", "Token", "UserInDB", "TokenData", "UserCreate"]

View File

@@ -1 +1,24 @@
DATABASE_PATH = "data/database.db"
# STL
import os
from pathlib import Path
DATABASE_URL = os.getenv("DATABASE_URL", "sqlite+aiosqlite:///data/database.db")
RECORDINGS_DIR = Path("media/recordings")
RECORDINGS_ENABLED = os.getenv("RECORDINGS_ENABLED", "true").lower() == "true"
MEDIA_DIR = Path("media")
MEDIA_TYPES = [
"audio",
"documents",
"images",
"pdfs",
"presentations",
"spreadsheets",
"videos",
]
REGISTRATION_ENABLED = os.getenv("REGISTRATION_ENABLED", "true").lower() == "true"
DEFAULT_USER_EMAIL = os.getenv("DEFAULT_USER_EMAIL")
DEFAULT_USER_PASSWORD = os.getenv("DEFAULT_USER_PASSWORD")
DEFAULT_USER_FULL_NAME = os.getenv("DEFAULT_USER_FULL_NAME")
LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO")

View File

@@ -1,3 +0,0 @@
from .common import insert, QUERIES, update
__all__ = ["insert", "QUERIES", "update"]

View File

@@ -0,0 +1,26 @@
# STL
from typing import AsyncGenerator
# PDM
from sqlalchemy.orm import declarative_base
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker, create_async_engine
# LOCAL
from api.backend.constants import DATABASE_URL
engine = create_async_engine(DATABASE_URL, echo=False, future=True)
AsyncSessionLocal = async_sessionmaker(
bind=engine,
autoflush=False,
autocommit=False,
expire_on_commit=False,
class_=AsyncSession,
)
Base = declarative_base()
async def get_db() -> AsyncGenerator[AsyncSession, None]:
async with AsyncSessionLocal() as session:
yield session

View File

@@ -1,92 +0,0 @@
import sqlite3
from typing import Any, Optional
from api.backend.constants import DATABASE_PATH
from api.backend.utils import format_json, format_sql_row_to_python
from api.backend.database.schema import INIT_QUERY
from api.backend.database.queries import JOB_INSERT_QUERY, DELETE_JOB_QUERY
import logging
LOG = logging.getLogger(__name__)
def connect():
connection = sqlite3.connect(DATABASE_PATH)
connection.set_trace_callback(print)
cursor = connection.cursor()
return cursor
def insert(query: str, values: tuple[Any, ...]):
connection = sqlite3.connect(DATABASE_PATH)
cursor = connection.cursor()
copy = list(values)
format_json(copy)
try:
_ = cursor.execute(query, copy)
connection.commit()
except sqlite3.Error as e:
LOG.error(f"An error occurred: {e}")
finally:
cursor.close()
connection.close()
def query(query: str, values: Optional[tuple[Any, ...]] = None):
connection = sqlite3.connect(DATABASE_PATH)
connection.row_factory = sqlite3.Row
cursor = connection.cursor()
rows = []
try:
if values:
_ = cursor.execute(query, values)
else:
_ = cursor.execute(query)
rows = cursor.fetchall()
finally:
cursor.close()
connection.close()
formatted_rows: list[dict[str, Any]] = []
for row in rows:
row = dict(row)
formatted_row = format_sql_row_to_python(row)
formatted_rows.append(formatted_row)
return formatted_rows
def update(query: str, values: Optional[tuple[Any, ...]] = None):
connection = sqlite3.connect(DATABASE_PATH)
cursor = connection.cursor()
copy = None
if values:
copy = list(values)
format_json(copy)
try:
if copy:
res = cursor.execute(query, copy)
else:
res = cursor.execute(query)
connection.commit()
return res.rowcount
except sqlite3.Error as e:
LOG.error(f"An error occurred: {e}")
finally:
cursor.close()
connection.close()
return 0
QUERIES = {
"init": INIT_QUERY,
"insert_job": JOB_INSERT_QUERY,
"delete_job": DELETE_JOB_QUERY,
}

View File

@@ -0,0 +1,65 @@
# PDM
from sqlalchemy import JSON, Column, String, Boolean, DateTime, ForeignKey, func
from sqlalchemy.orm import relationship
# LOCAL
from api.backend.database.base import Base
class User(Base):
__tablename__ = "users"
email = Column(String(255), primary_key=True, nullable=False)
hashed_password = Column(String(255), nullable=False)
full_name = Column(String(255), nullable=True)
disabled = Column(Boolean, default=False)
jobs = relationship("Job", back_populates="user_obj", cascade="all, delete-orphan")
cron_jobs = relationship(
"CronJob", back_populates="user_obj", cascade="all, delete-orphan"
)
class Job(Base):
__tablename__ = "jobs"
id = Column(String(64), primary_key=True, nullable=False)
url = Column(String(2048), nullable=False)
elements = Column(JSON, nullable=False)
user = Column(String(255), ForeignKey("users.email"), nullable=True)
time_created = Column(
DateTime(timezone=True), server_default=func.now(), nullable=False
)
result = Column(JSON, nullable=False)
status = Column(String(50), nullable=False)
chat = Column(JSON, nullable=True)
job_options = Column(JSON, nullable=True)
agent_mode = Column(Boolean, default=False, nullable=False)
prompt = Column(String(1024), nullable=True)
favorite = Column(Boolean, default=False, nullable=False)
user_obj = relationship("User", back_populates="jobs")
cron_jobs = relationship(
"CronJob", back_populates="job_obj", cascade="all, delete-orphan"
)
class CronJob(Base):
__tablename__ = "cron_jobs"
id = Column(String(64), primary_key=True, nullable=False)
user_email = Column(String(255), ForeignKey("users.email"), nullable=False)
job_id = Column(String(64), ForeignKey("jobs.id"), nullable=False)
cron_expression = Column(String(255), nullable=False)
time_created = Column(
DateTime(timezone=True), server_default=func.now(), nullable=False
)
time_updated = Column(
DateTime(timezone=True),
server_default=func.now(),
onupdate=func.now(),
nullable=False,
)
user_obj = relationship("User", back_populates="cron_jobs")
job_obj = relationship("Job", back_populates="cron_jobs")

View File

@@ -1,3 +0,0 @@
from .queries import JOB_INSERT_QUERY, DELETE_JOB_QUERY
__all__ = ["JOB_INSERT_QUERY", "DELETE_JOB_QUERY"]

View File

@@ -0,0 +1,75 @@
# STL
import logging
from typing import Any
# PDM
from sqlalchemy import delete as sql_delete
from sqlalchemy import select
from sqlalchemy import update as sql_update
# LOCAL
from api.backend.database.base import AsyncSessionLocal
from api.backend.database.models import Job
LOG = logging.getLogger("Database")
async def insert_job(item: dict[str, Any]) -> None:
async with AsyncSessionLocal() as session:
job = Job(
id=item["id"],
url=item["url"],
elements=item["elements"],
user=item["user"],
time_created=item["time_created"],
result=item["result"],
status=item["status"],
chat=item["chat"],
job_options=item["job_options"],
agent_mode=item["agent_mode"],
prompt=item["prompt"],
)
session.add(job)
await session.commit()
LOG.info(f"Inserted item: {item}")
async def get_queued_job():
async with AsyncSessionLocal() as session:
stmt = (
select(Job)
.where(Job.status == "Queued")
.order_by(Job.time_created.desc())
.limit(1)
)
result = await session.execute(stmt)
job = result.scalars().first()
if job:
LOG.info(f"Got queued job: {job}")
return job
async def update_job(ids: list[str], updates: dict[str, Any]):
if not updates:
return
async with AsyncSessionLocal() as session:
stmt = sql_update(Job).where(Job.id.in_(ids)).values(**updates)
result = await session.execute(stmt)
await session.commit()
LOG.debug(f"Updated job count: {result.rowcount}")
async def delete_jobs(jobs: list[str]):
if not jobs:
LOG.info("No jobs to delete.")
return False
async with AsyncSessionLocal() as session:
stmt = sql_delete(Job).where(Job.id.in_(jobs))
result = await session.execute(stmt)
await session.commit()
LOG.info(f"Deleted jobs count: {result.rowcount}")
return result.rowcount

View File

@@ -1,9 +0,0 @@
JOB_INSERT_QUERY = """
INSERT INTO jobs
(id, url, elements, user, time_created, result, status, chat, job_options)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
"""
DELETE_JOB_QUERY = """
DELETE FROM jobs WHERE id IN ()
"""

View File

@@ -0,0 +1,43 @@
# PDM
from sqlalchemy import func, select
from sqlalchemy.ext.asyncio import AsyncSession
# LOCAL
from api.backend.database.models import Job
async def average_elements_per_link(session: AsyncSession, user_email: str):
date_func = func.date(Job.time_created)
stmt = (
select(
date_func.label("date"),
func.avg(func.json_array_length(Job.elements)).label("average_elements"),
func.count().label("count"),
)
.where(Job.status == "Completed", Job.user == user_email)
.group_by(date_func)
.order_by("date")
)
result = await session.execute(stmt)
rows = result.all()
return [dict(row._mapping) for row in rows]
async def get_jobs_per_day(session: AsyncSession, user_email: str):
date_func = func.date(Job.time_created)
stmt = (
select(
date_func.label("date"),
func.count().label("job_count"),
)
.where(Job.status == "Completed", Job.user == user_email)
.group_by(date_func)
.order_by("date")
)
result = await session.execute(stmt)
rows = result.all()
return [dict(row._mapping) for row in rows]

View File

@@ -1,3 +0,0 @@
from .schema import INIT_QUERY
__all__ = ["INIT_QUERY"]

View File

@@ -1,30 +0,0 @@
INIT_QUERY = """
CREATE TABLE IF NOT EXISTS jobs (
id STRING PRIMARY KEY NOT NULL,
url STRING NOT NULL,
elements JSON NOT NULL,
user STRING,
time_created DATETIME NOT NULL,
result JSON NOT NULL,
status STRING NOT NULL,
chat JSON,
job_options JSON
);
CREATE TABLE IF NOT EXISTS users (
email STRING PRIMARY KEY NOT NULL,
hashed_password STRING NOT NULL,
full_name STRING,
disabled BOOLEAN
);
CREATE TABLE IF NOT EXISTS cron_jobs (
id STRING PRIMARY KEY NOT NULL,
user_email STRING NOT NULL,
job_id STRING NOT NULL,
cron_expression STRING NOT NULL,
time_created DATETIME NOT NULL,
time_updated DATETIME NOT NULL,
FOREIGN KEY (job_id) REFERENCES jobs(id)
);
"""

View File

@@ -1,43 +1,56 @@
import os
from api.backend.database.common import connect, QUERIES, insert
# STL
import logging
# PDM
from sqlalchemy.exc import IntegrityError
# LOCAL
from api.backend.constants import (
DEFAULT_USER_EMAIL,
REGISTRATION_ENABLED,
DEFAULT_USER_PASSWORD,
DEFAULT_USER_FULL_NAME,
)
from api.backend.database.base import Base, AsyncSessionLocal, engine
from api.backend.auth.auth_utils import get_password_hash
from api.backend.database.models import User
LOG = logging.getLogger(__name__)
LOG = logging.getLogger("Database")
async def init_database():
LOG.info("Creating database schema...")
def init_database():
cursor = connect()
async with engine.begin() as conn:
await conn.run_sync(Base.metadata.create_all)
for query in QUERIES["init"].strip().split(";"):
if query.strip():
LOG.info(f"Executing query: {query}")
_ = cursor.execute(query)
if not REGISTRATION_ENABLED:
default_user_email = DEFAULT_USER_EMAIL
default_user_password = DEFAULT_USER_PASSWORD
default_user_full_name = DEFAULT_USER_FULL_NAME
if os.environ.get("REGISTRATION_ENABLED", "True") == "False":
default_user_email = os.environ.get("DEFAULT_USER_EMAIL")
default_user_password = os.environ.get("DEFAULT_USER_PASSWORD")
default_user_full_name = os.environ.get("DEFAULT_USER_FULL_NAME")
if (
not default_user_email
or not default_user_password
or not default_user_full_name
):
LOG.error(
"DEFAULT_USER_EMAIL, DEFAULT_USER_PASSWORD, or DEFAULT_USER_FULL_NAME is not set!"
)
if not (default_user_email and default_user_password and default_user_full_name):
LOG.error("DEFAULT_USER_* env vars are not set!")
exit(1)
query = "INSERT INTO users (email, hashed_password, full_name) VALUES (?, ?, ?)"
_ = insert(
query,
(
default_user_email,
get_password_hash(default_user_password),
default_user_full_name,
),
)
async with AsyncSessionLocal() as session:
user = await session.get(User, default_user_email)
if user:
LOG.info("Default user already exists. Skipping creation.")
return
LOG.info("Creating default user...")
new_user = User(
email=default_user_email,
hashed_password=get_password_hash(default_user_password),
full_name=default_user_full_name,
disabled=False,
)
try:
session.add(new_user)
await session.commit()
LOG.info(f"Created default user: {default_user_email}")
except IntegrityError as e:
await session.rollback()
LOG.warning(f"Could not create default user (already exists?): {e}")
cursor.close()

View File

@@ -0,0 +1,37 @@
# STL
import json
from typing import Any
from datetime import datetime
def format_list_for_query(ids: list[str]):
return (
f"({','.join(['?' for _ in ids])})" # Returns placeholders, e.g., "(?, ?, ?)"
)
def format_sql_row_to_python(row: dict[str, Any]):
new_row: dict[str, Any] = {}
for key, value in row.items():
if isinstance(value, str):
try:
new_row[key] = json.loads(value)
except json.JSONDecodeError:
new_row[key] = value
else:
new_row[key] = value
return new_row
def format_json(items: list[Any]):
for idx, item in enumerate(items):
if isinstance(item, (dict, list)):
formatted_item = json.dumps(item)
items[idx] = formatted_item
def parse_datetime(dt_str: str) -> datetime:
if dt_str.endswith("Z"):
dt_str = dt_str.replace("Z", "+00:00") # valid ISO format for UTC
return datetime.fromisoformat(dt_str)

View File

@@ -1,17 +1,9 @@
from .job import (
insert,
update_job,
delete_jobs,
get_jobs_per_day,
get_queued_job,
average_elements_per_link,
)
# LOCAL
from .job import insert, update_job, delete_jobs, get_queued_job
__all__ = [
"insert",
"update_job",
"delete_jobs",
"get_jobs_per_day",
"get_queued_job",
"average_elements_per_link",
]

View File

@@ -1,78 +1,75 @@
import datetime
from typing import Any
# STL
import uuid
from api.backend.database.common import insert, query
from api.backend.models import CronJob
from apscheduler.schedulers.background import BackgroundScheduler # type: ignore
from apscheduler.triggers.cron import CronTrigger # type: ignore
from api.backend.job import insert as insert_job
import logging
import datetime
from typing import Any, List
LOG = logging.getLogger("Cron Scheduler")
# PDM
from sqlalchemy import select
from apscheduler.triggers.cron import CronTrigger
from apscheduler.schedulers.asyncio import AsyncIOScheduler
# LOCAL
from api.backend.job import insert as insert_job
from api.backend.database.base import AsyncSessionLocal
from api.backend.database.models import Job, CronJob
LOG = logging.getLogger("Cron")
def insert_cron_job(cron_job: CronJob):
query = """
INSERT INTO cron_jobs (id, user_email, job_id, cron_expression, time_created, time_updated)
VALUES (?, ?, ?, ?, ?, ?)
"""
values = (
cron_job.id,
cron_job.user_email,
cron_job.job_id,
cron_job.cron_expression,
cron_job.time_created,
cron_job.time_updated,
)
insert(query, values)
async def insert_cron_job(cron_job: CronJob) -> bool:
async with AsyncSessionLocal() as session:
session.add(cron_job)
await session.commit()
return True
def delete_cron_job(id: str, user_email: str):
query = """
DELETE FROM cron_jobs
WHERE id = ? AND user_email = ?
"""
values = (id, user_email)
insert(query, values)
async def delete_cron_job(id: str, user_email: str) -> bool:
async with AsyncSessionLocal() as session:
stmt = select(CronJob).where(CronJob.id == id, CronJob.user_email == user_email)
result = await session.execute(stmt)
cron_job = result.scalars().first()
if cron_job:
await session.delete(cron_job)
await session.commit()
return True
def get_cron_jobs(user_email: str):
cron_jobs = query("SELECT * FROM cron_jobs WHERE user_email = ?", (user_email,))
return cron_jobs
async def get_cron_jobs(user_email: str) -> List[CronJob]:
async with AsyncSessionLocal() as session:
stmt = select(CronJob).where(CronJob.user_email == user_email)
result = await session.execute(stmt)
return list(result.scalars().all())
def get_all_cron_jobs():
cron_jobs = query("SELECT * FROM cron_jobs")
return cron_jobs
async def get_all_cron_jobs() -> List[CronJob]:
async with AsyncSessionLocal() as session:
stmt = select(CronJob)
result = await session.execute(stmt)
return list(result.scalars().all())
def insert_job_from_cron_job(job: dict[str, Any]):
insert_job(
{
**job,
"id": uuid.uuid4().hex,
"status": "Queued",
"result": "",
"chat": None,
"time_created": datetime.datetime.now(),
"time_updated": datetime.datetime.now(),
}
)
async def insert_job_from_cron_job(job: dict[str, Any]):
async with AsyncSessionLocal() as session:
await insert_job(
{
**job,
"id": uuid.uuid4().hex,
"status": "Queued",
"result": "",
"chat": None,
"time_created": datetime.datetime.now(datetime.timezone.utc),
"time_updated": datetime.datetime.now(datetime.timezone.utc),
},
session,
)
def get_cron_job_trigger(cron_expression: str):
expression_parts = cron_expression.split()
if len(expression_parts) != 5:
print(f"Invalid cron expression: {cron_expression}")
LOG.warning(f"Invalid cron expression: {cron_expression}")
return None
minute, hour, day, month, day_of_week = expression_parts
@@ -82,19 +79,37 @@ def get_cron_job_trigger(cron_expression: str):
)
def start_cron_scheduler(scheduler: BackgroundScheduler):
cron_jobs = get_all_cron_jobs()
async def start_cron_scheduler(scheduler: AsyncIOScheduler):
async with AsyncSessionLocal() as session:
stmt = select(CronJob)
result = await session.execute(stmt)
cron_jobs = result.scalars().all()
LOG.info(f"Cron jobs: {cron_jobs}")
LOG.info(f"Cron jobs: {cron_jobs}")
for job in cron_jobs:
queried_job = query("SELECT * FROM jobs WHERE id = ?", (job["job_id"],))
for cron_job in cron_jobs:
stmt = select(Job).where(Job.id == cron_job.job_id)
result = await session.execute(stmt)
queried_job = result.scalars().first()
LOG.info(f"Adding job: {queried_job}")
LOG.info(f"Adding job: {queried_job}")
scheduler.add_job(
insert_job_from_cron_job,
get_cron_job_trigger(job["cron_expression"]),
id=job["id"],
args=[queried_job[0]],
)
trigger = get_cron_job_trigger(cron_job.cron_expression) # type: ignore
if not trigger:
continue
job_dict = (
{
c.key: getattr(queried_job, c.key)
for c in queried_job.__table__.columns
}
if queried_job
else {}
)
scheduler.add_job(
insert_job_from_cron_job,
trigger,
id=cron_job.id,
args=[job_dict],
)

View File

@@ -1,97 +1,113 @@
# STL
import logging
import datetime
from typing import Any
# PDM
from sqlalchemy import delete as sql_delete
from sqlalchemy import select
from sqlalchemy import update as sql_update
from sqlalchemy.ext.asyncio import AsyncSession
# LOCAL
from api.backend.utils import format_list_for_query
from api.backend.database.common import (
insert as common_insert,
query as common_query,
QUERIES,
update as common_update,
)
from api.backend.database.base import AsyncSessionLocal
from api.backend.database.models import Job
LOG = logging.getLogger(__name__)
LOG = logging.getLogger("Job")
def insert(item: dict[str, Any]) -> None:
common_insert(
QUERIES["insert_job"],
(
async def insert(item: dict[str, Any], db: AsyncSession) -> None:
existing = await db.get(Job, item["id"])
if existing:
await multi_field_update_job(
item["id"],
item["url"],
item["elements"],
item["user"],
item["time_created"],
item["result"],
item["status"],
item["chat"],
item["job_options"],
),
{
"agent_mode": item["agent_mode"],
"prompt": item["prompt"],
"job_options": item["job_options"],
"elements": item["elements"],
"status": "Queued",
"result": [],
"time_created": datetime.datetime.now(datetime.timezone.utc),
"chat": None,
},
db,
)
return
job = Job(
id=item["id"],
url=item["url"],
elements=item["elements"],
user=item["user"],
time_created=datetime.datetime.now(datetime.timezone.utc),
result=item["result"],
status=item["status"],
chat=item["chat"],
job_options=item["job_options"],
agent_mode=item["agent_mode"],
prompt=item["prompt"],
)
LOG.info(f"Inserted item: {item}")
db.add(job)
await db.commit()
LOG.debug(f"Inserted item: {item}")
async def check_for_job_completion(id: str) -> dict[str, Any]:
async with AsyncSessionLocal() as session:
job = await session.get(Job, id)
return job.__dict__ if job else {}
async def get_queued_job():
query = (
"SELECT * FROM jobs WHERE status = 'Queued' ORDER BY time_created DESC LIMIT 1"
)
res = common_query(query)
LOG.info(f"Got queued job: {res}")
return res[0] if res else None
async with AsyncSessionLocal() as session:
stmt = (
select(Job)
.where(Job.status == "Queued")
.order_by(Job.time_created.desc())
.limit(1)
)
result = await session.execute(stmt)
job = result.scalars().first()
LOG.debug(f"Got queued job: {job}")
return job.__dict__ if job else None
async def update_job(ids: list[str], field: str, value: Any):
query = f"UPDATE jobs SET {field} = ? WHERE id IN {format_list_for_query(ids)}"
res = common_update(query, tuple([value] + ids))
LOG.info(f"Updated job: {res}")
async with AsyncSessionLocal() as session:
stmt = sql_update(Job).where(Job.id.in_(ids)).values({field: value})
res = await session.execute(stmt)
await session.commit()
LOG.debug(f"Updated job count: {res.rowcount}")
async def multi_field_update_job(
id: str, fields: dict[str, Any], session: AsyncSession | None = None
):
close_session = False
if not session:
session = AsyncSessionLocal()
close_session = True
try:
stmt = sql_update(Job).where(Job.id == id).values(**fields)
await session.execute(stmt)
await session.commit()
LOG.debug(f"Updated job {id} fields: {fields}")
finally:
if close_session:
await session.close()
async def delete_jobs(jobs: list[str]):
if not jobs:
LOG.info("No jobs to delete.")
LOG.debug("No jobs to delete.")
return False
query = f"DELETE FROM jobs WHERE id IN {format_list_for_query(jobs)}"
res = common_update(query, tuple(jobs))
return res > 0
async def average_elements_per_link(user: str):
job_query = """
SELECT
DATE(time_created) AS date,
AVG(json_array_length(elements)) AS average_elements,
COUNT(*) AS count
FROM
jobs
WHERE
status = 'Completed' AND user = ?
GROUP BY
DATE(time_created)
ORDER BY
date ASC;
"""
results = common_query(job_query, (user,))
return results
async def get_jobs_per_day(user: str):
job_query = """
SELECT
DATE(time_created) AS date,
COUNT(*) AS job_count
FROM
jobs
WHERE
status = 'Completed' AND user = ?
GROUP BY
DATE(time_created)
ORDER BY
date ASC;
"""
results = common_query(job_query, (user,))
return results
async with AsyncSessionLocal() as session:
stmt = sql_delete(Job).where(Job.id.in_(jobs))
res = await session.execute(stmt)
await session.commit()
LOG.debug(f"Deleted jobs: {res.rowcount}")
return res.rowcount > 0

View File

@@ -0,0 +1,280 @@
# STL
import csv
import uuid
import random
import logging
import datetime
from io import StringIO
# PDM
from fastapi import Depends, APIRouter
from sqlalchemy import select
from fastapi.encoders import jsonable_encoder
from fastapi.responses import FileResponse, JSONResponse, StreamingResponse
from sqlalchemy.ext.asyncio import AsyncSession
from apscheduler.triggers.cron import CronTrigger # type: ignore
# LOCAL
from api.backend.job import insert, update_job, delete_jobs
from api.backend.constants import MEDIA_DIR, MEDIA_TYPES, RECORDINGS_DIR
from api.backend.scheduler import scheduler
from api.backend.schemas.job import Job, UpdateJobs, DownloadJob, DeleteScrapeJobs
from api.backend.auth.schemas import User
from api.backend.schemas.cron import CronJob as PydanticCronJob
from api.backend.schemas.cron import DeleteCronJob
from api.backend.database.base import get_db
from api.backend.auth.auth_utils import get_current_user
from api.backend.database.models import Job as DatabaseJob
from api.backend.database.models import CronJob
from api.backend.job.utils.text_utils import clean_text
from api.backend.job.models.job_options import FetchOptions
from api.backend.routers.handle_exceptions import handle_exceptions
from api.backend.job.utils.clean_job_format import clean_job_format
from api.backend.job.cron_scheduling.cron_scheduling import (
get_cron_jobs,
delete_cron_job,
insert_cron_job,
get_cron_job_trigger,
insert_job_from_cron_job,
)
from api.backend.job.utils.stream_md_from_job_results import stream_md_from_job_results
LOG = logging.getLogger("Job")
job_router = APIRouter()
@job_router.post("/update")
@handle_exceptions(logger=LOG)
async def update(update_jobs: UpdateJobs, _: User = Depends(get_current_user)):
await update_job(update_jobs.ids, update_jobs.field, update_jobs.value)
return {"message": "Jobs updated successfully"}
@job_router.post("/submit-scrape-job")
@handle_exceptions(logger=LOG)
async def submit_scrape_job(job: Job, db: AsyncSession = Depends(get_db)):
LOG.info(f"Recieved job: {job}")
if not job.id:
job.id = uuid.uuid4().hex
job_dict = job.model_dump()
await insert(job_dict, db)
return JSONResponse(
content={"id": job.id, "message": "Job submitted successfully."}
)
@job_router.post("/retrieve-scrape-jobs")
@handle_exceptions(logger=LOG)
async def retrieve_scrape_jobs(
fetch_options: FetchOptions,
user: User = Depends(get_current_user),
db: AsyncSession = Depends(get_db),
):
LOG.info(
f"Retrieving jobs for account: {user.email if user.email else 'Guest User'}"
)
if fetch_options.chat:
stmt = select(DatabaseJob.chat).filter(DatabaseJob.user == user.email)
else:
stmt = select(DatabaseJob).filter(DatabaseJob.user == user.email)
results = await db.execute(stmt)
rows = results.all() if fetch_options.chat else results.scalars().all()
return JSONResponse(content=jsonable_encoder(rows[::-1]))
@job_router.get("/job/{id}")
@handle_exceptions(logger=LOG)
async def job(
id: str, user: User = Depends(get_current_user), db: AsyncSession = Depends(get_db)
):
LOG.info(f"Retrieving jobs for account: {user.email}")
stmt = select(DatabaseJob).filter(
DatabaseJob.user == user.email, DatabaseJob.id == id
)
results = await db.execute(stmt)
return JSONResponse(
content=jsonable_encoder([job.__dict__ for job in results.scalars().all()])
)
@job_router.post("/download")
@handle_exceptions(logger=LOG)
async def download(download_job: DownloadJob, db: AsyncSession = Depends(get_db)):
LOG.info(f"Downloading job with ids: {download_job.ids}")
stmt = select(DatabaseJob).where(DatabaseJob.id.in_(download_job.ids))
result = await db.execute(stmt)
results = [job.__dict__ for job in result.scalars().all()]
if download_job.job_format == "csv":
csv_buffer = StringIO()
csv_writer = csv.writer(csv_buffer, quotechar='"', quoting=csv.QUOTE_ALL)
headers = [
"id",
"url",
"element_name",
"xpath",
"text",
"user",
"time_created",
]
csv_writer.writerow(headers)
for result in results:
for res in result["result"]:
for url, elements in res.items():
for element_name, values in elements.items():
for value in values:
text = clean_text(value.get("text", "")).strip()
if text:
csv_writer.writerow(
[
result.get("id", "")
+ "-"
+ str(random.randint(0, 1000000)),
url,
element_name,
value.get("xpath", ""),
text,
result.get("user", ""),
result.get("time_created", ""),
]
)
_ = csv_buffer.seek(0)
response = StreamingResponse(
csv_buffer,
media_type="text/csv",
)
response.headers["Content-Disposition"] = "attachment; filename=export.csv"
return response
elif download_job.job_format == "md":
response = StreamingResponse(
stream_md_from_job_results(results),
media_type="text/markdown",
)
response.headers["Content-Disposition"] = "attachment; filename=export.md"
return response
@job_router.get("/job/{id}/convert-to-csv")
@handle_exceptions(logger=LOG)
async def convert_to_csv(id: str, db: AsyncSession = Depends(get_db)):
stmt = select(DatabaseJob).filter(DatabaseJob.id == id)
results = await db.execute(stmt)
jobs = results.scalars().all()
return JSONResponse(content=clean_job_format([job.__dict__ for job in jobs]))
@job_router.post("/delete-scrape-jobs")
@handle_exceptions(logger=LOG)
async def delete(delete_scrape_jobs: DeleteScrapeJobs):
result = await delete_jobs(delete_scrape_jobs.ids)
return (
JSONResponse(content={"message": "Jobs successfully deleted."})
if result
else JSONResponse(content={"error": "Jobs not deleted."})
)
@job_router.post("/schedule-cron-job")
@handle_exceptions(logger=LOG)
async def schedule_cron_job(
cron_job: PydanticCronJob,
db: AsyncSession = Depends(get_db),
):
if not cron_job.id:
cron_job.id = uuid.uuid4().hex
now = datetime.datetime.now()
if not cron_job.time_created:
cron_job.time_created = now
if not cron_job.time_updated:
cron_job.time_updated = now
await insert_cron_job(CronJob(**cron_job.model_dump()))
stmt = select(DatabaseJob).where(DatabaseJob.id == cron_job.job_id)
result = await db.execute(stmt)
queried_job = result.scalars().first()
if not queried_job:
return JSONResponse(status_code=404, content={"error": "Related job not found"})
scheduler.add_job(
insert_job_from_cron_job,
get_cron_job_trigger(cron_job.cron_expression),
id=cron_job.id,
args=[queried_job],
)
return JSONResponse(content={"message": "Cron job scheduled successfully."})
@job_router.post("/delete-cron-job")
@handle_exceptions(logger=LOG)
async def delete_cron_job_request(request: DeleteCronJob):
if not request.id:
return JSONResponse(
content={"error": "Cron job id is required."}, status_code=400
)
await delete_cron_job(request.id, request.user_email)
scheduler.remove_job(request.id)
return JSONResponse(content={"message": "Cron job deleted successfully."})
@job_router.get("/cron-jobs")
@handle_exceptions(logger=LOG)
async def get_cron_jobs_request(user: User = Depends(get_current_user)):
cron_jobs = await get_cron_jobs(user.email)
return JSONResponse(content=jsonable_encoder(cron_jobs))
@job_router.get("/recordings/{id}")
@handle_exceptions(logger=LOG)
async def get_recording(id: str):
path = RECORDINGS_DIR / f"{id}.mp4"
if not path.exists():
return JSONResponse(content={"error": "Recording not found."}, status_code=404)
return FileResponse(
path, headers={"Content-Type": "video/mp4", "Accept-Ranges": "bytes"}
)
@job_router.get("/get-media")
@handle_exceptions(logger=LOG)
async def get_media(id: str):
files: dict[str, list[str]] = {}
for media_type in MEDIA_TYPES:
path = MEDIA_DIR / media_type / f"{id}"
files[media_type] = [file.name for file in path.glob("*")]
return JSONResponse(content={"files": files})
@job_router.get("/media")
@handle_exceptions(logger=LOG)
async def get_media_file(id: str, type: str, file: str):
path = MEDIA_DIR / type / f"{id}" / file
if not path.exists():
return JSONResponse(content={"error": "Media file not found."}, status_code=404)
return FileResponse(path)

View File

@@ -1,3 +1,5 @@
from .job_options import JobOptions
# LOCAL
from .job import Element, CapturedElement
from .job_options import Proxy, JobOptions
__all__ = ["JobOptions"]
__all__ = ["JobOptions", "CapturedElement", "Element", "Proxy"]

View File

@@ -0,0 +1,15 @@
from typing import Optional
import pydantic
class Element(pydantic.BaseModel):
name: str
xpath: str
url: Optional[str] = None
class CapturedElement(pydantic.BaseModel):
xpath: str
text: str
name: str

View File

@@ -1,8 +1,19 @@
from pydantic import BaseModel
# STL
from typing import Any, Optional
# PDM
from pydantic import BaseModel
# LOCAL
from api.backend.job.models.site_map import SiteMap
class Proxy(BaseModel):
server: str
username: Optional[str] = None
password: Optional[str] = None
class FetchOptions(BaseModel):
chat: Optional[bool] = None
@@ -10,6 +21,8 @@ class FetchOptions(BaseModel):
class JobOptions(BaseModel):
multi_page_scrape: bool = False
custom_headers: dict[str, Any] = {}
proxies: list[str] = []
proxies: list[Proxy] = []
site_map: Optional[SiteMap] = None
collect_media: bool = False
custom_cookies: list[dict[str, Any]] = []
return_html: bool = False

View File

@@ -0,0 +1,49 @@
# STL
import logging
from typing import Any, Optional
from urllib.parse import urlparse
# PDM
from playwright.async_api import Page, BrowserContext
LOG = logging.getLogger("Job")
async def add_custom_cookies(
custom_cookies: list[dict[str, Any]],
url: str,
context: BrowserContext,
) -> None:
parsed_url = urlparse(url)
domain = parsed_url.netloc
for cookie in custom_cookies:
cookie_dict = {
"name": cookie.get("name", ""),
"value": cookie.get("value", ""),
"domain": domain,
"path": "/",
}
LOG.info(f"Adding cookie: {cookie_dict}")
await context.add_cookies([cookie_dict]) # type: ignore
async def add_custom_headers(
custom_headers: dict[str, Any],
page: Page,
) -> None:
await page.set_extra_http_headers(custom_headers)
async def add_custom_items(
url: str,
page: Page,
cookies: Optional[list[dict[str, Any]]] = None,
headers: Optional[dict[str, Any]] = None,
) -> None:
if cookies:
await add_custom_cookies(cookies, url, page.context)
if headers:
await add_custom_headers(headers, page)

View File

@@ -1,19 +1,24 @@
# STL
import os
import requests
import re
import logging
from typing import Dict, List
from pathlib import Path
from selenium.webdriver.common.by import By
from seleniumwire import webdriver
from urllib.parse import urlparse
from urllib.parse import urljoin, urlparse
from api.backend.utils import LOG
# PDM
import aiohttp
from playwright.async_api import Page
LOG = logging.getLogger("Job")
def collect_media(driver: webdriver.Chrome):
async def collect_media(id: str, page: Page) -> dict[str, list[dict[str, str]]]:
media_types = {
"images": "img",
"videos": "video",
"audio": "audio",
"pdfs": 'a[href$=".pdf"]',
"pdfs": 'a[href$=".pdf"], a[href*=".pdf#page="]',
"documents": 'a[href$=".doc"], a[href$=".docx"], a[href$=".txt"], a[href$=".rtf"]',
"presentations": 'a[href$=".ppt"], a[href$=".pptx"]',
"spreadsheets": 'a[href$=".xls"], a[href$=".xlsx"], a[href$=".csv"]',
@@ -24,62 +29,79 @@ def collect_media(driver: webdriver.Chrome):
media_urls = {}
for media_type, selector in media_types.items():
elements = driver.find_elements(By.CSS_SELECTOR, selector)
urls: list[dict[str, str]] = []
async with aiohttp.ClientSession() as session:
for media_type, selector in media_types.items():
elements = await page.query_selector_all(selector)
urls: List[Dict[str, str]] = []
media_dir = base_dir / media_type
media_dir.mkdir(exist_ok=True)
media_dir = base_dir / media_type
media_dir.mkdir(exist_ok=True)
for element in elements:
if media_type == "images":
url = element.get_attribute("src")
elif media_type == "videos":
url = element.get_attribute("src") or element.get_attribute("data-src")
else:
url = element.get_attribute("href")
for element in elements:
if media_type == "images":
url = await element.get_attribute("src")
elif media_type == "videos":
url = await element.get_attribute(
"src"
) or await element.get_attribute("data-src")
else:
url = await element.get_attribute("href")
if url and url.startswith(("http://", "https://")):
try:
filename = os.path.basename(urlparse(url).path)
if url and url.startswith("/"):
root_url = urlparse(page.url)
root_domain = f"{root_url.scheme}://{root_url.netloc}"
url = f"{root_domain}{url}"
if not filename:
filename = f"{media_type}_{len(urls)}"
if url and re.match(r"^[\w\-]+/", url):
root_url = urlparse(page.url)
root_domain = f"{root_url.scheme}://{root_url.netloc}"
url = urljoin(root_domain + "/", url)
if media_type == "images":
filename += ".jpg"
elif media_type == "videos":
filename += ".mp4"
elif media_type == "audio":
filename += ".mp3"
elif media_type == "pdfs":
filename += ".pdf"
elif media_type == "documents":
filename += ".doc"
elif media_type == "presentations":
filename += ".ppt"
elif media_type == "spreadsheets":
filename += ".xls"
if url and url.startswith(("http://", "https://")):
try:
parsed = urlparse(url)
filename = (
os.path.basename(parsed.path) or f"{media_type}_{len(urls)}"
)
response = requests.get(url, stream=True)
response.raise_for_status()
if "." not in filename:
ext = {
"images": ".jpg",
"videos": ".mp4",
"audio": ".mp3",
"pdfs": ".pdf",
"documents": ".doc",
"presentations": ".ppt",
"spreadsheets": ".xls",
}.get(media_type, "")
filename += ext
# Save the file
file_path = media_dir / filename
with open(file_path, "wb") as f:
for chunk in response.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
if not os.path.exists(media_dir / id):
os.makedirs(media_dir / id, exist_ok=True)
urls.append({"url": url, "local_path": str(file_path)})
LOG.info(f"Downloaded {filename} to {file_path}")
file_path = media_dir / id / f"{filename}"
except Exception as e:
LOG.error(f"Error downloading {url}: {str(e)}")
continue
async with session.get(url) as response:
response.raise_for_status()
media_urls[media_type] = urls
with open(file_path, "wb") as f:
while True:
chunk = await response.content.read(8192)
if not chunk:
break
f.write(chunk)
urls.append({"url": url, "local_path": str(file_path)})
LOG.info(f"Downloaded {filename} to {file_path}")
except Exception as e:
LOG.error(f"Error downloading {url}: {str(e)}")
continue
media_urls[media_type] = urls
# Write summary
with open(base_dir / "download_summary.txt", "w") as f:
for media_type, downloads in media_urls.items():
if downloads:

View File

@@ -0,0 +1,182 @@
# STL
import random
import logging
from typing import Any, cast
from urllib.parse import urljoin, urlparse
# PDM
from bs4 import Tag, BeautifulSoup
from lxml import etree
from camoufox import AsyncCamoufox
from playwright.async_api import Page
# LOCAL
from api.backend.constants import RECORDINGS_ENABLED
from api.backend.job.models import Element, CapturedElement
from api.backend.job.utils.text_utils import clean_text
from api.backend.job.scraping.add_custom import add_custom_items
from api.backend.job.scraping.scraping_utils import (
sxpath,
is_same_domain,
scrape_content,
)
from api.backend.job.site_mapping.site_mapping import handle_site_mapping
LOG = logging.getLogger("Job")
async def make_site_request(
id: str,
url: str,
job_options: dict[str, Any],
visited_urls: set[str] = set(),
pages: set[tuple[str, str]] = set(),
original_url: str = "",
):
headers = job_options["custom_headers"]
multi_page_scrape = job_options["multi_page_scrape"]
proxies = job_options["proxies"]
site_map = job_options["site_map"]
collect_media = job_options["collect_media"]
custom_cookies = job_options["custom_cookies"]
if url in visited_urls:
return
proxy = None
if proxies:
proxy = random.choice(proxies)
LOG.info(f"Using proxy: {proxy}")
async with AsyncCamoufox(headless=not RECORDINGS_ENABLED, proxy=proxy) as browser:
page: Page = await browser.new_page()
await page.set_viewport_size({"width": 1920, "height": 1080})
# Add cookies and headers
await add_custom_items(url, page, custom_cookies, headers)
LOG.info(f"Visiting URL: {url}")
try:
await page.goto(url, timeout=60000)
await page.wait_for_load_state("networkidle")
final_url = page.url
visited_urls.add(url)
visited_urls.add(final_url)
html_content = await scrape_content(id, page, pages, collect_media)
html_content = await page.content()
pages.add((html_content, final_url))
if site_map:
await handle_site_mapping(
id, site_map, page, pages, collect_media=collect_media
)
finally:
await page.close()
await browser.close()
if not multi_page_scrape:
return
soup = BeautifulSoup(html_content, "html.parser")
for a_tag in soup.find_all("a"):
if not isinstance(a_tag, Tag):
continue
link = cast(str, a_tag.get("href", ""))
if not link:
continue
if not urlparse(link).netloc:
base_url = "{0.scheme}://{0.netloc}".format(urlparse(final_url))
link = urljoin(base_url, link)
if link not in visited_urls and is_same_domain(link, original_url):
await make_site_request(
id,
link,
job_options=job_options,
visited_urls=visited_urls,
pages=pages,
original_url=original_url,
)
async def collect_scraped_elements(
page: tuple[str, str], xpaths: list[Element], return_html: bool
):
soup = BeautifulSoup(page[0], "lxml")
root = etree.HTML(str(soup))
elements: dict[str, list[CapturedElement]] = {}
for elem in xpaths:
el = sxpath(root, elem.xpath)
for e in el: # type: ignore
if return_html:
elements[elem.name] = [
CapturedElement(
xpath=elem.xpath,
text=page[0],
name=elem.name,
)
]
continue
text = (
" ".join(str(t) for t in e.itertext())
if isinstance(e, etree._Element)
else str(e) # type: ignore
)
text = clean_text(text)
captured_element = CapturedElement(
xpath=elem.xpath, text=text, name=elem.name
)
if elem.name in elements:
elements[elem.name].append(captured_element)
else:
elements[elem.name] = [captured_element]
return {page[1]: elements}
async def scrape(
id: str,
url: str,
xpaths: list[Element],
job_options: dict[str, Any],
):
visited_urls: set[str] = set()
pages: set[tuple[str, str]] = set()
await make_site_request(
id,
url,
job_options=job_options,
visited_urls=visited_urls,
pages=pages,
original_url=url,
)
elements: list[dict[str, dict[str, list[CapturedElement]]]] = []
for page in pages:
elements.append(
await collect_scraped_elements(
page, xpaths, job_options.get("return_html", False)
)
)
return elements

View File

@@ -1,41 +1,58 @@
import time
from typing import cast
# STL
import asyncio
import logging
from typing import Set, Tuple
from urllib.parse import urlparse
from seleniumwire import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from api.backend.utils import LOG
# PDM
from lxml import etree
from playwright.async_api import Page
# LOCAL
from api.backend.job.scraping.collect_media import collect_media as collect_media_utils
LOG = logging.getLogger("Job")
def scrape_content(
driver: webdriver.Chrome, pages: set[tuple[str, str]], collect_media: bool
):
_ = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.TAG_NAME, "body"))
)
last_height = cast(str, driver.execute_script("return document.body.scrollHeight"))
async def scrape_content(
id: str, page: Page, pages: Set[Tuple[str, str]], collect_media: bool
) -> str:
last_height = await page.evaluate("document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3) # Wait for the page to load
new_height = cast(
str, driver.execute_script("return document.body.scrollHeight")
)
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
await asyncio.sleep(3)
new_height = await page.evaluate("document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
pages.add((driver.page_source, driver.current_url))
html = await page.content()
pages.add((html, page.url))
if collect_media:
LOG.info("Collecting media")
collect_media_utils(driver)
await collect_media_utils(id, page)
return driver.page_source
return html
def is_same_domain(url: str, original_url: str) -> bool:
parsed_url = urlparse(url)
parsed_original_url = urlparse(original_url)
return parsed_url.netloc == parsed_original_url.netloc or parsed_url.netloc == ""
def clean_xpath(xpath: str) -> str:
parts = xpath.split("/")
clean_parts = ["/" if part == "" else part for part in parts]
clean_xpath = "//".join(clean_parts).replace("////", "//").replace("'", "\\'")
LOG.info(f"Cleaned xpath: {clean_xpath}")
return clean_xpath
def sxpath(context: etree._Element, xpath: str):
return context.xpath(xpath)

View File

@@ -1,25 +1,22 @@
from api.backend.job.models.site_map import Action, SiteMap
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from typing import Any
# STL
import asyncio
import logging
import time
from copy import deepcopy
from typing import Any
# PDM
from playwright.async_api import Page
# LOCAL
from api.backend.job.models.site_map import Action, SiteMap
from api.backend.job.scraping.scraping_utils import scrape_content
from selenium.webdriver.support.ui import WebDriverWait
from seleniumwire.inspect import TimeoutException
from seleniumwire.webdriver import Chrome
from selenium.webdriver.support import expected_conditions as EC
LOG = logging.getLogger(__name__)
LOG = logging.getLogger("Job")
def clear_done_actions(site_map: dict[str, Any]):
def clear_done_actions(site_map: dict[str, Any]) -> dict[str, Any]:
"""Clear all actions that have been clicked."""
cleared_site_map = deepcopy(site_map)
cleared_site_map["actions"] = [
action for action in cleared_site_map["actions"] if not action["do_once"]
]
@@ -27,43 +24,27 @@ def clear_done_actions(site_map: dict[str, Any]):
return cleared_site_map
def handle_input(action: Action, driver: webdriver.Chrome):
async def handle_input(action: Action, page: Page) -> bool:
try:
element = WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.XPATH, action.xpath))
)
LOG.info(f"Sending keys: {action.input} to element: {element}")
element.send_keys(action.input)
except NoSuchElementException:
LOG.info(f"Element not found: {action.xpath}")
return False
except TimeoutException:
LOG.info(f"Timeout waiting for element: {action.xpath}")
return False
element = page.locator(f"xpath={action.xpath}")
LOG.info(f"Sending keys: {action.input} to element: {action.xpath}")
await element.fill(action.input)
return True
except Exception as e:
LOG.info(f"Error handling input: {e}")
LOG.warning(f"Error handling input for xpath '{action.xpath}': {e}")
return False
return True
def handle_click(action: Action, driver: webdriver.Chrome):
async def handle_click(action: Action, page: Page) -> bool:
try:
element = driver.find_element(By.XPATH, action.xpath)
LOG.info(f"Clicking element: {element}")
element.click()
except NoSuchElementException:
LOG.info(f"Element not found: {action.xpath}")
element = page.locator(f"xpath={action.xpath}")
LOG.info(f"Clicking element: {action.xpath}")
await element.click()
return True
except Exception as e:
LOG.warning(f"Error clicking element at xpath '{action.xpath}': {e}")
return False
return True
ACTION_MAP = {
"click": handle_click,
@@ -72,22 +53,28 @@ ACTION_MAP = {
async def handle_site_mapping(
id: str,
site_map_dict: dict[str, Any],
driver: Chrome,
page: Page,
pages: set[tuple[str, str]],
collect_media: bool = False,
):
site_map = SiteMap(**site_map_dict)
for action in site_map.actions:
action_handler = ACTION_MAP[action.type]
if not action_handler(action, driver):
success = await action_handler(action, page)
if not success:
return
time.sleep(2)
await asyncio.sleep(2)
_ = scrape_content(driver, pages)
await scrape_content(id, page, pages, collect_media=collect_media)
cleared_site_map_dict = clear_done_actions(site_map_dict)
if cleared_site_map_dict["actions"]:
await handle_site_mapping(cleared_site_map_dict, driver, pages)
await handle_site_mapping(
id, cleared_site_map_dict, page, pages, collect_media=collect_media
)

View File

@@ -0,0 +1,40 @@
# STL
from typing import Any
# LOCAL
from api.backend.job.utils.text_utils import clean_text
def clean_job_format(jobs: list[dict[str, Any]]) -> dict[str, Any]:
"""
Convert a single job to a dictionary format.
"""
headers = ["id", "url", "element_name", "xpath", "text", "user", "time_created"]
cleaned_rows = []
for job in jobs:
for res in job["result"]:
for url, elements in res.items():
for element_name, values in elements.items():
for value in values:
text = clean_text(value.get("text", "")).strip()
if text:
cleaned_rows.append(
{
"id": job.get("id", ""),
"url": url,
"element_name": element_name,
"xpath": value.get("xpath", ""),
"text": text,
"user": job.get("user", ""),
"time_created": job.get(
"time_created", ""
).isoformat(),
}
)
return {
"headers": headers,
"rows": cleaned_rows,
}

View File

@@ -0,0 +1,26 @@
# STL
from typing import Any
# LOCAL
from api.backend.job.utils.text_utils import clean_text
def stream_md_from_job_results(jobs: list[dict[str, Any]]):
md = "# Job Results Summary\n\n"
for i, job in enumerate(jobs, start=1):
md += f"## Job #{i}\n"
yield f"- **Job URL:** {job.get('url', 'N/A')}\n"
yield f"- **Timestamp:** {job.get('time_created', 'N/A')}\n"
yield f"- **ID:** {job.get('id', 'N/A')}\n"
yield "### Extracted Results:\n"
for res in job.get("result", []):
for url, elements in res.items():
yield f"\n#### URL: {url}\n"
for element_name, values in elements.items():
for value in values:
text = clean_text(value.get("text", "")).strip()
if text:
yield f"- **Element:** `{element_name}`\n"
yield f" - **Text:** {text}\n"
yield "\n---\n"

View File

@@ -0,0 +1,10 @@
def clean_text(text: str):
text = text.strip()
text = text.replace("\n", " ")
text = text.replace("\t", " ")
text = text.replace("\r", " ")
text = text.replace("\f", " ")
text = text.replace("\v", " ")
text = text.replace("\b", " ")
text = text.replace("\a", " ")
return text

View File

@@ -0,0 +1,31 @@
# STL
import logging
import traceback
from typing import Any, Union, Callable, Awaitable
from functools import wraps
# PDM
from fastapi.responses import JSONResponse
def handle_exceptions(
logger: logging.Logger,
) -> Callable[
[Callable[..., Awaitable[Any]]], Callable[..., Awaitable[Union[Any, JSONResponse]]]
]:
def decorator(
func: Callable[..., Awaitable[Any]],
) -> Callable[..., Awaitable[Union[Any, JSONResponse]]]:
@wraps(func)
async def wrapper(*args: Any, **kwargs: Any) -> Union[Any, JSONResponse]:
try:
return await func(*args, **kwargs)
except Exception as e:
logger.error(f"Exception occurred: {e}")
traceback.print_exc()
return JSONResponse(content={"error": str(e)}, status_code=500)
return wrapper
return decorator

View File

@@ -1,199 +0,0 @@
# STL
import datetime
import uuid
import traceback
from io import StringIO
import csv
import logging
import random
# PDM
from fastapi import Depends, APIRouter
from fastapi.encoders import jsonable_encoder
from fastapi.responses import JSONResponse, StreamingResponse
from api.backend.scheduler import scheduler
from apscheduler.triggers.cron import CronTrigger # type: ignore
# LOCAL
from api.backend.job import insert, update_job, delete_jobs
from api.backend.models import (
DeleteCronJob,
UpdateJobs,
DownloadJob,
DeleteScrapeJobs,
Job,
CronJob,
)
from api.backend.schemas import User
from api.backend.auth.auth_utils import get_current_user
from api.backend.utils import clean_text, format_list_for_query
from api.backend.job.models.job_options import FetchOptions
from api.backend.database.common import query
from api.backend.job.cron_scheduling.cron_scheduling import (
delete_cron_job,
get_cron_job_trigger,
insert_cron_job,
get_cron_jobs,
insert_job_from_cron_job,
)
LOG = logging.getLogger(__name__)
job_router = APIRouter()
@job_router.post("/update")
async def update(update_jobs: UpdateJobs, _: User = Depends(get_current_user)):
"""Used to update jobs"""
await update_job(update_jobs.ids, update_jobs.field, update_jobs.value)
@job_router.post("/submit-scrape-job")
async def submit_scrape_job(job: Job):
LOG.info(f"Recieved job: {job}")
try:
job.id = uuid.uuid4().hex
job_dict = job.model_dump()
insert(job_dict)
return JSONResponse(content={"id": job.id})
except Exception as e:
LOG.error(f"Exception occurred: {traceback.format_exc()}")
return JSONResponse(content={"error": str(e)}, status_code=500)
@job_router.post("/retrieve-scrape-jobs")
async def retrieve_scrape_jobs(
fetch_options: FetchOptions, user: User = Depends(get_current_user)
):
LOG.info(f"Retrieving jobs for account: {user.email}")
ATTRIBUTES = "chat" if fetch_options.chat else "*"
try:
job_query = f"SELECT {ATTRIBUTES} FROM jobs WHERE user = ?"
results = query(job_query, (user.email,))
return JSONResponse(content=jsonable_encoder(results[::-1]))
except Exception as e:
LOG.error(f"Exception occurred: {e}")
return JSONResponse(content=[], status_code=500)
@job_router.get("/job/{id}")
async def job(id: str, user: User = Depends(get_current_user)):
LOG.info(f"Retrieving jobs for account: {user.email}")
try:
job_query = "SELECT * FROM jobs WHERE user = ? AND id = ?"
results = query(job_query, (user.email, id))
return JSONResponse(content=jsonable_encoder(results))
except Exception as e:
LOG.error(f"Exception occurred: {e}")
return JSONResponse(content={"error": str(e)}, status_code=500)
@job_router.post("/download")
async def download(download_job: DownloadJob):
LOG.info(f"Downloading job with ids: {download_job.ids}")
try:
job_query = (
f"SELECT * FROM jobs WHERE id IN {format_list_for_query(download_job.ids)}"
)
results = query(job_query, tuple(download_job.ids))
csv_buffer = StringIO()
csv_writer = csv.writer(csv_buffer, quotechar='"', quoting=csv.QUOTE_ALL)
headers = ["id", "url", "element_name", "xpath", "text", "user", "time_created"]
csv_writer.writerow(headers)
for result in results:
for res in result["result"]:
for url, elements in res.items():
for element_name, values in elements.items():
for value in values:
text = clean_text(value.get("text", "")).strip()
if text:
csv_writer.writerow(
[
result.get("id", "")
+ "-"
+ str(random.randint(0, 1000000)),
url,
element_name,
value.get("xpath", ""),
text,
result.get("user", ""),
result.get("time_created", ""),
]
)
_ = csv_buffer.seek(0)
response = StreamingResponse(
csv_buffer,
media_type="text/csv",
)
response.headers["Content-Disposition"] = "attachment; filename=export.csv"
return response
except Exception as e:
LOG.error(f"Exception occurred: {e}")
traceback.print_exc()
return {"error": str(e)}
@job_router.post("/delete-scrape-jobs")
async def delete(delete_scrape_jobs: DeleteScrapeJobs):
result = await delete_jobs(delete_scrape_jobs.ids)
return (
JSONResponse(content={"message": "Jobs successfully deleted."})
if result
else JSONResponse({"error": "Jobs not deleted."})
)
@job_router.post("/schedule-cron-job")
async def schedule_cron_job(cron_job: CronJob):
if not cron_job.id:
cron_job.id = uuid.uuid4().hex
if not cron_job.time_created:
cron_job.time_created = datetime.datetime.now()
if not cron_job.time_updated:
cron_job.time_updated = datetime.datetime.now()
insert_cron_job(cron_job)
queried_job = query("SELECT * FROM jobs WHERE id = ?", (cron_job.job_id,))
scheduler.add_job(
insert_job_from_cron_job,
get_cron_job_trigger(cron_job.cron_expression),
id=cron_job.id,
args=[queried_job[0]],
)
return JSONResponse(content={"message": "Cron job scheduled successfully."})
@job_router.post("/delete-cron-job")
async def delete_cron_job_request(request: DeleteCronJob):
if not request.id:
return JSONResponse(
content={"error": "Cron job id is required."}, status_code=400
)
delete_cron_job(request.id, request.user_email)
scheduler.remove_job(request.id)
return JSONResponse(content={"message": "Cron job deleted successfully."})
@job_router.get("/cron-jobs")
async def get_cron_jobs_request(user: User = Depends(get_current_user)):
cron_jobs = get_cron_jobs(user.email)
return JSONResponse(content=jsonable_encoder(cron_jobs))

View File

@@ -1,46 +0,0 @@
# STL
import logging
import docker
# PDM
from fastapi import APIRouter, HTTPException
from fastapi.responses import JSONResponse, StreamingResponse
LOG = logging.getLogger(__name__)
log_router = APIRouter()
client = docker.from_env()
@log_router.get("/initial_logs")
async def get_initial_logs():
container_id = "scraperr_api"
try:
container = client.containers.get(container_id)
log_stream = container.logs(stream=False).decode("utf-8")
return JSONResponse(content={"logs": log_stream})
except Exception as e:
raise HTTPException(status_code=500, detail=f"Unexpected error: {e}")
@log_router.get("/logs")
async def get_own_logs():
container_id = "scraperr_api"
try:
container = client.containers.get(container_id)
log_stream = container.logs(stream=True, follow=True)
def log_generator():
try:
for log in log_stream:
yield f"data: {log.decode('utf-8')}\n\n"
except Exception as e:
yield f"data: {str(e)}\n\n"
return StreamingResponse(log_generator(), media_type="text/event-stream")
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))

View File

@@ -1,29 +0,0 @@
# STL
import logging
# PDM
from fastapi import APIRouter, Depends
# LOCAL
from api.backend.job import (
get_jobs_per_day,
average_elements_per_link,
)
from api.backend.auth.auth_utils import get_current_user
from api.backend.schemas import User
LOG = logging.getLogger(__name__)
stats_router = APIRouter()
@stats_router.get("/statistics/get-average-element-per-link")
async def get_average_element_per_link(user: User = Depends(get_current_user)):
return await average_elements_per_link(user.email)
@stats_router.get("/statistics/get-average-jobs-per-day")
async def average_jobs_per_day(user: User = Depends(get_current_user)):
data = await get_jobs_per_day(user.email)
return data

View File

@@ -1,3 +1,4 @@
from apscheduler.schedulers.background import BackgroundScheduler # type: ignore
# PDM
from apscheduler.schedulers.asyncio import AsyncIOScheduler
scheduler = BackgroundScheduler()
scheduler = AsyncIOScheduler()

View File

@@ -0,0 +1,17 @@
from typing import Optional, Union
from datetime import datetime
import pydantic
class CronJob(pydantic.BaseModel):
id: Optional[str] = None
user_email: str
job_id: str
cron_expression: str
time_created: Optional[Union[datetime, str]] = None
time_updated: Optional[Union[datetime, str]] = None
class DeleteCronJob(pydantic.BaseModel):
id: str
user_email: str

View File

@@ -1,50 +1,9 @@
# STL
from typing import Any, Optional, Union
from typing import Any, Literal, Optional, Union
from datetime import datetime
# LOCAL
from api.backend.job.models.job_options import JobOptions
# PDM
import pydantic
class Element(pydantic.BaseModel):
name: str
xpath: str
url: Optional[str] = None
class CapturedElement(pydantic.BaseModel):
xpath: str
text: str
name: str
class RetrieveScrapeJobs(pydantic.BaseModel):
user: str
class DownloadJob(pydantic.BaseModel):
ids: list[str]
class DeleteScrapeJobs(pydantic.BaseModel):
ids: list[str]
class GetStatistics(pydantic.BaseModel):
user: str
class UpdateJobs(pydantic.BaseModel):
ids: list[str]
field: str
value: Any
class AI(pydantic.BaseModel):
messages: list[Any]
from api.backend.job.models import Element, CapturedElement
class Job(pydantic.BaseModel):
@@ -57,17 +16,25 @@ class Job(pydantic.BaseModel):
job_options: JobOptions
status: str = "Queued"
chat: Optional[str] = None
agent_mode: bool = False
prompt: Optional[str] = None
favorite: bool = False
class CronJob(pydantic.BaseModel):
id: Optional[str] = None
user_email: str
job_id: str
cron_expression: str
time_created: Optional[Union[datetime, str]] = None
time_updated: Optional[Union[datetime, str]] = None
class RetrieveScrapeJobs(pydantic.BaseModel):
user: str
class DeleteCronJob(pydantic.BaseModel):
id: str
user_email: str
class DownloadJob(pydantic.BaseModel):
ids: list[str]
job_format: Literal["csv", "md"]
class DeleteScrapeJobs(pydantic.BaseModel):
ids: list[str]
class UpdateJobs(pydantic.BaseModel):
ids: list[str]
field: str
value: Any

View File

@@ -1,223 +0,0 @@
import logging
from typing import Any, Optional
import random
from bs4 import BeautifulSoup, Tag
from lxml import etree
from seleniumwire import webdriver
from lxml.etree import _Element
from fake_useragent import UserAgent
from selenium.webdriver.chrome.options import Options as ChromeOptions
from urllib.parse import urlparse, urljoin
from api.backend.models import Element, CapturedElement
from api.backend.job.site_mapping.site_mapping import (
handle_site_mapping,
)
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from api.backend.job.scraping.scraping_utils import scrape_content
LOG = logging.getLogger(__name__)
class HtmlElement(_Element): ...
def is_same_domain(url: str, original_url: str) -> bool:
parsed_url = urlparse(url)
parsed_original_url = urlparse(original_url)
return parsed_url.netloc == parsed_original_url.netloc or parsed_url.netloc == ""
def clean_xpath(xpath: str) -> str:
parts = xpath.split("/")
clean_parts: list[str] = []
for part in parts:
if part == "":
clean_parts.append("/")
else:
clean_parts.append(part)
clean_xpath = "//".join(clean_parts).replace("////", "//")
clean_xpath = clean_xpath.replace("'", "\\'")
LOG.info(f"Cleaned xpath: {clean_xpath}")
return clean_xpath
def sxpath(context: _Element, xpath: str) -> list[HtmlElement]:
return context.xpath(xpath) # pyright: ignore [reportReturnType]
def interceptor(headers: dict[str, Any]):
def _interceptor(request: Any):
for key, val in headers.items():
if request.headers.get(key):
del request.headers[key]
request.headers[key] = val
if "sec-ch-ua" in request.headers:
original_value = request.headers["sec-ch-ua"]
del request.headers["sec-ch-ua"]
modified_value = original_value.replace("HeadlessChrome", "Chrome")
request.headers["sec-ch-ua"] = modified_value
return _interceptor
def create_driver(proxies: Optional[list[str]] = []):
ua = UserAgent()
chrome_options = ChromeOptions()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument(f"user-agent={ua.random}")
sw_options = {}
if proxies:
selected_proxy = random.choice(proxies)
LOG.info(f"Using proxy: {selected_proxy}")
sw_options = {
"proxy": {
"https": f"https://{selected_proxy}",
"http": f"http://{selected_proxy}",
"no_proxy": "localhost,127.0.0.1",
}
}
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(
service=service,
options=chrome_options,
seleniumwire_options=sw_options,
)
return driver
async def make_site_request(
url: str,
headers: Optional[dict[str, Any]],
multi_page_scrape: bool = False,
visited_urls: set[str] = set(),
pages: set[tuple[str, str]] = set(),
original_url: str = "",
proxies: Optional[list[str]] = [],
site_map: Optional[dict[str, Any]] = None,
collect_media: bool = False,
) -> None:
"""Make basic `GET` request to site using Selenium."""
# Check if URL has already been visited
if url in visited_urls:
return
driver = create_driver(proxies)
driver.implicitly_wait(10)
if headers:
driver.request_interceptor = interceptor(headers)
try:
LOG.info(f"Visiting URL: {url}")
driver.get(url)
final_url = driver.current_url
visited_urls.add(url)
visited_urls.add(final_url)
page_source = scrape_content(driver, pages, collect_media)
if site_map:
LOG.info("Site map: %s", site_map)
_ = await handle_site_mapping(
site_map,
driver,
pages,
)
finally:
driver.quit()
if not multi_page_scrape:
return
soup = BeautifulSoup(page_source, "html.parser")
for a_tag in soup.find_all("a"):
if not isinstance(a_tag, Tag):
continue
link = str(a_tag.get("href", ""))
if link:
if not urlparse(link).netloc:
base_url = "{0.scheme}://{0.netloc}".format(urlparse(final_url))
link = urljoin(base_url, link)
if link not in visited_urls and is_same_domain(link, original_url):
await make_site_request(
link,
headers=headers,
multi_page_scrape=multi_page_scrape,
visited_urls=visited_urls,
pages=pages,
original_url=original_url,
)
async def collect_scraped_elements(page: tuple[str, str], xpaths: list[Element]):
soup = BeautifulSoup(page[0], "lxml")
root = etree.HTML(str(soup))
elements: dict[str, list[CapturedElement]] = dict()
for elem in xpaths:
el = sxpath(root, elem.xpath)
for e in el:
if isinstance(e, etree._Element): # type: ignore
text = "\t".join(str(t) for t in e.itertext())
else:
text = str(e)
captured_element = CapturedElement(
xpath=elem.xpath, text=text, name=elem.name
)
if elem.name in elements:
elements[elem.name].append(captured_element)
continue
elements[elem.name] = [captured_element]
return {page[1]: elements}
async def scrape(
url: str,
xpaths: list[Element],
headers: Optional[dict[str, Any]],
multi_page_scrape: bool = False,
proxies: Optional[list[str]] = [],
site_map: Optional[dict[str, Any]] = None,
collect_media: bool = False,
):
visited_urls: set[str] = set()
pages: set[tuple[str, str]] = set()
_ = await make_site_request(
url,
headers,
multi_page_scrape=multi_page_scrape,
visited_urls=visited_urls,
pages=pages,
original_url=url,
proxies=proxies,
site_map=site_map,
collect_media=collect_media,
)
elements: list[dict[str, dict[str, list[CapturedElement]]]] = list()
for page in pages:
elements.append(await collect_scraped_elements(page, xpaths))
return elements

View File

@@ -0,0 +1,37 @@
# STL
import logging
# PDM
from fastapi import Depends, APIRouter
from sqlalchemy.ext.asyncio import AsyncSession
# LOCAL
from api.backend.auth.schemas import User
from api.backend.database.base import get_db
from api.backend.auth.auth_utils import get_current_user
from api.backend.routers.handle_exceptions import handle_exceptions
from api.backend.database.queries.statistics.statistic_queries import (
get_jobs_per_day,
average_elements_per_link,
)
LOG = logging.getLogger("Statistics")
stats_router = APIRouter()
@stats_router.get("/statistics/get-average-element-per-link")
@handle_exceptions(logger=LOG)
async def get_average_element_per_link(
user: User = Depends(get_current_user), db: AsyncSession = Depends(get_db)
):
return await average_elements_per_link(db, user.email)
@stats_router.get("/statistics/get-average-jobs-per-day")
@handle_exceptions(logger=LOG)
async def average_jobs_per_day(
user: User = Depends(get_current_user), db: AsyncSession = Depends(get_db)
):
data = await get_jobs_per_day(db, user.email)
return data

View File

@@ -0,0 +1,108 @@
# STL
import os
import asyncio
from typing import Any, Generator, AsyncGenerator
# PDM
import pytest
import pytest_asyncio
from httpx import AsyncClient, ASGITransport
from proxy import Proxy
from sqlalchemy import text
from sqlalchemy.pool import NullPool
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker, create_async_engine
# LOCAL
from api.backend.app import app
from api.backend.database.base import get_db
from api.backend.database.models import Base
from api.backend.tests.constants import TEST_DB_PATH
@pytest.fixture(scope="session", autouse=True)
def running_proxy():
proxy = Proxy(["--hostname", "127.0.0.1", "--port", "8080"])
proxy.setup()
yield proxy
proxy.shutdown()
@pytest.fixture(scope="session")
def test_db_path() -> str:
return TEST_DB_PATH
@pytest.fixture(scope="session", autouse=True)
def test_db(test_db_path: str) -> Generator[str, None, None]:
"""Create a fresh test database for each test function."""
os.makedirs(os.path.dirname(test_db_path), exist_ok=True)
if os.path.exists(test_db_path):
os.remove(test_db_path)
# Create async engine for test database
test_db_url = f"sqlite+aiosqlite:///{test_db_path}"
engine = create_async_engine(test_db_url, echo=False)
async def setup_db():
async with engine.begin() as conn:
# Create tables
# LOCAL
from api.backend.database.models import Base
await conn.run_sync(Base.metadata.create_all)
# Run setup
asyncio.run(setup_db())
yield test_db_path
if os.path.exists(test_db_path):
os.remove(test_db_path)
@pytest_asyncio.fixture(scope="session")
async def test_engine():
test_db_url = f"sqlite+aiosqlite:///{TEST_DB_PATH}"
engine = create_async_engine(test_db_url, poolclass=NullPool)
async with engine.begin() as conn:
await conn.run_sync(Base.metadata.create_all)
yield engine
await engine.dispose()
@pytest_asyncio.fixture(scope="function")
async def db_session(test_engine: Any) -> AsyncGenerator[AsyncSession, None]:
async_session = async_sessionmaker(
bind=test_engine,
class_=AsyncSession,
expire_on_commit=False,
)
async with async_session() as session:
try:
yield session
finally:
# Truncate all tables after each test
for table in reversed(Base.metadata.sorted_tables):
await session.execute(text(f"DELETE FROM {table.name}"))
await session.commit()
@pytest.fixture()
def override_get_db(db_session: AsyncSession):
async def _override() -> AsyncGenerator[AsyncSession, None]:
yield db_session
return _override
@pytest_asyncio.fixture()
async def client(override_get_db: Any) -> AsyncGenerator[AsyncClient, None]:
app.dependency_overrides[get_db] = override_get_db
transport = ASGITransport(app=app)
async with AsyncClient(transport=transport, base_url="http://test") as c:
yield c
app.dependency_overrides.clear()

View File

@@ -0,0 +1 @@
TEST_DB_PATH = "tests/test_db.sqlite"

View File

@@ -1,7 +1,13 @@
from api.backend.models import Element, Job, JobOptions, CapturedElement
# STL
import uuid
# PDM
from faker import Faker
# LOCAL
from api.backend.job.models import Element, JobOptions, CapturedElement
from api.backend.schemas.job import Job
fake = Faker()

View File

@@ -1,40 +1,65 @@
# STL
import random
from datetime import datetime, timezone
# PDM
import pytest
from fastapi.testclient import TestClient
from unittest.mock import AsyncMock, patch
from api.backend.app import app
from api.backend.models import DownloadJob
from api.backend.tests.factories.job_factory import create_completed_job
from httpx import AsyncClient
from sqlalchemy.ext.asyncio import AsyncSession
client = TestClient(app)
# LOCAL
from api.backend.schemas.job import DownloadJob
from api.backend.database.models import Job
mocked_job = create_completed_job().model_dump()
mock_results = [mocked_job]
mocked_random_int = 123456
@pytest.mark.asyncio
@patch("api.backend.routers.job_router.query")
@patch("api.backend.routers.job_router.random.randint")
async def test_download(mock_randint: AsyncMock, mock_query: AsyncMock):
# Ensure the mock returns immediately
mock_query.return_value = mock_results
mock_randint.return_value = mocked_random_int
async def test_download(client: AsyncClient, db_session: AsyncSession):
# Insert a test job into the DB
job_id = "test-job-id"
test_job = Job(
id=job_id,
url="https://example.com",
elements=[],
user="test@example.com",
time_created=datetime.now(timezone.utc),
result=[
{
"https://example.com": {
"element_name": [{"xpath": "//div", "text": "example"}]
}
}
],
status="Completed",
chat=None,
job_options={},
agent_mode=False,
prompt="",
favorite=False,
)
db_session.add(test_job)
await db_session.commit()
# Create a DownloadJob instance
download_job = DownloadJob(ids=[mocked_job["id"]])
# Force predictable randint
random.seed(0)
# Make a POST request to the /download endpoint
response = client.post("/download", json=download_job.model_dump())
# Build request
download_job = DownloadJob(ids=[job_id], job_format="csv")
response = await client.post("/download", json=download_job.model_dump())
# Assertions
assert response.status_code == 200
assert response.headers["Content-Disposition"] == "attachment; filename=export.csv"
# Check the content of the CSV
# Validate CSV contents
csv_content = response.content.decode("utf-8")
expected_csv = (
f'"id","url","element_name","xpath","text","user","time_created"\r\n'
f'"{mocked_job["id"]}-{mocked_random_int}","https://example.com","element_name","//div","example",'
f'"{mocked_job["user"]}","{mocked_job["time_created"]}"\r\n'
lines = csv_content.strip().split("\n")
assert (
lines[0].strip()
== '"id","url","element_name","xpath","text","user","time_created"'
)
assert csv_content == expected_csv
assert '"https://example.com"' in lines[1]
assert '"element_name"' in lines[1]
assert '"//div"' in lines[1]
assert '"example"' in lines[1]

View File

@@ -1,27 +1,117 @@
import pytest
# STL
import logging
from unittest.mock import AsyncMock, patch, MagicMock
from api.backend.scraping import create_driver
from typing import Dict
from datetime import datetime
# PDM
import pytest
from httpx import AsyncClient
from sqlalchemy import select
from fastapi.testclient import TestClient
from playwright.async_api import Route, Cookie, async_playwright
from sqlalchemy.ext.asyncio import AsyncSession
# LOCAL
from api.backend.app import app
from api.backend.job.models import Proxy, Element, JobOptions
from api.backend.schemas.job import Job
from api.backend.database.models import Job as JobModel
from api.backend.job.scraping.add_custom import add_custom_items
logging.basicConfig(level=logging.DEBUG)
LOG = logging.getLogger(__name__)
client = TestClient(app)
@pytest.mark.asyncio
@patch("seleniumwire.webdriver.Chrome.get")
async def test_proxy(mock_get: AsyncMock):
# Mock the response of the requests.get call
mock_response = MagicMock()
mock_get.return_value = mock_response
async def test_add_custom_items():
test_cookies = [{"name": "big", "value": "cookie"}]
test_headers = {"User-Agent": "test-agent", "Accept": "application/json"}
driver = create_driver(proxies=["127.0.0.1:8080"])
assert driver is not None
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
# Simulate a request
driver.get("http://example.com")
response = driver.last_request
# Set up request interception
captured_headers: Dict[str, str] = {}
if response:
assert response.headers["Proxy-Connection"] == "keep-alive"
async def handle_route(route: Route) -> None:
nonlocal captured_headers
captured_headers = route.request.headers
await route.continue_()
driver.quit()
await page.route("**/*", handle_route)
await add_custom_items(
url="http://example.com",
page=page,
cookies=test_cookies,
headers=test_headers,
)
# Navigate to example.com
await page.goto("http://example.com")
# Verify cookies were added
cookies: list[Cookie] = await page.context.cookies()
test_cookie = next((c for c in cookies if c.get("name") == "big"), None)
assert test_cookie is not None
assert test_cookie.get("value") == "cookie"
assert test_cookie.get("path") == "/" # Default path should be set
assert test_cookie.get("sameSite") == "Lax" # Default sameSite should be set
# Verify headers were added
assert captured_headers.get("user-agent") == "test-agent"
await browser.close()
@pytest.mark.asyncio
async def test_proxies(client: AsyncClient, db_session: AsyncSession):
job = Job(
url="https://example.com",
elements=[Element(xpath="//div", name="test")],
job_options=JobOptions(
proxies=[
Proxy(
server="127.0.0.1:8080",
username="user",
password="pass",
)
],
),
time_created=datetime.now().isoformat(),
)
response = await client.post("/submit-scrape-job", json=job.model_dump())
assert response.status_code == 200
stmt = select(JobModel)
result = await db_session.execute(stmt)
jobs = result.scalars().all()
assert len(jobs) > 0
job_from_db = jobs[0]
job_dict = job_from_db.__dict__
job_dict.pop("_sa_instance_state", None)
assert job_dict is not None
print(job_dict)
assert job_dict["job_options"]["proxies"] == [
{
"server": "127.0.0.1:8080",
"username": "user",
"password": "pass",
}
]
# Verify the job was stored correctly in the database
assert job_dict["url"] == "https://example.com"
assert job_dict["status"] == "Queued"
assert len(job_dict["elements"]) == 1
assert job_dict["elements"][0]["xpath"] == "//div"
assert job_dict["elements"][0]["name"] == "test"

View File

@@ -0,0 +1,17 @@
# STL
import sqlite3
# LOCAL
from api.backend.database.schema import INIT_QUERY
from api.backend.tests.constants import TEST_DB_PATH
def connect_to_db():
conn = sqlite3.connect(TEST_DB_PATH)
cur = conn.cursor()
for query in INIT_QUERY.split(";"):
cur.execute(query)
conn.commit()
return conn, cur

View File

@@ -1,17 +1,10 @@
from typing import Any, Optional
# STL
import logging
import json
from typing import Optional
LOG = logging.getLogger(__name__)
def clean_text(text: str):
text = text.replace("\r\n", "\n") # Normalize newlines
text = text.replace("\n", "\\n") # Escape newlines
text = text.replace('"', '\\"') # Escape double quotes
return text
def get_log_level(level_name: Optional[str]) -> int:
level = logging.INFO
@@ -20,30 +13,3 @@ def get_log_level(level_name: Optional[str]) -> int:
level = getattr(logging, level_name, logging.INFO)
return level
def format_list_for_query(ids: list[str]):
return (
f"({','.join(['?' for _ in ids])})" # Returns placeholders, e.g., "(?, ?, ?)"
)
def format_sql_row_to_python(row: dict[str, Any]):
new_row: dict[str, Any] = {}
for key, value in row.items():
if isinstance(value, str):
try:
new_row[key] = json.loads(value)
except json.JSONDecodeError:
new_row[key] = value
else:
new_row[key] = value
return new_row
def format_json(items: list[Any]):
for idx, item in enumerate(items):
if isinstance(item, (dict, list)):
formatted_item = json.dumps(item)
items[idx] = formatted_item

View File

@@ -0,0 +1,17 @@
# STL
import os
from pathlib import Path
NOTIFICATION_CHANNEL = os.getenv("NOTIFICATION_CHANNEL", "")
NOTIFICATION_WEBHOOK_URL = os.getenv("NOTIFICATION_WEBHOOK_URL", "")
SCRAPERR_FRONTEND_URL = os.getenv("SCRAPERR_FRONTEND_URL", "")
EMAIL = os.getenv("EMAIL", "")
TO = os.getenv("TO", "")
SMTP_HOST = os.getenv("SMTP_HOST", "")
SMTP_PORT = int(os.getenv("SMTP_PORT", 587))
SMTP_USER = os.getenv("SMTP_USER", "")
SMTP_PASSWORD = os.getenv("SMTP_PASSWORD", "")
USE_TLS = os.getenv("USE_TLS", "false").lower() == "true"
RECORDINGS_ENABLED = os.getenv("RECORDINGS_ENABLED", "true").lower() == "true"
RECORDINGS_DIR = Path("/project/app/media/recordings")

View File

@@ -1,48 +1,88 @@
import os
from api.backend.job import get_queued_job, update_job
from api.backend.scraping import scrape
from api.backend.models import Element
from fastapi.encoders import jsonable_encoder
# STL
import json
import asyncio
import traceback
import subprocess
from api.backend.database.startup import init_database
# PDM
from fastapi.encoders import jsonable_encoder
from api.backend.worker.post_job_complete.post_job_complete import post_job_complete
# LOCAL
from api.backend.job import update_job, get_queued_job
from api.backend.job.models import Element
from api.backend.worker.logger import LOG
NOTIFICATION_CHANNEL = os.getenv("NOTIFICATION_CHANNEL", "")
NOTIFICATION_WEBHOOK_URL = os.getenv("NOTIFICATION_WEBHOOK_URL", "")
SCRAPERR_FRONTEND_URL = os.getenv("SCRAPERR_FRONTEND_URL", "")
EMAIL = os.getenv("EMAIL", "")
TO = os.getenv("TO", "")
SMTP_HOST = os.getenv("SMTP_HOST", "")
SMTP_PORT = int(os.getenv("SMTP_PORT", 587))
SMTP_USER = os.getenv("SMTP_USER", "")
SMTP_PASSWORD = os.getenv("SMTP_PASSWORD", "")
USE_TLS = os.getenv("USE_TLS", "false").lower() == "true"
from api.backend.ai.agent.agent import scrape_with_agent
from api.backend.worker.constants import (
TO,
EMAIL,
USE_TLS,
SMTP_HOST,
SMTP_PORT,
SMTP_USER,
SMTP_PASSWORD,
RECORDINGS_DIR,
RECORDINGS_ENABLED,
NOTIFICATION_CHANNEL,
SCRAPERR_FRONTEND_URL,
NOTIFICATION_WEBHOOK_URL,
)
from api.backend.job.scraping.scraping import scrape
from api.backend.worker.post_job_complete.post_job_complete import post_job_complete
async def process_job():
job = await get_queued_job()
ffmpeg_proc = None
status = "Queued"
if job:
LOG.info(f"Beginning processing job: {job}.")
try:
output_path = RECORDINGS_DIR / f"{job['id']}.mp4"
if RECORDINGS_ENABLED:
ffmpeg_proc = subprocess.Popen(
[
"ffmpeg",
"-y",
"-video_size",
"1280x1024",
"-framerate",
"15",
"-f",
"x11grab",
"-i",
":99",
"-codec:v",
"libx264",
"-preset",
"ultrafast",
output_path,
]
)
_ = await update_job([job["id"]], field="status", value="Scraping")
scraped = await scrape(
job["url"],
[Element(**j) for j in job["elements"]],
job["job_options"]["custom_headers"],
job["job_options"]["multi_page_scrape"],
job["job_options"]["proxies"],
job["job_options"]["site_map"],
job["job_options"]["collect_media"],
)
proxies = job["job_options"]["proxies"]
if proxies and isinstance(proxies[0], str) and proxies[0].startswith("{"):
try:
proxies = [json.loads(p) for p in proxies]
except json.JSONDecodeError:
LOG.error(f"Failed to parse proxy JSON: {proxies}")
proxies = []
if job["agent_mode"]:
scraped = await scrape_with_agent(job)
else:
scraped = await scrape(
job["id"],
job["url"],
[Element(**j) for j in job["elements"]],
{**job["job_options"], "proxies": proxies},
)
LOG.info(
f"Scraped result for url: {job['url']}, with elements: {job['elements']}\n{scraped}"
)
@@ -75,11 +115,15 @@ async def process_job():
},
)
if ffmpeg_proc:
ffmpeg_proc.terminate()
ffmpeg_proc.wait()
async def main():
LOG.info("Starting job worker...")
init_database()
RECORDINGS_DIR.mkdir(parents=True, exist_ok=True)
while True:
await process_job()

View File

@@ -1,5 +1,13 @@
# STL
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
LOG = logging.getLogger(__name__)
# LOCAL
from api.backend.app import LOG_LEVEL
logging.basicConfig(
level=LOG_LEVEL,
format="%(levelname)s: %(asctime)s - [%(name)s] - %(message)s",
handlers=[logging.StreamHandler()],
)
LOG = logging.getLogger("Job Worker")

View File

@@ -1,5 +1,7 @@
# STL
from typing import Any
# LOCAL
from api.backend.worker.post_job_complete.models import PostJobCompleteOptions
from api.backend.worker.post_job_complete.email_notifcation import (
send_job_complete_email,
@@ -10,12 +12,16 @@ from api.backend.worker.post_job_complete.discord_notification import (
async def post_job_complete(job: dict[str, Any], options: PostJobCompleteOptions):
if options["channel"] == "":
return
if not options.values():
return
if options["channel"] == "discord":
discord_notification(job, options)
elif options["channel"] == "email":
send_job_complete_email(job, options)
else:
raise ValueError(f"Invalid channel: {options['channel']}")
match options["channel"]:
case "discord":
discord_notification(job, options)
case "email":
send_job_complete_email(job, options)
case _:
raise ValueError(f"Invalid channel: {options['channel']}")

View File

@@ -0,0 +1,23 @@
describe("Global setup", () => {
it("signs up user once", () => {
cy.request({
method: "POST",
url: "/api/signup",
body: JSON.stringify({
data: {
email: "test@test.com",
password: "password",
full_name: "John Doe",
},
}),
headers: {
"Content-Type": "application/json",
},
failOnStatusCode: false,
}).then((response) => {
if (response.status !== 200 && response.status !== 201) {
console.warn("Signup failed:", response.status, response.body);
}
});
});
});

View File

@@ -0,0 +1,101 @@
import { login } from "../utilities/authentication.utils";
import {
addCustomHeaders,
addElement,
addMedia,
addSiteMapAction,
checkForMedia,
cleanUpJobs,
enterJobUrl,
openAdvancedJobOptions,
submitBasicJob,
submitJob,
waitForJobCompletion,
} from "../utilities/job.utilities";
import { mockSubmitJob } from "../utilities/mocks";
describe.only("Advanced Job Options", () => {
beforeEach(() => {
mockSubmitJob();
login();
cy.visit("/");
});
afterEach(() => {
cleanUpJobs();
});
it.only("should handle custom headers", () => {
const customHeaders = {
"User-Agent": "Test Agent",
"Accept-Language": "en-US",
};
addCustomHeaders(customHeaders);
submitBasicJob("https://httpbin.org/headers", "headers", "//pre");
cy.wait("@submitScrapeJob").then((interception) => {
expect(interception.response?.statusCode).to.eq(200);
expect(
interception.request?.body.data.job_options.custom_headers
).to.deep.equal(customHeaders);
});
waitForJobCompletion("https://httpbin.org/headers");
});
it("should handle site map actions", () => {
addSiteMapAction("click", "//button[contains(text(), 'Load More')]");
addSiteMapAction("input", "//input[@type='search']", "test search");
submitBasicJob("https://example.com", "content", "//div[@class='content']");
cy.wait("@submitScrapeJob").then((interception) => {
expect(interception.response?.statusCode).to.eq(200);
const siteMap = interception.request?.body.data.job_options.site_map;
expect(siteMap.actions).to.have.length(2);
expect(siteMap.actions[0].type).to.equal("click");
expect(siteMap.actions[1].type).to.equal("input");
});
waitForJobCompletion("https://example.com");
});
it("should handle multiple elements", () => {
enterJobUrl("https://books.toscrape.com");
addElement("titles", "//h3");
addElement("prices", "//p[@class='price_color']");
submitJob();
cy.wait("@submitScrapeJob").then((interception) => {
expect(interception.response?.statusCode).to.eq(200);
expect(interception.request?.body.data.elements).to.have.length(2);
});
waitForJobCompletion("https://books.toscrape.com");
});
it.only("should handle collecting media", () => {
enterJobUrl("https://books.toscrape.com");
openAdvancedJobOptions();
addMedia();
cy.get("body").type("{esc}");
addElement("images", "//img");
submitJob();
cy.wait("@submitScrapeJob").then((interception) => {
expect(interception.response?.statusCode).to.eq(200);
expect(interception.request?.body.data.job_options.collect_media).to.be
.true;
});
waitForJobCompletion("https://books.toscrape.com");
checkForMedia();
});
});

38
cypress/e2e/agent.cy.ts Normal file
View File

@@ -0,0 +1,38 @@
import { login } from "../utilities/authentication.utils";
import {
buildAgentJob,
cleanUpJobs,
submitJob,
waitForJobCompletion,
} from "../utilities/job.utilities";
import { mockSubmitJob } from "../utilities/mocks";
describe("Agent", () => {
beforeEach(() => {
mockSubmitJob();
login();
});
afterEach(() => {
cleanUpJobs();
});
it("should be able to scrape some data", () => {
cy.visit("/agent");
cy.wait(1000);
const url = "https://books.toscrape.com";
const prompt = "Collect all the links on the page";
buildAgentJob(url, prompt);
submitJob();
cy.wait("@submitScrapeJob").then((interception) => {
expect(interception.response?.statusCode).to.eq(200);
expect(interception.request?.body.data.url).to.eq(url);
expect(interception.request?.body.data.prompt).to.eq(prompt);
});
waitForJobCompletion("https://books.toscrape.com");
});
});

View File

@@ -1,60 +1,61 @@
describe("Authentication", () => {
it("should register", () => {
cy.intercept("POST", "/api/signup").as("signup");
import { faker } from "@faker-js/faker";
import { mockLogin, mockSignup } from "../utilities/mocks";
cy.visit("/").then(() => {
cy.get("button").contains("Login").click();
cy.url().should("include", "/login");
const mockEmail = faker.internet.email();
const mockPassword = faker.internet.password();
cy.get("form").should("be.visible");
cy.get("button")
.contains("No Account? Sign up")
.should("be.visible")
.click();
cy.get("input[name='email']").type("test@test.com");
cy.get("input[name='password']").type("password");
cy.get("input[name='fullName']").type("John Doe");
cy.get("button[type='submit']").contains("Signup").click();
cy.wait("@signup").then((interception) => {
if (!interception.response) {
cy.log("No response received!");
throw new Error("signup request did not return a response");
}
cy.log("Response status: " + interception.response.statusCode);
cy.log("Response body: " + JSON.stringify(interception.response.body));
expect(interception.response.statusCode).to.eq(200);
});
});
describe.only("Authentication", () => {
beforeEach(() => {
cy.visit("/");
mockSignup();
mockLogin();
});
it("should login", () => {
cy.intercept("POST", "/api/token").as("token");
it("should register", () => {
cy.get("button").contains("Login").click();
cy.url().should("include", "/login");
cy.visit("/").then(() => {
cy.get("button")
.contains("Login")
.click()
.then(() => {
cy.get("input[name='email']").type("test@test.com");
cy.get("input[name='password']").type("password");
cy.get("button[type='submit']").contains("Login").click();
cy.get("form").should("be.visible");
cy.wait("@token").then((interception) => {
if (!interception.response) {
cy.log("No response received!");
throw new Error("token request did not return a response");
}
cy.get("button")
.contains("No Account? Sign up")
.should("be.visible")
.click();
cy.log("Response status: " + interception.response.statusCode);
cy.log("Response body: " + JSON.stringify(interception.response.body));
cy.get("input[name='email']").type(mockEmail);
cy.get("input[name='password']").type(mockPassword);
cy.get("input[name='fullName']").type(faker.person.fullName());
cy.get("button[type='submit']").contains("Signup").click();
expect(interception.response.statusCode).to.eq(200);
});
});
cy.wait("@signup").then((interception) => {
if (!interception.response) {
throw new Error("signup request did not return a response");
}
expect(interception.response.statusCode).to.eq(200);
});
});
});
it("should login", () => {
cy.intercept("POST", "/api/token").as("token");
cy.visit("/").then(() => {
cy.get("button")
.contains("Login")
.click()
.then(() => {
cy.get("input[name='email']").type(mockEmail);
cy.get("input[name='password']").type(mockPassword);
cy.get("button[type='submit']").contains("Login").click();
cy.wait("@token").then((interception) => {
if (!interception.response) {
throw new Error("token request did not return a response");
}
expect(interception.response.statusCode).to.eq(200);
});
});
});
});

34
cypress/e2e/chat.cy.ts Normal file
View File

@@ -0,0 +1,34 @@
import { login } from "../utilities/authentication.utils";
import {
cleanUpJobs,
selectJobFromSelector,
submitBasicJob,
waitForJobCompletion,
} from "../utilities/job.utilities";
import { mockLogin } from "../utilities/mocks";
describe.only("Chat", () => {
beforeEach(() => {
mockLogin();
login();
cy.visit("/");
});
afterEach(() => {
cleanUpJobs();
});
it.only("should be able to chat", () => {
const url = "https://books.toscrape.com";
submitBasicJob(url, "test", "//body");
waitForJobCompletion(url);
cy.visit("/chat");
selectJobFromSelector();
cy.get("[data-cy='message-input']").type("Hello");
cy.get("[data-cy='send-message']").click();
cy.get("[data-cy='ai-message']").should("exist");
});
});

View File

@@ -1,34 +1,37 @@
import { login } from "../utilities/authentication.utils";
import {
addElement,
cleanUpJobs,
enterJobUrl,
submitJob,
waitForJobCompletion,
} from "../utilities/job.utilities";
import { mockSubmitJob } from "../utilities/mocks";
describe.only("Job", () => {
it("should create a job", () => {
cy.intercept("POST", "/api/submit-scrape-job").as("submitScrapeJob");
beforeEach(() => {
mockSubmitJob();
login();
cy.visit("/");
});
cy.get('[data-cy="url-input"]').type("https://example.com");
cy.get('[data-cy="name-field"]').type("example");
cy.get('[data-cy="xpath-field"]').type("//body");
cy.get('[data-cy="add-button"]').click();
afterEach(() => {
cleanUpJobs();
});
cy.contains("Submit").click();
it("should create a job", () => {
enterJobUrl("https://books.toscrape.com");
addElement("body", "//body");
submitJob();
cy.wait("@submitScrapeJob").then((interception) => {
if (!interception.response) {
cy.log("No response received!");
cy.log("Request body: " + JSON.stringify(interception.request?.body));
throw new Error("submitScrapeJob request did not return a response");
}
cy.log("Response status: " + interception.response.statusCode);
cy.log("Response body: " + JSON.stringify(interception.response.body));
expect(interception.response.statusCode).to.eq(200);
});
cy.get("li").contains("Previous Jobs").click();
cy.contains("div", "https://example.com", { timeout: 10000 }).should(
"exist"
);
cy.contains("div", "Completed", { timeout: 20000 }).should("exist");
waitForJobCompletion("https://books.toscrape.com");
});
});

View File

@@ -14,7 +14,7 @@
// ***********************************************************
// Import commands.js using ES2015 syntax:
import './commands'
import "./commands";
// Alternatively you can use CommonJS syntax:
// require('./commands')
// require('./commands')

View File

@@ -0,0 +1,68 @@
export const signup = () => {
cy.intercept("POST", "/api/token").as("token");
cy.visit("/").then(() => {
cy.get("button").contains("Login").click();
cy.url().should("include", "/login");
cy.get("form").should("be.visible");
cy.get("button")
.contains("No Account? Sign up")
.should("be.visible")
.click();
cy.get("input[name='email']").type("test@test.com");
cy.get("input[name='password']").type("password");
cy.get("input[name='fullName']").type("John Doe");
cy.get("button[type='submit']").contains("Signup").click();
cy.wait("@token").then((interception) => {
if (!interception.response) {
cy.log("No response received!");
throw new Error("token request did not return a response");
}
});
});
};
export const login = () => {
cy.intercept("POST", "/api/token").as("token");
cy.intercept("GET", "/api/me").as("me");
cy.intercept("GET", "/api/check").as("check");
cy.visit("/").then(() => {
cy.get("body").then(() => {
cy.get("button")
.contains("Login")
.click()
.then(() => {
cy.get("input[name='email']").type("test@test.com");
cy.get("input[name='password']").type("password");
cy.get("button[type='submit']").contains("Login").click();
cy.wait("@token").then((interception) => {
if (!interception.response) {
cy.log("No response received!");
throw new Error("token request did not return a response");
}
});
cy.wait("@me").then((interception) => {
if (!interception.response) {
cy.log("No response received!");
throw new Error("me request did not return a response");
}
});
cy.wait("@check").then((interception) => {
if (!interception.response) {
cy.log("No response received!");
throw new Error("check request did not return a response");
}
});
cy.url().should("not.include", "/login");
});
});
});
};

View File

@@ -0,0 +1,187 @@
export const cleanUpJobs = () => {
cy.intercept("POST", "/api/retrieve").as("retrieve");
cy.visit("/jobs");
cy.wait("@retrieve", { timeout: 15000 });
cy.get("tbody tr", { timeout: 20000 }).should("have.length.at.least", 1);
const tryClickSelectAll = (attempt = 1, maxAttempts = 5) => {
cy.log(`Attempt ${attempt} to click Select All`);
cy.get('[data-testid="select-all"]')
.closest("button")
.then(($btn) => {
// Retry if button is disabled
if ($btn.is(":disabled") || $btn.css("pointer-events") === "none") {
if (attempt < maxAttempts) {
cy.wait(1000).then(() =>
tryClickSelectAll(attempt + 1, maxAttempts)
);
} else {
throw new Error(
"Select All button is still disabled after max retries"
);
}
} else {
// Click and then verify if checkbox is checked
cy.wrap($btn)
.click({ force: true })
.then(() => {
cy.get("tbody tr")
.first()
.find("td")
.first()
.find("input[type='checkbox']")
.should("be.checked")
.then(() => {
cy.log("Select All successful");
});
});
// Handle failure case
cy.on("fail", () => {
cy.log("Error clicking Select All");
if (attempt < maxAttempts) {
cy.wait(1000).then(() =>
tryClickSelectAll(attempt + 1, maxAttempts)
);
} else {
throw new Error(
"Checkbox was never checked after clicking Select All"
);
}
return false; // Prevent Cypress from failing the test
});
}
});
};
tryClickSelectAll();
cy.get('[data-testid="DeleteIcon"]', { timeout: 10000 })
.closest("button")
.should("not.be.disabled")
.click();
};
export const submitBasicJob = (url: string, name: string, xpath: string) => {
cy.get('[data-cy="url-input"]').type(url);
cy.get('[data-cy="name-field"]').type(name);
cy.get('[data-cy="xpath-field"]').type(xpath);
cy.get('[data-cy="add-button"]').click();
cy.contains("Submit").click();
};
export const waitForJobCompletion = (url: string) => {
cy.intercept("POST", "/api/retrieve").as("retrieve");
cy.visit("/jobs");
cy.wait("@retrieve", { timeout: 30000 });
cy.contains("div", url, { timeout: 30000 }).should("exist");
const checkJobStatus = () => {
cy.get("[data-testid='job-status']", { timeout: 120000 }).then(($el) => {
const status = $el.text().toLowerCase().trim();
if (status.includes("completed")) {
return true;
} else if (status.includes("scraping") || status.includes("queued")) {
cy.wait(5000);
checkJobStatus();
} else {
throw new Error(`Unexpected job status: ${status}`);
}
});
};
checkJobStatus();
};
export const enableMultiPageScraping = () => {
cy.get("button").contains("Advanced Options").click();
cy.get('[data-cy="multi-page-toggle"]').click();
cy.get("body").type("{esc}");
};
export const addCustomHeaders = (headers: Record<string, string>) => {
cy.get("button").contains("Advanced Options").click();
cy.get('[name="custom_headers"]').type(JSON.stringify(headers), {
parseSpecialCharSequences: false,
});
cy.get("body").type("{esc}");
};
export const addCustomCookies = (cookies: Record<string, string>) => {
cy.get("button").contains("Advanced Options").click();
cy.get('[name="custom_cookies"]').type(JSON.stringify(cookies));
cy.get("body").type("{esc}");
};
export const openAdvancedJobOptions = () => {
cy.get("button").contains("Advanced Options").click();
};
export const selectJobFromSelector = () => {
checkAiDisabled();
cy.get("div[id='select-job']", { timeout: 10000 }).first().click();
cy.get("li[role='option']", { timeout: 10000 }).first().click();
};
export const addMedia = () => {
cy.get('[data-cy="collect-media-checkbox"]').click();
};
export const checkForMedia = () => {
cy.intercept("GET", "/api/media/get-media?id=**").as("getMedia");
cy.visit("/media");
selectJobFromSelector();
cy.wait("@getMedia", { timeout: 30000 });
};
export const addSiteMapAction = (
type: "click" | "input",
xpath: string,
input?: string
) => {
cy.get("button").contains("Create Site Map").click();
cy.get('[data-cy="site-map-select"]').select(type);
cy.get('[data-cy="site-map-xpath"]').type(xpath);
if (type === "input" && input) {
cy.get('[data-cy="site-map-input"]').type(input);
}
cy.get('[data-cy="add-site-map-action"]').click();
};
export const addElement = (name: string, xpath: string) => {
cy.get('[data-cy="name-field"]').type(name);
cy.get('[data-cy="xpath-field"]').type(xpath);
cy.get('[data-cy="add-button"]').click();
};
export const checkAiDisabled = () => {
cy.getAllLocalStorage().then((result) => {
const storage = JSON.parse(
result["http://localhost"]["persist:root"] as string
);
const settings = JSON.parse(storage.settings);
expect(settings.aiEnabled).to.equal(true);
});
};
export const buildAgentJob = (url: string, prompt: string) => {
checkAiDisabled();
enterJobUrl(url);
cy.get("[data-cy='prompt-input']").type(prompt);
};
export const submitJob = () => {
cy.get("button").contains("Submit").click();
};
export const enterJobUrl = (url: string) => {
cy.get('[data-cy="url-input"]').type(url);
};

View File

@@ -0,0 +1,15 @@
export const mockSubmitJob = () => {
cy.intercept("POST", "/api/submit-scrape-job").as("submitScrapeJob");
};
export const mockToken = () => {
cy.intercept("POST", "/api/token").as("token");
};
export const mockSignup = () => {
cy.intercept("POST", "/api/signup").as("signup");
};
export const mockLogin = () => {
cy.intercept("POST", "/api/token").as("token");
};

View File

@@ -0,0 +1 @@
export * from "./authentication.utils";

View File

@@ -1,6 +1,9 @@
version: "3"
services:
scraperr:
build:
context: .
dockerfile: docker/frontend/Dockerfile
command: ["npm", "run", "dev"]
volumes:
- "$PWD/src:/app/src"
@@ -10,7 +13,12 @@ services:
- "$PWD/package-lock.json:/app/package-lock.json"
- "$PWD/tsconfig.json:/app/tsconfig.json"
scraperr_api:
build:
context: .
dockerfile: docker/api/Dockerfile
environment:
- LOG_LEVEL=INFO
volumes:
- "$PWD/api:/project/api"
- "$PWD/api:/project/app/api"
ports:
- "5900:5900"

View File

@@ -1,11 +1,6 @@
services:
scraperr:
depends_on:
- scraperr_api
image: jpyles0524/scraperr:latest
build:
context: .
dockerfile: docker/frontend/Dockerfile
container_name: scraperr
command: ["npm", "run", "start"]
environment:
@@ -18,22 +13,17 @@ services:
scraperr_api:
init: True
image: jpyles0524/scraperr_api:latest
build:
context: .
dockerfile: docker/api/Dockerfile
environment:
- LOG_LEVEL=INFO
- SECRET_KEY=MRo9PfasPibnqFeK4Oswb6Z+PhFmjzdvxZzwdAkbf/Y= # used to encode authentication tokens (can be a random string)
- ALGORITHM=HS256 # authentication encoding algorithm
- ACCESS_TOKEN_EXPIRE_MINUTES=600 # access token expire minutes
- OPENAI_KEY=${OPENAI_KEY}
container_name: scraperr_api
ports:
- 8000:8000
volumes:
- "$PWD/data:/project/data"
- "$PWD/media:/project/media"
- /var/run/docker.sock:/var/run/docker.sock
- "$PWD/data:/project/app/data"
- "$PWD/media:/project/app/media"
networks:
- web
networks:
web:

View File

@@ -1,36 +1,45 @@
# Build python dependencies
FROM python:3.10.12-slim as pybuilder
RUN apt update && apt install -y uvicorn
RUN apt-get update && \
apt-get install -y curl && \
apt-get install -y x11vnc xvfb uvicorn wget gnupg supervisor libgl1 libglx-mesa0 libglx0 vainfo libva-dev libva-glx2 libva-drm2 ffmpeg pkg-config default-libmysqlclient-dev gcc && \
curl -LsSf https://astral.sh/uv/install.sh | sh && \
apt-get remove -y curl && \
apt-get autoremove -y && \
rm -rf /var/lib/apt/lists/*
RUN python -m pip --no-cache-dir install pdm
RUN pdm config python.use_venv false
WORKDIR /project/app
COPY pyproject.toml pdm.lock /project/app/
RUN pdm install
RUN pdm install -v --frozen-lockfile
RUN pdm run playwright install --with-deps
RUN pdm run camoufox fetch
COPY ./api/ /project/app/api
# Create final image
FROM python:3.10.12-slim
RUN apt-get update
RUN apt-get install -y wget gnupg supervisor
RUN wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | apt-key add -
RUN sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list'
RUN apt-get update
RUN apt-get install -y google-chrome-stable
ENV PYTHONPATH=/project/pkgs
COPY --from=pybuilder /usr/local/lib/python3.10/site-packages /usr/local/lib/python3.10/site-packages
COPY --from=pybuilder /usr/local/bin /usr/local/bin
COPY --from=pybuilder /project/app /project/
COPY supervisord.conf /etc/supervisor/conf.d/supervisord.conf
EXPOSE 8000
WORKDIR /project/
WORKDIR /project/app
RUN mkdir -p /project/app/media
RUN mkdir -p /project/app/data
RUN touch /project/app/data/database.db
EXPOSE 5900
COPY alembic /project/app/alembic
COPY alembic.ini /project/app/alembic.ini
COPY start.sh /project/app/start.sh
CMD [ "supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf" ]

View File

@@ -1,10 +1,14 @@
# Build next dependencies
FROM node:23.1
FROM node:23.1-slim
WORKDIR /app
COPY package*.json ./
RUN npm install
# Copy package files first to leverage Docker cache
COPY package.json yarn.lock ./
# Install dependencies in a separate layer
RUN yarn install --frozen-lockfile --network-timeout 600000
# Copy the rest of the application
COPY tsconfig.json /app/tsconfig.json
COPY tailwind.config.js /app/tailwind.config.js
COPY next.config.mjs /app/next.config.mjs
@@ -13,6 +17,7 @@ COPY postcss.config.js /app/postcss.config.js
COPY public /app/public
COPY src /app/src
RUN npm run build
# Build the application
RUN yarn build
EXPOSE 3000

Binary file not shown.

Before

Width:  |  Height:  |  Size: 46 KiB

After

Width:  |  Height:  |  Size: 67 KiB

23
helm/.helmignore Normal file
View File

@@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/

24
helm/Chart.yaml Normal file
View File

@@ -0,0 +1,24 @@
apiVersion: v2
name: scraperr
description: A Helm chart for Kubernetes
# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 1.1.7
# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "1.16.0"

View File

@@ -0,0 +1,56 @@
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: scraperr
spec:
replicas: {{ .Values.replicaCount }}
selector:
matchLabels:
app: scraperr
template:
metadata:
labels:
app: scraperr
spec:
containers:
- name: scraperr
{{ if .Values.scraperr.image.repository }}
image: "{{ .Values.scraperr.image.repository }}:{{ .Values.scraperr.image.tag }}"
{{ else }}
image: "{{ .Chart.Name }}:{{ .Chart.Version }}"
{{ end }}
imagePullPolicy: {{ .Values.scraperr.image.pullPolicy }}
command: {{ .Values.scraperr.containerCommand | toJson }}
ports:
- containerPort: {{ .Values.scraperr.containerPort }}
env: {{ toYaml .Values.scraperr.env | nindent 12 }}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: scraperr-api
spec:
replicas: {{ .Values.replicaCount }}
selector:
matchLabels:
app: scraperr-api
template:
metadata:
labels:
app: scraperr-api
spec:
containers:
- name: scraperr-api
{{ if .Values.scraperrApi.image.repository }}
image: "{{ .Values.scraperrApi.image.repository }}:{{ .Values.scraperrApi.image.tag }}"
{{ else }}
image: "{{ .Chart.Name }}:{{ .Chart.Version }}"
{{ end }}
imagePullPolicy: {{ .Values.scraperrApi.image.pullPolicy }}
ports:
- containerPort: {{ .Values.scraperrApi.containerPort }}
env: {{ toYaml .Values.scraperrApi.env | nindent 12 }}
volumeMounts: {{ toYaml .Values.scraperrApi.volumeMounts | nindent 12 }}
volumes: {{ toYaml .Values.scraperrApi.volumes | nindent 12 }}

Some files were not shown because too many files have changed in this diff Show More