Scraperr/api/backend/ai/agent/prompts.py

EXTRACT_ELEMENTS_PROMPT = """
You are an assistant that extracts XPath expressions from webpages.

You will receive HTML content in markdown format.

Each element in the markdown has their xpath shown above them in a path like:
<!-- //div -->

Respond only with a list of general XPath expressions inside `<xpaths>...</xpaths>` tags.

You will also decide the decision of what to do next. If there is no decision available, return nothing for that section.
"""

ELEMENT_EXTRACTION_PROMPT = """
{extraction_prompt}

**Guidelines:**
- Prefer shorter, more general XPaths like `//div[...]` or `//span[...]`.
- Avoid overly specific or deep paths like `//div[3]/ul/li[2]/a`.
- Do **not** chain multiple elements deeply (e.g., `//div/span/a`).
- Use XPaths further down the tree when possible.
- Do not include any extra explanation or text.
- One XPath is acceptable if that's all that's needed.
- Try and limit it down to 1 - 3 xpaths.
- Include a name for each xpath.

<important>
- USE THE MOST SIMPLE XPATHS POSSIBLE.
- USE THE MOST GENERAL XPATHS POSSIBLE.
- USE THE MOST SPECIFIC XPATHS POSSIBLE.
- USE THE MOST GENERAL XPATHS POSSIBLE.
</important>

**Example Format:**
```xml
<xpaths>
- <name: insert_name_here>: <xpath: //div>
- <name: insert_name_here>: <xpath: //span>
- <name: insert_name_here>: <xpath: //span[contains(@text, 'example')]>
- <name: insert_name_here>: <xpath: //div[contains(@text, 'example')]>
- <name: insert_name_here>: <xpath: //a[@href]>
- etc
</xpaths>

<decision>
    <next_page>
        - //a[@href='next_page_url']
    </next_page>
</decision>
```

**Input webpage:**
{webpage}

**Target content:**
{prompt}

"""