Compare commits

..

7 Commits

Author SHA1 Message Date
dgtlmoon
18fa5aea18 Remove trailing slash 2026-01-04 17:08:07 +01:00
dgtlmoon
686f1c2952 Misc small HTML Validation fixes 2026-01-04 17:06:13 +01:00
dgtlmoon
39274f121c 0.51.4
Some checks failed
Build and push containers / metadata (push) Has been cancelled
Build and push containers / build-push-containers (push) Has been cancelled
Publish Python 🐍distribution 📦 to PyPI and TestPyPI / Build distribution 📦 (push) Has been cancelled
ChangeDetection.io App Test / lint-code (push) Has been cancelled
Publish Python 🐍distribution 📦 to PyPI and TestPyPI / Test the built package works basically. (push) Has been cancelled
Publish Python 🐍distribution 📦 to PyPI and TestPyPI / Publish Python 🐍 distribution 📦 to PyPI (push) Has been cancelled
ChangeDetection.io App Test / test-application-3-10 (push) Has been cancelled
ChangeDetection.io App Test / test-application-3-11 (push) Has been cancelled
ChangeDetection.io App Test / test-application-3-12 (push) Has been cancelled
ChangeDetection.io App Test / test-application-3-13 (push) Has been cancelled
CodeQL / Analyze (javascript) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
2025-11-28 13:26:15 +01:00
dgtlmoon
4b1d871078 Improving UTF-8 handling for xPath selectors (Stop the xpath filter from chewing up non-regulat-latin-text style content) (#3659)
Some checks failed
Build and push containers / metadata (push) Has been cancelled
Build and push containers / build-push-containers (push) Has been cancelled
Publish Python 🐍distribution 📦 to PyPI and TestPyPI / Build distribution 📦 (push) Has been cancelled
Publish Python 🐍distribution 📦 to PyPI and TestPyPI / Test the built package works basically. (push) Has been cancelled
Publish Python 🐍distribution 📦 to PyPI and TestPyPI / Publish Python 🐍 distribution 📦 to PyPI (push) Has been cancelled
ChangeDetection.io App Test / lint-code (push) Has been cancelled
ChangeDetection.io App Test / test-application-3-10 (push) Has been cancelled
ChangeDetection.io App Test / test-application-3-11 (push) Has been cancelled
ChangeDetection.io App Test / test-application-3-12 (push) Has been cancelled
ChangeDetection.io App Test / test-application-3-13 (push) Has been cancelled
2025-11-28 13:13:41 +01:00
dependabot[bot]
f78c2dcffd Bump actions/checkout from 5 to 6 in the all group (#3651)
Some checks failed
Build and push containers / metadata (push) Has been cancelled
Build and push containers / build-push-containers (push) Has been cancelled
Publish Python 🐍distribution 📦 to PyPI and TestPyPI / Build distribution 📦 (push) Has been cancelled
ChangeDetection.io Container Build Test / Build linux/amd64 (alpine) (push) Has been cancelled
ChangeDetection.io Container Build Test / Build linux/arm64 (alpine) (push) Has been cancelled
ChangeDetection.io Container Build Test / Build linux/amd64 (main) (push) Has been cancelled
ChangeDetection.io Container Build Test / Build linux/arm/v7 (main) (push) Has been cancelled
ChangeDetection.io Container Build Test / Build linux/arm/v8 (main) (push) Has been cancelled
ChangeDetection.io Container Build Test / Build linux/arm64 (main) (push) Has been cancelled
ChangeDetection.io App Test / lint-code (push) Has been cancelled
Publish Python 🐍distribution 📦 to PyPI and TestPyPI / Test the built package works basically. (push) Has been cancelled
Publish Python 🐍distribution 📦 to PyPI and TestPyPI / Publish Python 🐍 distribution 📦 to PyPI (push) Has been cancelled
ChangeDetection.io App Test / test-application-3-10 (push) Has been cancelled
ChangeDetection.io App Test / test-application-3-11 (push) Has been cancelled
ChangeDetection.io App Test / test-application-3-12 (push) Has been cancelled
ChangeDetection.io App Test / test-application-3-13 (push) Has been cancelled
CodeQL / Analyze (javascript) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
2025-11-24 02:03:13 +01:00
Voczi
1c2c22b8df Specify UTF-8 encoding for xpath_element_js (#3650) 2025-11-23 19:55:26 +01:00
dgtlmoon
3276a9347a Update playwright library to 1.56
Some checks failed
Build and push containers / metadata (push) Has been cancelled
Build and push containers / build-push-containers (push) Has been cancelled
Publish Python 🐍distribution 📦 to PyPI and TestPyPI / Build distribution 📦 (push) Has been cancelled
ChangeDetection.io Container Build Test / Build linux/amd64 (alpine) (push) Has been cancelled
ChangeDetection.io Container Build Test / Build linux/arm64 (alpine) (push) Has been cancelled
ChangeDetection.io Container Build Test / Build linux/amd64 (main) (push) Has been cancelled
ChangeDetection.io Container Build Test / Build linux/arm/v7 (main) (push) Has been cancelled
ChangeDetection.io Container Build Test / Build linux/arm/v8 (main) (push) Has been cancelled
ChangeDetection.io Container Build Test / Build linux/arm64 (main) (push) Has been cancelled
ChangeDetection.io App Test / lint-code (push) Has been cancelled
Publish Python 🐍distribution 📦 to PyPI and TestPyPI / Test the built package works basically. (push) Has been cancelled
Publish Python 🐍distribution 📦 to PyPI and TestPyPI / Publish Python 🐍 distribution 📦 to PyPI (push) Has been cancelled
ChangeDetection.io App Test / test-application-3-10 (push) Has been cancelled
ChangeDetection.io App Test / test-application-3-11 (push) Has been cancelled
ChangeDetection.io App Test / test-application-3-12 (push) Has been cancelled
ChangeDetection.io App Test / test-application-3-13 (push) Has been cancelled
2025-11-21 11:12:18 +01:00
18 changed files with 277 additions and 125 deletions

View File

@@ -30,7 +30,7 @@ jobs:
steps:
- name: Checkout repository
uses: actions/checkout@v5
uses: actions/checkout@v6
# Initializes the CodeQL tools for scanning.
- name: Initialize CodeQL

View File

@@ -39,7 +39,7 @@ jobs:
# Or if we are in a tagged release scenario.
if: ${{ github.event.workflow_run.conclusion == 'success' }} || ${{ github.event.release.tag_name }} != ''
steps:
- uses: actions/checkout@v5
- uses: actions/checkout@v6
- name: Set up Python 3.11
uses: actions/setup-python@v6
with:

View File

@@ -7,7 +7,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- uses: actions/checkout@v6
- name: Set up Python
uses: actions/setup-python@v6
with:

View File

@@ -44,7 +44,7 @@ jobs:
- platform: linux/arm64
dockerfile: ./.github/test/Dockerfile-alpine
steps:
- uses: actions/checkout@v5
- uses: actions/checkout@v6
- name: Set up Python 3.11
uses: actions/setup-python@v6
with:

View File

@@ -7,7 +7,7 @@ jobs:
lint-code:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- uses: actions/checkout@v6
- name: Lint with Ruff
run: |
pip install ruff

View File

@@ -21,7 +21,7 @@ jobs:
env:
PYTHON_VERSION: ${{ inputs.python-version }}
steps:
- uses: actions/checkout@v5
- uses: actions/checkout@v6
- name: Set up Python ${{ env.PYTHON_VERSION }}
uses: actions/setup-python@v6
@@ -66,7 +66,7 @@ jobs:
env:
PYTHON_VERSION: ${{ inputs.python-version }}
steps:
- uses: actions/checkout@v5
- uses: actions/checkout@v6
- name: Download Docker image artifact
uses: actions/download-artifact@v6
@@ -93,7 +93,7 @@ jobs:
env:
PYTHON_VERSION: ${{ inputs.python-version }}
steps:
- uses: actions/checkout@v5
- uses: actions/checkout@v6
- name: Download Docker image artifact
uses: actions/download-artifact@v6
@@ -132,7 +132,7 @@ jobs:
env:
PYTHON_VERSION: ${{ inputs.python-version }}
steps:
- uses: actions/checkout@v5
- uses: actions/checkout@v6
- name: Download Docker image artifact
uses: actions/download-artifact@v6
@@ -174,7 +174,7 @@ jobs:
env:
PYTHON_VERSION: ${{ inputs.python-version }}
steps:
- uses: actions/checkout@v5
- uses: actions/checkout@v6
- name: Download Docker image artifact
uses: actions/download-artifact@v6
@@ -214,7 +214,7 @@ jobs:
env:
PYTHON_VERSION: ${{ inputs.python-version }}
steps:
- uses: actions/checkout@v5
- uses: actions/checkout@v6
- name: Download Docker image artifact
uses: actions/download-artifact@v6
@@ -250,7 +250,7 @@ jobs:
env:
PYTHON_VERSION: ${{ inputs.python-version }}
steps:
- uses: actions/checkout@v5
- uses: actions/checkout@v6
- name: Download Docker image artifact
uses: actions/download-artifact@v6
@@ -279,7 +279,7 @@ jobs:
env:
PYTHON_VERSION: ${{ inputs.python-version }}
steps:
- uses: actions/checkout@v5
- uses: actions/checkout@v6
- name: Download Docker image artifact
uses: actions/download-artifact@v6
@@ -319,7 +319,7 @@ jobs:
env:
PYTHON_VERSION: ${{ inputs.python-version }}
steps:
- uses: actions/checkout@v5
- uses: actions/checkout@v6
- name: Download Docker image artifact
uses: actions/download-artifact@v6
@@ -350,7 +350,7 @@ jobs:
env:
PYTHON_VERSION: ${{ inputs.python-version }}
steps:
- uses: actions/checkout@v5
- uses: actions/checkout@v6
- name: Download Docker image artifact
uses: actions/download-artifact@v6
@@ -395,7 +395,7 @@ jobs:
env:
PYTHON_VERSION: ${{ inputs.python-version }}
steps:
- uses: actions/checkout@v5
- uses: actions/checkout@v6
- name: Download Docker image artifact
uses: actions/download-artifact@v6

View File

@@ -52,7 +52,7 @@ RUN --mount=type=cache,id=pip,sharing=locked,target=/tmp/pip-cache \
--prefer-binary \
--cache-dir=/tmp/pip-cache \
--target=/dependencies \
playwright~=1.48.0 \
playwright~=1.56.0 \
|| echo "WARN: Failed to install Playwright. The application can still run, but the Playwright option will be disabled."

View File

@@ -2,7 +2,7 @@
# Read more https://github.com/dgtlmoon/changedetection.io/wiki
# Semver means never use .01, or 00. Should be .1.
__version__ = '0.51.3'
__version__ = '0.51.4'
from changedetectionio.strtobool import strtobool
from json.decoder import JSONDecodeError

View File

@@ -439,7 +439,7 @@ class browsersteps_live_ui(steppable_browser_interface):
logger.warning("Attempted to get current state after cleanup")
return (None, None)
xpath_element_js = importlib.resources.files("changedetectionio.content_fetchers.res").joinpath('xpath_element_scraper.js').read_text()
xpath_element_js = importlib.resources.files("changedetectionio.content_fetchers.res").joinpath('xpath_element_scraper.js').read_text(encoding="utf-8")
now = time.time()
await self.page.wait_for_timeout(1 * 1000)

View File

@@ -25,7 +25,7 @@
<li class="tab"><a href="#ui-options">UI Options</a></li>
<li class="tab"><a href="#api">API</a></li>
<li class="tab"><a href="#rss">RSS</a></li>
<li class="tab"><a href="#timedate">Time &amp Date</a></li>
<li class="tab"><a href="#timedate">Time &amp; Date</a></li>
<li class="tab"><a href="#proxies">CAPTCHA &amp; Proxies</a></li>
</ul>
</div>
@@ -78,9 +78,7 @@
<div class="tab-pane-inner" id="notifications">
<fieldset>
<div class="field-group">
{{ render_common_settings_form(form.application.form, emailprefix, settings_application, extra_notification_token_placeholder_info) }}
</div>
{{ render_common_settings_form(form.application.form, emailprefix, settings_application, extra_notification_token_placeholder_info) }}
</fieldset>
<div class="pure-control-group" id="notification-base-url">
{{ render_field(form.application.form.base_url, class="m-d") }}
@@ -121,7 +119,7 @@
</div>
<div class="pure-control-group">
{{ render_field(form.requests.form.timeout) }}
<span class="pure-form-message-inline">For regular plain requests (not chrome based), maximum number of seconds until timeout, 1-999.<br>
<span class="pure-form-message-inline">For regular plain requests (not chrome based), maximum number of seconds until timeout, 1-999.</span><br>
</div>
<div class="pure-control-group inline-radio">
{{ render_field(form.requests.form.default_ua) }}
@@ -212,7 +210,7 @@ nav
<a id="chrome-extension-link"
title="Try our new Chrome Extension!"
href="https://chromewebstore.google.com/detail/changedetectionio-website/kefcfmgmlhmankjmnbijimhofdjekbop">
<img alt="Chrome store icon" src="{{ url_for('static_content', group='images', filename='google-chrome-icon.png') }}" alt="Chrome">
<img alt="Chrome store icon" src="{{ url_for('static_content', group='images', filename='google-chrome-icon.png') }}" >
Chrome Webstore
</a>
</p>
@@ -253,14 +251,14 @@ nav
Ensure the settings below are correct, they are used to manage the time schedule for checking your web page watches.
</div>
<div class="pure-control-group">
<p><strong>UTC Time &amp Date from Server:</strong> <span id="utc-time" >{{ utc_time }}</span></p>
<p><strong>Local Time &amp Date in Browser:</strong> <span class="local-time" data-utc="{{ utc_time }}"></span></p>
<p>
<p><strong>UTC Time &amp; Date from Server:</strong> <span id="utc-time" >{{ utc_time }}</span></p>
<p><strong>Local Time &amp; Date in Browser:</strong> <span class="local-time" data-utc="{{ utc_time }}"></span></p>
<div>
{{ render_field(form.application.form.scheduler_timezone_default) }}
<datalist id="timezones" style="display: none;">
{%- for timezone in available_timezones -%}<option value="{{ timezone }}">{{ timezone }}</option>{%- endfor -%}
</datalist>
</p>
</div>
</div>
</div>
<div class="tab-pane-inner" id="ui-options">
@@ -329,7 +327,7 @@ nav
</div>
</div>
<p><strong>Tip</strong>: "Residential" and "Mobile" proxy type can be more successfull than "Data Center" for blocked websites.
<p><strong>Tip</strong>: "Residential" and "Mobile" proxy type can be more successfull than "Data Center" for blocked websites.</p>
<div class="pure-control-group" id="extra-proxies-setting">
{{ render_fieldlist_with_inline_errors(form.requests.form.extra_proxies) }}

View File

@@ -155,7 +155,7 @@ document.addEventListener('DOMContentLoaded', function() {
<div class="flex-wrapper">
{% if 'favicons_enabled' not in ui_settings or ui_settings['favicons_enabled'] %}
<div>{# A page might have hundreds of these images, set IMG options for lazy loading, don't set SRC if we dont have it so it doesnt fetch the placeholder' #}
<img alt="Favicon thumbnail" class="favicon" loading="lazy" decoding="async" fetchpriority="low" {% if favicon %} src="{{url_for('static_content', group='favicon', filename=watch.uuid)}}" {% else %} src='data:image/svg+xml;utf8,%3Csvg xmlns="http://www.w3.org/2000/svg" width="7.087" height="7.087" viewBox="0 0 7.087 7.087"%3E%3Ccircle cx="3.543" cy="3.543" r="3.279" stroke="%23e1e1e1" stroke-width="0.45" fill="none" opacity="0.74"/%3E%3C/svg%3E' {% endif %} />
<img alt="Favicon thumbnail" class="favicon" loading="lazy" decoding="async" fetchpriority="low" {% if favicon %} src="{{url_for('static_content', group='favicon', filename=watch.uuid)}}" {% else %} src='data:image/svg+xml;utf8,%3Csvg xmlns="http://www.w3.org/2000/svg" width="7.087" height="7.087" viewBox="0 0 7.087 7.087"%3E%3Ccircle cx="3.543" cy="3.543" r="3.279" stroke="%23e1e1e1" stroke-width="0.45" fill="none" opacity="0.74"/%3E%3C/svg%3E' {% endif %} >
</div>
{% endif %}
<div>

View File

@@ -172,99 +172,131 @@ def elementpath_tostring(obj):
return str(obj)
# Return str Utf-8 of matched rules
def xpath_filter(xpath_filter, html_content, append_pretty_line_formatting=False, is_rss=False):
def xpath_filter(xpath_filter, html_content, append_pretty_line_formatting=False, is_xml=False):
"""
:param xpath_filter:
:param html_content:
:param append_pretty_line_formatting:
:param is_xml: set to true if is XML or is RSS (RSS is XML)
:return:
"""
from lxml import etree, html
import elementpath
# xpath 2.0-3.1
from elementpath.xpath3 import XPath3Parser
parser = etree.HTMLParser()
if is_rss:
# So that we can keep CDATA for cdata_in_document_to_text() to process
parser = etree.XMLParser(strip_cdata=False)
tree = html.fromstring(bytes(html_content, encoding='utf-8'), parser=parser)
html_block = ""
# Build namespace map for XPath queries
namespaces = {'re': 'http://exslt.org/regular-expressions'}
# Handle default namespace in documents (common in RSS/Atom feeds, but can occur in any XML)
# XPath spec: unprefixed element names have no namespace, not the default namespace
# Solution: Register the default namespace with empty string prefix in elementpath
# This is primarily for RSS/Atom feeds but works for any XML with default namespace
if hasattr(tree, 'nsmap') and tree.nsmap and None in tree.nsmap:
# Register the default namespace with empty string prefix for elementpath
# This allows //title to match elements in the default namespace
namespaces[''] = tree.nsmap[None]
r = elementpath.select(tree, xpath_filter.strip(), namespaces=namespaces, parser=XPath3Parser)
#@note: //title/text() now works with default namespaces (fixed by registering '' prefix)
#@note: //title/text() wont work where <title>CDATA.. (use cdata_in_document_to_text first)
if type(r) != list:
r = [r]
for element in r:
# When there's more than 1 match, then add the suffix to separate each line
# And where the matched result doesn't include something that will cause Inscriptis to add a newline
# (This way each 'match' reliably has a new-line in the diff)
# Divs are converted to 4 whitespaces by inscriptis
if append_pretty_line_formatting and len(html_block) and (not hasattr( element, 'tag' ) or not element.tag in (['br', 'hr', 'div', 'p'])):
html_block += TEXT_FILTER_LIST_LINE_SUFFIX
if type(element) == str:
html_block += element
elif issubclass(type(element), etree._Element) or issubclass(type(element), etree._ElementTree):
html_block += etree.tostring(element, pretty_print=True).decode('utf-8')
tree = None
try:
if is_xml:
# So that we can keep CDATA for cdata_in_document_to_text() to process
parser = etree.XMLParser(strip_cdata=False)
# For XML/RSS content, use etree.fromstring to properly handle XML declarations
tree = etree.fromstring(html_content.encode('utf-8') if isinstance(html_content, str) else html_content, parser=parser)
else:
html_block += elementpath_tostring(element)
tree = html.fromstring(html_content, parser=parser)
html_block = ""
return html_block
# Build namespace map for XPath queries
namespaces = {'re': 'http://exslt.org/regular-expressions'}
# Handle default namespace in documents (common in RSS/Atom feeds, but can occur in any XML)
# XPath spec: unprefixed element names have no namespace, not the default namespace
# Solution: Register the default namespace with empty string prefix in elementpath
# This is primarily for RSS/Atom feeds but works for any XML with default namespace
if hasattr(tree, 'nsmap') and tree.nsmap and None in tree.nsmap:
# Register the default namespace with empty string prefix for elementpath
# This allows //title to match elements in the default namespace
namespaces[''] = tree.nsmap[None]
r = elementpath.select(tree, xpath_filter.strip(), namespaces=namespaces, parser=XPath3Parser)
#@note: //title/text() now works with default namespaces (fixed by registering '' prefix)
#@note: //title/text() wont work where <title>CDATA.. (use cdata_in_document_to_text first)
if type(r) != list:
r = [r]
for element in r:
# When there's more than 1 match, then add the suffix to separate each line
# And where the matched result doesn't include something that will cause Inscriptis to add a newline
# (This way each 'match' reliably has a new-line in the diff)
# Divs are converted to 4 whitespaces by inscriptis
if append_pretty_line_formatting and len(html_block) and (not hasattr( element, 'tag' ) or not element.tag in (['br', 'hr', 'div', 'p'])):
html_block += TEXT_FILTER_LIST_LINE_SUFFIX
if type(element) == str:
html_block += element
elif issubclass(type(element), etree._Element) or issubclass(type(element), etree._ElementTree):
# Use 'xml' method for RSS/XML content, 'html' for HTML content
# parser will be XMLParser if we detected XML content
method = 'xml' if (is_xml or isinstance(parser, etree.XMLParser)) else 'html'
html_block += etree.tostring(element, pretty_print=True, method=method, encoding='unicode')
else:
html_block += elementpath_tostring(element)
return html_block
finally:
# Explicitly clear the tree to free memory
# lxml trees can hold significant memory, especially with large documents
if tree is not None:
tree.clear()
# Return str Utf-8 of matched rules
# 'xpath1:'
def xpath1_filter(xpath_filter, html_content, append_pretty_line_formatting=False, is_rss=False):
def xpath1_filter(xpath_filter, html_content, append_pretty_line_formatting=False, is_xml=False):
from lxml import etree, html
parser = None
if is_rss:
# So that we can keep CDATA for cdata_in_document_to_text() to process
parser = etree.XMLParser(strip_cdata=False)
tree = html.fromstring(bytes(html_content, encoding='utf-8'), parser=parser)
html_block = ""
# Build namespace map for XPath queries
namespaces = {'re': 'http://exslt.org/regular-expressions'}
# NOTE: lxml's native xpath() does NOT support empty string prefix for default namespace
# For documents with default namespace (RSS/Atom feeds), users must use:
# - local-name(): //*[local-name()='title']/text()
# - Or use xpath_filter (not xpath1_filter) which supports default namespaces
# XPath spec: unprefixed element names have no namespace, not the default namespace
r = tree.xpath(xpath_filter.strip(), namespaces=namespaces)
#@note: xpath1 (lxml) does NOT automatically handle default namespaces
#@note: Use //*[local-name()='element'] or switch to xpath_filter for default namespace support
#@note: //title/text() wont work where <title>CDATA.. (use cdata_in_document_to_text first)
for element in r:
# When there's more than 1 match, then add the suffix to separate each line
# And where the matched result doesn't include something that will cause Inscriptis to add a newline
# (This way each 'match' reliably has a new-line in the diff)
# Divs are converted to 4 whitespaces by inscriptis
if append_pretty_line_formatting and len(html_block) and (not hasattr(element, 'tag') or not element.tag in (['br', 'hr', 'div', 'p'])):
html_block += TEXT_FILTER_LIST_LINE_SUFFIX
# Some kind of text, UTF-8 or other
if isinstance(element, (str, bytes)):
html_block += element
tree = None
try:
if is_xml:
# So that we can keep CDATA for cdata_in_document_to_text() to process
parser = etree.XMLParser(strip_cdata=False)
# For XML/RSS content, use etree.fromstring to properly handle XML declarations
tree = etree.fromstring(html_content.encode('utf-8') if isinstance(html_content, str) else html_content, parser=parser)
else:
# Return the HTML which will get parsed as text
html_block += etree.tostring(element, pretty_print=True).decode('utf-8')
tree = html.fromstring(html_content, parser=parser)
html_block = ""
return html_block
# Build namespace map for XPath queries
namespaces = {'re': 'http://exslt.org/regular-expressions'}
# NOTE: lxml's native xpath() does NOT support empty string prefix for default namespace
# For documents with default namespace (RSS/Atom feeds), users must use:
# - local-name(): //*[local-name()='title']/text()
# - Or use xpath_filter (not xpath1_filter) which supports default namespaces
# XPath spec: unprefixed element names have no namespace, not the default namespace
r = tree.xpath(xpath_filter.strip(), namespaces=namespaces)
#@note: xpath1 (lxml) does NOT automatically handle default namespaces
#@note: Use //*[local-name()='element'] or switch to xpath_filter for default namespace support
#@note: //title/text() wont work where <title>CDATA.. (use cdata_in_document_to_text first)
for element in r:
# When there's more than 1 match, then add the suffix to separate each line
# And where the matched result doesn't include something that will cause Inscriptis to add a newline
# (This way each 'match' reliably has a new-line in the diff)
# Divs are converted to 4 whitespaces by inscriptis
if append_pretty_line_formatting and len(html_block) and (not hasattr(element, 'tag') or not element.tag in (['br', 'hr', 'div', 'p'])):
html_block += TEXT_FILTER_LIST_LINE_SUFFIX
# Some kind of text, UTF-8 or other
if isinstance(element, (str, bytes)):
html_block += element
else:
# Return the HTML/XML which will get parsed as text
# Use 'xml' method for RSS/XML content, 'html' for HTML content
# parser will be XMLParser if we detected XML content
method = 'xml' if (is_xml or isinstance(parser, etree.XMLParser)) else 'html'
html_block += etree.tostring(element, pretty_print=True, method=method, encoding='unicode')
return html_block
finally:
# Explicitly clear the tree to free memory
# lxml trees can hold significant memory, especially with large documents
if tree is not None:
tree.clear()
# Extract/find element
def extract_element(find='title', html_content=''):

View File

@@ -103,15 +103,15 @@ class guess_stream_type():
self.is_json = True
elif 'pdf' in magic_content_header:
self.is_pdf = True
elif has_html_patterns or http_content_header == 'text/html':
self.is_html = True
elif any(s in magic_content_header for s in JSON_CONTENT_TYPES):
self.is_json = True
# magic will call a rss document 'xml'
# Rarely do endpoints give the right header, usually just text/xml, so we check also for <rss
# This also triggers the automatic CDATA text parser so the RSS goes back a nice content list
elif '<rss' in test_content_normalized or '<feed' in test_content_normalized or any(s in magic_content_header for s in RSS_XML_CONTENT_TYPES) or '<rdf:' in test_content_normalized:
self.is_rss = True
elif has_html_patterns or http_content_header == 'text/html':
self.is_html = True
elif any(s in magic_content_header for s in JSON_CONTENT_TYPES):
self.is_json = True
elif any(s in http_content_header for s in XML_CONTENT_TYPES):
# Only mark as generic XML if not already detected as RSS
if not self.is_rss:

View File

@@ -298,7 +298,7 @@ class ContentProcessor:
xpath_filter=filter_rule.replace('xpath:', ''),
html_content=content,
append_pretty_line_formatting=not self.watch.is_source_type_url,
is_rss=stream_content_type.is_rss
is_xml=stream_content_type.is_rss or stream_content_type.is_xml
)
# XPath1 filters (first match only)
@@ -307,7 +307,7 @@ class ContentProcessor:
xpath_filter=filter_rule.replace('xpath1:', ''),
html_content=content,
append_pretty_line_formatting=not self.watch.is_source_type_url,
is_rss=stream_content_type.is_rss
is_xml=stream_content_type.is_rss or stream_content_type.is_xml
)
# JSON filters

View File

@@ -145,6 +145,7 @@
<div id="notification-test-log" style="display: none;"><span class="pure-form-message-inline">Processing..</span></div>
</div>
</div>
<div class="pure-control-group grey-form-border">
<div class="pure-control-group">
{{ render_field(form.notification_title, class="m-d notification-title", placeholder=settings_application['notification_title']) }}
@@ -169,6 +170,7 @@
</span></li>
</ul>
<br>
</div>
</div>
<div class="">
{{ render_field(form.notification_format , class="notification-format") }}

View File

@@ -405,7 +405,10 @@ def test_plaintext_even_if_xml_content_and_can_apply_filters(client, live_server
follow_redirects=True
)
assert b'&lt;string name=&#34;feed_update_receiver_name&#34;' in res.data
# Check that the string element with the correct name attribute is present
# Note: namespace declarations may be included when extracting elements, which is correct XML behavior
assert b'feed_update_receiver_name' in res.data
assert b'Abonnementen bijwerken' in res.data
assert b'&lt;foobar' not in res.data
res = client.get(url_for("ui.form_delete", uuid="all"), follow_redirects=True)

View File

@@ -84,14 +84,14 @@ class TestXPathDefaultNamespace:
def test_atom_feed_simple_xpath_with_xpath_filter(self):
"""Test that //title/text() works on Atom feed with default namespace using xpath_filter."""
result = html_tools.xpath_filter('//title/text()', atom_feed_with_default_ns, is_rss=True)
result = html_tools.xpath_filter('//title/text()', atom_feed_with_default_ns, is_xml=True)
assert 'Release notes from PowerToys' in result
assert 'Release 0.95.1' in result
assert 'Release v0.95.0' in result
def test_atom_feed_nested_xpath_with_xpath_filter(self):
"""Test nested XPath like //entry/title/text() on Atom feed."""
result = html_tools.xpath_filter('//entry/title/text()', atom_feed_with_default_ns, is_rss=True)
result = html_tools.xpath_filter('//entry/title/text()', atom_feed_with_default_ns, is_xml=True)
assert 'Release 0.95.1' in result
assert 'Release v0.95.0' in result
# Should NOT include the feed title
@@ -99,20 +99,20 @@ class TestXPathDefaultNamespace:
def test_atom_feed_other_elements_with_xpath_filter(self):
"""Test that other elements like //updated/text() work on Atom feed."""
result = html_tools.xpath_filter('//updated/text()', atom_feed_with_default_ns, is_rss=True)
result = html_tools.xpath_filter('//updated/text()', atom_feed_with_default_ns, is_xml=True)
assert '2025-10-23T08:53:12Z' in result
assert '2025-10-24T14:20:14Z' in result
def test_rss_feed_without_namespace(self):
"""Test that //title/text() works on RSS feed without default namespace."""
result = html_tools.xpath_filter('//title/text()', rss_feed_no_default_ns, is_rss=True)
result = html_tools.xpath_filter('//title/text()', rss_feed_no_default_ns, is_xml=True)
assert 'Channel Title' in result
assert 'Item 1 Title' in result
assert 'Item 2 Title' in result
def test_rss_feed_nested_xpath(self):
"""Test nested XPath on RSS feed without default namespace."""
result = html_tools.xpath_filter('//item/title/text()', rss_feed_no_default_ns, is_rss=True)
result = html_tools.xpath_filter('//item/title/text()', rss_feed_no_default_ns, is_xml=True)
assert 'Item 1 Title' in result
assert 'Item 2 Title' in result
# Should NOT include channel title
@@ -120,31 +120,31 @@ class TestXPathDefaultNamespace:
def test_rss_feed_with_prefixed_namespaces(self):
"""Test that feeds with namespace prefixes (not default) still work."""
result = html_tools.xpath_filter('//title/text()', rss_feed_with_ns_prefix, is_rss=True)
result = html_tools.xpath_filter('//title/text()', rss_feed_with_ns_prefix, is_xml=True)
assert 'Channel Title' in result
assert 'Item Title' in result
def test_local_name_workaround_still_works(self):
"""Test that local-name() workaround still works for Atom feeds."""
result = html_tools.xpath_filter('//*[local-name()="title"]/text()', atom_feed_with_default_ns, is_rss=True)
result = html_tools.xpath_filter('//*[local-name()="title"]/text()', atom_feed_with_default_ns, is_xml=True)
assert 'Release notes from PowerToys' in result
assert 'Release 0.95.1' in result
def test_xpath1_filter_without_default_namespace(self):
"""Test xpath1_filter works on RSS without default namespace."""
result = html_tools.xpath1_filter('//title/text()', rss_feed_no_default_ns, is_rss=True)
result = html_tools.xpath1_filter('//title/text()', rss_feed_no_default_ns, is_xml=True)
assert 'Channel Title' in result
assert 'Item 1 Title' in result
def test_xpath1_filter_with_default_namespace_returns_empty(self):
"""Test that xpath1_filter returns empty on Atom with default namespace (known limitation)."""
result = html_tools.xpath1_filter('//title/text()', atom_feed_with_default_ns, is_rss=True)
result = html_tools.xpath1_filter('//title/text()', atom_feed_with_default_ns, is_xml=True)
# xpath1_filter (lxml) doesn't support default namespaces, so this returns empty
assert result == ''
def test_xpath1_filter_local_name_workaround(self):
"""Test that xpath1_filter works with local-name() workaround on Atom feeds."""
result = html_tools.xpath1_filter('//*[local-name()="title"]/text()', atom_feed_with_default_ns, is_rss=True)
result = html_tools.xpath1_filter('//*[local-name()="title"]/text()', atom_feed_with_default_ns, is_xml=True)
assert 'Release notes from PowerToys' in result
assert 'Release 0.95.1' in result

View File

@@ -201,3 +201,120 @@ def test_trips(html_content, xpath, answer):
html_content = html_tools.xpath_filter(xpath, html_content, append_pretty_line_formatting=True)
assert type(html_content) == str
assert answer in html_content
# Test for UTF-8 encoding bug fix (issue #3658)
# Polish and other UTF-8 characters should be preserved correctly
polish_html = """<!DOCTYPE html>
<html>
<head><meta charset="utf-8"></head>
<body>
<div class="index--s-headline-link">
<a class="index--s-headline-link" href="#">
Naukowcy potwierdzają: oglądanie krótkich filmików prowadzi do "zgnilizny mózgu"
</a>
</div>
<div>
<a class="other-class" href="#">
Test with Polish chars: żółć ąę śń
</a>
</div>
<div>
<p class="unicode-test">Cyrillic: Привет мир</p>
<p class="unicode-test">Greek: Γειά σου κόσμε</p>
<p class="unicode-test">Arabic: مرحبا بالعالم</p>
<p class="unicode-test">Chinese: 你好世界</p>
<p class="unicode-test">Japanese: こんにちは世界</p>
<p class="unicode-test">Emoji: 🌍🎉✨</p>
</div>
</body>
</html>
"""
@pytest.mark.parametrize("html_content", [polish_html])
@pytest.mark.parametrize("xpath, expected_text", [
# Test Polish characters in xpath_filter
('//a[(contains(@class,"index--s-headline-link"))]', 'Naukowcy potwierdzają'),
('//a[(contains(@class,"index--s-headline-link"))]', 'oglądanie krótkich filmików'),
('//a[(contains(@class,"index--s-headline-link"))]', 'zgnilizny mózgu'),
('//a[@class="other-class"]', 'żółć ąę śń'),
# Test various Unicode scripts
('//p[@class="unicode-test"]', 'Привет мир'),
('//p[@class="unicode-test"]', 'Γειά σου κόσμε'),
('//p[@class="unicode-test"]', 'مرحبا بالعالم'),
('//p[@class="unicode-test"]', '你好世界'),
('//p[@class="unicode-test"]', 'こんにちは世界'),
('//p[@class="unicode-test"]', '🌍🎉✨'),
# Test with text() extraction
('//a[@class="other-class"]/text()', 'żółć'),
])
def test_xpath_utf8_encoding(html_content, xpath, expected_text):
"""Test that XPath filters preserve UTF-8 characters correctly (issue #3658)"""
result = html_tools.xpath_filter(xpath, html_content, append_pretty_line_formatting=False)
assert type(result) == str
assert expected_text in result
# Ensure characters are NOT HTML-entity encoded
# For example, 'ą' should NOT become '&#261;'
assert '&#' not in result or expected_text in result
@pytest.mark.parametrize("html_content", [polish_html])
@pytest.mark.parametrize("xpath, expected_text", [
# Test Polish characters in xpath1_filter
('//a[(contains(@class,"index--s-headline-link"))]', 'Naukowcy potwierdzają'),
('//a[(contains(@class,"index--s-headline-link"))]', 'mózgu'),
('//a[@class="other-class"]', 'żółć ąę śń'),
# Test various Unicode scripts with xpath1
('//p[@class="unicode-test" and contains(text(), "Cyrillic")]', 'Привет мир'),
('//p[@class="unicode-test" and contains(text(), "Greek")]', 'Γειά σου'),
('//p[@class="unicode-test" and contains(text(), "Chinese")]', '你好世界'),
])
def test_xpath1_utf8_encoding(html_content, xpath, expected_text):
"""Test that XPath1 filters preserve UTF-8 characters correctly"""
result = html_tools.xpath1_filter(xpath, html_content, append_pretty_line_formatting=False)
assert type(result) == str
assert expected_text in result
# Ensure characters are NOT HTML-entity encoded
assert '&#' not in result or expected_text in result
# Test with real-world example from wyborcza.pl (issue #3658)
wyborcza_style_html = """<!DOCTYPE html>
<html lang="pl">
<head><meta charset="utf-8"></head>
<body>
<div class="article-list">
<a class="index--s-headline-link" href="/article1">
Naukowcy potwierdzają: oglądanie krótkich filmików prowadzi do "zgnilizny mózgu"
</a>
<a class="index--s-headline-link" href="/article2">
Zmiany klimatyczne wpływają na życie w miastach
</a>
<a class="index--s-headline-link" href="/article3">
Łódź: Nowe inwestycje w infrastrukturę miejską
</a>
</div>
</body>
</html>
"""
def test_wyborcza_real_world_example():
"""Test real-world case from wyborcza.pl that was failing (issue #3658)"""
xpath = '//a[(contains(@class,"index--s-headline-link"))]'
result = html_tools.xpath_filter(xpath, wyborcza_style_html, append_pretty_line_formatting=False)
# These exact strings should appear in the result
assert 'Naukowcy potwierdzają' in result
assert 'oglądanie krótkich filmików' in result
assert 'zgnilizny mózgu' in result
assert 'Łódź' in result
# Make sure they're NOT corrupted to mojibake like "potwierdzajÄ"
assert 'potwierdzajÄ' not in result
assert 'ogl&#261;danie' not in result
assert 'm&#243;zgu' not in result