Initial commit: hardened DeerFlow factory

Vendored deer-flow upstream (bytedance/deer-flow) plus prompt-injection hardening: - New deerflow.security package: content_delimiter, html_cleaner, sanitizer (8 layers — invisible chars, control chars, symbols, NFC, PUA, tag chars, horizontal whitespace collapse with newline/tab preservation, length cap) - New deerflow.community.searx package: web_search, web_fetch, image_search backed by a private SearX instance, every external string sanitized and wrapped in <<<EXTERNAL_UNTRUSTED_CONTENT>>> delimiters - All native community web providers (ddg_search, tavily, exa, firecrawl, jina_ai, infoquest, image_search) replaced with hard-fail stubs that raise NativeWebToolDisabledError at import time, so a misconfigured tool.use path fails loud rather than silently falling back to unsanitized output - Native client back-doors (jina_client.py, infoquest_client.py) stubbed too - Native-tool tests quarantined under tests/_disabled_native/ (collect_ignore_glob via local conftest.py) - Sanitizer Layer 7 fix: only collapse horizontal whitespace, preserve newlines and tabs so list/table structure survives - Hardened runtime config.yaml references only the searx-backed tools - Factory overlay (backend/) kept in sync with deer-flow tree as a reference / source See HARDENING.md for the full audit trail and verification steps.
2026-04-12 14:23:57 +02:00
commit 6de0bf9f5b
889 changed files with 173052 additions and 0 deletions
--- a/HARDENING.md
+++ b/HARDENING.md
@@ -0,0 +1,228 @@
+# DeerFlow Hardening Notes
+
+This repository is a hardened deployment of [bytedance/deer-flow](https://github.com/bytedance/deer-flow)
+with the only goal of preventing prompt-injection attacks via the agent's
+web access surface.
+
+The upstream tree lives in `deer-flow/` and is checked in directly (no
+submodule, no nested git). All hardening changes are kept inside that tree
+so that `python -m deerflow.community.searx.tools` resolves out of the box
+once `deer-flow/backend/packages/harness` is on `PYTHONPATH`.
+
+This document is a defense-in-depth audit trail. If you change any of the
+files listed here, please update this document in the same commit.
+
+## 1. Threat model
+
+Prompt-injection via untrusted web content. An attacker controls the body
+of an HTML page (or a search-result snippet) and tries to make the model:
+
+1. Treat externally fetched text as **system instructions** (delimiter confusion).
+2. Smuggle hidden tokens via **invisible Unicode** (zero-width spaces, BOM,
+   PUA, tag characters).
+3. Inject **executable HTML** (`<script>`, `<iframe>`, `<form>`, ...) that
+   the model would summarise verbatim.
+
+The hardening below is a port of the OpenClaw approach (`searx-scripts/`,
+`fetch-scripts/`) to DeerFlow's adapter contract.
+
+## 2. What was changed
+
+### 2.1 New: `deerflow.security`
+
+`deer-flow/backend/packages/harness/deerflow/security/`
+
+| File | Purpose |
+|---|---|
+| `__init__.py` | Public re-exports |
+| `content_delimiter.py` | Wraps untrusted content in `<<<EXTERNAL_UNTRUSTED_CONTENT>>> ... <<<END_EXTERNAL_UNTRUSTED_CONTENT>>>` so the LLM has a semantic boundary between system instructions and external data |
+| `html_cleaner.py` | `SecureTextExtractor` strips `script`, `style`, `noscript`, `header`, `footer`, `nav`, `aside`, `iframe`, `object`, `embed`, `form` |
+| `sanitizer.py` | `PromptInjectionSanitizer`: 8 layers — invisible chars, control chars, symbols (So/Sk), NFC normalize, PUA, tag chars, horizontal-whitespace collapse (newlines/tabs preserved), length cap |
+
+### 2.2 New: `deerflow.community.searx`
+
+`deer-flow/backend/packages/harness/deerflow/community/searx/tools.py`
+
+LangChain `@tool` exports:
+
+- `web_search_tool(query, max_results=10)` — calls a private SearX instance, sanitizes title + content, wraps results in security delimiters
+- `web_fetch_tool(url, max_chars=10000)` — fetches URL, runs `extract_secure_text` then `sanitizer.sanitize`, wraps result
+- `image_search_tool(query, max_results=5)` — SearX `categories=images`, sanitized title/url/thumbnail, wrapped
+
+Reads its config from `get_app_config().get_tool_config(<name>).model_extra`:
+`searx_url`, `max_results`, `max_chars`.
+
+### 2.3 Disabled: native community web tools
+
+Every legacy provider's `tools.py` was replaced with a hard-fail stub that
+raises `NativeWebToolDisabledError` **at module import time**. Importing
+the module aborts with a clear message pointing at the searx replacement,
+so a misconfigured `tool.use:` path in `config.yaml` fails loud, not silent.
+
+| Provider | Status | Reason |
+|---|---|---|
+| `community/ddg_search/tools.py` | stub | unhardened DuckDuckGo HTML scrape |
+| `community/tavily/tools.py` | stub | external API, no sanitization |
+| `community/exa/tools.py` | stub | external API, no sanitization |
+| `community/firecrawl/tools.py` | stub | external API, no sanitization |
+| `community/jina_ai/tools.py` | stub | unhardened Jina Reader |
+| `community/jina_ai/jina_client.py` | stub | back-door client, also disabled |
+| `community/infoquest/tools.py` | stub | external API, no sanitization |
+| `community/infoquest/infoquest_client.py` | stub | back-door client, also disabled |
+| `community/image_search/tools.py` | stub | unhardened DDG image fallback |
+
+Central reject helper: `community/_disabled_native.py` —
+`reject_native_provider(name)` raises `NativeWebToolDisabledError`.
+
+### 2.4 Quarantined tests
+
+Tests that expected the native modules to be importable are moved to
+`deer-flow/backend/tests/_disabled_native/`. A `conftest.py` in that
+directory sets `collect_ignore_glob = ["*.py"]` so pytest skips them
+without erroring.
+
+| Test | Reason |
+|---|---|
+| `test_exa_tools.py` | imports `deerflow.community.exa.tools` |
+| `test_firecrawl_tools.py` | imports `deerflow.community.firecrawl.tools` |
+| `test_jina_client.py` | imports `deerflow.community.jina_ai.jina_client` |
+| `test_infoquest_client.py` | imports `deerflow.community.infoquest.infoquest_client` |
+
+`test_doctor.py` and `test_setup_wizard.py` reference the native paths
+**only as strings in test configs** (not as imports), so they continue to
+run unchanged.
+
+### 2.5 Sanitizer bug fix
+
+`PromptInjectionSanitizer.sanitize()` Layer 7 used to do
+`re.sub(r'\s+', ' ', text)` which collapsed `\n` and `\t` into single
+spaces — destroying list/table structure from web pages. Replaced with
+horizontal-whitespace-only collapse plus `\n{3,} -> \n\n`. Verified by
+`test_security_sanitizer.py::test_preserves_newlines_and_tabs`.
+
+### 2.6 Hardened runtime config
+
+`config.yaml` (top-level, **not** `deer-flow/config.example.yaml`) is the
+runtime config and references **only** the searx-backed tools:
+
+```yaml
+tools:
+  - name: web_search
+    group: web
+    use: deerflow.community.searx.tools:web_search_tool
+    searx_url: http://10.67.67.1:8888
+    max_results: 10
+  - name: web_fetch
+    group: web
+    use: deerflow.community.searx.tools:web_fetch_tool
+    max_chars: 10000
+  - name: image_search
+    group: web
+    use: deerflow.community.searx.tools:image_search_tool
+    max_results: 5
+```
+
+The guardrail layer is intentionally not used as the primary block:
+DeerFlow guardrails see only `tool.name` (e.g. `web_search`), and both the
+hardened and the native version export the same name. The real block is
+the import-time stub above.
+
+## 3. Verification
+
+All checks below assume `PYTHONPATH=deer-flow/backend/packages/harness`.
+
+### 3.1 Hardened modules import
+
+```bash
+python3 -c "
+from deerflow.security.content_delimiter import wrap_untrusted_content
+from deerflow.security.sanitizer import sanitizer
+from deerflow.security.html_cleaner import extract_secure_text
+import importlib.util
+assert importlib.util.find_spec('deerflow.community.searx.tools') is not None
+print('OK')
+"
+```
+
+### 3.2 Native modules fail closed
+
+```bash
+python3 -c "
+for prov in ['ddg_search','tavily','exa','firecrawl','jina_ai','infoquest','image_search']:
+    try:
+        __import__(f'deerflow.community.{prov}.tools')
+        raise SystemExit(f'FAIL: {prov} imported')
+    except RuntimeError as e:
+        assert 'disabled in this hardened DeerFlow build' in str(e)
+print('OK — all native providers blocked')
+"
+```
+
+### 3.3 Security tests
+
+```bash
+PYTHONPATH=deer-flow/backend/packages/harness pytest \
+    backend/tests/test_security_sanitizer.py \
+    backend/tests/test_security_html_cleaner.py -q
+```
+
+Expected: `8 passed`.
+
+## 4. Adding a new web tool
+
+1. Implement it in `deer-flow/backend/packages/harness/deerflow/community/<name>/tools.py`.
+2. **Always** sanitize external strings via `deerflow.security.sanitizer`.
+3. **Always** wrap the response with `wrap_untrusted_content()`.
+4. For HTML input, use `extract_secure_text()` first.
+5. Add a test to `backend/tests/` that asserts the security delimiters are
+   present in the tool output.
+6. Update this document.
+
+## 5. Re-enabling a native provider (don't)
+
+If you really must:
+
+1. Replace the stub in `community/<provider>/tools.py` with a hardened
+   wrapper (sanitize → delimiter, just like searx).
+2. Move the matching test out of `tests/_disabled_native/`.
+3. Update this document and explain the threat-model change in your commit
+   message.
+
+## 6. Files touched (audit trail)
+
+```
+deer-flow/backend/packages/harness/deerflow/security/__init__.py          (new)
+deer-flow/backend/packages/harness/deerflow/security/content_delimiter.py (new)
+deer-flow/backend/packages/harness/deerflow/security/html_cleaner.py      (new)
+deer-flow/backend/packages/harness/deerflow/security/sanitizer.py         (new, with newline-preserving fix)
+
+deer-flow/backend/packages/harness/deerflow/community/searx/__init__.py   (new)
+deer-flow/backend/packages/harness/deerflow/community/searx/tools.py      (new)
+
+deer-flow/backend/packages/harness/deerflow/community/_disabled_native.py        (new)
+deer-flow/backend/packages/harness/deerflow/community/ddg_search/tools.py        (replaced with stub)
+deer-flow/backend/packages/harness/deerflow/community/tavily/tools.py            (replaced with stub)
+deer-flow/backend/packages/harness/deerflow/community/exa/tools.py               (replaced with stub)
+deer-flow/backend/packages/harness/deerflow/community/firecrawl/tools.py         (replaced with stub)
+deer-flow/backend/packages/harness/deerflow/community/jina_ai/tools.py           (replaced with stub)
+deer-flow/backend/packages/harness/deerflow/community/jina_ai/jina_client.py     (replaced with stub)
+deer-flow/backend/packages/harness/deerflow/community/infoquest/tools.py         (replaced with stub)
+deer-flow/backend/packages/harness/deerflow/community/infoquest/infoquest_client.py (replaced with stub)
+deer-flow/backend/packages/harness/deerflow/community/image_search/tools.py      (replaced with stub)
+
+deer-flow/backend/tests/_disabled_native/conftest.py                      (new — collect_ignore_glob)
+deer-flow/backend/tests/_disabled_native/test_exa_tools.py                (moved)
+deer-flow/backend/tests/_disabled_native/test_firecrawl_tools.py          (moved)
+deer-flow/backend/tests/_disabled_native/test_jina_client.py              (moved)
+deer-flow/backend/tests/_disabled_native/test_infoquest_client.py         (moved)
+
+backend/packages/harness/deerflow/security/                               (factory overlay, kept in sync)
+backend/packages/harness/deerflow/community/searx/                        (factory overlay, kept in sync)
+backend/tests/test_security_sanitizer.py                                  (factory tests)
+backend/tests/test_security_html_cleaner.py                               (factory tests)
+backend/tests/test_searx_tools.py                                         (factory tests)
+
+config.yaml                                                                (hardened runtime config, references only searx tools)
+.env.example                                                               (template, no secrets)
+HARDENING.md                                                               (this file)
+```