Initial commit: hardened DeerFlow factory

Vendored deer-flow upstream (bytedance/deer-flow) plus prompt-injection
hardening:

- New deerflow.security package: content_delimiter, html_cleaner,
  sanitizer (8 layers — invisible chars, control chars, symbols, NFC,
  PUA, tag chars, horizontal whitespace collapse with newline/tab
  preservation, length cap)
- New deerflow.community.searx package: web_search, web_fetch,
  image_search backed by a private SearX instance, every external
  string sanitized and wrapped in <<<EXTERNAL_UNTRUSTED_CONTENT>>>
  delimiters
- All native community web providers (ddg_search, tavily, exa,
  firecrawl, jina_ai, infoquest, image_search) replaced with hard-fail
  stubs that raise NativeWebToolDisabledError at import time, so a
  misconfigured tool.use path fails loud rather than silently falling
  back to unsanitized output
- Native client back-doors (jina_client.py, infoquest_client.py)
  stubbed too
- Native-tool tests quarantined under tests/_disabled_native/
  (collect_ignore_glob via local conftest.py)
- Sanitizer Layer 7 fix: only collapse horizontal whitespace, preserve
  newlines and tabs so list/table structure survives
- Hardened runtime config.yaml references only the searx-backed tools
- Factory overlay (backend/) kept in sync with deer-flow tree as a
  reference / source

See HARDENING.md for the full audit trail and verification steps.
This commit is contained in:
2026-04-12 14:23:57 +02:00
commit 6de0bf9f5b
889 changed files with 173052 additions and 0 deletions

228
HARDENING.md Normal file
View File

@@ -0,0 +1,228 @@
# DeerFlow Hardening Notes
This repository is a hardened deployment of [bytedance/deer-flow](https://github.com/bytedance/deer-flow)
with the only goal of preventing prompt-injection attacks via the agent's
web access surface.
The upstream tree lives in `deer-flow/` and is checked in directly (no
submodule, no nested git). All hardening changes are kept inside that tree
so that `python -m deerflow.community.searx.tools` resolves out of the box
once `deer-flow/backend/packages/harness` is on `PYTHONPATH`.
This document is a defense-in-depth audit trail. If you change any of the
files listed here, please update this document in the same commit.
## 1. Threat model
Prompt-injection via untrusted web content. An attacker controls the body
of an HTML page (or a search-result snippet) and tries to make the model:
1. Treat externally fetched text as **system instructions** (delimiter confusion).
2. Smuggle hidden tokens via **invisible Unicode** (zero-width spaces, BOM,
PUA, tag characters).
3. Inject **executable HTML** (`<script>`, `<iframe>`, `<form>`, ...) that
the model would summarise verbatim.
The hardening below is a port of the OpenClaw approach (`searx-scripts/`,
`fetch-scripts/`) to DeerFlow's adapter contract.
## 2. What was changed
### 2.1 New: `deerflow.security`
`deer-flow/backend/packages/harness/deerflow/security/`
| File | Purpose |
|---|---|
| `__init__.py` | Public re-exports |
| `content_delimiter.py` | Wraps untrusted content in `<<<EXTERNAL_UNTRUSTED_CONTENT>>> ... <<<END_EXTERNAL_UNTRUSTED_CONTENT>>>` so the LLM has a semantic boundary between system instructions and external data |
| `html_cleaner.py` | `SecureTextExtractor` strips `script`, `style`, `noscript`, `header`, `footer`, `nav`, `aside`, `iframe`, `object`, `embed`, `form` |
| `sanitizer.py` | `PromptInjectionSanitizer`: 8 layers — invisible chars, control chars, symbols (So/Sk), NFC normalize, PUA, tag chars, horizontal-whitespace collapse (newlines/tabs preserved), length cap |
### 2.2 New: `deerflow.community.searx`
`deer-flow/backend/packages/harness/deerflow/community/searx/tools.py`
LangChain `@tool` exports:
- `web_search_tool(query, max_results=10)` — calls a private SearX instance, sanitizes title + content, wraps results in security delimiters
- `web_fetch_tool(url, max_chars=10000)` — fetches URL, runs `extract_secure_text` then `sanitizer.sanitize`, wraps result
- `image_search_tool(query, max_results=5)` — SearX `categories=images`, sanitized title/url/thumbnail, wrapped
Reads its config from `get_app_config().get_tool_config(<name>).model_extra`:
`searx_url`, `max_results`, `max_chars`.
### 2.3 Disabled: native community web tools
Every legacy provider's `tools.py` was replaced with a hard-fail stub that
raises `NativeWebToolDisabledError` **at module import time**. Importing
the module aborts with a clear message pointing at the searx replacement,
so a misconfigured `tool.use:` path in `config.yaml` fails loud, not silent.
| Provider | Status | Reason |
|---|---|---|
| `community/ddg_search/tools.py` | stub | unhardened DuckDuckGo HTML scrape |
| `community/tavily/tools.py` | stub | external API, no sanitization |
| `community/exa/tools.py` | stub | external API, no sanitization |
| `community/firecrawl/tools.py` | stub | external API, no sanitization |
| `community/jina_ai/tools.py` | stub | unhardened Jina Reader |
| `community/jina_ai/jina_client.py` | stub | back-door client, also disabled |
| `community/infoquest/tools.py` | stub | external API, no sanitization |
| `community/infoquest/infoquest_client.py` | stub | back-door client, also disabled |
| `community/image_search/tools.py` | stub | unhardened DDG image fallback |
Central reject helper: `community/_disabled_native.py`
`reject_native_provider(name)` raises `NativeWebToolDisabledError`.
### 2.4 Quarantined tests
Tests that expected the native modules to be importable are moved to
`deer-flow/backend/tests/_disabled_native/`. A `conftest.py` in that
directory sets `collect_ignore_glob = ["*.py"]` so pytest skips them
without erroring.
| Test | Reason |
|---|---|
| `test_exa_tools.py` | imports `deerflow.community.exa.tools` |
| `test_firecrawl_tools.py` | imports `deerflow.community.firecrawl.tools` |
| `test_jina_client.py` | imports `deerflow.community.jina_ai.jina_client` |
| `test_infoquest_client.py` | imports `deerflow.community.infoquest.infoquest_client` |
`test_doctor.py` and `test_setup_wizard.py` reference the native paths
**only as strings in test configs** (not as imports), so they continue to
run unchanged.
### 2.5 Sanitizer bug fix
`PromptInjectionSanitizer.sanitize()` Layer 7 used to do
`re.sub(r'\s+', ' ', text)` which collapsed `\n` and `\t` into single
spaces — destroying list/table structure from web pages. Replaced with
horizontal-whitespace-only collapse plus `\n{3,} -> \n\n`. Verified by
`test_security_sanitizer.py::test_preserves_newlines_and_tabs`.
### 2.6 Hardened runtime config
`config.yaml` (top-level, **not** `deer-flow/config.example.yaml`) is the
runtime config and references **only** the searx-backed tools:
```yaml
tools:
- name: web_search
group: web
use: deerflow.community.searx.tools:web_search_tool
searx_url: http://10.67.67.1:8888
max_results: 10
- name: web_fetch
group: web
use: deerflow.community.searx.tools:web_fetch_tool
max_chars: 10000
- name: image_search
group: web
use: deerflow.community.searx.tools:image_search_tool
max_results: 5
```
The guardrail layer is intentionally not used as the primary block:
DeerFlow guardrails see only `tool.name` (e.g. `web_search`), and both the
hardened and the native version export the same name. The real block is
the import-time stub above.
## 3. Verification
All checks below assume `PYTHONPATH=deer-flow/backend/packages/harness`.
### 3.1 Hardened modules import
```bash
python3 -c "
from deerflow.security.content_delimiter import wrap_untrusted_content
from deerflow.security.sanitizer import sanitizer
from deerflow.security.html_cleaner import extract_secure_text
import importlib.util
assert importlib.util.find_spec('deerflow.community.searx.tools') is not None
print('OK')
"
```
### 3.2 Native modules fail closed
```bash
python3 -c "
for prov in ['ddg_search','tavily','exa','firecrawl','jina_ai','infoquest','image_search']:
try:
__import__(f'deerflow.community.{prov}.tools')
raise SystemExit(f'FAIL: {prov} imported')
except RuntimeError as e:
assert 'disabled in this hardened DeerFlow build' in str(e)
print('OK — all native providers blocked')
"
```
### 3.3 Security tests
```bash
PYTHONPATH=deer-flow/backend/packages/harness pytest \
backend/tests/test_security_sanitizer.py \
backend/tests/test_security_html_cleaner.py -q
```
Expected: `8 passed`.
## 4. Adding a new web tool
1. Implement it in `deer-flow/backend/packages/harness/deerflow/community/<name>/tools.py`.
2. **Always** sanitize external strings via `deerflow.security.sanitizer`.
3. **Always** wrap the response with `wrap_untrusted_content()`.
4. For HTML input, use `extract_secure_text()` first.
5. Add a test to `backend/tests/` that asserts the security delimiters are
present in the tool output.
6. Update this document.
## 5. Re-enabling a native provider (don't)
If you really must:
1. Replace the stub in `community/<provider>/tools.py` with a hardened
wrapper (sanitize → delimiter, just like searx).
2. Move the matching test out of `tests/_disabled_native/`.
3. Update this document and explain the threat-model change in your commit
message.
## 6. Files touched (audit trail)
```
deer-flow/backend/packages/harness/deerflow/security/__init__.py (new)
deer-flow/backend/packages/harness/deerflow/security/content_delimiter.py (new)
deer-flow/backend/packages/harness/deerflow/security/html_cleaner.py (new)
deer-flow/backend/packages/harness/deerflow/security/sanitizer.py (new, with newline-preserving fix)
deer-flow/backend/packages/harness/deerflow/community/searx/__init__.py (new)
deer-flow/backend/packages/harness/deerflow/community/searx/tools.py (new)
deer-flow/backend/packages/harness/deerflow/community/_disabled_native.py (new)
deer-flow/backend/packages/harness/deerflow/community/ddg_search/tools.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/tavily/tools.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/exa/tools.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/firecrawl/tools.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/jina_ai/tools.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/jina_ai/jina_client.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/infoquest/tools.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/infoquest/infoquest_client.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/image_search/tools.py (replaced with stub)
deer-flow/backend/tests/_disabled_native/conftest.py (new — collect_ignore_glob)
deer-flow/backend/tests/_disabled_native/test_exa_tools.py (moved)
deer-flow/backend/tests/_disabled_native/test_firecrawl_tools.py (moved)
deer-flow/backend/tests/_disabled_native/test_jina_client.py (moved)
deer-flow/backend/tests/_disabled_native/test_infoquest_client.py (moved)
backend/packages/harness/deerflow/security/ (factory overlay, kept in sync)
backend/packages/harness/deerflow/community/searx/ (factory overlay, kept in sync)
backend/tests/test_security_sanitizer.py (factory tests)
backend/tests/test_security_html_cleaner.py (factory tests)
backend/tests/test_searx_tools.py (factory tests)
config.yaml (hardened runtime config, references only searx tools)
.env.example (template, no secrets)
HARDENING.md (this file)
```