Vendored deer-flow upstream (bytedance/deer-flow) plus prompt-injection hardening: - New deerflow.security package: content_delimiter, html_cleaner, sanitizer (8 layers — invisible chars, control chars, symbols, NFC, PUA, tag chars, horizontal whitespace collapse with newline/tab preservation, length cap) - New deerflow.community.searx package: web_search, web_fetch, image_search backed by a private SearX instance, every external string sanitized and wrapped in <<<EXTERNAL_UNTRUSTED_CONTENT>>> delimiters - All native community web providers (ddg_search, tavily, exa, firecrawl, jina_ai, infoquest, image_search) replaced with hard-fail stubs that raise NativeWebToolDisabledError at import time, so a misconfigured tool.use path fails loud rather than silently falling back to unsanitized output - Native client back-doors (jina_client.py, infoquest_client.py) stubbed too - Native-tool tests quarantined under tests/_disabled_native/ (collect_ignore_glob via local conftest.py) - Sanitizer Layer 7 fix: only collapse horizontal whitespace, preserve newlines and tabs so list/table structure survives - Hardened runtime config.yaml references only the searx-backed tools - Factory overlay (backend/) kept in sync with deer-flow tree as a reference / source See HARDENING.md for the full audit trail and verification steps.
10 KiB
DeerFlow Hardening Notes
This repository is a hardened deployment of bytedance/deer-flow with the only goal of preventing prompt-injection attacks via the agent's web access surface.
The upstream tree lives in deer-flow/ and is checked in directly (no
submodule, no nested git). All hardening changes are kept inside that tree
so that python -m deerflow.community.searx.tools resolves out of the box
once deer-flow/backend/packages/harness is on PYTHONPATH.
This document is a defense-in-depth audit trail. If you change any of the files listed here, please update this document in the same commit.
1. Threat model
Prompt-injection via untrusted web content. An attacker controls the body of an HTML page (or a search-result snippet) and tries to make the model:
- Treat externally fetched text as system instructions (delimiter confusion).
- Smuggle hidden tokens via invisible Unicode (zero-width spaces, BOM, PUA, tag characters).
- Inject executable HTML (
<script>,<iframe>,<form>, ...) that the model would summarise verbatim.
The hardening below is a port of the OpenClaw approach (searx-scripts/,
fetch-scripts/) to DeerFlow's adapter contract.
2. What was changed
2.1 New: deerflow.security
deer-flow/backend/packages/harness/deerflow/security/
| File | Purpose |
|---|---|
__init__.py |
Public re-exports |
content_delimiter.py |
Wraps untrusted content in <<<EXTERNAL_UNTRUSTED_CONTENT>>> ... <<<END_EXTERNAL_UNTRUSTED_CONTENT>>> so the LLM has a semantic boundary between system instructions and external data |
html_cleaner.py |
SecureTextExtractor strips script, style, noscript, header, footer, nav, aside, iframe, object, embed, form |
sanitizer.py |
PromptInjectionSanitizer: 8 layers — invisible chars, control chars, symbols (So/Sk), NFC normalize, PUA, tag chars, horizontal-whitespace collapse (newlines/tabs preserved), length cap |
2.2 New: deerflow.community.searx
deer-flow/backend/packages/harness/deerflow/community/searx/tools.py
LangChain @tool exports:
web_search_tool(query, max_results=10)— calls a private SearX instance, sanitizes title + content, wraps results in security delimitersweb_fetch_tool(url, max_chars=10000)— fetches URL, runsextract_secure_textthensanitizer.sanitize, wraps resultimage_search_tool(query, max_results=5)— SearXcategories=images, sanitized title/url/thumbnail, wrapped
Reads its config from get_app_config().get_tool_config(<name>).model_extra:
searx_url, max_results, max_chars.
2.3 Disabled: native community web tools
Every legacy provider's tools.py was replaced with a hard-fail stub that
raises NativeWebToolDisabledError at module import time. Importing
the module aborts with a clear message pointing at the searx replacement,
so a misconfigured tool.use: path in config.yaml fails loud, not silent.
| Provider | Status | Reason |
|---|---|---|
community/ddg_search/tools.py |
stub | unhardened DuckDuckGo HTML scrape |
community/tavily/tools.py |
stub | external API, no sanitization |
community/exa/tools.py |
stub | external API, no sanitization |
community/firecrawl/tools.py |
stub | external API, no sanitization |
community/jina_ai/tools.py |
stub | unhardened Jina Reader |
community/jina_ai/jina_client.py |
stub | back-door client, also disabled |
community/infoquest/tools.py |
stub | external API, no sanitization |
community/infoquest/infoquest_client.py |
stub | back-door client, also disabled |
community/image_search/tools.py |
stub | unhardened DDG image fallback |
Central reject helper: community/_disabled_native.py —
reject_native_provider(name) raises NativeWebToolDisabledError.
2.4 Quarantined tests
Tests that expected the native modules to be importable are moved to
deer-flow/backend/tests/_disabled_native/. A conftest.py in that
directory sets collect_ignore_glob = ["*.py"] so pytest skips them
without erroring.
| Test | Reason |
|---|---|
test_exa_tools.py |
imports deerflow.community.exa.tools |
test_firecrawl_tools.py |
imports deerflow.community.firecrawl.tools |
test_jina_client.py |
imports deerflow.community.jina_ai.jina_client |
test_infoquest_client.py |
imports deerflow.community.infoquest.infoquest_client |
test_doctor.py and test_setup_wizard.py reference the native paths
only as strings in test configs (not as imports), so they continue to
run unchanged.
2.5 Sanitizer bug fix
PromptInjectionSanitizer.sanitize() Layer 7 used to do
re.sub(r'\s+', ' ', text) which collapsed \n and \t into single
spaces — destroying list/table structure from web pages. Replaced with
horizontal-whitespace-only collapse plus \n{3,} -> \n\n. Verified by
test_security_sanitizer.py::test_preserves_newlines_and_tabs.
2.6 Hardened runtime config
config.yaml (top-level, not deer-flow/config.example.yaml) is the
runtime config and references only the searx-backed tools:
tools:
- name: web_search
group: web
use: deerflow.community.searx.tools:web_search_tool
searx_url: http://10.67.67.1:8888
max_results: 10
- name: web_fetch
group: web
use: deerflow.community.searx.tools:web_fetch_tool
max_chars: 10000
- name: image_search
group: web
use: deerflow.community.searx.tools:image_search_tool
max_results: 5
The guardrail layer is intentionally not used as the primary block:
DeerFlow guardrails see only tool.name (e.g. web_search), and both the
hardened and the native version export the same name. The real block is
the import-time stub above.
3. Verification
All checks below assume PYTHONPATH=deer-flow/backend/packages/harness.
3.1 Hardened modules import
python3 -c "
from deerflow.security.content_delimiter import wrap_untrusted_content
from deerflow.security.sanitizer import sanitizer
from deerflow.security.html_cleaner import extract_secure_text
import importlib.util
assert importlib.util.find_spec('deerflow.community.searx.tools') is not None
print('OK')
"
3.2 Native modules fail closed
python3 -c "
for prov in ['ddg_search','tavily','exa','firecrawl','jina_ai','infoquest','image_search']:
try:
__import__(f'deerflow.community.{prov}.tools')
raise SystemExit(f'FAIL: {prov} imported')
except RuntimeError as e:
assert 'disabled in this hardened DeerFlow build' in str(e)
print('OK — all native providers blocked')
"
3.3 Security tests
PYTHONPATH=deer-flow/backend/packages/harness pytest \
backend/tests/test_security_sanitizer.py \
backend/tests/test_security_html_cleaner.py -q
Expected: 8 passed.
4. Adding a new web tool
- Implement it in
deer-flow/backend/packages/harness/deerflow/community/<name>/tools.py. - Always sanitize external strings via
deerflow.security.sanitizer. - Always wrap the response with
wrap_untrusted_content(). - For HTML input, use
extract_secure_text()first. - Add a test to
backend/tests/that asserts the security delimiters are present in the tool output. - Update this document.
5. Re-enabling a native provider (don't)
If you really must:
- Replace the stub in
community/<provider>/tools.pywith a hardened wrapper (sanitize → delimiter, just like searx). - Move the matching test out of
tests/_disabled_native/. - Update this document and explain the threat-model change in your commit message.
6. Files touched (audit trail)
deer-flow/backend/packages/harness/deerflow/security/__init__.py (new)
deer-flow/backend/packages/harness/deerflow/security/content_delimiter.py (new)
deer-flow/backend/packages/harness/deerflow/security/html_cleaner.py (new)
deer-flow/backend/packages/harness/deerflow/security/sanitizer.py (new, with newline-preserving fix)
deer-flow/backend/packages/harness/deerflow/community/searx/__init__.py (new)
deer-flow/backend/packages/harness/deerflow/community/searx/tools.py (new)
deer-flow/backend/packages/harness/deerflow/community/_disabled_native.py (new)
deer-flow/backend/packages/harness/deerflow/community/ddg_search/tools.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/tavily/tools.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/exa/tools.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/firecrawl/tools.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/jina_ai/tools.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/jina_ai/jina_client.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/infoquest/tools.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/infoquest/infoquest_client.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/image_search/tools.py (replaced with stub)
deer-flow/backend/tests/_disabled_native/conftest.py (new — collect_ignore_glob)
deer-flow/backend/tests/_disabled_native/test_exa_tools.py (moved)
deer-flow/backend/tests/_disabled_native/test_firecrawl_tools.py (moved)
deer-flow/backend/tests/_disabled_native/test_jina_client.py (moved)
deer-flow/backend/tests/_disabled_native/test_infoquest_client.py (moved)
backend/packages/harness/deerflow/security/ (factory overlay, kept in sync)
backend/packages/harness/deerflow/community/searx/ (factory overlay, kept in sync)
backend/tests/test_security_sanitizer.py (factory tests)
backend/tests/test_security_html_cleaner.py (factory tests)
backend/tests/test_searx_tools.py (factory tests)
config.yaml (hardened runtime config, references only searx tools)
.env.example (template, no secrets)
HARDENING.md (this file)