Files

DATA 6de0bf9f5b Initial commit: hardened DeerFlow factory

Vendored deer-flow upstream (bytedance/deer-flow) plus prompt-injection
hardening:

- New deerflow.security package: content_delimiter, html_cleaner,
  sanitizer (8 layers — invisible chars, control chars, symbols, NFC,
  PUA, tag chars, horizontal whitespace collapse with newline/tab
  preservation, length cap)
- New deerflow.community.searx package: web_search, web_fetch,
  image_search backed by a private SearX instance, every external
  string sanitized and wrapped in <<<EXTERNAL_UNTRUSTED_CONTENT>>>
  delimiters
- All native community web providers (ddg_search, tavily, exa,
  firecrawl, jina_ai, infoquest, image_search) replaced with hard-fail
  stubs that raise NativeWebToolDisabledError at import time, so a
  misconfigured tool.use path fails loud rather than silently falling
  back to unsanitized output
- Native client back-doors (jina_client.py, infoquest_client.py)
  stubbed too
- Native-tool tests quarantined under tests/_disabled_native/
  (collect_ignore_glob via local conftest.py)
- Sanitizer Layer 7 fix: only collapse horizontal whitespace, preserve
  newlines and tabs so list/table structure survives
- Hardened runtime config.yaml references only the searx-backed tools
- Factory overlay (backend/) kept in sync with deer-flow tree as a
  reference / source

See HARDENING.md for the full audit trail and verification steps.

2026-04-12 14:23:57 +02:00

10 KiB

Raw Blame History

DeerFlow Hardening Notes

This repository is a hardened deployment of bytedance/deer-flow with the only goal of preventing prompt-injection attacks via the agent's web access surface.

The upstream tree lives in deer-flow/ and is checked in directly (no submodule, no nested git). All hardening changes are kept inside that tree so that python -m deerflow.community.searx.tools resolves out of the box once deer-flow/backend/packages/harness is on PYTHONPATH.

This document is a defense-in-depth audit trail. If you change any of the files listed here, please update this document in the same commit.

1. Threat model

Prompt-injection via untrusted web content. An attacker controls the body of an HTML page (or a search-result snippet) and tries to make the model:

Treat externally fetched text as system instructions (delimiter confusion).
Smuggle hidden tokens via invisible Unicode (zero-width spaces, BOM, PUA, tag characters).
Inject executable HTML (<script>, <iframe>, <form>, ...) that the model would summarise verbatim.

The hardening below is a port of the OpenClaw approach (searx-scripts/, fetch-scripts/) to DeerFlow's adapter contract.

2. What was changed

2.1 New: `deerflow.security`

deer-flow/backend/packages/harness/deerflow/security/

File	Purpose
`__init__.py`	Public re-exports
`content_delimiter.py`	Wraps untrusted content in `<<<EXTERNAL_UNTRUSTED_CONTENT>>> ... <<<END_EXTERNAL_UNTRUSTED_CONTENT>>>` so the LLM has a semantic boundary between system instructions and external data
`html_cleaner.py`	`SecureTextExtractor` strips `script`, `style`, `noscript`, `header`, `footer`, `nav`, `aside`, `iframe`, `object`, `embed`, `form`
`sanitizer.py`	`PromptInjectionSanitizer`: 8 layers — invisible chars, control chars, symbols (So/Sk), NFC normalize, PUA, tag chars, horizontal-whitespace collapse (newlines/tabs preserved), length cap

2.2 New: `deerflow.community.searx`

deer-flow/backend/packages/harness/deerflow/community/searx/tools.py

LangChain @tool exports:

web_search_tool(query, max_results=10) — calls a private SearX instance, sanitizes title + content, wraps results in security delimiters
web_fetch_tool(url, max_chars=10000) — fetches URL, runs extract_secure_text then sanitizer.sanitize, wraps result
image_search_tool(query, max_results=5) — SearX categories=images, sanitized title/url/thumbnail, wrapped

Reads its config from get_app_config().get_tool_config(<name>).model_extra: searx_url, max_results, max_chars.

2.3 Disabled: native community web tools

Every legacy provider's tools.py was replaced with a hard-fail stub that raises NativeWebToolDisabledError at module import time. Importing the module aborts with a clear message pointing at the searx replacement, so a misconfigured tool.use: path in config.yaml fails loud, not silent.

Provider	Status	Reason
`community/ddg_search/tools.py`	stub	unhardened DuckDuckGo HTML scrape
`community/tavily/tools.py`	stub	external API, no sanitization
`community/exa/tools.py`	stub	external API, no sanitization
`community/firecrawl/tools.py`	stub	external API, no sanitization
`community/jina_ai/tools.py`	stub	unhardened Jina Reader
`community/jina_ai/jina_client.py`	stub	back-door client, also disabled
`community/infoquest/tools.py`	stub	external API, no sanitization
`community/infoquest/infoquest_client.py`	stub	back-door client, also disabled
`community/image_search/tools.py`	stub	unhardened DDG image fallback

Central reject helper: community/_disabled_native.py — reject_native_provider(name) raises NativeWebToolDisabledError.

2.4 Quarantined tests

Tests that expected the native modules to be importable are moved to deer-flow/backend/tests/_disabled_native/. A conftest.py in that directory sets collect_ignore_glob = ["*.py"] so pytest skips them without erroring.

Test	Reason
`test_exa_tools.py`	imports `deerflow.community.exa.tools`
`test_firecrawl_tools.py`	imports `deerflow.community.firecrawl.tools`
`test_jina_client.py`	imports `deerflow.community.jina_ai.jina_client`
`test_infoquest_client.py`	imports `deerflow.community.infoquest.infoquest_client`

test_doctor.py and test_setup_wizard.py reference the native paths only as strings in test configs (not as imports), so they continue to run unchanged.

2.5 Sanitizer bug fix

PromptInjectionSanitizer.sanitize() Layer 7 used to do re.sub(r'\s+', ' ', text) which collapsed \n and \t into single spaces — destroying list/table structure from web pages. Replaced with horizontal-whitespace-only collapse plus \n{3,} -> \n\n. Verified by test_security_sanitizer.py::test_preserves_newlines_and_tabs.

2.6 Hardened runtime config

config.yaml (top-level, not deer-flow/config.example.yaml) is the runtime config and references only the searx-backed tools:

tools:
  - name: web_search
    group: web
    use: deerflow.community.searx.tools:web_search_tool
    searx_url: http://10.67.67.1:8888
    max_results: 10
  - name: web_fetch
    group: web
    use: deerflow.community.searx.tools:web_fetch_tool
    max_chars: 10000
  - name: image_search
    group: web
    use: deerflow.community.searx.tools:image_search_tool
    max_results: 5

The guardrail layer is intentionally not used as the primary block: DeerFlow guardrails see only tool.name (e.g. web_search), and both the hardened and the native version export the same name. The real block is the import-time stub above.

3. Verification

All checks below assume PYTHONPATH=deer-flow/backend/packages/harness.

3.1 Hardened modules import

python3 -c "
from deerflow.security.content_delimiter import wrap_untrusted_content
from deerflow.security.sanitizer import sanitizer
from deerflow.security.html_cleaner import extract_secure_text
import importlib.util
assert importlib.util.find_spec('deerflow.community.searx.tools') is not None
print('OK')
"

3.2 Native modules fail closed

python3 -c "
for prov in ['ddg_search','tavily','exa','firecrawl','jina_ai','infoquest','image_search']:
    try:
        __import__(f'deerflow.community.{prov}.tools')
        raise SystemExit(f'FAIL: {prov} imported')
    except RuntimeError as e:
        assert 'disabled in this hardened DeerFlow build' in str(e)
print('OK — all native providers blocked')
"

3.3 Security tests

PYTHONPATH=deer-flow/backend/packages/harness pytest \
    backend/tests/test_security_sanitizer.py \
    backend/tests/test_security_html_cleaner.py -q

Expected: 8 passed.

4. Adding a new web tool

Implement it in deer-flow/backend/packages/harness/deerflow/community/<name>/tools.py.
Always sanitize external strings via deerflow.security.sanitizer.
Always wrap the response with wrap_untrusted_content().
For HTML input, use extract_secure_text() first.
Add a test to backend/tests/ that asserts the security delimiters are present in the tool output.
Update this document.

5. Re-enabling a native provider (don't)

If you really must:

Replace the stub in community/<provider>/tools.py with a hardened wrapper (sanitize → delimiter, just like searx).
Move the matching test out of tests/_disabled_native/.
Update this document and explain the threat-model change in your commit message.

6. Files touched (audit trail)

deer-flow/backend/packages/harness/deerflow/security/__init__.py          (new)
deer-flow/backend/packages/harness/deerflow/security/content_delimiter.py (new)
deer-flow/backend/packages/harness/deerflow/security/html_cleaner.py      (new)
deer-flow/backend/packages/harness/deerflow/security/sanitizer.py         (new, with newline-preserving fix)

deer-flow/backend/packages/harness/deerflow/community/searx/__init__.py   (new)
deer-flow/backend/packages/harness/deerflow/community/searx/tools.py      (new)

deer-flow/backend/packages/harness/deerflow/community/_disabled_native.py        (new)
deer-flow/backend/packages/harness/deerflow/community/ddg_search/tools.py        (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/tavily/tools.py            (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/exa/tools.py               (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/firecrawl/tools.py         (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/jina_ai/tools.py           (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/jina_ai/jina_client.py     (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/infoquest/tools.py         (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/infoquest/infoquest_client.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/image_search/tools.py      (replaced with stub)

deer-flow/backend/tests/_disabled_native/conftest.py                      (new — collect_ignore_glob)
deer-flow/backend/tests/_disabled_native/test_exa_tools.py                (moved)
deer-flow/backend/tests/_disabled_native/test_firecrawl_tools.py          (moved)
deer-flow/backend/tests/_disabled_native/test_jina_client.py              (moved)
deer-flow/backend/tests/_disabled_native/test_infoquest_client.py         (moved)

backend/packages/harness/deerflow/security/                               (factory overlay, kept in sync)
backend/packages/harness/deerflow/community/searx/                        (factory overlay, kept in sync)
backend/tests/test_security_sanitizer.py                                  (factory tests)
backend/tests/test_security_html_cleaner.py                               (factory tests)
backend/tests/test_searx_tools.py                                         (factory tests)

config.yaml                                                                (hardened runtime config, references only searx tools)
.env.example                                                               (template, no secrets)
HARDENING.md                                                               (this file)

10 KiB Raw Blame History