Agents in this build are text-only researchers. Image, audio, video, and binary content has no role in the pipeline and only widens the attack surface (server-side image fetches, exfiltration via rendered img tags, etc). The cleanest answer is to never load it in the first place rather than maintain a domain allowlist that nobody can keep up to date. - web_fetch_tool now uses httpx.AsyncClient.stream and inspects the Content-Type header BEFORE the body is read into memory. Only text/*, application/json, application/xml, application/xhtml+xml, application/ld+json, application/atom+xml, application/rss+xml are accepted; everything else (image/*, audio/*, video/*, octet-stream, pdf, font, missing header, ...) is refused with a wrap_untrusted error reply. The body bytes never enter the process for refused responses. Read budget is bounded to ~4x max_chars regardless. - image_search_tool removed from deerflow.community.searx.tools (both the deer-flow runtime tree and the factory overlay). The function is gone, not stubbed — any tool.use referencing it will raise AttributeError at tool-loading time. - config.yaml: image_search tool entry removed; the example allowed_tools list updated to drop image_search. - HARDENING.md: new section 2.8 explains the policy and the frontend caveat (the LLM can still emit  markdown which the user's browser would render — that requires a separate frontend patch that is not yet implemented). Section 3.4 adds a verification snippet for the policy. The web_fetch entry in section 2.2 is updated to mention the streaming Content-Type gate. Both source trees stay in sync.
17 KiB
DeerFlow Hardening Notes
This repository is a hardened deployment of bytedance/deer-flow with the only goal of preventing prompt-injection attacks via the agent's web access surface.
The upstream tree lives in deer-flow/ and is checked in directly (no
submodule, no nested git). All hardening changes are kept inside that tree
so that python -m deerflow.community.searx.tools resolves out of the box
once deer-flow/backend/packages/harness is on PYTHONPATH.
This document is a defense-in-depth audit trail. If you change any of the files listed here, please update this document in the same commit.
1. Threat model
Prompt-injection via untrusted web content. An attacker controls the body of an HTML page (or a search-result snippet) and tries to make the model:
- Treat externally fetched text as system instructions (delimiter confusion).
- Smuggle hidden tokens via invisible Unicode (zero-width spaces, BOM, PUA, tag characters).
- Inject executable HTML (
<script>,<iframe>,<form>, ...) that the model would summarise verbatim.
The hardening below is a port of the OpenClaw approach (searx-scripts/,
fetch-scripts/) to DeerFlow's adapter contract.
2. What was changed
2.1 New: deerflow.security
deer-flow/backend/packages/harness/deerflow/security/
| File | Purpose |
|---|---|
__init__.py |
Public re-exports |
content_delimiter.py |
Wraps untrusted content in <<<EXTERNAL_UNTRUSTED_CONTENT>>> ... <<<END_EXTERNAL_UNTRUSTED_CONTENT>>> so the LLM has a semantic boundary between system instructions and external data |
html_cleaner.py |
SecureTextExtractor strips script, style, noscript, header, footer, nav, aside, iframe, object, embed, form |
sanitizer.py |
PromptInjectionSanitizer: 8 layers — invisible chars, control chars, symbols (So/Sk), NFC normalize, PUA, tag chars, horizontal-whitespace collapse (newlines/tabs preserved), length cap |
2.2 New: deerflow.community.searx
deer-flow/backend/packages/harness/deerflow/community/searx/tools.py
LangChain @tool exports:
web_search_tool(query, max_results=10)— calls a private SearX instance, sanitizes title + content, wraps results in security delimitersweb_fetch_tool(url, max_chars=10000)— streams response, refuses non-text Content-Type before reading the body, then runsextract_secure_textandsanitizer.sanitizeover the head of the body, wraps result
image_search_tool was removed on purpose — see section 2.8.
Reads its config from get_app_config().get_tool_config(<name>).model_extra:
searx_url, max_results, max_chars.
2.3 Disabled: native community web tools
Every legacy provider's tools.py was replaced with a hard-fail stub that
raises NativeWebToolDisabledError at module import time. Importing
the module aborts with a clear message pointing at the searx replacement,
so a misconfigured tool.use: path in config.yaml fails loud, not silent.
| Provider | Status | Reason |
|---|---|---|
community/ddg_search/tools.py |
stub | unhardened DuckDuckGo HTML scrape |
community/tavily/tools.py |
stub | external API, no sanitization |
community/exa/tools.py |
stub | external API, no sanitization |
community/firecrawl/tools.py |
stub | external API, no sanitization |
community/jina_ai/tools.py |
stub | unhardened Jina Reader |
community/jina_ai/jina_client.py |
stub | back-door client, also disabled |
community/infoquest/tools.py |
stub | external API, no sanitization |
community/infoquest/infoquest_client.py |
stub | back-door client, also disabled |
community/image_search/tools.py |
stub | unhardened DDG image fallback |
Central reject helper: community/_disabled_native.py —
reject_native_provider(name) raises NativeWebToolDisabledError.
2.4 Quarantined tests
Tests that expected the native modules to be importable are moved to
deer-flow/backend/tests/_disabled_native/. A conftest.py in that
directory sets collect_ignore_glob = ["*.py"] so pytest skips them
without erroring.
| Test | Reason |
|---|---|
test_exa_tools.py |
imports deerflow.community.exa.tools |
test_firecrawl_tools.py |
imports deerflow.community.firecrawl.tools |
test_jina_client.py |
imports deerflow.community.jina_ai.jina_client |
test_infoquest_client.py |
imports deerflow.community.infoquest.infoquest_client |
test_doctor.py and test_setup_wizard.py reference the native paths
only as strings in test configs (not as imports), so they continue to
run unchanged.
2.5 Sanitizer bug fix
PromptInjectionSanitizer.sanitize() Layer 7 used to do
re.sub(r'\s+', ' ', text) which collapsed \n and \t into single
spaces — destroying list/table structure from web pages. Replaced with
horizontal-whitespace-only collapse plus \n{3,} -> \n\n. Verified by
test_security_sanitizer.py::test_preserves_newlines_and_tabs.
2.6 Hardened runtime config
config.yaml (top-level, not deer-flow/config.example.yaml) is the
runtime config and references only the searx-backed tools:
tools:
- name: web_search
group: web
use: deerflow.community.searx.tools:web_search_tool
searx_url: http://10.67.67.1:8888
max_results: 10
- name: web_fetch
group: web
use: deerflow.community.searx.tools:web_fetch_tool
max_chars: 10000
- name: image_search
group: web
use: deerflow.community.searx.tools:image_search_tool
max_results: 5
The guardrail layer is intentionally not used as the primary block:
DeerFlow guardrails see only tool.name (e.g. web_search), and both the
hardened and the native version export the same name. The real block is
the import-time stub above.
2.7 Network isolation (egress firewall)
The DeerFlow team recommends running the agent in a dedicated VLAN. Our Fritzbox cannot do LAN VLANs, so instead we put the container behind an egress firewall on the Docker host. The container can reach the Internet plus a small whitelist of Wireguard hosts (Searx, local model servers), but cannot scan or attack any device on the home LAN. Inbound traffic from the LAN to the container's published ports is unaffected because the rules are stateful.
Allow (egress from container):
| Destination | Purpose |
|---|---|
1.0.0.0/8 ... 223.0.0.0/8 (public Internet) |
Ollama Cloud, search backends |
10.67.67.1 |
Searx (Wireguard) |
10.67.67.2 |
XTTS / Whisper / Ollama-local (Wireguard) |
Block (egress from container):
| Destination | Reason |
|---|---|
192.168.3.0/24 |
home LAN — no lateral movement |
10.0.0.0/8 (except whitelisted /32) |
other Wireguard subnets, RFC1918 |
172.16.0.0/12 |
other Docker bridges |
Implementation:
| File | Role |
|---|---|
docker/docker-compose.override.yaml |
Pins the upstream deer-flow Docker network to a stable Linux bridge name br-deerflow, so the firewall can address it without guessing Docker's auto-generated br-<hash>. Used as a -f overlay on top of deer-flow/docker/docker-compose.yaml. |
scripts/deerflow-firewall.sh |
Idempotent up/down/status wrapper that installs the iptables rules in the DOCKER-USER chain. Inserted in reverse order so the final chain order is: stateful return, allow Searx, allow Ollama-local, block LAN, block /8, block /12. |
scripts/deerflow-firewall.nix |
NixOS module snippet defining systemd.services.deerflow-firewall. Ordered After=docker.service, Requires=docker.service, PartOf=docker.service so the rules survive dockerd restarts and follow its lifecycle. Copy into configuration.nix and nixos-rebuild switch. |
Important guarantees:
- The rules match on
-i br-deerflow. If the bridge does not exist (e.g. DeerFlow has never been started), the rules are no-ops and do not affect any other container (paperclip, telebrowser, openclaw-gateway, ...). They activate automatically the momentdocker compose ... up -dcreates the bridge. - Stopping or removing the DeerFlow container leaves the rules in place but inert. Stopping the systemd unit removes them.
- The script is idempotent:
upwill never duplicate a rule,downremoves all copies.
Bring up:
cd /home/data/deerflow-factory
docker compose \
-f deer-flow/docker/docker-compose.yaml \
-f docker/docker-compose.override.yaml \
up -d
# Then either run the script directly:
sudo scripts/deerflow-firewall.sh up
# ...or, on NixOS, copy scripts/deerflow-firewall.nix into configuration.nix
# and:
sudo nixos-rebuild switch
systemctl status deerflow-firewall
Smoke tests (run from inside the container, e.g. docker exec -it <id> sh):
# allowed
curl -s -o /dev/null -w "%{http_code}\n" --max-time 5 http://10.67.67.1:8888/ # Searx -> 200
curl -s -o /dev/null -w "%{http_code}\n" --max-time 5 https://api.cloudflare.com/ # Internet -> 200/4xx
# blocked (should fail with "no route" / "host prohibited" / timeout)
curl -s -o /dev/null -w "%{http_code}\n" --max-time 5 http://192.168.3.1/ # FAIL
curl -s -o /dev/null -w "%{http_code}\n" --max-time 5 http://10.67.67.16/ # FAIL (blocked by 10/8 reject; .16 is not whitelisted)
2.8 No-images policy
Agents in this build are text-only researchers. They never need to fetch image, audio, video, or binary content, so the entire pipeline is hardened to refuse it:
| Layer | What it does |
|---|---|
web_fetch_tool |
Streams the response and inspects the Content-Type header before reading the body. Anything that is not text/*, application/json, application/xml, application/xhtml+xml, application/ld+json, application/atom+xml, or application/rss+xml is refused with wrap_untrusted_content({"error": "Refused: non-text response..."}). The body bytes are never loaded into memory. |
image_search_tool |
Removed. The function no longer exists in deerflow/community/searx/tools.py. Any tool.use: deerflow.community.searx.tools:image_search_tool in config.yaml would fail with an attribute error during tool loading. |
config.yaml |
The image_search tool entry was deleted. Only web_search and web_fetch are registered in the web group. |
Why no allowlist? A domain allowlist for image fetching would either be impossible to maintain (research touches new domains every day) or silently rot into a permanent allow-everything. Removing image fetching entirely is the only honest answer for a text-only research use case.
Frontend caveat: the LLM can still emit 
markdown in its answer. If the deer-flow frontend renders that
markdown, the user's browser (not the container!) will load the
image and potentially leak referrer/timing data. The egress firewall
on data-nuc does not see this traffic. Mitigations:
- Best: configure the frontend's markdown renderer to disable images,
or replace
<img>tags with a placeholder. Not yet implemented in this repo — needs a patch in the deer-flow frontend. - Workaround: render answers in a CSP-restricted iframe with
img-src 'none'.
If you bring image fetching back, build a separate tool with an
explicit per-call allowlist and a server-side image proxy that runs
under the same egress firewall as the rest of the container. Do not
relax web_fetch_tool's Content-Type check.
3. Verification
All checks below assume PYTHONPATH=deer-flow/backend/packages/harness.
3.1 Hardened modules import
python3 -c "
from deerflow.security.content_delimiter import wrap_untrusted_content
from deerflow.security.sanitizer import sanitizer
from deerflow.security.html_cleaner import extract_secure_text
import importlib.util
assert importlib.util.find_spec('deerflow.community.searx.tools') is not None
print('OK')
"
3.2 Native modules fail closed
python3 -c "
for prov in ['ddg_search','tavily','exa','firecrawl','jina_ai','infoquest','image_search']:
try:
__import__(f'deerflow.community.{prov}.tools')
raise SystemExit(f'FAIL: {prov} imported')
except RuntimeError as e:
assert 'disabled in this hardened DeerFlow build' in str(e)
print('OK — all native providers blocked')
"
3.3 Security tests
PYTHONPATH=deer-flow/backend/packages/harness pytest \
backend/tests/test_security_sanitizer.py \
backend/tests/test_security_html_cleaner.py -q
Expected: 8 passed.
3.4 No-images verification
PYTHONPATH=deer-flow/backend/packages/harness python3 -c "
import deerflow.community.searx.tools as t
assert hasattr(t, 'web_search_tool'), 'web_search_tool missing'
assert hasattr(t, 'web_fetch_tool'), 'web_fetch_tool missing'
assert not hasattr(t, 'image_search_tool'), 'image_search_tool must be removed'
from deerflow.community.searx.tools import _is_text_content_type
assert _is_text_content_type('text/html; charset=utf-8')
assert _is_text_content_type('application/json')
assert not _is_text_content_type('image/png')
assert not _is_text_content_type('application/octet-stream')
assert not _is_text_content_type('')
print('OK — no-images policy intact')
"
4. Adding a new web tool
- Implement it in
deer-flow/backend/packages/harness/deerflow/community/<name>/tools.py. - Always sanitize external strings via
deerflow.security.sanitizer. - Always wrap the response with
wrap_untrusted_content(). - For HTML input, use
extract_secure_text()first. - Add a test to
backend/tests/that asserts the security delimiters are present in the tool output. - Update this document.
5. Re-enabling a native provider (don't)
If you really must:
- Replace the stub in
community/<provider>/tools.pywith a hardened wrapper (sanitize → delimiter, just like searx). - Move the matching test out of
tests/_disabled_native/. - Update this document and explain the threat-model change in your commit message.
6. Files touched (audit trail)
deer-flow/backend/packages/harness/deerflow/security/__init__.py (new)
deer-flow/backend/packages/harness/deerflow/security/content_delimiter.py (new)
deer-flow/backend/packages/harness/deerflow/security/html_cleaner.py (new)
deer-flow/backend/packages/harness/deerflow/security/sanitizer.py (new, with newline-preserving fix)
deer-flow/backend/packages/harness/deerflow/community/searx/__init__.py (new)
deer-flow/backend/packages/harness/deerflow/community/searx/tools.py (new — web_search + web_fetch with Content-Type gate; image_search_tool intentionally absent)
deer-flow/backend/packages/harness/deerflow/community/_disabled_native.py (new)
deer-flow/backend/packages/harness/deerflow/community/ddg_search/tools.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/tavily/tools.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/exa/tools.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/firecrawl/tools.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/jina_ai/tools.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/jina_ai/jina_client.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/infoquest/tools.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/infoquest/infoquest_client.py (replaced with stub)
deer-flow/backend/packages/harness/deerflow/community/image_search/tools.py (replaced with stub)
deer-flow/backend/tests/_disabled_native/conftest.py (new — collect_ignore_glob)
deer-flow/backend/tests/_disabled_native/test_exa_tools.py (moved)
deer-flow/backend/tests/_disabled_native/test_firecrawl_tools.py (moved)
deer-flow/backend/tests/_disabled_native/test_jina_client.py (moved)
deer-flow/backend/tests/_disabled_native/test_infoquest_client.py (moved)
backend/packages/harness/deerflow/security/ (factory overlay, kept in sync)
backend/packages/harness/deerflow/community/searx/ (factory overlay, kept in sync)
backend/tests/test_security_sanitizer.py (factory tests)
backend/tests/test_security_html_cleaner.py (factory tests)
backend/tests/test_searx_tools.py (factory tests)
config.yaml (hardened runtime config, references only searx tools)
.env.example (template, no secrets)
HARDENING.md (this file)
docker/docker-compose.override.yaml (named bridge br-deerflow)
scripts/deerflow-firewall.sh (egress firewall up/down/status)
scripts/deerflow-firewall.nix (NixOS systemd unit snippet)