No-images policy: refuse non-text fetches, drop image_search_tool

Agents in this build are text-only researchers. Image, audio, video, and binary content has no role in the pipeline and only widens the attack surface (server-side image fetches, exfiltration via rendered img tags, etc). The cleanest answer is to never load it in the first place rather than maintain a domain allowlist that nobody can keep up to date. - web_fetch_tool now uses httpx.AsyncClient.stream and inspects the Content-Type header BEFORE the body is read into memory. Only text/*, application/json, application/xml, application/xhtml+xml, application/ld+json, application/atom+xml, application/rss+xml are accepted; everything else (image/*, audio/*, video/*, octet-stream, pdf, font, missing header, ...) is refused with a wrap_untrusted error reply. The body bytes never enter the process for refused responses. Read budget is bounded to ~4x max_chars regardless. - image_search_tool removed from deerflow.community.searx.tools (both the deer-flow runtime tree and the factory overlay). The function is gone, not stubbed — any tool.use referencing it will raise AttributeError at tool-loading time. - config.yaml: image_search tool entry removed; the example allowed_tools list updated to drop image_search. - HARDENING.md: new section 2.8 explains the policy and the frontend caveat (the LLM can still emit ![](url) markdown which the user's browser would render — that requires a separate frontend patch that is not yet implemented). Section 3.4 adds a verification snippet for the policy. The web_fetch entry in section 2.2 is updated to mention the streaming Content-Type gate. Both source trees stay in sync.
2026-04-12 15:59:55 +02:00
parent 4237f03a83
commit e510f975f6
4 changed files with 269 additions and 123 deletions
--- a/HARDENING.md
+++ b/HARDENING.md
@@ -46,8 +46,9 @@ The hardening below is a port of the OpenClaw approach (`searx-scripts/`,
 LangChain `@tool` exports:

 - `web_search_tool(query, max_results=10)` — calls a private SearX instance, sanitizes title + content, wraps results in security delimiters
- `web_fetch_tool(url, max_chars=10000)` — fetches URL, runs `extract_secure_text` then `sanitizer.sanitize`, wraps result
- `image_search_tool(query, max_results=5)` — SearX `categories=images`, sanitized title/url/thumbnail, wrapped
+- `web_fetch_tool(url, max_chars=10000)` — streams response, refuses non-text Content-Type **before reading the body**, then runs `extract_secure_text` and `sanitizer.sanitize` over the head of the body, wraps result
+
+`image_search_tool` was removed on purpose — see section 2.8.

 Reads its config from `get_app_config().get_tool_config(<name>).model_extra`:
 `searx_url`, `max_results`, `max_chars`.
@@ -203,6 +204,40 @@ curl -s -o /dev/null -w "%{http_code}\n" --max-time 5 http://192.168.3.1/
 curl -s -o /dev/null -w "%{http_code}\n" --max-time 5 http://10.67.67.16/         # FAIL (blocked by 10/8 reject; .16 is not whitelisted)
 ```

+### 2.8 No-images policy
+
+Agents in this build are **text-only researchers**. They never need to
+fetch image, audio, video, or binary content, so the entire pipeline is
+hardened to refuse it:
+
+| Layer | What it does |
+|---|---|
+| `web_fetch_tool` | Streams the response and inspects the `Content-Type` header **before** reading the body. Anything that is not `text/*`, `application/json`, `application/xml`, `application/xhtml+xml`, `application/ld+json`, `application/atom+xml`, or `application/rss+xml` is refused with `wrap_untrusted_content({"error": "Refused: non-text response..."})`. The body bytes are never loaded into memory. |
+| `image_search_tool` | **Removed**. The function no longer exists in `deerflow/community/searx/tools.py`. Any `tool.use: deerflow.community.searx.tools:image_search_tool` in `config.yaml` would fail with an attribute error during tool loading. |
+| `config.yaml` | The `image_search` tool entry was deleted. Only `web_search` and `web_fetch` are registered in the `web` group. |
+
+**Why no allowlist?** A domain allowlist for image fetching would either
+be impossible to maintain (research touches new domains every day) or
+silently rot into a permanent allow-everything. Removing image fetching
+entirely is the only honest answer for a text-only research use case.
+
+**Frontend caveat:** the LLM can still emit `![alt](https://...)`
+markdown in its **answer**. If the deer-flow frontend renders that
+markdown, the **user's browser** (not the container!) will load the
+image and potentially leak referrer/timing data. The egress firewall
+on data-nuc does not see this traffic. Mitigations:
+
+1. Best: configure the frontend's markdown renderer to disable images,
+   or replace `<img>` tags with a placeholder. **Not yet implemented in
+   this repo** — needs a patch in the deer-flow frontend.
+2. Workaround: render answers in a CSP-restricted iframe with
+   `img-src 'none'`.
+
+If you bring image fetching back, build a **separate** tool with an
+explicit per-call allowlist and a server-side image proxy that runs
+under the same egress firewall as the rest of the container. Do not
+relax `web_fetch_tool`'s Content-Type check.
+
 ## 3. Verification

 All checks below assume `PYTHONPATH=deer-flow/backend/packages/harness`.
@@ -244,6 +279,24 @@ PYTHONPATH=deer-flow/backend/packages/harness pytest \

 Expected: `8 passed`.

+### 3.4 No-images verification
+
+```bash
+PYTHONPATH=deer-flow/backend/packages/harness python3 -c "
+import deerflow.community.searx.tools as t
+assert hasattr(t, 'web_search_tool'), 'web_search_tool missing'
+assert hasattr(t, 'web_fetch_tool'),  'web_fetch_tool missing'
+assert not hasattr(t, 'image_search_tool'), 'image_search_tool must be removed'
+from deerflow.community.searx.tools import _is_text_content_type
+assert     _is_text_content_type('text/html; charset=utf-8')
+assert     _is_text_content_type('application/json')
+assert not _is_text_content_type('image/png')
+assert not _is_text_content_type('application/octet-stream')
+assert not _is_text_content_type('')
+print('OK — no-images policy intact')
+"
+```
+
 ## 4. Adding a new web tool

 1. Implement it in `deer-flow/backend/packages/harness/deerflow/community/<name>/tools.py`.
@@ -273,7 +326,7 @@ deer-flow/backend/packages/harness/deerflow/security/html_cleaner.py      (new)
 deer-flow/backend/packages/harness/deerflow/security/sanitizer.py         (new, with newline-preserving fix)

 deer-flow/backend/packages/harness/deerflow/community/searx/__init__.py   (new)
-deer-flow/backend/packages/harness/deerflow/community/searx/tools.py      (new)
+deer-flow/backend/packages/harness/deerflow/community/searx/tools.py      (new — web_search + web_fetch with Content-Type gate; image_search_tool intentionally absent)

 deer-flow/backend/packages/harness/deerflow/community/_disabled_native.py        (new)
 deer-flow/backend/packages/harness/deerflow/community/ddg_search/tools.py        (replaced with stub)