No-images policy: refuse non-text fetches, drop image_search_tool

Agents in this build are text-only researchers. Image, audio, video,
and binary content has no role in the pipeline and only widens the
attack surface (server-side image fetches, exfiltration via rendered
img tags, etc). The cleanest answer is to never load it in the first
place rather than maintain a domain allowlist that nobody can keep
up to date.

- web_fetch_tool now uses httpx.AsyncClient.stream and inspects the
  Content-Type header BEFORE the body is read into memory. Only
  text/*, application/json, application/xml, application/xhtml+xml,
  application/ld+json, application/atom+xml, application/rss+xml are
  accepted; everything else (image/*, audio/*, video/*, octet-stream,
  pdf, font, missing header, ...) is refused with a wrap_untrusted
  error reply. The body bytes never enter the process for refused
  responses. Read budget is bounded to ~4x max_chars regardless.

- image_search_tool removed from deerflow.community.searx.tools
  (both the deer-flow runtime tree and the factory overlay). The
  function is gone, not stubbed — any tool.use referencing it will
  raise AttributeError at tool-loading time.

- config.yaml: image_search tool entry removed; the example
  allowed_tools list updated to drop image_search.

- HARDENING.md: new section 2.8 explains the policy and the frontend
  caveat (the LLM can still emit ![](url) markdown which the user's
  browser would render — that requires a separate frontend patch
  that is not yet implemented). Section 3.4 adds a verification
  snippet for the policy. The web_fetch entry in section 2.2 is
  updated to mention the streaming Content-Type gate.

Both source trees stay in sync.
This commit is contained in:
2026-04-12 15:59:55 +02:00
parent 4237f03a83
commit e510f975f6
4 changed files with 269 additions and 123 deletions

View File

@@ -46,8 +46,9 @@ The hardening below is a port of the OpenClaw approach (`searx-scripts/`,
LangChain `@tool` exports:
- `web_search_tool(query, max_results=10)` — calls a private SearX instance, sanitizes title + content, wraps results in security delimiters
- `web_fetch_tool(url, max_chars=10000)`fetches URL, runs `extract_secure_text` then `sanitizer.sanitize`, wraps result
- `image_search_tool(query, max_results=5)` — SearX `categories=images`, sanitized title/url/thumbnail, wrapped
- `web_fetch_tool(url, max_chars=10000)`streams response, refuses non-text Content-Type **before reading the body**, then runs `extract_secure_text` and `sanitizer.sanitize` over the head of the body, wraps result
`image_search_tool` was removed on purpose — see section 2.8.
Reads its config from `get_app_config().get_tool_config(<name>).model_extra`:
`searx_url`, `max_results`, `max_chars`.
@@ -203,6 +204,40 @@ curl -s -o /dev/null -w "%{http_code}\n" --max-time 5 http://192.168.3.1/
curl -s -o /dev/null -w "%{http_code}\n" --max-time 5 http://10.67.67.16/ # FAIL (blocked by 10/8 reject; .16 is not whitelisted)
```
### 2.8 No-images policy
Agents in this build are **text-only researchers**. They never need to
fetch image, audio, video, or binary content, so the entire pipeline is
hardened to refuse it:
| Layer | What it does |
|---|---|
| `web_fetch_tool` | Streams the response and inspects the `Content-Type` header **before** reading the body. Anything that is not `text/*`, `application/json`, `application/xml`, `application/xhtml+xml`, `application/ld+json`, `application/atom+xml`, or `application/rss+xml` is refused with `wrap_untrusted_content({"error": "Refused: non-text response..."})`. The body bytes are never loaded into memory. |
| `image_search_tool` | **Removed**. The function no longer exists in `deerflow/community/searx/tools.py`. Any `tool.use: deerflow.community.searx.tools:image_search_tool` in `config.yaml` would fail with an attribute error during tool loading. |
| `config.yaml` | The `image_search` tool entry was deleted. Only `web_search` and `web_fetch` are registered in the `web` group. |
**Why no allowlist?** A domain allowlist for image fetching would either
be impossible to maintain (research touches new domains every day) or
silently rot into a permanent allow-everything. Removing image fetching
entirely is the only honest answer for a text-only research use case.
**Frontend caveat:** the LLM can still emit `![alt](https://...)`
markdown in its **answer**. If the deer-flow frontend renders that
markdown, the **user's browser** (not the container!) will load the
image and potentially leak referrer/timing data. The egress firewall
on data-nuc does not see this traffic. Mitigations:
1. Best: configure the frontend's markdown renderer to disable images,
or replace `<img>` tags with a placeholder. **Not yet implemented in
this repo** — needs a patch in the deer-flow frontend.
2. Workaround: render answers in a CSP-restricted iframe with
`img-src 'none'`.
If you bring image fetching back, build a **separate** tool with an
explicit per-call allowlist and a server-side image proxy that runs
under the same egress firewall as the rest of the container. Do not
relax `web_fetch_tool`'s Content-Type check.
## 3. Verification
All checks below assume `PYTHONPATH=deer-flow/backend/packages/harness`.
@@ -244,6 +279,24 @@ PYTHONPATH=deer-flow/backend/packages/harness pytest \
Expected: `8 passed`.
### 3.4 No-images verification
```bash
PYTHONPATH=deer-flow/backend/packages/harness python3 -c "
import deerflow.community.searx.tools as t
assert hasattr(t, 'web_search_tool'), 'web_search_tool missing'
assert hasattr(t, 'web_fetch_tool'), 'web_fetch_tool missing'
assert not hasattr(t, 'image_search_tool'), 'image_search_tool must be removed'
from deerflow.community.searx.tools import _is_text_content_type
assert _is_text_content_type('text/html; charset=utf-8')
assert _is_text_content_type('application/json')
assert not _is_text_content_type('image/png')
assert not _is_text_content_type('application/octet-stream')
assert not _is_text_content_type('')
print('OK — no-images policy intact')
"
```
## 4. Adding a new web tool
1. Implement it in `deer-flow/backend/packages/harness/deerflow/community/<name>/tools.py`.
@@ -273,7 +326,7 @@ deer-flow/backend/packages/harness/deerflow/security/html_cleaner.py (new)
deer-flow/backend/packages/harness/deerflow/security/sanitizer.py (new, with newline-preserving fix)
deer-flow/backend/packages/harness/deerflow/community/searx/__init__.py (new)
deer-flow/backend/packages/harness/deerflow/community/searx/tools.py (new)
deer-flow/backend/packages/harness/deerflow/community/searx/tools.py (new — web_search + web_fetch with Content-Type gate; image_search_tool intentionally absent)
deer-flow/backend/packages/harness/deerflow/community/_disabled_native.py (new)
deer-flow/backend/packages/harness/deerflow/community/ddg_search/tools.py (replaced with stub)