No-images policy: refuse non-text fetches, drop image_search_tool
Agents in this build are text-only researchers. Image, audio, video, and binary content has no role in the pipeline and only widens the attack surface (server-side image fetches, exfiltration via rendered img tags, etc). The cleanest answer is to never load it in the first place rather than maintain a domain allowlist that nobody can keep up to date. - web_fetch_tool now uses httpx.AsyncClient.stream and inspects the Content-Type header BEFORE the body is read into memory. Only text/*, application/json, application/xml, application/xhtml+xml, application/ld+json, application/atom+xml, application/rss+xml are accepted; everything else (image/*, audio/*, video/*, octet-stream, pdf, font, missing header, ...) is refused with a wrap_untrusted error reply. The body bytes never enter the process for refused responses. Read budget is bounded to ~4x max_chars regardless. - image_search_tool removed from deerflow.community.searx.tools (both the deer-flow runtime tree and the factory overlay). The function is gone, not stubbed — any tool.use referencing it will raise AttributeError at tool-loading time. - config.yaml: image_search tool entry removed; the example allowed_tools list updated to drop image_search. - HARDENING.md: new section 2.8 explains the policy and the frontend caveat (the LLM can still emit  markdown which the user's browser would render — that requires a separate frontend patch that is not yet implemented). Section 3.4 adds a verification snippet for the policy. The web_fetch entry in section 2.2 is updated to mention the streaming Content-Type gate. Both source trees stay in sync.
This commit is contained in:
59
HARDENING.md
59
HARDENING.md
@@ -46,8 +46,9 @@ The hardening below is a port of the OpenClaw approach (`searx-scripts/`,
|
||||
LangChain `@tool` exports:
|
||||
|
||||
- `web_search_tool(query, max_results=10)` — calls a private SearX instance, sanitizes title + content, wraps results in security delimiters
|
||||
- `web_fetch_tool(url, max_chars=10000)` — fetches URL, runs `extract_secure_text` then `sanitizer.sanitize`, wraps result
|
||||
- `image_search_tool(query, max_results=5)` — SearX `categories=images`, sanitized title/url/thumbnail, wrapped
|
||||
- `web_fetch_tool(url, max_chars=10000)` — streams response, refuses non-text Content-Type **before reading the body**, then runs `extract_secure_text` and `sanitizer.sanitize` over the head of the body, wraps result
|
||||
|
||||
`image_search_tool` was removed on purpose — see section 2.8.
|
||||
|
||||
Reads its config from `get_app_config().get_tool_config(<name>).model_extra`:
|
||||
`searx_url`, `max_results`, `max_chars`.
|
||||
@@ -203,6 +204,40 @@ curl -s -o /dev/null -w "%{http_code}\n" --max-time 5 http://192.168.3.1/
|
||||
curl -s -o /dev/null -w "%{http_code}\n" --max-time 5 http://10.67.67.16/ # FAIL (blocked by 10/8 reject; .16 is not whitelisted)
|
||||
```
|
||||
|
||||
### 2.8 No-images policy
|
||||
|
||||
Agents in this build are **text-only researchers**. They never need to
|
||||
fetch image, audio, video, or binary content, so the entire pipeline is
|
||||
hardened to refuse it:
|
||||
|
||||
| Layer | What it does |
|
||||
|---|---|
|
||||
| `web_fetch_tool` | Streams the response and inspects the `Content-Type` header **before** reading the body. Anything that is not `text/*`, `application/json`, `application/xml`, `application/xhtml+xml`, `application/ld+json`, `application/atom+xml`, or `application/rss+xml` is refused with `wrap_untrusted_content({"error": "Refused: non-text response..."})`. The body bytes are never loaded into memory. |
|
||||
| `image_search_tool` | **Removed**. The function no longer exists in `deerflow/community/searx/tools.py`. Any `tool.use: deerflow.community.searx.tools:image_search_tool` in `config.yaml` would fail with an attribute error during tool loading. |
|
||||
| `config.yaml` | The `image_search` tool entry was deleted. Only `web_search` and `web_fetch` are registered in the `web` group. |
|
||||
|
||||
**Why no allowlist?** A domain allowlist for image fetching would either
|
||||
be impossible to maintain (research touches new domains every day) or
|
||||
silently rot into a permanent allow-everything. Removing image fetching
|
||||
entirely is the only honest answer for a text-only research use case.
|
||||
|
||||
**Frontend caveat:** the LLM can still emit ``
|
||||
markdown in its **answer**. If the deer-flow frontend renders that
|
||||
markdown, the **user's browser** (not the container!) will load the
|
||||
image and potentially leak referrer/timing data. The egress firewall
|
||||
on data-nuc does not see this traffic. Mitigations:
|
||||
|
||||
1. Best: configure the frontend's markdown renderer to disable images,
|
||||
or replace `<img>` tags with a placeholder. **Not yet implemented in
|
||||
this repo** — needs a patch in the deer-flow frontend.
|
||||
2. Workaround: render answers in a CSP-restricted iframe with
|
||||
`img-src 'none'`.
|
||||
|
||||
If you bring image fetching back, build a **separate** tool with an
|
||||
explicit per-call allowlist and a server-side image proxy that runs
|
||||
under the same egress firewall as the rest of the container. Do not
|
||||
relax `web_fetch_tool`'s Content-Type check.
|
||||
|
||||
## 3. Verification
|
||||
|
||||
All checks below assume `PYTHONPATH=deer-flow/backend/packages/harness`.
|
||||
@@ -244,6 +279,24 @@ PYTHONPATH=deer-flow/backend/packages/harness pytest \
|
||||
|
||||
Expected: `8 passed`.
|
||||
|
||||
### 3.4 No-images verification
|
||||
|
||||
```bash
|
||||
PYTHONPATH=deer-flow/backend/packages/harness python3 -c "
|
||||
import deerflow.community.searx.tools as t
|
||||
assert hasattr(t, 'web_search_tool'), 'web_search_tool missing'
|
||||
assert hasattr(t, 'web_fetch_tool'), 'web_fetch_tool missing'
|
||||
assert not hasattr(t, 'image_search_tool'), 'image_search_tool must be removed'
|
||||
from deerflow.community.searx.tools import _is_text_content_type
|
||||
assert _is_text_content_type('text/html; charset=utf-8')
|
||||
assert _is_text_content_type('application/json')
|
||||
assert not _is_text_content_type('image/png')
|
||||
assert not _is_text_content_type('application/octet-stream')
|
||||
assert not _is_text_content_type('')
|
||||
print('OK — no-images policy intact')
|
||||
"
|
||||
```
|
||||
|
||||
## 4. Adding a new web tool
|
||||
|
||||
1. Implement it in `deer-flow/backend/packages/harness/deerflow/community/<name>/tools.py`.
|
||||
@@ -273,7 +326,7 @@ deer-flow/backend/packages/harness/deerflow/security/html_cleaner.py (new)
|
||||
deer-flow/backend/packages/harness/deerflow/security/sanitizer.py (new, with newline-preserving fix)
|
||||
|
||||
deer-flow/backend/packages/harness/deerflow/community/searx/__init__.py (new)
|
||||
deer-flow/backend/packages/harness/deerflow/community/searx/tools.py (new)
|
||||
deer-flow/backend/packages/harness/deerflow/community/searx/tools.py (new — web_search + web_fetch with Content-Type gate; image_search_tool intentionally absent)
|
||||
|
||||
deer-flow/backend/packages/harness/deerflow/community/_disabled_native.py (new)
|
||||
deer-flow/backend/packages/harness/deerflow/community/ddg_search/tools.py (replaced with stub)
|
||||
|
||||
Reference in New Issue
Block a user