Quick orientation: layout, hard rules (native tools stay disabled, sanitize+wrap, no secrets, two trees in sync, firewall is part of the threat model, deer-flow is vendored), where things run on data-nuc, commit style, a one-page verification block, and the common NixOS / docker / pip footguns to avoid.
178 lines
8.0 KiB
Markdown
178 lines
8.0 KiB
Markdown
# CLAUDE.md — orientation for future sessions
|
|
|
|
You are in **deerflow-factory**, a hardened deployment of
|
|
[bytedance/deer-flow](https://github.com/bytedance/deer-flow) on data-nuc.
|
|
Read this file first before touching anything in this repo.
|
|
|
|
## Read these in order
|
|
|
|
1. **HARDENING.md** — threat model, what was changed, why, and how to
|
|
verify. The full audit trail. Update it whenever you change anything
|
|
listed in section 6 ("Files touched").
|
|
2. **RUN.md** — how to start/stop/inspect the stack, smoke-test commands.
|
|
3. **DEERFLOW_PROMPT_INJECTION_PROTECTION_PLAN.md** — the original plan
|
|
that the hardening implements. Historical context.
|
|
|
|
## Layout
|
|
|
|
```
|
|
deerflow-factory/
|
|
├── deer-flow/ ← vendored upstream (no nested .git!)
|
|
│ └── backend/packages/harness/deerflow/
|
|
│ ├── security/ ← hardened content sanitizer (lives here!)
|
|
│ ├── community/searx/ ← hardened web tools (lives here!)
|
|
│ └── community/<native>/ ← stubbed, raise on import
|
|
├── backend/ ← factory overlay (mirror, kept in sync)
|
|
│ └── packages/harness/deerflow/
|
|
│ ├── security/ ← duplicated source-of-truth
|
|
│ └── community/searx/ ← duplicated source-of-truth
|
|
├── docker/
|
|
│ └── docker-compose.override.yaml ← named bridge br-deerflow
|
|
├── scripts/
|
|
│ ├── deerflow-firewall.sh ← egress firewall up/down/status
|
|
│ └── deerflow-firewall.nix ← NixOS module (imported by /etc/nixos/configuration.nix)
|
|
├── config.yaml ← runtime config — only references searx tools
|
|
├── .env ← real secrets, .gitignored
|
|
├── .env.example ← template
|
|
├── HARDENING.md
|
|
├── RUN.md
|
|
└── CLAUDE.md ← you are here
|
|
```
|
|
|
|
## Hard rules (do not violate without explicit user approval)
|
|
|
|
1. **Native web tools stay disabled.** The legacy providers
|
|
(`ddg_search`, `tavily`, `exa`, `firecrawl`, `jina_ai`, `infoquest`,
|
|
`image_search`) and the matching helper clients (`jina_client.py`,
|
|
`infoquest_client.py`) are intentionally replaced with import-time
|
|
`RuntimeError` stubs. Re-enabling **any** of them requires:
|
|
- hardening the call site (sanitize → wrap_untrusted_content)
|
|
- moving the matching test out of `tests/_disabled_native/`
|
|
- updating HARDENING.md sections 2.3, 2.4, and 6
|
|
- explicit user sign-off in the same conversation
|
|
|
|
2. **All web output must be sanitized and delimited.** Any new code path
|
|
that returns external data to the model **must** route through
|
|
`deerflow.security.sanitizer.sanitize()` and
|
|
`deerflow.security.content_delimiter.wrap_untrusted_content()`.
|
|
The whole point of this build is that the LLM never sees raw web
|
|
bytes.
|
|
|
|
3. **No secrets in git.** `.env` is `.gitignored`. Before staging,
|
|
verify with `git diff --cached | grep -iE 'api_key|secret|token|password'`.
|
|
Use `.env.example` for templates only — placeholders, never live keys.
|
|
|
|
4. **Two source trees, one truth.** The hardened code lives in **both**
|
|
`deer-flow/backend/packages/harness/deerflow/{security,community/searx}/`
|
|
(the runtime path) **and** `backend/packages/harness/deerflow/...`
|
|
(the factory overlay used by the standalone tests). They must stay
|
|
identical. If you fix a bug in one, mirror it to the other in the
|
|
same commit, or delete one of the two trees and pick a single source
|
|
of truth.
|
|
|
|
5. **The egress firewall is part of the threat model.** Do not change
|
|
`scripts/deerflow-firewall.sh` allow/block lists without updating
|
|
HARDENING.md section 2.7. Specifically:
|
|
- allow: `10.67.67.1` (Searx), `10.67.67.2` (XTTS/Whisper/Ollama-local)
|
|
- block: `192.168.3.0/24` (home LAN), `10.0.0.0/8`, `172.16.0.0/12`
|
|
|
|
6. **deer-flow is vendored, not a submodule.** The upstream `.git` was
|
|
removed and is parked at `/tmp/deer-flow-upstream.git.bak` on
|
|
data-nuc. If you need to pull upstream changes, do it in a separate
|
|
working copy and rebase manually — do not re-introduce a nested git
|
|
into this repo.
|
|
|
|
## Where things run
|
|
|
|
- **Host:** data-nuc (NixOS 25.11, kernel 6.12.x). `data` user is in
|
|
the `docker` group, can use `docker compose` directly.
|
|
- **Repo path:** `/home/data/deerflow-factory`
|
|
- **Gitea remote:** `https://git.beerbandit.de/DATA/deerflow-factory`
|
|
(credentials in `~/.git-credentials` for user `data`)
|
|
- **Egress firewall:** `systemctl status deerflow-firewall`
|
|
- active = rules in DOCKER-USER, applied to `br-deerflow`
|
|
- inactive = rules removed (no firewall)
|
|
- **DeerFlow stack:** not running yet at the time of this CLAUDE.md
|
|
initial commit. First start: see RUN.md.
|
|
|
|
## Commit / push style
|
|
|
|
- Imperative subject, present-tense body. Reference HARDENING.md
|
|
sections by number when you change something they describe.
|
|
- Do not amend or force-push without asking. Add a follow-up commit.
|
|
- Pre-commit secret check:
|
|
```bash
|
|
git diff --cached --name-only | xargs -I{} grep -lE \
|
|
'api_key|secret_key|sk-[a-zA-Z0-9]{20,}|ghp_|tvly-' {} 2>/dev/null
|
|
```
|
|
Only `.env.example` should appear. If anything else does, abort.
|
|
|
|
## Quick verification (run before declaring "it works")
|
|
|
|
From `/home/data/deerflow-factory`:
|
|
|
|
```bash
|
|
PYTHONPATH=deer-flow/backend/packages/harness python3 -c "
|
|
# 1. hardened modules import
|
|
from deerflow.security.content_delimiter import wrap_untrusted_content
|
|
from deerflow.security.sanitizer import sanitizer
|
|
from deerflow.security.html_cleaner import extract_secure_text
|
|
import importlib.util
|
|
assert importlib.util.find_spec('deerflow.community.searx.tools') is not None
|
|
|
|
# 2. native modules fail closed
|
|
for prov in ['ddg_search','tavily','exa','firecrawl','jina_ai','infoquest','image_search']:
|
|
try:
|
|
__import__(f'deerflow.community.{prov}.tools')
|
|
raise SystemExit(f'FAIL: {prov} imported')
|
|
except RuntimeError as e:
|
|
assert 'disabled in this hardened DeerFlow build' in str(e)
|
|
print('OK')
|
|
"
|
|
|
|
# 3. security tests
|
|
PYTHONPATH=deer-flow/backend/packages/harness pytest \
|
|
backend/tests/test_security_sanitizer.py \
|
|
backend/tests/test_security_html_cleaner.py -q
|
|
# expected: 8 passed
|
|
|
|
# 4. firewall service
|
|
systemctl is-active deerflow-firewall
|
|
sudo scripts/deerflow-firewall.sh status
|
|
```
|
|
|
|
If any of these fail and you cannot fix them in the same session, stop
|
|
and report — do not paper over the failure.
|
|
|
|
## Common footguns
|
|
|
|
- **`pip` is not installed system-wide on NixOS.** If you need a Python
|
|
dep for a one-off script, use `nix-shell -p python3Packages.<name>`
|
|
or run inside a deer-flow `.venv` once it exists. Do not try
|
|
`pip install --user` — it will fail.
|
|
- **`sudo` is passwordless for `data`.** Be careful: any `sudo` you run
|
|
succeeds without a prompt. Double-check destructive commands.
|
|
- **NixOS rewrites `/etc/systemd/system/`.** Do not drop unit files in
|
|
there directly; they will be wiped on `nixos-rebuild switch`. Add a
|
|
`systemd.services.<name>` block to a Nix module instead (see
|
|
`scripts/deerflow-firewall.nix` for the pattern).
|
|
- **The factory overlay (`backend/`) is currently a mirror, not the
|
|
runtime.** When you import from Python at runtime, the path that
|
|
matters is `deer-flow/backend/packages/harness`. The overlay only
|
|
matters for the standalone factory tests. Keep them in sync until we
|
|
pick one as canonical.
|
|
- **`docker compose down` does not remove the firewall rules.** That is
|
|
by design. Only `systemctl stop deerflow-firewall` removes them.
|
|
|
|
## What this repo is NOT
|
|
|
|
- Not a fork on GitHub. The vendored upstream `.git` was deleted on
|
|
purpose. If you need to compare against upstream, clone it fresh into
|
|
`/tmp/`.
|
|
- Not a Python package (yet). There is no `pyproject.toml` at the
|
|
factory root; the Python entry point is the deer-flow tree's own
|
|
`backend/pyproject.toml`. We only put files into its harness package.
|
|
- Not multi-tenant. There is exactly one DeerFlow instance, one Searx,
|
|
one set of credentials. Keep it that way unless the user explicitly
|
|
asks for tenancy.
|