Files
deerflow-factory/CLAUDE.md
DATA 4237f03a83 Add CLAUDE.md project guide for future Claude Code sessions
Quick orientation: layout, hard rules (native tools stay disabled,
sanitize+wrap, no secrets, two trees in sync, firewall is part of the
threat model, deer-flow is vendored), where things run on data-nuc,
commit style, a one-page verification block, and the common NixOS /
docker / pip footguns to avoid.
2026-04-12 15:23:37 +02:00

178 lines
8.0 KiB
Markdown

# CLAUDE.md — orientation for future sessions
You are in **deerflow-factory**, a hardened deployment of
[bytedance/deer-flow](https://github.com/bytedance/deer-flow) on data-nuc.
Read this file first before touching anything in this repo.
## Read these in order
1. **HARDENING.md** — threat model, what was changed, why, and how to
verify. The full audit trail. Update it whenever you change anything
listed in section 6 ("Files touched").
2. **RUN.md** — how to start/stop/inspect the stack, smoke-test commands.
3. **DEERFLOW_PROMPT_INJECTION_PROTECTION_PLAN.md** — the original plan
that the hardening implements. Historical context.
## Layout
```
deerflow-factory/
├── deer-flow/ ← vendored upstream (no nested .git!)
│ └── backend/packages/harness/deerflow/
│ ├── security/ ← hardened content sanitizer (lives here!)
│ ├── community/searx/ ← hardened web tools (lives here!)
│ └── community/<native>/ ← stubbed, raise on import
├── backend/ ← factory overlay (mirror, kept in sync)
│ └── packages/harness/deerflow/
│ ├── security/ ← duplicated source-of-truth
│ └── community/searx/ ← duplicated source-of-truth
├── docker/
│ └── docker-compose.override.yaml ← named bridge br-deerflow
├── scripts/
│ ├── deerflow-firewall.sh ← egress firewall up/down/status
│ └── deerflow-firewall.nix ← NixOS module (imported by /etc/nixos/configuration.nix)
├── config.yaml ← runtime config — only references searx tools
├── .env ← real secrets, .gitignored
├── .env.example ← template
├── HARDENING.md
├── RUN.md
└── CLAUDE.md ← you are here
```
## Hard rules (do not violate without explicit user approval)
1. **Native web tools stay disabled.** The legacy providers
(`ddg_search`, `tavily`, `exa`, `firecrawl`, `jina_ai`, `infoquest`,
`image_search`) and the matching helper clients (`jina_client.py`,
`infoquest_client.py`) are intentionally replaced with import-time
`RuntimeError` stubs. Re-enabling **any** of them requires:
- hardening the call site (sanitize → wrap_untrusted_content)
- moving the matching test out of `tests/_disabled_native/`
- updating HARDENING.md sections 2.3, 2.4, and 6
- explicit user sign-off in the same conversation
2. **All web output must be sanitized and delimited.** Any new code path
that returns external data to the model **must** route through
`deerflow.security.sanitizer.sanitize()` and
`deerflow.security.content_delimiter.wrap_untrusted_content()`.
The whole point of this build is that the LLM never sees raw web
bytes.
3. **No secrets in git.** `.env` is `.gitignored`. Before staging,
verify with `git diff --cached | grep -iE 'api_key|secret|token|password'`.
Use `.env.example` for templates only — placeholders, never live keys.
4. **Two source trees, one truth.** The hardened code lives in **both**
`deer-flow/backend/packages/harness/deerflow/{security,community/searx}/`
(the runtime path) **and** `backend/packages/harness/deerflow/...`
(the factory overlay used by the standalone tests). They must stay
identical. If you fix a bug in one, mirror it to the other in the
same commit, or delete one of the two trees and pick a single source
of truth.
5. **The egress firewall is part of the threat model.** Do not change
`scripts/deerflow-firewall.sh` allow/block lists without updating
HARDENING.md section 2.7. Specifically:
- allow: `10.67.67.1` (Searx), `10.67.67.2` (XTTS/Whisper/Ollama-local)
- block: `192.168.3.0/24` (home LAN), `10.0.0.0/8`, `172.16.0.0/12`
6. **deer-flow is vendored, not a submodule.** The upstream `.git` was
removed and is parked at `/tmp/deer-flow-upstream.git.bak` on
data-nuc. If you need to pull upstream changes, do it in a separate
working copy and rebase manually — do not re-introduce a nested git
into this repo.
## Where things run
- **Host:** data-nuc (NixOS 25.11, kernel 6.12.x). `data` user is in
the `docker` group, can use `docker compose` directly.
- **Repo path:** `/home/data/deerflow-factory`
- **Gitea remote:** `https://git.beerbandit.de/DATA/deerflow-factory`
(credentials in `~/.git-credentials` for user `data`)
- **Egress firewall:** `systemctl status deerflow-firewall`
- active = rules in DOCKER-USER, applied to `br-deerflow`
- inactive = rules removed (no firewall)
- **DeerFlow stack:** not running yet at the time of this CLAUDE.md
initial commit. First start: see RUN.md.
## Commit / push style
- Imperative subject, present-tense body. Reference HARDENING.md
sections by number when you change something they describe.
- Do not amend or force-push without asking. Add a follow-up commit.
- Pre-commit secret check:
```bash
git diff --cached --name-only | xargs -I{} grep -lE \
'api_key|secret_key|sk-[a-zA-Z0-9]{20,}|ghp_|tvly-' {} 2>/dev/null
```
Only `.env.example` should appear. If anything else does, abort.
## Quick verification (run before declaring "it works")
From `/home/data/deerflow-factory`:
```bash
PYTHONPATH=deer-flow/backend/packages/harness python3 -c "
# 1. hardened modules import
from deerflow.security.content_delimiter import wrap_untrusted_content
from deerflow.security.sanitizer import sanitizer
from deerflow.security.html_cleaner import extract_secure_text
import importlib.util
assert importlib.util.find_spec('deerflow.community.searx.tools') is not None
# 2. native modules fail closed
for prov in ['ddg_search','tavily','exa','firecrawl','jina_ai','infoquest','image_search']:
try:
__import__(f'deerflow.community.{prov}.tools')
raise SystemExit(f'FAIL: {prov} imported')
except RuntimeError as e:
assert 'disabled in this hardened DeerFlow build' in str(e)
print('OK')
"
# 3. security tests
PYTHONPATH=deer-flow/backend/packages/harness pytest \
backend/tests/test_security_sanitizer.py \
backend/tests/test_security_html_cleaner.py -q
# expected: 8 passed
# 4. firewall service
systemctl is-active deerflow-firewall
sudo scripts/deerflow-firewall.sh status
```
If any of these fail and you cannot fix them in the same session, stop
and report — do not paper over the failure.
## Common footguns
- **`pip` is not installed system-wide on NixOS.** If you need a Python
dep for a one-off script, use `nix-shell -p python3Packages.<name>`
or run inside a deer-flow `.venv` once it exists. Do not try
`pip install --user` — it will fail.
- **`sudo` is passwordless for `data`.** Be careful: any `sudo` you run
succeeds without a prompt. Double-check destructive commands.
- **NixOS rewrites `/etc/systemd/system/`.** Do not drop unit files in
there directly; they will be wiped on `nixos-rebuild switch`. Add a
`systemd.services.<name>` block to a Nix module instead (see
`scripts/deerflow-firewall.nix` for the pattern).
- **The factory overlay (`backend/`) is currently a mirror, not the
runtime.** When you import from Python at runtime, the path that
matters is `deer-flow/backend/packages/harness`. The overlay only
matters for the standalone factory tests. Keep them in sync until we
pick one as canonical.
- **`docker compose down` does not remove the firewall rules.** That is
by design. Only `systemctl stop deerflow-firewall` removes them.
## What this repo is NOT
- Not a fork on GitHub. The vendored upstream `.git` was deleted on
purpose. If you need to compare against upstream, clone it fresh into
`/tmp/`.
- Not a Python package (yet). There is no `pyproject.toml` at the
factory root; the Python entry point is the deer-flow tree's own
`backend/pyproject.toml`. We only put files into its harness package.
- Not multi-tenant. There is exactly one DeerFlow instance, one Searx,
one set of credentials. Keep it that way unless the user explicitly
asks for tenancy.