Add CLAUDE.md project guide for future Claude Code sessions

Quick orientation: layout, hard rules (native tools stay disabled, sanitize+wrap, no secrets, two trees in sync, firewall is part of the threat model, deer-flow is vendored), where things run on data-nuc, commit style, a one-page verification block, and the common NixOS / docker / pip footguns to avoid.
2026-04-12 15:23:37 +02:00
parent 7f3f9bff6e
commit 4237f03a83
1 changed files with 177 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,177 @@
+# CLAUDE.md — orientation for future sessions
+
+You are in **deerflow-factory**, a hardened deployment of
+[bytedance/deer-flow](https://github.com/bytedance/deer-flow) on data-nuc.
+Read this file first before touching anything in this repo.
+
+## Read these in order
+
+1. **HARDENING.md** — threat model, what was changed, why, and how to
+   verify. The full audit trail. Update it whenever you change anything
+   listed in section 6 ("Files touched").
+2. **RUN.md** — how to start/stop/inspect the stack, smoke-test commands.
+3. **DEERFLOW_PROMPT_INJECTION_PROTECTION_PLAN.md** — the original plan
+   that the hardening implements. Historical context.
+
+## Layout
+
+```
+deerflow-factory/
+├── deer-flow/                    ← vendored upstream (no nested .git!)
+│   └── backend/packages/harness/deerflow/
+│       ├── security/             ← hardened content sanitizer  (lives here!)
+│       ├── community/searx/      ← hardened web tools           (lives here!)
+│       └── community/<native>/   ← stubbed, raise on import
+├── backend/                      ← factory overlay (mirror, kept in sync)
+│   └── packages/harness/deerflow/
+│       ├── security/             ← duplicated source-of-truth
+│       └── community/searx/      ← duplicated source-of-truth
+├── docker/
+│   └── docker-compose.override.yaml  ← named bridge br-deerflow
+├── scripts/
+│   ├── deerflow-firewall.sh      ← egress firewall up/down/status
+│   └── deerflow-firewall.nix     ← NixOS module (imported by /etc/nixos/configuration.nix)
+├── config.yaml                    ← runtime config — only references searx tools
+├── .env                           ← real secrets, .gitignored
+├── .env.example                   ← template
+├── HARDENING.md
+├── RUN.md
+└── CLAUDE.md                      ← you are here
+```
+
+## Hard rules (do not violate without explicit user approval)
+
+1. **Native web tools stay disabled.** The legacy providers
+   (`ddg_search`, `tavily`, `exa`, `firecrawl`, `jina_ai`, `infoquest`,
+   `image_search`) and the matching helper clients (`jina_client.py`,
+   `infoquest_client.py`) are intentionally replaced with import-time
+   `RuntimeError` stubs. Re-enabling **any** of them requires:
+   - hardening the call site (sanitize → wrap_untrusted_content)
+   - moving the matching test out of `tests/_disabled_native/`
+   - updating HARDENING.md sections 2.3, 2.4, and 6
+   - explicit user sign-off in the same conversation
+
+2. **All web output must be sanitized and delimited.** Any new code path
+   that returns external data to the model **must** route through
+   `deerflow.security.sanitizer.sanitize()` and
+   `deerflow.security.content_delimiter.wrap_untrusted_content()`.
+   The whole point of this build is that the LLM never sees raw web
+   bytes.
+
+3. **No secrets in git.** `.env` is `.gitignored`. Before staging,
+   verify with `git diff --cached | grep -iE 'api_key|secret|token|password'`.
+   Use `.env.example` for templates only — placeholders, never live keys.
+
+4. **Two source trees, one truth.** The hardened code lives in **both**
+   `deer-flow/backend/packages/harness/deerflow/{security,community/searx}/`
+   (the runtime path) **and** `backend/packages/harness/deerflow/...`
+   (the factory overlay used by the standalone tests). They must stay
+   identical. If you fix a bug in one, mirror it to the other in the
+   same commit, or delete one of the two trees and pick a single source
+   of truth.
+
+5. **The egress firewall is part of the threat model.** Do not change
+   `scripts/deerflow-firewall.sh` allow/block lists without updating
+   HARDENING.md section 2.7. Specifically:
+   - allow: `10.67.67.1` (Searx), `10.67.67.2` (XTTS/Whisper/Ollama-local)
+   - block: `192.168.3.0/24` (home LAN), `10.0.0.0/8`, `172.16.0.0/12`
+
+6. **deer-flow is vendored, not a submodule.** The upstream `.git` was
+   removed and is parked at `/tmp/deer-flow-upstream.git.bak` on
+   data-nuc. If you need to pull upstream changes, do it in a separate
+   working copy and rebase manually — do not re-introduce a nested git
+   into this repo.
+
+## Where things run
+
+- **Host:** data-nuc (NixOS 25.11, kernel 6.12.x). `data` user is in
+  the `docker` group, can use `docker compose` directly.
+- **Repo path:** `/home/data/deerflow-factory`
+- **Gitea remote:** `https://git.beerbandit.de/DATA/deerflow-factory`
+  (credentials in `~/.git-credentials` for user `data`)
+- **Egress firewall:** `systemctl status deerflow-firewall`
+  - active = rules in DOCKER-USER, applied to `br-deerflow`
+  - inactive = rules removed (no firewall)
+- **DeerFlow stack:** not running yet at the time of this CLAUDE.md
+  initial commit. First start: see RUN.md.
+
+## Commit / push style
+
+- Imperative subject, present-tense body. Reference HARDENING.md
+  sections by number when you change something they describe.
+- Do not amend or force-push without asking. Add a follow-up commit.
+- Pre-commit secret check:
+  ```bash
+  git diff --cached --name-only | xargs -I{} grep -lE \
+    'api_key|secret_key|sk-[a-zA-Z0-9]{20,}|ghp_|tvly-' {} 2>/dev/null
+  ```
+  Only `.env.example` should appear. If anything else does, abort.
+
+## Quick verification (run before declaring "it works")
+
+From `/home/data/deerflow-factory`:
+
+```bash
+PYTHONPATH=deer-flow/backend/packages/harness python3 -c "
+# 1. hardened modules import
+from deerflow.security.content_delimiter import wrap_untrusted_content
+from deerflow.security.sanitizer import sanitizer
+from deerflow.security.html_cleaner import extract_secure_text
+import importlib.util
+assert importlib.util.find_spec('deerflow.community.searx.tools') is not None
+
+# 2. native modules fail closed
+for prov in ['ddg_search','tavily','exa','firecrawl','jina_ai','infoquest','image_search']:
+    try:
+        __import__(f'deerflow.community.{prov}.tools')
+        raise SystemExit(f'FAIL: {prov} imported')
+    except RuntimeError as e:
+        assert 'disabled in this hardened DeerFlow build' in str(e)
+print('OK')
+"
+
+# 3. security tests
+PYTHONPATH=deer-flow/backend/packages/harness pytest \
+    backend/tests/test_security_sanitizer.py \
+    backend/tests/test_security_html_cleaner.py -q
+# expected: 8 passed
+
+# 4. firewall service
+systemctl is-active deerflow-firewall
+sudo scripts/deerflow-firewall.sh status
+```
+
+If any of these fail and you cannot fix them in the same session, stop
+and report — do not paper over the failure.
+
+## Common footguns
+
+- **`pip` is not installed system-wide on NixOS.** If you need a Python
+  dep for a one-off script, use `nix-shell -p python3Packages.<name>`
+  or run inside a deer-flow `.venv` once it exists. Do not try
+  `pip install --user` — it will fail.
+- **`sudo` is passwordless for `data`.** Be careful: any `sudo` you run
+  succeeds without a prompt. Double-check destructive commands.
+- **NixOS rewrites `/etc/systemd/system/`.** Do not drop unit files in
+  there directly; they will be wiped on `nixos-rebuild switch`. Add a
+  `systemd.services.<name>` block to a Nix module instead (see
+  `scripts/deerflow-firewall.nix` for the pattern).
+- **The factory overlay (`backend/`) is currently a mirror, not the
+  runtime.** When you import from Python at runtime, the path that
+  matters is `deer-flow/backend/packages/harness`. The overlay only
+  matters for the standalone factory tests. Keep them in sync until we
+  pick one as canonical.
+- **`docker compose down` does not remove the firewall rules.** That is
+  by design. Only `systemctl stop deerflow-firewall` removes them.
+
+## What this repo is NOT
+
+- Not a fork on GitHub. The vendored upstream `.git` was deleted on
+  purpose. If you need to compare against upstream, clone it fresh into
+  `/tmp/`.
+- Not a Python package (yet). There is no `pyproject.toml` at the
+  factory root; the Python entry point is the deer-flow tree's own
+  `backend/pyproject.toml`. We only put files into its harness package.
+- Not multi-tenant. There is exactly one DeerFlow instance, one Searx,
+  one set of credentials. Keep it that way unless the user explicitly
+  asks for tenancy.