diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..f1473b4 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,177 @@ +# CLAUDE.md — orientation for future sessions + +You are in **deerflow-factory**, a hardened deployment of +[bytedance/deer-flow](https://github.com/bytedance/deer-flow) on data-nuc. +Read this file first before touching anything in this repo. + +## Read these in order + +1. **HARDENING.md** — threat model, what was changed, why, and how to + verify. The full audit trail. Update it whenever you change anything + listed in section 6 ("Files touched"). +2. **RUN.md** — how to start/stop/inspect the stack, smoke-test commands. +3. **DEERFLOW_PROMPT_INJECTION_PROTECTION_PLAN.md** — the original plan + that the hardening implements. Historical context. + +## Layout + +``` +deerflow-factory/ +├── deer-flow/ ← vendored upstream (no nested .git!) +│ └── backend/packages/harness/deerflow/ +│ ├── security/ ← hardened content sanitizer (lives here!) +│ ├── community/searx/ ← hardened web tools (lives here!) +│ └── community// ← stubbed, raise on import +├── backend/ ← factory overlay (mirror, kept in sync) +│ └── packages/harness/deerflow/ +│ ├── security/ ← duplicated source-of-truth +│ └── community/searx/ ← duplicated source-of-truth +├── docker/ +│ └── docker-compose.override.yaml ← named bridge br-deerflow +├── scripts/ +│ ├── deerflow-firewall.sh ← egress firewall up/down/status +│ └── deerflow-firewall.nix ← NixOS module (imported by /etc/nixos/configuration.nix) +├── config.yaml ← runtime config — only references searx tools +├── .env ← real secrets, .gitignored +├── .env.example ← template +├── HARDENING.md +├── RUN.md +└── CLAUDE.md ← you are here +``` + +## Hard rules (do not violate without explicit user approval) + +1. **Native web tools stay disabled.** The legacy providers + (`ddg_search`, `tavily`, `exa`, `firecrawl`, `jina_ai`, `infoquest`, + `image_search`) and the matching helper clients (`jina_client.py`, + `infoquest_client.py`) are intentionally replaced with import-time + `RuntimeError` stubs. Re-enabling **any** of them requires: + - hardening the call site (sanitize → wrap_untrusted_content) + - moving the matching test out of `tests/_disabled_native/` + - updating HARDENING.md sections 2.3, 2.4, and 6 + - explicit user sign-off in the same conversation + +2. **All web output must be sanitized and delimited.** Any new code path + that returns external data to the model **must** route through + `deerflow.security.sanitizer.sanitize()` and + `deerflow.security.content_delimiter.wrap_untrusted_content()`. + The whole point of this build is that the LLM never sees raw web + bytes. + +3. **No secrets in git.** `.env` is `.gitignored`. Before staging, + verify with `git diff --cached | grep -iE 'api_key|secret|token|password'`. + Use `.env.example` for templates only — placeholders, never live keys. + +4. **Two source trees, one truth.** The hardened code lives in **both** + `deer-flow/backend/packages/harness/deerflow/{security,community/searx}/` + (the runtime path) **and** `backend/packages/harness/deerflow/...` + (the factory overlay used by the standalone tests). They must stay + identical. If you fix a bug in one, mirror it to the other in the + same commit, or delete one of the two trees and pick a single source + of truth. + +5. **The egress firewall is part of the threat model.** Do not change + `scripts/deerflow-firewall.sh` allow/block lists without updating + HARDENING.md section 2.7. Specifically: + - allow: `10.67.67.1` (Searx), `10.67.67.2` (XTTS/Whisper/Ollama-local) + - block: `192.168.3.0/24` (home LAN), `10.0.0.0/8`, `172.16.0.0/12` + +6. **deer-flow is vendored, not a submodule.** The upstream `.git` was + removed and is parked at `/tmp/deer-flow-upstream.git.bak` on + data-nuc. If you need to pull upstream changes, do it in a separate + working copy and rebase manually — do not re-introduce a nested git + into this repo. + +## Where things run + +- **Host:** data-nuc (NixOS 25.11, kernel 6.12.x). `data` user is in + the `docker` group, can use `docker compose` directly. +- **Repo path:** `/home/data/deerflow-factory` +- **Gitea remote:** `https://git.beerbandit.de/DATA/deerflow-factory` + (credentials in `~/.git-credentials` for user `data`) +- **Egress firewall:** `systemctl status deerflow-firewall` + - active = rules in DOCKER-USER, applied to `br-deerflow` + - inactive = rules removed (no firewall) +- **DeerFlow stack:** not running yet at the time of this CLAUDE.md + initial commit. First start: see RUN.md. + +## Commit / push style + +- Imperative subject, present-tense body. Reference HARDENING.md + sections by number when you change something they describe. +- Do not amend or force-push without asking. Add a follow-up commit. +- Pre-commit secret check: + ```bash + git diff --cached --name-only | xargs -I{} grep -lE \ + 'api_key|secret_key|sk-[a-zA-Z0-9]{20,}|ghp_|tvly-' {} 2>/dev/null + ``` + Only `.env.example` should appear. If anything else does, abort. + +## Quick verification (run before declaring "it works") + +From `/home/data/deerflow-factory`: + +```bash +PYTHONPATH=deer-flow/backend/packages/harness python3 -c " +# 1. hardened modules import +from deerflow.security.content_delimiter import wrap_untrusted_content +from deerflow.security.sanitizer import sanitizer +from deerflow.security.html_cleaner import extract_secure_text +import importlib.util +assert importlib.util.find_spec('deerflow.community.searx.tools') is not None + +# 2. native modules fail closed +for prov in ['ddg_search','tavily','exa','firecrawl','jina_ai','infoquest','image_search']: + try: + __import__(f'deerflow.community.{prov}.tools') + raise SystemExit(f'FAIL: {prov} imported') + except RuntimeError as e: + assert 'disabled in this hardened DeerFlow build' in str(e) +print('OK') +" + +# 3. security tests +PYTHONPATH=deer-flow/backend/packages/harness pytest \ + backend/tests/test_security_sanitizer.py \ + backend/tests/test_security_html_cleaner.py -q +# expected: 8 passed + +# 4. firewall service +systemctl is-active deerflow-firewall +sudo scripts/deerflow-firewall.sh status +``` + +If any of these fail and you cannot fix them in the same session, stop +and report — do not paper over the failure. + +## Common footguns + +- **`pip` is not installed system-wide on NixOS.** If you need a Python + dep for a one-off script, use `nix-shell -p python3Packages.` + or run inside a deer-flow `.venv` once it exists. Do not try + `pip install --user` — it will fail. +- **`sudo` is passwordless for `data`.** Be careful: any `sudo` you run + succeeds without a prompt. Double-check destructive commands. +- **NixOS rewrites `/etc/systemd/system/`.** Do not drop unit files in + there directly; they will be wiped on `nixos-rebuild switch`. Add a + `systemd.services.` block to a Nix module instead (see + `scripts/deerflow-firewall.nix` for the pattern). +- **The factory overlay (`backend/`) is currently a mirror, not the + runtime.** When you import from Python at runtime, the path that + matters is `deer-flow/backend/packages/harness`. The overlay only + matters for the standalone factory tests. Keep them in sync until we + pick one as canonical. +- **`docker compose down` does not remove the firewall rules.** That is + by design. Only `systemctl stop deerflow-firewall` removes them. + +## What this repo is NOT + +- Not a fork on GitHub. The vendored upstream `.git` was deleted on + purpose. If you need to compare against upstream, clone it fresh into + `/tmp/`. +- Not a Python package (yet). There is no `pyproject.toml` at the + factory root; the Python entry point is the deer-flow tree's own + `backend/pyproject.toml`. We only put files into its harness package. +- Not multi-tenant. There is exactly one DeerFlow instance, one Searx, + one set of credentials. Keep it that way unless the user explicitly + asks for tenancy.