Initial commit: hardened DeerFlow factory

Vendored deer-flow upstream (bytedance/deer-flow) plus prompt-injection hardening: - New deerflow.security package: content_delimiter, html_cleaner, sanitizer (8 layers — invisible chars, control chars, symbols, NFC, PUA, tag chars, horizontal whitespace collapse with newline/tab preservation, length cap) - New deerflow.community.searx package: web_search, web_fetch, image_search backed by a private SearX instance, every external string sanitized and wrapped in <<<EXTERNAL_UNTRUSTED_CONTENT>>> delimiters - All native community web providers (ddg_search, tavily, exa, firecrawl, jina_ai, infoquest, image_search) replaced with hard-fail stubs that raise NativeWebToolDisabledError at import time, so a misconfigured tool.use path fails loud rather than silently falling back to unsanitized output - Native client back-doors (jina_client.py, infoquest_client.py) stubbed too - Native-tool tests quarantined under tests/_disabled_native/ (collect_ignore_glob via local conftest.py) - Sanitizer Layer 7 fix: only collapse horizontal whitespace, preserve newlines and tabs so list/table structure survives - Hardened runtime config.yaml references only the searx-backed tools - Factory overlay (backend/) kept in sync with deer-flow tree as a reference / source See HARDENING.md for the full audit trail and verification steps.
2026-04-12 14:23:57 +02:00
commit 6de0bf9f5b
889 changed files with 173052 additions and 0 deletions
--- a/deer-flow/backend/docs/ARCHITECTURE.md
+++ b/deer-flow/backend/docs/ARCHITECTURE.md
@@ -0,0 +1,484 @@
+# Architecture Overview
+
+This document provides a comprehensive overview of the DeerFlow backend architecture.
+
+## System Architecture
+
+```
+┌──────────────────────────────────────────────────────────────────────────┐
+│                              Client (Browser)                             │
+└─────────────────────────────────┬────────────────────────────────────────┘
+                                  │
+                                  ▼
+┌──────────────────────────────────────────────────────────────────────────┐
+│                          Nginx (Port 2026)                               │
+│                    Unified Reverse Proxy Entry Point                      │
+│  ┌────────────────────────────────────────────────────────────────────┐  │
+│  │  /api/langgraph/*  →  LangGraph Server (2024)                      │  │
+│  │  /api/*            →  Gateway API (8001)                           │  │
+│  │  /*                →  Frontend (3000)                               │  │
+│  └────────────────────────────────────────────────────────────────────┘  │
+└─────────────────────────────────┬────────────────────────────────────────┘
+                                  │
+          ┌───────────────────────┼───────────────────────┐
+          │                       │                       │
+          ▼                       ▼                       ▼
+┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
+│   LangGraph Server  │ │    Gateway API      │ │     Frontend        │
+│     (Port 2024)     │ │    (Port 8001)      │ │    (Port 3000)      │
+│                     │ │                     │ │                     │
+│  - Agent Runtime    │ │  - Models API       │ │  - Next.js App      │
+│  - Thread Mgmt      │ │  - MCP Config       │ │  - React UI         │
+│  - SSE Streaming    │ │  - Skills Mgmt      │ │  - Chat Interface   │
+│  - Checkpointing    │ │  - File Uploads     │ │                     │
+│                     │ │  - Thread Cleanup   │ │                     │
+│                     │ │  - Artifacts        │ │                     │
+└─────────────────────┘ └─────────────────────┘ └─────────────────────┘
+          │                       │
+          │     ┌─────────────────┘
+          │     │
+          ▼     ▼
+┌──────────────────────────────────────────────────────────────────────────┐
+│                         Shared Configuration                              │
+│  ┌─────────────────────────┐  ┌────────────────────────────────────────┐ │
+│  │      config.yaml        │  │      extensions_config.json            │ │
+│  │  - Models               │  │  - MCP Servers                         │ │
+│  │  - Tools                │  │  - Skills State                        │ │
+│  │  - Sandbox              │  │                                        │ │
+│  │  - Summarization        │  │                                        │ │
+│  └─────────────────────────┘  └────────────────────────────────────────┘ │
+└──────────────────────────────────────────────────────────────────────────┘
+```
+
+## Component Details
+
+### LangGraph Server
+
+The LangGraph server is the core agent runtime, built on LangGraph for robust multi-agent workflow orchestration.
+
+**Entry Point**: `packages/harness/deerflow/agents/lead_agent/agent.py:make_lead_agent`
+
+**Key Responsibilities**:
+- Agent creation and configuration
+- Thread state management
+- Middleware chain execution
+- Tool execution orchestration
+- SSE streaming for real-time responses
+
+**Configuration**: `langgraph.json`
+
+```json
+{
+  "agent": {
+    "type": "agent",
+    "path": "deerflow.agents:make_lead_agent"
+  }
+}
+```
+
+### Gateway API
+
+FastAPI application providing REST endpoints for non-agent operations.
+
+**Entry Point**: `app/gateway/app.py`
+
+**Routers**:
+- `models.py` - `/api/models` - Model listing and details
+- `mcp.py` - `/api/mcp` - MCP server configuration
+- `skills.py` - `/api/skills` - Skills management
+- `uploads.py` - `/api/threads/{id}/uploads` - File upload
+- `threads.py` - `/api/threads/{id}` - Local DeerFlow thread data cleanup after LangGraph deletion
+- `artifacts.py` - `/api/threads/{id}/artifacts` - Artifact serving
+- `suggestions.py` - `/api/threads/{id}/suggestions` - Follow-up suggestion generation
+
+The web conversation delete flow is now split across both backend surfaces: LangGraph handles `DELETE /api/langgraph/threads/{thread_id}` for thread state, then the Gateway `threads.py` router removes DeerFlow-managed filesystem data via `Paths.delete_thread_dir()`.
+
+### Agent Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│                           make_lead_agent(config)                        │
+└────────────────────────────────────┬────────────────────────────────────┘
+                                     │
+                                     ▼
+┌─────────────────────────────────────────────────────────────────────────┐
+│                            Middleware Chain                              │
+│  ┌──────────────────────────────────────────────────────────────────┐   │
+│  │ 1. ThreadDataMiddleware  - Initialize workspace/uploads/outputs  │   │
+│  │ 2. UploadsMiddleware     - Process uploaded files               │   │
+│  │ 3. SandboxMiddleware     - Acquire sandbox environment          │   │
+│  │ 4. SummarizationMiddleware - Context reduction (if enabled)     │   │
+│  │ 5. TitleMiddleware       - Auto-generate titles                 │   │
+│  │ 6. TodoListMiddleware    - Task tracking (if plan_mode)         │   │
+│  │ 7. ViewImageMiddleware   - Vision model support                 │   │
+│  │ 8. ClarificationMiddleware - Handle clarifications              │   │
+│  └──────────────────────────────────────────────────────────────────┘   │
+└────────────────────────────────────┬────────────────────────────────────┘
+                                     │
+                                     ▼
+┌─────────────────────────────────────────────────────────────────────────┐
+│                              Agent Core                                  │
+│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────────┐   │
+│  │      Model       │  │      Tools       │  │    System Prompt     │   │
+│  │  (from factory)  │  │  (configured +   │  │  (with skills)       │   │
+│  │                  │  │   MCP + builtin) │  │                      │   │
+│  └──────────────────┘  └──────────────────┘  └──────────────────────┘   │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+### Thread State
+
+The `ThreadState` extends LangGraph's `AgentState` with additional fields:
+
+```python
+class ThreadState(AgentState):
+    # Core state from AgentState
+    messages: list[BaseMessage]
+
+    # DeerFlow extensions
+    sandbox: dict             # Sandbox environment info
+    artifacts: list[str]      # Generated file paths
+    thread_data: dict         # {workspace, uploads, outputs} paths
+    title: str | None         # Auto-generated conversation title
+    todos: list[dict]         # Task tracking (plan mode)
+    viewed_images: dict       # Vision model image data
+```
+
+### Sandbox System
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│                           Sandbox Architecture                           │
+└─────────────────────────────────────────────────────────────────────────┘
+
+                      ┌─────────────────────────┐
+                      │    SandboxProvider      │ (Abstract)
+                      │  - acquire()            │
+                      │  - get()                │
+                      │  - release()            │
+                      └────────────┬────────────┘
+                                   │
+              ┌────────────────────┼────────────────────┐
+              │                                         │
+              ▼                                         ▼
+┌─────────────────────────┐              ┌─────────────────────────┐
+│  LocalSandboxProvider   │              │  AioSandboxProvider     │
+│  (packages/harness/deerflow/sandbox/local.py) │              │  (packages/harness/deerflow/community/)       │
+│                         │              │                         │
+│  - Singleton instance   │              │  - Docker-based         │
+│  - Direct execution     │              │  - Isolated containers  │
+│  - Development use      │              │  - Production use       │
+└─────────────────────────┘              └─────────────────────────┘
+
+                      ┌─────────────────────────┐
+                      │        Sandbox          │ (Abstract)
+                      │  - execute_command()    │
+                      │  - read_file()          │
+                      │  - write_file()         │
+                      │  - list_dir()           │
+                      └─────────────────────────┘
+```
+
+**Virtual Path Mapping**:
+
+| Virtual Path | Physical Path |
+|-------------|---------------|
+| `/mnt/user-data/workspace` | `backend/.deer-flow/threads/{thread_id}/user-data/workspace` |
+| `/mnt/user-data/uploads` | `backend/.deer-flow/threads/{thread_id}/user-data/uploads` |
+| `/mnt/user-data/outputs` | `backend/.deer-flow/threads/{thread_id}/user-data/outputs` |
+| `/mnt/skills` | `deer-flow/skills/` |
+
+### Tool System
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│                            Tool Sources                                  │
+└─────────────────────────────────────────────────────────────────────────┘
+
+┌─────────────────────┐  ┌─────────────────────┐  ┌─────────────────────┐
+│   Built-in Tools    │  │  Configured Tools   │  │     MCP Tools       │
+│  (packages/harness/deerflow/tools/)       │  │  (config.yaml)      │  │  (extensions.json)  │
+├─────────────────────┤  ├─────────────────────┤  ├─────────────────────┤
+│ - present_file      │  │ - web_search        │  │ - github            │
+│ - ask_clarification │  │ - web_fetch         │  │ - filesystem        │
+│ - view_image        │  │ - bash              │  │ - postgres          │
+│                     │  │ - read_file         │  │ - brave-search      │
+│                     │  │ - write_file        │  │ - puppeteer         │
+│                     │  │ - str_replace       │  │ - ...               │
+│                     │  │ - ls                │  │                     │
+└─────────────────────┘  └─────────────────────┘  └─────────────────────┘
+           │                       │                       │
+           └───────────────────────┴───────────────────────┘
+                                   │
+                                   ▼
+                      ┌─────────────────────────┐
+                      │   get_available_tools() │
+                      │   (packages/harness/deerflow/tools/__init__)  │
+                      └─────────────────────────┘
+```
+
+### Model Factory
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│                          Model Factory                                   │
+│                     (packages/harness/deerflow/models/factory.py)                              │
+└─────────────────────────────────────────────────────────────────────────┘
+
+config.yaml:
+┌─────────────────────────────────────────────────────────────────────────┐
+│ models:                                                                  │
+│   - name: gpt-4                                                         │
+│     display_name: GPT-4                                                 │
+│     use: langchain_openai:ChatOpenAI                                    │
+│     model: gpt-4                                                        │
+│     api_key: $OPENAI_API_KEY                                            │
+│     max_tokens: 4096                                                    │
+│     supports_thinking: false                                            │
+│     supports_vision: true                                               │
+└─────────────────────────────────────────────────────────────────────────┘
+                                   │
+                                   ▼
+                      ┌─────────────────────────┐
+                      │   create_chat_model()   │
+                      │  - name: str            │
+                      │  - thinking_enabled     │
+                      └────────────┬────────────┘
+                                   │
+                                   ▼
+                      ┌─────────────────────────┐
+                      │   resolve_class()       │
+                      │  (reflection system)    │
+                      └────────────┬────────────┘
+                                   │
+                                   ▼
+                      ┌─────────────────────────┐
+                      │   BaseChatModel         │
+                      │  (LangChain instance)   │
+                      └─────────────────────────┘
+```
+
+**Supported Providers**:
+- OpenAI (`langchain_openai:ChatOpenAI`)
+- Anthropic (`langchain_anthropic:ChatAnthropic`)
+- DeepSeek (`langchain_deepseek:ChatDeepSeek`)
+- Custom via LangChain integrations
+
+### MCP Integration
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│                          MCP Integration                                 │
+│                        (packages/harness/deerflow/mcp/manager.py)                              │
+└─────────────────────────────────────────────────────────────────────────┘
+
+extensions_config.json:
+┌─────────────────────────────────────────────────────────────────────────┐
+│ {                                                                        │
+│   "mcpServers": {                                                       │
+│     "github": {                                                         │
+│       "enabled": true,                                                  │
+│       "type": "stdio",                                                  │
+│       "command": "npx",                                                 │
+│       "args": ["-y", "@modelcontextprotocol/server-github"],           │
+│       "env": {"GITHUB_TOKEN": "$GITHUB_TOKEN"}                          │
+│     }                                                                   │
+│   }                                                                     │
+│ }                                                                       │
+└─────────────────────────────────────────────────────────────────────────┘
+                                   │
+                                   ▼
+                      ┌─────────────────────────┐
+                      │  MultiServerMCPClient   │
+                      │  (langchain-mcp-adapters)│
+                      └────────────┬────────────┘
+                                   │
+              ┌────────────────────┼────────────────────┐
+              │                    │                    │
+              ▼                    ▼                    ▼
+       ┌───────────┐        ┌───────────┐        ┌───────────┐
+       │  stdio    │        │   SSE     │        │   HTTP    │
+       │ transport │        │ transport │        │ transport │
+       └───────────┘        └───────────┘        └───────────┘
+```
+
+### Skills System
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│                          Skills System                                   │
+│                       (packages/harness/deerflow/skills/loader.py)                             │
+└─────────────────────────────────────────────────────────────────────────┘
+
+Directory Structure:
+┌─────────────────────────────────────────────────────────────────────────┐
+│ skills/                                                                  │
+│ ├── public/                        # Public skills (committed)           │
+│ │   ├── pdf-processing/                                                 │
+│ │   │   └── SKILL.md                                                    │
+│ │   ├── frontend-design/                                                │
+│ │   │   └── SKILL.md                                                    │
+│ │   └── ...                                                             │
+│ └── custom/                        # Custom skills (gitignored)          │
+│     └── user-installed/                                                 │
+│         └── SKILL.md                                                    │
+└─────────────────────────────────────────────────────────────────────────┘
+
+SKILL.md Format:
+┌─────────────────────────────────────────────────────────────────────────┐
+│ ---                                                                      │
+│ name: PDF Processing                                                     │
+│ description: Handle PDF documents efficiently                            │
+│ license: MIT                                                            │
+│ allowed-tools:                                                          │
+│   - read_file                                                           │
+│   - write_file                                                          │
+│   - bash                                                                │
+│ ---                                                                      │
+│                                                                          │
+│ # Skill Instructions                                                     │
+│ Content injected into system prompt...                                   │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+### Request Flow
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│                         Request Flow Example                             │
+│                    User sends message to agent                           │
+└─────────────────────────────────────────────────────────────────────────┘
+
+1. Client → Nginx
+   POST /api/langgraph/threads/{thread_id}/runs
+   {"input": {"messages": [{"role": "user", "content": "Hello"}]}}
+
+2. Nginx → LangGraph Server (2024)
+   Proxied to LangGraph server
+
+3. LangGraph Server
+   a. Load/create thread state
+   b. Execute middleware chain:
+      - ThreadDataMiddleware: Set up paths
+      - UploadsMiddleware: Inject file list
+      - SandboxMiddleware: Acquire sandbox
+      - SummarizationMiddleware: Check token limits
+      - TitleMiddleware: Generate title if needed
+      - TodoListMiddleware: Load todos (if plan mode)
+      - ViewImageMiddleware: Process images
+      - ClarificationMiddleware: Check for clarifications
+
+   c. Execute agent:
+      - Model processes messages
+      - May call tools (bash, web_search, etc.)
+      - Tools execute via sandbox
+      - Results added to messages
+
+   d. Stream response via SSE
+
+4. Client receives streaming response
+```
+
+## Data Flow
+
+### File Upload Flow
+
+```
+1. Client uploads file
+   POST /api/threads/{thread_id}/uploads
+   Content-Type: multipart/form-data
+
+2. Gateway receives file
+   - Validates file
+   - Stores in .deer-flow/threads/{thread_id}/user-data/uploads/
+   - If document: converts to Markdown via markitdown
+
+3. Returns response
+   {
+     "files": [{
+       "filename": "doc.pdf",
+       "path": ".deer-flow/.../uploads/doc.pdf",
+       "virtual_path": "/mnt/user-data/uploads/doc.pdf",
+       "artifact_url": "/api/threads/.../artifacts/mnt/.../doc.pdf"
+     }]
+   }
+
+4. Next agent run
+   - UploadsMiddleware lists files
+   - Injects file list into messages
+   - Agent can access via virtual_path
+```
+
+### Thread Cleanup Flow
+
+```
+1. Client deletes conversation via LangGraph
+   DELETE /api/langgraph/threads/{thread_id}
+
+2. Web UI follows up with Gateway cleanup
+   DELETE /api/threads/{thread_id}
+
+3. Gateway removes local DeerFlow-managed files
+   - Deletes .deer-flow/threads/{thread_id}/ recursively
+   - Missing directories are treated as a no-op
+   - Invalid thread IDs are rejected before filesystem access
+```
+
+### Configuration Reload
+
+```
+1. Client updates MCP config
+   PUT /api/mcp/config
+
+2. Gateway writes extensions_config.json
+   - Updates mcpServers section
+   - File mtime changes
+
+3. MCP Manager detects change
+   - get_cached_mcp_tools() checks mtime
+   - If changed: reinitializes MCP client
+   - Loads updated server configurations
+
+4. Next agent run uses new tools
+```
+
+## Security Considerations
+
+### Sandbox Isolation
+
+- Agent code executes within sandbox boundaries
+- Local sandbox: Direct execution (development only)
+- Docker sandbox: Container isolation (production recommended)
+- Path traversal prevention in file operations
+
+### API Security
+
+- Thread isolation: Each thread has separate data directories
+- File validation: Uploads checked for path safety
+- Environment variable resolution: Secrets not stored in config
+
+### MCP Security
+
+- Each MCP server runs in its own process
+- Environment variables resolved at runtime
+- Servers can be enabled/disabled independently
+
+## Performance Considerations
+
+### Caching
+
+- MCP tools cached with file mtime invalidation
+- Configuration loaded once, reloaded on file change
+- Skills parsed once at startup, cached in memory
+
+### Streaming
+
+- SSE used for real-time response streaming
+- Reduces time to first token
+- Enables progress visibility for long operations
+
+### Context Management
+
+- Summarization middleware reduces context when limits approached
+- Configurable triggers: tokens, messages, or fraction
+- Preserves recent messages while summarizing older ones