Initial commit: hardened DeerFlow factory
Vendored deer-flow upstream (bytedance/deer-flow) plus prompt-injection hardening: - New deerflow.security package: content_delimiter, html_cleaner, sanitizer (8 layers — invisible chars, control chars, symbols, NFC, PUA, tag chars, horizontal whitespace collapse with newline/tab preservation, length cap) - New deerflow.community.searx package: web_search, web_fetch, image_search backed by a private SearX instance, every external string sanitized and wrapped in <<<EXTERNAL_UNTRUSTED_CONTENT>>> delimiters - All native community web providers (ddg_search, tavily, exa, firecrawl, jina_ai, infoquest, image_search) replaced with hard-fail stubs that raise NativeWebToolDisabledError at import time, so a misconfigured tool.use path fails loud rather than silently falling back to unsanitized output - Native client back-doors (jina_client.py, infoquest_client.py) stubbed too - Native-tool tests quarantined under tests/_disabled_native/ (collect_ignore_glob via local conftest.py) - Sanitizer Layer 7 fix: only collapse horizontal whitespace, preserve newlines and tabs so list/table structure survives - Hardened runtime config.yaml references only the searx-backed tools - Factory overlay (backend/) kept in sync with deer-flow tree as a reference / source See HARDENING.md for the full audit trail and verification steps.
This commit is contained in:
369
deer-flow/backend/docs/CONFIGURATION.md
Normal file
369
deer-flow/backend/docs/CONFIGURATION.md
Normal file
@@ -0,0 +1,369 @@
|
||||
# Configuration Guide
|
||||
|
||||
This guide explains how to configure DeerFlow for your environment.
|
||||
|
||||
## Config Versioning
|
||||
|
||||
`config.example.yaml` contains a `config_version` field that tracks schema changes. When the example version is higher than your local `config.yaml`, the application emits a startup warning:
|
||||
|
||||
```
|
||||
WARNING - Your config.yaml (version 0) is outdated — the latest version is 1.
|
||||
Run `make config-upgrade` to merge new fields into your config.
|
||||
```
|
||||
|
||||
- **Missing `config_version`** in your config is treated as version 0.
|
||||
- Run `make config-upgrade` to auto-merge missing fields (your existing values are preserved, a `.bak` backup is created).
|
||||
- When changing the config schema, bump `config_version` in `config.example.yaml`.
|
||||
|
||||
## Configuration Sections
|
||||
|
||||
### Models
|
||||
|
||||
Configure the LLM models available to the agent:
|
||||
|
||||
```yaml
|
||||
models:
|
||||
- name: gpt-4 # Internal identifier
|
||||
display_name: GPT-4 # Human-readable name
|
||||
use: langchain_openai:ChatOpenAI # LangChain class path
|
||||
model: gpt-4 # Model identifier for API
|
||||
api_key: $OPENAI_API_KEY # API key (use env var)
|
||||
max_tokens: 4096 # Max tokens per request
|
||||
temperature: 0.7 # Sampling temperature
|
||||
```
|
||||
|
||||
**Supported Providers**:
|
||||
- OpenAI (`langchain_openai:ChatOpenAI`)
|
||||
- Anthropic (`langchain_anthropic:ChatAnthropic`)
|
||||
- DeepSeek (`langchain_deepseek:ChatDeepSeek`)
|
||||
- Claude Code OAuth (`deerflow.models.claude_provider:ClaudeChatModel`)
|
||||
- Codex CLI (`deerflow.models.openai_codex_provider:CodexChatModel`)
|
||||
- Any LangChain-compatible provider
|
||||
|
||||
CLI-backed provider examples:
|
||||
|
||||
```yaml
|
||||
models:
|
||||
- name: gpt-5.4
|
||||
display_name: GPT-5.4 (Codex CLI)
|
||||
use: deerflow.models.openai_codex_provider:CodexChatModel
|
||||
model: gpt-5.4
|
||||
supports_thinking: true
|
||||
supports_reasoning_effort: true
|
||||
|
||||
- name: claude-sonnet-4.6
|
||||
display_name: Claude Sonnet 4.6 (Claude Code OAuth)
|
||||
use: deerflow.models.claude_provider:ClaudeChatModel
|
||||
model: claude-sonnet-4-6
|
||||
max_tokens: 4096
|
||||
supports_thinking: true
|
||||
```
|
||||
|
||||
**Auth behavior for CLI-backed providers**:
|
||||
- `CodexChatModel` loads Codex CLI auth from `~/.codex/auth.json`
|
||||
- The Codex Responses endpoint currently rejects `max_tokens` and `max_output_tokens`, so `CodexChatModel` does not expose a request-level token cap
|
||||
- `ClaudeChatModel` accepts `CLAUDE_CODE_OAUTH_TOKEN`, `ANTHROPIC_AUTH_TOKEN`, `CLAUDE_CODE_OAUTH_TOKEN_FILE_DESCRIPTOR`, `CLAUDE_CODE_CREDENTIALS_PATH`, or plaintext `~/.claude/.credentials.json`
|
||||
- On macOS, DeerFlow does not probe Keychain automatically. Use `scripts/export_claude_code_oauth.py` to export Claude Code auth explicitly when needed
|
||||
|
||||
To use OpenAI's `/v1/responses` endpoint with LangChain, keep using `langchain_openai:ChatOpenAI` and set:
|
||||
|
||||
```yaml
|
||||
models:
|
||||
- name: gpt-5-responses
|
||||
display_name: GPT-5 (Responses API)
|
||||
use: langchain_openai:ChatOpenAI
|
||||
model: gpt-5
|
||||
api_key: $OPENAI_API_KEY
|
||||
use_responses_api: true
|
||||
output_version: responses/v1
|
||||
```
|
||||
|
||||
For OpenAI-compatible gateways (for example Novita or OpenRouter), keep using `langchain_openai:ChatOpenAI` and set `base_url`:
|
||||
|
||||
```yaml
|
||||
models:
|
||||
- name: novita-deepseek-v3.2
|
||||
display_name: Novita DeepSeek V3.2
|
||||
use: langchain_openai:ChatOpenAI
|
||||
model: deepseek/deepseek-v3.2
|
||||
api_key: $NOVITA_API_KEY
|
||||
base_url: https://api.novita.ai/openai
|
||||
supports_thinking: true
|
||||
when_thinking_enabled:
|
||||
extra_body:
|
||||
thinking:
|
||||
type: enabled
|
||||
|
||||
- name: minimax-m2.5
|
||||
display_name: MiniMax M2.5
|
||||
use: langchain_openai:ChatOpenAI
|
||||
model: MiniMax-M2.5
|
||||
api_key: $MINIMAX_API_KEY
|
||||
base_url: https://api.minimax.io/v1
|
||||
max_tokens: 4096
|
||||
temperature: 1.0 # MiniMax requires temperature in (0.0, 1.0]
|
||||
supports_vision: true
|
||||
|
||||
- name: minimax-m2.5-highspeed
|
||||
display_name: MiniMax M2.5 Highspeed
|
||||
use: langchain_openai:ChatOpenAI
|
||||
model: MiniMax-M2.5-highspeed
|
||||
api_key: $MINIMAX_API_KEY
|
||||
base_url: https://api.minimax.io/v1
|
||||
max_tokens: 4096
|
||||
temperature: 1.0 # MiniMax requires temperature in (0.0, 1.0]
|
||||
supports_vision: true
|
||||
- name: openrouter-gemini-2.5-flash
|
||||
display_name: Gemini 2.5 Flash (OpenRouter)
|
||||
use: langchain_openai:ChatOpenAI
|
||||
model: google/gemini-2.5-flash-preview
|
||||
api_key: $OPENAI_API_KEY
|
||||
base_url: https://openrouter.ai/api/v1
|
||||
```
|
||||
|
||||
If your OpenRouter key lives in a different environment variable name, point `api_key` at that variable explicitly (for example `api_key: $OPENROUTER_API_KEY`).
|
||||
|
||||
**Thinking Models**:
|
||||
Some models support "thinking" mode for complex reasoning:
|
||||
|
||||
```yaml
|
||||
models:
|
||||
- name: deepseek-v3
|
||||
supports_thinking: true
|
||||
when_thinking_enabled:
|
||||
extra_body:
|
||||
thinking:
|
||||
type: enabled
|
||||
```
|
||||
|
||||
**Gemini with thinking via OpenAI-compatible gateway**:
|
||||
|
||||
When routing Gemini through an OpenAI-compatible proxy (Vertex AI OpenAI compat endpoint, AI Studio, or third-party gateways) with thinking enabled, the API attaches a `thought_signature` to each tool-call object returned in the response. Every subsequent request that replays those assistant messages **must** echo those signatures back on the tool-call entries or the API returns:
|
||||
|
||||
```
|
||||
HTTP 400 INVALID_ARGUMENT: function call `<tool>` in the N. content block is
|
||||
missing a `thought_signature`.
|
||||
```
|
||||
|
||||
Standard `langchain_openai:ChatOpenAI` silently drops `thought_signature` when serialising messages. Use `deerflow.models.patched_openai:PatchedChatOpenAI` instead — it re-injects the tool-call signatures (sourced from `AIMessage.additional_kwargs["tool_calls"]`) into every outgoing payload:
|
||||
|
||||
```yaml
|
||||
models:
|
||||
- name: gemini-2.5-pro-thinking
|
||||
display_name: Gemini 2.5 Pro (Thinking)
|
||||
use: deerflow.models.patched_openai:PatchedChatOpenAI
|
||||
model: google/gemini-2.5-pro-preview # model name as expected by your gateway
|
||||
api_key: $GEMINI_API_KEY
|
||||
base_url: https://<your-openai-compat-gateway>/v1
|
||||
max_tokens: 16384
|
||||
supports_thinking: true
|
||||
supports_vision: true
|
||||
when_thinking_enabled:
|
||||
extra_body:
|
||||
thinking:
|
||||
type: enabled
|
||||
```
|
||||
|
||||
For Gemini accessed **without** thinking (e.g. via OpenRouter where thinking is not activated), the plain `langchain_openai:ChatOpenAI` with `supports_thinking: false` is sufficient and no patch is needed.
|
||||
|
||||
### Tool Groups
|
||||
|
||||
Organize tools into logical groups:
|
||||
|
||||
```yaml
|
||||
tool_groups:
|
||||
- name: web # Web browsing and search
|
||||
- name: file:read # Read-only file operations
|
||||
- name: file:write # Write file operations
|
||||
- name: bash # Shell command execution
|
||||
```
|
||||
|
||||
### Tools
|
||||
|
||||
Configure specific tools available to the agent:
|
||||
|
||||
```yaml
|
||||
tools:
|
||||
- name: web_search
|
||||
group: web
|
||||
use: deerflow.community.tavily.tools:web_search_tool
|
||||
max_results: 5
|
||||
# api_key: $TAVILY_API_KEY # Optional
|
||||
```
|
||||
|
||||
**Built-in Tools**:
|
||||
- `web_search` - Search the web (DuckDuckGo, Tavily, Exa, InfoQuest, Firecrawl)
|
||||
- `web_fetch` - Fetch web pages (Jina AI, Exa, InfoQuest, Firecrawl)
|
||||
- `ls` - List directory contents
|
||||
- `read_file` - Read file contents
|
||||
- `write_file` - Write file contents
|
||||
- `str_replace` - String replacement in files
|
||||
- `bash` - Execute bash commands
|
||||
|
||||
### Sandbox
|
||||
|
||||
DeerFlow supports multiple sandbox execution modes. Configure your preferred mode in `config.yaml`:
|
||||
|
||||
**Local Execution** (runs sandbox code directly on the host machine):
|
||||
```yaml
|
||||
sandbox:
|
||||
use: deerflow.sandbox.local:LocalSandboxProvider # Local execution
|
||||
allow_host_bash: false # default; host bash is disabled unless explicitly re-enabled
|
||||
```
|
||||
|
||||
**Docker Execution** (runs sandbox code in isolated Docker containers):
|
||||
```yaml
|
||||
sandbox:
|
||||
use: deerflow.community.aio_sandbox:AioSandboxProvider # Docker-based sandbox
|
||||
```
|
||||
|
||||
**Docker Execution with Kubernetes** (runs sandbox code in Kubernetes pods via provisioner service):
|
||||
|
||||
This mode runs each sandbox in an isolated Kubernetes Pod on your **host machine's cluster**. Requires Docker Desktop K8s, OrbStack, or similar local K8s setup.
|
||||
|
||||
```yaml
|
||||
sandbox:
|
||||
use: deerflow.community.aio_sandbox:AioSandboxProvider
|
||||
provisioner_url: http://provisioner:8002
|
||||
```
|
||||
|
||||
When using Docker development (`make docker-start`), DeerFlow starts the `provisioner` service only if this provisioner mode is configured. In local or plain Docker sandbox modes, `provisioner` is skipped.
|
||||
|
||||
See [Provisioner Setup Guide](../../docker/provisioner/README.md) for detailed configuration, prerequisites, and troubleshooting.
|
||||
|
||||
Choose between local execution or Docker-based isolation:
|
||||
|
||||
**Option 1: Local Sandbox** (default, simpler setup):
|
||||
```yaml
|
||||
sandbox:
|
||||
use: deerflow.sandbox.local:LocalSandboxProvider
|
||||
allow_host_bash: false
|
||||
```
|
||||
|
||||
`allow_host_bash` is intentionally `false` by default. DeerFlow's local sandbox is a host-side convenience mode, not a secure shell isolation boundary. If you need `bash`, prefer `AioSandboxProvider`. Only set `allow_host_bash: true` for fully trusted single-user local workflows.
|
||||
|
||||
**Option 2: Docker Sandbox** (isolated, more secure):
|
||||
```yaml
|
||||
sandbox:
|
||||
use: deerflow.community.aio_sandbox:AioSandboxProvider
|
||||
port: 8080
|
||||
auto_start: true
|
||||
container_prefix: deer-flow-sandbox
|
||||
|
||||
# Optional: Additional mounts
|
||||
mounts:
|
||||
- host_path: /path/on/host
|
||||
container_path: /path/in/container
|
||||
read_only: false
|
||||
```
|
||||
|
||||
When you configure `sandbox.mounts`, DeerFlow exposes those `container_path` values in the agent prompt so the agent can discover and operate on mounted directories directly instead of assuming everything must live under `/mnt/user-data`.
|
||||
|
||||
### Skills
|
||||
|
||||
Configure the skills directory for specialized workflows:
|
||||
|
||||
```yaml
|
||||
skills:
|
||||
# Host path (optional, default: ../skills)
|
||||
path: /custom/path/to/skills
|
||||
|
||||
# Container mount path (default: /mnt/skills)
|
||||
container_path: /mnt/skills
|
||||
```
|
||||
|
||||
**How Skills Work**:
|
||||
- Skills are stored in `deer-flow/skills/{public,custom}/`
|
||||
- Each skill has a `SKILL.md` file with metadata
|
||||
- Skills are automatically discovered and loaded
|
||||
- Available in both local and Docker sandbox via path mapping
|
||||
|
||||
**Per-Agent Skill Filtering**:
|
||||
Custom agents can restrict which skills they load by defining a `skills` field in their `config.yaml` (located at `workspace/agents/<agent_name>/config.yaml`):
|
||||
- **Omitted or `null`**: Loads all globally enabled skills (default fallback).
|
||||
- **`[]` (empty list)**: Disables all skills for this specific agent.
|
||||
- **`["skill-name"]`**: Loads only the explicitly specified skills.
|
||||
|
||||
### Title Generation
|
||||
|
||||
Automatic conversation title generation:
|
||||
|
||||
```yaml
|
||||
title:
|
||||
enabled: true
|
||||
max_words: 6
|
||||
max_chars: 60
|
||||
model_name: null # Use first model in list
|
||||
```
|
||||
|
||||
### GitHub API Token (Optional for GitHub Deep Research Skill)
|
||||
|
||||
The default GitHub API rate limits are quite restrictive. For frequent project research, we recommend configuring a personal access token (PAT) with read-only permissions.
|
||||
|
||||
**Configuration Steps**:
|
||||
1. Uncomment the `GITHUB_TOKEN` line in the `.env` file and add your personal access token
|
||||
2. Restart the DeerFlow service to apply changes
|
||||
|
||||
## Environment Variables
|
||||
|
||||
DeerFlow supports environment variable substitution using the `$` prefix:
|
||||
|
||||
```yaml
|
||||
models:
|
||||
- api_key: $OPENAI_API_KEY # Reads from environment
|
||||
```
|
||||
|
||||
**Common Environment Variables**:
|
||||
- `OPENAI_API_KEY` - OpenAI API key
|
||||
- `ANTHROPIC_API_KEY` - Anthropic API key
|
||||
- `DEEPSEEK_API_KEY` - DeepSeek API key
|
||||
- `NOVITA_API_KEY` - Novita API key (OpenAI-compatible endpoint)
|
||||
- `TAVILY_API_KEY` - Tavily search API key
|
||||
- `DEER_FLOW_CONFIG_PATH` - Custom config file path
|
||||
|
||||
## Configuration Location
|
||||
|
||||
The configuration file should be placed in the **project root directory** (`deer-flow/config.yaml`), not in the backend directory.
|
||||
|
||||
## Configuration Priority
|
||||
|
||||
DeerFlow searches for configuration in this order:
|
||||
|
||||
1. Path specified in code via `config_path` argument
|
||||
2. Path from `DEER_FLOW_CONFIG_PATH` environment variable
|
||||
3. `config.yaml` in current working directory (typically `backend/` when running)
|
||||
4. `config.yaml` in parent directory (project root: `deer-flow/`)
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Place `config.yaml` in project root** - Not in `backend/` directory
|
||||
2. **Never commit `config.yaml`** - It's already in `.gitignore`
|
||||
3. **Use environment variables for secrets** - Don't hardcode API keys
|
||||
4. **Keep `config.example.yaml` updated** - Document all new options
|
||||
5. **Test configuration changes locally** - Before deploying
|
||||
6. **Use Docker sandbox for production** - Better isolation and security
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "Config file not found"
|
||||
- Ensure `config.yaml` exists in the **project root** directory (`deer-flow/config.yaml`)
|
||||
- The backend searches parent directory by default, so root location is preferred
|
||||
- Alternatively, set `DEER_FLOW_CONFIG_PATH` environment variable to custom location
|
||||
|
||||
### "Invalid API key"
|
||||
- Verify environment variables are set correctly
|
||||
- Check that `$` prefix is used for env var references
|
||||
|
||||
### "Skills not loading"
|
||||
- Check that `deer-flow/skills/` directory exists
|
||||
- Verify skills have valid `SKILL.md` files
|
||||
- Check `skills.path` configuration if using custom path
|
||||
|
||||
### "Docker sandbox fails to start"
|
||||
- Ensure Docker is running
|
||||
- Check port 8080 (or configured port) is available
|
||||
- Verify Docker image is accessible
|
||||
|
||||
## Examples
|
||||
|
||||
See `config.example.yaml` for complete examples of all configuration options.
|
||||
Reference in New Issue
Block a user