Initial commit: hardened DeerFlow factory

Vendored deer-flow upstream (bytedance/deer-flow) plus prompt-injection hardening: - New deerflow.security package: content_delimiter, html_cleaner, sanitizer (8 layers — invisible chars, control chars, symbols, NFC, PUA, tag chars, horizontal whitespace collapse with newline/tab preservation, length cap) - New deerflow.community.searx package: web_search, web_fetch, image_search backed by a private SearX instance, every external string sanitized and wrapped in <<<EXTERNAL_UNTRUSTED_CONTENT>>> delimiters - All native community web providers (ddg_search, tavily, exa, firecrawl, jina_ai, infoquest, image_search) replaced with hard-fail stubs that raise NativeWebToolDisabledError at import time, so a misconfigured tool.use path fails loud rather than silently falling back to unsanitized output - Native client back-doors (jina_client.py, infoquest_client.py) stubbed too - Native-tool tests quarantined under tests/_disabled_native/ (collect_ignore_glob via local conftest.py) - Sanitizer Layer 7 fix: only collapse horizontal whitespace, preserve newlines and tabs so list/table structure survives - Hardened runtime config.yaml references only the searx-backed tools - Factory overlay (backend/) kept in sync with deer-flow tree as a reference / source See HARDENING.md for the full audit trail and verification steps.
2026-04-12 14:23:57 +02:00
commit 6de0bf9f5b
889 changed files with 173052 additions and 0 deletions
--- a/deer-flow/skills/public/deep-research/SKILL.md
+++ b/deer-flow/skills/public/deep-research/SKILL.md
@@ -0,0 +1,198 @@
+---
+name: deep-research
+description: Use this skill instead of WebSearch for ANY question requiring web research. Trigger on queries like "what is X", "explain X", "compare X and Y", "research X", or before content generation tasks. Provides systematic multi-angle research methodology instead of single superficial searches. Use this proactively when the user's question needs online information.
+---
+
+# Deep Research Skill
+
+## Overview
+
+This skill provides a systematic methodology for conducting thorough web research. **Load this skill BEFORE starting any content generation task** to ensure you gather sufficient information from multiple angles, depths, and sources.
+
+## When to Use This Skill
+
+**Always load this skill when:**
+
+### Research Questions
+- User asks "what is X", "explain X", "research X", "investigate X"
+- User wants to understand a concept, technology, or topic in depth
+- The question requires current, comprehensive information from multiple sources
+- A single web search would be insufficient to answer properly
+
+### Content Generation (Pre-research)
+- Creating presentations (PPT/slides)
+- Creating frontend designs or UI mockups
+- Writing articles, reports, or documentation
+- Producing videos or multimedia content
+- Any content that requires real-world information, examples, or current data
+
+## Core Principle
+
+**Never generate content based solely on general knowledge.** The quality of your output directly depends on the quality and quantity of research conducted beforehand. A single search query is NEVER enough.
+
+## Research Methodology
+
+### Phase 1: Broad Exploration
+
+Start with broad searches to understand the landscape:
+
+1. **Initial Survey**: Search for the main topic to understand the overall context
+2. **Identify Dimensions**: From initial results, identify key subtopics, themes, angles, or aspects that need deeper exploration
+3. **Map the Territory**: Note different perspectives, stakeholders, or viewpoints that exist
+
+Example:
+```
+Topic: "AI in healthcare"
+Initial searches:
+- "AI healthcare applications 2024"
+- "artificial intelligence medical diagnosis"
+- "healthcare AI market trends"
+
+Identified dimensions:
+- Diagnostic AI (radiology, pathology)
+- Treatment recommendation systems
+- Administrative automation
+- Patient monitoring
+- Regulatory landscape
+- Ethical considerations
+```
+
+### Phase 2: Deep Dive
+
+For each important dimension identified, conduct targeted research:
+
+1. **Specific Queries**: Search with precise keywords for each subtopic
+2. **Multiple Phrasings**: Try different keyword combinations and phrasings
+3. **Fetch Full Content**: Use `web_fetch` to read important sources in full, not just snippets
+4. **Follow References**: When sources mention other important resources, search for those too
+
+Example:
+```
+Dimension: "Diagnostic AI in radiology"
+Targeted searches:
+- "AI radiology FDA approved systems"
+- "chest X-ray AI detection accuracy"
+- "radiology AI clinical trials results"
+
+Then fetch and read:
+- Key research papers or summaries
+- Industry reports
+- Real-world case studies
+```
+
+### Phase 3: Diversity & Validation
+
+Ensure comprehensive coverage by seeking diverse information types:
+
+| Information Type | Purpose | Example Searches |
+|-----------------|---------|------------------|
+| **Facts & Data** | Concrete evidence | "statistics", "data", "numbers", "market size" |
+| **Examples & Cases** | Real-world applications | "case study", "example", "implementation" |
+| **Expert Opinions** | Authority perspectives | "expert analysis", "interview", "commentary" |
+| **Trends & Predictions** | Future direction | "trends 2024", "forecast", "future of" |
+| **Comparisons** | Context and alternatives | "vs", "comparison", "alternatives" |
+| **Challenges & Criticisms** | Balanced view | "challenges", "limitations", "criticism" |
+
+### Phase 4: Synthesis Check
+
+Before proceeding to content generation, verify:
+
+- [ ] Have I searched from at least 3-5 different angles?
+- [ ] Have I fetched and read the most important sources in full?
+- [ ] Do I have concrete data, examples, and expert perspectives?
+- [ ] Have I explored both positive aspects and challenges/limitations?
+- [ ] Is my information current and from authoritative sources?
+
+**If any answer is NO, continue researching before generating content.**
+
+## Search Strategy Tips
+
+### Effective Query Patterns
+
+```
+# Be specific with context
+❌ "AI trends"
+✅ "enterprise AI adoption trends 2024"
+
+# Include authoritative source hints
+"[topic] research paper"
+"[topic] McKinsey report"
+"[topic] industry analysis"
+
+# Search for specific content types
+"[topic] case study"
+"[topic] statistics"
+"[topic] expert interview"
+
+# Use temporal qualifiers — always use the ACTUAL current year from <current_date>
+"[topic] 2026"   # ← replace with real current year, never hardcode a past year
+"[topic] latest"
+"[topic] recent developments"
+```
+
+### Temporal Awareness
+
+**Always check `<current_date>` in your context before forming ANY search query.**
+
+`<current_date>` gives you the full date: year, month, day, and weekday (e.g. `2026-02-28, Saturday`). Use the right level of precision depending on what the user is asking:
+
+| User intent | Temporal precision needed | Example query |
+|---|---|---|
+| "today / this morning / just released" | **Month + Day** | `"tech news February 28 2026"` |
+| "this week" | **Week range** | `"technology releases week of Feb 24 2026"` |
+| "recently / latest / new" | **Month** | `"AI breakthroughs February 2026"` |
+| "this year / trends" | **Year** | `"software trends 2026"` |
+
+**Rules:**
+- When the user asks about "today" or "just released", use **month + day + year** in your search queries to get same-day results
+- Never drop to year-only when day-level precision is needed — `"tech news 2026"` will NOT surface today's news
+- Try multiple phrasings: numeric form (`2026-02-28`), written form (`February 28 2026`), and relative terms (`today`, `this week`) across different queries
+
+❌ User asks "what's new in tech today" → searching `"new technology 2026"` → misses today's news
+✅ User asks "what's new in tech today" → searching `"new technology February 28 2026"` + `"tech news today Feb 28"` → gets today's results
+
+### When to Use web_fetch
+
+Use `web_fetch` to read full content when:
+- A search result looks highly relevant and authoritative
+- You need detailed information beyond the snippet
+- The source contains data, case studies, or expert analysis
+- You want to understand the full context of a finding
+
+### Iterative Refinement
+
+Research is iterative. After initial searches:
+1. Review what you've learned
+2. Identify gaps in your understanding
+3. Formulate new, more targeted queries
+4. Repeat until you have comprehensive coverage
+
+## Quality Bar
+
+Your research is sufficient when you can confidently answer:
+- What are the key facts and data points?
+- What are 2-3 concrete real-world examples?
+- What do experts say about this topic?
+- What are the current trends and future directions?
+- What are the challenges or limitations?
+- What makes this topic relevant or important now?
+
+## Common Mistakes to Avoid
+
+- ❌ Stopping after 1-2 searches
+- ❌ Relying on search snippets without reading full sources
+- ❌ Searching only one aspect of a multi-faceted topic
+- ❌ Ignoring contradicting viewpoints or challenges
+- ❌ Using outdated information when current data exists
+- ❌ Starting content generation before research is complete
+
+## Output
+
+After completing research, you should have:
+1. A comprehensive understanding of the topic from multiple angles
+2. Specific facts, data points, and statistics
+3. Real-world examples and case studies
+4. Expert perspectives and authoritative sources
+5. Current trends and relevant context
+
+**Only then proceed to content generation**, using the gathered information to create high-quality, well-informed content.