Initial commit: hardened DeerFlow factory

Vendored deer-flow upstream (bytedance/deer-flow) plus prompt-injection hardening: - New deerflow.security package: content_delimiter, html_cleaner, sanitizer (8 layers — invisible chars, control chars, symbols, NFC, PUA, tag chars, horizontal whitespace collapse with newline/tab preservation, length cap) - New deerflow.community.searx package: web_search, web_fetch, image_search backed by a private SearX instance, every external string sanitized and wrapped in <<<EXTERNAL_UNTRUSTED_CONTENT>>> delimiters - All native community web providers (ddg_search, tavily, exa, firecrawl, jina_ai, infoquest, image_search) replaced with hard-fail stubs that raise NativeWebToolDisabledError at import time, so a misconfigured tool.use path fails loud rather than silently falling back to unsanitized output - Native client back-doors (jina_client.py, infoquest_client.py) stubbed too - Native-tool tests quarantined under tests/_disabled_native/ (collect_ignore_glob via local conftest.py) - Sanitizer Layer 7 fix: only collapse horizontal whitespace, preserve newlines and tabs so list/table structure survives - Hardened runtime config.yaml references only the searx-backed tools - Factory overlay (backend/) kept in sync with deer-flow tree as a reference / source See HARDENING.md for the full audit trail and verification steps.
2026-04-12 14:23:57 +02:00
commit 6de0bf9f5b
889 changed files with 173052 additions and 0 deletions
--- a/deer-flow/skills/public/github-deep-research/SKILL.md
+++ b/deer-flow/skills/public/github-deep-research/SKILL.md
@@ -0,0 +1,166 @@
+---
+name: github-deep-research
+description: Conduct multi-round deep research on any GitHub Repo. Use when users request comprehensive analysis, timeline reconstruction, competitive analysis, or in-depth investigation of GitHub. Produces structured markdown reports with executive summaries, chronological timelines, metrics analysis, and Mermaid diagrams. Triggers on Github repository URL or open source projects.
+---
+
+# GitHub Deep Research Skill
+
+Multi-round research combining GitHub API, web_search, web_fetch to produce comprehensive markdown reports.
+
+## Research Workflow
+
+- Round 1: GitHub API
+- Round 2: Discovery
+- Round 3: Deep Investigation
+- Round 4: Deep Dive
+
+## Core Methodology
+
+### Query Strategy
+
+**Broad to Narrow**: Start with GitHub API, then general queries, refine based on findings.
+
+```
+Round 1: GitHub API
+Round 2: "{topic} overview"
+Round 3: "{topic} architecture", "{topic} vs alternatives"
+Round 4: "{topic} issues", "{topic} roadmap", "site:github.com {topic}"
+```
+
+**Source Prioritization**:
+1. Official docs/repos (highest weight)
+2. Technical blogs (Medium, Dev.to)
+3. News articles (verified outlets)
+4. Community discussions (Reddit, HN)
+5. Social media (lowest weight, for sentiment)
+
+### Research Rounds
+
+**Round 1 - GitHub API**
+Directly execute `scripts/github_api.py` without `read_file()`:
+```bash
+python /path/to/skill/scripts/github_api.py <owner> <repo> summary
+python /path/to/skill/scripts/github_api.py <owner> <repo> readme
+python /path/to/skill/scripts/github_api.py <owner> <repo> tree
+```
+
+**Available commands (the last argument of `github_api.py`):**
+- summary
+- info
+- readme
+- tree
+- languages
+- contributors
+- commits
+- issues
+- prs
+- releases
+
+**Round 2 - Discovery (3-5 web_search)**
+- Get overview and identify key terms
+- Find official website/repo
+- Identify main players/competitors
+
+**Round 3 - Deep Investigation (5-10 web_search + web_fetch)**
+- Technical architecture details
+- Timeline of key events
+- Community sentiment
+- Use web_fetch on valuable URLs for full content
+
+**Round 4 - Deep Dive**
+- Analyze commit history for timeline
+- Review issues/PRs for feature evolution
+- Check contributor activity
+
+## Report Structure
+
+Follow template in `assets/report_template.md`:
+
+1. **Metadata Block** - Date, confidence level, subject
+2. **Executive Summary** - 2-3 sentence overview with key metrics
+3. **Chronological Timeline** - Phased breakdown with dates
+4. **Key Analysis Sections** - Topic-specific deep dives
+5. **Metrics & Comparisons** - Tables, growth charts
+6. **Strengths & Weaknesses** - Balanced assessment
+7. **Sources** - Categorized references
+8. **Confidence Assessment** - Claims by confidence level
+9. **Methodology** - Research approach used
+
+### Mermaid Diagrams
+
+Include diagrams where helpful:
+
+**Timeline (Gantt)**:
+```mermaid
+gantt
+    title Project Timeline
+    dateFormat YYYY-MM-DD
+    section Phase 1
+    Development    :2025-01-01, 2025-03-01
+    section Phase 2
+    Launch         :2025-03-01, 2025-04-01
+```
+
+**Architecture (Flowchart)**:
+```mermaid
+flowchart TD
+    A[User] --> B[Coordinator]
+    B --> C[Planner]
+    C --> D[Research Team]
+    D --> E[Reporter]
+```
+
+**Comparison (Pie/Bar)**:
+```mermaid
+pie title Market Share
+    "Project A" : 45
+    "Project B" : 30
+    "Others" : 25
+```
+
+## Confidence Scoring
+
+Assign confidence based on source quality:
+
+| Confidence | Criteria |
+|------------|----------|
+| High (90%+) | Official docs, GitHub data, multiple corroborating sources |
+| Medium (70-89%) | Single reliable source, recent articles |
+| Low (50-69%) | Social media, unverified claims, outdated info |
+
+## Output
+
+Save report as: `research_{topic}_{YYYYMMDD}.md`
+
+### Formatting Rules
+
+- Chinese content: Use full-width punctuation（，。：；！？）
+- Technical terms: Provide Wiki/doc URL on first mention
+- Tables: Use for metrics, comparisons
+- Code blocks: For technical examples
+- Mermaid: For architecture, timelines, flows
+
+## Best Practices
+
+1. **Start with official sources** - Repo, docs, company blog
+2. **Verify dates from commits/PRs** - More reliable than articles
+3. **Triangulate claims** - 2+ independent sources
+4. **Note conflicting info** - Don't hide contradictions
+5. **Distinguish fact vs opinion** - Label speculation clearly
+6. **CRITICAL: Always include inline citations** - Use `[citation:Title](URL)` format immediately after each claim from external sources
+7. **Extract URLs from search results** - web_search returns {title, url, snippet} - always use the URL field
+8. **Update as you go** - Don't wait until end to synthesize
+
+### Citation Examples
+
+**Good - With inline citations:**
+```markdown
+The project gained 10,000 stars within 3 months of launch [citation:GitHub Stats](https://github.com/owner/repo).
+The architecture uses LangGraph for workflow orchestration [citation:LangGraph Docs](https://langchain.com/langgraph).
+```
+
+**Bad - Without citations:**
+```markdown
+The project gained 10,000 stars within 3 months of launch.
+The architecture uses LangGraph for workflow orchestration.
+```