-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Overview
Implement security defenses to protect the Email Worker against prompt injection attacks. As the Email Worker processes incoming emails with LLMs, we need comprehensive detection and filtering to prevent malicious emails from manipulating AI behavior.
PRD
📄 Full PRD: https://github.com/offloadmywork/wiki/blob/main/projects/email-prompt-injection-defense.md
Key Components
1. Threat Model
- Direct instruction override attacks
- Data exfiltration attempts
- Behavior manipulation
- Obfuscated/encoded injection patterns
2. Multi-Layer Detection
- Layer 1: Fast regex patterns (< 50ms)
- Layer 2: LLM-based classification (1-5s)
- Layer 3: HTML/encoding heuristics
- Layer 4: Output validation
3. Filtering Actions
- Flag: Mark suspicious but deliver
- Quarantine: Move to review queue
- Reject: Block at ingestion
- Sanitize: Remove dangerous content
4. Pipeline Integration
- Before-insert filtering (primary)
- After-insert analysis (supplementary)
- Async processing for expensive checks
5. Metrics & Monitoring
- Detection rate, false positives
- Attack pattern trends
- Review queue health
- Performance metrics
6. Appeal System
- User-friendly appeal process
- Manual review dashboard
- Auto-approval heuristics
- ML feedback loop
Implementation Phases
Phase 1 (Week 1-2): Foundation - Regex detection, logging, quarantine
Phase 2 (Week 3-4): Advanced detection - LLM classifier, heuristics
Phase 3 (Week 5-6): UX - Appeal system, notifications
Phase 4 (Week 7-8): Optimization - Fine-tuning, performance
Phase 5 (Ongoing): Monitoring & iteration
Success Criteria
- ✅ 95%+ detection rate for known injection patterns
- ✅ <5% false positive rate
- ✅ <100ms Layer 1 detection latency
- ✅ <24h average appeal response time
- ✅ Comprehensive audit trail
Labels
security, enhancement, email-processing, llm-safety
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request