Skip to content

Email Worker: Prompt Injection Defense #1

@nev-offload

Description

@nev-offload

Overview

Implement security defenses to protect the Email Worker against prompt injection attacks. As the Email Worker processes incoming emails with LLMs, we need comprehensive detection and filtering to prevent malicious emails from manipulating AI behavior.

PRD

📄 Full PRD: https://github.com/offloadmywork/wiki/blob/main/projects/email-prompt-injection-defense.md

Key Components

1. Threat Model

  • Direct instruction override attacks
  • Data exfiltration attempts
  • Behavior manipulation
  • Obfuscated/encoded injection patterns

2. Multi-Layer Detection

  • Layer 1: Fast regex patterns (< 50ms)
  • Layer 2: LLM-based classification (1-5s)
  • Layer 3: HTML/encoding heuristics
  • Layer 4: Output validation

3. Filtering Actions

  • Flag: Mark suspicious but deliver
  • Quarantine: Move to review queue
  • Reject: Block at ingestion
  • Sanitize: Remove dangerous content

4. Pipeline Integration

  • Before-insert filtering (primary)
  • After-insert analysis (supplementary)
  • Async processing for expensive checks

5. Metrics & Monitoring

  • Detection rate, false positives
  • Attack pattern trends
  • Review queue health
  • Performance metrics

6. Appeal System

  • User-friendly appeal process
  • Manual review dashboard
  • Auto-approval heuristics
  • ML feedback loop

Implementation Phases

Phase 1 (Week 1-2): Foundation - Regex detection, logging, quarantine
Phase 2 (Week 3-4): Advanced detection - LLM classifier, heuristics
Phase 3 (Week 5-6): UX - Appeal system, notifications
Phase 4 (Week 7-8): Optimization - Fine-tuning, performance
Phase 5 (Ongoing): Monitoring & iteration

Success Criteria

  • ✅ 95%+ detection rate for known injection patterns
  • ✅ <5% false positive rate
  • ✅ <100ms Layer 1 detection latency
  • ✅ <24h average appeal response time
  • ✅ Comprehensive audit trail

Labels

security, enhancement, email-processing, llm-safety

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions