Skip to content

wiqilee/VoxGuard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

126 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

VoxGuard ๐Ÿ›ก๏ธ

Real-Time Multimodal AI Scam Detection with Live Intervention

Protecting you during the call, not after. When a scammer asks for your OTP, VoxGuard steps in before you hand it over.

VoxGuard Architecture

VoxGuard Architecture

VoxGuard Architecture

VoxGuard Architecture

License: MIT Python 3.11+ Rust Gemini Live API Google Cloud Run Built for

To our knowledge, the first real-time multimodal scam detection agent with active intervention. Gemini Live API + Rust WASM + Psychological AI + Natural Voice TTS = Protection in under 80ms.

Live Demo Demo Video Medium Article For Judges


๐ŸŽฏ For Judges: 2-Minute Guide

TL;DR: Open the demo, click START, pick a demo script, watch real-time scam detection happen, then watch the system intervene and block you from giving away your money - with a natural human voice warning you in your language.

In 30 seconds (Demo Mode, no microphone needed):

1. Open https://voxguard-kappa.vercel.app
2. Click the "MONITOR" tab (default view)
3. Click "START"
4. Click any Demo Script (e.g., "Bank Impersonation")
5. Watch: alerts fire in real-time, threat score rises, psych vectors light up
6. When the caller asks for your OTP -> INTERVENTION OVERLAY fires automatically
7. HEAR: VoxGuard speaks a natural voice warning via Gemini TTS
8. SEE: Multimodal Explanation Card shows WHY this is a scam (audio + visual signals)
9. Take the Verification Challenge or hit SAFE EXIT
10. SEE: Guided Action Agent gives you step-by-step recovery actions for your country
11. Click "REPORT" tab -> see forensic report with intervention history -> export PDF

What to look for:

Tab What It Demonstrates
MONITOR 2-way dialog (ME + CALLER), waveform, <80ms alerts, caller HUD, live intervention overlay, natural voice warnings, explanation cards
PSYCH 6 Cialdini vectors + 5 lie detection indicators + user vulnerability (world first)
PATTERNS 50+ grounded patterns with fullscreen detail view + interpretation
REPORT Full transcript, intervention history, guided action plan, forensic export (PDF + HTML), session gallery
ABOUT Architecture + data sources + why this is unprecedented

Innovation in one sentence:

Every other tool either blocks calls before they connect or sends a passive alert after the damage is done. VoxGuard is the first system that protects you while the scammer is talking - with a natural voice that warns you, an AI that explains why this is dangerous, and an agent that walks you step-by-step to safety.


โš ๏ธ The Problem

Every 30 seconds, someone somewhere in the world loses money to a phone or video call scam.

The Problem

The Problem2

Last year, my neighborโ€™s father wired $12,000 to someone impersonating a bank representative. He knew scams were everywhere. He had seen the warnings, read the advice, and understood the basics. But when the caller said his account would be frozen within ten minutes and asked for his one-time password, he handed it over immediately.

I kept thinking about that moment. Not because of the money, but because of what it revealed: most anti-scam advice fails at the exact moment it matters. He had a phone, life experience, and enough caution to know better. None of that helped in the thirty seconds when pressure, fear, and urgency took over.

And that gap is far bigger than one family. According to the FBIโ€™s IC3 2024 Annual Report, reported internet crime losses in the United States hit $16.6 billion in 2024. In its 2024 report, the Global Anti-Scam Alliance estimated that consumers worldwide lost more than $1.026 trillion to scams.

So I built VoxGuard: a real-time multimodal AI agent that listens to live conversations, detects scam patterns as they emerge, and intervenes before the damage is done. Not after the call. Not the next day. Right at the moment the scammer asks for your one-time password.

Every existing tool shares one fatal flaw: they act after the damage is done.

"The difference between a scam succeeding and failing is often a single moment of doubt. VoxGuard creates that moment, then forces you to think before you act."

Wiqi Lee


๐Ÿš€ What Makes VoxGuard Unprecedented

Features
Feature Truecaller Hiya ScamShield (SG) VoxGuard
Pre-call blocking Yes Yes Yes No (by design)
During-call analysis No No No Yes (First)
Live intervention (blocks fatal actions) No No No Yes (First)
Natural voice intervention (Gemini TTS) No No No Yes (First)
Multimodal explanation cards (audio + vision) No No No Yes (First)
Guided anti-scam action agent No No No Yes (First)
Scenario-based verification challenge No No No Yes (First)
Auto-disconnect countdown No No No Yes (First)
Multimodal (audio + vision) No No No Yes (First)
2-way transcript (ME + CALLER) No No No Yes (First)
Screen share scam detection No No No Yes (First)
Sub-100ms alert latency No No No Yes (Rust WASM)
Psychological manipulation scoring No No No Yes (First)
Lie detection analysis No No No Yes (First)
User vulnerability scoring No No No Yes (First)
Multi-language support Partial Partial SG only Yes, 40 languages
Per-country recommended actions No No No Yes (9 countries)
Country-specific emergency contacts + reporting No No No Yes (9 countries)
Intervention history in forensic reports No No No Yes (First)
Session gallery with playback No No No Yes
Audio session recording (REC) No No No Yes (First)
Forensic export (PDF + HTML) No No No Yes
Grounded to global scam databases No Partial Partial Yes
Works on any call platform (desktop + mobile) No No No Yes (browser-based, responsive)

๐ŸŽฎ Two Operating Modes

VoxGuard operates in two modes that can run simultaneously:

Demo Mode (No setup needed)

Open voxguard-kappa.vercel.app, click START, and select a demo script.

VoxGuard plays a simulated scam conversation with realistic 2-way dialog (ME + CALLER), fires real-time alerts, and triggers live interventions when danger is detected. Demo Mode runs entirely in the browser โ€” no microphone, no backend, no API key required. This is the fastest way for judges to see the full detection โ†’ intervention โ†’ action agent flow.

Available Demo Scripts: Bank Impersonation, Tech Support Scam, Government/Tax Scam, AI Voice Clone, Digital Arrest, Job Offer Scam, Family Emergency โ€” plus 9 regional variants in Indonesian, Chinese, Japanese, Korean, Spanish, French, Hindi, Arabic, Malay, and Portuguese.

Live Mode (Real call protection)

Click START, then click the ๐ŸŽ™ LIVE MIC button. Grant microphone permission.

VoxGuard captures your actual microphone audio at 16kHz Mono PCM via the Web Audio API. Audio is preprocessed through the Rust WASM engine (Wiener noise reduction, spectral subtraction, VAD) and streamed to the backend through WebSocket. The backend sends audio to Gemini for real-time transcription and threat analysis.

How to use with a real call:

  1. Click START to begin a session
  2. Click ๐ŸŽ™ LIVE MIC โ€” grant microphone permission
  3. A green status banner appears: LIVE MICROPHONE โ€” Real audio capture active
  4. Put your phone on speaker and place it near your device, or use the same device for both the call and VoxGuard
  5. VoxGuard listens to the live call and detects scam patterns as they happen
  6. Optionally click REC to record session audio into the forensic report
  7. Click ๐ŸŽ™ MIC ON again to disconnect the microphone

Requirements: Live Mode requires a running backend on Google Cloud Run with a valid Gemini API key (VITE_WS_URL environment variable pointing to the backend WebSocket endpoint).

Live Mic + Demo Scripts: You can run a demo script while Live Mic is active. The demo provides simulated dialog and alerts, while the microphone captures real ambient audio in parallel. This is useful for demonstrations where you want to show both the scripted flow and the live microphone capability side by side.

Live Mic + REC: Enable both Live Mic and REC to record your actual call audio into the forensic report. REC is only available when Live Mic is active โ€” it records directly from the microphone. This is the recommended setup for real-world use.


๐Ÿšจ Live Scam Intervention

Live Scam Intervention Demo

No existing tool we surveyed does this. This is what sets VoxGuard apart.

When VoxGuard's threat engine determines you are about to take an irreversible action, it does not wait for you to check a dashboard. It takes over your screen, speaks to you in a natural human voice, explains exactly why this is dangerous, and forces a decision point.

Three Escalation Levels

Level Trigger What Happens
โš ๏ธ WARN Threat score crosses 55, or a high-risk manipulation pattern is detected Amber warning banner. Natural voice: calm, advisory tone. Verify Caller + Safe Exit + Continue With Caution. Speech pauses.
๐Ÿ›‘ BLOCK Threat score crosses 75, or the caller requests OTP / account credentials / gift cards / crypto transfer Full-screen red overlay. Natural voice: firm, urgent tone. Fatal patterns (OTP/transfer/crypto): Safe Exit only. Verifiable patterns: Verification Challenge + Safe Exit.
๐Ÿšจ LOCKDOWN Threat score crosses 90. Confirmed scam with maximum confidence. Full-screen red lockdown with 30-second auto-disconnect countdown. Natural voice: commanding, sharp tone. Safe Exit only. No challenge - too dangerous.

Instant Intervention

Some patterns are so dangerous that VoxGuard does not wait for the threat score to accumulate. These high-lethality patterns trigger an immediate BLOCK-level intervention the moment they are detected, regardless of cumulative score:

  • OTP / Credential Extraction ("Read me the code", "Confirm your PIN") โ†’ Safe Exit only
  • Safe Account Transfer ("Transfer your funds to this protection account") โ†’ Safe Exit only
  • Gift Card Demand ("Purchase prepaid cards and read me the numbers") โ†’ Safe Exit only
  • Crypto Transfer Scam ("Send Bitcoin to this wallet address") โ†’ Safe Exit only

These fatal patterns skip the Verification Challenge entirely - when someone is actively extracting your credentials, the only safe action is to disconnect.

This works across all supported languages. If the caller asks for your OTP in Indonesian, Chinese, Japanese, Korean, Spanish, French, Hindi, or Arabic, VoxGuard blocks it instantly.

๐Ÿ”Š Natural Voice Intervention (Gemini TTS)

VoxGuard doesn't just show you a warning - it speaks to you.

When an intervention fires, gemini-2.5-flash-preview-tts generates a natural human voice warning that matches the detected scam type, the urgency level, and the user's language. This is not robotic text-to-speech - it sounds like a real person warning you.

Level Voice Profile Example
WARN Calm, advisory (Kore) "Caution. VoxGuard has detected suspicious patterns in this call. The caller may not be who they claim to be."
BLOCK Firm, authoritative (Puck) "Stop immediately. The caller is asking for your one-time password. A real bank will never ask for this. Hang up now."
LOCKDOWN Sharp, commanding (Charon) "Emergency. VoxGuard has confirmed this is a scam. This call will disconnect in 30 seconds."

Voice warnings are:

  • Contextual - different scripts for bank impersonation, OTP extraction, gift card demand, government scam, tech support, and more
  • Localized - fully scripted in 9 languages (EN, ID, ZH, JA, KO, ES, FR, HI, AR)
  • Graceful fallback - if Gemini TTS is unavailable, falls back to browser speech synthesis automatically

๐Ÿ“‹ Multimodal Explanation Cards

After a high-severity alert, VoxGuard generates an Explanation Card that tells you in plain language why this is dangerous - combining evidence from both audio and visual analysis.

Example Explanation Card:

๐Ÿšจ CRITICAL: Bank Impersonation + Fake Login Page

๐ŸŽ™๏ธ Audio | ๐Ÿ–ฅ๏ธ Screen | 95% confidence

The caller claims to be from your bank's fraud department (Authority tactic) while your screen shows a login page at 'bank-secure-verify.com' - this is NOT your real bank's domain. The combination of voice impersonation and a phishing page is a confirmed scam technique.

โœ… End this call and call your bank using the number on the BACK of your card.

Each card shows:

  • Signal badges - which sources detected the threat (Audio, Screen, or both)
  • Confidence score - how certain VoxGuard is
  • Expandable signals - detailed breakdown of each detected pattern with source and severity
  • Risk factors - specific quotes or elements that triggered the alert
  • Recommended action - one clear thing to do right now

Scenario-Based Verification Challenge

When a non-fatal WARN or BLOCK-level intervention fires (e.g., impersonation, fake support, government scare), the user can take a Verification Challenge. Unlike generic questionnaires, VoxGuard's challenges are contextual to the detected scam type:

Bank Impersonation (2-3 questions):

  • "Did this caller contact you first, or did you call them?"
  • "Are they asking you to share your OTP, PIN, or password?"
  • "Did they tell you NOT to call your bank directly?"

Government Impersonation (2 questions):

  • "Is this caller threatening arrest or legal action if you don't pay now?"
  • "Are they demanding payment via gift cards, crypto, or wire transfer?"

Tech Support Impersonation (3 questions):

  • "Did this caller contact you first about a 'virus' or 'security issue'?"
  • "Are they asking you to install remote access software?"
  • "Are they rushing you to act immediately?"

Each scenario includes 7 supported scam types (bank, government, tech support, investment, family impersonation, prize/lottery, urgency) with a generic fallback for unknown patterns.

After the user answers, VoxGuard provides a clear result:

  • โš  LIKELY SCAM โ†’ recommends immediate disconnection
  • โšก EXERCISE CAUTION โ†’ offers "I Will Verify Through Official Channel" action with specific guidance (e.g., "Call the number on the BACK of your bank card")

Challenges are fully localized in 9 languages (EN, ID, ZH, JA, KO, ES, FR, HI, AR).

Safe Exit - End Call

The End Call - Safe Exit button is the primary protective action. When pressed:

  1. Voice warning stops immediately (Gemini TTS + browser synthesis)
  2. Speech synthesis stops immediately
  3. Demo/call playback halts completely
  4. Detection stream terminates
  5. Session status changes to terminated
  6. Guided Action Agent launches with personalized recovery steps
  7. App switches to the REPORT tab automatically
  8. Forensic report displays with full session data, intervention history, and export options

This ensures the user experiences a clear, decisive break from the scam call - not just a UI dismiss.

๐Ÿ›ก๏ธ Guided Anti-Scam Action Agent

After Safe Exit, VoxGuard becomes your recovery coach.

Instead of just saying "call your bank", VoxGuard generates a personalized, step-by-step action plan based on the scam type, your country, and how severe the threat was. Each step has a checkbox so you can track your progress.

Example Action Plan (Bank Impersonation, Indonesia, Critical):

๐Ÿ›ก๏ธ Anti-Scam Action Plan
๐Ÿšจ CRITICAL | ๐Ÿ‡ฎ๐Ÿ‡ฉ Indonesia | โฑ 15-30 minutes

โš ๏ธ Act within the next 15 minutes. Time is critical.

๐Ÿค– Based on the call pattern, the caller was impersonating a bank officer
   and attempted to extract your OTP. If you shared any codes, contact
   your bank immediately to freeze your account.

โ˜ ๐Ÿ“ต Blokir nomor penelepon di pengaturan HP Anda          [immediate]
โ˜ ๐Ÿฆ Hubungi bank menggunakan nomor di BELAKANG kartu ATM   [critical]
โ˜ ๐Ÿ”’ Minta bank untuk blokir sementara rekening             [critical]
โ˜ ๐Ÿ“‹ Laporkan ke OJK: 157 atau konsumen@ojk.go.id           [high]
โ˜ ๐Ÿš” Laporkan ke Bareskrim: patrolisiber.id atau 110        [recommended]
โ˜ ๐Ÿ”‘ Ganti password semua akun yang dibicarakan             [recommended]

๐Ÿšจ Emergency: Call 110 if you feel in danger

Supported countries (9):

Country Flag Emergency Reporting Channels
United States ๐Ÿ‡บ๐Ÿ‡ธ 911 FTC, FBI IC3, Credit bureaus
Indonesia ๐Ÿ‡ฎ๐Ÿ‡ฉ 110 OJK, Bareskrim, Bank Indonesia
China ๐Ÿ‡จ๐Ÿ‡ณ 110 National Anti-Fraud Center app
Japan ๐Ÿ‡ฏ๐Ÿ‡ต 110 NPA #9110, Consumer Hotline 188
South Korea ๐Ÿ‡ฐ๐Ÿ‡ท 112 FSS 1332, Cyber investigation
Spain ๐Ÿ‡ช๐Ÿ‡ธ 112 Guardia Civil, INCIBE 017
France ๐Ÿ‡ซ๐Ÿ‡ท 17 PHAROS, Info Escroqueries
India ๐Ÿ‡ฎ๐Ÿ‡ณ 112 Cybercrime 1930, cybercrime.gov.in
Singapore ๐Ÿ‡ธ๐Ÿ‡ฌ 999 ScamShield app, SPF, NCPC ScamAlert

Action plans are:

  • Personalized - steps prioritized based on what was detected (OTP extraction gets bank-first, gift card gets report-first)
  • AI-enhanced - Gemini adds personalized advice based on the specific call transcript
  • Interactive - checkbox progress tracking with completion percentage
  • Localized - action text in the user's language with local phone numbers and websites

Safe Exit Actions

Every intervention displays localized safe exit actions specific to the user's country:

  • ๐Ÿ“ต Hang up now
  • ๐Ÿฆ Call your bank's real number (the one on the back of your card)
  • ๐Ÿ‘ฅ Call a trusted family member before doing anything
  • ๐Ÿšซ Never share OTP / PIN / password on a call

These are available in English, Indonesian, Chinese, Japanese, Korean, Spanish, French, Hindi, and Arabic.

Why This Matters

Scammers succeed because they keep victims in a state of panic that shuts down rational thinking. The caller manufactures urgency ("your account will be frozen in 10 minutes"), establishes false authority ("I am calling from the fraud department"), and enforces isolation ("do not contact your bank directly").

VoxGuard's three-layer defense breaks that panic loop:

  1. Voice Intervention physically interrupts the scammer's narrative with a calm, authoritative human voice speaking directly to the victim
  2. Explanation Card replaces confusion with clarity, showing exactly which signals triggered the alert and why
  3. Action Agent replaces helplessness with a concrete plan, providing step-by-step instructions with local emergency numbers

The intervention overlay physically breaks that panic loop. It pauses the conversation. It forces a moment of reflection. The scenario-based verification challenge makes the victim confront the reality of what is happening, with questions tailored to the specific scam type they are experiencing. And if the victim is too far gone to respond, the 30-second lockdown countdown ends the call automatically.

No existing scam detection tool we surveyed does this. Every other system either blocks calls before they connect (which misses new numbers) or sends a passive notification after the call ends (which is too late). VoxGuard is, to our knowledge, the first system that intervenes during the critical moment when the victim is about to hand over their money.

Intervention Tracking

Every intervention event is tracked and preserved:

  • Intervention counter displays in the monitor tab during active sessions
  • Intervention history is included in the forensic report with level, trigger pattern, threat score at time of firing, and user response (dismissed, safe exit, challenge passed, challenge failed)
  • Alert cards that triggered an intervention display a ๐Ÿ›‘ INTERVENED badge
  • HTML and PDF exports include a dedicated intervention section
  • Session gallery shows intervention count per saved session
  • Action plan history is preserved with completion status

๐Ÿ—๏ธ Architecture

System Architecture

System Architecture

System Overview

VoxGuard is a four-layer real-time pipeline that processes audio, video, and screen input from any call platform and delivers protection in under 80 milliseconds.

Layer 1: Input Sources

Three concurrent input streams feed the system:

  • Caller Audio - the scammer's voice or video call (phone, WhatsApp, Zoom, Teams, any platform)
  • User Microphone - the protected party's microphone via Web Audio API for 2-way transcript
  • Screen Share - optional screen capture (JPEG 1280px every 2 seconds) for visual scam detection

Layer 2: Browser Layer (Client)

Component Technology Role
React UI Vite 5 + JSX 5-tab interface, live alerts, intervention overlay, action plan display
Rust WASM wasm-pack, Wiener NR Spectral subtraction, Float32 PCM, zero-copy audio at <100ms latency
Web Audio MediaStream API 16kHz Mono PCM capture in 250ms frames via ScriptProcessor
WebSocket useWebSocket hook Bidirectional: sends audio/screen, receives alerts/TTS/explanations/actions
Intervention InterventionOverlay.jsx WARN / BLOCK / LOCKDOWN overlay with verification challenge and auto-disconnect
Action Agent ActionAgent.jsx Post-session guided recovery: 9 languages, 9 countries, step-by-step checklist
Screen Capture getDisplayMedia API Opt-in only, Base64 JPEG, 2-second interval

Layer 3: Backend (Google Cloud Run, Python FastAPI)

Service File Role
FastAPI main.py REST + WebSocket /ws/session, auto-scaling on Cloud Run, health check, CORS
Threat Engine threat_engine.py Weighted scoring (0.45 Language + 0.35 Behavioral + 0.20 Visual), 500ms cycle, intervention triggers at score 55/75/90 + instant pattern matching
Audio Analyzer audio_analyzer.py VAD + buffer management, streams to Gemini Live API for real-time transcription
Vision Analyzer vision_analyzer.py Screenshot analysis via Gemini Vision: fake login pages, phishing domains, QR codes
Psych Analyzer psych_analyzer.py Single Gemini call returns 6 Cialdini vectors + 5 lie detection indicators + intervention recommendation
TTS Service tts_service.py Natural voice intervention via Gemini TTS with 3 voice profiles (Kore/Puck/Charon)
Explanation Service explanation_service.py Combines audio + visual analysis into plain-language explanation cards
Action Agent action_agent.py Generates personalized recovery plans with country-specific emergency contacts and reporting channels for 9 countries
Storage Service storage_service.py Persists session data and forensic reports to Cloud Firestore. Stores audio recordings to Cloud Storage for forensic export and replay.

Layer 4: Google Gemini AI

Model Version Purpose
Gemini Audio gemini-2.5-flash Real-time audio analysis via generate_content_async with inline 16kHz PCM audio data. Transcription + scam pattern detection.
Gemini Vision gemini-2.5-flash Screenshot analysis: fake UI detection, phishing domain identification, QR code scanning
Gemini Text gemini-2.5-flash Transcript analysis, psychological scoring, 50+ pattern matching, explanation generation
Gemini TTS gemini-2.5-flash-preview-tts Natural voice intervention with contextual scripts in 9 languages
Grounding DB scam_patterns.json 50+ verified patterns from FTC, FBI IC3, GASA, MAS, ACCC - zero hallucination

Data Flow

Caller Voice โ”€โ”
              โ”œโ”€โ†’ Rust WASM (spectral analysis, noise reduction)
User Mic โ”€โ”€โ”€โ”€โ”€โ”˜         โ”‚
                        โ–ผ
              WebSocket (audio chunks + screen frames)
                        โ”‚
                        โ–ผ
              FastAPI Backend (Cloud Run)
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ–ผ         โ–ผ         โ–ผ
         Audio SVC  Vision SVC  Psych SVC
         (Gemini    (Gemini     (Gemini
          Live)      Vision)     Text)
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ–ผ
              Threat Engine (scoring + intervention triggers)
                        โ”‚
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ–ผ         โ–ผ              โ–ผ
          Alert     Intervention   Explanation
          Event     Event + TTS    Card Event
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ–ผ
              WebSocket (back to browser)
                        โ”‚
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ–ผ         โ–ผ              โ–ผ
          AlertCard  Intervention   Explanation
          (live)     Overlay +      Card +
                     Voice Warning  Action Agent

๐Ÿ” Features

1. ๐ŸŽ™๏ธ Live Audio Stream Analysis

Audio Stream

The Rust WebAssembly audio engine captures microphone input at the browser level with zero-copy processing. Audio is downsampled to 16kHz Mono PCM, processed through Wiener noise reduction, and streamed to Gemini Live API in 250ms frames, achieving <80ms alert latency from speech to alert.

2. ๐Ÿ–ฅ๏ธ Screen Share Scam Vision

With explicit user consent, VoxGuard captures screen frames (JPEG 1280px) every 2 seconds and sends them to Gemini Vision for analysis: fake bank login pages, fraudulent investment dashboards, malicious QR codes, and spoofed government portals.

3. ๐Ÿ“Š Real-Time Threat Intelligence Engine

A weighted composite scoring system running every 500ms. Language score (45%) handles transcript pattern matching against 50+ verified patterns. Behavioral score (35%) tracks urgency signals, isolation tactics, and impersonation markers. Visual score (20%) covers screen analysis results when active. Output: 0-100 threat score with severity classification. The engine also evaluates every alert for intervention triggers, checking both score thresholds (55/75/90) and instant-pattern matches.

4. ๐Ÿ“š Scam Pattern Library (50+ Grounded Patterns)

All patterns grounded to published sources: FTC Consumer Sentinel, FBI IC3 2024, GASA Global Scam Report, MAS ScamShield (SG), and ACCC ScamWatch. No hallucination. Verified structured knowledge only. Each pattern includes an intervention_level field that determines whether detection triggers WARN, BLOCK, or no automatic intervention.

5. ๐Ÿง  Psychological Manipulation Scoring

Psychological Vectors

The only scam detection system we are aware of that maps psychological manipulation vectors in real-time using three analytical frameworks:

Framework 1: Cialdini's 6 Influence Principles maps which persuasion vectors the caller is deploying:

Vector Trigger Example
SCARCITY "This offer expires in 10 minutes"
AUTHORITY "I'm calling from the tax office"
FEAR "Your account will be frozen"
RECIPROCITY "We already helped you, now you must..."
ISOLATION "Don't tell your family about this"
COMMITMENT "You already agreed to verify your identity"

Framework 2: User Vulnerability State derives metrics showing how the manipulation is affecting the user's decision-making (Panic Level, Compliance Risk, Misplaced Trust). These scores directly feed the intervention system: high panic combined with high authority triggers earlier intervention.

Each vector includes real-time interpretation (Inactive, Low, Moderate, Elevated, High, Critical) with explanations, plus a pie chart distribution view.

6. ๐Ÿ” Lie Detection Analysis

Lie Detection

5 behavioral deception indicators based on FBI Criteria-Based Content Analysis (CBCA) methodology:

Indicator What It Detects
Inconsistency Contradictions between claims made at different points
Strategic Vagueness Deliberately avoids specifics when challenged
Excessive Detail Unprompted flood of irrelevant details (overcompensation)
Question Deflection Changes subject or responds with new claims
Pressure to Comply Uses urgency to prevent verification

Lie detection scores are displayed in the PSYCH tab alongside manipulation vectors, included in forensic reports (PDF/HTML), and saved to the session gallery. The Psych Analyzer service returns both psych scores and lie scores in a single Gemini call, along with an intervention recommendation that feeds back into the threat engine.

7. ๐Ÿ’ฌ Two-Way Communication Transcript

Both sides of the conversation are transcribed in real-time:

  • ME (user), displayed in green
  • CALLER (scammer), displayed in orange with flag markers

Flagged statements trigger real-time alerts. Full 2-way transcript is preserved in session reports and gallery.

8. ๐Ÿ“‹ Session Report, Gallery & Forensic Export

Every session generates a complete forensic report with:

  • Full 2-way transcript with timestamps
  • Intervention history with level, trigger type, pattern, score at time of firing, and user response
  • Action plan history with completion status and country-specific steps taken
  • Alert timeline with confidence scores (alerts that triggered intervention are marked with ๐Ÿ›‘)
  • Psychological vector breakdown + lie detection scores
  • Country-specific recommended actions with local emergency numbers
  • Country flag and language indicator
  • Session audio recording (when REC is enabled)

Export: Dark-theme HTML or print-ready PDF with colored bars, dedicated intervention section, all analytical sections, and "Built by Wiqi Lee" footer.

Session Gallery: Saved sessions with threat score preview, country label, duration, and intervention count. Click any session for fullscreen detail view with tabs (Transcript, Alerts, Interventions, Psych + Lie Detection, Action Plan, Recommended Actions). Audio playback when recording is available.

8.1 ๐ŸŽ™ Audio Recording (REC)

The REC button in the Monitor tab captures live microphone audio for forensic export. REC requires Live Mic mode โ€” it records directly from your device microphone via the Web Audio API. In Demo Mode (no microphone), the REC button is disabled because browser security restrictions prevent capturing audio played through new Audio() or speechSynthesis.

The recording is saved as a WebM or MP4 blob (depending on browser support) and embedded in the forensic report. When you export to HTML, the audio player is included so you can replay the session later.

How to record:

  1. Click START to begin a session
  2. Click ๐ŸŽ™ LIVE MIC and grant microphone permission
  3. Click the red REC button (it will pulse to indicate recording is active)
  4. Put your phone on speaker โ€” VoxGuard records the ambient audio from the live call
  5. When you click STOP or SAFE EXIT, the recording is automatically saved
  6. Open the REPORT tab to see the audio player and export to HTML

Note: REC is greyed out until Live Mic is enabled. This is by design โ€” there is no audio to record in Demo Mode.

8.2 ๐ŸŽ™ Live Microphone Mode

VoxGuard supports real-time audio capture from your device microphone, in addition to demo mode. Both can run simultaneously.

How to use Live Mic:

  1. Click START to begin a session
  2. Click the ๐ŸŽ™ LIVE MIC button in the Monitor controls
  3. Grant microphone permission when your browser asks
  4. A green status banner appears: "LIVE MICROPHONE - Real audio capture active"
  5. Put your phone on speaker and place it near your device, or use the same device for both the call and VoxGuard
  6. VoxGuard captures your microphone audio at 16kHz Mono PCM via the Web Audio API
  7. In production mode (with a running backend), the audio is streamed to the Gemini Live API for real-time analysis
  8. Click ๐ŸŽ™ MIC ON again to disconnect the microphone

Live Mic + REC together: Enable both Live Mic and REC at the same time to record your actual call audio into the forensic report. This is the recommended setup for real-world use.

Live Mic + Demo Scripts: You can also run a demo script while Live Mic is active. The demo script provides simulated 2-way dialog and alerts, while the microphone captures real ambient audio in parallel. This is useful for demonstrations where you want to show both the scripted flow and the live microphone capability.

9. ๐ŸŒ Multi-Language Support (40 Languages)

Gemini Live API supports 40 languages natively. VoxGuard includes region-specific scam patterns, localized alerts, localized intervention UI, natural voice warnings (Gemini TTS), scenario-based verification challenges, safe exit actions, and guided action plans.

Fully Native Support (demo scripts + localized alerts + voice intervention + action agent):

Language Flag Demo Scripts Regional Scams Intervention Voice TTS Action Agent
English ๐Ÿ‡บ๐Ÿ‡ธ Bank Fraud, Tech Support, Gov/Tax, Investment FTC/FBI patterns Full Full (3 voices) Full (US)
Indonesian ๐Ÿ‡ฎ๐Ÿ‡ฉ Bank XYZ, Pinjol, Mama Minta Pulsa, Giveaway Palsu OJK/Bareskrim Full Full Full (ID)
Chinese ๐Ÿ‡จ๐Ÿ‡ณ ๅ…ฌๅฎ‰ๅฑ€่ฏˆ้ช— (Police Impersonation) MPS Advisory Full Full Full (CN)
Japanese ๐Ÿ‡ฏ๐Ÿ‡ต ใ‚ชใƒฌใ‚ชใƒฌ่ฉๆฌบ (Ore Ore) NPA patterns Full Full Full (JP)
Korean ๐Ÿ‡ฐ๐Ÿ‡ท ๋ณด์ด์Šคํ”ผ์‹ฑ (Voice Phishing) FSS patterns Full Full Full (KR)
Spanish ๐Ÿ‡ช๐Ÿ‡ธ Fraude Bancario Guardia Civil Full Full Full (ES)
French ๐Ÿ‡ซ๐Ÿ‡ท Arnaque CPF DGCCRF Full Full Full (FR)
Hindi ๐Ÿ‡ฎ๐Ÿ‡ณ Digital Arrest Fraud MHA/RBI Full Full Full (IN)
Arabic ๐Ÿ‡ธ๐Ÿ‡ฆ ุงุญุชูŠุงู„ ู…ุตุฑููŠ (Bank Fraud) GASA Full Full Full (SA)

Partial Native Support (demo scripts in local language, TTS via Gemini):

Language Flag Demo Scripts Regional Scams Intervention Voice TTS Action Agent
Malay ๐Ÿ‡ฒ๐Ÿ‡พ Bank, Polis Diraja, Hadiah Palsu BNM patterns Full Full English fallback
Portuguese ๐Ÿ‡ง๐Ÿ‡ท Fraude Bancรกria, Polรญcia Federal, Prรชmio Falso GASA patterns Full Full English fallback

English Fallback (voice + alerts in English, UI translated):

Filipino ๐Ÿ‡ต๐Ÿ‡ญ, Thai ๐Ÿ‡น๐Ÿ‡ญ, Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ, German ๐Ÿ‡ฉ๐Ÿ‡ช, Italian ๐Ÿ‡ฎ๐Ÿ‡น, Dutch ๐Ÿ‡ณ๐Ÿ‡ฑ, Turkish ๐Ÿ‡น๐Ÿ‡ท, Polish ๐Ÿ‡ต๐Ÿ‡ฑ, Russian ๐Ÿ‡ท๐Ÿ‡บ, Ukrainian ๐Ÿ‡บ๐Ÿ‡ฆ, Romanian ๐Ÿ‡ท๐Ÿ‡ด, Czech ๐Ÿ‡จ๐Ÿ‡ฟ, Hungarian ๐Ÿ‡ญ๐Ÿ‡บ, Swedish ๐Ÿ‡ธ๐Ÿ‡ช, Danish ๐Ÿ‡ฉ๐Ÿ‡ฐ, Finnish ๐Ÿ‡ซ๐Ÿ‡ฎ, Greek ๐Ÿ‡ฌ๐Ÿ‡ท, Hebrew ๐Ÿ‡ฎ๐Ÿ‡ฑ, Persian ๐Ÿ‡ฎ๐Ÿ‡ท, Bengali ๐Ÿ‡ง๐Ÿ‡ฉ, Urdu ๐Ÿ‡ต๐Ÿ‡ฐ, Tamil ๐Ÿ‡ฑ๐Ÿ‡ฐ, Swahili ๐Ÿ‡ฐ๐Ÿ‡ช, Amharic ๐Ÿ‡ช๐Ÿ‡น, Yoruba ๐Ÿ‡ณ๐Ÿ‡ฌ, Hausa ๐Ÿ‡ณ๐Ÿ‡ฌ, Afrikaans ๐Ÿ‡ฟ๐Ÿ‡ฆ, Norwegian ๐Ÿ‡ณ๐Ÿ‡ด

โš ๏ธ English fallback languages show a yellow notice in the app. Full native support planned for future release.

10. ๐Ÿ“ฑ Responsive Design

Fully optimized for desktop and mobile browsers. Works on any smartphone via the browser โ€” no app installation required. On phones: header wraps, tabs scroll horizontally, content padding reduced, footer stacks vertically. Touch-friendly buttons and controls throughout.


๐Ÿ“‚ Project Structure

voxguard/
โ”œโ”€โ”€ .github/workflows/
โ”‚   โ”œโ”€โ”€ ci.yml                             # CI: WASM build, frontend build, backend tests
โ”‚   โ””โ”€โ”€ deploy.yml                         # CD: deploy backend to GCP Cloud Run
โ”‚
โ”œโ”€โ”€ frontend/                              # React SPA (Vite 5 + JSX)
โ”‚   โ”œโ”€โ”€ src/
โ”‚   โ”‚   โ”œโ”€โ”€ components/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ PixelLogo.jsx              # Animated pixel shield logo with color cycling
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ Primitives.jsx             # Reusable UI: PBox (bordered panel), PBtn, StatCard
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ AlertCard.jsx              # Expandable threat alert card with intervention badge
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ InterventionOverlay.jsx    # Live Scam Intervention: WARN/BLOCK/LOCKDOWN overlay
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ExplanationCard.jsx        # Multimodal explanation card (audio + vision signals)
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ActionAgent.jsx            # Guided anti-scam action agent with step checklist
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ThreatMeter.jsx            # SVG arc gauge: composite threat score 0-100
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ WaveformVisualizer.jsx     # Real-time audio waveform bar visualization
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ LanguageSelector.jsx       # Language dropdown (40 languages)
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ PixelParticles.jsx         # Animated pixel particles (header + monitor)
โ”‚   โ”‚   โ”œโ”€โ”€ pages/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ MonitorTab.jsx             # Main dashboard: waveform, alerts, explanation cards
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ Tabs.jsx                   # Psych, Patterns, Report (with action plan), About
โ”‚   โ”‚   โ”œโ”€โ”€ hooks/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ useWebSocket.js            # WebSocket client: alerts, TTS audio, explanations, actions
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ useAudioEngine.js          # Mic capture + Rust WASM bridge (Web Audio fallback)
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ useScreenCapture.js        # Screen share via getDisplayMedia, 2s JPEG frames
โ”‚   โ”‚   โ”œโ”€โ”€ wasm/                          # Generated by wasm-pack (gitignored, built in CI)
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ scam_shield_audio.js       # JS bindings for the Rust WASM module
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ scam_shield_audio_bg.wasm  # Compiled WASM binary
โ”‚   โ”‚   โ”œโ”€โ”€ utils/
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ constants.js               # Patterns, psych tactics, intervention config, challenges
โ”‚   โ”‚   โ”œโ”€โ”€ App.jsx                        # Root component: tab routing, state management, effects
โ”‚   โ”‚   โ””โ”€โ”€ main.jsx                       # React DOM mount point
โ”‚   โ”œโ”€โ”€ package.json                       # Dependencies: React 18, Vite 5
โ”‚   โ””โ”€โ”€ vite.config.js                     # Dev server proxy, WASM support, build config
โ”‚
โ”œโ”€โ”€ rust-engine/                           # Rust WASM audio preprocessor
โ”‚   โ”œโ”€โ”€ src/
โ”‚   โ”‚   โ””โ”€โ”€ lib.rs                         # DSP pipeline: Wiener NR, spectral sub, VAD, RMS norm
โ”‚   โ”œโ”€โ”€ Cargo.toml                         # Deps: wasm-bindgen, web-sys, js-sys, serde
โ”‚   โ””โ”€โ”€ Cargo.lock                         # Locked dependency versions
โ”‚
โ”œโ”€โ”€ backend/                               # Python FastAPI backend
โ”‚   โ”œโ”€โ”€ app/
โ”‚   โ”‚   โ”œโ”€โ”€ api/
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ websocket.py               # WebSocket: /ws/session + TTS + explanations + actions
โ”‚   โ”‚   โ”œโ”€โ”€ services/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ threat_engine.py           # Scoring + intervention triggers + session state tracking
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ audio_analyzer.py          # VAD + buffer management, Gemini audio streaming
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ vision_analyzer.py         # Screenshot analysis via Gemini Vision API
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ psych_analyzer.py          # Cialdini + lie detection + intervention recommendation
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ tts_service.py             # Natural voice intervention via Gemini TTS
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ explanation_service.py     # Multimodal explanation card generation
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ action_agent.py            # Guided anti-scam action plans (9 countries)
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ storage_service.py         # Cloud Firestore (sessions) + Cloud Storage (audio recordings)
โ”‚   โ”‚   โ””โ”€โ”€ core/
โ”‚   โ”‚       โ”œโ”€โ”€ config.py                  # Pydantic settings from env vars (incl. TTS, Firestore, Storage)
โ”‚   โ”‚       โ””โ”€โ”€ gemini_client.py           # Google GenAI SDK wrapper (audio + vision)
โ”‚   โ”œโ”€โ”€ data/
โ”‚   โ”‚   โ””โ”€โ”€ scam_patterns.json             # 50+ patterns with intervention_level field per pattern
โ”‚   โ”œโ”€โ”€ tests/
โ”‚   โ”‚   โ””โ”€โ”€ test_threat_engine.py          # Unit tests for scoring logic and intervention triggers
โ”‚   โ”œโ”€โ”€ main.py                            # FastAPI entry (legacy, redirects to app.main)
โ”‚   โ”œโ”€โ”€ requirements.txt                   # Python deps: FastAPI, google-generativeai, google-cloud-firestore, google-cloud-storage, numpy, scipy
โ”‚   โ””โ”€โ”€ Dockerfile                         # Cloud Run container: Python 3.11-slim, PORT=8080
โ”‚
โ”œโ”€โ”€ docs/svgs/
โ”‚   โ”œโ”€โ”€ architecture-badge.svg             # Animated pipeline badge for README header
โ”‚   โ”œโ”€โ”€ features-badge.svg                 # Animated capabilities overview
โ”‚   โ”œโ”€โ”€ intervention.svg                   # Animated intervention tiers (WARN/BLOCK/LOCKDOWN)
โ”‚   โ”œโ”€โ”€ threat-demo.svg                    # Threat score gauge demo graphic
โ”‚   โ”œโ”€โ”€ psych-vectors.svg                  # Psychological vector bar chart
โ”‚   โ”œโ”€โ”€ lie-detection.svg                  # Lie detection indicator chart
โ”‚   โ””โ”€โ”€ audio-stream.svg                   # Animated audio waveform graphic
โ”‚
โ”œโ”€โ”€ scripts/
โ”‚   โ””โ”€โ”€ deploy.sh                          # One-command GCP Cloud Run deployment
โ”‚
โ”œโ”€โ”€ .env.example                           # Root template (shared reference for all env vars)
โ”œโ”€โ”€ frontend/.env.example                  # Frontend: VITE_GEMINI_API_KEY, VITE_WS_URL
โ”œโ”€โ”€ backend/.env.example                   # Backend: GOOGLE_API_KEY, GEMINI_MODEL, GEMINI_TTS_MODEL
โ”œโ”€โ”€ docker-compose.yml                     # Local dev: backend + frontend orchestration
โ”œโ”€โ”€ vercel.json                            # Vercel config for frontend deployment
โ”œโ”€โ”€ .gitignore                             # Ignores: node_modules, .env, target/, wasm/
โ”œโ”€โ”€ LICENSE                                # MIT License
โ””โ”€โ”€ README.md                              # You are here

Note: frontend/src/wasm/ is gitignored. It is generated by wasm-pack build during CI. The Frontend Build job depends on the Rust WASM Build job in the CI/CD pipeline.

Environment Variables

VoxGuard uses three .env files to separate frontend, backend, and shared configuration. Only .env.example templates are committed to git โ€” actual .env files are gitignored.

File Variable Description
frontend/.env VITE_GEMINI_API_KEY Gemini API key for demo mode TTS (client-side)
frontend/.env VITE_WS_URL Backend WebSocket URL (e.g., wss://your-backend.run.app/ws/session)
backend/.env GOOGLE_API_KEY Gemini API key for backend services (audio, vision, TTS, psych)
backend/.env GEMINI_MODEL Audio model (default: gemini-2.5-flash)
backend/.env GEMINI_VISION_MODEL Vision model (default: gemini-2.5-flash)
backend/.env GEMINI_TTS_MODEL TTS model (default: gemini-2.5-flash-preview-tts)
backend/.env GOOGLE_CLOUD_PROJECT GCP project ID for Cloud Run, Firestore, and Storage
backend/.env FIRESTORE_ENABLED Enable Cloud Firestore session persistence (default: true)
backend/.env STORAGE_ENABLED Enable Cloud Storage audio uploads (default: true)
backend/.env STORAGE_BUCKET Cloud Storage bucket name for audio recordings
.env (shared reference) Root template combining all variables for quick setup

โšก Quick Start

Try It Now (No installation needed):

Judges: Just open the live demo โ€” no setup required for Demo Mode.

1. Open https://voxguard-kappa.vercel.app
2. Click START
3. Click any Demo Script (e.g., "Bank Impersonation")
4. Watch the intervention fire when the scammer asks for your OTP

Note: The Vercel deployment runs in Demo Mode with simulated 2-way dialog. Full real-time audio analysis (Live Mic mode) requires the backend running on Cloud Run with a Gemini API key. See below for local setup.

Local Development Setup

Prerequisites

  • Node.js 20+, Python 3.11+, Rust 1.75+ with wasm32-unknown-unknown target
  • wasm-pack installed
  • Google Gemini API key

Step 1: Clone and Configure

git clone https://github.com/wiqilee/VoxGuard.git
cd VoxGuard

# Configure environment variables
cp .env.example .env                       # Root (shared reference)
cp frontend/.env.example frontend/.env     # Frontend: add VITE_GEMINI_API_KEY
cp backend/.env.example backend/.env       # Backend: add GOOGLE_API_KEY
# Edit each .env file and add your Gemini API key

Step 2: Build Rust WASM Engine

cd rust-engine
wasm-pack build --target web --out-dir ../frontend/src/wasm
cd ..

Step 3: Run Frontend

cd frontend && npm install && npm run dev

Step 4: Run Backend (separate terminal)

cd backend && pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000

Step 5: Open http://localhost:5173


๐ŸŽฌ Demo Scripts

Three pre-loaded scripts for Demo Mode (no microphone needed):

English (EN) - 7 Scripts:

Script A: Bank Impersonation (Critical)

"Hello, I'm calling from your bank's fraud prevention department. We've detected suspicious activity on your account. Your account will be frozen in 10 minutes unless you verify your identity. Please provide your account number and the OTP."

Intervention trigger: When the caller asks for OTP, a BLOCK-level intervention fires instantly - Safe Exit only (fatal pattern). Voice warning: "Stop immediately. The caller is asking for your one-time password. A real bank will never ask for this. Hang up now." Explanation Card: Shows combined audio signals (Authority + Fear tactics) with confidence score. Action Agent: Country-specific steps to secure your bank account.

Script B: Tech Support Scam (High)

"Your computer has been compromised. I'm calling from the Security Center. You must install our remote access tool immediately or we cannot protect your credit cards."

Intervention trigger: WARN fires on impersonation detection. User can Verify Caller or Safe Exit. Voice warning: "Caution. No legitimate company will cold-call you about a virus on your computer."

Script C: Government / Tax Scam (Critical)

"This is an officer from the tax enforcement division. A warrant has been issued for your arrest. Settle this balance right now or face arrest. Purchase prepaid debit cards and read me the card numbers."

Intervention trigger: Gift Card Demand fires instant BLOCK - Safe Exit only (fatal pattern). Voice warning: "Stop. No legitimate organization accepts payment through gift cards. This is a confirmed scam."

Script D: AI Voice Clone (Critical)

The caller uses an AI-generated voice to impersonate the victim's child, fabricating a car accident and demanding $8,000 in bail via wire transfer.

Intervention trigger: Wire Transfer Instruction + Isolation Tactic. BLOCK fires on the bail demand. Voice warning: "Stop. This person is impersonating a family member. Verify their identity by calling them on their known number."

Script E: Digital Arrest (Critical)

The caller poses as a federal cyber crime officer, claims the victim's identity was used in money laundering, and demands the victim stay on video call while transferring funds to a "government-secured holding account."

Intervention trigger: Government Impersonation + Safe Account Transfer fires instant BLOCK. Voice warning: "Stop immediately. Law enforcement never demands money transfers by phone. This is a scam."

Script F: Job Offer Scam (High)

The caller offers a remote data entry position paying $5,000/month, then demands a $299 "training kit" payment via gift cards or cryptocurrency.

Intervention trigger: Gift Card Demand fires instant BLOCK when gift card payment is requested. Voice warning: "Stop. No legitimate employer asks you to pay upfront for a job. This is employment fraud."

Script G: Family Emergency (High)

The caller impersonates a family member in distress, claims a car accident and hospital emergency, and demands wire transfer.


๐Ÿ† For Judges: Full Evaluation Guide

Innovation and Multimodal UX (40%)

VoxGuard has no text box. The user never types. The interface is entirely driven by audio (microphone stream via Rust WASM to Gemini Live API), vision (screen capture to Gemini Vision API), voice (Gemini TTS for spoken intervention), and inference (psychological vector scoring via Gemini Text). The interaction is ambient: the AI listens, watches, speaks, explains, and guides while the user is on their call.

The Live Scam Intervention system is entirely new. No existing scam detection product actively blocks the user from completing a dangerous action during a live call. VoxGuard's three-tier escalation (WARN, BLOCK, LOCKDOWN) with natural voice warnings, multimodal explanation cards, scenario-based verification challenges, guided action agent, and auto-disconnect represents a fundamental shift from passive detection to active protection.

Technical Implementation (30%)

  • Google GenAI SDK: All Gemini functionality is implemented with the official Google Generative AI SDK for Python (google-generativeai==0.5.4) on Google Cloud Run. Audio analysis uses generate_content_async with inline 16kHz PCM audio data buffered from the Rust WASM engine. Session data and forensic reports are persisted via Cloud Firestore (storage_service.py โ†’ google-cloud-firestore). Session audio recordings are stored via Cloud Storage (storage_service.py โ†’ google-cloud-storage). Authentication is handled by google-auth.
  • Gemini Audio Analysis: gemini-2.5-flash for audio analysis via generate_content_async with inline 16kHz PCM audio data. Audio chunks are buffered (2-second flush with VAD), sent as base64 to the standard Gemini API, and return structured JSON with transcript, scam indicators, tactics, and lie indicators.
  • Gemini Text/Vision: gemini-2.5-flash for screenshot analysis, transcript analysis, psychological scoring, and multimodal explanation generation.
  • Gemini TTS: gemini-2.5-flash-preview-tts for natural voice intervention with 3 voice profiles (Charon for scammer simulation, Kore for user, Puck for warm advisory) and contextual scripts in 9 languages.
  • Rust WASM: Zero-copy audio processing, Wiener NR, Float32 PCM, <100ms latency
  • Cloud Run + Cloud Firestore + Cloud Storage: Fully containerized backend on Cloud Run with auto-scaling, health check endpoints, and session affinity for WebSocket. storage_service.py implements both persistence layers: session data (alerts, interventions, psych scores, transcripts, action plans) persisted to Cloud Firestore via google-cloud-firestore, and audio recordings uploaded to Cloud Storage via google-cloud-storage. Three Google Cloud services in production.
  • Grounding: Reasoning against 50+ verified patterns with zero hallucination.
  • Intervention Engine: Backend evaluates every alert for intervention eligibility, emitting intervention + intervention_audio + explanation_card events via WebSocket. Frontend renders the overlay with scenario-appropriate UI, plays TTS audio, shows explanation cards, and sends intervention_response back. The full loop is tracked in session state.
  • Explanation Service: Combines audio transcript analysis + screenshot analysis into a single Gemini call, producing plain-language explanation cards with signal badges, confidence scores, and recommended actions.
  • Action Agent: Generates personalized step-by-step recovery plans with country-specific emergency contacts, reporting channels, and AI-enhanced advice based on the call transcript.
  • Psych Analyzer: Single Gemini call returns both Cialdini scores and lie detection indicators, plus an intervention recommendation.

Demo and Presentation (30%)

Live Demo Demo Video Medium Article


โš ๏ธ Limitations

  • Demo Mode on Vercel: The live demo runs with simulated 2-way dialog and TTS alerts. Full real-time analysis requires a running backend with a valid Gemini API key.
  • Browser Speech Synthesis: Demo voice quality varies by browser/OS. When Gemini TTS is unavailable, falls back to browser speech synthesis.
  • English fallback: 31 languages use English voice and alerts in demo. 9 languages have full native support (EN, ID, ZH, JA, KO, ES, FR, HI, AR).
  • Browser-only: No native mobile or desktop clients yet.
  • Latency depends on network: <80ms measured locally; 100-300ms over public internet with Cloud Run.
  • No persistent storage in demo: Session reports use localStorage only.
  • Screen capture requires user consent: Vision analysis is opt-in and desktop-only.
  • No brand names in demos: All demo scripts use generic institution names to avoid trademark issues.
  • TTS voice availability: Gemini TTS voice profiles may vary by region. The system gracefully falls back to browser speech synthesis if TTS is unavailable.
  • Audio recording requires Live Mic: The REC button is disabled in Demo Mode. Browser security restrictions prevent capturing audio played through new Audio() (Gemini TTS) or speechSynthesis (browser TTS). To record session audio, enable Live Mic + REC together so the microphone input is recorded directly from your device. This is a browser platform limitation, not a bug in VoxGuard.
  • Live Mic requires HTTPS: The getUserMedia API requires a secure context (HTTPS or localhost). The Vercel deployment and local dev server both satisfy this requirement.

๐Ÿ”ฎ Future Work

  • Native mobile app: iOS and Android with platform-level call interception for always-on protection.
  • Carrier-level integration: Deploying VoxGuard as an inline telecom network service.
  • Expanded pattern library: Growing from 50 to 500+ patterns with global regional coverage.
  • On-device WASM inference: Running scam classification directly in Rust WASM for offline-capable protection.
  • Community pattern submissions: Crowd-sourced, continuously updated threat intelligence.
  • Enterprise API: Hosted API for banks, telcos, and contact centers.
  • Real-time video deepfake detection: Detect AI-generated video in video call scams.
  • Auto-detect call platform: Automatically identify if user is on phone, Zoom, WhatsApp, or Teams.
  • Emotional contagion scoring: Measure how the caller's emotional state transfers to the victim.
  • Intervention learning: Track which intervention levels and challenge questions are most effective at stopping victims from complying with scammers, and adapt the system over time.
  • Full native support for all 40 languages: Extend localized demo scripts, alerts, voice TTS, and intervention UI beyond the current 9 languages.
  • Expanded action agent: Add more countries, integrate with local banking APIs for one-click account freeze, and provide follow-up reminder notifications.

๐Ÿ‘ค About the Creator

X Discord GitHub GDG

Wiqi Lee - Data Scientist, AI/ML Researcher, Software Engineer, Cellist

Programming Languages: Python, Java, Rust, Julia

Submitted to: Gemini Live Agent Challenge 2026 #GeminiLiveAgentChallenge

"This is not a hackathon project. This is infrastructure for human safety."


๐Ÿ“– Data Sources

Source URL Usage
FBI IC3 2024 Annual Report ic3.gov/AnnualReport/Reports/2024_IC3Report.pdf Statistics ($16.6B), scam categories
FBI IC3 Annual Reports Index ic3.gov/annualreport/reports All yearly reports archive
FTC Consumer Sentinel ftc.gov/enforcement/consumer-sentinel-network Pattern taxonomy, linguistic markers
GASA Global Scam Report gasa.org Global $1T+ loss estimates
MAS ScamShield (SG) scamshield.org.sg Southeast Asian variants
ACCC ScamWatch (AU) scamwatch.gov.au Australian variant patterns
OJK Indonesia ojk.go.id Indonesian financial authority patterns
Bareskrim Cyber (ID) patrolisiber.id Indonesian cybercrime reporting
NPA Japan (่ญฆๅฏŸๅบ) npa.go.jp Japanese ore-ore sagi patterns
FSS South Korea (๊ธˆ๊ฐ์›) fss.or.kr Korean voice phishing patterns
MHA India Cyber Crime cybercrime.gov.in Indian digital arrest scam data
INCIBE Spain incibe.es Spanish cybersecurity incident data
PHAROS France internet-signalement.gouv.fr French online fraud reporting
SAMA Saudi Arabia sama.gov.sa Saudi monetary authority fraud alerts
China National Anti-Fraud Center mps.gov.cn Chinese anti-fraud app & public security data
Interpol Financial Crime interpol.int International scam pattern intelligence

No proprietary or licensed data. No personal victim data. All examples reconstructed from published public reports.


๐Ÿ”’ Privacy and Ethics

  • No audio persisted by default: Audio is processed in real-time streams and discarded immediately after analysis. Raw audio is only retained when the user explicitly enables the REC recording feature.
  • Minimal data transmission: Rust WASM preprocesses audio locally. Only the necessary audio frames are streamed to the backend for Gemini Live API analysis. No raw audio is stored server-side.
  • No TTS audio stored: Voice intervention audio is generated on demand and not persisted.
  • Explicit screen consent: Screen capture requires explicit user activation.
  • No PII collection: No personally identifiable information is collected by VoxGuard.
  • No brand names: Demo scripts use generic institution names.
  • Intervention is protective, not punitive: The system helps users make informed decisions. It never prevents them from continuing a call if they choose to after the verification challenge.

๐Ÿ“„ License

MIT License. See LICENSE for details.


VOXGUARD 2026 ยท WIQI LEE ยท MIT LICENSE ยท #GeminiLiveAgentChallenge

Gemini X

Built to protect the people who need it most.

About

VoxGuard is a real-time multimodal scam detection system for live calls, built with Gemini Live API, Rust WASM audio streaming, and psychological manipulation scoring.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors