ScamShield

A Chrome extension that detects internship scams in real time — before you click Apply, upload a document, or hand over personal details.

The Problem

Internship scams in India follow a predictable pattern: fake company, vague role, upfront registration fee, collection of Aadhar/PAN details, and disappearance. Thousands of students fall for these every year because they look legitimate on the surface. No existing tool intercepts them at the point of interaction.

ScamShield sits in your browser, analyzes every internship form or job posting you visit, searches the web for complaints about the company, and warns you — with sources — before you do anything irreversible.

How It Works

Three things happen simultaneously when you land on a job or internship page:

1. Real-time interception A content script injected into every page watches for dangerous actions — clicking "Apply Now", uploading a document, following a WhatsApp link. If the page has been flagged, your action is intercepted before it completes and a warning is shown with evidence.

2. Backend search The company name and form text are sent to a FastAPI backend that searches Google Custom Search, Reddit, and checks WHOIS domain age and MCA21 company registration — all concurrently. Results are cached in Redis for 24 hours so repeat lookups are instant.

3. NLP analysis spaCy parses the form text using dependency parsing — not keyword lists — to detect structural patterns like payment requests, sensitive data harvesting, referral schemes, and urgency language regardless of exact wording.

Architecture

Browser (Chrome Extension)
├── content/        DOM parsing, MutationObserver, capture-phase interception
├── overlay/        Warning UI injected into page (Shadow DOM)
├── utils/          Shared helpers: hash, sanitize, storage, normalize
├── popup/          Toolbar icon panel
└── background.js   Service worker: API calls, Redis-backed cache

Backend (FastAPI + Python)
├── api/            Route handlers: /analyze, /feedback, /health
├── services/       Business logic: NLP engine, search orchestrator, scorer, cache
├── scrapers/       One file per source: Google CSE, Reddit, WHOIS, MCA21
├── db/             SQLAlchemy models, async queries, migrations
└── ml/             spaCy model, weekly retrain script from user feedback

Tech Stack

Layer	Technology
Extension	Vanilla JS, no framework, no build step
Backend	Python 3.11, FastAPI, uvicorn
NLP	spaCy `en_core_web_md`
Database	PostgreSQL with `pg_trgm` for fuzzy name matching
Cache	Redis (verdict cache, 24hr TTL)
Search	Google Custom Search API, Reddit JSON API
Domain check	`python-whois`
Company check	MCA21 registry
HTTP client	`httpx` (async)
ORM	SQLAlchemy 2.0 (async)
Fuzzy matching	RapidFuzz
Dev environment	Docker Compose

Repository Structure

scamshield/
├── scamshield-extension/       Chrome extension (load unpacked)
│   ├── manifest.json
│   ├── config.js
│   ├── background.js
│   ├── content/
│   │   ├── index.js
│   │   ├── extractor.js
│   │   ├── interceptor.js
│   │   ├── observer.js
│   │   └── page-injector.js
│   ├── overlay/
│   │   ├── panel.js
│   │   ├── toast.js
│   │   ├── blocker.js
│   │   └── overlay.css
│   ├── utils/
│   │   ├── hash.js
│   │   ├── sanitize.js
│   │   ├── normalize.js
│   │   └── storage.js
│   ├── popup/
│   │   ├── popup.html
│   │   └── popup.js
│   └── assets/icons/
│
└── scamshield-backend/         Python backend (Docker or local)
    ├── main.py
    ├── config.py
    ├── requirements.txt
    ├── Dockerfile
    ├── docker-compose.yml
    ├── .env.example
    ├── api/
    │   ├── analyze.py
    │   ├── feedback.py
    │   └── health.py
    ├── services/
    │   ├── nlp_engine.py
    │   ├── search_engine.py
    │   ├── scorer.py
    │   └── cache_service.py
    ├── scrapers/
    │   ├── google_cse.py
    │   ├── reddit.py
    │   ├── whois_check.py
    │   └── mca21.py
    ├── db/
    │   ├── database.py
    │   ├── models.py
    │   ├── queries.py
    │   └── migrations/
    │       └── 001_init.sql
    ├── ml/
    │   ├── train.py
    │   └── config.cfg
    ├── utils/
    │   └── extract_meta.py
    └── tests/
        ├── test_nlp.py
        ├── test_scrapers.py
        └── test_scorer.py

Getting Started

Prerequisites

Python 3.11+
Docker and Docker Compose
Google Custom Search API key (get one here)
Chrome browser

Backend Setup

1. Clone and configure

cd scamshield-backend
cp .env.example .env

Edit .env and fill in:

DATABASE_URL=postgresql://scamshield:yourpassword@localhost:5432/scamshield
REDIS_URL=redis://localhost:6379/0
GOOGLE_CSE_API_KEY=your_key_here
GOOGLE_CSE_ID=your_cse_id_here
API_SECRET_KEY=any_random_string
ALLOWED_ORIGINS=chrome-extension://your-extension-id

2. Start all services

docker-compose up

This starts PostgreSQL, Redis, and the FastAPI server together.

3. Initialize the database (first time only)

docker-compose exec api psql $DATABASE_URL -f db/migrations/001_init.sql

4. Verify it's running

curl http://localhost:8000/health
# {"status":"ok","postgres":true,"redis":true}

5. Test an analysis

curl -X POST http://localhost:8000/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Internship at ABC Solutions. Pay registration fee of Rs 500 to confirm selection.",
    "hash": "test123"
  }'

Extension Setup

1. Get your extension ID

Load the extension first (step 2), copy the ID shown in chrome://extensions, then update ALLOWED_ORIGINS in your .env.

2. Load in Chrome

Open chrome://extensions
Enable Developer mode (top right toggle)
Click Load unpacked
Select the scamshield-extension/ folder

3. Point the extension at your backend

Edit scamshield-extension/config.js:

const SS_CONFIG = {
  API_BASE_URL: 'http://localhost:8000',  // change to your server URL when deployed
  ...
};

4. Test it

Open any Google Form for a job application. The extension icon should show activity. Open DevTools → Console to see analysis logs.

Verdict Levels

Level	Score	What happens
SCAM	75+	Action blocked. Full warning modal with sources shown. User must explicitly override.
SUSPICIOUS	35–74	Soft toast warning shown. Action is not blocked.
LEGIT	0–34	Silent. Green badge in toolbar.
UNKNOWN	—	Page could not be analyzed. Action passes through.

Signal Scoring

The scorer combines signals from multiple sources:

Signal	Points
Payment request detected (via dependency parse)	45
Sensitive data requested (Aadhar, PAN, bank account)	35
Referral / MLM language	30
Not registered on MCA21	20
Domain age < 30 days	30
Domain age 30–90 days	20
Google search complaint results (up to 8 hits × 10)	max 50
Reddit complaint posts (per post × 12)	max 40
Vague job role	10
Urgency language	8

Scores are capped per category to prevent one bad signal from dominating. The NLP uses spaCy's dependency parser — not keyword matching — so rephrasing "registration fee" as "processing contribution" does not evade detection.

User Feedback Loop

Every verdict shown to a user includes two buttons: "Yes, it's a scam" and "No, looks legit".

Feedback is stored in the user_feedback table with:

Raw form text
NLP signals at time of analysis
System verdict
User verdict
Action taken (did they override the warning?)

When 3 or more users confirm a company as a scam, the Redis cache is invalidated and the company is re-scored with community data factored in.

Once you have ~200 labeled examples, run the retrain script to fine-tune the spaCy text classifier:

cd scamshield-backend
python ml/train.py
python -m spacy train ml/config.cfg --output ml/spacy_model

Replace the base model reference in services/nlp_engine.py with your fine-tuned model path. The system gets progressively more accurate with every new batch of feedback.

API Reference

`POST /analyze`

Analyzes form text and returns a verdict.

Request

{
  "text": "sanitized form text (max 4000 chars)",
  "hash": "fnv_hash_of_text"
}

Response

{
  "company_name": "ABC Solutions",
  "level": "SCAM",
  "risk_score": 95,
  "mca21_registered": false,
  "domain_age_days": 18,
  "sources": [
    {
      "title": "ABC Solutions scam - Reddit",
      "url": "https://reddit.com/...",
      "snippet": "I paid Rs 500 and never heard back...",
      "source": "reddit.com"
    }
  ],
  "nlp_signals": [
    {
      "type": "PAYMENT_REQUEST",
      "context": "Pay registration fee of Rs 500 to confirm selection.",
      "weight": 45
    }
  ],
  "form_hash": "abc123"
}

`POST /feedback`

Stores user feedback on a verdict.

Request

{
  "company_name": "ABC Solutions",
  "system_verdict": "SCAM",
  "user_verdict": "CONFIRM_SCAM",
  "action_taken": "HEEDED_WARNING",
  "form_hash": "abc123"
}

`GET /health`

Returns service health status.

{
  "status": "ok",
  "postgres": true,
  "redis": true
}

Roadmap

Phase 1 — MVP (current)

Chrome extension with DOM extraction and real-time interception
FastAPI backend with spaCy NLP
Google CSE + Reddit + WHOIS + MCA21 search
Redis verdict cache
PostgreSQL with fuzzy company name matching
User feedback storage

Phase 2 — With real users

DistilBERT fine-tuned on feedback data (ONNX for local inference)
Celery + Redis queue for async scraping
Playwright for JS-rendered complaint pages
Elasticsearch for fuzzy name search at scale
Automated weekly model retraining

Phase 3 — Scale

Neo4j graph engine for entity fingerprinting across company name changes
Anomaly detection via Isolation Forest
Contrastive learning for evolving scam vocabulary
Community leaderboard of reported companies

Why Not Use an AI API?

ScamShield deliberately does not call any external AI API (OpenAI, Claude, etc.) for detection. The reasons:

API keys in an extension are a security liability — any user can extract them from the extension source
Latency — a round-trip to an external AI adds 2–5 seconds to every page load
Cost — at scale, per-token pricing becomes unsustainable before you have revenue
Privacy — sending raw form text including personal details to a third-party AI raises data concerns

spaCy with dependency parsing catches structural scam patterns just as reliably as a language model for this specific, narrow task — and runs in ~50ms on a CPU with no external dependencies after setup.

Contributing

This is an early-stage project. If you find a scam that wasn't detected or a legitimate company that was falsely flagged, the most valuable thing you can do is use the feedback buttons in the extension. Every labeled example directly improves the model.

For code contributions, open an issue first describing what you want to change.

License

MIT

Acknowledgements

Built to protect students and freshers in India from a very real and growing problem. If this extension stops even one person from losing money to a fake internship, it's worth it.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
frontend		frontend
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScamShield

The Problem

How It Works

Architecture

Tech Stack

Repository Structure

Getting Started

Prerequisites

Backend Setup

Extension Setup

Verdict Levels

Signal Scoring

User Feedback Loop

API Reference

`POST /analyze`

`POST /feedback`

`GET /health`

Roadmap

Phase 1 — MVP (current)

Phase 2 — With real users

Phase 3 — Scale

Why Not Use an AI API?

Contributing

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ScamShield

The Problem

How It Works

Architecture

Tech Stack

Repository Structure

Getting Started

Prerequisites

Backend Setup

Extension Setup

Verdict Levels

Signal Scoring

User Feedback Loop

API Reference

POST /analyze

POST /feedback

GET /health

Roadmap

Phase 1 — MVP (current)

Phase 2 — With real users

Phase 3 — Scale

Why Not Use an AI API?

Contributing

License

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /analyze`

`POST /feedback`

`GET /health`

Packages