Skip to content

garvjain7/ScamShield

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

ScamShield

A Chrome extension that detects internship scams in real time — before you click Apply, upload a document, or hand over personal details.


The Problem

Internship scams in India follow a predictable pattern: fake company, vague role, upfront registration fee, collection of Aadhar/PAN details, and disappearance. Thousands of students fall for these every year because they look legitimate on the surface. No existing tool intercepts them at the point of interaction.

ScamShield sits in your browser, analyzes every internship form or job posting you visit, searches the web for complaints about the company, and warns you — with sources — before you do anything irreversible.


How It Works

Three things happen simultaneously when you land on a job or internship page:

1. Real-time interception A content script injected into every page watches for dangerous actions — clicking "Apply Now", uploading a document, following a WhatsApp link. If the page has been flagged, your action is intercepted before it completes and a warning is shown with evidence.

2. Backend search The company name and form text are sent to a FastAPI backend that searches Google Custom Search, Reddit, and checks WHOIS domain age and MCA21 company registration — all concurrently. Results are cached in Redis for 24 hours so repeat lookups are instant.

3. NLP analysis spaCy parses the form text using dependency parsing — not keyword lists — to detect structural patterns like payment requests, sensitive data harvesting, referral schemes, and urgency language regardless of exact wording.


Architecture

Browser (Chrome Extension)
├── content/        DOM parsing, MutationObserver, capture-phase interception
├── overlay/        Warning UI injected into page (Shadow DOM)
├── utils/          Shared helpers: hash, sanitize, storage, normalize
├── popup/          Toolbar icon panel
└── background.js   Service worker: API calls, Redis-backed cache

Backend (FastAPI + Python)
├── api/            Route handlers: /analyze, /feedback, /health
├── services/       Business logic: NLP engine, search orchestrator, scorer, cache
├── scrapers/       One file per source: Google CSE, Reddit, WHOIS, MCA21
├── db/             SQLAlchemy models, async queries, migrations
└── ml/             spaCy model, weekly retrain script from user feedback

Tech Stack

Layer Technology
Extension Vanilla JS, no framework, no build step
Backend Python 3.11, FastAPI, uvicorn
NLP spaCy en_core_web_md
Database PostgreSQL with pg_trgm for fuzzy name matching
Cache Redis (verdict cache, 24hr TTL)
Search Google Custom Search API, Reddit JSON API
Domain check python-whois
Company check MCA21 registry
HTTP client httpx (async)
ORM SQLAlchemy 2.0 (async)
Fuzzy matching RapidFuzz
Dev environment Docker Compose

Repository Structure

scamshield/
├── scamshield-extension/       Chrome extension (load unpacked)
│   ├── manifest.json
│   ├── config.js
│   ├── background.js
│   ├── content/
│   │   ├── index.js
│   │   ├── extractor.js
│   │   ├── interceptor.js
│   │   ├── observer.js
│   │   └── page-injector.js
│   ├── overlay/
│   │   ├── panel.js
│   │   ├── toast.js
│   │   ├── blocker.js
│   │   └── overlay.css
│   ├── utils/
│   │   ├── hash.js
│   │   ├── sanitize.js
│   │   ├── normalize.js
│   │   └── storage.js
│   ├── popup/
│   │   ├── popup.html
│   │   └── popup.js
│   └── assets/icons/
│
└── scamshield-backend/         Python backend (Docker or local)
    ├── main.py
    ├── config.py
    ├── requirements.txt
    ├── Dockerfile
    ├── docker-compose.yml
    ├── .env.example
    ├── api/
    │   ├── analyze.py
    │   ├── feedback.py
    │   └── health.py
    ├── services/
    │   ├── nlp_engine.py
    │   ├── search_engine.py
    │   ├── scorer.py
    │   └── cache_service.py
    ├── scrapers/
    │   ├── google_cse.py
    │   ├── reddit.py
    │   ├── whois_check.py
    │   └── mca21.py
    ├── db/
    │   ├── database.py
    │   ├── models.py
    │   ├── queries.py
    │   └── migrations/
    │       └── 001_init.sql
    ├── ml/
    │   ├── train.py
    │   └── config.cfg
    ├── utils/
    │   └── extract_meta.py
    └── tests/
        ├── test_nlp.py
        ├── test_scrapers.py
        └── test_scorer.py

Getting Started

Prerequisites

  • Python 3.11+
  • Docker and Docker Compose
  • Google Custom Search API key (get one here)
  • Chrome browser

Backend Setup

1. Clone and configure

cd scamshield-backend
cp .env.example .env

Edit .env and fill in:

DATABASE_URL=postgresql://scamshield:yourpassword@localhost:5432/scamshield
REDIS_URL=redis://localhost:6379/0
GOOGLE_CSE_API_KEY=your_key_here
GOOGLE_CSE_ID=your_cse_id_here
API_SECRET_KEY=any_random_string
ALLOWED_ORIGINS=chrome-extension://your-extension-id

2. Start all services

docker-compose up

This starts PostgreSQL, Redis, and the FastAPI server together.

3. Initialize the database (first time only)

docker-compose exec api psql $DATABASE_URL -f db/migrations/001_init.sql

4. Verify it's running

curl http://localhost:8000/health
# {"status":"ok","postgres":true,"redis":true}

5. Test an analysis

curl -X POST http://localhost:8000/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Internship at ABC Solutions. Pay registration fee of Rs 500 to confirm selection.",
    "hash": "test123"
  }'

Extension Setup

1. Get your extension ID

Load the extension first (step 2), copy the ID shown in chrome://extensions, then update ALLOWED_ORIGINS in your .env.

2. Load in Chrome

  1. Open chrome://extensions
  2. Enable Developer mode (top right toggle)
  3. Click Load unpacked
  4. Select the scamshield-extension/ folder

3. Point the extension at your backend

Edit scamshield-extension/config.js:

const SS_CONFIG = {
  API_BASE_URL: 'http://localhost:8000',  // change to your server URL when deployed
  ...
};

4. Test it

Open any Google Form for a job application. The extension icon should show activity. Open DevTools → Console to see analysis logs.


Verdict Levels

Level Score What happens
SCAM 75+ Action blocked. Full warning modal with sources shown. User must explicitly override.
SUSPICIOUS 35–74 Soft toast warning shown. Action is not blocked.
LEGIT 0–34 Silent. Green badge in toolbar.
UNKNOWN Page could not be analyzed. Action passes through.

Signal Scoring

The scorer combines signals from multiple sources:

Signal Points
Payment request detected (via dependency parse) 45
Sensitive data requested (Aadhar, PAN, bank account) 35
Referral / MLM language 30
Not registered on MCA21 20
Domain age < 30 days 30
Domain age 30–90 days 20
Google search complaint results (up to 8 hits × 10) max 50
Reddit complaint posts (per post × 12) max 40
Vague job role 10
Urgency language 8

Scores are capped per category to prevent one bad signal from dominating. The NLP uses spaCy's dependency parser — not keyword matching — so rephrasing "registration fee" as "processing contribution" does not evade detection.


User Feedback Loop

Every verdict shown to a user includes two buttons: "Yes, it's a scam" and "No, looks legit".

Feedback is stored in the user_feedback table with:

  • Raw form text
  • NLP signals at time of analysis
  • System verdict
  • User verdict
  • Action taken (did they override the warning?)

When 3 or more users confirm a company as a scam, the Redis cache is invalidated and the company is re-scored with community data factored in.

Once you have ~200 labeled examples, run the retrain script to fine-tune the spaCy text classifier:

cd scamshield-backend
python ml/train.py
python -m spacy train ml/config.cfg --output ml/spacy_model

Replace the base model reference in services/nlp_engine.py with your fine-tuned model path. The system gets progressively more accurate with every new batch of feedback.


API Reference

POST /analyze

Analyzes form text and returns a verdict.

Request

{
  "text": "sanitized form text (max 4000 chars)",
  "hash": "fnv_hash_of_text"
}

Response

{
  "company_name": "ABC Solutions",
  "level": "SCAM",
  "risk_score": 95,
  "mca21_registered": false,
  "domain_age_days": 18,
  "sources": [
    {
      "title": "ABC Solutions scam - Reddit",
      "url": "https://reddit.com/...",
      "snippet": "I paid Rs 500 and never heard back...",
      "source": "reddit.com"
    }
  ],
  "nlp_signals": [
    {
      "type": "PAYMENT_REQUEST",
      "context": "Pay registration fee of Rs 500 to confirm selection.",
      "weight": 45
    }
  ],
  "form_hash": "abc123"
}

POST /feedback

Stores user feedback on a verdict.

Request

{
  "company_name": "ABC Solutions",
  "system_verdict": "SCAM",
  "user_verdict": "CONFIRM_SCAM",
  "action_taken": "HEEDED_WARNING",
  "form_hash": "abc123"
}

GET /health

Returns service health status.

{
  "status": "ok",
  "postgres": true,
  "redis": true
}

Roadmap

Phase 1 — MVP (current)

  • Chrome extension with DOM extraction and real-time interception
  • FastAPI backend with spaCy NLP
  • Google CSE + Reddit + WHOIS + MCA21 search
  • Redis verdict cache
  • PostgreSQL with fuzzy company name matching
  • User feedback storage

Phase 2 — With real users

  • DistilBERT fine-tuned on feedback data (ONNX for local inference)
  • Celery + Redis queue for async scraping
  • Playwright for JS-rendered complaint pages
  • Elasticsearch for fuzzy name search at scale
  • Automated weekly model retraining

Phase 3 — Scale

  • Neo4j graph engine for entity fingerprinting across company name changes
  • Anomaly detection via Isolation Forest
  • Contrastive learning for evolving scam vocabulary
  • Community leaderboard of reported companies

Why Not Use an AI API?

ScamShield deliberately does not call any external AI API (OpenAI, Claude, etc.) for detection. The reasons:

  1. API keys in an extension are a security liability — any user can extract them from the extension source
  2. Latency — a round-trip to an external AI adds 2–5 seconds to every page load
  3. Cost — at scale, per-token pricing becomes unsustainable before you have revenue
  4. Privacy — sending raw form text including personal details to a third-party AI raises data concerns

spaCy with dependency parsing catches structural scam patterns just as reliably as a language model for this specific, narrow task — and runs in ~50ms on a CPU with no external dependencies after setup.


Contributing

This is an early-stage project. If you find a scam that wasn't detected or a legitimate company that was falsely flagged, the most valuable thing you can do is use the feedback buttons in the extension. Every labeled example directly improves the model.

For code contributions, open an issue first describing what you want to change.


License

MIT


Acknowledgements

Built to protect students and freshers in India from a very real and growing problem. If this extension stops even one person from losing money to a fake internship, it's worth it.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors