A Chrome extension that detects internship scams in real time — before you click Apply, upload a document, or hand over personal details.
Internship scams in India follow a predictable pattern: fake company, vague role, upfront registration fee, collection of Aadhar/PAN details, and disappearance. Thousands of students fall for these every year because they look legitimate on the surface. No existing tool intercepts them at the point of interaction.
ScamShield sits in your browser, analyzes every internship form or job posting you visit, searches the web for complaints about the company, and warns you — with sources — before you do anything irreversible.
Three things happen simultaneously when you land on a job or internship page:
1. Real-time interception A content script injected into every page watches for dangerous actions — clicking "Apply Now", uploading a document, following a WhatsApp link. If the page has been flagged, your action is intercepted before it completes and a warning is shown with evidence.
2. Backend search The company name and form text are sent to a FastAPI backend that searches Google Custom Search, Reddit, and checks WHOIS domain age and MCA21 company registration — all concurrently. Results are cached in Redis for 24 hours so repeat lookups are instant.
3. NLP analysis spaCy parses the form text using dependency parsing — not keyword lists — to detect structural patterns like payment requests, sensitive data harvesting, referral schemes, and urgency language regardless of exact wording.
Browser (Chrome Extension)
├── content/ DOM parsing, MutationObserver, capture-phase interception
├── overlay/ Warning UI injected into page (Shadow DOM)
├── utils/ Shared helpers: hash, sanitize, storage, normalize
├── popup/ Toolbar icon panel
└── background.js Service worker: API calls, Redis-backed cache
Backend (FastAPI + Python)
├── api/ Route handlers: /analyze, /feedback, /health
├── services/ Business logic: NLP engine, search orchestrator, scorer, cache
├── scrapers/ One file per source: Google CSE, Reddit, WHOIS, MCA21
├── db/ SQLAlchemy models, async queries, migrations
└── ml/ spaCy model, weekly retrain script from user feedback
| Layer | Technology |
|---|---|
| Extension | Vanilla JS, no framework, no build step |
| Backend | Python 3.11, FastAPI, uvicorn |
| NLP | spaCy en_core_web_md |
| Database | PostgreSQL with pg_trgm for fuzzy name matching |
| Cache | Redis (verdict cache, 24hr TTL) |
| Search | Google Custom Search API, Reddit JSON API |
| Domain check | python-whois |
| Company check | MCA21 registry |
| HTTP client | httpx (async) |
| ORM | SQLAlchemy 2.0 (async) |
| Fuzzy matching | RapidFuzz |
| Dev environment | Docker Compose |
scamshield/
├── scamshield-extension/ Chrome extension (load unpacked)
│ ├── manifest.json
│ ├── config.js
│ ├── background.js
│ ├── content/
│ │ ├── index.js
│ │ ├── extractor.js
│ │ ├── interceptor.js
│ │ ├── observer.js
│ │ └── page-injector.js
│ ├── overlay/
│ │ ├── panel.js
│ │ ├── toast.js
│ │ ├── blocker.js
│ │ └── overlay.css
│ ├── utils/
│ │ ├── hash.js
│ │ ├── sanitize.js
│ │ ├── normalize.js
│ │ └── storage.js
│ ├── popup/
│ │ ├── popup.html
│ │ └── popup.js
│ └── assets/icons/
│
└── scamshield-backend/ Python backend (Docker or local)
├── main.py
├── config.py
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
├── .env.example
├── api/
│ ├── analyze.py
│ ├── feedback.py
│ └── health.py
├── services/
│ ├── nlp_engine.py
│ ├── search_engine.py
│ ├── scorer.py
│ └── cache_service.py
├── scrapers/
│ ├── google_cse.py
│ ├── reddit.py
│ ├── whois_check.py
│ └── mca21.py
├── db/
│ ├── database.py
│ ├── models.py
│ ├── queries.py
│ └── migrations/
│ └── 001_init.sql
├── ml/
│ ├── train.py
│ └── config.cfg
├── utils/
│ └── extract_meta.py
└── tests/
├── test_nlp.py
├── test_scrapers.py
└── test_scorer.py
- Python 3.11+
- Docker and Docker Compose
- Google Custom Search API key (get one here)
- Chrome browser
1. Clone and configure
cd scamshield-backend
cp .env.example .envEdit .env and fill in:
DATABASE_URL=postgresql://scamshield:yourpassword@localhost:5432/scamshield
REDIS_URL=redis://localhost:6379/0
GOOGLE_CSE_API_KEY=your_key_here
GOOGLE_CSE_ID=your_cse_id_here
API_SECRET_KEY=any_random_string
ALLOWED_ORIGINS=chrome-extension://your-extension-id
2. Start all services
docker-compose upThis starts PostgreSQL, Redis, and the FastAPI server together.
3. Initialize the database (first time only)
docker-compose exec api psql $DATABASE_URL -f db/migrations/001_init.sql4. Verify it's running
curl http://localhost:8000/health
# {"status":"ok","postgres":true,"redis":true}5. Test an analysis
curl -X POST http://localhost:8000/analyze \
-H "Content-Type: application/json" \
-d '{
"text": "Internship at ABC Solutions. Pay registration fee of Rs 500 to confirm selection.",
"hash": "test123"
}'1. Get your extension ID
Load the extension first (step 2), copy the ID shown in chrome://extensions, then update ALLOWED_ORIGINS in your .env.
2. Load in Chrome
- Open
chrome://extensions - Enable Developer mode (top right toggle)
- Click Load unpacked
- Select the
scamshield-extension/folder
3. Point the extension at your backend
Edit scamshield-extension/config.js:
const SS_CONFIG = {
API_BASE_URL: 'http://localhost:8000', // change to your server URL when deployed
...
};4. Test it
Open any Google Form for a job application. The extension icon should show activity. Open DevTools → Console to see analysis logs.
| Level | Score | What happens |
|---|---|---|
| SCAM | 75+ | Action blocked. Full warning modal with sources shown. User must explicitly override. |
| SUSPICIOUS | 35–74 | Soft toast warning shown. Action is not blocked. |
| LEGIT | 0–34 | Silent. Green badge in toolbar. |
| UNKNOWN | — | Page could not be analyzed. Action passes through. |
The scorer combines signals from multiple sources:
| Signal | Points |
|---|---|
| Payment request detected (via dependency parse) | 45 |
| Sensitive data requested (Aadhar, PAN, bank account) | 35 |
| Referral / MLM language | 30 |
| Not registered on MCA21 | 20 |
| Domain age < 30 days | 30 |
| Domain age 30–90 days | 20 |
| Google search complaint results (up to 8 hits × 10) | max 50 |
| Reddit complaint posts (per post × 12) | max 40 |
| Vague job role | 10 |
| Urgency language | 8 |
Scores are capped per category to prevent one bad signal from dominating. The NLP uses spaCy's dependency parser — not keyword matching — so rephrasing "registration fee" as "processing contribution" does not evade detection.
Every verdict shown to a user includes two buttons: "Yes, it's a scam" and "No, looks legit".
Feedback is stored in the user_feedback table with:
- Raw form text
- NLP signals at time of analysis
- System verdict
- User verdict
- Action taken (did they override the warning?)
When 3 or more users confirm a company as a scam, the Redis cache is invalidated and the company is re-scored with community data factored in.
Once you have ~200 labeled examples, run the retrain script to fine-tune the spaCy text classifier:
cd scamshield-backend
python ml/train.py
python -m spacy train ml/config.cfg --output ml/spacy_modelReplace the base model reference in services/nlp_engine.py with your fine-tuned model path. The system gets progressively more accurate with every new batch of feedback.
Analyzes form text and returns a verdict.
Request
{
"text": "sanitized form text (max 4000 chars)",
"hash": "fnv_hash_of_text"
}Response
{
"company_name": "ABC Solutions",
"level": "SCAM",
"risk_score": 95,
"mca21_registered": false,
"domain_age_days": 18,
"sources": [
{
"title": "ABC Solutions scam - Reddit",
"url": "https://reddit.com/...",
"snippet": "I paid Rs 500 and never heard back...",
"source": "reddit.com"
}
],
"nlp_signals": [
{
"type": "PAYMENT_REQUEST",
"context": "Pay registration fee of Rs 500 to confirm selection.",
"weight": 45
}
],
"form_hash": "abc123"
}Stores user feedback on a verdict.
Request
{
"company_name": "ABC Solutions",
"system_verdict": "SCAM",
"user_verdict": "CONFIRM_SCAM",
"action_taken": "HEEDED_WARNING",
"form_hash": "abc123"
}Returns service health status.
{
"status": "ok",
"postgres": true,
"redis": true
}- Chrome extension with DOM extraction and real-time interception
- FastAPI backend with spaCy NLP
- Google CSE + Reddit + WHOIS + MCA21 search
- Redis verdict cache
- PostgreSQL with fuzzy company name matching
- User feedback storage
- DistilBERT fine-tuned on feedback data (ONNX for local inference)
- Celery + Redis queue for async scraping
- Playwright for JS-rendered complaint pages
- Elasticsearch for fuzzy name search at scale
- Automated weekly model retraining
- Neo4j graph engine for entity fingerprinting across company name changes
- Anomaly detection via Isolation Forest
- Contrastive learning for evolving scam vocabulary
- Community leaderboard of reported companies
ScamShield deliberately does not call any external AI API (OpenAI, Claude, etc.) for detection. The reasons:
- API keys in an extension are a security liability — any user can extract them from the extension source
- Latency — a round-trip to an external AI adds 2–5 seconds to every page load
- Cost — at scale, per-token pricing becomes unsustainable before you have revenue
- Privacy — sending raw form text including personal details to a third-party AI raises data concerns
spaCy with dependency parsing catches structural scam patterns just as reliably as a language model for this specific, narrow task — and runs in ~50ms on a CPU with no external dependencies after setup.
This is an early-stage project. If you find a scam that wasn't detected or a legitimate company that was falsely flagged, the most valuable thing you can do is use the feedback buttons in the extension. Every labeled example directly improves the model.
For code contributions, open an issue first describing what you want to change.
MIT
Built to protect students and freshers in India from a very real and growing problem. If this extension stops even one person from losing money to a fake internship, it's worth it.