SRME is a research intelligence platform that ingests university faculty directories, builds a semantic index of researchers and their publications, and enables collaborator discovery using NLP embeddings..
It is engineered to handle real-world university websites that use:
- A-Z indexes
- Pagination
- Drupal AJAX + CSRF protection
- Cookie banners
- JavaScript-hydrated content
| Problem | Traditional Approach | SRME Approach |
|---|---|---|
| Multi-page faculty directories | Manual scraper per university | Autonomous traversal heuristics |
| Drupal / CSRF blocking | Site-specific hacks | Generic CSRF + session injection engine |
| JS rendered directories | Impossible with requests |
Headless Chromium fallback (Playwright) |
| Emails hidden in profiles | Missed data | Deep profile scraping + de-obfuscation |
| Finding collaborators | Manual Google search | Semantic embedding match engine |
| Data export for outreach | Manual copy/paste | One-click Excel export |
SRME discovers, crawls, parses, and embeds faculty data in the background.
Input
- University name
- Faculty directory URL
Output
- Indexed professors
- Emails
- Papers
- Semantic embeddings
Enter your research interests (e.g. Explainable AI) and SRME returns the most relevant researchers using cosine similarity over embeddings.
When standard HTTP scraping fails, SRME automatically:
- Launches headless Chromium
- Accepts cookies
- Clicks “Load More”
- Extracts fully rendered HTML
| Feature | Description | Verified On |
|---|---|---|
| Universal A-Z traversal | Detects alphabetical index automatically | ETH Zurich |
| Pagination traversal | Detects numeric/next pagination | Toronto |
| Drupal AJAX + CSRF bypass | Injects form state into AJAX POST | Oxford, Imperial |
| Session persistence | requests.Session maintains cookies |
Cambridge |
| JS rendering fallback | Playwright browser automation | Oxford Physics |
| Email de-obfuscation | Handles [at], (dot) patterns |
Multiple sites |
| Concurrent ingestion | 5 workers, atomic DB updates | Stress tested |
| Idempotent pipeline | No duplicate professors or papers | Verified tests |
| Excel export | Live DB export for outreach | Production ready |
Faculty Directory URL
│
▼
Universal Scraper
(Traversal + AJAX + JS)
│
▼
Faculty Parser ──► Email Deep Scraper
│
▼
Paper Fetcher (Semantic Scholar)
│
▼
Embedding Engine (MiniLM)
│
▼
Semantic Match API
│
▼
UI
| Path | Responsibility |
|---|---|
backend/core/scraper.py |
Universal directory crawler (A-Z, AJAX, CSRF, JS) |
backend/workers/tasks.py |
Ingestion workers, embedding pipeline |
backend/models.py |
DB models (Professor, Paper, Author) |
backend/database.py |
SQLite / SQLAlchemy setup |
backend/main.py |
FastAPI routes + Excel export |
frontend/index.html |
Ingest + Search UI |
assets/ |
Screenshots & demo GIF |
| Layer | Technology |
|---|---|
| API | FastAPI |
| DB | SQLite + SQLAlchemy |
| NLP | sentence-transformers/all-MiniLM-L6-v2 |
| Scraping | Requests + BeautifulSoup |
| JS Rendering | Playwright (Chromium) |
| Export | OpenPyXL |
| Concurrency | ThreadPool Workers |
pip install -r requirements.txt
playwright install chromium
uvicorn main:app --reload --port 8001Open: http://127.0.0.1:8001
GET /export/professors.xlsx
| Column | Description |
|---|---|
| Name | Professor name |
| Extracted or deep-scraped | |
| Profile URL | Faculty page |
| University | Source university |
| Department | Department name |
| Papers Indexed | Unique paper count |
| University | Challenge | SRME Result |
|---|---|---|
| ETH Zurich | A-Z segmented directory | 16 → 225 faculty |
| Oxford Physics | Drupal AJAX + cookies + JS | 9 → 229 faculty |
| Cambridge DAMTP | Strict SSL chain | Successful |
| Toronto CS | Standard directory | Stable extraction |
- Atomic DB progress tracking
- Race-condition safe inserts
- Idempotent professor & paper ingestion
- Backoff & retry for external APIs
- Search while indexing
- Worker stability under concurrency
MIT License


