██████╗ ██╗ ██╗██████╗ ██╗
██╔════╝ ██║ ██║██╔══██╗██║
██║ ███╗██║ ██║██████╔╝██║
██║ ██║██║ ██║██╔══██╗██║
╚██████╔╝╚██████╔╝██║ ██║███████╗
╚═════╝ ╚═════╝ ╚═╝ ╚═╝╚══════╝
An intelligent web crawler built in Go that extracts email addresses and social profiles from websites with precision and speed.
🚀 Fast, intelligent, and scalable email discovery for modern web applications
- 🧠 Intelligent Crawling: Prioritizes contact and information pages
- 🌍 Multi-language Support: Recognizes keywords in 6 languages (Spanish, English, French, German, Italian, Portuguese)
- 🔄 Meta Redirects: Automatically follows HTML meta redirects
- ⚡ Redis Cache: Smart caching with 12-month persistence and 5,400x speed improvement
- 🚀 Async Processing: Background jobs with webhook notifications
- 🔍 Auto Deduplication: Automatically removes duplicate emails
- 🌐 Social Discovery: Extracts social profile links from known platforms
- 🐳 Dockerized: Easy deployment with Docker Compose
- 📡 REST API: Both synchronous and asynchronous endpoints
- ⚙️ Configurable Depth: Explore up to 3 levels deep (configurable)
- 🛡️ Input Guardrails: Accepts only absolute HTTP/HTTPS URLs
- 📏 Crawl Limits: Restricts per-page response size and total pages visited
- ♻️ Async Recovery: Requeues interrupted processing jobs on restart
If you like this project, consider buying me a coffee ☕💛

- Docker
- Docker Compose
# Pull and run the latest image
docker run -d --name gurl-crawler \
-p 8080:8080 \
-p 6379:6379 \
luisra51/gurl:latest
# Or use with external Redis
docker run -d --name gurl-crawler \
-p 8080:8080 \
-e REDIS_HOST=your-redis-host \
-e REDIS_PORT=6379 \
luisra51/gurl:latestgit clone https://github.com/luisra51/gurl.git
cd gurl
docker-compose up --buildService will be available at http://localhost:8080
The service assumes authentication, authorization, and rate limiting are handled by an upstream gateway or internal platform layer.
# Basic scan
curl "http://localhost:8080/scan?url=example.com"
# With specific protocol
curl "http://localhost:8080/scan?url=https://company.com"Response:
{
"emails": ["info@example.com", "contact@example.com"],
"social_profiles": [
{
"platform": "linkedin",
"url": "https://linkedin.com/company/example",
"handle": "example",
"source_page": "https://company.com",
"confidence": "high"
}
],
"from_cache": false,
"crawl_time": "2.3s"
}curl -X POST "http://localhost:8080/scan/async" \
-H "Content-Type: application/json" \
-d '{
"url": "slow-website.com",
"webhook_url": "https://your-api.com/webhook",
"callback_id": "optional-tracking-id"
}'Immediate Response:
{
"job_id": "uuid-123-456-789",
"status": "queued",
"estimated_time": "30-60s",
"webhook_url": "https://your-api.com/webhook",
"check_status_url": "/scan/status/uuid-123-456-789"
}Webhook Callback (When Complete):
{
"job_id": "uuid-123-456-789",
"callback_id": "optional-tracking-id",
"status": "completed",
"url": "https://slow-website.com",
"emails": ["contact@slow-website.com"],
"social_profiles": [
{
"platform": "instagram",
"url": "https://instagram.com/slow-website",
"handle": "slow-website",
"source_page": "https://slow-website.com",
"confidence": "high"
}
],
"crawl_time": "45.2s",
"pages_visited": 15,
"completed_at": "2025-08-07T10:30:00Z"
}{
"emails": ["info@example.com", "contact@example.com"],
"social_profiles": [],
"from_cache": true,
"crawl_time": "396µs"
}{
"emails": [],
"from_cache": false,
"crawl_time": "2.1s"
}{
"error": "Invalid URL provided"
}The crawler intelligently recognizes contact-related keywords in 6 languages:
- 🇪🇸 Spanish: contacto, información, equipo, nosotros, empresa
- 🇺🇸 English: contact, about, team, support, help, office
- 🇫🇷 French: nous-contacter, équipe, aide, assistance, bureau
- 🇩🇪 German: kontakt, über-uns, impressum, unser-team, hilfe
- 🇮🇹 Italian: contatti, chi-siamo, squadra, informazioni, supporto
- 🇵🇹 Portuguese: contato, sobre-nos, equipe, ajuda, suporte
43+ keywords total across all languages for maximum coverage
| Method | Endpoint | Description |
|---|---|---|
GET |
/scan?url=<website> |
Scan website (immediate response) |
GET |
/cache/stats |
View Redis cache statistics |
DELETE |
/cache/invalidate |
Clear all cache |
DELETE |
/cache/invalidate?url=<website> |
Clear specific URL cache |
| Method | Endpoint | Description |
|---|---|---|
POST |
/scan/async |
Create async scan job |
GET |
/scan/status/<job_id> |
Check job status |
DELETE |
/scan/cancel/<job_id> |
Cancel queued or processing job |
GET |
/scan/jobs |
View active job statistics |
# View cache statistics
curl "http://localhost:8080/cache/stats"
# Check async job status
curl "http://localhost:8080/scan/status/uuid-123-456"
# Cancel queued job
curl -X DELETE "http://localhost:8080/scan/cancel/uuid-123-456"
# View active jobs and statistics
curl "http://localhost:8080/scan/jobs"
# Clear complete cache
curl -X DELETE "http://localhost:8080/cache/invalidate"# Crawler Settings
CRAWLER_MAX_DEPTH=3 # Maximum crawling depth
CRAWLER_DEDUPLICATE_EMAILS=true # Remove duplicate emails
CRAWLER_MAX_RESPONSE_BYTES=1048576 # Max bytes read per page (1 MiB)
CRAWLER_MAX_PAGES_VISITED=50 # Max pages visited per crawl
# Cache Settings
CACHE_ENABLED=true # Enable Redis cache
CACHE_EXPIRATION_MONTHS=12 # Cache TTL in months
# Async Processing Settings
ASYNC_ENABLED=true # Enable async processing
ASYNC_WORKERS=3 # Number of parallel workers
ASYNC_JOB_TIMEOUT_SECONDS=300 # Job timeout (5 minutes)
ASYNC_WEBHOOK_RETRIES=3 # Webhook retry attempts
# Redis Configuration
REDIS_HOST=localhost # Redis host
REDIS_PORT=6379 # Redis port
REDIS_PERSIST_DISK=false # Disk persistence (prod: true)
# Server Configuration
SERVER_PORT=8080 # Server port
SERVER_HOST=0.0.0.0 # Server host
SERVER_READ_HEADER_TIMEOUT_SECONDS=5 # Read header timeout
SERVER_READ_TIMEOUT_SECONDS=15 # Full request read timeout
SERVER_WRITE_TIMEOUT_SECONDS=30 # Response write timeout
SERVER_IDLE_TIMEOUT_SECONDS=60 # Keep-alive idle timeout- 🎯 Smart Crawling: Prioritizes contact pages with multilingual keywords
- 📊 Depth Control: Configurable depth (default: 3 levels)
- 📏 Resource Limits: Parses only HTML, caps page size, and caps total pages visited
- 🌐 Social Platforms: LinkedIn, X/Twitter, Instagram, Facebook, YouTube, TikTok, GitHub, Telegram, WhatsApp, Linktree
- ⚡ Cache System: Redis-based caching with 12-month TTL
- 🔄 Auto Deduplication: Automatic email normalization and deduplication
- 🚀 Performance: 5,400x faster responses with cache hits
- ♻️ Restart Recovery: Async jobs left in
processingare requeued on startup
- Scan targets must resolve to absolute
http://orhttps://URLs. - Async webhook targets must also be absolute
http://orhttps://URLs. - Bare domains such as
example.comare normalized tohttps://example.com.
Each discovered social profile includes:
platform: normalized platform nameurl: canonical normalized profile URLhandle: optional value derived from the URL pathsource_page: crawled page where the link was foundconfidence: simple heuristic signal (high,medium,low)
- Improve social profile quality heuristics with more platform-specific filters, canonicalization rules, and confidence signals.
/
├── .env # Environment variables (development)
├── .env.example # Configuration example
├── go.mod # Go dependencies
├── Dockerfile # Container definition
├── docker-compose.yml # Redis + App services
├── scan_urls.sh # Batch processing script
├── cmd/
│ └── crawler/
│ └── main.go # Application entry point
└── internal/
├── cache/
│ └── cache.go # Redis cache management
├── config/
│ └── config.go # Environment configuration
├── crawler/
│ └── crawler.go # Core crawling logic
├── handler/
│ └── handler.go # HTTP endpoints (sync + async)
└── jobs/
├── types.go # Job data types
├── queue.go # Redis job queue
└── worker.go # Worker system + webhooks
- 🗄️ Cache Layer: Redis with configurable TTL and optional persistence
- ⚙️ Job Queue: Redis-based async system with parallel workers
- 📡 Webhook System: Result delivery with retries and exponential backoff
- 🌐 Multi-language: 43+ keywords across 6 languages
- 🔧 Config Management: Environment-based configuration
# Copy environment variables
cp .env.example .env
# Start complete stack
docker-compose up --build# Install Redis locally
# Ubuntu/Debian: sudo apt install redis-server
# macOS: brew install redis
# Start Redis
redis-server
# Install Go dependencies
go mod tidy
# Run application
go run cmd/crawler/main.goWe welcome contributions! Here's how you can help:
- 🐛 Bug Reports: Found a bug? Open an issue
- ✨ Feature Requests: Have an idea? Start a discussion
- 📝 Documentation: Improve docs, add examples, fix typos
- 🌍 Translations: Add support for more languages
- 🧪 Testing: Write tests, test edge cases
- 💻 Code: Implement new features or fix bugs
- Fork the repository
- Clone your fork:
git clone https://github.com/your-username/gurl.git cd gurl - Create a feature branch:
git checkout -b feature/amazing-feature
- Make your changes
- Test your changes:
docker-compose up --build # Test your changes - Commit and push:
git commit -m "Add amazing feature" git push origin feature/amazing-feature - Open a Pull Request
- Follow standard Go conventions (
go fmt,go vet) - Add tests for new features
- Update documentation for API changes
- Use meaningful commit messages
- JavaScript: Does not execute JavaScript, only analyzes static HTML
- Single Page Applications: Limited on SPAs that load content dynamically
- Rate limiting: Does not implement throttling between requests
- Same domain: Only crawls pages from the same base domain
- 💼 Lead Generation: Find contact emails from company websites
- 🔍 Research Automation: Collect contact information at scale
- 📊 Competitive Analysis: Study competitor contact pages
- 🔗 API Integration: Integrate with CRMs via webhooks
- 📦 Batch Processing: Process thousands of URLs with
scan_urls.sh - 🏗️ Microservices: Email discovery service for distributed architectures
# Single container (no Redis persistence)
docker run -d --name gurl-crawler \
-p 8080:8080 \
luisra51/gurl:latest
# With Docker Compose (includes Redis)
docker-compose -f docker-compose.hub.yml up -d
# Production with external Redis
docker run -d --name gurl-crawler \
-p 8080:8080 \
-e REDIS_HOST=your-redis-host \
-e REDIS_PORT=6379 \
-e REDIS_PERSIST_DISK=true \
-e ASYNC_WORKERS=5 \
-e CACHE_EXPIRATION_MONTHS=12 \
luisra51/gurl:latest# Quick development (no persistence)
docker-compose up --build
# Fast rebuilds
docker-compose up --build crawler-app
# Clean and start fresh
docker-compose down -v && docker-compose up --builddocker build -t email-crawler .
docker run -p 8080:8080 email-crawler# View cache statistics
curl "http://localhost:8080/cache/stats"
# View worker and job status
curl "http://localhost:8080/scan/jobs"
# Application logs
docker-compose logs -f crawler-app
# Redis logs
docker-compose logs -f redis
# Enter container for debugging
docker-compose exec crawler-app sh