-
Notifications
You must be signed in to change notification settings - Fork 182
Open
Description
MAGIK-935: Website URL Provider — Meta/OG/JSON-LD Extraction
Epic: EPIC-025 — #113
Priority: P0
Estimate: 5 SP
Depends on: MAGIK-934
Description
Create WebsiteProvider that crawls a given website URL and extracts profile-relevant data from HTML meta tags, Open Graph tags, JSON-LD structured data, schema.org markup, and visible page content.
Implementation
Class: Libraries/Enrichment/WebsiteProvider.php
URL matching: Any valid HTTP/HTTPS URL not matched by social-specific providers.
Extraction layers (in priority order):
- JSON-LD / schema.org —
<script type="application/ld+json">for Organization, LocalBusiness, Person - Open Graph tags —
og:title,og:description,og:image,og:url,og:site_name - HTML meta tags —
<meta name="description">,<meta name="author">,<link rel="icon"> - Visible content heuristics — regex for emails (
mailto:), phone patterns, address blocks - Social link discovery —
<a href>matching known social platform URL patterns
Extracted fields:
| Field | Source | Confidence |
|---|---|---|
| Company/Site name | JSON-LD > OG > <title> |
0.9 / 0.8 / 0.6 |
| Description | JSON-LD > OG > meta description | 0.9 / 0.8 / 0.7 |
| Logo/Favicon | JSON-LD logo > OG image > <link rel="icon"> |
0.9 / 0.7 / 0.5 |
| Emails | JSON-LD > mailto: links > regex |
0.9 / 0.8 / 0.5 |
| Phones | JSON-LD > tel: links > regex |
0.9 / 0.8 / 0.4 |
| Address | JSON-LD PostalAddress > address block regex | 0.9 / 0.3 |
| Social links | <a> href matching fb/ig/yt/tw/li patterns |
0.8 |
HTTP client: CodeIgniter's CURLRequest with 10s timeout, User-Agent: MagikTap-Enrichment/1.0, robots.txt check.
Files
| File | Action |
|---|---|
Libraries/Enrichment/WebsiteProvider.php |
Create |
Libraries/Enrichment/HtmlExtractor.php |
Create (shared HTML parsing utility) |
Libraries/EnrichmentService.php |
Modify (register provider) |
Config/Enrichment.php |
Modify (add website config) |
Acceptance Criteria
-
canHandleUrl()matches any valid HTTP/HTTPS URL (fallback provider) - Extracts name, description, logo from OG tags
- Extracts structured data from JSON-LD when available
- Discovers email/phone from visible content
- Discovers social media links from page anchors
- Each field includes confidence score and source evidence
- Respects robots.txt and 10s timeout
- Returns graceful error on unreachable/blocked sites
- Unit tests with fixture HTML files
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels