Skip to content

MAGIK-935: Website URL Provider — Meta/OG/JSON-LD Extraction #116

@MAGIKBIT

Description

@MAGIKBIT

MAGIK-935: Website URL Provider — Meta/OG/JSON-LD Extraction

Epic: EPIC-025 — #113
Priority: P0
Estimate: 5 SP
Depends on: MAGIK-934

Description

Create WebsiteProvider that crawls a given website URL and extracts profile-relevant data from HTML meta tags, Open Graph tags, JSON-LD structured data, schema.org markup, and visible page content.

Implementation

Class: Libraries/Enrichment/WebsiteProvider.php

URL matching: Any valid HTTP/HTTPS URL not matched by social-specific providers.

Extraction layers (in priority order):

  1. JSON-LD / schema.org<script type="application/ld+json"> for Organization, LocalBusiness, Person
  2. Open Graph tagsog:title, og:description, og:image, og:url, og:site_name
  3. HTML meta tags<meta name="description">, <meta name="author">, <link rel="icon">
  4. Visible content heuristics — regex for emails (mailto:), phone patterns, address blocks
  5. Social link discovery<a href> matching known social platform URL patterns

Extracted fields:

Field Source Confidence
Company/Site name JSON-LD > OG > <title> 0.9 / 0.8 / 0.6
Description JSON-LD > OG > meta description 0.9 / 0.8 / 0.7
Logo/Favicon JSON-LD logo > OG image > <link rel="icon"> 0.9 / 0.7 / 0.5
Emails JSON-LD > mailto: links > regex 0.9 / 0.8 / 0.5
Phones JSON-LD > tel: links > regex 0.9 / 0.8 / 0.4
Address JSON-LD PostalAddress > address block regex 0.9 / 0.3
Social links <a> href matching fb/ig/yt/tw/li patterns 0.8

HTTP client: CodeIgniter's CURLRequest with 10s timeout, User-Agent: MagikTap-Enrichment/1.0, robots.txt check.

Files

File Action
Libraries/Enrichment/WebsiteProvider.php Create
Libraries/Enrichment/HtmlExtractor.php Create (shared HTML parsing utility)
Libraries/EnrichmentService.php Modify (register provider)
Config/Enrichment.php Modify (add website config)

Acceptance Criteria

  • canHandleUrl() matches any valid HTTP/HTTPS URL (fallback provider)
  • Extracts name, description, logo from OG tags
  • Extracts structured data from JSON-LD when available
  • Discovers email/phone from visible content
  • Discovers social media links from page anchors
  • Each field includes confidence score and source evidence
  • Respects robots.txt and 10s timeout
  • Returns graceful error on unreachable/blocked sites
  • Unit tests with fixture HTML files

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions