Skip to content

feat: normalize manifest + page API contracts (Phase 0 + 4)#59

Open
tnunamak wants to merge 15 commits intomainfrom
feat/connector-contract-unification
Open

feat: normalize manifest + page API contracts (Phase 0 + 4)#59
tnunamak wants to merge 15 commits intomainfrom
feat/connector-contract-unification

Conversation

@tnunamak
Copy link
Copy Markdown
Member

@tnunamak tnunamak commented Apr 10, 2026

Summary

Part of the multi-repo connector contract unification plan. See docs/260410-contract-freeze.md for the locked-in decisions this PR implements.

Paired PRs:

Phase 0 — freeze the contract shape

  • Canonical manifest schema in schemas/manifest.schema.json. Every *-playwright.json now has manifest_version, connector_id, source_id, page_api_version, canonical connect_url/connect_selector/icon, while keeping the legacy aliases (id, connectURL, etc.) for backward compatibility.
  • types/connector.d.ts now exposes the minimum canonical page API — added click, fill, press, waitForSelector, and url (previously de facto only in Context Gateway). requestInput remains canonical; getInput is never part of the contract.
  • types/page-api-version.ts exports PAGE_API_VERSION = 1. Independent of manifest_version, connector version, and per-scope schema version.
  • Instagram: new public scope instagram.following with its own schema. Ad targeting categories remain additive on the existing instagram.ads schema (the categories field was already there).
  • iCloud Notes: upstreamed as a canonical connector (connectors/apple/icloud-notes-playwright.{js,json} + schemas/icloud_notes.{notes,folders}.json + registry.json entry).

Late-cycle fixes (post initial review)

  • Instagram login — replaced the broken input.value = x + dispatchEvent pattern and stale input[name="username"] selectors with the CG-proven page.fill + page.press('Enter') pattern and the real 2026 selectors (input[name="email"], input[name="pass"]). The canonical script also now emits the instagram.following scope via the ported scraper helper.
  • GitHub login — replaced the canonical script wholesale with the CG origin/main version (which has been running in production) and translated its CG-only calls to the canonical API: getInput({title, description, schema, uiSchema})requestInput({message, schema}), keyboard_press('Enter')page.press(selector, 'Enter'). Preserves the retry loop, URL-based 2FA detection, passkey / device-verify paths, and #login_field/#password selectors.
  • iCloud Notes scoped shape — wrapped the result in canonical scoped keys (icloud_notes.notes, icloud_notes.folders) that the registered schemas already advertise, and rewrote the schemas to match the exact fields the CG-in-prod script emits (recordName, title, snippet, folder, isPinned, createdDate, modifiedDate, hasAttachments, textContent).
  • Manifest descriptions — stripped "using Playwright browser automation" from 14 user-facing top-level descriptions (the field is rendered in end-user flows; implementation strategy is maintainer detail). Refreshed registry.json descriptions + checksums.

Guardrail scripts

Script Acceptance target
scripts/normalize-manifests.mjs idempotent canonicalization
scripts/validate-manifests.mjs HC-MANIFEST-CONTRACT-001/002/003, HC-WHAT-NOT-TO-DO-002
scripts/validate-scope-schemas.mjs HC-RESULT-CONTRACT-001
scripts/check-page-api-additive.mjs HC-PHASE4-PAGE-API-ADDITIVE-001
scripts/check-source-id-stability.mjs HC-COMPAT-SOURCE-ID-001
scripts/check-additive-schemas.mjs HC-COMPAT-ADDITIVE-SCHEMA-001

All wired into .github/workflows/contract-guardrails.yml.

Conformance fixture

connectors/_conformance/conformance-playwright.{js,json} is a synthetic connector that exercises every minimum-surface method and emits a conformance.result scope payload. Designed to run identically in both the DataConnect playwright-runner and Context Gateway's client-side runtime. Wiring it into a cross-runtime harness is Phase 4 follow-up work.

Smoke test evidence

End-to-end runs via CG /lab against the production remote-browser.vana.workers.dev worker with real credentials, asserting on actual scoped payload contents (not UI "Connected" messages):

  • GitHub (@tnunamak) — 99 repos, 10 starred, scoped keys github.profile, github.repositories, github.starred. Login including 2FA OTP worked on first try.
  • Instagram (@dondochaka) — profile, 0 posts, real following accounts, ad topics. Scoped keys instagram.profile, instagram.posts, instagram.following, instagram.ads.
  • iCloud Notes — 2 real notes ("Hmm", "This is a test") + 1 folder. Scoped keys icloud_notes.notes, icloud_notes.folders. Note: iCloud Notes still depends on the CG-runtime-only page API (getInput, frame_*, keyboard_*), advertised as the cg-legacy-page-api capability in the manifest. Converging this to the canonical minimum surface is Phase 4+ follow-up work — in the meantime, the canonical data-connect registry does not list iCloud Notes, so it cannot be selected there.
  • Oura — deferred (overlay flag activate_canonical_script: false). The CG-in-prod Oura script still ships; canonical convergence is a follow-up.

Test plan

  • node scripts/normalize-manifests.mjs --check — 20 manifests, 0 would change
  • node scripts/validate-manifests.mjs — 20/20 pass
  • node scripts/validate-scope-schemas.mjs — 16 registered manifests, 50 schemas
  • BASE_REF=origin/main node scripts/check-page-api-additive.mjs — 5 methods added, 0 removed
  • BASE_REF=origin/main node scripts/check-additive-schemas.mjs — 41 schemas, 0 breaking changes
  • Manual smoke: GitHub, Instagram, iCloud Notes all return scoped payloads end-to-end (see above)
  • Manual smoke: schema-health-check workflow still passes after new iCloud Notes entries

Phase 0 of the connector contract unification plan:

- Add canonical manifest schema (schemas/manifest.schema.json) and a
  normalizer that adds manifest_version, connector_id, source_id,
  page_api_version, connect_url, connect_selector, and icon to every
  *-playwright.json while preserving legacy aliases.
- Extend the typed Page API (types/connector.d.ts) with the canonical
  minimum surface (click, fill, press, waitForSelector, url) so that
  all shells can converge on a single runtime contract.
- Add types/page-api-version.ts exporting PAGE_API_VERSION = 1.
- Upstream iCloud Notes as a canonical connector (manifest, two scope
  schemas, registry entry).
- Add a new public scope instagram.following with its own schema, and
  record the contract decision (including Instagram ads targeting
  categories remaining additive on instagram.ads) in
  docs/260410-contract-freeze.md.
- Add a conformance fixture connector under connectors/_conformance/
  that probes every minimum-surface method.
- Add guardrail scripts:
    validate-manifests.mjs            (HC-MANIFEST-CONTRACT-001/002/003)
    normalize-manifests.mjs           (idempotent canonicalization)
    validate-scope-schemas.mjs        (HC-RESULT-CONTRACT-001)
    check-page-api-additive.mjs       (HC-PHASE4-PAGE-API-ADDITIVE-001)
    check-source-id-stability.mjs     (HC-COMPAT-SOURCE-ID-001)
    check-additive-schemas.mjs        (HC-COMPAT-ADDITIVE-SCHEMA-001)
- Wire all guardrails into a new .github/workflows/contract-guardrails.yml
  workflow.

See docs/260410-contract-freeze.md for the full decision record.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 10, 2026

Schema Health Check — 10 issue(s) found

40/46 scopes consistent | 3 missing schema files | 6 not in Gateway | 0 metadata drift | 1 orphaned

View issues

Missing local schema files:

Scope Connector Added in PR?
steam.profile steam-playwright
steam.games steam-playwright
steam.friends steam-playwright

Not registered in Gateway:

Scope Connector Added in PR?
instagram.following instagram-playwright added in this PR
steam.profile steam-playwright
steam.games steam-playwright
steam.friends steam-playwright
icloud_notes.notes icloud-notes-playwright added in this PR
icloud_notes.folders icloud-notes-playwright added in this PR

Orphaned schema files (no connector declares this scope):

  • schemas/doordash.orders.json

These issues should be resolved before connectors can be used by Personal Servers.

…cleanups

Follow-up to the Phase 0/4 contract freeze commit, addressing review feedback:

## Instagram: upstream the CG scrapeFollowingAccounts function

The previous contract-freeze commit added instagram.following to the manifest
but the canonical script did not actually emit it. Port CG's
`scrapeFollowingAccounts` helper into `connectors/meta/instagram-playwright.js`
so the canonical connector is a superset of CG's. The function only uses
page.evaluate (no CG-specific runtime methods) so it's portable as-is.

The script now emits `instagram.following` as `{accounts: [...], total: n}`.
Schema updated to tolerate nullable `profile_pic_url`.

## iCloud Notes: honestly advertise the CG-runtime dependency

The iCloud Notes script upstreamed in the previous commit calls CG-specific
methods (getInput, frame_*, keyboard_*) that are NOT part of the canonical
Page API and do NOT run under the DataConnect playwright-runner. Add a new
`cg-legacy-page-api` capability flag to manifest.schema.json and advertise
it on the iCloud Notes manifest. Any runner that checks capabilities before
activating a connector can reject iCloud Notes up front.

Update the description to spell out the runtime limitation.

## docs/260410-contract-freeze.md: correct Instagram claims

The previous freeze doc incorrectly asserted that `instagram.ads.categories`
was "already present" in the canonical schema and that the CG Instagram
targeting_categories field would map to it. In reality neither the canonical
nor the CG Instagram script emits `categories` at all. The doc now:

- describes the actual state of the ads contract (schema has an optional
  `categories` field with no emitter)
- explains that the canonical Instagram script is now activated in CG
- documents the CG proxy shims (setProgress/showBrowser/goHeadless) that
  the CG side needs in order to run canonical scripts
- explains why Oura is NOT activated in this cycle (OTP-only accounts)
- explains that iCloud Notes is metadata-converged but not runtime-canonical
@tnunamak
Copy link
Copy Markdown
Member Author

Follow-up: Instagram following + honest iCloud runtime flag

Review flagged three issues with the initial commit. Two were material:

  1. Instagram script did not emit instagram.following. The manifest declared the scope, but the canonical script only emitted instagram.profile, instagram.posts, and instagram.ads. Fixed by porting CG's scrapeFollowingAccounts helper (pure page.evaluate, no CG-runtime dependencies) into connectors/meta/instagram-playwright.js. The canonical script now emits instagram.following as {accounts: [...], total: n}. Schema updated to allow nullable profile_pic_url.

  2. iCloud Notes script was not actually canonical. It was upstreamed by copying the CG version verbatim, which uses CG-runtime-only methods (getInput, frame_click, frame_evaluate, frame_fill, frame_waitForSelector, keyboard_press, keyboard_type) that are NOT declared in types/connector.d.ts and are NOT implemented in the data-connect playwright-runner. Marking this honestly: added cg-legacy-page-api as a new value in the capabilities enum, advertised it on the iCloud Notes manifest, and updated the description to spell out the runtime limitation. Metadata convergence is real; runtime convergence is a Phase 4+ follow-up task.

  3. Contract freeze doc had incorrect Instagram claims. The earlier version said instagram.ads.categories was "already present" and that CG's targeting_categories maps to it. In reality, neither the canonical nor the CG Instagram script ever emitted categories. Updated the doc to reflect actual state and added sections explaining (a) why the canonical Instagram script is now activated in CG, (b) why Oura is NOT activated (OTP-only accounts), (c) why iCloud Notes is metadata-converged but not runtime-canonical.

See the paired context-gateway PR for the CG-side changes (proxy shims, Instagram activation, Oura deferral, proxy coverage CI gate).

tnunamak added 13 commits April 10, 2026 15:12
…tors

The previous login flow used `input.value = x; dispatchEvent('input')` and
`button[type="submit"].click()`. Both patterns are broken against current
Instagram (2025-04):

- Instagram's controlled inputs are driven by React. Setting `input.value`
  directly does NOT trigger React's synthetic `onChange`, so the form state
  stays empty and submit does nothing.
- Instagram's DOM switched from `input[name="username"]` to
  `input[name="email"]` and from `input[name="password"]` to
  `input[name="pass"]` in some regional variants. The single-selector check
  missed these entirely.
- Instagram's submit control varies by DOM version: sometimes
  `<button type="submit">`, sometimes `<input type="submit">`, sometimes
  `<div role="button">`. The hardcoded button selector only caught the
  first variant.

Use the canonical Page API's `page.fill(selector, value)` and
`page.press(passSelector, 'Enter')` instead. Playwright's native typing
correctly triggers React synthetic events for controlled inputs, and
pressing Enter bypasses all submit-button DOM variations.

Broaden the user/password selectors to include the observed alternatives.
Add `page.waitForSelector(userSel, {state: 'visible'})` before requesting
credentials so we don't race Instagram's async form hydration.

Also: retry `fetchWebInfo()` up to 3 times after the SPA login redirect —
the page settling takes 10+ seconds and a single post-submit check races
the transition.

OTP code entry follows the same pattern (page.fill + page.press).

Validated end-to-end from Context Gateway against a real Instagram
account: profile, posts, following (13 accounts via the recently ported
scrapeFollowingAccounts helper), and ads all returned in the canonical
scoped payload shape.
…shape

GitHub canonical script was using the broken input.value+dispatchEvent
pattern that doesn't fire React onChange and stale #login_field selectors
that don't match the 2026 GitHub DOM. Replace with the CG origin/main
version (which has been running in production) and translate its
CG-only page calls to the canonical page API:

  getInput({title, description, schema, uiSchema, submitLabel})
    -> requestInput({message, schema})
  keyboard_press('Enter')
    -> page.press(selector, 'Enter')

Preserves the retry loop with lastError, URL-based 2FA detection
(/sessions/two-factor/webauthn, /sessions/verified-device), passkey and
device-verify handling, and the #login_field / #password selectors that
actually exist in the current GitHub login page.

iCloud Notes was still emitting a flat {notes, folders, userName} blob.
Wrap the result in the canonical scoped shape the registered schemas
already advertise:

  {
    "icloud_notes.notes":   { notes, total, userName },
    "icloud_notes.folders": { folders, total },
    ...
  }

The schemas themselves were placeholders; rewrite schemas/icloud_notes.*
to match the exact fields the CG-in-prod script emits per note
(recordName, title, snippet, folder, isPinned, createdDate, modifiedDate,
hasAttachments, textContent) and per folder (recordName, title), so any
downstream consumer that type-checks against the schema sees reality.

Also drop the maintainer rant from the iCloud Notes manifest description
field — that field is user-facing in CG's /lab picker and elsewhere, and
end users don't want to read about cg-legacy-page-api. Move the runtime
caveat to a file-top comment in the script.

Smoke-tested end to end against production remote worker via CG /lab:

  github:    99 repos, 10 starred, @tnunamak profile — scoped payload
  icloud:    2 real notes ("Hmm", "This is a test"), 1 folder — scoped payload
…er-facing descriptions

The top-level manifest description field is rendered in end-user flows
(e.g., CG's /lab picker, data-connect's connect list). End users don't
need to see the runtime strategy — whether the connector uses Playwright
or a native API is a maintainer detail, not a user detail.

Scrub the trailing " using Playwright browser automation" phrase from
14 manifests and refresh the registry.json descriptions + checksums to
match.

No logic changes. Scope-level descriptions are untouched.
…orphan scan

The schema-health-check was flagging two schemas as orphaned that are
legitimately in use:

- schemas/conformance.result.json — declared by the CI-only synthetic
  fixture at connectors/_conformance/conformance-playwright.json, which
  is intentionally not listed in registry.json (it's not a shippable
  connector).
- schemas/manifest.schema.json — the connector-manifest meta-schema, not
  a scope schema at all.

Teach the check to read fixture manifests from connectors/_conformance/
and to exclude known non-scope meta-schemas from the orphan list.
Use window.HTMLInputElement.prototype directly rather than
Object.getPrototypeOf(el) when reaching for the native value setter.
This is the standard React 16+ bypass and is bulletproof against any
polyfill or framework that may have mutated the element's instance
prototype chain. The bypass path is only used when the primary
page.fill + page.press('Enter') route has already failed, so reliability
matters more than style here.
…llection

GitHub: wrap the final promptUser() manual-login fallback in a
showBrowser() check. Runtimes like Context Gateway that never surface a
headed remote browser to the end user return { headed: false }; in that
case, calling promptUser would hang indefinitely because the user has
no browser to complete the manual step in. Return a clean
"automatic sign-in failed" error instead.

Instagram: honor page.requestedScopes() at the script level.

  - At the top of the main IIFE, resolve the canonical scope list via
    page.requestedScopes() (falls back to null for older runtimes =
    "collect everything", the legacy behavior).
  - Gate the Following scrape on instagram.following.
  - Gate the Posts network capture, the scroll-to-load-more loop, and
    the timeline processing on instagram.posts.
  - Gate the entire Phase 3 (accountscenter navigation + advertisers +
    topics dialog scrapes) on instagram.ads.
  - transformDataForSchema() now only emits the scoped keys that were
    actually requested, and stamps the requestedScopes list into the
    result for downstream audit.

Profile is always collected because it's load-bearing for login
detection and as the source of the user id for the Following endpoint.
If the caller didn't request instagram.profile, it's simply omitted
from the final result.

This is not an enforcement layer against a malicious caller — the
script runs in a client-controlled JS VM and the gate can be stripped.
It exists so well-behaved flows don't waste worker cycles or
overcollect, and so the selected scope set is observable at the page
API level. The trust boundary against malicious 3rd-party devs is app
identity, not runtime enforcement.
…er check

The canonical iCloud Notes script falls through to page.promptUser()
when its automated Apple ID login fails. In runtimes like Context
Gateway where the remote browser is not surfaced to the end user
(showBrowser returns { headed: false }), calling promptUser would
either hang forever (old polling implementation) or throw (the new
fail-fast shim) — both are unacceptable. Mirror the GitHub fix:
showBrowser -> check headed -> either promptUser if we can, or return
a clean error if we can't.

Found in Gemini review ahead of the staging cut.
Transient HTTP errors (net::ERR_HTTP_RESPONSE_CODE_FAILURE,
page.goto timeouts, etc.) on login, profile, and account-center
navigations were bubbling up to the caller as raw errors that killed
the entire run, even though a simple retry would have recovered.

The prod Context Gateway Instagram connector shipped safeGoto +
withTimeout helpers that wrapped every page.goto in a retry loop with
an explicit 15s timeout. Porting that forward into the canonical
Instagram, GitHub, iCloud Notes, and Oura scripts closes the
regression for Instagram and preemptively hardens the other three
(which had the same fragile bare page.goto calls in prod but had
simply not hit a transient yet).

Implementation:

  - Inlined withTimeout() and safeGoto() helpers at the top of each
    script. Data-connectors scripts can't import shared modules
    because both runtimes (CG client VM, data-connect playwright-
    runner) execute them as raw source, so the helpers are duplicated
    per connector until the canonical page API surface grows native
    retry/timeout primitives. That's tracked as the "generalized
    error handling at the proxy level" followup in the contract plan.
  - Replaced every bare page.goto call (except those inside the
    safeGoto helper itself) with safeGoto.
  - For load-bearing navigations (login pages, profile pages) a
    failed safeGoto short-circuits the run with a clean structured
    error instead of leaking a raw DOMException.
  - For recoverable sub-navigations (Instagram ads landing, ad_topics
    sub-page, GitHub pagination pages) a failed safeGoto lets the
    caller skip the scope/page cleanly without killing the rest of
    the collection.

Verified on staging via /lab with real Instagram credentials. Bare
page.goto calls remaining in any canonical script: only those inside
the safeGoto helper bodies themselves.
The canonical GitHub script was hitting "Login form not found" on
staging (and probably local, under the right conditions) because the
querySelector check raced against login page hydration. page.goto
resolves on "load" but React takes a beat longer to mount the form.

Fix:

  1. Use page.waitForSelector with a 10s budget before the querySelector
     check, so we give the login form time to paint before declaring
     "not found". Matches how Instagram waits for its credential form.

  2. On failure, capture the actual page state (title, URL, h1, h2,
     first 400 chars of body text) and include it in the lastError.
     Next time this fails, the error message will tell us whether
     GitHub served a CAPTCHA, rate-limit page, device-verify
     interstitial, or some other non-login page — instead of the
     generic "Login form not found" we're getting now.

  3. Broadened the login field selectors to also accept
     input[name="login"] / input[name="password"] in case GitHub
     serves a form with different IDs.

  4. Fixed a latent bug in the retry loop: on the final attempt, we
     were re-navigating to /login after declaring the attempt failed.
     That was wasted work since the while condition would immediately
     terminate. Now we only re-navigate when we'll actually retry.
…nals

Root cause of the "Login form not found" failure: on a persistent
browser profile where the user is already logged in from a previous
session, the script's checkLoggedIn() was returning false at the top
of the main flow. That made the script try to navigate to /login,
but GitHub redirected back to / (the authenticated dashboard), and
the subsequent querySelector('#login_field') check naturally failed
because there was no login form to find.

The observability patch from the previous commit caught the smoking
gun: lastError captured title="GitHub" url="https://github.com/"
h1="Search code, repositories, users, issues, pull requests..."
body="Skip to content Dashboard Top repositories..." — the script
was staring at the logged-in dashboard the whole time.

Fix: rewrite checkLoggedIn to be both more patient and more thorough.

  1. Use waitForSelector (5s budget) so GitHub's header has time to
     hydrate before we declare "not logged in".
  2. Short-circuit to false on explicit /login or /session paths.
  3. Check four positive signals, any one of which proves the user
     is authenticated:
       a) meta[name="user-login"] with a non-empty content
       b) summary[aria-label*="View profile and more"] (header menu)
       c) header img.avatar-user
       d) dashboard-specific content ("Top repositories" + "Dashboard")
  4. Stricter negative signal to avoid false negatives on non-login
     pages that happen to have a /login link somewhere.

This matches the spirit of the prod CG behavior which works on the
same remote-worker persistent profile — the prod script evidently
never hit this because the header structure it checked was present
when the script ran, but newer versions of GitHub's dashboard render
slightly later.
Previous fix was still failing on staging because checkLoggedIn()
returned synchronously from a single querySelector snapshot. When
called immediately after device-verify form submission, GitHub was
still mid-navigation to the authenticated dashboard, so none of the
positive signals matched and the function returned false — sending
the while loop back around into another "Login form not found"
retry, this time on the dashboard page where there's no login form
at all.

Fix: rewrite checkLoggedIn as a polling loop.

  - Accepts { timeoutMs, pollIntervalMs } options.
  - Default 10s timeout / 1s poll interval.
  - Top-of-flow initial check uses a shorter 5s timeout (GitHub
    either has the dashboard up or it doesn't by then).
  - Inside the login retry loop (after credential + device-verify
    submit) uses the 10s default to cover multi-step redirects.
  - Short-circuits to false on /login or /session paths immediately.
  - Added two more positive signals: avatar-user anywhere on a
    non-auth path, and a "your work" header heuristic.
  - Logs the winning signal or the last snapshot on failure so the
    next failure explains itself.
resolveUsername was failing with 'Could not resolve a valid GitHub
username after login' because readLoggedInUsername only checked
meta[name='user-login'], which wasn't always present on the
just-hydrated dashboard. Broaden the signals and add polling:

  1. meta[name='user-login'] (canonical)
  2. header img.avatar-user [alt='@username']
  3. Profile link with data-hovercard-type='user'
  4. 'Signed in as X' aria-label

Also increase the fallback /settings/profile read budget to 8s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant