feat: normalize manifest + page API contracts (Phase 0 + 4)#59
feat: normalize manifest + page API contracts (Phase 0 + 4)#59
Conversation
Phase 0 of the connector contract unification plan:
- Add canonical manifest schema (schemas/manifest.schema.json) and a
normalizer that adds manifest_version, connector_id, source_id,
page_api_version, connect_url, connect_selector, and icon to every
*-playwright.json while preserving legacy aliases.
- Extend the typed Page API (types/connector.d.ts) with the canonical
minimum surface (click, fill, press, waitForSelector, url) so that
all shells can converge on a single runtime contract.
- Add types/page-api-version.ts exporting PAGE_API_VERSION = 1.
- Upstream iCloud Notes as a canonical connector (manifest, two scope
schemas, registry entry).
- Add a new public scope instagram.following with its own schema, and
record the contract decision (including Instagram ads targeting
categories remaining additive on instagram.ads) in
docs/260410-contract-freeze.md.
- Add a conformance fixture connector under connectors/_conformance/
that probes every minimum-surface method.
- Add guardrail scripts:
validate-manifests.mjs (HC-MANIFEST-CONTRACT-001/002/003)
normalize-manifests.mjs (idempotent canonicalization)
validate-scope-schemas.mjs (HC-RESULT-CONTRACT-001)
check-page-api-additive.mjs (HC-PHASE4-PAGE-API-ADDITIVE-001)
check-source-id-stability.mjs (HC-COMPAT-SOURCE-ID-001)
check-additive-schemas.mjs (HC-COMPAT-ADDITIVE-SCHEMA-001)
- Wire all guardrails into a new .github/workflows/contract-guardrails.yml
workflow.
See docs/260410-contract-freeze.md for the full decision record.
Schema Health Check — 10 issue(s) found40/46 scopes consistent | 3 missing schema files | 6 not in Gateway | 0 metadata drift | 1 orphaned View issuesMissing local schema files:
Not registered in Gateway:
Orphaned schema files (no connector declares this scope):
|
…cleanups
Follow-up to the Phase 0/4 contract freeze commit, addressing review feedback:
## Instagram: upstream the CG scrapeFollowingAccounts function
The previous contract-freeze commit added instagram.following to the manifest
but the canonical script did not actually emit it. Port CG's
`scrapeFollowingAccounts` helper into `connectors/meta/instagram-playwright.js`
so the canonical connector is a superset of CG's. The function only uses
page.evaluate (no CG-specific runtime methods) so it's portable as-is.
The script now emits `instagram.following` as `{accounts: [...], total: n}`.
Schema updated to tolerate nullable `profile_pic_url`.
## iCloud Notes: honestly advertise the CG-runtime dependency
The iCloud Notes script upstreamed in the previous commit calls CG-specific
methods (getInput, frame_*, keyboard_*) that are NOT part of the canonical
Page API and do NOT run under the DataConnect playwright-runner. Add a new
`cg-legacy-page-api` capability flag to manifest.schema.json and advertise
it on the iCloud Notes manifest. Any runner that checks capabilities before
activating a connector can reject iCloud Notes up front.
Update the description to spell out the runtime limitation.
## docs/260410-contract-freeze.md: correct Instagram claims
The previous freeze doc incorrectly asserted that `instagram.ads.categories`
was "already present" in the canonical schema and that the CG Instagram
targeting_categories field would map to it. In reality neither the canonical
nor the CG Instagram script emits `categories` at all. The doc now:
- describes the actual state of the ads contract (schema has an optional
`categories` field with no emitter)
- explains that the canonical Instagram script is now activated in CG
- documents the CG proxy shims (setProgress/showBrowser/goHeadless) that
the CG side needs in order to run canonical scripts
- explains why Oura is NOT activated in this cycle (OTP-only accounts)
- explains that iCloud Notes is metadata-converged but not runtime-canonical
Follow-up: Instagram following + honest iCloud runtime flagReview flagged three issues with the initial commit. Two were material:
See the paired context-gateway PR for the CG-side changes (proxy shims, Instagram activation, Oura deferral, proxy coverage CI gate). |
…tors
The previous login flow used `input.value = x; dispatchEvent('input')` and
`button[type="submit"].click()`. Both patterns are broken against current
Instagram (2025-04):
- Instagram's controlled inputs are driven by React. Setting `input.value`
directly does NOT trigger React's synthetic `onChange`, so the form state
stays empty and submit does nothing.
- Instagram's DOM switched from `input[name="username"]` to
`input[name="email"]` and from `input[name="password"]` to
`input[name="pass"]` in some regional variants. The single-selector check
missed these entirely.
- Instagram's submit control varies by DOM version: sometimes
`<button type="submit">`, sometimes `<input type="submit">`, sometimes
`<div role="button">`. The hardcoded button selector only caught the
first variant.
Use the canonical Page API's `page.fill(selector, value)` and
`page.press(passSelector, 'Enter')` instead. Playwright's native typing
correctly triggers React synthetic events for controlled inputs, and
pressing Enter bypasses all submit-button DOM variations.
Broaden the user/password selectors to include the observed alternatives.
Add `page.waitForSelector(userSel, {state: 'visible'})` before requesting
credentials so we don't race Instagram's async form hydration.
Also: retry `fetchWebInfo()` up to 3 times after the SPA login redirect —
the page settling takes 10+ seconds and a single post-submit check races
the transition.
OTP code entry follows the same pattern (page.fill + page.press).
Validated end-to-end from Context Gateway against a real Instagram
account: profile, posts, following (13 accounts via the recently ported
scrapeFollowingAccounts helper), and ads all returned in the canonical
scoped payload shape.
…shape
GitHub canonical script was using the broken input.value+dispatchEvent
pattern that doesn't fire React onChange and stale #login_field selectors
that don't match the 2026 GitHub DOM. Replace with the CG origin/main
version (which has been running in production) and translate its
CG-only page calls to the canonical page API:
getInput({title, description, schema, uiSchema, submitLabel})
-> requestInput({message, schema})
keyboard_press('Enter')
-> page.press(selector, 'Enter')
Preserves the retry loop with lastError, URL-based 2FA detection
(/sessions/two-factor/webauthn, /sessions/verified-device), passkey and
device-verify handling, and the #login_field / #password selectors that
actually exist in the current GitHub login page.
iCloud Notes was still emitting a flat {notes, folders, userName} blob.
Wrap the result in the canonical scoped shape the registered schemas
already advertise:
{
"icloud_notes.notes": { notes, total, userName },
"icloud_notes.folders": { folders, total },
...
}
The schemas themselves were placeholders; rewrite schemas/icloud_notes.*
to match the exact fields the CG-in-prod script emits per note
(recordName, title, snippet, folder, isPinned, createdDate, modifiedDate,
hasAttachments, textContent) and per folder (recordName, title), so any
downstream consumer that type-checks against the schema sees reality.
Also drop the maintainer rant from the iCloud Notes manifest description
field — that field is user-facing in CG's /lab picker and elsewhere, and
end users don't want to read about cg-legacy-page-api. Move the runtime
caveat to a file-top comment in the script.
Smoke-tested end to end against production remote worker via CG /lab:
github: 99 repos, 10 starred, @tnunamak profile — scoped payload
icloud: 2 real notes ("Hmm", "This is a test"), 1 folder — scoped payload
…er-facing descriptions The top-level manifest description field is rendered in end-user flows (e.g., CG's /lab picker, data-connect's connect list). End users don't need to see the runtime strategy — whether the connector uses Playwright or a native API is a maintainer detail, not a user detail. Scrub the trailing " using Playwright browser automation" phrase from 14 manifests and refresh the registry.json descriptions + checksums to match. No logic changes. Scope-level descriptions are untouched.
…orphan scan The schema-health-check was flagging two schemas as orphaned that are legitimately in use: - schemas/conformance.result.json — declared by the CI-only synthetic fixture at connectors/_conformance/conformance-playwright.json, which is intentionally not listed in registry.json (it's not a shippable connector). - schemas/manifest.schema.json — the connector-manifest meta-schema, not a scope schema at all. Teach the check to read fixture manifests from connectors/_conformance/ and to exclude known non-scope meta-schemas from the orphan list.
Use window.HTMLInputElement.prototype directly rather than
Object.getPrototypeOf(el) when reaching for the native value setter.
This is the standard React 16+ bypass and is bulletproof against any
polyfill or framework that may have mutated the element's instance
prototype chain. The bypass path is only used when the primary
page.fill + page.press('Enter') route has already failed, so reliability
matters more than style here.
…llection
GitHub: wrap the final promptUser() manual-login fallback in a
showBrowser() check. Runtimes like Context Gateway that never surface a
headed remote browser to the end user return { headed: false }; in that
case, calling promptUser would hang indefinitely because the user has
no browser to complete the manual step in. Return a clean
"automatic sign-in failed" error instead.
Instagram: honor page.requestedScopes() at the script level.
- At the top of the main IIFE, resolve the canonical scope list via
page.requestedScopes() (falls back to null for older runtimes =
"collect everything", the legacy behavior).
- Gate the Following scrape on instagram.following.
- Gate the Posts network capture, the scroll-to-load-more loop, and
the timeline processing on instagram.posts.
- Gate the entire Phase 3 (accountscenter navigation + advertisers +
topics dialog scrapes) on instagram.ads.
- transformDataForSchema() now only emits the scoped keys that were
actually requested, and stamps the requestedScopes list into the
result for downstream audit.
Profile is always collected because it's load-bearing for login
detection and as the source of the user id for the Following endpoint.
If the caller didn't request instagram.profile, it's simply omitted
from the final result.
This is not an enforcement layer against a malicious caller — the
script runs in a client-controlled JS VM and the gate can be stripped.
It exists so well-behaved flows don't waste worker cycles or
overcollect, and so the selected scope set is observable at the page
API level. The trust boundary against malicious 3rd-party devs is app
identity, not runtime enforcement.
…er check
The canonical iCloud Notes script falls through to page.promptUser()
when its automated Apple ID login fails. In runtimes like Context
Gateway where the remote browser is not surfaced to the end user
(showBrowser returns { headed: false }), calling promptUser would
either hang forever (old polling implementation) or throw (the new
fail-fast shim) — both are unacceptable. Mirror the GitHub fix:
showBrowser -> check headed -> either promptUser if we can, or return
a clean error if we can't.
Found in Gemini review ahead of the staging cut.
Transient HTTP errors (net::ERR_HTTP_RESPONSE_CODE_FAILURE,
page.goto timeouts, etc.) on login, profile, and account-center
navigations were bubbling up to the caller as raw errors that killed
the entire run, even though a simple retry would have recovered.
The prod Context Gateway Instagram connector shipped safeGoto +
withTimeout helpers that wrapped every page.goto in a retry loop with
an explicit 15s timeout. Porting that forward into the canonical
Instagram, GitHub, iCloud Notes, and Oura scripts closes the
regression for Instagram and preemptively hardens the other three
(which had the same fragile bare page.goto calls in prod but had
simply not hit a transient yet).
Implementation:
- Inlined withTimeout() and safeGoto() helpers at the top of each
script. Data-connectors scripts can't import shared modules
because both runtimes (CG client VM, data-connect playwright-
runner) execute them as raw source, so the helpers are duplicated
per connector until the canonical page API surface grows native
retry/timeout primitives. That's tracked as the "generalized
error handling at the proxy level" followup in the contract plan.
- Replaced every bare page.goto call (except those inside the
safeGoto helper itself) with safeGoto.
- For load-bearing navigations (login pages, profile pages) a
failed safeGoto short-circuits the run with a clean structured
error instead of leaking a raw DOMException.
- For recoverable sub-navigations (Instagram ads landing, ad_topics
sub-page, GitHub pagination pages) a failed safeGoto lets the
caller skip the scope/page cleanly without killing the rest of
the collection.
Verified on staging via /lab with real Instagram credentials. Bare
page.goto calls remaining in any canonical script: only those inside
the safeGoto helper bodies themselves.
The canonical GitHub script was hitting "Login form not found" on
staging (and probably local, under the right conditions) because the
querySelector check raced against login page hydration. page.goto
resolves on "load" but React takes a beat longer to mount the form.
Fix:
1. Use page.waitForSelector with a 10s budget before the querySelector
check, so we give the login form time to paint before declaring
"not found". Matches how Instagram waits for its credential form.
2. On failure, capture the actual page state (title, URL, h1, h2,
first 400 chars of body text) and include it in the lastError.
Next time this fails, the error message will tell us whether
GitHub served a CAPTCHA, rate-limit page, device-verify
interstitial, or some other non-login page — instead of the
generic "Login form not found" we're getting now.
3. Broadened the login field selectors to also accept
input[name="login"] / input[name="password"] in case GitHub
serves a form with different IDs.
4. Fixed a latent bug in the retry loop: on the final attempt, we
were re-navigating to /login after declaring the attempt failed.
That was wasted work since the while condition would immediately
terminate. Now we only re-navigate when we'll actually retry.
…nals
Root cause of the "Login form not found" failure: on a persistent
browser profile where the user is already logged in from a previous
session, the script's checkLoggedIn() was returning false at the top
of the main flow. That made the script try to navigate to /login,
but GitHub redirected back to / (the authenticated dashboard), and
the subsequent querySelector('#login_field') check naturally failed
because there was no login form to find.
The observability patch from the previous commit caught the smoking
gun: lastError captured title="GitHub" url="https://github.com/"
h1="Search code, repositories, users, issues, pull requests..."
body="Skip to content Dashboard Top repositories..." — the script
was staring at the logged-in dashboard the whole time.
Fix: rewrite checkLoggedIn to be both more patient and more thorough.
1. Use waitForSelector (5s budget) so GitHub's header has time to
hydrate before we declare "not logged in".
2. Short-circuit to false on explicit /login or /session paths.
3. Check four positive signals, any one of which proves the user
is authenticated:
a) meta[name="user-login"] with a non-empty content
b) summary[aria-label*="View profile and more"] (header menu)
c) header img.avatar-user
d) dashboard-specific content ("Top repositories" + "Dashboard")
4. Stricter negative signal to avoid false negatives on non-login
pages that happen to have a /login link somewhere.
This matches the spirit of the prod CG behavior which works on the
same remote-worker persistent profile — the prod script evidently
never hit this because the header structure it checked was present
when the script ran, but newer versions of GitHub's dashboard render
slightly later.
Previous fix was still failing on staging because checkLoggedIn()
returned synchronously from a single querySelector snapshot. When
called immediately after device-verify form submission, GitHub was
still mid-navigation to the authenticated dashboard, so none of the
positive signals matched and the function returned false — sending
the while loop back around into another "Login form not found"
retry, this time on the dashboard page where there's no login form
at all.
Fix: rewrite checkLoggedIn as a polling loop.
- Accepts { timeoutMs, pollIntervalMs } options.
- Default 10s timeout / 1s poll interval.
- Top-of-flow initial check uses a shorter 5s timeout (GitHub
either has the dashboard up or it doesn't by then).
- Inside the login retry loop (after credential + device-verify
submit) uses the 10s default to cover multi-step redirects.
- Short-circuits to false on /login or /session paths immediately.
- Added two more positive signals: avatar-user anywhere on a
non-auth path, and a "your work" header heuristic.
- Logs the winning signal or the last snapshot on failure so the
next failure explains itself.
resolveUsername was failing with 'Could not resolve a valid GitHub username after login' because readLoggedInUsername only checked meta[name='user-login'], which wasn't always present on the just-hydrated dashboard. Broaden the signals and add polling: 1. meta[name='user-login'] (canonical) 2. header img.avatar-user [alt='@username'] 3. Profile link with data-hovercard-type='user' 4. 'Signed in as X' aria-label Also increase the fallback /settings/profile read budget to 8s.
Summary
Part of the multi-repo connector contract unification plan. See
docs/260410-contract-freeze.mdfor the locked-in decisions this PR implements.Paired PRs:
click/fill/press/waitForSelector/urlin the playwright-runner to match the new canonical page API.Phase 0 — freeze the contract shape
schemas/manifest.schema.json. Every*-playwright.jsonnow hasmanifest_version,connector_id,source_id,page_api_version, canonicalconnect_url/connect_selector/icon, while keeping the legacy aliases (id,connectURL, etc.) for backward compatibility.types/connector.d.tsnow exposes the minimum canonical page API — addedclick,fill,press,waitForSelector, andurl(previously de facto only in Context Gateway).requestInputremains canonical;getInputis never part of the contract.types/page-api-version.tsexportsPAGE_API_VERSION = 1. Independent ofmanifest_version, connectorversion, and per-scope schemaversion.instagram.followingwith its own schema. Ad targeting categories remain additive on the existinginstagram.adsschema (thecategoriesfield was already there).connectors/apple/icloud-notes-playwright.{js,json}+schemas/icloud_notes.{notes,folders}.json+registry.jsonentry).Late-cycle fixes (post initial review)
input.value = x + dispatchEventpattern and staleinput[name="username"]selectors with the CG-provenpage.fill+page.press('Enter')pattern and the real 2026 selectors (input[name="email"],input[name="pass"]). The canonical script also now emits theinstagram.followingscope via the ported scraper helper.getInput({title, description, schema, uiSchema})→requestInput({message, schema}),keyboard_press('Enter')→page.press(selector, 'Enter'). Preserves the retry loop, URL-based 2FA detection, passkey / device-verify paths, and#login_field/#passwordselectors.icloud_notes.notes,icloud_notes.folders) that the registered schemas already advertise, and rewrote the schemas to match the exact fields the CG-in-prod script emits (recordName,title,snippet,folder,isPinned,createdDate,modifiedDate,hasAttachments,textContent).registry.jsondescriptions + checksums.Guardrail scripts
scripts/normalize-manifests.mjsscripts/validate-manifests.mjsscripts/validate-scope-schemas.mjsscripts/check-page-api-additive.mjsscripts/check-source-id-stability.mjsscripts/check-additive-schemas.mjsAll wired into
.github/workflows/contract-guardrails.yml.Conformance fixture
connectors/_conformance/conformance-playwright.{js,json}is a synthetic connector that exercises every minimum-surface method and emits aconformance.resultscope payload. Designed to run identically in both the DataConnectplaywright-runnerand Context Gateway's client-side runtime. Wiring it into a cross-runtime harness is Phase 4 follow-up work.Smoke test evidence
End-to-end runs via CG /lab against the production
remote-browser.vana.workers.devworker with real credentials, asserting on actual scoped payload contents (not UI "Connected" messages):@tnunamak) — 99 repos, 10 starred, scoped keysgithub.profile,github.repositories,github.starred. Login including 2FA OTP worked on first try.@dondochaka) — profile, 0 posts, real following accounts, ad topics. Scoped keysinstagram.profile,instagram.posts,instagram.following,instagram.ads.icloud_notes.notes,icloud_notes.folders. Note: iCloud Notes still depends on the CG-runtime-only page API (getInput,frame_*,keyboard_*), advertised as thecg-legacy-page-apicapability in the manifest. Converging this to the canonical minimum surface is Phase 4+ follow-up work — in the meantime, the canonical data-connect registry does not list iCloud Notes, so it cannot be selected there.activate_canonical_script: false). The CG-in-prod Oura script still ships; canonical convergence is a follow-up.Test plan
node scripts/normalize-manifests.mjs --check— 20 manifests, 0 would changenode scripts/validate-manifests.mjs— 20/20 passnode scripts/validate-scope-schemas.mjs— 16 registered manifests, 50 schemasBASE_REF=origin/main node scripts/check-page-api-additive.mjs— 5 methods added, 0 removedBASE_REF=origin/main node scripts/check-additive-schemas.mjs— 41 schemas, 0 breaking changes