Skip to content

[DO NOT MERGE]: HSM Sync Engine#69

Open
dtkav wants to merge 352 commits intomainfrom
merge-hsm
Open

[DO NOT MERGE]: HSM Sync Engine#69
dtkav wants to merge 352 commits intomainfrom
merge-hsm

Conversation

@dtkav
Copy link
Copy Markdown
Member

@dtkav dtkav commented Feb 19, 2026

No description provided.

dtkav and others added 30 commits February 16, 2026 16:03
- Add state vector merge routing tests verifying that stateVectorIsAhead
  drives correct merge path selection (disk-only, remote-only, three-way)
- Remove 3 broken MergeManager tests (createHSM doesn't wire effect
  subscriptions; MergeHSM has no destroy method)
- Delete obsolete state-vector-debug.test.ts (used removed manager.register API)
- Update all 25 hibernation tests: documents start warm after
  notifyHSMCreated, not hibernated. Buffer tests use maxConcurrentWarm:0
  and explicit hibernate() to prevent processWakeQueue from draining.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Shadow mode was designed for parallel HSM comparison but was only used in
tests, not production. Removing to reduce maintenance burden and dead code.

Deleted:
- src/merge-hsm/shadow/ (ShadowMergeHSM, ShadowManager, types)
- src/merge-hsm/__tests__/shadow.test.ts
- Shadow exports from index.ts
- Divergence tracking from HSMDebuggerView

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Using path in WRITE_DISK effects was unsafe for renames - if a file was
renamed while the effect was in-flight, the write would target the wrong
location. Now effects carry the stable guid, and handlers look up the
current path at write time.

- WriteDiskEffect.path → WriteDiskEffect.guid
- DiskIntegration filters by guid, gets path via getter
- handleIdleWriteDisk looks up path from guid

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…idge

The recording bridge was caching path at recording start, so renamed files
showed stale paths in recordings. Now getFullPath() is called at each entry
to capture the current path.

- getDocument(guid) → getHSM(guid) returns just the HSM
- New getFullPath(guid) returns current full vault path
- Path is resolved fresh at each timeline entry

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove the legacy LiveEditPlugin (592 lines) which was fully replaced
by HSMEditorPlugin for markdown documents. Move the connectionManagerFacet
import in ShareLinkPlugin to LiveNodePlugin.

Add periodic editor↔CRDT drift detection in CM6Integration that runs
every 5 seconds during active.tracking. When drift is found (editor
content diverges from localDoc), diagnostics are written to both the
relay log and HSM recording file, then the editor is corrected to
match the CRDT.

Enhanced checkAndCorrectDrift() to accept actual editor text instead
of relying on the cached lastKnownEditorText, which can go stale if
changes bypass the CM6_CHANGE event path (e.g. Obsidian's metadata
renderer calling setViewData directly).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add getConnectionManager() that finds LiveViewManager via app.plugins
- Remove StateField, Compartment, reconfigure(), and wipe() machinery
- Simplify load() to register extensions once (idempotent)
- Remove isLiveMd check and view fallback in HSMEditorPlugin

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Idle merge operations (remote, disk, three-way) previously called
transitionTo("idle.synced") directly inside async callbacks, which
could complete before the async hash computation — causing the state
machine to show idle.synced while work was still in progress.

Now, real merge cases send an IDLE_MERGE_COMPLETE event after the
async hash completes, with all effects and the state transition
happening synchronously after the await. This ensures intermediate
states (idle.remoteAhead, idle.diskAhead, idle.diverged) are
observable while async work runs.

- Add IdleMergeCompleteEvent type and handler
- Refactor performIdleRemoteAutoMerge, performIdleDiskAutoMerge,
  performIdleThreeWayMerge to use event-driven completion
- Compute hash before emitting effects to avoid interleaving window
- Add serialization/deserialization for the new event type
- Fix loadToConflict test helper to pre-compute hash before
  SET_MODE_IDLE to avoid microtask boundary

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add seeded PRNG-driven random delays to test infrastructure to catch
timing-related bugs. Controlled via environment variables:
- TEST_ASYNC_DELAYS=1 enables random delays (0-10ms)
- TEST_SEED=<number> sets reproducible random sequence

Changes:
- Add random.ts with Mulberry32 PRNG fountain
- Add async delays to persistence sync, destroy, and diskLoader
- Add sendAcquireLock/sendAcquireLockToTracking helpers that properly
  await async state transitions
- Update loadToIdle to wait for persistence sync before returning
- Update loadToConflict to await state after acquireLock
- Convert ~30 tests to use async-aware helpers

All 89 tests pass with and without TEST_ASYNC_DELAYS=1.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
DRIFT_DETECTED events were using a non-standard format (type/timestamp/state)
which broke the visualizer. Now uses standard fields (ns/ts/path/event/from/to).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove enableMergeHSMRecording flag (always install bridge, lightweight without onEntry)
- Delete RecordingMergeHSM.ts and generateTest.ts (unused)
- Consolidate StreamingEntry into HSMLogEntry
- Simplify recording bridge, replay, and serialization modules
- Update tests for new recording API

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The relay-live-editor CSS class is added asynchronously after acquireLock().
Previously, edits arriving before initialization were lost.

- Don't destroy plugin early if CSS class not yet present
- Buffer CM6 changes that arrive before HSM is ready
- Replay buffered edits once initialization completes

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Validates remote Yjs updates before applying them to prevent
corrupted updates from breaking document state.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Disk edits in idle mode now create a fork snapshot of localDoc before
ingesting changes. This enables three-way reconciliation when the
provider reconnects: fork.base serves as the common ancestor for
merging local (disk edit) and remote changes.

SyncGate controls CRDT op flow between localDoc and remoteDoc, blocking
remote-to-local merges while a fork is unreconciled.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Block syncLocalToRemote() while a fork is unreconciled. This prevents
local changes from propagating to remote before fork reconciliation
confirms they're safe.

When the fork clears (clearForkAndUpdateLCA), pending outbound changes
are flushed by recomputing the diff from localDoc to remoteDoc.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When a fork is created in idle mode (disk edit while provider not synced),
the HSM emits REQUEST_PROVIDER_SYNC effect. SharedFolder handles this by:
1. Downloading latest state via backgroundSync
2. Applying updates to remoteDoc
3. Sending CONNECTED + PROVIDER_SYNCED to HSM
4. HSM then runs fork reconciliation

This ensures disk edits are reconciled promptly without waiting for the
user to open the file.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When PROVIDER_SYNCED arrives in active.tracking with a fork present,
reconcileForkInActive runs three-way merge and dispatches granular
changes to the editor via computeDiffChanges (not full replace).

This ensures fork reconciliation happens promptly when the user has
the file open, rather than waiting for close/reopen.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fork reconciliation now queries provider.synced via callback instead of
storing duplicate state in _syncGate.providerSynced for idle mode.

Changes:
- Add isProviderSynced callback to MergeHSMConfig
- invokeForkReconcile uses callback to check provider state
- Add awaitingProvider guard to stay in idle.localAhead until synced
- Add hasFork guard to accumulate REMOTE_UPDATE during fork reconciliation
- Test harness tracks provider state and passes callback

This avoids state ownership issues where the HSM would store providerSynced
but not receive DISCONNECTED events in idle mode to clear it.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
destroyLocalDoc() now nulls out references synchronously and does async
IDB cleanup on captured refs, preventing races where wake() recreates
localDoc while cleanup is still running. wake() and processWakeQueue()
call ensureLocalDocForIdle() to recreate localDoc after hibernation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove moduleNameMapper mocks for node-diff3 and y-indexeddb so tests
run against real implementations. Add a transform rule for src/*.js
files so ts-jest handles the ESM syntax in y-indexeddb.js.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Track OBSIDIAN_FILE_OPENED/OBSIDIAN_FILE_UNLOADED events to maintain
an isObsidianFileOpen flag on each HSM. DiskIntegration checks this
flag before executing WRITE_DISK effects, blocking writes when the
editor has the file open to prevent content duplication.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Simplify invokeForkReconcile to always use diff3 via remoteDoc,
removing the "remote unchanged" shortcut that bypassed three-way merge.
Reset providerSynced on fork creation and RELEASE_LOCK so reconciliation
waits for a fresh sync. Emit REQUEST_PROVIDER_SYNC when releasing lock
with an active fork.

Add patchLCAHash() to async-compute LCA hashes after reconciliation
instead of blocking on the hash computation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously, all active documents skipped remote update injection.
Now the skip is gated on enableDirectRemoteUpdates, allowing the
enqueueDownload path (which fetches full server state) to work for
active documents when direct injection is disabled.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ingestDiskToLocalDoc action that applies pending disk contents to
localDoc. Change idle.synced DISK_CHANGED transition to re-enter
idle.localAhead with ingestion instead of going straight to diverged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The guard checked for sharedFolder.remote during state change
callbacks, which is no longer needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ication

Adds the OpCapture module (reversible CRDT op capture) and integrates it
end-to-end: persistence in IDB via y-indexeddb, test harness support, and
consumption during fork reconciliation. When both disk and remote edit the
same content, redundant disk ops are reversed; unique disk ops are dropped
(kept in CRDT). Also fixes diff3 tokenization to use split(/(\n)/) so
adjacent-line changes are handled independently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a file was deleted or a shared folder was removed from settings,
the per-document y-indexeddb databases (and folder-level database)
were left behind as orphans in IndexedDB. This adds deleteDatabase
calls in deleteFile() and SharedFolders.delete() after the in-memory
objects are destroyed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three fixes to the fork→diverged→idle-merge path:

1. clearForkKeepDiverged: clear pendingIdleUpdates (already evaluated
   by diff3) and restore pendingDiskContents from localDoc so the
   three-way merge sees the real disk content, not the LCA fallback.

2. invokeIdleThreeWayAutoMerge: read remoteDoc text directly instead
   of applying pendingIdleUpdates to localDoc via raw Y.applyUpdate.
   The raw CRDT merge causes interleaving corruption on conflicting
   edits. If remoteDoc isn't available yet, bail and wait for
   REMOTE_UPDATE to reenter idle.diverged.

3. clearForkAndUpdateLCA: clear pendingIdleUpdates for hygiene.

Also adds:
- MERGE_CONFLICT transition in active.tracking for fork conflicts
  detected during reconcileForkInActive
- OBSIDIAN_SAVE_FRONTMATTER / OBSIDIAN_METADATA_SYNC diagnostic
  events for correlating metadata editor hooks with drift events
- Regression test for the corruption scenario

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Path was stored at construction time and never updated on rename,
causing stale paths in logging and debug tools. Now MergeHSM takes
a getPath callback that computes the current path from the source
of truth (Document.path via SharedFolder's files map).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BackgroundSync.syncDocumentWebsocket only checked userLock before
disconnecting, missing cases where the MergeHSM was actively managing
the document. Also defers awareness cursor updates in RemoteSelections
to avoid re-entrant EditorView.update errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dtkav added 3 commits April 9, 2026 16:30
Thin bridge over MergeHSM.awaitState, which is event-driven (subscribes
to stateChanges and resolves as soon as its predicate matches) rather
than polling. The CDP caller gets event-driven settlement with a single
network round-trip instead of N ticks of "read state, check, sleep 50ms".

Function-valued predicates don't cross the Python↔JS boundary, so the
API takes a string prefix — the common case is "active." or "idle." —
and races against a timeout. Rejects with a diagnostic error that
includes the current state path.

Consumers (RelayClient.await_hsm_state, test scripts) compose
action + settlement without baking the wait into editor_open/close,
keeping those primitives pure:

    await client.editor_open(path)
    await client.await_hsm_state(path, "active.")

    await client.editor_close()
    await client.await_hsm_state(path, "idle.")
Exposes conflict inspection and resolution primitives for the test
harness and live debugging:

- getConflictInfo(path): focused ConflictInfoSnapshot with base, ours,
  theirs, labels, and a per-hunk ConflictHunkInfo[] carrying each
  hunk's index, resolved flag, oursContent/theirsContent, and a
  content-hash `id` (jj-style minimum-unique-prefix over all hunks in
  the current set). Ids are derived from oursContent + \0 +
  theirsContent, so they survive re-parses, waits, and persisted fork
  restoration — but shift if the underlying conflict content changes.
- resolveConflict(path, contents): dispatches RESOLVE with final text.
- resolveHunk(path, indexOrId, 'local'|'remote'|'both'): dispatches
  RESOLVE_HUNK. Accepts a numeric index or any unambiguous prefix of
  the hunk id; throws on ambiguous or unknown ids with a candidate
  list.
- openDiffView(path) / cancelDiffView(path): thin wrappers over the
  OPEN_DIFF_VIEW and CANCEL events for driving the conflict state
  machine from tests.
- clearLca(path): under-the-hood LCA mutation — reproduces the no-LCA
  state that arises after upgrading from a plugin version without LCA
  tracking.

All methods go through __relayDebug.lookupDocument and hsm.send, so
the state machine drives every transition.
Track live Relay plugin instances on a window-level Set so a
crashed or incomplete onunload() surfaces deterministically on the
next load instead of silently stacking ghost listeners.

onload generates a random instance ID, checks the set for stale
entries from a previous lifecycle (logging a loud console.error
naming the leaked IDs), and adds its own ID. onunload deletes
the instance's ID as the very last step — after auditTeardown and
flushLogs — so any earlier throw leaves the ID in place as a
tombstone. The next load sees it and warns.

Consumers (deploy, RelayClient.verify_build) read the Set directly
and assert size === 1. Anything else means the vault is running
duplicated subscriptions with racing IDB writes, and tests produce
phantom results.
dtkav added 6 commits April 9, 2026 16:41
The class accumulated subscriptions in a single flat Unsubscribe[]
array without per-folder keying. Every time sharedFolders notified
(folder added OR removed), the handler iterated all current folders
and pushed 3 fresh subscribes per folder onto the shared array. After
N notifications a single folder accumulated 3N subscriptions, and
unsubs from removed folders never left the array.

On plugin unload, destroy() walked the stale array and called each
unsubscriber. Closures from folders whose own Observable had been
destroyed (nulling _listeners) crashed with "Cannot read properties
of null" in the middle of onunload, leaving the rest of the teardown
chain dangling.

Replace the flat array with a Map<SharedFolder, Unsubscribe[]> and
a single rootSub on the SharedFolders ObservableSet. On every root
notification, diff the current folder set against the tracked map:
release all subscriptions for removed folders, attach new
subscriptions only for newly-added ones. Plugin-global subscriptions
(backgroundSync stores, workspace layout) move to a separate
globalSubs list that's untouched by folder churn.

destroy() releases in the correct order: root first (so no new
notifications arrive mid-teardown), per-folder second, global last.
Each release path wraps individual unsub calls in try/catch so a
pre-destroyed observable doesn't break the rest of the chain.
Observable.destroy() cleared the listener set but never removed the
instance from the module-level audit map, so every Observable ever
constructed stayed strongly referenced until auditTeardown() ran at
plugin unload.

It also never cancelled in-flight PostOffice deliveries for its
listeners, leaving a ~20ms window in which PostOffice.deliver() could
invoke a recipient with a torn-down sender.

Delete from the audit map and cancel each listener on Postie before
clearing the set.
SharedFolder.connect() called setupEventSubscriptions() again on every
reconnect, pushing another handleDocumentUpdateEvent closure onto the
provider's eventCallbacks["document.updated"] array. After N reconnects
the handler ran N+1 times per server event.

The provider preserves eventCallbacks and eventSubscriptions across
WebSocket reconnects and re-sends the server subscribe frame itself
(client/provider.ts:264-267), and HasProvider.connect() reuses the
same _provider instance via refreshProvider. The constructor's
setupEventSubscriptions() call is the only registration we need.
super.destroy() already tears down the ydoc via HasProvider.destroyRemoteDoc().
The subsequent this.ydoc.destroy() hit the lazy getter, allocating a fresh
Y.Doc, YSweetProvider, websocket, and two connection/state listeners that
were then never cleaned up — leaking on every folder removal and plugin
unload.
IndexeddbPersistence registers itself as a 'destroy' event handler on
the ydoc, so SharedFolder.destroy() does trigger it indirectly via
super.destroy() → HasProvider.destroyRemoteDoc → ydoc.destroy(). But
the persistence destroy() is async — it awaits _pendingWrites and
_pendingCompaction before db.close() — and the returned promise is
discarded inside the synchronous event emit. On normal Obsidian
unload the OS reaps everything; on plugin reload the new instance can
race the old one's pending IDB close on the same database name.

Call _persistence.destroy() explicitly before super.destroy() and
register the promise with awaitOnReload so plugin re-enable waits for
the flush to complete. Calling destroy() synchronously removes its
own 'destroy' listener, so the cascade through super.destroy() does
not double-fire.
mgmobrien and others added 17 commits April 10, 2026 14:24
The folder CRDT persistence used the bare folder GUID as the IDB
database name. When two vaults share the same folder in one Obsidian
process, they shared a single IDB store, leaking Y.Doc updates between
vaults. Use `${appId}-relay-folder-${guid}` to match the convention
used by every other IDB store in the codebase.

Adds a `migrateFrom` option to IndexeddbPersistence that copies raw
blobs from the old DB into the new one (without roundtripping through
Y.js), then deletes the old DB. Migration runs before `fetchUpdates`
so the `synced` event is only emitted after data is safely in the new
store.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
initializeWithContent returns false when content was enrolled by the
editor (race between uploadDoc and acquireLock). The file still needs
syncing and publishing to the syncStore for cross-vault propagation.

Also clears pendingUpload in deleteFile.
classifyUpdate deduplicates via tracked state vectors, making the
user-based filter redundant. The server can also mislabel the user
on the event, making user-based filtering unreliable.
initializeFromRemote now only enrolls CRDT bytes and sets the state
vector. LCA is the caller's responsibility via the new setLCA method.
downloadDoc calls setLCA after flushing to disk. GUID remap does not
set LCA (disk may differ from remote).
baseStart/baseEnd from diff3 are token indices, not line numbers.
The line-map lookup produced wrong positions. String search for
oursContent in localContent is reliable for unique hunk text.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants