Keep PTY runtimes alive across server restarts by Pajn · Pull Request #7 · Pajn/Embers

Pajn · 2026-03-22T20:22:27Z

Summary by CodeRabbit

New Features
- Per-buffer runtime keepers via sockets for attaching external runtimes; new runtime status/snapshot/scrollback/capture capabilities and a hidden runtime-keeper CLI.
Improvements
- Server now restores and reattaches live runtimes after restart, preserves runtime-driven state (sequence/title/activity), and simplifies resize/snapshot/lifecycle behavior.
- Environment values accept raw or base64-encoded payloads.
Tests
- Integration tests serialized via an interprocess test lock to reduce flakiness; persistence/recovery scenarios strengthened.

Stack

~~api-docs Add api docs for rhai config #2~~ (merged)
~~persistance Persist workspace state across clean restarts #3~~ (merged)
~~buffer-cli Add native detached buffer CLI workflows #9~~ (merged)
~~session-rename Add native session rename support #5~~ (merged)
client-switch Add native client list detach and switch workflows #6
→ pty-alive Keep PTY runtimes alive across server restarts #7
cli-tree Add buffer and node cli actions #8
styling-gaps Add blink styling, italic support and rebindable search keys #10
node-apis Add more node apis #11
terminal-validation Test and document terminal handling #12

chatgpt-codex-connector · 2026-03-22T20:22:32Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

coderabbitai · 2026-03-22T20:22:32Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a Unix-socket "runtime keeper" process and hidden RuntimeKeeper CLI; moves per-buffer PTY ownership into keeper processes accessed via per-buffer Unix sockets; updates server runtime APIs, state, persistence, framing protocol, serde derives, and serializes affected multi-threaded tests via a cross-process test lock.

Changes

Cohort / File(s)	Summary
CLI: RuntimeKeeper & env parsing `crates/embers-cli/src/lib.rs`, `crates/embers-cli/tests/*`, `crates/embers-cli/Cargo.toml`	Add hidden `Command::RuntimeKeeper` with `--keeper-socket`, `--cols`/`--rows`, optional `--cwd`, repeatable `--env KEY=VALUE` (supports `base64:` decoding), `parse_env_arg`/`decode_runtime_keeper_env_value`, route to `embers_server::run_runtime_keeper`, and add unit tests; add `base64` workspace dep.
Server: Buffer runtime / Keeper protocol `crates/embers-server/src/buffer_runtime.rs`, `crates/embers-server/Cargo.toml`	Replace in-process PTY with a runtime-keeper over Unix socket: length-prefixed JSON RPC (status/write/resize/snapshot/scrollback/kill), KeeperConnection, attach/spawn APIs, poller for structured BufferRuntimeUpdate/Status, Keeper-side `run_runtime_keeper`, framing tests; add `base64` dep.
Server: Runtime lifecycle & state `crates/embers-server/src/{server.rs,config.rs,model.rs,persist.rs,state.rs}`	Add `runtime_dir` to config; derive per-buffer keeper socket paths; persist and restore `runtime_socket_path` on Buffer; new Buffer ctor/accessors and custom Debug/PartialEq/Eq; change ServerState APIs (`set_buffer_runtime_socket_path`, `mark_buffer_interrupted` signature); rework spawn/attach/restore lifecycle and request handling to use keeper RPCs.
Server: Public re-exports `crates/embers-server/src/lib.rs`	Re-export keeper-related types/functions: `BufferRuntimeStatus`, `BufferRuntimeUpdate`, `RuntimeKeeperCli`, `run_runtime_keeper`.
Core: Serde derives `crates/embers-core/src/{metadata.rs,snapshot.rs}`	Derive `Serialize`/`Deserialize` for metadata and snapshot types (Timestamp, ActivityState, EntityMetadata, CursorPosition/Shape/State, TerminalModes, SnapshotLine, TerminalSnapshot).
Tests: Interprocess test lock & serialized tests `crates/embers-test-support/src/{lib.rs,test_lock.rs}`, `crates/embers-test-support/Cargo.toml`, `crates/embers-cli/tests/*`, `crates/embers-test-support/tests/buffer_runtime.rs`, `crates/embers-cli/tests/panes.rs`	Add `test_lock` module with `InterprocessTestLock` and `acquire_test_lock()` (Unix flock / file-based fallback); export from test-support crate; acquire lock at start of multi-threaded tokio tests to serialize tests and avoid socket interference; update imports and add acquisitions in several tests.
Tests: Persistence & behavior `crates/embers-server/tests/persistence.rs`	Adjust test to expect live buffers remain `Running` across restart; add `wait_for_snapshot_line` helper and polling-based snapshot assertions.
Misc: small changes `crates/embers-cli/tests/interactive.rs`, `crates/embers-cli/tests/panes.rs`, `crates/embers-test-support/tests/buffer_runtime.rs`	Introduce `acquire_test_lock()` usage at test starts to serialize multi-threaded tests.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as "embers-cli (RuntimeKeeper)"
    participant Keeper as "Runtime Keeper (PTY)"
    participant Socket as "Unix Socket"
    participant Server as "embers-server"

    CLI->>Keeper: run_runtime_keeper(RuntimeKeeperCli)
    Keeper->>Keeper: spawn PTY, terminal backend, snapshotter
    Keeper->>Socket: bind & listen (keeper socket)
    Server->>Socket: connect / attach
    Server->>Socket: send RPC (write/resize/status/capture/kill)
    Socket->>Keeper: deliver RPC request
    Keeper->>Keeper: perform PTY action, update sequence/activity/title
    Keeper->>Socket: send length-prefixed JSON response/update
    Socket->>Server: receive response/update
    Server->>Server: apply_runtime_status / record_buffer_update

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Initial implementation #1: Prior runtime-keeper implementation and foundational changes touching the same CLI/server buffer runtime modules.

Poem

🐇 I burrowed sockets beneath the shell so deep,
I tend the PTY garden while the keepers sleep.
JSON pebbles, mutex hugs to keep tests right,
Sequences climb ladders and snapshots see the light.
Cheers from the burrow — small, steady, polite.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.83% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main architectural change: introducing a persistent runtime keeper mechanism so PTY processes survive server restarts.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Pajn · 2026-03-25T23:02:39Z

@coderabbitai review

coderabbitai · 2026-03-25T23:02:46Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

crates/embers-server/src/server.rs (2)

1715-1738: ⚠️ Potential issue | 🔴 Critical

Don't drop the keeper handle on normal exit.

After Line 1716 removes buffer_id from buffer_runtimes, every capture_* and visible-snapshot path falls back to buffer_runtime()'s Exited error. The keeper itself stays alive and can still serve the final screen in crates/embers-server/src/buffer_runtime.rs, so this strips scrollback/final-render access from exited panes until the next restart.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/embers-server/src/server.rs` around lines 1715 - 1738, The code
removes the runtime handle unconditionally in record_buffer_exit by calling
self.buffer_runtimes.lock().await.remove(&buffer_id), which drops the keeper and
prevents capture_*/visible-snapshot paths from serving final render; change the
logic so the keeper handle is not removed on a normal exit: fetch (e.g.,
get/clone or peek) the runtime handle instead of removing it when
should_interrupt is false so the keeper remains available for buffer_runtime()
to serve Exited state, and only remove the entry when you truly intend to
discard the runtime (for example when should_interrupt is true or after final
cleanup). Ensure you update record_buffer_exit to conditionally remove from
buffer_runtimes (referencing buffer_runtimes and record_buffer_exit) and
preserve the keeper used by buffer_runtime()/Exited in buffer_runtime.rs.
253-257: 🧹 Nitpick | 🔵 Trivial

Remove the unused server-side surface cache.

CI is already warning that buffer_surfaces and BufferSurface are dead after this refactor. Keeping this path around obscures the new keeper-backed rendering flow and leaves stale code to maintain.

Also applies to: 269-398
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/embers-server/src/server.rs` around lines 253 - 257, Remove the unused
server-side surface cache by deleting the Server struct field buffer_surfaces
and the BufferSurface type and any associated imports/usages; search for symbols
buffer_surfaces and BufferSurface (and matches in impls/constructors, e.g.,
where Server is created or mutated) and remove or refactor those code paths to
rely on the keeper-backed rendering flow instead (cleanup any
Mutex/BTreeMap/BufferId references tied only to this cache and remove tests or
dead code referencing these symbols).

crates/embers-server/src/buffer_runtime.rs (1)

311-325: ⚠️ Potential issue | 🟠 Major

Join the poller instead of detaching it.

Dropping the JoinHandle on line 324 does not wait for the thread to finish; it just detaches it. This means shutdown can continue while the poller thread still owns the keeper connection and may still fire callbacks. Additionally, since Drop calls join_threads_blocking(), there's a risk of deadlock if the poller thread is being dropped. Join it explicitly with a self-join guard to prevent deadlock:

Fix

 impl BufferRuntimeInner {
     fn join_threads_blocking(&self) {
         self.stop.store(true, Ordering::Relaxed);
+        let current_thread = thread::current().id();
         let mut threads = match self.threads.lock() {
             Ok(threads) => threads,
             Err(poisoned) => {
                 error!(
                     %self.buffer_id,
                     "buffer runtime thread registry lock poisoned during shutdown"
                 );
                 poisoned.into_inner()
             }
         };
-        let _ = threads.poller.take();
+        if let Some(poller) = threads.poller.take() {
+            if poller.thread().id() != current_thread {
+                let _ = poller.join();
+            }
+        }
     }
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/embers-server/src/buffer_runtime.rs` around lines 311 - 325, In
join_threads_blocking (impl BufferRuntimeInner) don’t drop threads.poller (which
detaches); instead take the JoinHandle from threads.poller, detect if it is the
current thread (use a self-join guard / thread::current().id()) and if not call
join() on the handle to wait for the poller to finish, otherwise detach/skip to
avoid deadlock; ensure stop.store(true, ...) happens before joining so the
poller can exit and handle poisoning the threads mutex as currently done when
retrieving threads.poller.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/embers-server/src/buffer_runtime.rs`:
- Around line 482-535: The keeper socket must be bound before spawning the PTY
child and reader/wait threads to avoid orphaned children on bind failure: move
the UnixListener::bind(&cli.socket_path) and creation of SocketCleanup before
calling pair.slave.spawn_command(...) and before spawning keeper_read_loop and
keeper_wait_loop threads; create the KeeperRuntime after successful bind (or at
least ensure bind errors are checked before spawn_command/threads run), then
spawn the child (child.clone_killer()/child.process_id()), start the reader/wait
threads (keeper_read_loop, keeper_wait_loop) using the runtime, and only then
proceed — update any uses of cli.socket_path in thread names accordingly so the
listener is reserved first and failures cleanly return an error without leaving
the child running.
- Around line 224-262: The four query methods (status, capture_snapshot,
capture_visible_snapshot, capture_scrollback_slice) perform blocking UnixStream
I/O and must be moved onto a blocking thread; change each to async (e.g., pub
async fn status(&self) -> Result<BufferRuntimeStatus>) and call
tokio::task::spawn_blocking to run the existing synchronous logic inside the
closure, returning the join handle's result (and map/join errors to
MuxError/Result as appropriate). Inside the spawn_blocking closure, acquire the
inner.connection lock and call
connection.status()/snapshot()/visible_snapshot()/scrollback_slice() just like
before; ensure you clone any Arcs needed so the closure owns them. Update any
callsites in server.rs to await these methods. This mirrors the existing pattern
used by write/resize/kill to avoid blocking Tokio workers.

In `@crates/embers-server/src/config.rs`:
- Around line 23-27: The derived runtime_dir created from socket_path
(runtime_dir = socket_path.with_extension("runtimes")) can produce keeper socket
paths that exceed OS domain-socket length limits when you later build keeper
sockets (see server code that does runtime_dir.join(format!("buffer-{}.sock",
buffer_id.0))). Update the code to validate the full path length before
creating/using keeper sockets: compute the final candidate path (runtime_dir
joined with the buffer file name), check it against the platform socket path
limit (e.g., 104 bytes on macOS, 108 on Linux), and if it would exceed the limit
either shorten the on-disk name (e.g., hash/truncate buffer_id or use a stable
shortened prefix) or return a clear configuration error when creating
runtime_dir/socket_path so callers reserve enough space; implement this
validation where runtime_dir is used to construct keeper sockets (reference
runtime_dir, socket_path, and the keeper socket construction in server.rs).

In `@crates/embers-server/src/server.rs`:
- Around line 455-487: The loop restores each buffer by calling
attach_buffer_runtime(buffer.id, socket_path) which can hang because
connect_to_keeper() retries for the full budget; to fix, short-circuit
non-existent sockets or use a shorter restore timeout: before calling
attach_buffer_runtime check the socket file exists (e.g., tokio::fs::metadata or
Path::exists on buffer.runtime_socket_path) and if missing call
state.mark_buffer_interrupted(buffer.id, buffer_pid_hint(&buffer.state)) and
continue, or add an optional timeout/fast-path parameter to
attach_buffer_runtime (and then to connect_to_keeper in buffer_runtime.rs) so
the restore path uses a small timeout instead of the full retry budget.

In `@crates/embers-server/src/state.rs`:
- Around line 400-405: mark_buffer_interrupted currently unconditionally
overwrites a buffer's state to BufferState::Interrupted, which loses existing
BufferState::Exited metadata used by restore_buffer_runtimes(); change
mark_buffer_interrupted to first inspect the current buffer.state and if it is
BufferState::Exited (preserving its exit metadata) do nothing and return Ok(()),
otherwise set buffer.state = BufferState::Interrupted(InterruptedBuffer {
last_known_pid: pid }) as before; reference the mark_buffer_interrupted function
and the BufferState::Exited / BufferState::Interrupted variants to locate where
to add the guard.

In `@crates/embers-server/tests/persistence.rs`:
- Around line 219-253: Replace the single tokio::time::sleep + one-shot calls to
request_snapshot with a retry/poll loop that repeatedly sends
BufferRequest::Capture (using RequestId(9)/RequestId(10) and the
attached_id/detached_id) and checks the returned snapshot.lines for the expected
"restarted-attached" / "restarted-detached" strings until either the line
appears or a deadline expires; remove the fixed 200ms sleep, backoff briefly
between attempts, and fail only after the timeout so transient non-snapshot
responses from request_snapshot don't cause intermittent panics.

---

Outside diff comments:
In `@crates/embers-server/src/buffer_runtime.rs`:
- Around line 311-325: In join_threads_blocking (impl BufferRuntimeInner) don’t
drop threads.poller (which detaches); instead take the JoinHandle from
threads.poller, detect if it is the current thread (use a self-join guard /
thread::current().id()) and if not call join() on the handle to wait for the
poller to finish, otherwise detach/skip to avoid deadlock; ensure
stop.store(true, ...) happens before joining so the poller can exit and handle
poisoning the threads mutex as currently done when retrieving threads.poller.

In `@crates/embers-server/src/server.rs`:
- Around line 1715-1738: The code removes the runtime handle unconditionally in
record_buffer_exit by calling
self.buffer_runtimes.lock().await.remove(&buffer_id), which drops the keeper and
prevents capture_*/visible-snapshot paths from serving final render; change the
logic so the keeper handle is not removed on a normal exit: fetch (e.g.,
get/clone or peek) the runtime handle instead of removing it when
should_interrupt is false so the keeper remains available for buffer_runtime()
to serve Exited state, and only remove the entry when you truly intend to
discard the runtime (for example when should_interrupt is true or after final
cleanup). Ensure you update record_buffer_exit to conditionally remove from
buffer_runtimes (referencing buffer_runtimes and record_buffer_exit) and
preserve the keeper used by buffer_runtime()/Exited in buffer_runtime.rs.
- Around line 253-257: Remove the unused server-side surface cache by deleting
the Server struct field buffer_surfaces and the BufferSurface type and any
associated imports/usages; search for symbols buffer_surfaces and BufferSurface
(and matches in impls/constructors, e.g., where Server is created or mutated)
and remove or refactor those code paths to rely on the keeper-backed rendering
flow instead (cleanup any Mutex/BTreeMap/BufferId references tied only to this
cache and remove tests or dead code referencing these symbols).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 577c7183-c1a6-44fc-b0c4-5a2f9b38d910

📥 Commits

Reviewing files that changed from the base of the PR and between 94b8fb4 and 46f07f7.

📒 Files selected for processing (14)

crates/embers-cli/src/lib.rs
crates/embers-cli/tests/interactive.rs
crates/embers-cli/tests/panes.rs
crates/embers-core/src/metadata.rs
crates/embers-core/src/snapshot.rs
crates/embers-server/src/buffer_runtime.rs
crates/embers-server/src/config.rs
crates/embers-server/src/lib.rs
crates/embers-server/src/model.rs
crates/embers-server/src/persist.rs
crates/embers-server/src/server.rs
crates/embers-server/src/state.rs
crates/embers-server/tests/persistence.rs
crates/embers-test-support/tests/buffer_runtime.rs

crates/embers-server/src/buffer_runtime.rs

crates/embers-server/src/config.rs

crates/embers-server/src/server.rs

crates/embers-server/src/state.rs

crates/embers-server/tests/persistence.rs

Pajn · 2026-03-26T09:52:11Z

@coderabbitai review

coderabbitai · 2026-03-26T09:52:16Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/embers-server/src/server.rs (1)

1460-1499: ⚠️ Potential issue | 🔴 Critical

Don't force Running before applying the sampled keeper status.

This races with on_exit for short-lived commands. If the callback marks the buffer exited first, mark_buffer_running() fails here and BufferRequest::Create deletes the buffer as if startup failed.

Suggested fix

         {
             let mut state = self.state.lock().await;
-            if let Err(error) = state.mark_buffer_running(buffer_id, status.pid) {
-                let _ = runtime.kill().await;
-                let _ = runtime.join_threads().await;
-                return Err(error);
-            }
             state.set_buffer_runtime_socket_path(
                 buffer_id,
                 Some(runtime.socket_path().to_path_buf()),
             )?;
             apply_runtime_status(&mut state, buffer_id, &status);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/embers-server/src/server.rs` around lines 1460 - 1499, The code
currently calls state.mark_buffer_running(...) before applying the sampled
runtime status, which races with the on_exit callback; fix by applying the
sampled status first via apply_runtime_status(&mut state, buffer_id, &status),
then only call mark_buffer_running(buffer_id, status.pid) if the buffer state is
still non-terminal (i.e. not already marked exited by apply_runtime_status); if
apply_runtime_status indicates the buffer is already exited, perform the same
cleanup (runtime.kill().await and runtime.join_threads().await) and return the
error/early-exit instead of forcing Running. Reference
BufferRuntimeHandle::spawn, status, apply_runtime_status, and
mark_buffer_running to locate and reorder/add the check.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/embers-cli/src/lib.rs`:
- Around line 73-87: RuntimeKeeper currently redefines the global --socket
option causing a clap clash; remove the duplicate socket field and its
#[arg(long)] from the RuntimeKeeper variant and update any locations that
referenced RuntimeKeeper.socket to use the root Cli.socket (or accept the socket
via function parameters from the parsed Cli) instead; also remove/rename any
other duplicate socket definitions (the other RuntimeKeeper-like variant noted
in the review) so only the Cli struct defines #[arg(long, global = true)]
socket.

In `@crates/embers-server/src/buffer_runtime.rs`:
- Around line 765-770: When connection.status() returns Err in the poll loop,
don't just log and sleep/continue; treat it as terminal and invoke the on_exit
callback with None (call on_exit(None)) for the associated buffer
(inner.buffer_id) and then break/return out of the poll loop so the thread stops
instead of spinning with STATUS_POLL_INTERVAL; update the block handling
Err(error) in the poller where debug!(%error, %inner.buffer_id, "status poll
failed") is logged to call on_exit(None) and stop the loop so the buffer can
exit Running state.
- Around line 899-910: The read_message function currently allocates payload
based on the peer-supplied len (u32), which can cause OOM; add a bounds check on
len before creating payload (e.g., compare len against a defined MAX_FRAME_SIZE
constant like MAX_IPC_FRAME_SIZE or a sensible hard cap such as a few MBs), and
if len exceeds the limit return an appropriate Err (wrap as MuxError::internal
or a new MuxError variant) instead of allocating; update the code around the len
variable and the payload allocation in read_message to validate and reject
oversized frames early.
- Around line 822-858: resolve_runtime_keeper_executable() currently only finds
a Cargo-built binary (or uses EMBERS_RUNTIME_KEEPER_BIN) which causes
spawn_runtime_keeper() to fall back to the in-process thread for installed
deployments; update resolve_runtime_keeper_executable() to also search for an
installed runtime keeper binary by (1) checking
std::env::var("EMBERS_RUNTIME_KEEPER_BIN") first (keep existing behavior), (2)
looking in PATH for a well-known binary name such as "embers-runtime-keeper"
(use which-style lookup via env::var_os("PATH") and iterate dirs), and (3) if
that fails, checking next to the current executable
(std::env::current_exe().parent().join("embers-runtime-keeper")) before falling
back to the Cargo-specific logic; ensure the function returns an executable
PathBuf only if it is exists() and is_file() (and on Unix is executable), so
spawn_runtime_keeper() can launch the external process for installed embers too.
- Around line 557-565: The accept loop in run_runtime_keeper never observes the
buffer/PTY exit so terminate never becomes true (handle_keeper_request always
returns false), causing accept() to block forever; modify the loop in
run_runtime_keeper that currently calls listener.accept() and
handle_keeper_client(runtime.clone(), &mut stream) to also check the
KeeperRuntime::exit_code (or a provided is_exited() method) before/after accept
and break when exit_code.is_some(), or use a timed accept/select with a timeout
to re-check exit_code periodically; ensure keeper_wait_loop updates the same
exit flag (exit_code) and that breaking the loop allows reader_join and
wait_join to join and triggers SocketCleanup.

In `@crates/embers-server/src/server.rs`:
- Around line 1506-1534: attach_buffer_runtime's BufferRuntimeCallbacks
(on_output/on_exit) spawn tasks directly and thus bypass the quiesce_state_tasks
TaskCounter; update attach_buffer_runtime to mirror spawn_buffer_runtime by
entering the shared TaskCounter before spawning: call the TaskCounter::enter
(via self.state_tasks.enter() or equivalent) at the start of each callback,
capture the returned guard and move it into the spawned async closure so the
guard is held for the duration of the spawned task, then drop the guard after
record_buffer_update/record_buffer_exit complete; ensure you reference
BufferRuntimeCallbacks, on_output, on_exit, attach_buffer_runtime,
record_buffer_update, record_buffer_exit, and quiesce_state_tasks when making
the change so restored-runtime callbacks are counted.

---

Outside diff comments:
In `@crates/embers-server/src/server.rs`:
- Around line 1460-1499: The code currently calls state.mark_buffer_running(...)
before applying the sampled runtime status, which races with the on_exit
callback; fix by applying the sampled status first via apply_runtime_status(&mut
state, buffer_id, &status), then only call mark_buffer_running(buffer_id,
status.pid) if the buffer state is still non-terminal (i.e. not already marked
exited by apply_runtime_status); if apply_runtime_status indicates the buffer is
already exited, perform the same cleanup (runtime.kill().await and
runtime.join_threads().await) and return the error/early-exit instead of forcing
Running. Reference BufferRuntimeHandle::spawn, status, apply_runtime_status, and
mark_buffer_running to locate and reorder/add the check.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5062b2c9-e792-4728-ac37-09955943099c

📥 Commits

Reviewing files that changed from the base of the PR and between 46f07f7 and 53e5a24.

📒 Files selected for processing (14)

crates/embers-cli/src/lib.rs
crates/embers-cli/tests/interactive.rs
crates/embers-cli/tests/panes.rs
crates/embers-core/src/metadata.rs
crates/embers-core/src/snapshot.rs
crates/embers-server/src/buffer_runtime.rs
crates/embers-server/src/config.rs
crates/embers-server/src/lib.rs
crates/embers-server/src/model.rs
crates/embers-server/src/persist.rs
crates/embers-server/src/server.rs
crates/embers-server/src/state.rs
crates/embers-server/tests/persistence.rs
crates/embers-test-support/tests/buffer_runtime.rs

crates/embers-cli/src/lib.rs

crates/embers-server/src/buffer_runtime.rs

crates/embers-server/src/server.rs

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (6)

crates/embers-cli/src/lib.rs (1)

74-88: 🧹 Nitpick | 🔵 Trivial

Consider renaming RuntimeKeeper.socket to keeper_socket for clarity.

The RuntimeKeeper subcommand redefines a --socket argument that is already defined globally on the root Cli struct. While clap 4.5 allows this and the code works correctly, the shared name obscures that they serve different purposes: the global socket targets the server connection, while RuntimeKeeper's socket is a subprocess-specific argument.

Renaming to --keeper-socket improves clarity and prevents accidental confusion:

Suggested changes

     #[command(name = "__runtime-keeper", hide = true)]
     RuntimeKeeper {
-        #[arg(long)]
-        socket: PathBuf,
+        #[arg(long = "keeper-socket")]
+        keeper_socket: PathBuf,

Then update crates/embers-cli/src/lib.rs handler:

         Some(Command::RuntimeKeeper {
-            socket,
+            keeper_socket,
             cols,
             rows,
             cwd,
             env,
             command,
         }) => embers_server::run_runtime_keeper(embers_server::RuntimeKeeperCli {
-            socket_path: socket,
+            socket_path: keeper_socket,

And update crates/embers-server/src/buffer_runtime.rs:

-            .arg("--socket")
+            .arg("--keeper-socket")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/embers-cli/src/lib.rs` around lines 74 - 88, Rename the RuntimeKeeper
subcommand's socket field to keeper_socket (and its CLI long flag to
--keeper-socket) to avoid collision with the global socket option; update the
RuntimeKeeper variant in the Cli enum in crates/embers-cli/src/lib.rs (the
RuntimeKeeper struct fields) and adjust any places that construct or destructure
RuntimeKeeper (handlers that read RuntimeKeeper.socket) to use
RuntimeKeeper.keeper_socket, and update
crates/embers-server/src/buffer_runtime.rs where the RuntimeKeeper socket value
is consumed (references to RuntimeKeeper.socket) to reference the new
keeper_socket name so all callsites and parsing remain consistent.

crates/embers-server/src/buffer_runtime.rs (4)

765-770: ⚠️ Potential issue | 🟠 Major

Treat a broken keeper connection as terminal.

Once connection.status() starts failing, this loop just retries the same dead stream forever. on_exit never fires, so the buffer can remain stuck in Running and the poller burns a thread until shutdown.

Suggested fix

                     match connection.status() {
                         Ok(status) => status,
                         Err(error) => {
                             debug!(%error, %inner.buffer_id, "status poll failed");
-                            thread::sleep(STATUS_POLL_INTERVAL);
-                            continue;
+                            if !saw_exit {
+                                saw_exit = true;
+                                (callbacks.on_exit)(inner.buffer_id, None);
+                            }
+                            break;
                         }
                     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/embers-server/src/buffer_runtime.rs` around lines 765 - 770, The
status polling loop currently retries forever when connection.status() returns
Err, causing a stuck Running state; change the Err branch (the match on
connection.status()) to treat the keeper connection as terminal: log the error
with %error and %inner.buffer_id, call the buffer cleanup/exit path (invoke
on_exit or otherwise transition the buffer to a terminal state), stop
sleeping/retrying and break/return from the loop so the poller thread exits;
ensure STATUS_POLL_INTERVAL is not used to retry in this error path and that any
needed cleanup for the connection is performed before returning.

899-910: ⚠️ Potential issue | 🟠 Major

Bound keeper frame sizes before allocating.

len is peer-controlled. A bogus local client can force arbitrarily large allocations here and OOM the server or keeper.

Suggested fix

 fn read_message<T: for<'de> Deserialize<'de>>(stream: &mut UnixStream) -> Result<Option<T>> {
     let mut len_bytes = [0_u8; 4];
     match stream.read_exact(&mut len_bytes) {
         Ok(()) => {}
         Err(error) if error.kind() == std::io::ErrorKind::UnexpectedEof => return Ok(None),
         Err(error) => return Err(error.into()),
     }
     let len = u32::from_le_bytes(len_bytes) as usize;
+    if len > 16 * 1024 * 1024 {
+        return Err(MuxError::protocol(format!(
+            "runtime keeper payload too large: {len} bytes"
+        )));
+    }
     let mut payload = vec![0_u8; len];
     stream.read_exact(&mut payload)?;
     let value =
         serde_json::from_slice(&payload).map_err(|error| MuxError::internal(error.to_string()))?;
     Ok(Some(value))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/embers-server/src/buffer_runtime.rs` around lines 899 - 910, In
read_message, validate the peer-controlled len before allocating payload to
avoid OOM: introduce a MAX_FRAME_SIZE constant and check the parsed len
(u32::from_le_bytes(len_bytes) as usize) against it and also ensure it’s
non-zero/within usize bounds; if len is out of range, return an appropriate
MuxError instead of allocating; only after this guarded check create the payload
Vec and proceed with stream.read_exact and serde_json::from_slice. Ensure the
check references the len variable and the payload allocation site so the
allocation is never attempted for an oversized frame.

803-819: ⚠️ Potential issue | 🟠 Major

Restore can still spend 30s per stale socket inode.

server.rs now skips missing paths, but connect_to_keeper() still retries ConnectionRefused for the full 30s budget. If a crashed keeper leaves its socket file behind, restart stalls buffer-by-buffer anyway.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/embers-server/src/buffer_runtime.rs` around lines 803 - 819,
connect_to_keeper currently sleeps and retries on
std::io::ErrorKind::ConnectionRefused which causes long stalls when a stale
socket inode exists; change the match arm for ErrorKind::ConnectionRefused in
connect_to_keeper to return the error immediately (Err(error.into())) instead of
sleeping and retrying (leave the NotFound branch to keep retry behavior). Keep
the existing retry loop and timeout behavior (CONNECT_RETRY_ATTEMPTS,
CONNECT_RETRY_DELAY, and the final MuxError::timeout) but fail fast on
ConnectionRefused to avoid per-buffer 30s stalls.

822-858: ⚠️ Potential issue | 🔴 Critical

Installed builds still miss the external keeper path.

If current_exe() is a normal installed embers binary, this returns None and spawn_runtime_keeper() falls back to the in-process thread. PTYs then die with the server process, so restart survival only works under Cargo-ish launches or with EMBERS_RUNTIME_KEEPER_BIN.

Suggested fix

 fn resolve_runtime_keeper_executable() -> Option<PathBuf> {
     if let Some(path) = std::env::var_os("EMBERS_RUNTIME_KEEPER_BIN").map(PathBuf::from)
         && path.is_file()
     {
         return Some(path);
@@
     }
     let current_exe = std::env::current_exe().ok()?;
+    let expected_name = if cfg!(windows) { "embers.exe" } else { "embers" };
+    if current_exe.file_name().is_some_and(|name| name == expected_name) {
+        return Some(current_exe);
+    }
     if let Some(parent) = current_exe.parent()
         && parent.file_name().is_some_and(|name| name == "deps")
     {
         let candidate = parent.parent()?.join(if cfg!(windows) {
             "embers.exe"

Also applies to: 861-886

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/embers-server/src/buffer_runtime.rs` around lines 822 - 858, The
current spawn_runtime_keeper falls back to an in-process thread when
resolve_runtime_keeper_executable() returns None, causing installed builds to
never launch the external keeper; change the logic to attempt alternative
resolution before falling back: call env::var("EMBERS_RUNTIME_KEEPER_BIN") and,
if unset, attempt to locate a sibling/packaged binary name (e.g.,
"embers-runtime-keeper") on PATH (use which::which or equivalent) inside
spawn_runtime_keeper (or enhance resolve_runtime_keeper_executable to perform
these checks), and only when all external-lookups fail then spawn the in-process
thread via run_runtime_keeper with RuntimeKeeperCli; keep the existing arg/env
handling and error mapping unchanged.

crates/embers-server/src/server.rs (1)

1604-1632: ⚠️ Potential issue | 🟠 Major

Track restored-runtime callbacks in state_tasks.

attach_buffer_runtime() spawns record_buffer_update() / record_buffer_exit() work without entering self.state_tasks. quiesce_state_tasks() can therefore return before restored-runtime state mutations finish, and shutdown may persist stale buffer state.

Suggested fix

     let output_handle = tokio::runtime::Handle::current();
     let exit_handle = output_handle.clone();
     let output_runtime = self.clone();
     let exit_runtime = self.clone();
+    let output_tasks = self.state_tasks.clone();
+    let exit_tasks = self.state_tasks.clone();
     let runtime = BufferRuntimeHandle::attach(
         buffer_id,
         socket_path,
         BufferRuntimeCallbacks {
             on_output: Arc::new(move |buffer_id, update| {
                 let runtime = output_runtime.clone();
+                let _task = output_tasks.enter();
                 std::mem::drop(output_handle.spawn(async move {
+                    let _task = _task;
                     runtime.record_buffer_update(buffer_id, update).await;
                 }));
             }),
             on_exit: Arc::new(move |buffer_id, exit_code| {
                 let runtime = exit_runtime.clone();
+                let _task = exit_tasks.enter();
                 std::mem::drop(exit_handle.spawn(async move {
+                    let _task = _task;
                     runtime.record_buffer_exit(buffer_id, exit_code).await;
                 }));
             }),
         },

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/embers-server/src/buffer_runtime.rs`:
- Around line 568-575: The loop in handle_keeper_client uses the ? operator on
read_message and handle_keeper_request which causes run_runtime_keeper to exit
on ordinary request errors; instead, catch errors from
read_message::<KeeperRequest> and handle_keeper_request, convert them into a
KeeperResponse::Error (including error details), and send that response via
write_message rather than returning Err; keep using the terminate boolean only
for fatal control-flow from successful handle_keeper_request results, and if
write_message itself fails treat it as a socket/connection termination (return
Ok(false)). Ensure you update the error handling around read_message,
handle_keeper_request, and write_message in handle_keeper_client to build and
write a KeeperResponse::Error (referencing KeeperResponse::Error, KeeperRequest,
read_message, handle_keeper_request, write_message, and handle_keeper_client).

In `@crates/embers-server/src/server.rs`:
- Around line 1739-1760: The record_buffer_update method currently applies
runtime updates unconditionally which can allow out-of-order async work to roll
back buffer.last_snapshot_seq and re-broadcast stale title/activity; change
record_buffer_update to compare update.sequence against the existing
buffer.last_snapshot_seq (and only apply and set
title/activity/last_snapshot_seq when update.sequence is greater than the stored
sequence), and skip the broadcast when the incoming update is stale (sequence <=
last_snapshot_seq); reference the BufferRuntimeUpdate.sequence,
buffer.last_snapshot_seq, and the record_buffer_update function to locate the
change.

---

Duplicate comments:
In `@crates/embers-cli/src/lib.rs`:
- Around line 74-88: Rename the RuntimeKeeper subcommand's socket field to
keeper_socket (and its CLI long flag to --keeper-socket) to avoid collision with
the global socket option; update the RuntimeKeeper variant in the Cli enum in
crates/embers-cli/src/lib.rs (the RuntimeKeeper struct fields) and adjust any
places that construct or destructure RuntimeKeeper (handlers that read
RuntimeKeeper.socket) to use RuntimeKeeper.keeper_socket, and update
crates/embers-server/src/buffer_runtime.rs where the RuntimeKeeper socket value
is consumed (references to RuntimeKeeper.socket) to reference the new
keeper_socket name so all callsites and parsing remain consistent.

In `@crates/embers-server/src/buffer_runtime.rs`:
- Around line 765-770: The status polling loop currently retries forever when
connection.status() returns Err, causing a stuck Running state; change the Err
branch (the match on connection.status()) to treat the keeper connection as
terminal: log the error with %error and %inner.buffer_id, call the buffer
cleanup/exit path (invoke on_exit or otherwise transition the buffer to a
terminal state), stop sleeping/retrying and break/return from the loop so the
poller thread exits; ensure STATUS_POLL_INTERVAL is not used to retry in this
error path and that any needed cleanup for the connection is performed before
returning.
- Around line 899-910: In read_message, validate the peer-controlled len before
allocating payload to avoid OOM: introduce a MAX_FRAME_SIZE constant and check
the parsed len (u32::from_le_bytes(len_bytes) as usize) against it and also
ensure it’s non-zero/within usize bounds; if len is out of range, return an
appropriate MuxError instead of allocating; only after this guarded check create
the payload Vec and proceed with stream.read_exact and serde_json::from_slice.
Ensure the check references the len variable and the payload allocation site so
the allocation is never attempted for an oversized frame.
- Around line 803-819: connect_to_keeper currently sleeps and retries on
std::io::ErrorKind::ConnectionRefused which causes long stalls when a stale
socket inode exists; change the match arm for ErrorKind::ConnectionRefused in
connect_to_keeper to return the error immediately (Err(error.into())) instead of
sleeping and retrying (leave the NotFound branch to keep retry behavior). Keep
the existing retry loop and timeout behavior (CONNECT_RETRY_ATTEMPTS,
CONNECT_RETRY_DELAY, and the final MuxError::timeout) but fail fast on
ConnectionRefused to avoid per-buffer 30s stalls.
- Around line 822-858: The current spawn_runtime_keeper falls back to an
in-process thread when resolve_runtime_keeper_executable() returns None, causing
installed builds to never launch the external keeper; change the logic to
attempt alternative resolution before falling back: call
env::var("EMBERS_RUNTIME_KEEPER_BIN") and, if unset, attempt to locate a
sibling/packaged binary name (e.g., "embers-runtime-keeper") on PATH (use
which::which or equivalent) inside spawn_runtime_keeper (or enhance
resolve_runtime_keeper_executable to perform these checks), and only when all
external-lookups fail then spawn the in-process thread via run_runtime_keeper
with RuntimeKeeperCli; keep the existing arg/env handling and error mapping
unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a411efbc-93e0-40f5-8cfb-f0848d0c1b4c

📥 Commits

Reviewing files that changed from the base of the PR and between 53e5a24 and 6816a93.

📒 Files selected for processing (14)

crates/embers-cli/src/lib.rs
crates/embers-cli/tests/interactive.rs
crates/embers-cli/tests/panes.rs
crates/embers-core/src/metadata.rs
crates/embers-core/src/snapshot.rs
crates/embers-server/src/buffer_runtime.rs
crates/embers-server/src/config.rs
crates/embers-server/src/lib.rs
crates/embers-server/src/model.rs
crates/embers-server/src/persist.rs
crates/embers-server/src/server.rs
crates/embers-server/src/state.rs
crates/embers-server/tests/persistence.rs
crates/embers-test-support/tests/buffer_runtime.rs

crates/embers-server/src/buffer_runtime.rs

crates/embers-server/src/server.rs

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

crates/embers-server/src/model.rs (1)
21-33: 🧹 Nitpick | 🔵 Trivial

Avoid leaking the keeper socket path into the public Buffer surface.

Buffer is re-exported from the crate root, so runtime_socket_path turns a backend filesystem detail into public API and into the struct’s derived Debug/Eq behavior. If outside callers do not need this, keep it in server-internal state/persistence types instead of the public model.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/embers-server/src/model.rs` around lines 21 - 33, The public Buffer
struct currently exposes runtime_socket_path which leaks a backend filesystem
detail; remove the pub runtime_socket_path: Option<PathBuf> field from the
public Buffer definition and move that data into an internal/server-only type
used for persistence (e.g., a separate struct like InternalBufferState or
PersistedBuffer) or make the field private within the module if it must remain
on the same type; update all builders/constructors and uses that previously
read/write Buffer::runtime_socket_path to use the new internal type or accessor
functions and adjust any Debug/Eq/Serialize derivations accordingly so external
callers no longer see the keeper socket path.
crates/embers-server/src/server.rs (1)
54-65: ⚠️ Potential issue | 🟠 Major

Bind the public socket after runtime restore completes.

UnixListener::bind() happens before Runtime::restore_buffer_runtimes(), while crates/embers-cli/src/lib.rs treats a successful connect as readiness. With restore now doing real work, clients can connect and send the first request while no accept loop is running yet, so startup commands can stall behind restore. Move the bind/permission setup after restore, or make the readiness probe perform a real request.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/embers-server/src/server.rs` around lines 54 - 65, The UnixListener is
currently bound before runtime.restore_buffer_runtimes() which allows clients to
connect before the accept loop is ready; move the bind and permission steps
after the runtime is constructed and after runtime.restore_buffer_runtimes().
Specifically: keep load_workspace(...) and Runtime::new(...) as-is, call
runtime.restore_buffer_runtimes().await? first, then call
UnixListener::bind(&self.config.socket_path)? and
set_socket_permissions(&self.config.socket_path)?; this ensures the socket is
created only after restore completes (alternatively, if you prefer readiness
semantics, implement a real readiness probe that performs a real request instead
of relying on successful bind).

♻️ Duplicate comments (3)

crates/embers-server/src/server.rs (1)

1604-1629: ⚠️ Potential issue | 🟠 Major

Count restored-runtime callbacks in state_tasks.

attach_buffer_runtime() spawns record_buffer_update() and record_buffer_exit() without taking a TaskCounter ticket. quiesce_state_tasks() can therefore persist the workspace before callbacks from restored keepers finish during shutdown. Mirror the guard logic from spawn_buffer_runtime().

Suggested fix

     async fn attach_buffer_runtime(
         self: &Arc<Self>,
         buffer_id: BufferId,
         socket_path: PathBuf,
     ) -> Result<(BufferRuntimeHandle, BufferRuntimeStatus)> {
         let output_handle = tokio::runtime::Handle::current();
         let exit_handle = output_handle.clone();
         let output_runtime = self.clone();
         let exit_runtime = self.clone();
+        let output_tasks = self.state_tasks.clone();
+        let exit_tasks = self.state_tasks.clone();
         let runtime = BufferRuntimeHandle::attach(
             buffer_id,
             socket_path,
             BufferRuntimeCallbacks {
                 on_output: Arc::new(move |buffer_id, update| {
                     let runtime = output_runtime.clone();
+                    let _task = output_tasks.enter();
                     std::mem::drop(output_handle.spawn(async move {
+                        let _task = _task;
                         runtime.record_buffer_update(buffer_id, update).await;
                     }));
                 }),
                 on_exit: Arc::new(move |buffer_id, exit_code| {
                     let runtime = exit_runtime.clone();
+                    let _task = exit_tasks.enter();
                     std::mem::drop(exit_handle.spawn(async move {
+                        let _task = _task;
                         runtime.record_buffer_exit(buffer_id, exit_code).await;
                     }));
                 }),
             },
         )?;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/embers-server/src/server.rs` around lines 1604 - 1629,
attach_buffer_runtime currently spawns the on_output and on_exit async tasks
without acquiring a TaskCounter ticket, so restored-runtime callbacks can run
after quiesce_state_tasks persists state; modify attach_buffer_runtime to mirror
spawn_buffer_runtime by obtaining a TaskCounter guard (or calling the same
helper) before spawning each task inside the on_output and on_exit closures,
hold the guard for the lifetime of the spawned task, and pass/clone the guard
into the async move block so record_buffer_update and record_buffer_exit run
under the TaskCounter ticket.

crates/embers-server/src/buffer_runtime.rs (2)

190-200: ⚠️ Potential issue | 🟠 Major

Move the keeper attach handshake off the async worker.

attach() still does blocking connect_to_keeper() retries plus a synchronous connection.status() round-trip. crates/embers-server/src/server.rs:1540-1590 and crates/embers-server/src/server.rs:1601-1632 call this path directly from async code, so a slow or missing keeper can pin a Tokio worker for ~30s during spawn/restore even though the later RPCs already use spawn_blocking.

To confirm the async/blocking call chain:

#!/bin/bash
set -euo pipefail

echo "Async call sites:"
rg -n -C3 'BufferRuntimeHandle::(spawn|attach)\(' crates/embers-server/src/server.rs

echo
echo "Synchronous attach handshake:"
rg -n -C3 'connect_to_keeper\(|connection\.status\(\)' crates/embers-server/src/buffer_runtime.rs

echo
echo "Blocking retry loop:"
rg -n -C3 'CONNECT_RETRY_(DELAY|ATTEMPTS)|thread::sleep\(CONNECT_RETRY_DELAY\)' crates/embers-server/src/buffer_runtime.rs

Also applies to: 829-845

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/embers-server/src/buffer_runtime.rs` around lines 190 - 200, The
attach() function currently performs blocking connect_to_keeper() and
synchronous KeeperConnection::status() calls and is invoked from async contexts;
move that blocking handshake off the async worker by performing the entire
connect+status retry loop inside a blocking task (e.g.,
tokio::task::spawn_blocking) or expose an async attach_async wrapper that calls
spawn_blocking for the existing connect_to_keeper/KeeperConnection::status
logic; update call sites that call BufferRuntime::attach (or
BufferRuntimeHandle::attach) from async code to await the new async wrapper or
to call the nonblocking API so no Tokio worker is blocked by the retry loop.
Ensure unique symbols to change: attach, connect_to_keeper,
KeeperConnection::status, and callers in server.rs that call
BufferRuntimeHandle::attach/spawn/restore.

561-565: ⚠️ Potential issue | 🟠 Major

Exited PTYs still leave a keeper behind.

keeper_wait_loop() records the exit code, but the accept loop never observes it and accept() blocks forever. After the shell exits, the helper process/thread and socket file stay around indefinitely until something else kills the keeper.

Also applies to: 765-769

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/embers-server/src/buffer_runtime.rs` around lines 561 - 565, The
accept loop blocks on listener.accept() so it never sees the exit recorded by
keeper_wait_loop(); make the loop nonblocking and poll the termination signal
instead: set the listener to nonblocking (or use a timeouted accept) and change
the termination flag to a shared AtomicBool or receive on a channel that
keeper_wait_loop sets/sends to; then in the while !terminate loop call
listener.accept(), handle WouldBlock by sleeping/yielding briefly, and still
call handle_keeper_client(...) when accept returns a stream so the loop can exit
promptly when keeper_wait_loop signals termination. Ensure you update references
to terminate, listener.accept, handle_keeper_client and keeper_wait_loop
accordingly so the accept loop observes the exit.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/embers-cli/src/lib.rs`:
- Around line 657-673: The Command::RuntimeKeeper branch is being forced through
run()'s global socket resolution/validation
(resolve_socket_path()/validate_runtime_socket_parent()) even though it uses its
own keeper_socket; move the match arm that handles Command::RuntimeKeeper (the
branch that calls embers_server::run_runtime_keeper with RuntimeKeeperCli) to
execute before calling resolve_socket_path()/validate_runtime_socket_parent(),
or alternatively add a conditional to skip those validations when the matched
variant is Command::RuntimeKeeper so that keeper startup uses its provided
keeper_socket without being blocked by the global socket checks.

In `@crates/embers-cli/tests/interactive.rs`:
- Around line 21-24: The test_lock() function currently returns a process-local
static LOCK (OnceLock<Mutex<()>>) which does not serialize across the separate
test binaries; replace this per-process static with an interprocess coordination
mechanism (e.g., a file-based lock using a crate like fs2/file-lock or explicit
flock on a well-known temp file) and update test_lock (and any callers in
interactive.rs, panes.rs, and buffer_runtime.rs) to acquire the interprocess
lock before proceeding and release it when done so tests across binaries
serialize; ensure the new implementation uses a unique, shared lock name/path
and returns an appropriate guard or RAII object so tests still block safely
until the lock is available.

In `@crates/embers-server/src/buffer_runtime.rs`:
- Around line 31-35: The BufferRuntimeUpdate struct's title field currently uses
Option<String>, which conflates "no update" with "explicitly cleared" and causes
stale titles; change BufferRuntimeUpdate::title to Option<Option<String>> so
outer None = no change, Some(None) = cleared, Some(Some(s)) = set. Update any
producers (e.g., spawn_status_poller()) to emit Some(None) when clearing the
title and update the consumer logic in record_buffer_update() to handle the
three cases (outer None => do nothing, Some(None) => clear stored title,
Some(Some(s)) => set title to s); reference BufferRuntimeUpdate,
spawn_status_poller, and record_buffer_update when making the changes.
- Around line 862-865: The code in buffer_runtime.rs currently passes
environment values to the external keeper using value.to_string_lossy(), losing
non-UTF-8 bytes; update the CLI encoding to use a lossless encoding (e.g.,
base64) for --env so the external keeper and the in-process fallback use the
same representation. Change the loop that builds keeper.args (the cli.env
iteration and keeper.arg(...) that uses to_string_lossy) to encode
value.as_bytes() with base64 (or JSON-escape) and format the argument as
"KEY=ENCODED" plus a marker like "base64:" so the keeper can detect and decode
it; also update the keeper's argument-parsing code and the in-process fallback
deserialization to decode the same marker back into the original bytes and
reconstruct the same env variable (preserving non-UTF-8) instead of relying on
to_string_lossy().

In `@crates/embers-server/src/server.rs`:
- Around line 424-456: When reattachment fails (both the "socket missing" branch
and the Err branch from attach_buffer_runtime), the code currently marks the
buffer interrupted but leaves the stale runtime path in state; update both
failure paths to clear the stored runtime path by calling the state mutation
that removes the socket (use the same function used to set it, e.g.,
set_buffer_runtime_socket_path(buffer.id, None) or equivalent) before continuing
so the dead path is not re-persisted; locate these changes around
attach_buffer_runtime, buffer_pid_hint, and state.mark_buffer_interrupted to
insert the clear operation.

---

Outside diff comments:
In `@crates/embers-server/src/model.rs`:
- Around line 21-33: The public Buffer struct currently exposes
runtime_socket_path which leaks a backend filesystem detail; remove the pub
runtime_socket_path: Option<PathBuf> field from the public Buffer definition and
move that data into an internal/server-only type used for persistence (e.g., a
separate struct like InternalBufferState or PersistedBuffer) or make the field
private within the module if it must remain on the same type; update all
builders/constructors and uses that previously read/write
Buffer::runtime_socket_path to use the new internal type or accessor functions
and adjust any Debug/Eq/Serialize derivations accordingly so external callers no
longer see the keeper socket path.

In `@crates/embers-server/src/server.rs`:
- Around line 54-65: The UnixListener is currently bound before
runtime.restore_buffer_runtimes() which allows clients to connect before the
accept loop is ready; move the bind and permission steps after the runtime is
constructed and after runtime.restore_buffer_runtimes(). Specifically: keep
load_workspace(...) and Runtime::new(...) as-is, call
runtime.restore_buffer_runtimes().await? first, then call
UnixListener::bind(&self.config.socket_path)? and
set_socket_permissions(&self.config.socket_path)?; this ensures the socket is
created only after restore completes (alternatively, if you prefer readiness
semantics, implement a real readiness probe that performs a real request instead
of relying on successful bind).

---

Duplicate comments:
In `@crates/embers-server/src/buffer_runtime.rs`:
- Around line 190-200: The attach() function currently performs blocking
connect_to_keeper() and synchronous KeeperConnection::status() calls and is
invoked from async contexts; move that blocking handshake off the async worker
by performing the entire connect+status retry loop inside a blocking task (e.g.,
tokio::task::spawn_blocking) or expose an async attach_async wrapper that calls
spawn_blocking for the existing connect_to_keeper/KeeperConnection::status
logic; update call sites that call BufferRuntime::attach (or
BufferRuntimeHandle::attach) from async code to await the new async wrapper or
to call the nonblocking API so no Tokio worker is blocked by the retry loop.
Ensure unique symbols to change: attach, connect_to_keeper,
KeeperConnection::status, and callers in server.rs that call
BufferRuntimeHandle::attach/spawn/restore.
- Around line 561-565: The accept loop blocks on listener.accept() so it never
sees the exit recorded by keeper_wait_loop(); make the loop nonblocking and poll
the termination signal instead: set the listener to nonblocking (or use a
timeouted accept) and change the termination flag to a shared AtomicBool or
receive on a channel that keeper_wait_loop sets/sends to; then in the while
!terminate loop call listener.accept(), handle WouldBlock by sleeping/yielding
briefly, and still call handle_keeper_client(...) when accept returns a stream
so the loop can exit promptly when keeper_wait_loop signals termination. Ensure
you update references to terminate, listener.accept, handle_keeper_client and
keeper_wait_loop accordingly so the accept loop observes the exit.

In `@crates/embers-server/src/server.rs`:
- Around line 1604-1629: attach_buffer_runtime currently spawns the on_output
and on_exit async tasks without acquiring a TaskCounter ticket, so
restored-runtime callbacks can run after quiesce_state_tasks persists state;
modify attach_buffer_runtime to mirror spawn_buffer_runtime by obtaining a
TaskCounter guard (or calling the same helper) before spawning each task inside
the on_output and on_exit closures, hold the guard for the lifetime of the
spawned task, and pass/clone the guard into the async move block so
record_buffer_update and record_buffer_exit run under the TaskCounter ticket.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6c99f284-24a0-40c8-95f3-818539c21bd1

📥 Commits

Reviewing files that changed from the base of the PR and between 6816a93 and 6edf142.

📒 Files selected for processing (14)

crates/embers-cli/src/lib.rs
crates/embers-cli/tests/interactive.rs
crates/embers-cli/tests/panes.rs
crates/embers-core/src/metadata.rs
crates/embers-core/src/snapshot.rs
crates/embers-server/src/buffer_runtime.rs
crates/embers-server/src/config.rs
crates/embers-server/src/lib.rs
crates/embers-server/src/model.rs
crates/embers-server/src/persist.rs
crates/embers-server/src/server.rs
crates/embers-server/src/state.rs
crates/embers-server/tests/persistence.rs
crates/embers-test-support/tests/buffer_runtime.rs

crates/embers-cli/src/lib.rs

crates/embers-cli/tests/interactive.rs

crates/embers-server/src/buffer_runtime.rs

crates/embers-server/src/server.rs

Pajn · 2026-03-28T15:16:36Z

@coderabbitai review

coderabbitai · 2026-03-28T15:16:42Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/embers-server/src/buffer_runtime.rs`:
- Around line 595-603: The accept loop in the keeper thread never exits because
handle_keeper_client/handle_keeper_request always returns false (so while
!terminate never breaks); update the logic so that a Kill request causes
termination (make handle_keeper_request return true for the Kill branch and
propagate that through handle_keeper_client), or add an exit check inside the
accept loop (inspect the shared exit_code/Runtime state each iteration and break
if Some) and/or use a timeout on listener.accept to re-check exit conditions;
ensure reader_join and wait_join still join after the loop ends.
- Around line 1006-1015: The Unix branch already encodes raw bytes losslessly,
but the Windows branch uses to_string_lossy() which can drop invalid UTF-16;
update encode_runtime_keeper_env_value to produce a lossless encoding on Windows
by converting the OsStr to a raw, reversible byte sequence (for example use the
wtf-8 crate to encode the OsStr to WTF-8 bytes or use
std::os::windows::ffi::OsStrExt to get the &[u16] and serialize as little-endian
bytes) and then base64-encode those bytes; keep the function name
encode_runtime_keeper_env_value and ensure whatever lossless scheme you choose
has a matching decoding change in the keeper CLI.

In `@crates/embers-server/src/server.rs`:
- Around line 2174-2192: Add a brief doc comment above the platform-specific
constants to explain their origin: note that UNIX_SOCKET_PATH_LIMIT corresponds
to the size of sockaddr_un.sun_path on each platform (macOS 104 per unix(4),
other Unix/Linux 108 per unix(7)), so future maintainers understand why these
values are chosen; update the comment near the constants
(UNIX_SOCKET_PATH_LIMIT) and mention validate_keeper_socket_path references them
to validate keeper socket lengths derived from server socket paths.

In `@crates/embers-server/tests/persistence.rs`:
- Around line 238-263: Remove the redundant post-check assertions after calling
wait_for_snapshot_line: the helper wait_for_snapshot_line already ensures the
expected line exists so you should delete the assert! blocks that inspect
attached_capture.lines and detached_capture.lines (references:
wait_for_snapshot_line, attached_capture, detached_capture). Keep the calls that
assign attached_capture and detached_capture and simply rely on the helper's
verification, removing the two assert!(...) blocks and their failure messages.

In `@crates/embers-test-support/src/test_lock.rs`:
- Around line 69-88: The acquire_file_lock function currently busy-waits
indefinitely on non-Unix platforms; modify it to use a timeout: record
Instant::now at function start, loop as before but after each sleep check
elapsed and if it exceeds a configurable timeout (e.g. Duration::from_secs(10))
return an Err or panic with a clear diagnostic instructing the user to remove
the orphaned lock file, including the lock path in the message; reference the
acquire_file_lock function and use std::time::Instant/Duration and the existing
thread::sleep/OpenOptions logic so the only behavioral change is the timeout and
error message.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 682d9b14-cf54-41d7-972b-614b0a1d8503

📥 Commits

Reviewing files that changed from the base of the PR and between 6edf142 and fa6a041.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (19)

crates/embers-cli/Cargo.toml
crates/embers-cli/src/lib.rs
crates/embers-cli/tests/interactive.rs
crates/embers-cli/tests/panes.rs
crates/embers-core/src/metadata.rs
crates/embers-core/src/snapshot.rs
crates/embers-server/Cargo.toml
crates/embers-server/src/buffer_runtime.rs
crates/embers-server/src/config.rs
crates/embers-server/src/lib.rs
crates/embers-server/src/model.rs
crates/embers-server/src/persist.rs
crates/embers-server/src/server.rs
crates/embers-server/src/state.rs
crates/embers-server/tests/persistence.rs
crates/embers-test-support/Cargo.toml
crates/embers-test-support/src/lib.rs
crates/embers-test-support/src/test_lock.rs
crates/embers-test-support/tests/buffer_runtime.rs

coderabbitai · 2026-03-28T15:23:10Z

crates/embers-server/src/buffer_runtime.rs

+    let mut terminate = false;
+    while !terminate {
+        let (mut stream, _) = listener.accept()?;
+        terminate = handle_keeper_client(runtime.clone(), &mut stream)?;
+    }
+
+    let _ = reader_join.join();
+    let _ = wait_join.join();
+    Ok(())


⚠️ Potential issue | 🟠 Major

Keeper accept loop never terminates after buffer exits.

The while !terminate loop at line 596 only exits when handle_keeper_request returns true, but all request handlers return false (including Kill at line 672). After the buffer process exits:

The reader/wait threads complete

The accept loop blocks indefinitely waiting for new connections

The keeper process/thread never terminates

For external keeper processes, this creates orphaned processes. For in-process threads, resources remain held until the server process exits.

Consider returning true from Kill or adding a timeout/exit-check to the accept loop when exit_code becomes Some.

🔧 Proposed fix: terminate on Kill request

KeeperRequest::Kill => { runtime.kill()?; - Ok((KeeperResponse::Ok, false)) + Ok((KeeperResponse::Ok, true)) }

Also applies to: 670-673

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/embers-server/src/buffer_runtime.rs` around lines 595 - 603, The accept loop in the keeper thread never exits because handle_keeper_client/handle_keeper_request always returns false (so while !terminate never breaks); update the logic so that a Kill request causes termination (make handle_keeper_request return true for the Kill branch and propagate that through handle_keeper_client), or add an exit check inside the accept loop (inspect the shared exit_code/Runtime state each iteration and break if Some) and/or use a timeout on listener.accept to re-check exit conditions; ensure reader_join and wait_join still join after the loop ends.

crates/embers-server/src/buffer_runtime.rs

crates/embers-server/src/server.rs

crates/embers-server/tests/persistence.rs

crates/embers-test-support/src/test_lock.rs

This was referenced Mar 22, 2026

Add api docs for rhai config #2

Merged

Persist workspace state across clean restarts #3

Merged

This was referenced Mar 22, 2026

Add native detached buffer CLI workflows #4

Closed

Add native session rename support #5

Merged

Add native client list detach and switch workflows #6

Merged

Pajn force-pushed the client-switch branch from cdeb3b7 to 4617166 Compare March 22, 2026 21:12

Pajn force-pushed the pty-alive branch from 9863fa7 to 0f0fd77 Compare March 22, 2026 21:12

Pajn force-pushed the client-switch branch from 4617166 to bc396bd Compare March 22, 2026 21:14

Pajn force-pushed the pty-alive branch from 0f0fd77 to 18d14fb Compare March 22, 2026 21:14

Pajn mentioned this pull request Mar 22, 2026

Add buffer and node cli actions #8

Open

Pajn force-pushed the pty-alive branch from 18d14fb to 6dd325e Compare March 22, 2026 21:23

Pajn force-pushed the client-switch branch from bc396bd to 4cf2658 Compare March 22, 2026 21:23

Pajn force-pushed the pty-alive branch from 6dd325e to 6420ef7 Compare March 22, 2026 22:13

Pajn force-pushed the client-switch branch 2 times, most recently from b86c41d to 8dbc2b0 Compare March 22, 2026 22:44

Pajn force-pushed the pty-alive branch from 6420ef7 to 80fbf4e Compare March 22, 2026 22:44

Pajn force-pushed the client-switch branch from 8dbc2b0 to 80cc06b Compare March 22, 2026 23:07

Pajn force-pushed the pty-alive branch 2 times, most recently from 991111a to 7e4827f Compare March 22, 2026 23:40

Pajn force-pushed the client-switch branch 2 times, most recently from 34f1237 to 6826cc1 Compare March 23, 2026 11:14

Pajn force-pushed the pty-alive branch from 7e4827f to 5834de6 Compare March 23, 2026 11:14

Pajn force-pushed the client-switch branch from 6826cc1 to 1efd351 Compare March 23, 2026 13:06

Pajn force-pushed the pty-alive branch from 5834de6 to 74fc8d4 Compare March 23, 2026 13:06

Pajn force-pushed the client-switch branch from 1efd351 to fcd28ae Compare March 23, 2026 14:03

Pajn force-pushed the pty-alive branch 2 times, most recently from 85a63c3 to 8762d2d Compare March 23, 2026 19:30

Pajn mentioned this pull request Mar 23, 2026

Add native detached buffer CLI workflows #9

Merged

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

Pajn force-pushed the pty-alive branch from 46f07f7 to 53e5a24 Compare March 26, 2026 09:47

coderabbitai bot reviewed Mar 26, 2026

View reviewed changes

Pajn force-pushed the client-switch branch from f6f5786 to 03fb9ac Compare March 26, 2026 21:28

Pajn force-pushed the pty-alive branch from 53e5a24 to 5ad8ae8 Compare March 26, 2026 21:30

Pajn mentioned this pull request Mar 26, 2026

Test and document terminal handling #12

Open

Pajn force-pushed the client-switch branch from 03fb9ac to a9051a3 Compare March 26, 2026 23:05

Pajn force-pushed the pty-alive branch from 5ad8ae8 to fa604d3 Compare March 26, 2026 23:05

Pajn force-pushed the client-switch branch from a9051a3 to ccff5ca Compare March 27, 2026 08:28

Pajn force-pushed the pty-alive branch 2 times, most recently from 35a08ac to 5ab4920 Compare March 27, 2026 21:04

Pajn force-pushed the client-switch branch 2 times, most recently from 237f71a to f311c96 Compare March 27, 2026 22:43

Pajn force-pushed the pty-alive branch 2 times, most recently from 5aace75 to 9c359b4 Compare March 28, 2026 10:10

Pajn force-pushed the client-switch branch from 60a02eb to 251e3c1 Compare March 28, 2026 10:10

Base automatically changed from client-switch to main March 28, 2026 10:11

Pajn force-pushed the pty-alive branch from 9c359b4 to 6816a93 Compare March 28, 2026 10:12

coderabbitai bot reviewed Mar 28, 2026

View reviewed changes

crates/embers-server/src/buffer_runtime.rs Show resolved Hide resolved

crates/embers-server/src/server.rs Show resolved Hide resolved

Pajn force-pushed the pty-alive branch from 6816a93 to 6edf142 Compare March 28, 2026 12:15

coderabbitai bot reviewed Mar 28, 2026

View reviewed changes

Pajn force-pushed the pty-alive branch from 6edf142 to fa6a041 Compare March 28, 2026 13:44

coderabbitai bot reviewed Mar 28, 2026

View reviewed changes

Keep PTY runtimes alive across server restarts

a140b89

Conversation

Pajn commented Mar 22, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Stack

Uh oh!

chatgpt-codex-connector bot commented Mar 22, 2026

Uh oh!

coderabbitai bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Pajn commented Mar 25, 2026

Uh oh!

coderabbitai bot commented Mar 25, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pajn commented Mar 26, 2026

Uh oh!

coderabbitai bot commented Mar 26, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pajn commented Mar 28, 2026

Uh oh!

coderabbitai bot commented Mar 28, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Pajn commented Mar 22, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 22, 2026 •

edited

Loading