Fix silent stream crashes caused by ETS table ownership by dcrockwell · Pull Request #66 · TrustBound/dream

dcrockwell · 2026-03-17T21:41:56Z

Why

In production, streaming HTTP requests were silently hanging for the full monitor timeout (900s) with zero crash reports in CloudWatch. Three independent streams failed in succession with no error callbacks firing, no log entries — they just vanished.

The root cause was ETS table ownership semantics: dream_http_client_ref_mapping was owned by whichever short-lived process first created it. When that process exited, Erlang destroyed the table. Every other concurrent stream process then crashed with badarg when trying to look up ref mappings — but because this happened inside the selector's receive block, the crash killed the stream process silently. No on_stream_error callback ever fired.

A secondary race condition also existed where two concurrent calls to create the table could cause a badarg crash.

What

Core fix — persistent table ownership:

ensure_ref_mapping_table() now spawns a dedicated holder process (table_holder_loop/0) and transfers table ownership via ets:give_away/3
The holder runs an infinite receive loop, so the table outlives any individual stream process

Defensive ETS access:

ets:new is wrapped in try/catch error:badarg to handle the concurrent creation race
All 7 ETS access functions now have try/catch error:badarg guards with safe fallbacks
get_or_create_string_id recovers from table destruction by re-creating the table

Regression tests — 9 new tests:

4 deterministic ETS-level tests verifying holder ownership, table survival after creator exit, mapping persistence, and concurrent creation safety
5 integration tests through start_stream() covering streams from expired callers, concurrent fast+slow streams (the exact bug scenario), sequential streams, and 5-way concurrent stress

Version bump: dream_http_client 5.1.2 → 5.1.3

How

The fix uses Erlang's ets:give_away/3 to transfer ownership of the named ETS table from the creating process to a long-lived holder process. This is a well-established Erlang pattern for persistent shared ETS tables. The holder process is a minimal receive loop that handles the ETS-TRANSFER message and then waits indefinitely — its only purpose is to keep the table alive.

The defensive try/catch wrappers are belt-and-suspenders: the holder process should make table destruction impossible during normal operation, but protecting every access point ensures no badarg can ever crash a stream process again.

Test plan

All 9 new regression tests pass (make test in modules/http_client)
All existing tests pass — no regressions
make build passes across all modules and examples with zero warnings
Pre-commit hooks pass (format check + full build)

## Why This Change Was Made - In production, three independent streaming requests hung for exactly 900,001ms (the monitor timeout) with zero crash reports in CloudWatch. The root cause: the `dream_http_client_ref_mapping` ETS table was owned by whichever short-lived process first called `ensure_ref_mapping_table()`. When that process exited, the ETS table was destroyed. Any concurrent stream process that tried to look up ref mappings crashed with `badarg` inside `decode_stream_message_for_selector/1` — which runs inside the selector's `receive` block, so the crash killed the stream process silently. No `on_stream_error` callback fired. The stream just vanished. - A secondary race condition existed: two concurrent calls to `ensure_ref_mapping_table()` could both see `undefined` from `ets:info/1`, and the second `ets:new/2` would crash with `badarg`. ## What Was Changed - `ensure_ref_mapping_table()` now spawns a dedicated `table_holder_loop/0` process and transfers ETS table ownership via `ets:give_away/3`. The holder process runs an infinite receive loop, so the table outlives any individual stream process. - `ets:new` is wrapped in `try/catch error:badarg` to handle the race where another process creates the table between our `ets:info` check and our `ets:new` call. - All 7 bare ETS access functions (`lookup_ref_by_string`, `lookup_string_by_ref`, `store_ref_mapping`, `remove_ref_mapping`, `maybe_store_stream_zlib`, `maybe_decompress_stream_chunk`, `cleanup_stream_zlib`) now have `try/catch error:badarg` guards with safe fallbacks. - `get_or_create_string_id` recovers from table destruction by re-creating the table on `badarg`. - 9 regression tests added: 4 deterministic ETS-level tests (holder ownership, table survives creator exit, mappings persist, concurrent creation) and 5 integration tests through `start_stream()` (streams from expired callers, concurrent fast+slow streams, sequential streams). - Version bumped to 5.1.3, CHANGELOG and release notes updated. ## Note to Future Engineer - The `table_holder_loop/0` process intentionally leaks when tests delete and recreate the table — it just sits in `receive` doing nothing forever. This is harmless in both production (table is created once) and tests (processes are cleaned up when the VM exits). Don't add a `stop` message handler unless you enjoy debugging "who killed my ETS table" for the second time. - The try/catch on every ETS access is belt-and-suspenders: the holder process should make table destruction impossible, but the Erlang VM has a long history of teaching developers that "should" and "will" are different words.

## Why This Change Was Made - The previous fix (holder process + try/catch) was architecturally wrong. The unsupervised bare `spawn` for table ownership could die silently, and the try/catch on every ETS access masked real errors — returning garbled compressed data or silently losing mappings. Two layers of band-aids compensating for each other is not how the BEAM works. - The project's own BEAM rules explicitly state: "Never spawn unsupervised processes for anything that matters" and "Never try/catch your way out of a process crash." ## What Was Changed - `dream_http_client` is now a proper OTP application. Added `dream_http_client_app.erl` (application behaviour) and `dream_http_client_sup.erl` (supervisor). Both ETS tables are created in `start/2`, owned by the application master process for the entire application lifetime. - `gleam.toml` has `[erlang] application_start_module` set. - Removed `ensure_ref_mapping_table/0`, `table_holder_loop/0`, and all `try/catch error:badarg` guards from the 7 ETS access functions in `dream_httpc_shim.erl`. ETS operations are now direct calls. - Removed lazy table creation from `client.gleam`: `ensure_ets_tables`, `ensure_recorder_table`, `ensure_ref_mapping_table_wrapper`, `ets_table_exists`, and `ets_new` are all gone. - Rewrote regression tests for the new architecture. Removed tests that delete/recreate the table (architecture makes that unnecessary). Kept all integration tests for concurrent streams. - Updated CHANGELOG and release notes to reflect the architectural change. ## Note to Future Engineer - This follows the Ranch pattern (ranch_app.erl) — create ETS tables in the OTP application's start/2 callback. If you're wondering "why not just spawn a holder process?" — we tried that. It was the previous commit. Read it and weep. - The tables are `public` and `named_table`, accessible from any process. The application master owns them. If they don't exist, the application isn't started. That's a crash-loudly situation, not a try-catch-and-hope situation.

dcrockwell self-assigned this Mar 17, 2026

dcrockwell added bug Something isn't working module Change to a dream module labels Mar 17, 2026

dcrockwell marked this pull request as draft March 17, 2026 21:49

dcrockwell marked this pull request as ready for review March 17, 2026 22:21

dcrockwell merged commit f9ef632 into develop Mar 17, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix silent stream crashes caused by ETS table ownership#66

Fix silent stream crashes caused by ETS table ownership#66
dcrockwell merged 2 commits intodevelopfrom
fix/ets-table-ownership-crash

dcrockwell commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dcrockwell commented Mar 17, 2026

Why

What

How

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant