Fix silent stream crashes caused by ETS table ownership#66
Merged
dcrockwell merged 2 commits intodevelopfrom Mar 17, 2026
Merged
Fix silent stream crashes caused by ETS table ownership#66dcrockwell merged 2 commits intodevelopfrom
dcrockwell merged 2 commits intodevelopfrom
Conversation
## Why This Change Was Made - In production, three independent streaming requests hung for exactly 900,001ms (the monitor timeout) with zero crash reports in CloudWatch. The root cause: the `dream_http_client_ref_mapping` ETS table was owned by whichever short-lived process first called `ensure_ref_mapping_table()`. When that process exited, the ETS table was destroyed. Any concurrent stream process that tried to look up ref mappings crashed with `badarg` inside `decode_stream_message_for_selector/1` — which runs inside the selector's `receive` block, so the crash killed the stream process silently. No `on_stream_error` callback fired. The stream just vanished. - A secondary race condition existed: two concurrent calls to `ensure_ref_mapping_table()` could both see `undefined` from `ets:info/1`, and the second `ets:new/2` would crash with `badarg`. ## What Was Changed - `ensure_ref_mapping_table()` now spawns a dedicated `table_holder_loop/0` process and transfers ETS table ownership via `ets:give_away/3`. The holder process runs an infinite receive loop, so the table outlives any individual stream process. - `ets:new` is wrapped in `try/catch error:badarg` to handle the race where another process creates the table between our `ets:info` check and our `ets:new` call. - All 7 bare ETS access functions (`lookup_ref_by_string`, `lookup_string_by_ref`, `store_ref_mapping`, `remove_ref_mapping`, `maybe_store_stream_zlib`, `maybe_decompress_stream_chunk`, `cleanup_stream_zlib`) now have `try/catch error:badarg` guards with safe fallbacks. - `get_or_create_string_id` recovers from table destruction by re-creating the table on `badarg`. - 9 regression tests added: 4 deterministic ETS-level tests (holder ownership, table survives creator exit, mappings persist, concurrent creation) and 5 integration tests through `start_stream()` (streams from expired callers, concurrent fast+slow streams, sequential streams). - Version bumped to 5.1.3, CHANGELOG and release notes updated. ## Note to Future Engineer - The `table_holder_loop/0` process intentionally leaks when tests delete and recreate the table — it just sits in `receive` doing nothing forever. This is harmless in both production (table is created once) and tests (processes are cleaned up when the VM exits). Don't add a `stop` message handler unless you enjoy debugging "who killed my ETS table" for the second time. - The try/catch on every ETS access is belt-and-suspenders: the holder process should make table destruction impossible, but the Erlang VM has a long history of teaching developers that "should" and "will" are different words.
## Why This Change Was Made - The previous fix (holder process + try/catch) was architecturally wrong. The unsupervised bare `spawn` for table ownership could die silently, and the try/catch on every ETS access masked real errors — returning garbled compressed data or silently losing mappings. Two layers of band-aids compensating for each other is not how the BEAM works. - The project's own BEAM rules explicitly state: "Never spawn unsupervised processes for anything that matters" and "Never try/catch your way out of a process crash." ## What Was Changed - `dream_http_client` is now a proper OTP application. Added `dream_http_client_app.erl` (application behaviour) and `dream_http_client_sup.erl` (supervisor). Both ETS tables are created in `start/2`, owned by the application master process for the entire application lifetime. - `gleam.toml` has `[erlang] application_start_module` set. - Removed `ensure_ref_mapping_table/0`, `table_holder_loop/0`, and all `try/catch error:badarg` guards from the 7 ETS access functions in `dream_httpc_shim.erl`. ETS operations are now direct calls. - Removed lazy table creation from `client.gleam`: `ensure_ets_tables`, `ensure_recorder_table`, `ensure_ref_mapping_table_wrapper`, `ets_table_exists`, and `ets_new` are all gone. - Rewrote regression tests for the new architecture. Removed tests that delete/recreate the table (architecture makes that unnecessary). Kept all integration tests for concurrent streams. - Updated CHANGELOG and release notes to reflect the architectural change. ## Note to Future Engineer - This follows the Ranch pattern (ranch_app.erl) — create ETS tables in the OTP application's start/2 callback. If you're wondering "why not just spawn a holder process?" — we tried that. It was the previous commit. Read it and weep. - The tables are `public` and `named_table`, accessible from any process. The application master owns them. If they don't exist, the application isn't started. That's a crash-loudly situation, not a try-catch-and-hope situation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
In production, streaming HTTP requests were silently hanging for the full monitor timeout (900s) with zero crash reports in CloudWatch. Three independent streams failed in succession with no error callbacks firing, no log entries — they just vanished.
The root cause was ETS table ownership semantics:
dream_http_client_ref_mappingwas owned by whichever short-lived process first created it. When that process exited, Erlang destroyed the table. Every other concurrent stream process then crashed withbadargwhen trying to look up ref mappings — but because this happened inside the selector'sreceiveblock, the crash killed the stream process silently. Noon_stream_errorcallback ever fired.A secondary race condition also existed where two concurrent calls to create the table could cause a
badargcrash.What
Core fix — persistent table ownership:
ensure_ref_mapping_table()now spawns a dedicated holder process (table_holder_loop/0) and transfers table ownership viaets:give_away/3Defensive ETS access:
ets:newis wrapped intry/catch error:badargto handle the concurrent creation racetry/catch error:badargguards with safe fallbacksget_or_create_string_idrecovers from table destruction by re-creating the tableRegression tests — 9 new tests:
start_stream()covering streams from expired callers, concurrent fast+slow streams (the exact bug scenario), sequential streams, and 5-way concurrent stressVersion bump:
dream_http_client5.1.2 → 5.1.3How
The fix uses Erlang's
ets:give_away/3to transfer ownership of the named ETS table from the creating process to a long-lived holder process. This is a well-established Erlang pattern for persistent shared ETS tables. The holder process is a minimalreceiveloop that handles theETS-TRANSFERmessage and then waits indefinitely — its only purpose is to keep the table alive.The defensive
try/catchwrappers are belt-and-suspenders: the holder process should make table destruction impossible during normal operation, but protecting every access point ensures nobadargcan ever crash a stream process again.Test plan
make testinmodules/http_client)make buildpasses across all modules and examples with zero warnings