Skip to content

Fix silent stream crashes caused by ETS table ownership#66

Merged
dcrockwell merged 2 commits intodevelopfrom
fix/ets-table-ownership-crash
Mar 17, 2026
Merged

Fix silent stream crashes caused by ETS table ownership#66
dcrockwell merged 2 commits intodevelopfrom
fix/ets-table-ownership-crash

Conversation

@dcrockwell
Copy link
Copy Markdown
Contributor

Why

In production, streaming HTTP requests were silently hanging for the full monitor timeout (900s) with zero crash reports in CloudWatch. Three independent streams failed in succession with no error callbacks firing, no log entries — they just vanished.

The root cause was ETS table ownership semantics: dream_http_client_ref_mapping was owned by whichever short-lived process first created it. When that process exited, Erlang destroyed the table. Every other concurrent stream process then crashed with badarg when trying to look up ref mappings — but because this happened inside the selector's receive block, the crash killed the stream process silently. No on_stream_error callback ever fired.

A secondary race condition also existed where two concurrent calls to create the table could cause a badarg crash.

What

Core fix — persistent table ownership:

  • ensure_ref_mapping_table() now spawns a dedicated holder process (table_holder_loop/0) and transfers table ownership via ets:give_away/3
  • The holder runs an infinite receive loop, so the table outlives any individual stream process

Defensive ETS access:

  • ets:new is wrapped in try/catch error:badarg to handle the concurrent creation race
  • All 7 ETS access functions now have try/catch error:badarg guards with safe fallbacks
  • get_or_create_string_id recovers from table destruction by re-creating the table

Regression tests — 9 new tests:

  • 4 deterministic ETS-level tests verifying holder ownership, table survival after creator exit, mapping persistence, and concurrent creation safety
  • 5 integration tests through start_stream() covering streams from expired callers, concurrent fast+slow streams (the exact bug scenario), sequential streams, and 5-way concurrent stress

Version bump: dream_http_client 5.1.2 → 5.1.3

How

The fix uses Erlang's ets:give_away/3 to transfer ownership of the named ETS table from the creating process to a long-lived holder process. This is a well-established Erlang pattern for persistent shared ETS tables. The holder process is a minimal receive loop that handles the ETS-TRANSFER message and then waits indefinitely — its only purpose is to keep the table alive.

The defensive try/catch wrappers are belt-and-suspenders: the holder process should make table destruction impossible during normal operation, but protecting every access point ensures no badarg can ever crash a stream process again.

Test plan

  • All 9 new regression tests pass (make test in modules/http_client)
  • All existing tests pass — no regressions
  • make build passes across all modules and examples with zero warnings
  • Pre-commit hooks pass (format check + full build)

## Why This Change Was Made
- In production, three independent streaming requests hung for exactly
  900,001ms (the monitor timeout) with zero crash reports in CloudWatch.
  The root cause: the `dream_http_client_ref_mapping` ETS table was owned
  by whichever short-lived process first called `ensure_ref_mapping_table()`.
  When that process exited, the ETS table was destroyed. Any concurrent
  stream process that tried to look up ref mappings crashed with `badarg`
  inside `decode_stream_message_for_selector/1` — which runs inside the
  selector's `receive` block, so the crash killed the stream process
  silently. No `on_stream_error` callback fired. The stream just vanished.
- A secondary race condition existed: two concurrent calls to
  `ensure_ref_mapping_table()` could both see `undefined` from `ets:info/1`,
  and the second `ets:new/2` would crash with `badarg`.

## What Was Changed
- `ensure_ref_mapping_table()` now spawns a dedicated `table_holder_loop/0`
  process and transfers ETS table ownership via `ets:give_away/3`. The
  holder process runs an infinite receive loop, so the table outlives any
  individual stream process.
- `ets:new` is wrapped in `try/catch error:badarg` to handle the race where
  another process creates the table between our `ets:info` check and our
  `ets:new` call.
- All 7 bare ETS access functions (`lookup_ref_by_string`,
  `lookup_string_by_ref`, `store_ref_mapping`, `remove_ref_mapping`,
  `maybe_store_stream_zlib`, `maybe_decompress_stream_chunk`,
  `cleanup_stream_zlib`) now have `try/catch error:badarg` guards with
  safe fallbacks.
- `get_or_create_string_id` recovers from table destruction by re-creating
  the table on `badarg`.
- 9 regression tests added: 4 deterministic ETS-level tests (holder
  ownership, table survives creator exit, mappings persist, concurrent
  creation) and 5 integration tests through `start_stream()` (streams from
  expired callers, concurrent fast+slow streams, sequential streams).
- Version bumped to 5.1.3, CHANGELOG and release notes updated.

## Note to Future Engineer
- The `table_holder_loop/0` process intentionally leaks when tests delete
  and recreate the table — it just sits in `receive` doing nothing forever.
  This is harmless in both production (table is created once) and tests
  (processes are cleaned up when the VM exits). Don't add a `stop` message
  handler unless you enjoy debugging "who killed my ETS table" for the
  second time.
- The try/catch on every ETS access is belt-and-suspenders: the holder
  process should make table destruction impossible, but the Erlang VM has
  a long history of teaching developers that "should" and "will" are
  different words.
@dcrockwell dcrockwell self-assigned this Mar 17, 2026
@dcrockwell dcrockwell added bug Something isn't working module Change to a dream module labels Mar 17, 2026
@dcrockwell dcrockwell marked this pull request as draft March 17, 2026 21:49
## Why This Change Was Made
- The previous fix (holder process + try/catch) was architecturally wrong.
  The unsupervised bare `spawn` for table ownership could die silently,
  and the try/catch on every ETS access masked real errors — returning
  garbled compressed data or silently losing mappings. Two layers of
  band-aids compensating for each other is not how the BEAM works.
- The project's own BEAM rules explicitly state: "Never spawn unsupervised
  processes for anything that matters" and "Never try/catch your way out
  of a process crash."

## What Was Changed
- `dream_http_client` is now a proper OTP application. Added
  `dream_http_client_app.erl` (application behaviour) and
  `dream_http_client_sup.erl` (supervisor). Both ETS tables are created
  in `start/2`, owned by the application master process for the entire
  application lifetime.
- `gleam.toml` has `[erlang] application_start_module` set.
- Removed `ensure_ref_mapping_table/0`, `table_holder_loop/0`, and all
  `try/catch error:badarg` guards from the 7 ETS access functions in
  `dream_httpc_shim.erl`. ETS operations are now direct calls.
- Removed lazy table creation from `client.gleam`: `ensure_ets_tables`,
  `ensure_recorder_table`, `ensure_ref_mapping_table_wrapper`,
  `ets_table_exists`, and `ets_new` are all gone.
- Rewrote regression tests for the new architecture. Removed tests that
  delete/recreate the table (architecture makes that unnecessary). Kept
  all integration tests for concurrent streams.
- Updated CHANGELOG and release notes to reflect the architectural change.

## Note to Future Engineer
- This follows the Ranch pattern (ranch_app.erl) — create ETS tables in
  the OTP application's start/2 callback. If you're wondering "why not
  just spawn a holder process?" — we tried that. It was the previous
  commit. Read it and weep.
- The tables are `public` and `named_table`, accessible from any process.
  The application master owns them. If they don't exist, the application
  isn't started. That's a crash-loudly situation, not a try-catch-and-hope
  situation.
@dcrockwell dcrockwell marked this pull request as ready for review March 17, 2026 22:21
@dcrockwell dcrockwell merged commit f9ef632 into develop Mar 17, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working module Change to a dream module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant