Skip to content

HTTP Client Release 5.1.3: Fix silent stream crashes with proper OTP application architecture#67

Merged
dcrockwell merged 3 commits intomainfrom
develop
Mar 17, 2026
Merged

HTTP Client Release 5.1.3: Fix silent stream crashes with proper OTP application architecture#67
dcrockwell merged 3 commits intomainfrom
develop

Conversation

@dcrockwell
Copy link
Copy Markdown
Contributor

Summary

This release fixes a critical production bug where concurrent HTTP streams silently died when the process that created the shared ETS table exited. dream_http_client is now a proper OTP application — its ETS tables are owned by the application master process for the entire application lifetime.

Why

In production, three independent streaming requests hung for exactly 900,001ms (the monitor timeout) with zero crash reports in CloudWatch. The streams just vanished — no on_stream_error callback fired, no log entries.

The root cause was ETS table ownership semantics: the dream_http_client_ref_mapping table was owned by whichever short-lived process first created it. When that process exited, Erlang destroyed the table. Every concurrent stream process then crashed with badarg inside the selector's receive block, killing the stream silently.

What

dream_http_client is now a proper OTP application following the Ranch pattern (the same approach used by Cowboy's transport layer for its ranch_server ETS table):

  • dream_http_client_app.erl — OTP application behaviour that creates both ETS tables (dream_http_client_ref_mapping and dream_http_client_stream_recorders) in start/2
  • dream_http_client_sup.erl — minimal supervisor required by the application callback
  • gleam.toml[erlang] application_start_module = "dream_http_client_app"

The tables are owned by the application master process, which lives for the entire application lifetime. No process can destroy them by exiting.

How

The initial fix attempt used an unsupervised spawn + ets:give_away holder process with try/catch error:badarg on every ETS access. This was architecturally wrong — an unsupervised process that could die silently, with try/catch masking real errors (returning garbled compressed data, silently losing mappings).

The correct fix follows the BEAM's built-in application lifecycle: create the tables in the OTP application's start/2 callback, where they're owned by the application master. This eliminates the holder process, all try/catch blocks, race conditions on creation, and lazy table initialization. ETS operations are now direct calls — if the table doesn't exist, the process crashes loudly (correct behavior, means the application isn't started).

All 192 tests pass, including 7 new regression tests that verify the OTP ownership model and reproduce the exact concurrent-streams-from-expired-callers bug scenario.

No API changes.

dcrockwell and others added 3 commits March 17, 2026 15:41
## Why This Change Was Made
- In production, three independent streaming requests hung for exactly
  900,001ms (the monitor timeout) with zero crash reports in CloudWatch.
  The root cause: the `dream_http_client_ref_mapping` ETS table was owned
  by whichever short-lived process first called `ensure_ref_mapping_table()`.
  When that process exited, the ETS table was destroyed. Any concurrent
  stream process that tried to look up ref mappings crashed with `badarg`
  inside `decode_stream_message_for_selector/1` — which runs inside the
  selector's `receive` block, so the crash killed the stream process
  silently. No `on_stream_error` callback fired. The stream just vanished.
- A secondary race condition existed: two concurrent calls to
  `ensure_ref_mapping_table()` could both see `undefined` from `ets:info/1`,
  and the second `ets:new/2` would crash with `badarg`.

## What Was Changed
- `ensure_ref_mapping_table()` now spawns a dedicated `table_holder_loop/0`
  process and transfers ETS table ownership via `ets:give_away/3`. The
  holder process runs an infinite receive loop, so the table outlives any
  individual stream process.
- `ets:new` is wrapped in `try/catch error:badarg` to handle the race where
  another process creates the table between our `ets:info` check and our
  `ets:new` call.
- All 7 bare ETS access functions (`lookup_ref_by_string`,
  `lookup_string_by_ref`, `store_ref_mapping`, `remove_ref_mapping`,
  `maybe_store_stream_zlib`, `maybe_decompress_stream_chunk`,
  `cleanup_stream_zlib`) now have `try/catch error:badarg` guards with
  safe fallbacks.
- `get_or_create_string_id` recovers from table destruction by re-creating
  the table on `badarg`.
- 9 regression tests added: 4 deterministic ETS-level tests (holder
  ownership, table survives creator exit, mappings persist, concurrent
  creation) and 5 integration tests through `start_stream()` (streams from
  expired callers, concurrent fast+slow streams, sequential streams).
- Version bumped to 5.1.3, CHANGELOG and release notes updated.

## Note to Future Engineer
- The `table_holder_loop/0` process intentionally leaks when tests delete
  and recreate the table — it just sits in `receive` doing nothing forever.
  This is harmless in both production (table is created once) and tests
  (processes are cleaned up when the VM exits). Don't add a `stop` message
  handler unless you enjoy debugging "who killed my ETS table" for the
  second time.
- The try/catch on every ETS access is belt-and-suspenders: the holder
  process should make table destruction impossible, but the Erlang VM has
  a long history of teaching developers that "should" and "will" are
  different words.
## Why This Change Was Made
- The previous fix (holder process + try/catch) was architecturally wrong.
  The unsupervised bare `spawn` for table ownership could die silently,
  and the try/catch on every ETS access masked real errors — returning
  garbled compressed data or silently losing mappings. Two layers of
  band-aids compensating for each other is not how the BEAM works.
- The project's own BEAM rules explicitly state: "Never spawn unsupervised
  processes for anything that matters" and "Never try/catch your way out
  of a process crash."

## What Was Changed
- `dream_http_client` is now a proper OTP application. Added
  `dream_http_client_app.erl` (application behaviour) and
  `dream_http_client_sup.erl` (supervisor). Both ETS tables are created
  in `start/2`, owned by the application master process for the entire
  application lifetime.
- `gleam.toml` has `[erlang] application_start_module` set.
- Removed `ensure_ref_mapping_table/0`, `table_holder_loop/0`, and all
  `try/catch error:badarg` guards from the 7 ETS access functions in
  `dream_httpc_shim.erl`. ETS operations are now direct calls.
- Removed lazy table creation from `client.gleam`: `ensure_ets_tables`,
  `ensure_recorder_table`, `ensure_ref_mapping_table_wrapper`,
  `ets_table_exists`, and `ets_new` are all gone.
- Rewrote regression tests for the new architecture. Removed tests that
  delete/recreate the table (architecture makes that unnecessary). Kept
  all integration tests for concurrent streams.
- Updated CHANGELOG and release notes to reflect the architectural change.

## Note to Future Engineer
- This follows the Ranch pattern (ranch_app.erl) — create ETS tables in
  the OTP application's start/2 callback. If you're wondering "why not
  just spawn a holder process?" — we tried that. It was the previous
  commit. Read it and weep.
- The tables are `public` and `named_table`, accessible from any process.
  The application master owns them. If they don't exist, the application
  isn't started. That's a crash-loudly situation, not a try-catch-and-hope
  situation.
Fix silent stream crashes caused by ETS table ownership
@dcrockwell dcrockwell self-assigned this Mar 17, 2026
@dcrockwell dcrockwell added bug Something isn't working release Official public releases module Change to a dream module labels Mar 17, 2026
@dcrockwell dcrockwell merged commit 98a7103 into main Mar 17, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working module Change to a dream module release Official public releases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant