fix: subscriber thread never self heals after half open tcp connection by Mubramaj · Pull Request #388 · discourse/message_bus

Mubramaj · 2026-04-10T18:10:06Z

Closes #387

Problem

When a Unicorn/Puma worker logs:

Global messages on timed out, message bus is no longer functioning correctly

the process never recovers. Real-time features silently break for all users on that worker until the app server is restarted manually.

This happens in practice when Redis and MessageBus run on separate hosts and a half-open TCP connection forms (common after VRRP failover or transient network events). The subscriber's global_redis.subscribe call blocks on a dead socket indefinitely without raising an exception. The thread stays alive (no exception, so thread.alive? == true) but receives no messages, so @last_message goes stale. ensure_subscriber_thread sees a live thread and does nothing.

MessageBus detects the stale @last_message via the keepalive mechanism and logs a warning but stops there.

The Fix

When the keepalive detects a stale @last_message, kill the stuck subscriber thread and start a fresh one with a new Redis connection:

# new behaviour
thread.kill
@subscriber_thread = nil
ensure_subscriber_thread  # opens a new connection

I found out that @SamSaffron made a commit a long time ago that removed an original self-kill mechanism 60eda3d

60eda3d removed the original self-kill mechanism because it could take down a healthy worker during the network issue. Here we just kill the thread and revive using ensure_subscriber_thread . Sam let me know if I am totally missing something and re-introducing an issue.

I ran this fix in production via a fork-pinned Gemfile for several days. The warning occurred twice; both times the subscriber thread restarted cleanly with no side effects.

Happened last Thursday a 7:04 am

Then again 2 times this morning

In both cases it revived cleanly.

The trouble is that Thread.kill is fundamentally unsafe and can lead to many other issues. My concern is that the cure being proposed is more dangerous than the illness.

The only safe thing to do is tear down the process and start again or rely on other timeouts that are safe to clean stuff up.

@jeremyevans thoughts?

I agreed that we should not use Thread.kill. https://github.com/Mubramaj/message_bus/pull/2/changes looks like a better way to fix this.

fix: subscriber thread never self heals after half open tcp connection (

89056e2

#1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: subscriber thread never self heals after half open tcp connection#388

fix: subscriber thread never self heals after half open tcp connection#388
Mubramaj wants to merge 1 commit intodiscourse:mainfrom
Mubramaj:fix/subscriber-thread-self-heal

Mubramaj commented Apr 10, 2026 •

edited

Loading

Uh oh!

SamSaffron commented Apr 12, 2026

Uh oh!

Mubramaj commented Apr 13, 2026

Uh oh!

jeremyevans commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

Mubramaj commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

The Fix

Related

Uh oh!

SamSaffron commented Apr 12, 2026

Uh oh!

Mubramaj commented Apr 13, 2026

Uh oh!

jeremyevans commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Mubramaj commented Apr 10, 2026 •

edited

Loading