Skip to content

fix: subscriber thread never self heals after half open tcp connection#388

Open
Mubramaj wants to merge 1 commit intodiscourse:mainfrom
Mubramaj:fix/subscriber-thread-self-heal
Open

fix: subscriber thread never self heals after half open tcp connection#388
Mubramaj wants to merge 1 commit intodiscourse:mainfrom
Mubramaj:fix/subscriber-thread-self-heal

Conversation

@Mubramaj
Copy link
Copy Markdown

@Mubramaj Mubramaj commented Apr 10, 2026

Closes #387

Problem

When a Unicorn/Puma worker logs:

Global messages on timed out, message bus is no longer functioning correctly

the process never recovers. Real-time features silently break for all users on that worker until the app server is restarted manually.

This happens in practice when Redis and MessageBus run on separate hosts and a half-open TCP connection forms (common after VRRP failover or transient network events). The subscriber's global_redis.subscribe call blocks on a dead socket indefinitely without raising an exception. The thread stays alive (no exception, so thread.alive? == true) but receives no messages, so @last_message goes stale. ensure_subscriber_thread sees a live thread and does nothing.

MessageBus detects the stale @last_message via the keepalive mechanism and logs a warning but stops there.

The Fix

When the keepalive detects a stale @last_message, kill the stuck subscriber thread and start a fresh one with a new Redis connection:

# new behaviour
thread.kill
@subscriber_thread = nil
ensure_subscriber_thread  # opens a new connection

I found out that @SamSaffron made a commit a long time ago that removed an original self-kill mechanism 60eda3d

60eda3d removed the original self-kill mechanism because it could take down a healthy worker during the network issue. Here we just kill the thread and revive using ensure_subscriber_thread . Sam let me know if I am totally missing something and re-introducing an issue.

I ran this fix in production via a fork-pinned Gemfile for several days. The warning occurred twice; both times the subscriber thread restarted cleanly with no side effects.

  1. Happened last Thursday a 7:04 am
image
  1. Then again 2 times this morning
image

In both cases it revived cleanly.

Related

@SamSaffron
Copy link
Copy Markdown
Member

The trouble is that Thread.kill is fundamentally unsafe and can lead to many other issues. My concern is that the cure being proposed is more dangerous than the illness.

The only safe thing to do is tear down the process and start again or rely on other timeouts that are safe to clean stuff up.

@jeremyevans thoughts?

@Mubramaj
Copy link
Copy Markdown
Author

Thank you for the feedback and the insight. I totally understand your concern and the issue. Based on your feedback I tried a different route:

Opened a PR on my fork and going to test it again in production for couple days until the illness appears again and there is no side effect. Then I will open a PR again. But in case you have right away feedback on the new route taken here is the link:

https://github.com/Mubramaj/message_bus/pull/2/changes

Basically the idea is instead of killing the thread simulate an error by disconnecting the back-end and for the existing global_subscribe rescue => error ... retry mechanism kicks in.

This way the subscriber thread never dies. It reconnects itself. No Thread.kill.

@jeremyevans
Copy link
Copy Markdown
Collaborator

The trouble is that Thread.kill is fundamentally unsafe and can lead to many other issues. My concern is that the cure being proposed is more dangerous than the illness.

The only safe thing to do is tear down the process and start again or rely on other timeouts that are safe to clean stuff up.

@jeremyevans thoughts?

I agreed that we should not use Thread.kill. https://github.com/Mubramaj/message_bus/pull/2/changes looks like a better way to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Global messages on XX timed out, subscriber thread never self-heals after half-open TCP connection

3 participants