fix: subscriber thread never self heals after half open tcp connection#388
fix: subscriber thread never self heals after half open tcp connection#388Mubramaj wants to merge 1 commit intodiscourse:mainfrom
Conversation
|
The trouble is that The only safe thing to do is tear down the process and start again or rely on other timeouts that are safe to clean stuff up. @jeremyevans thoughts? |
|
Thank you for the feedback and the insight. I totally understand your concern and the issue. Based on your feedback I tried a different route: Opened a PR on my fork and going to test it again in production for couple days until the illness appears again and there is no side effect. Then I will open a PR again. But in case you have right away feedback on the new route taken here is the link: https://github.com/Mubramaj/message_bus/pull/2/changes Basically the idea is instead of killing the thread simulate an error by disconnecting the back-end and for the existing global_subscribe This way the subscriber thread never dies. It reconnects itself. No Thread.kill. |
I agreed that we should not use |
Closes #387
Problem
When a Unicorn/Puma worker logs:
the process never recovers. Real-time features silently break for all users on that worker until the app server is restarted manually.
This happens in practice when Redis and MessageBus run on separate hosts and a half-open TCP connection forms (common after VRRP failover or transient network events). The subscriber's
global_redis.subscribecall blocks on a dead socket indefinitely without raising an exception. The thread stays alive (no exception, sothread.alive? == true) but receives no messages, so@last_messagegoes stale.ensure_subscriber_threadsees a live thread and does nothing.MessageBus detects the stale
@last_messagevia the keepalive mechanism and logs a warning but stops there.The Fix
When the keepalive detects a stale
@last_message, kill the stuck subscriber thread and start a fresh one with a new Redis connection:I found out that @SamSaffron made a commit a long time ago that removed an original self-kill mechanism 60eda3d
60eda3dremoved the original self-kill mechanism because it could take down a healthy worker during the network issue. Here we just kill the thread and revive usingensure_subscriber_thread. Sam let me know if I am totally missing something and re-introducing an issue.I ran this fix in production via a fork-pinned Gemfile for several days. The warning occurred twice; both times the subscriber thread restarted cleanly with no side effects.
In both cases it revived cleanly.
Related