#186 fixed several race conditions that happen during the "disconnecting and blocking" unit test, but a new one was reported in Sjors/bitcoin#90 (comment) with details in Sjors/bitcoin#90 (comment) and I want to write more notes about it. There is also a branch with previously attempted fixes
The problem is not too hard to reproduce locally: if mptest binary is run in in a loop in bash, the tsan error happens in a few minutes. With attempted fixes so far running the mptest binary in a loop results in a deadlock after about ~20 minutes for me.
Background
If a Connection object is destroyed while a ProxyClient<Thread> object is still using it (due to a sudden disconnect), the ProxyClient<Thread>::m_disconnect_cb callback will be called from ~Connection destructor, deleting the ProxyClient<Thread> object from whichever ConnThreads map it is in.
If a ProxyClient<Thread> object is destroyed before its Connection object (the more normal case), the ProxyClient<Thread> destructor will unregister m_disconnect_cb so when the Connection is later destroyed it will not try to delete the ProxyClient<Thread> object.
If the two destructors are called in sequence in either order, everything is fine. But if the destructors are called from different threads in a race, numerous problems occur.
Problem 1: invalid m_disconnect_cb iterator erase
If ~Connection is destroyed first and is about to call the m_disconnect_cb callback, but then the ~ProxyClient destructor is called in another thread, it will currently (as of a11e690) try to delete the m_disconnect_cb using an iterator which is no longer valid. 79c09e4 fixes this by changing the ~ProxyClient<Thread> destructor to avoid making deleting m_disconnect_cb once it is invalidated.
Problem 2: ProxyClient<Thread> use-after-free
Second problem is if ~Connection is destroyed first and invokes the m_disconnect_cb callback, and ~ProxyClient<Thread> destructor runs and finishes before m_disconnect_cb has a chance to do anything, when the callback resumes, the ProxyClient<Thread> object will be gone and the callbacks's attempt to access and erase it will be invalid. 79c09e4 fixes this by making ~ProxyClient<Thread> set a destroyed bit when it runs and having the m_disconnect_cb callback just return early if the bit is set.
Problem 3: ThreadContext use after free
Just fixing the race between ~Connection and ~ProxyClient<Thread> is actually not enough because after the ProxyClient<Thread> object is destroyed, the thread owning the ProxyClient<Thread> can exit, destroying the thread_local ThreadContext struct that contained the ProxyClient<Thread> object, and also destroying the ThreadContext.waiter::m_mutex mutex the ProxyClient<Thread> code uses for synchronization. If this happens while an m_disconnect_db callback is still in progress, that callback will fail when it tries to lock the mutex and the mutex longer exists. (Commit 85df929 was a broken attempt to fix this and commit 30431bc has another description of the problem).
Possible Fixes
Adding EventLoop::sync call
The most direct fix for problem 3 is probably to have the thread that destroyed the ProxyClient<Thread> object and is about to exit and destroy ThreadContext::waiter.m_mutex call EventLoop::sync to force m_disconnect_cb to finish running before this happens. This would work because m_disconnect_cb always runs on the event loop thread. This would require reverting 85df929 to avoid reintroducing Problem 2 (so the m_disconnect_cb callback checks destroyed bit after it acquires ThreadContext::waiter.m_mutex instead of before.)
Calling removeSyncCleanup from EventLoop thread
If removeSyncCleanup(m_disconnect_cb) could be called from EventLoop thread I think all previous fixes could be reverted and there would be no problem here because m_disconnect_cb would only be accessed from one thread, and couldn't be removed while it might be running. This would somehow have to be done without ThreadContext::waiter.m_mutex being held (to avoid deadlocks) so I'm not sure if this is possible.
Switching mutexes
I am thinking that cleaner fixes for all the problems above could be possible by switching ProxyClient<Thread> and related code to use EventLoop::m_mutex for some locking instead of ThreadContext::waiter.m_mutex. This could simplify things by not needing to deal with two mutexes and needing to release them to lock in a consistent order. I was even thinking of switching to EventLoop::m_mutex exclusively for all locking, but I don't think I can do this because it's possible for a single thread to make IPC calls over multiple connections and for not all connections to use the same event loop.
Refactoring ThreadContext
ThreadContext callback_threads request_threads maps indexed by Connection* could be replaced with Connection maps indexed by thread_id which could allow not using ThreadContext::waiter.m_mutex at all and using EventLoop::m_mutex everywhere.
Summary
There is a bug and I am not sure how to fix it yet.
#186 fixed several race conditions that happen during the "disconnecting and blocking" unit test, but a new one was reported in Sjors/bitcoin#90 (comment) with details in Sjors/bitcoin#90 (comment) and I want to write more notes about it. There is also a branch with previously attempted fixes
The problem is not too hard to reproduce locally: if
mptestbinary is run in in a loop in bash, the tsan error happens in a few minutes. With attempted fixes so far running the mptest binary in a loop results in a deadlock after about ~20 minutes for me.Background
If a
Connectionobject is destroyed while aProxyClient<Thread>object is still using it (due to a sudden disconnect), theProxyClient<Thread>::m_disconnect_cbcallback will be called from~Connectiondestructor, deleting theProxyClient<Thread>object from whicheverConnThreadsmap it is in.If a
ProxyClient<Thread>object is destroyed before itsConnectionobject (the more normal case), theProxyClient<Thread>destructor will unregisterm_disconnect_cbso when theConnectionis later destroyed it will not try to delete theProxyClient<Thread>object.If the two destructors are called in sequence in either order, everything is fine. But if the destructors are called from different threads in a race, numerous problems occur.
Problem 1: invalid
m_disconnect_cbiterator eraseIf
~Connectionis destroyed first and is about to call them_disconnect_cbcallback, but then the~ProxyClientdestructor is called in another thread, it will currently (as of a11e690) try to delete them_disconnect_cbusing an iterator which is no longer valid. 79c09e4 fixes this by changing the~ProxyClient<Thread>destructor to avoid making deletingm_disconnect_cbonce it is invalidated.Problem 2:
ProxyClient<Thread>use-after-freeSecond problem is if
~Connectionis destroyed first and invokes them_disconnect_cbcallback, and~ProxyClient<Thread>destructor runs and finishes beforem_disconnect_cbhas a chance to do anything, when the callback resumes, theProxyClient<Thread>object will be gone and the callbacks's attempt to access and erase it will be invalid. 79c09e4 fixes this by making~ProxyClient<Thread>set a destroyed bit when it runs and having them_disconnect_cbcallback just return early if the bit is set.Problem 3:
ThreadContextuse after freeJust fixing the race between
~Connectionand~ProxyClient<Thread>is actually not enough because after theProxyClient<Thread>object is destroyed, the thread owning theProxyClient<Thread>can exit, destroying thethread_local ThreadContextstruct that contained theProxyClient<Thread>object, and also destroying theThreadContext.waiter::m_mutexmutex theProxyClient<Thread>code uses for synchronization. If this happens while anm_disconnect_dbcallback is still in progress, that callback will fail when it tries to lock the mutex and the mutex longer exists. (Commit 85df929 was a broken attempt to fix this and commit 30431bc has another description of the problem).Possible Fixes
Adding EventLoop::sync call
The most direct fix for problem 3 is probably to have the thread that destroyed the
ProxyClient<Thread>object and is about to exit and destroyThreadContext::waiter.m_mutexcallEventLoop::syncto forcem_disconnect_cbto finish running before this happens. This would work becausem_disconnect_cbalways runs on the event loop thread. This would require reverting 85df929 to avoid reintroducing Problem 2 (so them_disconnect_cbcallback checks destroyed bit after it acquiresThreadContext::waiter.m_mutexinstead of before.)Calling removeSyncCleanup from EventLoop thread
If
removeSyncCleanup(m_disconnect_cb)could be called from EventLoop thread I think all previous fixes could be reverted and there would be no problem here becausem_disconnect_cbwould only be accessed from one thread, and couldn't be removed while it might be running. This would somehow have to be done withoutThreadContext::waiter.m_mutexbeing held (to avoid deadlocks) so I'm not sure if this is possible.Switching mutexes
I am thinking that cleaner fixes for all the problems above could be possible by switching
ProxyClient<Thread>and related code to useEventLoop::m_mutexfor some locking instead ofThreadContext::waiter.m_mutex. This could simplify things by not needing to deal with two mutexes and needing to release them to lock in a consistent order. I was even thinking of switching toEventLoop::m_mutexexclusively for all locking, but I don't think I can do this because it's possible for a single thread to make IPC calls over multiple connections and for not all connections to use the same event loop.Refactoring ThreadContext
ThreadContextcallback_threadsrequest_threadsmaps indexed byConnection*could be replaced withConnectionmaps indexed by thread_id which could allow not usingThreadContext::waiter.m_mutexat all and usingEventLoop::m_mutexeverywhere.Summary
There is a bug and I am not sure how to fix it yet.