Skip to content

2.10.1 client deadlocks on Python GIL during logging while nacking and receiving messages #116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zbentley opened this issue May 5, 2023 · 2 comments · Fixed by #118
Closed
Assignees

Comments

@zbentley
Copy link
Contributor

zbentley commented May 5, 2023

We've observed full python interpreter lockups (not just "blocking": the interpreter calling the client halts; can't be unblocked or time out/raise exceptions, even if the blocking operation is moved to a python Thread and waited on with a timeout) in the presence of:

  • The 2.10.1 python client.
  • Python threading (using pulsar Client from a thread).
  • Python asyncio/event loop Future manipulation.
  • consumers in the act of receiving messages (running client's internal receive loop).
  • Many Nacks of the same message.
  • Multiple consumers.
  • using a Python logger= argument to Client. We must do this, otherwise the logs emitted by the client to STDOUT fill up our disks.

All of those have to be present to trigger the issue. When multiple Shared consumers are repeatedly nacking messages with a 15sec delay on a topic with a few hundred messages (100% of them are nacked over and over), all but one of the consumers eventually (within a few minutes) locks up--that is, no Python in that consumer can run. It's not just that it's blocked in a negative_acknowledge call, it's that all threads, signal handlers, coroutines, etc. in that interpreter are stuck. This says GIL conflict to me.

While this program has many hundreds of threads, the stacktraces from the most relevant ones are included here:
threads.txt

@BewareMyPower BewareMyPower self-assigned this May 6, 2023
@BewareMyPower
Copy link
Contributor

From the threads info, the root cause is still related to the NegativeAcksTracker. Here are the simplified threads info:

Thread 2 stuck at:
  acquiring lock in NegativeAcksTracker::add

Thread 14 stuck at:
  acquiring GIL in LoggerWrapper::log
    ConsumerImpl::redeliverMessages
      NegativeAcksTracker::handleTimer

----

Thread 6 stuck at:
  acquiring GIL in LoggerWrapper::log
    ClientConnection::handleIncomingCommand

Thread 8 stuck at:
  acquiring GIL in LoggerWrapper::log
    ClientConnection::getConnectionAsync
      PartitionedConsumerImpl::getPartitionMetadata

Thread 15 stuck at:
  acquiring GIL in LoggerWrapper::LoggerWrapper
    ConsumerStatsImpl::flushAndReset

Thread 13 stuck at
  acquiring GIL in LoggerWrapper::log
    ClientConnection::handleIncomingCommand

Nearly all threads except thread 2 are stuck at acquiring GIL in LoggerWrapper's methods. It means the GIL was hold by another thread, which is likely thread 2 because it calls negative_acknowledge. However, thread 2 is also stuck because the lock in NegativeAcksTracker is hold by thread 14.

In conclusion:

  • Thread 2 holds the GIL, then tries to acquire NegativeAcksTracker#mutex_
  • Thread 14 holds the NegativeAcksTracker#mutex_, then tries to acquire the GIL

Since the GIL won't be released, all threads that try to acquire the GIL will be stuck. It will block nearly all threads when the Python logger is customized because all methods of LoggerWrapper requires the GIL.

I'm going to check the code in further progress and see how could we fix this deadlock.

(BTW, this threads info is strange that I see CaptivePythonObjectMixin in it. But this class was first introduced in 3.0.0)

@BewareMyPower
Copy link
Contributor

I opened a PR in the C++ client repo to avoid the deadlock caused by user-provided logger: apache/pulsar-client-cpp#266

BewareMyPower added a commit to BewareMyPower/pulsar-client-python that referenced this issue May 17, 2023
Fixes apache#116

apache/pulsar-client-cpp#266 is included in the
C++ client 3.2.0 so that apache#116 will be fixed.
merlimat pushed a commit that referenced this issue May 17, 2023
* Bump the C++ client to 3.2.0

Fixes #116

apache/pulsar-client-cpp#266 is included in the
C++ client 3.2.0 so that #116 will be fixed.

* Change the download URL since archive.apache.org is not available now

* Revert "Change the download URL since archive.apache.org is not available now"

This reverts commit 6b92e98.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants