-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Classic queue processes can run into an exception (a continuation to #12367) #13758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello! I need the logs before the crash. You can send them by email if you'd like (loic.hoguin @ broadcom.com). |
How long are the TTLs? |
Thanks, I will sent you the logs, however, I had attached logs above in the section "Reproduction steps", in "Logs". TTLs are
|
The log snippets are not enough to figure out some things about this issue that's why I'm asking for more logs. Thank you about the TTLs it will help try to reproduce. |
Please enable debug logging so that I get more information:
https://www.rabbitmq.com/docs/logging I am specifically looking for |
I would like to know more about the type of data that goes through the queues, especially larger data: how long does it stay in the queues, is fanout being used, any other features other than TTL? Are you using consumers or basic.get? Thank you. |
Sure,
the (many) producers send
The specific queue, which crashes, is bound to one fanout, and to one topic exchange. For the topic exchange data is routed via routing key.
it is consumed. (no basic.get) Thank you, |
Interesting, thank you. |
Thanks a lot for help. As info, unfortunately I had not managed to get further debug infos up to now, I will try to collect info after Easter and then report those. |
No worries, thank you. It will likely take weeks to figure it out as the timing to trigger the bug seems very rare. Yesterday we tried to reproduce by creating a similar environment but we weren't successful. We'll keep trying. |
I have a potential fix at #13771 It would help if you could try it. CI will build an OCI image shortly that you can use based of |
@alxndrdude what kind of package do you need? There won't be any more |
Here is a |
Great, thanks a lot. I think, for us going to v4.1.x would be fine. However, I have to clarify, if (and how) to roll out a 4.1.1-alpha version. |
We discussed, and decided to 1st try to reproduce the Bug in a safer sandbox env (via RMQ federation setup). If we succeed to reproduce the Bug in the sandbox RMQ, we can easily bump the sandbox RMQ to v4.1.1-apha. That means, it might take a wile, I apologize for this. |
We fully understand that ensuring that such a change was effective takes time. RabbitMQ Thank you. |
We finished some testing with a sanbox env, here are our findings: Setup
Findings In the sandbox env, we cannot reproduce the error. This means, we somehow triggered the Bug again, however, NOT in the sandbox env, but in the production env. The critical/problematic messages were failed to be mirrored/broadcasted from prod to sandbox. Some details
Parts of the prod error log of a crashed federation
|
There's not a problematic message as such. The issue is not with how a particular message is processed or stored, but rather with a very specific order of events, involving a message referenced by multiple queues, interleaved with other messages. I think the way you set up the test doesn't guarantee the exact some order and timing of events on the test cluster (and definitely not if the queue on the upstream crashes - even if the message is transferred later after the queue restart, the timings will be completely different). Moreover, the problem occurs when reading messages. Therefore, the pattern of queue consumption would need to be the same on the downstream as well. It's not clear whether your "dummy storage queues" were consumed the same way they would be on the upstream. I'm not saying this method cannot reproduce the issue, but I don't think it will trigger the issue at the same when it happens on the upstream and it might not trigger it at all. |
Describe the bug
We do run a RMQ node of v4.0.7 (deployed via a docker container).
We do have a classic durable queue with queue TTL and message TTL, a nearly continuous work-load of approx 200 msg per sec, which is consumed by one consumer and produced by multiple producers (routed via a topic exchange).
Currently, we do observer in the RMQ log error events "Restarting crashed queue ...".
This happens approx 4-8 times a day.
After such a crash our client hangs and stops consuming, and seems NOT to recognize, that the queue had crashed and be recovered (as long as the client hangs, messages are queued in the queue).
After restarting the client, all works as expected again.
Note:
I would appreciate any help, advice or Bug fix.
Thanks.
Reproduction steps
Unfortunately, we are not able to reproduce this bug in a sandbox or test environment.
An extraction of the prod error log:
Logs
Expected behavior
To not observe queue crashes.
Additional context
No response
The text was updated successfully, but these errors were encountered: