Classic queue processes can run into an exception (a continuation to #12367) #13758

alxndrdude · 2025-04-16T05:53:37Z

Describe the bug

We do run a RMQ node of v4.0.7 (deployed via a docker container).

We do have a classic durable queue with queue TTL and message TTL, a nearly continuous work-load of approx 200 msg per sec, which is consumed by one consumer and produced by multiple producers (routed via a topic exchange).

Currently, we do observer in the RMQ log error events "Restarting crashed queue ...".
This happens approx 4-8 times a day.
After such a crash our client hangs and stops consuming, and seems NOT to recognize, that the queue had crashed and be recovered (as long as the client hangs, messages are queued in the queue).
After restarting the client, all works as expected again.

Note:

For a long time, we ran on RMQ v3.12.x, then upgraded first to v3.13.7 and then to v4.0.7.
Running v3.12.x, no problems seem to occur, in RMQ v3.13.7 we first saw the problem. After upgrading to v4.0.7 the errors still remain.
The same messages are routed to other queues, however, we do not observe the error for those.

I would appreciate any help, advice or Bug fix.
Thanks.

Reproduction steps

Unfortunately, we are not able to reproduce this bug in a sandbox or test environment.

An extraction of the prod error log:

Logs

2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1> ** Generic server <0.100434552.1> terminating
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1> ** Last message in was {'$gen_cast',{notify_sent,<0.100460962.1>,50}}
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1> ** When Server state == {q,
...
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1> ** Reason for termination ==
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1> ** {function_clause,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>        [{rabbit_msg_store,reader_pread_parse,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [[eof]],
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_msg_store.erl"},{line,672}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_msg_store,reader_pread,2,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_msg_store.erl"},{line,668}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_msg_store,read_many_file2,4,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_msg_store.erl"},{line,536}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_variable_queue,maybe_deltas_to_betas,4,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_variable_queue.erl"},{line,2346}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_variable_queue,fetch_from_q3,1,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_variable_queue.erl"},{line,2238}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_variable_queue,queue_out,1,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_variable_queue.erl"},{line,1260}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_variable_queue,collect_by_predicate,3,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_variable_queue.erl"},{line,1557}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_variable_queue,remove_by_predicate,2,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_variable_queue.erl"},{line,1495}]}]}
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1> 
...
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>   crasher:
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     initial call: rabbit_amqqueue_process:init/1
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     pid: <0.100434552.1>
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     registered_name: []
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     exception exit: {function_clause,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                         [{rabbit_msg_store,reader_pread_parse,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [[eof]],
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_msg_store.erl"},{line,672}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_msg_store,reader_pread,2,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_msg_store.erl"},{line,668}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_msg_store,read_many_file2,4,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_msg_store.erl"},{line,536}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_variable_queue,maybe_deltas_to_betas,4,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_variable_queue.erl"},{line,2346}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_variable_queue,fetch_from_q3,1,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_variable_queue.erl"},{line,2238}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_variable_queue,queue_out,1,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_variable_queue.erl"},{line,1260}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_variable_queue,collect_by_predicate,3,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_variable_queue.erl"},{line,1557}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_variable_queue,remove_by_predicate,2,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_variable_queue.erl"},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                               {line,1495}]}]}
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>       in function  gen_server2:terminate/3 (gen_server2.erl, line 1172)
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     ancestors: [<0.99260568.1>,<0.3066.0>,<0.661.0>,<0.660.0>,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                   rabbit_vhost_sup_sup,rabbit_sup,<0.216.0>]
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     message_queue_len: 172
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     messages: [{'$gen_cast',{notify_sent,<0.100460962.1>,50}},
...
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                   {{credit_to,<0.83484211.1>},177},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                   {{credit_to,<0.98925663.1>},96}]
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     trap_exit: true
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     status: running
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     heap_size: 3581853
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     stack_size: 29
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     reductions: 455662007
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>   neighbours:
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1> 
...

...
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>     supervisor: {<0.99260568.1>,rabbit_amqqueue_sup}
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>     errorContext: child_terminated
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>     reason: {function_clause,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                 [{rabbit_msg_store,reader_pread_parse,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [[eof]],
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_msg_store.erl"},{line,672}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_msg_store,reader_pread,2,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_msg_store.erl"},{line,668}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_msg_store,read_many_file2,4,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_msg_store.erl"},{line,536}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_variable_queue,maybe_deltas_to_betas,4,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_variable_queue.erl"},{line,2346}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_variable_queue,fetch_from_q3,1,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_variable_queue.erl"},{line,2238}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_variable_queue,queue_out,1,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_variable_queue.erl"},{line,1260}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_variable_queue,collect_by_predicate,3,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_variable_queue.erl"},{line,1557}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_variable_queue,remove_by_predicate,2,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_variable_queue.erl"},{line,1495}]}]}
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>     offender: [{pid,<0.100434552.1>},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                {id,rabbit_amqqueue},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                {mfargs,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                    {rabbit_amqqueue_process,start_link,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                        [{amqqueue,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             {resource,<<"/">>,queue,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                                 <<"measurement.message.live_consumer">>},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             true,false,none,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             [{<<"x-expires">>,long,5000},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                              {<<"x-message-ttl">>,long,30000},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                              {<<"x-queue-type">>,longstr,<<"classic">>}],
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             none,[],[],[],undefined,undefined,[],[],live,0,[],
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             <<"/">>,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             #{user => <<"mpgplatform">>},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             rabbit_classic_queue,#{}},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                         <0.95885391.1>]}},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                {restart_type,transient},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                {significant,true},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                {shutdown,600000},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                {child_type,worker}]
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1> 
...

...
2025-04-15 18:43:03.841432+00:00 [error] <0.102250085.1> Restarting crashed queue 'measurement.message.live_consumer' in vhost '/'.
...
2025-04-15 18:43:03.849148+00:00 [warning] <0.102250085.1> Queue measurement.message.live_consumer in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown

Expected behavior

To not observe queue crashes.

Additional context

No response

The text was updated successfully, but these errors were encountered:

lhoguin · 2025-04-16T07:46:27Z

Hello! I need the logs before the crash. You can send them by email if you'd like (loic.hoguin @ broadcom.com).

lhoguin · 2025-04-16T07:50:54Z

How long are the TTLs?

alxndrdude · 2025-04-16T07:52:45Z

Thanks, I will sent you the logs, however, I had attached logs above in the section "Reproduction steps", in "Logs".

TTLs are

'x-message-ttl' => 30000,
 'x-expires' => 5000

lhoguin · 2025-04-16T08:41:21Z

The log snippets are not enough to figure out some things about this issue that's why I'm asking for more logs.

Thank you about the TTLs it will help try to reproduce.

lhoguin · 2025-04-16T10:32:06Z

Please enable debug logging so that I get more information:

log.file.level = debug

https://www.rabbitmq.com/docs/logging

I am specifically looking for grep -i compact and grep -i truncat type of logs.

lhoguin · 2025-04-16T10:38:16Z

I would like to know more about the type of data that goes through the queues, especially larger data: how long does it stay in the queues, is fanout being used, any other features other than TTL? Are you using consumers or basic.get? Thank you.

alxndrdude · 2025-04-16T12:38:53Z

Sure,

type of data that goes through the queues, especially larger data:

the (many) producers send

rarely json
mainly protobuf wire format (https://protobuf.dev/programming-guides/encoding/)
both json and protobuf blobs might be "large", however, in most of the cases protobuf blobs will be the "large" data
the producers are committing messages in transaction with a max batch size of 1000, however, the message size might vary

how long does it stay in the queues,

for the specific queue, which crashes, data stays at most 1-2 seconds, typically milliseconds in the queue
because data is distributed, in other queues it may stay much longer, e.g. we have a queue some dead-letter data (a small subset) is routed to, where it typically stays a very long time (7 days TTL)

is fanout being used, any other features other than TTL?

The specific queue, which crashes, is bound to one fanout, and to one topic exchange. For the topic exchange data is routed via routing key.

Are you using consumers or basic.get?

it is consumed. (no basic.get)

Thank you,

lhoguin · 2025-04-16T13:01:26Z

Interesting, thank you.

alxndrdude · 2025-04-16T14:22:42Z

Thanks a lot for help. As info, unfortunately I had not managed to get further debug infos up to now, I will try to collect info after Easter and then report those.

lhoguin · 2025-04-17T07:24:23Z

No worries, thank you. It will likely take weeks to figure it out as the timing to trigger the bug seems very rare.

Yesterday we tried to reproduce by creating a similar environment but we weren't successful. We'll keep trying.

lhoguin · 2025-04-18T11:57:41Z

I have a potential fix at #13771

It would help if you could try it. CI will build an OCI image shortly that you can use based of main. If main is too recent we could make an OCI available for 4.0.x.

michaelklishin · 2025-04-18T19:24:58Z

@alxndrdude what kind of package do you need? There won't be any more 4.0.x releases for non-paying users but main right now is almost identical to v4.1.x, so we have an alpha OCI build, and after the PR is merged, will have a 4.1.1-alpha build.

michaelklishin · 2025-04-19T18:13:09Z

Here is a 4.1.1-alpha release that includes #13771.

alxndrdude · 2025-04-22T05:52:02Z

Great, thanks a lot. I think, for us going to v4.1.x would be fine. However, I have to clarify, if (and how) to roll out a 4.1.1-alpha version.

alxndrdude · 2025-04-22T14:00:55Z

We discussed, and decided to 1st try to reproduce the Bug in a safer sandbox env (via RMQ federation setup). If we succeed to reproduce the Bug in the sandbox RMQ, we can easily bump the sandbox RMQ to v4.1.1-apha.

That means, it might take a wile, I apologize for this.

michaelklishin · 2025-04-22T14:31:04Z

We discussed, and decided to 1st try to reproduce the Bug in a safer sandbox env (via RMQ federation setup). If we succeed to reproduce the Bug in the sandbox RMQ, we can easily bump the sandbox RMQ to v4.1.1-apha.

That means, it might take a wile, I apologize for this.

We fully understand that ensuring that such a change was effective takes time. RabbitMQ 4.1.1 will ship some time next month in any case.

Thank you.

alxndrdude · 2025-04-28T06:54:11Z

We finished some testing with a sanbox env, here are our findings:

Setup

a "sandbox" RMQ v 4.0.7 (same as in production) was setup
the sandbox RMQ was federated via the federation plugin (no-ack, 30s msg TTL) with the two relevant exchanges (one fanout and one topic) of the production system, to mirror/broadcast the messages from the prod into the sandbox system
one dummy storage queue and one dummy consumer queue was setup
the dummy storage queue was NOT consumed, is durable and had a message TTL of 30s
the dummy consumer queue IS consumed, is durable, had a message TTL of 30s and a consumer capacity of 100%

Findings

In the sandbox env, we cannot reproduce the error.
In the production system we additionally found: Restarting crashed queue 'federation: exchange.measurement.message -> rabbit@mpg-rmq-sandbox01' in vhost.

This means, we somehow triggered the Bug again, however, NOT in the sandbox env, but in the production env. The critical/problematic messages were failed to be mirrored/broadcasted from prod to sandbox.

Some details

sometimes crashes of the prod federation and the prod queue seem to be correlated (approx same timestamp)
sometimes the federation crashed, but the prod queue NOT, and vice versa

Parts of the prod error log of a crashed federation

2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2> ** Generic server <0.41594500.2> terminating
2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2> ** Last message in was {'$gen_cast',{notify_sent,<0.40871177.2>,50}}
2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2> ** When Server state == {q,
2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2>                          {amqqueue,
2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2>                           {resource,<<"/">>,queue,
2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2>                            <<"federation: exchange.measurement.message -> rabbit@mpg-rmq-sandbox01">>},
2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2>                           true,false,none,
2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2>                           [{<<"x-message-ttl">>,long,30000},
2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2>                            {<<"x-internal-purpose">>,longstr,<<"federation">>},
2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2>                            {<<"x-queue-type">>,longstr,<<"classic">>}],
...
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>                   {{credit_to,<0.35455712.2>},89},
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>                   {{credit_to,<0.39322814.2>},35}]
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>     trap_exit: true
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>     status: running
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>     heap_size: 6189440
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>     stack_size: 29
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>     reductions: 184472121
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>   neighbours:
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2> 
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>     offender: [{pid,<0.41594500.2>},
...
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>     ancestors: [<0.34264874.2>,<0.3066.0>,<0.661.0>,<0.660.0>,
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>     links: [<0.34264874.2>]
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>     supervisor: {<0.34264874.2>,rabbit_amqqueue_sup}
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>     errorContext: child_terminated
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>     reason: {function_clause,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                 [{rabbit_msg_store,reader_pread_parse,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                      [[eof]],
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                      [{file,"rabbit_msg_store.erl"},{line,672}]},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                  {rabbit_msg_store,reader_pread,2,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                      [{file,"rabbit_msg_store.erl"},{line,668}]},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                  {rabbit_msg_store,read_many_file2,4,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                      [{file,"rabbit_msg_store.erl"},{line,536}]},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                  {rabbit_variable_queue,maybe_deltas_to_betas,4,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                      [{file,"rabbit_variable_queue.erl"},{line,2346}]},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                  {rabbit_variable_queue,fetch_from_q3,1,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                      [{file,"rabbit_variable_queue.erl"},{line,2238}]},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                  {rabbit_variable_queue,queue_out,1,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                      [{file,"rabbit_variable_queue.erl"},{line,1260}]},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                  {rabbit_variable_queue,collect_by_predicate,3,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                      [{file,"rabbit_variable_queue.erl"},{line,1557}]},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                  {rabbit_variable_queue,remove_by_predicate,2,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                      [{file,"rabbit_variable_queue.erl"},{line,1495}]}]}
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>     offender: [{pid,<0.41594500.2>},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                {id,rabbit_amqqueue},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                {mfargs,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                    {rabbit_amqqueue_process,start_link,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                        [{amqqueue,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                             {resource,<<"/">>,queue,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                                 <<"federation: exchange.measurement.message -> rabbit@mpg-rmq-sandbox01">>},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                             true,false,none,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                             [{<<"x-message-ttl">>,long,30000},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                              {<<"x-internal-purpose">>,longstr,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                               <<"federation">>},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                              {<<"x-queue-type">>,longstr,<<"classic">>}],
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                             none,[],[],[],undefined,undefined,[],[],live,0,[],
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                             <<"/">>,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                             #{user => <<"mpgsandbox">>},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                             rabbit_classic_queue,#{}},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                         <0.40952274.2>]}},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                {restart_type,transient},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                {significant,true},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                {shutdown,600000},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                {child_type,worker}]
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>

mkuratczyk · 2025-04-28T10:11:57Z

This means, we somehow triggered the Bug again, however, NOT in the sandbox env, but in the production env. The critical/problematic messages were failed to be mirrored/broadcasted from prod to sandbox.

There's not a problematic message as such. The issue is not with how a particular message is processed or stored, but rather with a very specific order of events, involving a message referenced by multiple queues, interleaved with other messages. I think the way you set up the test doesn't guarantee the exact some order and timing of events on the test cluster (and definitely not if the queue on the upstream crashes - even if the message is transferred later after the queue restart, the timings will be completely different).

Moreover, the problem occurs when reading messages. Therefore, the pattern of queue consumption would need to be the same on the downstream as well. It's not clear whether your "dummy storage queues" were consumed the same way they would be on the upstream.

I'm not saying this method cannot reproduce the issue, but I don't think it will trigger the issue at the same when it happens on the upstream and it might not trigger it at all.

alxndrdude added the bug label Apr 16, 2025

alxndrdude mentioned this issue Apr 16, 2025

Classic queue can run into an exception after upgrade to RabbitMQ 3.13.7 (or 4.0.2) #12367

Closed

michaelklishin changed the title ~~Observing "Restarting crashed queue" with hanging consumer~~ Classic queue processes can run into an exception (a continuation to #12367) Apr 16, 2025

lhoguin mentioned this issue Apr 18, 2025

CQ: Fix rare eof crash of message store with fanout #13771

Merged

1 task

mergify bot mentioned this issue Apr 19, 2025

CQ: Fix rare eof crash of message store with fanout (backport #13771) #13772

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classic queue processes can run into an exception (a continuation to #12367) #13758

Classic queue processes can run into an exception (a continuation to #12367) #13758

alxndrdude commented Apr 16, 2025

lhoguin commented Apr 16, 2025 •

edited

Loading

lhoguin commented Apr 16, 2025

alxndrdude commented Apr 16, 2025

lhoguin commented Apr 16, 2025

lhoguin commented Apr 16, 2025

lhoguin commented Apr 16, 2025

alxndrdude commented Apr 16, 2025

lhoguin commented Apr 16, 2025

alxndrdude commented Apr 16, 2025

lhoguin commented Apr 17, 2025

lhoguin commented Apr 18, 2025

michaelklishin commented Apr 18, 2025

michaelklishin commented Apr 19, 2025 •

edited

Loading

alxndrdude commented Apr 22, 2025

alxndrdude commented Apr 22, 2025

michaelklishin commented Apr 22, 2025

alxndrdude commented Apr 28, 2025

mkuratczyk commented Apr 28, 2025 •

edited

Loading

Classic queue processes can run into an exception (a continuation to #12367) #13758

Classic queue processes can run into an exception (a continuation to #12367) #13758

Comments

alxndrdude commented Apr 16, 2025

Describe the bug

Reproduction steps

Expected behavior

Additional context

lhoguin commented Apr 16, 2025 • edited Loading

lhoguin commented Apr 16, 2025

alxndrdude commented Apr 16, 2025

lhoguin commented Apr 16, 2025

lhoguin commented Apr 16, 2025

lhoguin commented Apr 16, 2025

alxndrdude commented Apr 16, 2025

lhoguin commented Apr 16, 2025

alxndrdude commented Apr 16, 2025

lhoguin commented Apr 17, 2025

lhoguin commented Apr 18, 2025

michaelklishin commented Apr 18, 2025

michaelklishin commented Apr 19, 2025 • edited Loading

alxndrdude commented Apr 22, 2025

alxndrdude commented Apr 22, 2025

michaelklishin commented Apr 22, 2025

alxndrdude commented Apr 28, 2025

mkuratczyk commented Apr 28, 2025 • edited Loading

lhoguin commented Apr 16, 2025 •

edited

Loading

michaelklishin commented Apr 19, 2025 •

edited

Loading

mkuratczyk commented Apr 28, 2025 •

edited

Loading