Skip to content

Classic queue processes can run into an exception (a continuation to #12367) #13758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alxndrdude opened this issue Apr 16, 2025 · 18 comments
Open
Labels

Comments

@alxndrdude
Copy link

Describe the bug

We do run a RMQ node of v4.0.7 (deployed via a docker container).

We do have a classic durable queue with queue TTL and message TTL, a nearly continuous work-load of approx 200 msg per sec, which is consumed by one consumer and produced by multiple producers (routed via a topic exchange).

Currently, we do observer in the RMQ log error events "Restarting crashed queue ...".
This happens approx 4-8 times a day.
After such a crash our client hangs and stops consuming, and seems NOT to recognize, that the queue had crashed and be recovered (as long as the client hangs, messages are queued in the queue).
After restarting the client, all works as expected again.

Note:

  • For a long time, we ran on RMQ v3.12.x, then upgraded first to v3.13.7 and then to v4.0.7.
  • Running v3.12.x, no problems seem to occur, in RMQ v3.13.7 we first saw the problem. After upgrading to v4.0.7 the errors still remain.
  • The same messages are routed to other queues, however, we do not observe the error for those.

I would appreciate any help, advice or Bug fix.
Thanks.

Reproduction steps

Unfortunately, we are not able to reproduce this bug in a sandbox or test environment.

An extraction of the prod error log:

Logs

2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1> ** Generic server <0.100434552.1> terminating
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1> ** Last message in was {'$gen_cast',{notify_sent,<0.100460962.1>,50}}
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1> ** When Server state == {q,
...
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1> ** Reason for termination ==
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1> ** {function_clause,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>        [{rabbit_msg_store,reader_pread_parse,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [[eof]],
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_msg_store.erl"},{line,672}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_msg_store,reader_pread,2,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_msg_store.erl"},{line,668}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_msg_store,read_many_file2,4,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_msg_store.erl"},{line,536}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_variable_queue,maybe_deltas_to_betas,4,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_variable_queue.erl"},{line,2346}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_variable_queue,fetch_from_q3,1,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_variable_queue.erl"},{line,2238}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_variable_queue,queue_out,1,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_variable_queue.erl"},{line,1260}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_variable_queue,collect_by_predicate,3,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_variable_queue.erl"},{line,1557}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_variable_queue,remove_by_predicate,2,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_variable_queue.erl"},{line,1495}]}]}
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1> 
...
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>   crasher:
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     initial call: rabbit_amqqueue_process:init/1
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     pid: <0.100434552.1>
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     registered_name: []
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     exception exit: {function_clause,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                         [{rabbit_msg_store,reader_pread_parse,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [[eof]],
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_msg_store.erl"},{line,672}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_msg_store,reader_pread,2,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_msg_store.erl"},{line,668}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_msg_store,read_many_file2,4,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_msg_store.erl"},{line,536}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_variable_queue,maybe_deltas_to_betas,4,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_variable_queue.erl"},{line,2346}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_variable_queue,fetch_from_q3,1,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_variable_queue.erl"},{line,2238}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_variable_queue,queue_out,1,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_variable_queue.erl"},{line,1260}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_variable_queue,collect_by_predicate,3,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_variable_queue.erl"},{line,1557}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_variable_queue,remove_by_predicate,2,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_variable_queue.erl"},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                               {line,1495}]}]}
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>       in function  gen_server2:terminate/3 (gen_server2.erl, line 1172)
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     ancestors: [<0.99260568.1>,<0.3066.0>,<0.661.0>,<0.660.0>,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                   rabbit_vhost_sup_sup,rabbit_sup,<0.216.0>]
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     message_queue_len: 172
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     messages: [{'$gen_cast',{notify_sent,<0.100460962.1>,50}},
...
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                   {{credit_to,<0.83484211.1>},177},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                   {{credit_to,<0.98925663.1>},96}]
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     trap_exit: true
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     status: running
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     heap_size: 3581853
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     stack_size: 29
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     reductions: 455662007
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>   neighbours:
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1> 
...
...
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>     supervisor: {<0.99260568.1>,rabbit_amqqueue_sup}
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>     errorContext: child_terminated
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>     reason: {function_clause,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                 [{rabbit_msg_store,reader_pread_parse,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [[eof]],
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_msg_store.erl"},{line,672}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_msg_store,reader_pread,2,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_msg_store.erl"},{line,668}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_msg_store,read_many_file2,4,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_msg_store.erl"},{line,536}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_variable_queue,maybe_deltas_to_betas,4,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_variable_queue.erl"},{line,2346}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_variable_queue,fetch_from_q3,1,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_variable_queue.erl"},{line,2238}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_variable_queue,queue_out,1,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_variable_queue.erl"},{line,1260}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_variable_queue,collect_by_predicate,3,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_variable_queue.erl"},{line,1557}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_variable_queue,remove_by_predicate,2,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_variable_queue.erl"},{line,1495}]}]}
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>     offender: [{pid,<0.100434552.1>},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                {id,rabbit_amqqueue},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                {mfargs,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                    {rabbit_amqqueue_process,start_link,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                        [{amqqueue,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             {resource,<<"/">>,queue,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                                 <<"measurement.message.live_consumer">>},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             true,false,none,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             [{<<"x-expires">>,long,5000},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                              {<<"x-message-ttl">>,long,30000},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                              {<<"x-queue-type">>,longstr,<<"classic">>}],
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             none,[],[],[],undefined,undefined,[],[],live,0,[],
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             <<"/">>,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             #{user => <<"mpgplatform">>},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             rabbit_classic_queue,#{}},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                         <0.95885391.1>]}},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                {restart_type,transient},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                {significant,true},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                {shutdown,600000},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                {child_type,worker}]
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1> 
...
...
2025-04-15 18:43:03.841432+00:00 [error] <0.102250085.1> Restarting crashed queue 'measurement.message.live_consumer' in vhost '/'.
...
2025-04-15 18:43:03.849148+00:00 [warning] <0.102250085.1> Queue measurement.message.live_consumer in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown

Expected behavior

To not observe queue crashes.

Additional context

No response

@lhoguin
Copy link
Contributor

lhoguin commented Apr 16, 2025

Hello! I need the logs before the crash. You can send them by email if you'd like (loic.hoguin @ broadcom.com).

@lhoguin
Copy link
Contributor

lhoguin commented Apr 16, 2025

How long are the TTLs?

@alxndrdude
Copy link
Author

Thanks, I will sent you the logs, however, I had attached logs above in the section "Reproduction steps", in "Logs".

TTLs are

'x-message-ttl' => 30000,
 'x-expires' => 5000

@lhoguin
Copy link
Contributor

lhoguin commented Apr 16, 2025

The log snippets are not enough to figure out some things about this issue that's why I'm asking for more logs.

Thank you about the TTLs it will help try to reproduce.

@lhoguin
Copy link
Contributor

lhoguin commented Apr 16, 2025

Please enable debug logging so that I get more information:

log.file.level = debug

https://www.rabbitmq.com/docs/logging

I am specifically looking for grep -i compact and grep -i truncat type of logs.

@lhoguin
Copy link
Contributor

lhoguin commented Apr 16, 2025

I would like to know more about the type of data that goes through the queues, especially larger data: how long does it stay in the queues, is fanout being used, any other features other than TTL? Are you using consumers or basic.get? Thank you.

@alxndrdude
Copy link
Author

Sure,

type of data that goes through the queues, especially larger data:

the (many) producers send

  1. rarely json
  2. mainly protobuf wire format (https://protobuf.dev/programming-guides/encoding/)
  3. both json and protobuf blobs might be "large", however, in most of the cases protobuf blobs will be the "large" data
  4. the producers are committing messages in transaction with a max batch size of 1000, however, the message size might vary

how long does it stay in the queues,

  • for the specific queue, which crashes, data stays at most 1-2 seconds, typically milliseconds in the queue
  • because data is distributed, in other queues it may stay much longer, e.g. we have a queue some dead-letter data (a small subset) is routed to, where it typically stays a very long time (7 days TTL)

is fanout being used, any other features other than TTL?

The specific queue, which crashes, is bound to one fanout, and to one topic exchange. For the topic exchange data is routed via routing key.

Are you using consumers or basic.get?

it is consumed. (no basic.get)

Thank you,

@lhoguin
Copy link
Contributor

lhoguin commented Apr 16, 2025

Interesting, thank you.

@alxndrdude
Copy link
Author

Thanks a lot for help. As info, unfortunately I had not managed to get further debug infos up to now, I will try to collect info after Easter and then report those.

@michaelklishin michaelklishin changed the title Observing "Restarting crashed queue" with hanging consumer Classic queue processes can run into an exception (a continuation to #12367) Apr 16, 2025
@lhoguin
Copy link
Contributor

lhoguin commented Apr 17, 2025

No worries, thank you. It will likely take weeks to figure it out as the timing to trigger the bug seems very rare.

Yesterday we tried to reproduce by creating a similar environment but we weren't successful. We'll keep trying.

@lhoguin
Copy link
Contributor

lhoguin commented Apr 18, 2025

I have a potential fix at #13771

It would help if you could try it. CI will build an OCI image shortly that you can use based of main. If main is too recent we could make an OCI available for 4.0.x.

@michaelklishin
Copy link
Collaborator

@alxndrdude what kind of package do you need? There won't be any more 4.0.x releases for non-paying users but main right now is almost identical to v4.1.x, so we have an alpha OCI build, and after the PR is merged, will have a 4.1.1-alpha build.

@michaelklishin
Copy link
Collaborator

michaelklishin commented Apr 19, 2025

Here is a 4.1.1-alpha release that includes #13771.

@alxndrdude
Copy link
Author

Great, thanks a lot. I think, for us going to v4.1.x would be fine. However, I have to clarify, if (and how) to roll out a 4.1.1-alpha version.

@alxndrdude
Copy link
Author

We discussed, and decided to 1st try to reproduce the Bug in a safer sandbox env (via RMQ federation setup). If we succeed to reproduce the Bug in the sandbox RMQ, we can easily bump the sandbox RMQ to v4.1.1-apha.

That means, it might take a wile, I apologize for this.

@michaelklishin
Copy link
Collaborator

We discussed, and decided to 1st try to reproduce the Bug in a safer sandbox env (via RMQ federation setup). If we succeed to reproduce the Bug in the sandbox RMQ, we can easily bump the sandbox RMQ to v4.1.1-apha.

That means, it might take a wile, I apologize for this.

We fully understand that ensuring that such a change was effective takes time. RabbitMQ 4.1.1 will ship some time next month in any case.

Thank you.

@alxndrdude
Copy link
Author

We finished some testing with a sanbox env, here are our findings:

Setup

  1. a "sandbox" RMQ v 4.0.7 (same as in production) was setup
  2. the sandbox RMQ was federated via the federation plugin (no-ack, 30s msg TTL) with the two relevant exchanges (one fanout and one topic) of the production system, to mirror/broadcast the messages from the prod into the sandbox system
  3. one dummy storage queue and one dummy consumer queue was setup
  4. the dummy storage queue was NOT consumed, is durable and had a message TTL of 30s
  5. the dummy consumer queue IS consumed, is durable, had a message TTL of 30s and a consumer capacity of 100%

Findings

In the sandbox env, we cannot reproduce the error.
In the production system we additionally found: Restarting crashed queue 'federation: exchange.measurement.message -> rabbit@mpg-rmq-sandbox01' in vhost.

This means, we somehow triggered the Bug again, however, NOT in the sandbox env, but in the production env. The critical/problematic messages were failed to be mirrored/broadcasted from prod to sandbox.

Some details

  1. sometimes crashes of the prod federation and the prod queue seem to be correlated (approx same timestamp)
  2. sometimes the federation crashed, but the prod queue NOT, and vice versa
Parts of the prod error log of a crashed federation

2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2> ** Generic server <0.41594500.2> terminating
2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2> ** Last message in was {'$gen_cast',{notify_sent,<0.40871177.2>,50}}
2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2> ** When Server state == {q,
2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2>                          {amqqueue,
2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2>                           {resource,<<"/">>,queue,
2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2>                            <<"federation: exchange.measurement.message -> rabbit@mpg-rmq-sandbox01">>},
2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2>                           true,false,none,
2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2>                           [{<<"x-message-ttl">>,long,30000},
2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2>                            {<<"x-internal-purpose">>,longstr,<<"federation">>},
2025-04-28 02:44:01.678638+00:00 [error] <0.41594500.2>                            {<<"x-queue-type">>,longstr,<<"classic">>}],
...
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>                   {{credit_to,<0.35455712.2>},89},
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>                   {{credit_to,<0.39322814.2>},35}]
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>     trap_exit: true
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>     status: running
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>     heap_size: 6189440
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>     stack_size: 29
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>     reductions: 184472121
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>   neighbours:
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2> 
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>     offender: [{pid,<0.41594500.2>},
...
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>     ancestors: [<0.34264874.2>,<0.3066.0>,<0.661.0>,<0.660.0>,
2025-04-28 02:44:02.956810+00:00 [error] <0.41594500.2>     links: [<0.34264874.2>]
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>     supervisor: {<0.34264874.2>,rabbit_amqqueue_sup}
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>     errorContext: child_terminated
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>     reason: {function_clause,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                 [{rabbit_msg_store,reader_pread_parse,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                      [[eof]],
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                      [{file,"rabbit_msg_store.erl"},{line,672}]},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                  {rabbit_msg_store,reader_pread,2,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                      [{file,"rabbit_msg_store.erl"},{line,668}]},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                  {rabbit_msg_store,read_many_file2,4,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                      [{file,"rabbit_msg_store.erl"},{line,536}]},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                  {rabbit_variable_queue,maybe_deltas_to_betas,4,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                      [{file,"rabbit_variable_queue.erl"},{line,2346}]},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                  {rabbit_variable_queue,fetch_from_q3,1,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                      [{file,"rabbit_variable_queue.erl"},{line,2238}]},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                  {rabbit_variable_queue,queue_out,1,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                      [{file,"rabbit_variable_queue.erl"},{line,1260}]},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                  {rabbit_variable_queue,collect_by_predicate,3,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                      [{file,"rabbit_variable_queue.erl"},{line,1557}]},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                  {rabbit_variable_queue,remove_by_predicate,2,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                      [{file,"rabbit_variable_queue.erl"},{line,1495}]}]}
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>     offender: [{pid,<0.41594500.2>},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                {id,rabbit_amqqueue},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                {mfargs,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                    {rabbit_amqqueue_process,start_link,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                        [{amqqueue,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                             {resource,<<"/">>,queue,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                                 <<"federation: exchange.measurement.message -> rabbit@mpg-rmq-sandbox01">>},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                             true,false,none,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                             [{<<"x-message-ttl">>,long,30000},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                              {<<"x-internal-purpose">>,longstr,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                               <<"federation">>},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                              {<<"x-queue-type">>,longstr,<<"classic">>}],
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                             none,[],[],[],undefined,undefined,[],[],live,0,[],
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                             <<"/">>,
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                             #{user => <<"mpgsandbox">>},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                             rabbit_classic_queue,#{}},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                         <0.40952274.2>]}},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                {restart_type,transient},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                {significant,true},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                {shutdown,600000},
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>                {child_type,worker}]
2025-04-28 02:44:03.864942+00:00 [error] <0.34264874.2>

@mkuratczyk
Copy link
Contributor

mkuratczyk commented Apr 28, 2025

This means, we somehow triggered the Bug again, however, NOT in the sandbox env, but in the production env. The critical/problematic messages were failed to be mirrored/broadcasted from prod to sandbox.

There's not a problematic message as such. The issue is not with how a particular message is processed or stored, but rather with a very specific order of events, involving a message referenced by multiple queues, interleaved with other messages. I think the way you set up the test doesn't guarantee the exact some order and timing of events on the test cluster (and definitely not if the queue on the upstream crashes - even if the message is transferred later after the queue restart, the timings will be completely different).

Moreover, the problem occurs when reading messages. Therefore, the pattern of queue consumption would need to be the same on the downstream as well. It's not clear whether your "dummy storage queues" were consumed the same way they would be on the upstream.

I'm not saying this method cannot reproduce the issue, but I don't think it will trigger the issue at the same when it happens on the upstream and it might not trigger it at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants