Skip to content

Gvrose fips legacy 8 compliant/4.18.0 425.13.1 #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

gvrose8192
Copy link

@gvrose8192 gvrose8192 commented Oct 21, 2024

Latest work on the FIPS 8 compliant kernel Various linked tasks:
VULN-429
VULN-4095
VULN-597
SECO-169
SECO-94

Kernel selftests have passed https://github.com/user-attachments/files/17512330/kernel-selftest.log with no change in results from previous run.

I've been running netfilter tests in a loop overnight - for i in {1..50000}; do sudo valgrind --log-file=valgrind-results$i.log ./run-tests.sh; done
Typical output, unchanged over any number runs.
nftables-test.log

Valgrind out put from any of hundreds of runs is all the same -
valgrind-results.log

With lockdep enabled and running sudo stress --cpu 28 --io 28 --vm 28 --vm-bytes 1G --timeout 3h I ran multiple passes of the nftables tests with valgrind: for i in {1..4}; do sudo valgrind --log-file=valgrind-results$i.log ./run-tests.sh; done

The nftables tests all pass with no difference from the original tests. Valgrind logs here:
valgrind-results1.log
valgrind-results2.log
valgrind-results3.log
valgrind-results4.log

No lockdep splats, any other splats, no OOMs, no panics, nor any other error messages during the run.

After pulling in the missing patch "netfilter: nf_tables: set backend .flush always succeeds" I reran the netfilter tests overnight with lockdep and kmemleak enabled in the kernel and running "sudo stress --cpu 28 --io 28 --vm 28 --vm-bytes 1G --timeout 8h" and found no issues. The logfiles are unchanged from the previous night's runs.

@gvrose8192 gvrose8192 requested a review from PlaidCat October 21, 2024 21:12
@gvrose8192
Copy link
Author

The 2nd commit is crap - the code is right but whacked the commit message metadata. I'll fix that up. Leaving the rest for your review.

@gvrose8192 gvrose8192 force-pushed the gvrose_fips-legacy-8-compliant/4.18.0-425.13.1 branch from 22f98fe to 3f2fa28 Compare October 21, 2024 21:23
@gvrose8192
Copy link
Author

The 2nd commit is crap - the code is right but whacked the commit message metadata. I'll fix that up. Leaving the rest for your review.

OK, fixed with a force push.

@gvrose8192 gvrose8192 force-pushed the gvrose_fips-legacy-8-compliant/4.18.0-425.13.1 branch 2 times, most recently from 8f50770 to 9cff1f7 Compare October 23, 2024 00:31
@gvrose8192
Copy link
Author

This PR is now ready for full review.
CVE's addressed by this PR:
CVE-2023-4244
CVE-2023-52581
CVE-2024-26925

Github actions build checks here:
https://github.com/ctrliq/kernel-src-tree/actions/runs/11505520259 Checked the PR for valid commit messages
https://github.com/ctrliq/kernel-src-tree/actions/runs/11505519609 Checked the compile/build for x86_64
https://github.com/ctrliq/kernel-src-tree/actions/runs/11505519601 Checked the compile/build for aarch64 - this is not really valid for fips8, but it demonstrates the code changes are portable.

Kernel selftest log shows no new errors or consistent discrepancies from the base kernel from before this PR. I.E things that failed before still fail, things that passed before, still pass.

kernel-selftest.log

What remains to do:

  1. This will require some close inspection of those commits marked with an 'upstream-diff' tag.
  2. Netfilter testsuite - will run it against the current fips-compliant8 branch and then against the same branch with this PR. Checking for no new errors. If something passes that didn't used to then that's great and I will note it in the PR conversation.

@PlaidCat This PR is ready for review. I'll be configuring and running the netfilter tests in parallel and record the results here.

@gvrose8192
Copy link
Author

Oh - totally forgot about this included patch: 5647beb

So that adds an additional CVE fixed by this PR - CVE-2024-39502

So we have 4 total CVEs addressed by this PR, not 3.

@gvrose8192
Copy link
Author

nft-test-results.log

Sample results from an nftables testsuite run on my dev system running Rocky 9.4. This was a sanity check to make sure I could actually build and install the nftables testsuite available here: https://git.netfilter.org/nftables/

Next step is collect the results from a run with the currently available fips8-compliant kernel and compare to the results when I run the kernel built from this PR.

Copy link
Collaborator

@PlaidCat PlaidCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: previous version invalid do to wrong reference

Copy link
Collaborator

@PlaidCat PlaidCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: previous version invalid do to wrong reference

@gvrose8192
Copy link
Author

"in nf_tables_commit under case NFT_MSG_NEWSETELEM: also uses nft_setelem_remove
https://github.com/ctrliq/kernel-src-tree/blob/centos_kernel-4.18.0-534.el8/net/netfilter/nf_tables_api.c#L8341

Same as above in __nf_tables_abort NFT_MSG_NEWSETELEM
https://github.com/ctrliq/kernel-src-tree/blob/centos_kernel-4.18.0-534.el8/net/netfilter/nf_tables_api.c#L8547"

Acked - requires more investigation.

@gvrose8192 gvrose8192 force-pushed the gvrose_fips-legacy-8-compliant/4.18.0-425.13.1 branch from 35ee0b2 to c7185b5 Compare October 26, 2024 18:18
@gvrose8192
Copy link
Author

Closing this pull request - will post an updated PR

@gvrose8192 gvrose8192 closed this Oct 28, 2024
@gvrose8192 gvrose8192 reopened this Oct 28, 2024
@gvrose8192 gvrose8192 force-pushed the gvrose_fips-legacy-8-compliant/4.18.0-425.13.1 branch 2 times, most recently from e82f39b to 240a26d Compare October 28, 2024 20:29
@gvrose8192
Copy link
Author

All kernel selftests continue to show no new errors or consistent discrepancies between the base version and with this patch series.
I am running the nftables testing run-tests.sh in continuous loop with valgrind. No memory leaks detected so far after hundreds of loops. The logs are too big to store but I ran a single loops manually and got the following results:

test-results.log
valgrind-results.log

I'll resume valgrind checking of the nftables nfct checks for an overnight run, make sure no long term (within a day) damage is found and to increase confidence in the PR.

Copy link
Collaborator

@PlaidCat PlaidCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For netfilter: nf_tables: mark set as dead when unbinding anonymous set with timeout 240a26d

it should be jira VULN-835

@gvrose8192
Copy link
Author

For netfilter: nf_tables: mark set as dead when unbinding anonymous set with timeout 240a26d

it should be jira VULN-835

Good catch - I wondered about pulling that in with a different jira and meant to ask you but then got distracted by other work. I'll fix that up.

@gvrose8192
Copy link
Author

"in nf_tables_commit under case NFT_MSG_NEWSETELEM: also uses nft_setelem_remove https://github.com/ctrliq/kernel-src-tree/blob/centos_kernel-4.18.0-534.el8/net/netfilter/nf_tables_api.c#L8341

Same as above in __nf_tables_abort NFT_MSG_NEWSETELEM https://github.com/ctrliq/kernel-src-tree/blob/centos_kernel-4.18.0-534.el8/net/netfilter/nf_tables_api.c#L8547"

Acked - requires more investigation.

OK, yes. Found and fixed - just missed it in an otherwise large commit. Fix incoming with next branch force push.

Copy link
Collaborator

@PlaidCat PlaidCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subsystem-sync netfilter:nf_tables 4.18.0-553
These should be:
subsystem-sync netfilter:nf_tables 4.18.0-534

Copy link
Collaborator

@PlaidCat PlaidCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

netfilter: nf_tables: fix table flag updates 95b04bb

Is the upstream-diff here just contextual information in the fuzz?

PlaidCat added a commit that referenced this pull request May 20, 2025
jira NONE_AUTOMATION
Rebuild_History Non-Buildable kernel-5.14.0-570.16.1.el9_6
commit-author Shradha Gupta <[email protected]>
commit 3e64bb2
Empty-Commit: Cherry-Pick Conflicts during history rebuild.
Will be included in final tarball splat. Ref for failed cherry-pick at:
ciq/ciq_backports/kernel-5.14.0-570.16.1.el9_6/3e64bb2a.failed

When on a MANA VM hibernation is triggered, as part of hibernate_snapshot(),
mana_gd_suspend() and mana_gd_resume() are called. If during this
mana_gd_resume(), a failure occurs with HWC creation, mana_port_debugfs
pointer does not get reinitialized and ends up pointing to older,
cleaned-up dentry.
Further in the hibernation path, as part of power_down(), mana_gd_shutdown()
is triggered. This call, unaware of the failures in resume, tries to cleanup
the already cleaned up  mana_port_debugfs value and hits the following bug:

[  191.359296] mana 7870:00:00.0: Shutdown was called
[  191.359918] BUG: kernel NULL pointer dereference, address: 0000000000000098
[  191.360584] #PF: supervisor write access in kernel mode
[  191.361125] #PF: error_code(0x0002) - not-present page
[  191.361727] PGD 1080ea067 P4D 0
[  191.362172] Oops: Oops: 0002 [#1] SMP NOPTI
[  191.362606] CPU: 11 UID: 0 PID: 1674 Comm: bash Not tainted 6.14.0-rc5+ #2
[  191.363292] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 11/21/2024
[  191.364124] RIP: 0010:down_write+0x19/0x50
[  191.364537] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb e8 de cd ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 13 75 16 65 48 8b 05 88 24 4c 6a 48 89 43 08 48 8b 5d
[  191.365867] RSP: 0000:ff45fbe0c1c037b8 EFLAGS: 00010246
[  191.366350] RAX: 0000000000000000 RBX: 0000000000000098 RCX: ffffff8100000000
[  191.366951] RDX: 0000000000000001 RSI: 0000000000000064 RDI: 0000000000000098
[  191.367600] RBP: ff45fbe0c1c037c0 R08: 0000000000000000 R09: 0000000000000001
[  191.368225] R10: ff45fbe0d2b01000 R11: 0000000000000008 R12: 0000000000000000
[  191.368874] R13: 000000000000000b R14: ff43dc27509d67c0 R15: 0000000000000020
[  191.369549] FS:  00007dbc5001e740(0000) GS:ff43dc663f380000(0000) knlGS:0000000000000000
[  191.370213] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  191.370830] CR2: 0000000000000098 CR3: 0000000168e8e002 CR4: 0000000000b73ef0
[  191.371557] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  191.372192] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[  191.372906] Call Trace:
[  191.373262]  <TASK>
[  191.373621]  ? show_regs+0x64/0x70
[  191.374040]  ? __die+0x24/0x70
[  191.374468]  ? page_fault_oops+0x290/0x5b0
[  191.374875]  ? do_user_addr_fault+0x448/0x800
[  191.375357]  ? exc_page_fault+0x7a/0x160
[  191.375971]  ? asm_exc_page_fault+0x27/0x30
[  191.376416]  ? down_write+0x19/0x50
[  191.376832]  ? down_write+0x12/0x50
[  191.377232]  simple_recursive_removal+0x4a/0x2a0
[  191.377679]  ? __pfx_remove_one+0x10/0x10
[  191.378088]  debugfs_remove+0x44/0x70
[  191.378530]  mana_detach+0x17c/0x4f0
[  191.378950]  ? __flush_work+0x1e2/0x3b0
[  191.379362]  ? __cond_resched+0x1a/0x50
[  191.379787]  mana_remove+0xf2/0x1a0
[  191.380193]  mana_gd_shutdown+0x3b/0x70
[  191.380642]  pci_device_shutdown+0x3a/0x80
[  191.381063]  device_shutdown+0x13e/0x230
[  191.381480]  kernel_power_off+0x35/0x80
[  191.381890]  hibernate+0x3c6/0x470
[  191.382312]  state_store+0xcb/0xd0
[  191.382734]  kobj_attr_store+0x12/0x30
[  191.383211]  sysfs_kf_write+0x3e/0x50
[  191.383640]  kernfs_fop_write_iter+0x140/0x1d0
[  191.384106]  vfs_write+0x271/0x440
[  191.384521]  ksys_write+0x72/0xf0
[  191.384924]  __x64_sys_write+0x19/0x20
[  191.385313]  x64_sys_call+0x2b0/0x20b0
[  191.385736]  do_syscall_64+0x79/0x150
[  191.386146]  ? __mod_memcg_lruvec_state+0xe7/0x240
[  191.386676]  ? __lruvec_stat_mod_folio+0x79/0xb0
[  191.387124]  ? __pfx_lru_add+0x10/0x10
[  191.387515]  ? queued_spin_unlock+0x9/0x10
[  191.387937]  ? do_anonymous_page+0x33c/0xa00
[  191.388374]  ? __handle_mm_fault+0xcf3/0x1210
[  191.388805]  ? __count_memcg_events+0xbe/0x180
[  191.389235]  ? handle_mm_fault+0xae/0x300
[  191.389588]  ? do_user_addr_fault+0x559/0x800
[  191.390027]  ? irqentry_exit_to_user_mode+0x43/0x230
[  191.390525]  ? irqentry_exit+0x1d/0x30
[  191.390879]  ? exc_page_fault+0x86/0x160
[  191.391235]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  191.391745] RIP: 0033:0x7dbc4ff1c574
[  191.392111] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d d5 ea 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
[  191.393412] RSP: 002b:00007ffd95a23ab8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[  191.393990] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007dbc4ff1c574
[  191.394594] RDX: 0000000000000005 RSI: 00005a6eeadb0ce0 RDI: 0000000000000001
[  191.395215] RBP: 00007ffd95a23ae0 R08: 00007dbc50003b20 R09: 0000000000000000
[  191.395805] R10: 0000000000000001 R11: 0000000000000202 R12: 0000000000000005
[  191.396404] R13: 00005a6eeadb0ce0 R14: 00007dbc500045c0 R15: 00007dbc50001ee0
[  191.396987]  </TASK>

To fix this, we explicitly set such mana debugfs variables to NULL after
debugfs_remove() is called.

Fixes: 6607c17 ("net: mana: Enable debugfs files for MANA device")
	Cc: [email protected]
	Signed-off-by: Shradha Gupta <[email protected]>
	Reviewed-by: Haiyang Zhang <[email protected]>
	Reviewed-by: Michal Kubiak <[email protected]>
Link: https://patch.msgid.link/1741688260-28922-1-git-send-email-shradhagupta@linux.microsoft.com
	Signed-off-by: Paolo Abeni <[email protected]>

(cherry picked from commit 3e64bb2)
	Signed-off-by: Jonathan Maple <[email protected]>

# Conflicts:
#	drivers/net/ethernet/microsoft/mana/gdma_main.c
#	drivers/net/ethernet/microsoft/mana/mana_en.c
github-actions bot pushed a commit that referenced this pull request May 21, 2025
JIRA: https://issues.redhat.com/browse/RHEL-84571
Upstream Status: net.git commit 443041d
Conflicts: context mismatch as we don't have MPCapableSYNTXDrop _ upstream
  commit 6982826 ("mptcp: fallback to TCP after SYN+MPC drops") and
  MPCapableSYNTXDisabled _ upstream commit 27069e7 ("mptcp: disable
  active MPTCP in case of blackhole")

commit 3d04139
Author: Paolo Abeni <[email protected]>
Date:   Mon Oct 14 16:06:00 2024 +0200

    mptcp: prevent MPC handshake on port-based signal endpoints

    Syzkaller reported a lockdep splat:

      ============================================
      WARNING: possible recursive locking detected
      6.11.0-rc6-syzkaller-00019-g67784a74e258 #0 Not tainted
      --------------------------------------------
      syz-executor364/5113 is trying to acquire lock:
      ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
      ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328

      but task is already holding lock:
      ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
      ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328

      other info that might help us debug this:
       Possible unsafe locking scenario:

             CPU0
             ----
        lock(k-slock-AF_INET);
        lock(k-slock-AF_INET);

       *** DEADLOCK ***

       May be due to missing lock nesting notation

      7 locks held by syz-executor364/5113:
       #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline]
       #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x153/0x1b10 net/mptcp/protocol.c:1806
       #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline]
       #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg_fastopen+0x11f/0x530 net/mptcp/protocol.c:1727
       #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline]
       #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
       #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x5f/0x1b80 net/ipv4/ip_output.c:470
       #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline]
       #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
       #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x45f/0x1390 net/ipv4/ip_output.c:228
       #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: local_lock_acquire include/linux/local_lock_internal.h:29 [inline]
       #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: process_backlog+0x33b/0x15b0 net/core/dev.c:6104
       #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline]
       #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
       #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0x230/0x5f0 net/ipv4/ip_input.c:232
       #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
       #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328

      stack backtrace:
      CPU: 0 UID: 0 PID: 5113 Comm: syz-executor364 Not tainted 6.11.0-rc6-syzkaller-00019-g67784a74e258 #0
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:93 [inline]
       dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119
       check_deadlock kernel/locking/lockdep.c:3061 [inline]
       validate_chain+0x15d3/0x5900 kernel/locking/lockdep.c:3855
       __lock_acquire+0x137a/0x2040 kernel/locking/lockdep.c:5142
       lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5759
       __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
       _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
       spin_lock include/linux/spinlock.h:351 [inline]
       sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328
       mptcp_sk_clone_init+0x32/0x13c0 net/mptcp/protocol.c:3279
       subflow_syn_recv_sock+0x931/0x1920 net/mptcp/subflow.c:874
       tcp_check_req+0xfe4/0x1a20 net/ipv4/tcp_minisocks.c:853
       tcp_v4_rcv+0x1c3e/0x37f0 net/ipv4/tcp_ipv4.c:2267
       ip_protocol_deliver_rcu+0x22e/0x440 net/ipv4/ip_input.c:205
       ip_local_deliver_finish+0x341/0x5f0 net/ipv4/ip_input.c:233
       NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314
       NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314
       __netif_receive_skb_one_core net/core/dev.c:5661 [inline]
       __netif_receive_skb+0x2bf/0x650 net/core/dev.c:5775
       process_backlog+0x662/0x15b0 net/core/dev.c:6108
       __napi_poll+0xcb/0x490 net/core/dev.c:6772
       napi_poll net/core/dev.c:6841 [inline]
       net_rx_action+0x89b/0x1240 net/core/dev.c:6963
       handle_softirqs+0x2c4/0x970 kernel/softirq.c:554
       do_softirq+0x11b/0x1e0 kernel/softirq.c:455
       </IRQ>
       <TASK>
       __local_bh_enable_ip+0x1bb/0x200 kernel/softirq.c:382
       local_bh_enable include/linux/bottom_half.h:33 [inline]
       rcu_read_unlock_bh include/linux/rcupdate.h:908 [inline]
       __dev_queue_xmit+0x1763/0x3e90 net/core/dev.c:4450
       dev_queue_xmit include/linux/netdevice.h:3105 [inline]
       neigh_hh_output include/net/neighbour.h:526 [inline]
       neigh_output include/net/neighbour.h:540 [inline]
       ip_finish_output2+0xd41/0x1390 net/ipv4/ip_output.c:235
       ip_local_out net/ipv4/ip_output.c:129 [inline]
       __ip_queue_xmit+0x118c/0x1b80 net/ipv4/ip_output.c:535
       __tcp_transmit_skb+0x2544/0x3b30 net/ipv4/tcp_output.c:1466
       tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:6542 [inline]
       tcp_rcv_state_process+0x2c32/0x4570 net/ipv4/tcp_input.c:6729
       tcp_v4_do_rcv+0x77d/0xc70 net/ipv4/tcp_ipv4.c:1934
       sk_backlog_rcv include/net/sock.h:1111 [inline]
       __release_sock+0x214/0x350 net/core/sock.c:3004
       release_sock+0x61/0x1f0 net/core/sock.c:3558
       mptcp_sendmsg_fastopen+0x1ad/0x530 net/mptcp/protocol.c:1733
       mptcp_sendmsg+0x1884/0x1b10 net/mptcp/protocol.c:1812
       sock_sendmsg_nosec net/socket.c:730 [inline]
       __sock_sendmsg+0x1a6/0x270 net/socket.c:745
       ____sys_sendmsg+0x525/0x7d0 net/socket.c:2597
       ___sys_sendmsg net/socket.c:2651 [inline]
       __sys_sendmmsg+0x3b2/0x740 net/socket.c:2737
       __do_sys_sendmmsg net/socket.c:2766 [inline]
       __se_sys_sendmmsg net/socket.c:2763 [inline]
       __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2763
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7f04fb13a6b9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 01 1a 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007ffd651f42d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f04fb13a6b9
      RDX: 0000000000000001 RSI: 0000000020000d00 RDI: 0000000000000004
      RBP: 00007ffd651f4310 R08: 0000000000000001 R09: 0000000000000001
      R10: 0000000020000080 R11: 0000000000000246 R12: 00000000000f4240
      R13: 00007f04fb187449 R14: 00007ffd651f42f4 R15: 00007ffd651f4300
       </TASK>

    As noted by Cong Wang, the splat is false positive, but the code
    path leading to the report is an unexpected one: a client is
    attempting an MPC handshake towards the in-kernel listener created
    by the in-kernel PM for a port based signal endpoint.

    Such connection will be never accepted; many of them can make the
    listener queue full and preventing the creation of MPJ subflow via
    such listener - its intended role.

    Explicitly detect this scenario at initial-syn time and drop the
    incoming MPC request.

    Fixes: 1729cf1 ("mptcp: create the listening socket for new port")
    Cc: [email protected]
    Reported-by: [email protected]
    Closes: https://syzkaller.appspot.com/bug?extid=f4aacdfef2c6a6529c3e
    Cc: Cong Wang <[email protected]>
    Signed-off-by: Paolo Abeni <[email protected]>
    Reviewed-by: Matthieu Baerts (NGI0) <[email protected]>
    Reviewed-by: Mat Martineau <[email protected]>
    Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
    Link: https://patch.msgid.link/[email protected]
    Signed-off-by: Jakub Kicinski <[email protected]>

Signed-off-by: Davide Caratti <[email protected]>
PlaidCat added a commit that referenced this pull request May 21, 2025
jira NONE_AUTOMATION
Rebuild_History Non-Buildable kernel-5.14.0-570.16.1.el9_6
commit-author Shradha Gupta <[email protected]>
commit 3e64bb2
Empty-Commit: Cherry-Pick Conflicts during history rebuild.
Will be included in final tarball splat. Ref for failed cherry-pick at:
ciq/ciq_backports/kernel-5.14.0-570.16.1.el9_6/3e64bb2a.failed

When on a MANA VM hibernation is triggered, as part of hibernate_snapshot(),
mana_gd_suspend() and mana_gd_resume() are called. If during this
mana_gd_resume(), a failure occurs with HWC creation, mana_port_debugfs
pointer does not get reinitialized and ends up pointing to older,
cleaned-up dentry.
Further in the hibernation path, as part of power_down(), mana_gd_shutdown()
is triggered. This call, unaware of the failures in resume, tries to cleanup
the already cleaned up  mana_port_debugfs value and hits the following bug:

[  191.359296] mana 7870:00:00.0: Shutdown was called
[  191.359918] BUG: kernel NULL pointer dereference, address: 0000000000000098
[  191.360584] #PF: supervisor write access in kernel mode
[  191.361125] #PF: error_code(0x0002) - not-present page
[  191.361727] PGD 1080ea067 P4D 0
[  191.362172] Oops: Oops: 0002 [#1] SMP NOPTI
[  191.362606] CPU: 11 UID: 0 PID: 1674 Comm: bash Not tainted 6.14.0-rc5+ #2
[  191.363292] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 11/21/2024
[  191.364124] RIP: 0010:down_write+0x19/0x50
[  191.364537] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb e8 de cd ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 13 75 16 65 48 8b 05 88 24 4c 6a 48 89 43 08 48 8b 5d
[  191.365867] RSP: 0000:ff45fbe0c1c037b8 EFLAGS: 00010246
[  191.366350] RAX: 0000000000000000 RBX: 0000000000000098 RCX: ffffff8100000000
[  191.366951] RDX: 0000000000000001 RSI: 0000000000000064 RDI: 0000000000000098
[  191.367600] RBP: ff45fbe0c1c037c0 R08: 0000000000000000 R09: 0000000000000001
[  191.368225] R10: ff45fbe0d2b01000 R11: 0000000000000008 R12: 0000000000000000
[  191.368874] R13: 000000000000000b R14: ff43dc27509d67c0 R15: 0000000000000020
[  191.369549] FS:  00007dbc5001e740(0000) GS:ff43dc663f380000(0000) knlGS:0000000000000000
[  191.370213] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  191.370830] CR2: 0000000000000098 CR3: 0000000168e8e002 CR4: 0000000000b73ef0
[  191.371557] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  191.372192] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[  191.372906] Call Trace:
[  191.373262]  <TASK>
[  191.373621]  ? show_regs+0x64/0x70
[  191.374040]  ? __die+0x24/0x70
[  191.374468]  ? page_fault_oops+0x290/0x5b0
[  191.374875]  ? do_user_addr_fault+0x448/0x800
[  191.375357]  ? exc_page_fault+0x7a/0x160
[  191.375971]  ? asm_exc_page_fault+0x27/0x30
[  191.376416]  ? down_write+0x19/0x50
[  191.376832]  ? down_write+0x12/0x50
[  191.377232]  simple_recursive_removal+0x4a/0x2a0
[  191.377679]  ? __pfx_remove_one+0x10/0x10
[  191.378088]  debugfs_remove+0x44/0x70
[  191.378530]  mana_detach+0x17c/0x4f0
[  191.378950]  ? __flush_work+0x1e2/0x3b0
[  191.379362]  ? __cond_resched+0x1a/0x50
[  191.379787]  mana_remove+0xf2/0x1a0
[  191.380193]  mana_gd_shutdown+0x3b/0x70
[  191.380642]  pci_device_shutdown+0x3a/0x80
[  191.381063]  device_shutdown+0x13e/0x230
[  191.381480]  kernel_power_off+0x35/0x80
[  191.381890]  hibernate+0x3c6/0x470
[  191.382312]  state_store+0xcb/0xd0
[  191.382734]  kobj_attr_store+0x12/0x30
[  191.383211]  sysfs_kf_write+0x3e/0x50
[  191.383640]  kernfs_fop_write_iter+0x140/0x1d0
[  191.384106]  vfs_write+0x271/0x440
[  191.384521]  ksys_write+0x72/0xf0
[  191.384924]  __x64_sys_write+0x19/0x20
[  191.385313]  x64_sys_call+0x2b0/0x20b0
[  191.385736]  do_syscall_64+0x79/0x150
[  191.386146]  ? __mod_memcg_lruvec_state+0xe7/0x240
[  191.386676]  ? __lruvec_stat_mod_folio+0x79/0xb0
[  191.387124]  ? __pfx_lru_add+0x10/0x10
[  191.387515]  ? queued_spin_unlock+0x9/0x10
[  191.387937]  ? do_anonymous_page+0x33c/0xa00
[  191.388374]  ? __handle_mm_fault+0xcf3/0x1210
[  191.388805]  ? __count_memcg_events+0xbe/0x180
[  191.389235]  ? handle_mm_fault+0xae/0x300
[  191.389588]  ? do_user_addr_fault+0x559/0x800
[  191.390027]  ? irqentry_exit_to_user_mode+0x43/0x230
[  191.390525]  ? irqentry_exit+0x1d/0x30
[  191.390879]  ? exc_page_fault+0x86/0x160
[  191.391235]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  191.391745] RIP: 0033:0x7dbc4ff1c574
[  191.392111] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d d5 ea 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
[  191.393412] RSP: 002b:00007ffd95a23ab8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[  191.393990] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007dbc4ff1c574
[  191.394594] RDX: 0000000000000005 RSI: 00005a6eeadb0ce0 RDI: 0000000000000001
[  191.395215] RBP: 00007ffd95a23ae0 R08: 00007dbc50003b20 R09: 0000000000000000
[  191.395805] R10: 0000000000000001 R11: 0000000000000202 R12: 0000000000000005
[  191.396404] R13: 00005a6eeadb0ce0 R14: 00007dbc500045c0 R15: 00007dbc50001ee0
[  191.396987]  </TASK>

To fix this, we explicitly set such mana debugfs variables to NULL after
debugfs_remove() is called.

Fixes: 6607c17 ("net: mana: Enable debugfs files for MANA device")
	Cc: [email protected]
	Signed-off-by: Shradha Gupta <[email protected]>
	Reviewed-by: Haiyang Zhang <[email protected]>
	Reviewed-by: Michal Kubiak <[email protected]>
Link: https://patch.msgid.link/1741688260-28922-1-git-send-email-shradhagupta@linux.microsoft.com
	Signed-off-by: Paolo Abeni <[email protected]>

(cherry picked from commit 3e64bb2)
	Signed-off-by: Jonathan Maple <[email protected]>

# Conflicts:
#	drivers/net/ethernet/microsoft/mana/gdma_main.c
#	drivers/net/ethernet/microsoft/mana/mana_en.c
github-actions bot pushed a commit that referenced this pull request May 27, 2025
…_USERCOPY=y crash

Borislav Petkov reported the following boot crash on x86-32,
with CONFIG_HARDENED_USERCOPY=y:

  |  usercopy: Kernel memory overwrite attempt detected to SLUB object 'task_struct' (offset 2112, size 160)!
  |  ...
  |  kernel BUG at mm/usercopy.c:102!

So the useroffset and usersize arguments are what control the allowed
window of copying in/out of the "task_struct" kmem cache:

        /* create a slab on which task_structs can be allocated */
        task_struct_whitelist(&useroffset, &usersize);
        task_struct_cachep = kmem_cache_create_usercopy("task_struct",
                        arch_task_struct_size, align,
                        SLAB_PANIC|SLAB_ACCOUNT,
                        useroffset, usersize, NULL);

task_struct_whitelist() positions this window based on the location of
the thread_struct within task_struct, and gets the arch-specific details
via arch_thread_struct_whitelist(offset, size):

	static void __init task_struct_whitelist(unsigned long *offset, unsigned long *size)
	{
		/* Fetch thread_struct whitelist for the architecture. */
		arch_thread_struct_whitelist(offset, size);

		/*
		 * Handle zero-sized whitelist or empty thread_struct, otherwise
		 * adjust offset to position of thread_struct in task_struct.
		 */
		if (unlikely(*size == 0))
			*offset = 0;
		else
			*offset += offsetof(struct task_struct, thread);
	}

Commit cb7ca40 ("x86/fpu: Make task_struct::thread constant size")
removed the logic for the window, leaving:

	static inline void
	arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
	{
		*offset = 0;
		*size = 0;
	}

So now there is no window that usercopy hardening will allow to be copied
in/out of task_struct.

But as reported above, there *is* a copy in copy_uabi_to_xstate(). (It
seems there are several, actually.)

	int copy_sigframe_from_user_to_xstate(struct task_struct *tsk,
					      const void __user *ubuf)
	{
		return copy_uabi_to_xstate(x86_task_fpu(tsk)->fpstate, NULL, ubuf, &tsk->thread.pkru);
	}

This appears to be writing into x86_task_fpu(tsk)->fpstate. With or
without CONFIG_X86_DEBUG_FPU, this resolves to:

	((struct fpu *)((void *)(task) + sizeof(*(task))))

i.e. the memory "after task_struct" is cast to "struct fpu", and the
uses the "fpstate" pointer. How that pointer gets set looks to be
variable, but I think the one we care about here is:

        fpu->fpstate = &fpu->__fpstate;

And struct fpu::__fpstate says:

        struct fpstate                  __fpstate;
        /*
         * WARNING: '__fpstate' is dynamically-sized.  Do not put
         * anything after it here.
         */

So we're still dealing with a dynamically sized thing, even if it's not
within the literal struct task_struct -- it's still in the kmem cache,
though.

Looking at the kmem cache size, it has allocated "arch_task_struct_size"
bytes, which is calculated in fpu__init_task_struct_size():

        int task_size = sizeof(struct task_struct);

        task_size += sizeof(struct fpu);

        /*
         * Subtract off the static size of the register state.
         * It potentially has a bunch of padding.
         */
        task_size -= sizeof(union fpregs_state);

        /*
         * Add back the dynamically-calculated register state
         * size.
         */
        task_size += fpu_kernel_cfg.default_size;

        /*
         * We dynamically size 'struct fpu', so we require that
         * 'state' be at the end of 'it:
         */
        CHECK_MEMBER_AT_END_OF(struct fpu, __fpstate);

        arch_task_struct_size = task_size;

So, this is still copying out of the kmem cache for task_struct, and the
window seems unchanged (still fpu regs). This is what the window was
before:

	void fpu_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
	{
		*offset = offsetof(struct thread_struct, fpu.__fpstate.regs);
		*size = fpu_kernel_cfg.default_size;
	}

And the same commit I mentioned above removed it.

I think the misunderstanding is here:

  | The fpu_thread_struct_whitelist() quirk to hardened usercopy can be removed,
  | now that the FPU structure is not embedded in the task struct anymore, which
  | reduces text footprint a bit.

Yes, FPU is no longer in task_struct, but it IS in the kmem cache named
"task_struct", since the fpstate is still being allocated there.

Partially revert the earlier mentioned commit, along with a
recalculation of the fpstate regs location.

Fixes: cb7ca40 ("x86/fpu: Make task_struct::thread constant size")
Reported-by: Borislav Petkov (AMD) <[email protected]>
Tested-by: Borislav Petkov (AMD) <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Chang S. Bae <[email protected]>
Cc: Gustavo A. R. Silva <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/[email protected]/ # Discussion #1
Link: https://lore.kernel.org/r/202505041418.F47130C4C8@keescook             # Discussion #2
github-actions bot pushed a commit that referenced this pull request May 27, 2025
Move hctx debugfs/sysfs register out of freezing queue in
__blk_mq_update_nr_hw_queues(), so that the following lockdep dependency
can be killed:

	#2 (&q->q_usage_counter(io)#16){++++}-{0:0}:
	#1 (fs_reclaim){+.+.}-{0:0}:
	#0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs

And registering/un-registering hctx debugfs/sysfs does not require queue to
be frozen:

- hctx sysfs attributes show() are drained when removing kobject, and
  there isn't store() implementation for hctx sysfs attributes

- debugfs entry read() is drained too when removing debugfs directory,
  and there isn't write() implementation for hctx debugfs too

- so it is safe to register/unregister hctx sysfs/debugfs without
  freezing queue because the cod paths changes nothing, and we just
  need to keep hctx live

Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Nilay Shroff <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 27, 2025
…xit()

scheduler's ->exit() is called with queue frozen and elevator lock is held, and
wbt_enable_default() can't be called with queue frozen, otherwise the
following lockdep warning is triggered:

	#6 (&q->rq_qos_mutex){+.+.}-{4:4}:
	#5 (&eq->sysfs_lock){+.+.}-{4:4}:
	#4 (&q->elevator_lock){+.+.}-{4:4}:
	#3 (&q->q_usage_counter(io)#3){++++}-{0:0}:
	#2 (fs_reclaim){+.+.}-{0:0}:
	#1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}:
	#0 (&q->debugfs_mutex){+.+.}-{4:4}:

Fix the issue by moving wbt_enable_default() out of bfq's exit(), and
call it from elevator_change_done().

Meantime add disk->rqos_state_mutex for covering wbt state change, which
matches the purpose more than ->elevator_lock.

Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Nilay Shroff <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 27, 2025
Amir Goldstein <[email protected]> says:

This adds a test for fanotify mount ns notifications inside userns [1].

While working on the test I ended up making lots of cleanups to reduce
build dependency on make headers_install.

These patches got rid of the dependency for my kvm setup for the
affected filesystems tests.

Building with TOOLS_INCLUDES dir was recommended by John Hubbard [2].

NOTE #1: these patches are based on a merge of vfs-6.16.mount
(changes wrappers.h) into v6.15-rc5 (changes mount-notify_test.c),
so if this cleanup is acceptable, we should probably setup a selftests
branch for 6.16, so that it can be used to test the fanotify patches.

NOTE #2: some of the defines in wrappers.h are left for overlayfs and
mount_setattr tests, which were not converted to use TOOLS_INCLUDES.
I did not want to mess with those tests.

* patches from https://lore.kernel.org/[email protected]:
  selftests/fs/mount-notify: add a test variant running inside userns
  selftests/filesystems: create setup_userns() helper
  selftests/filesystems: create get_unique_mnt_id() helper
  selftests/fs/mount-notify: build with tools include dir
  selftests/mount_settattr: remove duplicate syscall definitions
  selftests/pidfd: move syscall definitions into wrappers.h
  selftests/fs/statmount: build with tools include dir
  selftests/filesystems: move wrapper.h out of overlayfs subdir

Link: https://lore.kernel.org/[email protected]
Signed-off-by: Christian Brauner <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 28, 2025
ACPICA commit 1c28da2242783579d59767617121035dafba18c3

This was originally done in NetBSD:
NetBSD/src@b69d1ac
and is the correct alternative to the smattering of `memcpy`s I
previously contributed to this repository.

This also sidesteps the newly strict checks added in UBSAN:
llvm/llvm-project@7926744

Before this change we see the following UBSAN stack trace in Fuchsia:

  #0    0x000021afcfdeca5e in acpi_rs_get_address_common(struct acpi_resource*, union aml_resource*) ../../third_party/acpica/source/components/resources/rsaddr.c:329 <platform-bus-x86.so>+0x6aca5e
  #1.2  0x000021982bc4af3c in ubsan_get_stack_trace() compiler-rt/lib/ubsan/ubsan_diag.cpp:41 <libclang_rt.asan.so>+0x41f3c
  #1.1  0x000021982bc4af3c in maybe_print_stack_trace() compiler-rt/lib/ubsan/ubsan_diag.cpp:51 <libclang_rt.asan.so>+0x41f3c
  #1    0x000021982bc4af3c in ~scoped_report() compiler-rt/lib/ubsan/ubsan_diag.cpp:395 <libclang_rt.asan.so>+0x41f3c
  #2    0x000021982bc4bb6f in handletype_mismatch_impl() compiler-rt/lib/ubsan/ubsan_handlers.cpp:137 <libclang_rt.asan.so>+0x42b6f
  #3    0x000021982bc4b723 in __ubsan_handle_type_mismatch_v1 compiler-rt/lib/ubsan/ubsan_handlers.cpp:142 <libclang_rt.asan.so>+0x42723
  #4    0x000021afcfdeca5e in acpi_rs_get_address_common(struct acpi_resource*, union aml_resource*) ../../third_party/acpica/source/components/resources/rsaddr.c:329 <platform-bus-x86.so>+0x6aca5e
  #5    0x000021afcfdf2089 in acpi_rs_convert_aml_to_resource(struct acpi_resource*, union aml_resource*, struct acpi_rsconvert_info*) ../../third_party/acpica/source/components/resources/rsmisc.c:355 <platform-bus-x86.so>+0x6b2089
  #6    0x000021afcfded169 in acpi_rs_convert_aml_to_resources(u8*, u32, u32, u8, void**) ../../third_party/acpica/source/components/resources/rslist.c:137 <platform-bus-x86.so>+0x6ad169
  #7    0x000021afcfe2d24a in acpi_ut_walk_aml_resources(struct acpi_walk_state*, u8*, acpi_size, acpi_walk_aml_callback, void**) ../../third_party/acpica/source/components/utilities/utresrc.c:237 <platform-bus-x86.so>+0x6ed24a
  #8    0x000021afcfde66b7 in acpi_rs_create_resource_list(union acpi_operand_object*, struct acpi_buffer*) ../../third_party/acpica/source/components/resources/rscreate.c:199 <platform-bus-x86.so>+0x6a66b7
  #9    0x000021afcfdf6979 in acpi_rs_get_method_data(acpi_handle, const char*, struct acpi_buffer*) ../../third_party/acpica/source/components/resources/rsutils.c:770 <platform-bus-x86.so>+0x6b6979
  #10   0x000021afcfdf708f in acpi_walk_resources(acpi_handle, char*, acpi_walk_resource_callback, void*) ../../third_party/acpica/source/components/resources/rsxface.c:731 <platform-bus-x86.so>+0x6b708f
  #11   0x000021afcfa95dcf in acpi::acpi_impl::walk_resources(acpi::acpi_impl*, acpi_handle, const char*, acpi::Acpi::resources_callable) ../../src/devices/board/lib/acpi/acpi-impl.cc:41 <platform-bus-x86.so>+0x355dcf
  #12   0x000021afcfaa8278 in acpi::device_builder::gather_resources(acpi::device_builder*, acpi::Acpi*, fidl::any_arena&, acpi::Manager*, acpi::device_builder::gather_resources_callback) ../../src/devices/board/lib/acpi/device-builder.cc:84 <platform-bus-x86.so>+0x368278
  #13   0x000021afcfbddb87 in acpi::Manager::configure_discovered_devices(acpi::Manager*) ../../src/devices/board/lib/acpi/manager.cc:75 <platform-bus-x86.so>+0x49db87
  #14   0x000021afcf99091d in publish_acpi_devices(acpi::Manager*, zx_device_t*, zx_device_t*) ../../src/devices/board/drivers/x86/acpi-nswalk.cc:95 <platform-bus-x86.so>+0x25091d
  #15   0x000021afcf9c1d4e in x86::X86::do_init(x86::X86*) ../../src/devices/board/drivers/x86/x86.cc:60 <platform-bus-x86.so>+0x281d4e
  #16   0x000021afcf9e33ad in λ(x86::X86::ddk_init::(anon class)*) ../../src/devices/board/drivers/x86/x86.cc:77 <platform-bus-x86.so>+0x2a33ad
  #17   0x000021afcf9e313e in fit::internal::target<(lambda at../../src/devices/board/drivers/x86/x86.cc:76:19), false, false, std::__2::allocator<std::byte>, void>::invoke(void*) ../../sdk/lib/fit/include/lib/fit/internal/function.h:183 <platform-bus-x86.so>+0x2a313e
  #18   0x000021afcfbab4c7 in fit::internal::function_base<16UL, false, void(), std::__2::allocator<std::byte>>::invoke(const fit::internal::function_base<16UL, false, void (), std::__2::allocator<std::byte> >*) ../../sdk/lib/fit/include/lib/fit/internal/function.h:522 <platform-bus-x86.so>+0x46b4c7
  #19   0x000021afcfbab342 in fit::function_impl<16UL, false, void(), std::__2::allocator<std::byte>>::operator()(const fit::function_impl<16UL, false, void (), std::__2::allocator<std::byte> >*) ../../sdk/lib/fit/include/lib/fit/function.h:315 <platform-bus-x86.so>+0x46b342
  #20   0x000021afcfcd98c3 in async::internal::retained_task::Handler(async_dispatcher_t*, async_task_t*, zx_status_t) ../../sdk/lib/async/task.cc:24 <platform-bus-x86.so>+0x5998c3
  #21   0x00002290f9924616 in λ(const driver_runtime::Dispatcher::post_task::(anon class)*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, zx_status_t) ../../src/devices/bin/driver_runtime/dispatcher.cc:789 <libdriver_runtime.so>+0x10a616
  #22   0x00002290f9924323 in fit::internal::target<(lambda at../../src/devices/bin/driver_runtime/dispatcher.cc:788:7), true, false, std::__2::allocator<std::byte>, void, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request>>, int>::invoke(void*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int) ../../sdk/lib/fit/include/lib/fit/internal/function.h:128 <libdriver_runtime.so>+0x10a323
  #23   0x00002290f9904b76 in fit::internal::function_base<24UL, true, void(std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request>>, int), std::__2::allocator<std::byte>>::invoke(const fit::internal::function_base<24UL, true, void (std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int), std::__2::allocator<std::byte> >*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int) ../../sdk/lib/fit/include/lib/fit/internal/function.h:522 <libdriver_runtime.so>+0xeab76
  #24   0x00002290f9904831 in fit::callback_impl<24UL, true, void(std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request>>, int), std::__2::allocator<std::byte>>::operator()(fit::callback_impl<24UL, true, void (std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int), std::__2::allocator<std::byte> >*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int) ../../sdk/lib/fit/include/lib/fit/function.h:471 <libdriver_runtime.so>+0xea831
  #25   0x00002290f98d5adc in driver_runtime::callback_request::Call(driver_runtime::callback_request*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, zx_status_t) ../../src/devices/bin/driver_runtime/callback_request.h:74 <libdriver_runtime.so>+0xbbadc
  #26   0x00002290f98e1e58 in driver_runtime::Dispatcher::dispatch_callback(driver_runtime::Dispatcher*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >) ../../src/devices/bin/driver_runtime/dispatcher.cc:1248 <libdriver_runtime.so>+0xc7e58
  #27   0x00002290f98e4159 in driver_runtime::Dispatcher::dispatch_callbacks(driver_runtime::Dispatcher*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../src/devices/bin/driver_runtime/dispatcher.cc:1308 <libdriver_runtime.so>+0xca159
  #28   0x00002290f9918414 in λ(const driver_runtime::Dispatcher::create_with_adder::(anon class)*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../src/devices/bin/driver_runtime/dispatcher.cc:353 <libdriver_runtime.so>+0xfe414
  #29   0x00002290f991812d in fit::internal::target<(lambda at../../src/devices/bin/driver_runtime/dispatcher.cc:351:7), true, false, std::__2::allocator<std::byte>, void, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter>>, fbl::ref_ptr<driver_runtime::Dispatcher>>::invoke(void*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../sdk/lib/fit/include/lib/fit/internal/function.h:128 <libdriver_runtime.so>+0xfe12d
  #30   0x00002290f9906fc7 in fit::internal::function_base<8UL, true, void(std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter>>, fbl::ref_ptr<driver_runtime::Dispatcher>), std::__2::allocator<std::byte>>::invoke(const fit::internal::function_base<8UL, true, void (std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>), std::__2::allocator<std::byte> >*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../sdk/lib/fit/include/lib/fit/internal/function.h:522 <libdriver_runtime.so>+0xecfc7
  #31   0x00002290f9906c66 in fit::function_impl<8UL, true, void(std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter>>, fbl::ref_ptr<driver_runtime::Dispatcher>), std::__2::allocator<std::byte>>::operator()(const fit::function_impl<8UL, true, void (std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>), std::__2::allocator<std::byte> >*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../sdk/lib/fit/include/lib/fit/function.h:315 <libdriver_runtime.so>+0xecc66
  #32   0x00002290f98e73d9 in driver_runtime::Dispatcher::event_waiter::invoke_callback(driver_runtime::Dispatcher::event_waiter*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../src/devices/bin/driver_runtime/dispatcher.h:543 <libdriver_runtime.so>+0xcd3d9
  #33   0x00002290f98e700d in driver_runtime::Dispatcher::event_waiter::handle_event(std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, async_dispatcher_t*, async::wait_base*, zx_status_t, zx_packet_signal_t const*) ../../src/devices/bin/driver_runtime/dispatcher.cc:1442 <libdriver_runtime.so>+0xcd00d
  #34   0x00002290f9918983 in async_loop_owned_event_handler<driver_runtime::Dispatcher::event_waiter>::handle_event(async_loop_owned_event_handler<driver_runtime::Dispatcher::event_waiter>*, async_dispatcher_t*, async::wait_base*, zx_status_t, zx_packet_signal_t const*) ../../src/devices/bin/driver_runtime/async_loop_owned_event_handler.h:59 <libdriver_runtime.so>+0xfe983
  #35   0x00002290f9918b9e in async::wait_method<async_loop_owned_event_handler<driver_runtime::Dispatcher::event_waiter>, &async_loop_owned_event_handler<driver_runtime::Dispatcher::event_waiter>::handle_event>::call_handler(async_dispatcher_t*, async_wait_t*, zx_status_t, zx_packet_signal_t const*) ../../sdk/lib/async/include/lib/async/cpp/wait.h:201 <libdriver_runtime.so>+0xfeb9e
  #36   0x00002290f99bf509 in async_loop_dispatch_wait(async_loop_t*, async_wait_t*, zx_status_t, zx_packet_signal_t const*) ../../sdk/lib/async-loop/loop.c:394 <libdriver_runtime.so>+0x1a5509
  #37   0x00002290f99b9958 in async_loop_run_once(async_loop_t*, zx_time_t) ../../sdk/lib/async-loop/loop.c:343 <libdriver_runtime.so>+0x19f958
  #38   0x00002290f99b9247 in async_loop_run(async_loop_t*, zx_time_t, _Bool) ../../sdk/lib/async-loop/loop.c:301 <libdriver_runtime.so>+0x19f247
  #39   0x00002290f99ba962 in async_loop_run_thread(void*) ../../sdk/lib/async-loop/loop.c:860 <libdriver_runtime.so>+0x1a0962
  #40   0x000041afd176ef30 in start_c11(void*) ../../zircon/third_party/ulib/musl/pthread/pthread_create.c:63 <libc.so>+0x84f30
  #41   0x000041afd18a448d in thread_trampoline(uintptr_t, uintptr_t) ../../zircon/system/ulib/runtime/thread.cc:100 <libc.so>+0x1ba48d

Link: acpica/acpica@1c28da22
Signed-off-by: Rafael J. Wysocki <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Tamir Duberstein <[email protected]>
[ rjw: Pick up the tag from Tamir ]
Signed-off-by: Rafael J. Wysocki <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 28, 2025
Lockdep reports a possible circular locking dependency [1] when
cpu_hotplug_lock is acquired inside store_local_boost(), after
policy->rwsem has already been taken by store().

However, the boost update is strictly per-policy and does not
access shared state or iterate over all policies.

Since policy->rwsem is already held, this is enough to serialize
against concurrent topology changes for the current policy.

Remove the cpus_read_lock() to resolve the lockdep warning and
avoid unnecessary locking.

 [1]
 ======================================================
 WARNING: possible circular locking dependency detected
 6.15.0-rc6-debug-gb01fc4eca73c #1 Not tainted
 ------------------------------------------------------
 power-profiles-/588 is trying to acquire lock:
 ffffffffb3a7d910 (cpu_hotplug_lock){++++}-{0:0}, at: store_local_boost+0x56/0xd0

 but task is already holding lock:
 ffff8b6e5a12c380 (&policy->rwsem){++++}-{4:4}, at: store+0x37/0x90

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #2 (&policy->rwsem){++++}-{4:4}:
        down_write+0x29/0xb0
        cpufreq_online+0x7e8/0xa40
        cpufreq_add_dev+0x82/0xa0
        subsys_interface_register+0x148/0x160
        cpufreq_register_driver+0x15d/0x260
        amd_pstate_register_driver+0x36/0x90
        amd_pstate_init+0x1e7/0x270
        do_one_initcall+0x68/0x2b0
        kernel_init_freeable+0x231/0x270
        kernel_init+0x15/0x130
        ret_from_fork+0x2c/0x50
        ret_from_fork_asm+0x11/0x20

 -> #1 (subsys mutex#3){+.+.}-{4:4}:
        __mutex_lock+0xc2/0x930
        subsys_interface_register+0x7f/0x160
        cpufreq_register_driver+0x15d/0x260
        amd_pstate_register_driver+0x36/0x90
        amd_pstate_init+0x1e7/0x270
        do_one_initcall+0x68/0x2b0
        kernel_init_freeable+0x231/0x270
        kernel_init+0x15/0x130
        ret_from_fork+0x2c/0x50
        ret_from_fork_asm+0x11/0x20

 -> #0 (cpu_hotplug_lock){++++}-{0:0}:
        __lock_acquire+0x10ed/0x1850
        lock_acquire.part.0+0x69/0x1b0
        cpus_read_lock+0x2a/0xc0
        store_local_boost+0x56/0xd0
        store+0x50/0x90
        kernfs_fop_write_iter+0x132/0x200
        vfs_write+0x2b3/0x590
        ksys_write+0x74/0xf0
        do_syscall_64+0xbb/0x1d0
        entry_SYSCALL_64_after_hwframe+0x56/0x5e

Signed-off-by: Seyediman Seyedarab <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Rafael J. Wysocki <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 29, 2025
JIRA: https://issues.redhat.com/browse/RHEL-79791
CVE: CVE-2024-56607

commit 8fac326
Author: Kalle Valo <[email protected]>
Date:   Mon Oct 7 19:59:27 2024 +0300

    wifi: ath12k: fix atomic calls in ath12k_mac_op_set_bitrate_mask()
    
    When I try to manually set bitrates:
    
    iw wlan0 set bitrates legacy-2.4 1
    
    I get sleeping from invalid context error, see below. Fix that by switching to
    use recently introduced ieee80211_iterate_stations_mtx().
    
    Do note that WCN6855 firmware is still crashing, I'm not sure if that firmware
    even supports bitrate WMI commands and should we consider disabling
    ath12k_mac_op_set_bitrate_mask() for WCN6855? But that's for another patch.
    
    BUG: sleeping function called from invalid context at drivers/net/wireless/ath/ath12k/wmi.c:420
    in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 2236, name: iw
    preempt_count: 0, expected: 0
    RCU nest depth: 1, expected: 0
    3 locks held by iw/2236:
     #0: ffffffffabc6f1d8 (cb_lock){++++}-{3:3}, at: genl_rcv+0x14/0x40
     #1: ffff888138410810 (&rdev->wiphy.mtx){+.+.}-{3:3}, at: nl80211_pre_doit+0x54d/0x800 [cfg80211]
     #2: ffffffffab2cfaa0 (rcu_read_lock){....}-{1:2}, at: ieee80211_iterate_stations_atomic+0x2f/0x200 [mac80211]
    CPU: 3 UID: 0 PID: 2236 Comm: iw Not tainted 6.11.0-rc7-wt-ath+ #1772
    Hardware name: Intel(R) Client Systems NUC8i7HVK/NUC8i7HVB, BIOS HNKBLi70.86A.0067.2021.0528.1339 05/28/2021
    Call Trace:
     <TASK>
     dump_stack_lvl+0xa4/0xe0
     dump_stack+0x10/0x20
     __might_resched+0x363/0x5a0
     ? __alloc_skb+0x165/0x340
     __might_sleep+0xad/0x160
     ath12k_wmi_cmd_send+0xb1/0x3d0 [ath12k]
     ? ath12k_wmi_init_wcn7850+0xa40/0xa40 [ath12k]
     ? __netdev_alloc_skb+0x45/0x7b0
     ? __asan_memset+0x39/0x40
     ? ath12k_wmi_alloc_skb+0xf0/0x150 [ath12k]
     ? reacquire_held_locks+0x4d0/0x4d0
     ath12k_wmi_set_peer_param+0x340/0x5b0 [ath12k]
     ath12k_mac_disable_peer_fixed_rate+0xa3/0x110 [ath12k]
     ? ath12k_mac_vdev_stop+0x4f0/0x4f0 [ath12k]
     ieee80211_iterate_stations_atomic+0xd4/0x200 [mac80211]
     ath12k_mac_op_set_bitrate_mask+0x5d2/0x1080 [ath12k]
     ? ath12k_mac_vif_chan+0x320/0x320 [ath12k]
     drv_set_bitrate_mask+0x267/0x470 [mac80211]
     ieee80211_set_bitrate_mask+0x4cc/0x8a0 [mac80211]
     ? __this_cpu_preempt_check+0x13/0x20
     nl80211_set_tx_bitrate_mask+0x2bc/0x530 [cfg80211]
     ? nl80211_parse_tx_bitrate_mask+0x2320/0x2320 [cfg80211]
     ? trace_contention_end+0xef/0x140
     ? rtnl_unlock+0x9/0x10
     ? nl80211_pre_doit+0x557/0x800 [cfg80211]
     genl_family_rcv_msg_doit+0x1f0/0x2e0
     ? genl_family_rcv_msg_attrs_parse.isra.0+0x250/0x250
     ? ns_capable+0x57/0xd0
     genl_family_rcv_msg+0x34c/0x600
     ? genl_family_rcv_msg_dumpit+0x310/0x310
     ? __lock_acquire+0xc62/0x1de0
     ? he_set_mcs_mask.isra.0+0x8d0/0x8d0 [cfg80211]
     ? nl80211_parse_tx_bitrate_mask+0x2320/0x2320 [cfg80211]
     ? cfg80211_external_auth_request+0x690/0x690 [cfg80211]
     genl_rcv_msg+0xa0/0x130
     netlink_rcv_skb+0x14c/0x400
     ? genl_family_rcv_msg+0x600/0x600
     ? netlink_ack+0xd70/0xd70
     ? rwsem_optimistic_spin+0x4f0/0x4f0
     ? genl_rcv+0x14/0x40
     ? down_read_killable+0x580/0x580
     ? netlink_deliver_tap+0x13e/0x350
     ? __this_cpu_preempt_check+0x13/0x20
     genl_rcv+0x23/0x40
     netlink_unicast+0x45e/0x790
     ? netlink_attachskb+0x7f0/0x7f0
     netlink_sendmsg+0x7eb/0xdb0
     ? netlink_unicast+0x790/0x790
     ? __this_cpu_preempt_check+0x13/0x20
     ? selinux_socket_sendmsg+0x31/0x40
     ? netlink_unicast+0x790/0x790
     __sock_sendmsg+0xc9/0x160
     ____sys_sendmsg+0x620/0x990
     ? kernel_sendmsg+0x30/0x30
     ? __copy_msghdr+0x410/0x410
     ? __kasan_check_read+0x11/0x20
     ? mark_lock+0xe6/0x1470
     ___sys_sendmsg+0xe9/0x170
     ? copy_msghdr_from_user+0x120/0x120
     ? __lock_acquire+0xc62/0x1de0
     ? do_fault_around+0x2c6/0x4e0
     ? do_user_addr_fault+0x8c1/0xde0
     ? reacquire_held_locks+0x220/0x4d0
     ? do_user_addr_fault+0x8c1/0xde0
     ? __kasan_check_read+0x11/0x20
     ? __fdget+0x4e/0x1d0
     ? sockfd_lookup_light+0x1a/0x170
     __sys_sendmsg+0xd2/0x180
     ? __sys_sendmsg_sock+0x20/0x20
     ? reacquire_held_locks+0x4d0/0x4d0
     ? debug_smp_processor_id+0x17/0x20
     __x64_sys_sendmsg+0x72/0xb0
     ? lockdep_hardirqs_on+0x7d/0x100
     x64_sys_call+0x894/0x9f0
     do_syscall_64+0x64/0x130
     entry_SYSCALL_64_after_hwframe+0x4b/0x53
    RIP: 0033:0x7f230fe04807
    Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
    RSP: 002b:00007ffe996a7ea8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 0000556f9f9c3390 RCX: 00007f230fe04807
    RDX: 0000000000000000 RSI: 00007ffe996a7ee0 RDI: 0000000000000003
    RBP: 0000556f9f9c88c0 R08: 0000000000000002 R09: 0000000000000000
    R10: 0000556f965ca190 R11: 0000000000000246 R12: 0000556f9f9c8780
    R13: 00007ffe996a7ee0 R14: 0000556f9f9c87d0 R15: 0000556f9f9c88c0
     </TASK>
    
    Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3
    
    Signed-off-by: Kalle Valo <[email protected]>
    Link: https://patch.msgid.link/[email protected]
    Signed-off-by: Jeff Johnson <[email protected]>

Signed-off-by: Jose Ignacio Tornos Martinez <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 30, 2025
JIRA: https://issues.redhat.com/browse/RHEL-75959
Upstream Status: v6.14-rc1

commit e4b6b66
Author:     Aaro Koskinen <[email protected]>
AuthorDate: Thu Jan  2 20:19:51 2025 +0200
Commit:     Helge Deller <[email protected]>
CommitDate: Thu Jan  9 00:37:54 2025 +0100

    When using touchscreen and framebuffer, Nokia 770 crashes easily with:

        BUG: scheduling while atomic: irq/144-ads7846/82/0x00010000
        Modules linked in: usb_f_ecm g_ether usb_f_rndis u_ether libcomposite configfs omap_udc ohci_omap ohci_hcd
        CPU: 0 UID: 0 PID: 82 Comm: irq/144-ads7846 Not tainted 6.12.7-770 #2
        Hardware name: Nokia 770
        Call trace:
         unwind_backtrace from show_stack+0x10/0x14
         show_stack from dump_stack_lvl+0x54/0x5c
         dump_stack_lvl from __schedule_bug+0x50/0x70
         __schedule_bug from __schedule+0x4d4/0x5bc
         __schedule from schedule+0x34/0xa0
         schedule from schedule_preempt_disabled+0xc/0x10
         schedule_preempt_disabled from __mutex_lock.constprop.0+0x218/0x3b4
         __mutex_lock.constprop.0 from clk_prepare_lock+0x38/0xe4
         clk_prepare_lock from clk_set_rate+0x18/0x154
         clk_set_rate from sossi_read_data+0x4c/0x168
         sossi_read_data from hwa742_read_reg+0x5c/0x8c
         hwa742_read_reg from send_frame_handler+0xfc/0x300
         send_frame_handler from process_pending_requests+0x74/0xd0
         process_pending_requests from lcd_dma_irq_handler+0x50/0x74
         lcd_dma_irq_handler from __handle_irq_event_percpu+0x44/0x130
         __handle_irq_event_percpu from handle_irq_event+0x28/0x68
         handle_irq_event from handle_level_irq+0x9c/0x170
         handle_level_irq from generic_handle_domain_irq+0x2c/0x3c
         generic_handle_domain_irq from omap1_handle_irq+0x40/0x8c
         omap1_handle_irq from generic_handle_arch_irq+0x28/0x3c
         generic_handle_arch_irq from call_with_stack+0x1c/0x24
         call_with_stack from __irq_svc+0x94/0xa8
        Exception stack(0xc5255da0 to 0xc5255de8)
        5da0: 00000001 c22fc620 00000000 00000000 c08384a8 c106fc00 00000000 c240c248
        5dc0: c113a600 c3f6ec30 00000001 00000000 c22fc620 c5255df0 c22fc620 c0279a94
        5de0: 60000013 ffffffff
         __irq_svc from clk_prepare_lock+0x4c/0xe4
         clk_prepare_lock from clk_get_rate+0x10/0x74
         clk_get_rate from uwire_setup_transfer+0x40/0x180
         uwire_setup_transfer from spi_bitbang_transfer_one+0x2c/0x9c
         spi_bitbang_transfer_one from spi_transfer_one_message+0x2d0/0x664
         spi_transfer_one_message from __spi_pump_transfer_message+0x29c/0x498
         __spi_pump_transfer_message from __spi_sync+0x1f8/0x2e8
         __spi_sync from spi_sync+0x24/0x40
         spi_sync from ads7846_halfd_read_state+0x5c/0x1c0
         ads7846_halfd_read_state from ads7846_irq+0x58/0x348
         ads7846_irq from irq_thread_fn+0x1c/0x78
         irq_thread_fn from irq_thread+0x120/0x228
         irq_thread from kthread+0xc8/0xe8
         kthread from ret_from_fork+0x14/0x28

    As a quick fix, switch to a threaded IRQ which provides a stable system.

    Signed-off-by: Aaro Koskinen <[email protected]>
    Reviewed-by: Linus Walleij <[email protected]>
    Signed-off-by: Helge Deller <[email protected]>

Signed-off-by: Jocelyn Falempe <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 30, 2025
Intel TDX protects guest VM's from malicious host and certain physical
attacks.  TDX introduces a new operation mode, Secure Arbitration Mode
(SEAM) to isolate and protect guest VM's.  A TDX guest VM runs in SEAM and,
unlike VMX, direct control and interaction with the guest by the host VMM
is not possible.  Instead, Intel TDX Module, which also runs in SEAM,
provides a SEAMCALL API.

The SEAMCALL that provides the ability to enter a guest is TDH.VP.ENTER.
The TDX Module processes TDH.VP.ENTER, and enters the guest via VMX
VMLAUNCH/VMRESUME instructions.  When a guest VM-exit requires host VMM
interaction, the TDH.VP.ENTER SEAMCALL returns to the host VMM (KVM).

Add tdh_vp_enter() to wrap the SEAMCALL invocation of TDH.VP.ENTER;
tdh_vp_enter() needs to be noinstr because VM entry in KVM is noinstr
as well, which is for two reasons:
* marking the area as CT_STATE_GUEST via guest_state_enter_irqoff() and
  guest_state_exit_irqoff()
* IRET must be avoided between VM-exit and NMI handling, in order to
  avoid prematurely releasing the NMI inhibit.

TDH.VP.ENTER is different from other SEAMCALLs in several ways: it
uses more arguments, and after it returns some host state may need to be
restored.  Therefore tdh_vp_enter() uses __seamcall_saved_ret() instead of
__seamcall_ret(); since it is the only caller of __seamcall_saved_ret(),
it can be made noinstr also.

TDH.VP.ENTER arguments are passed through General Purpose Registers (GPRs).
For the special case of the TD guest invoking TDG.VP.VMCALL, nearly any GPR
can be used, as well as XMM0 to XMM15. Notably, RBP is not used, and Linux
mandates the TDX Module feature NO_RBP_MOD, which is enforced elsewhere.
Additionally, XMM registers are not required for the existing Guest
Hypervisor Communication Interface and are handled by existing KVM code
should they be modified by the guest.

There are 2 input formats and 5 output formats for TDH.VP.ENTER arguments.
Input #1 : Initial entry or following a previous async. TD Exit
Input #2 : Following a previous TDCALL(TDG.VP.VMCALL)
Output #1 : On Error (No TD Entry)
Output #2 : Async. Exits with a VMX Architectural Exit Reason
Output #3 : Async. Exits with a non-VMX TD Exit Status
Output #4 : Async. Exits with Cross-TD Exit Details
Output #5 : On TDCALL(TDG.VP.VMCALL)

Currently, to keep things simple, the wrapper function does not attempt
to support different formats, and just passes all the GPRs that could be
used.  The GPR values are held by KVM in the area set aside for guest
GPRs.  KVM code uses the guest GPR area (vcpu->arch.regs[]) to set up for
or process results of tdh_vp_enter().

Therefore changing tdh_vp_enter() to use more complex argument formats
would also alter the way KVM code interacts with tdh_vp_enter().

Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Adrian Hunter <[email protected]>
Message-ID: <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 30, 2025
[ Upstream commit 1b9366c ]

If waiting for gpu reset done in KFD release_work, thers is WARNING:
possible circular locking dependency detected

  #2  kfd_create_process
        kfd_process_mutex
          flush kfd release work

  #1  kfd release work
        wait for amdgpu reset work

  #0  amdgpu_device_gpu_reset
        kgd2kfd_pre_reset
          kfd_process_mutex

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock((work_completion)(&p->release_work));
                  lock((wq_completion)kfd_process_wq);
                  lock((work_completion)(&p->release_work));
   lock((wq_completion)amdgpu-reset-dev);

To fix this, KFD create process move flush release work outside
kfd_process_mutex.

Signed-off-by: Philip Yang <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 30, 2025
[ Upstream commit 88f7f56 ]

When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush()
generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC,
which causes the flush_bio to be throttled by wbt_wait().

An example from v5.4, similar problem also exists in upstream:

    crash> bt 2091206
    PID: 2091206  TASK: ffff2050df92a300  CPU: 109  COMMAND: "kworker/u260:0"
     #0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8
     #1 [ffff800084a2f820] __schedule at ffff800040bfa0c4
     #2 [ffff800084a2f880] schedule at ffff800040bfa4b4
     #3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4
     #4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc
     #5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0
     #6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254
     #7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38
     #8 [ffff800084a2fa60] generic_make_request at ffff800040570138
     #9 [ffff800084a2fae0] submit_bio at ffff8000405703b4
    #10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs]
    #11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs]
    #12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs]
    #13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs]
    #14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs]
    #15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs]
    #16 [ffff800084a2fdb0] process_one_work at ffff800040111d08
    #17 [ffff800084a2fe00] worker_thread at ffff8000401121cc
    #18 [ffff800084a2fe70] kthread at ffff800040118de4

After commit 2def284 ("xfs: don't allow log IO to be throttled"),
the metadata submitted by xlog_write_iclog() should not be throttled.
But due to the existence of the dm layer, throttling flush_bio indirectly
causes the metadata bio to be throttled.

Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes
wbt_should_throttle() return false to avoid wbt_wait().

Signed-off-by: Jinliang Zheng <[email protected]>
Reviewed-by: Tianxiang Peng <[email protected]>
Reviewed-by: Hao Peng <[email protected]>
Signed-off-by: Mikulas Patocka <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
github-actions bot pushed a commit that referenced this pull request May 31, 2025
…mage

WARNING: CPU: 1 PID: 9426 at fs/inode.c:417 drop_nlink+0xac/0xd0
home/cc/linux/fs/inode.c:417
Modules linked in:
CPU: 1 UID: 0 PID: 9426 Comm: syz-executor568 Not tainted
6.14.0-12627-g94d471a4f428 #2 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:drop_nlink+0xac/0xd0 home/cc/linux/fs/inode.c:417
Code: 48 8b 5d 28 be 08 00 00 00 48 8d bb 70 07 00 00 e8 f9 67 e6 ff
f0 48 ff 83 70 07 00 00 5b 5d e9 9a 12 82 ff e8 95 12 82 ff 90
&lt;0f&gt; 0b 90 c7 45 48 ff ff ff ff 5b 5d e9 83 12 82 ff e8 fe 5f e6
ff
RSP: 0018:ffffc900026b7c28 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff8239710f
RDX: ffff888041345a00 RSI: ffffffff8239717b RDI: 0000000000000005
RBP: ffff888054509ad0 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff9ab36f08 R12: ffff88804bb40000
R13: ffff8880545091e0 R14: 0000000000008000 R15: ffff8880545091e0
FS:  000055555d0c5880(0000) GS:ffff8880eb3e3000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f915c55b178 CR3: 0000000050d20000 CR4: 0000000000352ef0
Call Trace:
 <task>
 f2fs_i_links_write home/cc/linux/fs/f2fs/f2fs.h:3194 [inline]
 f2fs_drop_nlink+0xd1/0x3c0 home/cc/linux/fs/f2fs/dir.c:845
 f2fs_delete_entry+0x542/0x1450 home/cc/linux/fs/f2fs/dir.c:909
 f2fs_unlink+0x45c/0x890 home/cc/linux/fs/f2fs/namei.c:581
 vfs_unlink+0x2fb/0x9b0 home/cc/linux/fs/namei.c:4544
 do_unlinkat+0x4c5/0x6a0 home/cc/linux/fs/namei.c:4608
 __do_sys_unlink home/cc/linux/fs/namei.c:4654 [inline]
 __se_sys_unlink home/cc/linux/fs/namei.c:4652 [inline]
 __x64_sys_unlink+0xc5/0x110 home/cc/linux/fs/namei.c:4652
 do_syscall_x64 home/cc/linux/arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xc7/0x250 home/cc/linux/arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb3d092324b
Code: 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 83 c8 ff c3 66
2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 57 00 00 00 0f 05
&lt;48&gt; 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01
48
RSP: 002b:00007ffdc232d938 EFLAGS: 00000206 ORIG_RAX: 0000000000000057
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb3d092324b
RDX: 00007ffdc232d960 RSI: 00007ffdc232d960 RDI: 00007ffdc232d9f0
RBP: 00007ffdc232d9f0 R08: 0000000000000001 R09: 00007ffdc232d7c0
R10: 00000000fffffffd R11: 0000000000000206 R12: 00007ffdc232eaf0
R13: 000055555d0cebb0 R14: 00007ffdc232d958 R15: 0000000000000001
 </task>

Cc: [email protected]
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 1, 2025
Running a modified trace-cmd record --nosplice where it does a mmap of the
ring buffer when '--nosplice' is set, caused the following lockdep splat:

 ======================================================
 WARNING: possible circular locking dependency detected
 6.15.0-rc7-test-00002-gfb7d03d8a82f #551 Not tainted
 ------------------------------------------------------
 trace-cmd/1113 is trying to acquire lock:
 ffff888100062888 (&buffer->mutex){+.+.}-{4:4}, at: ring_buffer_map+0x11c/0xe70

 but task is already holding lock:
 ffff888100a5f9f8 (&cpu_buffer->mapping_lock){+.+.}-{4:4}, at: ring_buffer_map+0xcf/0xe70

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #5 (&cpu_buffer->mapping_lock){+.+.}-{4:4}:
        __mutex_lock+0x192/0x18c0
        ring_buffer_map+0xcf/0xe70
        tracing_buffers_mmap+0x1c4/0x3b0
        __mmap_region+0xd8d/0x1f70
        do_mmap+0x9d7/0x1010
        vm_mmap_pgoff+0x20b/0x390
        ksys_mmap_pgoff+0x2e9/0x440
        do_syscall_64+0x79/0x1c0
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #4 (&mm->mmap_lock){++++}-{4:4}:
        __might_fault+0xa5/0x110
        _copy_to_user+0x22/0x80
        _perf_ioctl+0x61b/0x1b70
        perf_ioctl+0x62/0x90
        __x64_sys_ioctl+0x134/0x190
        do_syscall_64+0x79/0x1c0
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #3 (&cpuctx_mutex){+.+.}-{4:4}:
        __mutex_lock+0x192/0x18c0
        perf_event_init_cpu+0x325/0x7c0
        perf_event_init+0x52a/0x5b0
        start_kernel+0x263/0x3e0
        x86_64_start_reservations+0x24/0x30
        x86_64_start_kernel+0x95/0xa0
        common_startup_64+0x13e/0x141

 -> #2 (pmus_lock){+.+.}-{4:4}:
        __mutex_lock+0x192/0x18c0
        perf_event_init_cpu+0xb7/0x7c0
        cpuhp_invoke_callback+0x2c0/0x1030
        __cpuhp_invoke_callback_range+0xbf/0x1f0
        _cpu_up+0x2e7/0x690
        cpu_up+0x117/0x170
        cpuhp_bringup_mask+0xd5/0x120
        bringup_nonboot_cpus+0x13d/0x170
        smp_init+0x2b/0xf0
        kernel_init_freeable+0x441/0x6d0
        kernel_init+0x1e/0x160
        ret_from_fork+0x34/0x70
        ret_from_fork_asm+0x1a/0x30

 -> #1 (cpu_hotplug_lock){++++}-{0:0}:
        cpus_read_lock+0x2a/0xd0
        ring_buffer_resize+0x610/0x14e0
        __tracing_resize_ring_buffer.part.0+0x42/0x120
        tracing_set_tracer+0x7bd/0xa80
        tracing_set_trace_write+0x132/0x1e0
        vfs_write+0x21c/0xe80
        ksys_write+0xf9/0x1c0
        do_syscall_64+0x79/0x1c0
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #0 (&buffer->mutex){+.+.}-{4:4}:
        __lock_acquire+0x1405/0x2210
        lock_acquire+0x174/0x310
        __mutex_lock+0x192/0x18c0
        ring_buffer_map+0x11c/0xe70
        tracing_buffers_mmap+0x1c4/0x3b0
        __mmap_region+0xd8d/0x1f70
        do_mmap+0x9d7/0x1010
        vm_mmap_pgoff+0x20b/0x390
        ksys_mmap_pgoff+0x2e9/0x440
        do_syscall_64+0x79/0x1c0
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 other info that might help us debug this:

 Chain exists of:
   &buffer->mutex --> &mm->mmap_lock --> &cpu_buffer->mapping_lock

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&cpu_buffer->mapping_lock);
                                lock(&mm->mmap_lock);
                                lock(&cpu_buffer->mapping_lock);
   lock(&buffer->mutex);

  *** DEADLOCK ***

 2 locks held by trace-cmd/1113:
  #0: ffff888106b847e0 (&mm->mmap_lock){++++}-{4:4}, at: vm_mmap_pgoff+0x192/0x390
  #1: ffff888100a5f9f8 (&cpu_buffer->mapping_lock){+.+.}-{4:4}, at: ring_buffer_map+0xcf/0xe70

 stack backtrace:
 CPU: 5 UID: 0 PID: 1113 Comm: trace-cmd Not tainted 6.15.0-rc7-test-00002-gfb7d03d8a82f #551 PREEMPT
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x6e/0xa0
  print_circular_bug.cold+0x178/0x1be
  check_noncircular+0x146/0x160
  __lock_acquire+0x1405/0x2210
  lock_acquire+0x174/0x310
  ? ring_buffer_map+0x11c/0xe70
  ? ring_buffer_map+0x11c/0xe70
  ? __mutex_lock+0x169/0x18c0
  __mutex_lock+0x192/0x18c0
  ? ring_buffer_map+0x11c/0xe70
  ? ring_buffer_map+0x11c/0xe70
  ? function_trace_call+0x296/0x370
  ? __pfx___mutex_lock+0x10/0x10
  ? __pfx_function_trace_call+0x10/0x10
  ? __pfx___mutex_lock+0x10/0x10
  ? _raw_spin_unlock+0x2d/0x50
  ? ring_buffer_map+0x11c/0xe70
  ? ring_buffer_map+0x11c/0xe70
  ? __mutex_lock+0x5/0x18c0
  ring_buffer_map+0x11c/0xe70
  ? do_raw_spin_lock+0x12d/0x270
  ? find_held_lock+0x2b/0x80
  ? _raw_spin_unlock+0x2d/0x50
  ? rcu_is_watching+0x15/0xb0
  ? _raw_spin_unlock+0x2d/0x50
  ? trace_preempt_on+0xd0/0x110
  tracing_buffers_mmap+0x1c4/0x3b0
  __mmap_region+0xd8d/0x1f70
  ? ring_buffer_lock_reserve+0x99/0xff0
  ? __pfx___mmap_region+0x10/0x10
  ? ring_buffer_lock_reserve+0x99/0xff0
  ? __pfx_ring_buffer_lock_reserve+0x10/0x10
  ? __pfx_ring_buffer_lock_reserve+0x10/0x10
  ? bpf_lsm_mmap_addr+0x4/0x10
  ? security_mmap_addr+0x46/0xd0
  ? lock_is_held_type+0xd9/0x130
  do_mmap+0x9d7/0x1010
  ? 0xffffffffc0370095
  ? __pfx_do_mmap+0x10/0x10
  vm_mmap_pgoff+0x20b/0x390
  ? __pfx_vm_mmap_pgoff+0x10/0x10
  ? 0xffffffffc0370095
  ksys_mmap_pgoff+0x2e9/0x440
  do_syscall_64+0x79/0x1c0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 RIP: 0033:0x7fb0963a7de2
 Code: 00 00 00 0f 1f 44 00 00 41 f7 c1 ff 0f 00 00 75 27 55 89 cd 53 48 89 fb 48 85 ff 74 3b 41 89 ea 48 89 df b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 5b 5d c3 0f 1f 00 48 8b 05 e1 9f 0d 00 64
 RSP: 002b:00007ffdcc8fb878 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb0963a7de2
 RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000
 RBP: 0000000000000001 R08: 0000000000000006 R09: 0000000000000000
 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007ffdcc8fbe68 R14: 00007fb096628000 R15: 00005633e01a5c90
  </TASK>

The issue is that cpus_read_lock() is taken within buffer->mutex. The
memory mapped pages are taken with the mmap_lock held. The buffer->mutex
is taken within the cpu_buffer->mapping_lock. There's quite a chain with
all these locks, where the deadlock can be fixed by moving the
cpus_read_lock() outside the taking of the buffer->mutex.

Cc: [email protected]
Cc: Masami Hiramatsu <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Vincent Donnefort <[email protected]>
Link: https://lore.kernel.org/[email protected]
Fixes: 117c392 ("ring-buffer: Introducing ring-buffer mapping functions")
Signed-off-by: Steven Rostedt (Google) <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 3, 2025
Restore KVM's handling of a NULL kvm_x86_ops.mem_enc_ioctl, as the hook is
NULL on SVM when CONFIG_KVM_AMD_SEV=n, and TDX will soon follow suit.

  ------------[ cut here ]------------
  WARNING: CPU: 0 PID: 1 at arch/x86/include/asm/kvm-x86-ops.h:130 kvm_x86_vendor_init+0x178b/0x18e0
  Modules linked in:
  CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.15.0-rc2-dc1aead1a985-sink-vm #2 NONE
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_x86_vendor_init+0x178b/0x18e0
  Call Trace:
   <TASK>
   svm_init+0x2e/0x60
   do_one_initcall+0x56/0x290
   kernel_init_freeable+0x192/0x1e0
   kernel_init+0x16/0x130
   ret_from_fork+0x30/0x50
   ret_from_fork_asm+0x1a/0x30
   </TASK>
  ---[ end trace 0000000000000000 ]---

Opportunistically drop the superfluous curly braces.

Link: https://lore.kernel.org/all/[email protected]
Fixes: b2aaf38 ("KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 3, 2025
Despite the fact that several lockdep-related checks are skipped when
calling trylock* versions of the locking primitives, for example
mutex_trylock, each time the mutex is acquired, a held_lock is still
placed onto the lockdep stack by __lock_acquire() which is called
regardless of whether the trylock* or regular locking API was used.

This means that if the caller successfully acquires more than
MAX_LOCK_DEPTH locks of the same class, even when using mutex_trylock,
lockdep will still complain that the maximum depth of the held lock stack
has been reached and disable itself.

For example, the following error currently occurs in the ARM version
of KVM, once the code tries to lock all vCPUs of a VM configured with more
than MAX_LOCK_DEPTH vCPUs, a situation that can easily happen on modern
systems, where having more than 48 CPUs is common, and it's also common to
run VMs that have vCPU counts approaching that number:

[  328.171264] BUG: MAX_LOCK_DEPTH too low!
[  328.175227] turning off the locking correctness validator.
[  328.180726] Please attach the output of /proc/lock_stat to the bug report
[  328.187531] depth: 48  max: 48!
[  328.190678] 48 locks held by qemu-kvm/11664:
[  328.194957]  #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0
[  328.204048]  #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.212521]  #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.220991]  #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.229463]  #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.237934]  #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.246405]  #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0

Luckily, in all instances that require locking all vCPUs, the
'kvm->lock' is taken a priori, and that fact makes it possible to use
the little known feature of lockdep, called a 'nest_lock', to avoid this
warning and subsequent lockdep self-disablement.

The action of 'nested lock' being provided to lockdep's lock_acquire(),
causes the lockdep to detect that the top of the held lock stack contains
a lock of the same class and then increment its reference counter instead
of pushing a new held_lock item onto that stack.

See __lock_acquire for more information.

Signed-off-by: Maxim Levitsky <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 3, 2025
Use kvm_trylock_all_vcpus instead of a custom implementation when locking
all vCPUs of a VM, to avoid triggering a lockdep warning, in the case in
which the VM is configured to have more than MAX_LOCK_DEPTH vCPUs.

This fixes the following false lockdep warning:

[  328.171264] BUG: MAX_LOCK_DEPTH too low!
[  328.175227] turning off the locking correctness validator.
[  328.180726] Please attach the output of /proc/lock_stat to the bug report
[  328.187531] depth: 48  max: 48!
[  328.190678] 48 locks held by qemu-kvm/11664:
[  328.194957]  #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0
[  328.204048]  #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.212521]  #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.220991]  #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.229463]  #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.237934]  #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.246405]  #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Maxim Levitsky <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 4, 2025
JIRA: https://issues.redhat.com/browse/RHEL-83595

commit 9730763
Author: Nilay Shroff <[email protected]>
Date:   Wed Mar 19 16:23:46 2025 +0530

    block: correct locking order for protecting blk-wbt parameters

    The commit '245618f8e45f ("block: protect wbt_lat_usec using q->
    elevator_lock")' introduced q->elevator_lock to protect updates
    to blk-wbt parameters when writing to the sysfs attribute wbt_
    lat_usec and the cgroup attribute io.cost.qos.  However, both
    these attributes also acquire q->rq_qos_mutex, leading to the
    following lockdep warning:

    ======================================================
    WARNING: possible circular locking dependency detected
    6.14.0-rc5+ #138 Not tainted
    ------------------------------------------------------
    bash/5902 is trying to acquire lock:
    c000000085d495a0 (&q->rq_qos_mutex){+.+.}-{4:4}, at: wbt_init+0x164/0x238

    but task is already holding lock:
    c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&q->elevator_lock){+.+.}-{4:4}:
            __mutex_lock+0xf0/0xa58
            ioc_qos_write+0x16c/0x85c
            cgroup_file_write+0xc4/0x32c
            kernfs_fop_write_iter+0x1b8/0x29c
            vfs_write+0x410/0x584
            ksys_write+0x84/0x140
            system_call_exception+0x134/0x360
            system_call_vectored_common+0x15c/0x2ec

    -> #0 (&q->rq_qos_mutex){+.+.}-{4:4}:
            __lock_acquire+0x1b6c/0x2ae0
            lock_acquire+0x140/0x430
            __mutex_lock+0xf0/0xa58
            wbt_init+0x164/0x238
            queue_wb_lat_store+0x1dc/0x20c
            queue_attr_store+0x12c/0x164
            sysfs_kf_write+0x6c/0xb0
            kernfs_fop_write_iter+0x1b8/0x29c
            vfs_write+0x410/0x584
            ksys_write+0x84/0x140
            system_call_exception+0x134/0x360
            system_call_vectored_common+0x15c/0x2ec

    other info that might help us debug this:

        Possible unsafe locking scenario:

            CPU0                    CPU1
            ----                    ----
        lock(&q->elevator_lock);
                                    lock(&q->rq_qos_mutex);
                                    lock(&q->elevator_lock);
        lock(&q->rq_qos_mutex);

        *** DEADLOCK ***

    6 locks held by bash/5902:
        #0: c000000051122400 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x84/0x140
        #1: c00000007383f088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x174/0x29c
        #2: c000000008550428 (kn->active#182){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x180/0x29c
        #3: c000000085d493a8 (&q->q_usage_counter(io)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
        #4: c000000085d493e0 (&q->q_usage_counter(queue)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
        #5: c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c

    stack backtrace:
    CPU: 17 UID: 0 PID: 5902 Comm: bash Kdump: loaded Not tainted 6.14.0-rc5+ #138
    Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
    Call Trace:
    [c0000000721ef590] [c00000000118f8a8] dump_stack_lvl+0x108/0x18c (unreliable)
    [c0000000721ef5c0] [c00000000022563c] print_circular_bug+0x448/0x604
    [c0000000721ef670] [c000000000225a44] check_noncircular+0x24c/0x26c
    [c0000000721ef740] [c00000000022bf28] __lock_acquire+0x1b6c/0x2ae0
    [c0000000721ef870] [c000000000229240] lock_acquire+0x140/0x430
    [c0000000721ef970] [c0000000011cfbec] __mutex_lock+0xf0/0xa58
    [c0000000721efaa0] [c00000000096c46c] wbt_init+0x164/0x238
    [c0000000721efaf0] [c0000000008f8cd8] queue_wb_lat_store+0x1dc/0x20c
    [c0000000721efb50] [c0000000008f8fa0] queue_attr_store+0x12c/0x164
    [c0000000721efc60] [c0000000007c11cc] sysfs_kf_write+0x6c/0xb0
    [c0000000721efca0] [c0000000007bfa4c] kernfs_fop_write_iter+0x1b8/0x29c
    [c0000000721efcf0] [c0000000006a281c] vfs_write+0x410/0x584
    [c0000000721efdc0] [c0000000006a2cc8] ksys_write+0x84/0x140
    [c0000000721efe10] [c000000000031b64] system_call_exception+0x134/0x360
    [c0000000721efe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec

    >From the above log it's apparent that method which writes to sysfs attr
    wbt_lat_usec acquires q->elevator_lock first, and then acquires q->rq_
    qos_mutex. However the another method which writes to io.cost.qos,
    acquires q->rq_qos_mutex first, and then acquires q->rq_qos_mutex. So
    this could potentially cause the deadlock.

    A closer look at ioc_qos_write shows that correcting the lock order is
    non-trivial because q->rq_qos_mutex is acquired in blkg_conf_open_bdev
    and released in blkg_conf_exit. The function blkg_conf_open_bdev is
    responsible for parsing user input and finding the corresponding block
    device (bdev) from the user provided major:minor number.

    Since we do not know the bdev until blkg_conf_open_bdev completes, we
    cannot simply move q->elevator_lock acquisition before blkg_conf_open_
    bdev. So to address this, we intoduce new helpers blkg_conf_open_bdev_
    frozen and blkg_conf_exit_frozen which are just wrappers around blkg_
    conf_open_bdev and blkg_conf_exit respectively. The helper blkg_conf_
    open_bdev_frozen is similar to blkg_conf_open_bdev, but additionally
    freezes the queue, acquires q->elevator_lock and ensures the correct
    locking order is followed between q->elevator_lock and q->rq_qos_mutex.
    Similarly another helper blkg_conf_exit_frozen in addition to unfreezing
    the queue ensures that we release the locks in correct order.

    By using these helpers, now we maintain the same locking order in all
    code paths where we update blk-wbt parameters.

    Fixes: 245618f ("block: protect wbt_lat_usec using q->elevator_lock")
    Reported-by: kernel test robot <[email protected]>
    Closes: https://lore.kernel.org/oe-lkp/[email protected]
    Signed-off-by: Nilay Shroff <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Jens Axboe <[email protected]>

Signed-off-by: Ming Lei <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 4, 2025
JIRA: https://issues.redhat.com/browse/RHEL-92762
Upstream Status: kernel/git/torvalds/linux.git

commit 88f7f56
Author: Jinliang Zheng <[email protected]>
Date:   Thu Feb 20 19:20:14 2025 +0800

    dm: fix unconditional IO throttle caused by REQ_PREFLUSH

    When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush()
    generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC,
    which causes the flush_bio to be throttled by wbt_wait().

    An example from v5.4, similar problem also exists in upstream:

        crash> bt 2091206
        PID: 2091206  TASK: ffff2050df92a300  CPU: 109  COMMAND: "kworker/u260:0"
         #0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8
         #1 [ffff800084a2f820] __schedule at ffff800040bfa0c4
         #2 [ffff800084a2f880] schedule at ffff800040bfa4b4
         #3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4
         #4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc
         #5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0
         #6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254
         #7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38
         #8 [ffff800084a2fa60] generic_make_request at ffff800040570138
         #9 [ffff800084a2fae0] submit_bio at ffff8000405703b4
        #10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs]
        #11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs]
        #12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs]
        #13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs]
        #14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs]
        #15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs]
        #16 [ffff800084a2fdb0] process_one_work at ffff800040111d08
        #17 [ffff800084a2fe00] worker_thread at ffff8000401121cc
        #18 [ffff800084a2fe70] kthread at ffff800040118de4

    After commit 2def284 ("xfs: don't allow log IO to be throttled"),
    the metadata submitted by xlog_write_iclog() should not be throttled.
    But due to the existence of the dm layer, throttling flush_bio indirectly
    causes the metadata bio to be throttled.

    Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes
    wbt_should_throttle() return false to avoid wbt_wait().

    Signed-off-by: Jinliang Zheng <[email protected]>
    Reviewed-by: Tianxiang Peng <[email protected]>
    Reviewed-by: Hao Peng <[email protected]>
    Signed-off-by: Mikulas Patocka <[email protected]>

Signed-off-by: Benjamin Marzinski <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 4, 2025
JIRA: https://issues.redhat.com/browse/RHEL-92996

commit 053f3ff
Author: Dave Marquardt <[email protected]>
Date:   Wed Apr 2 10:44:03 2025 -0500

    net: ibmveth: make veth_pool_store stop hanging

    v2:
    - Created a single error handling unlock and exit in veth_pool_store
    - Greatly expanded commit message with previous explanatory-only text

    Summary: Use rtnl_mutex to synchronize veth_pool_store with itself,
    ibmveth_close and ibmveth_open, preventing multiple calls in a row to
    napi_disable.

    Background: Two (or more) threads could call veth_pool_store through
    writing to /sys/devices/vio/30000002/pool*/*. You can do this easily
    with a little shell script. This causes a hang.

    I configured LOCKDEP, compiled ibmveth.c with DEBUG, and built a new
    kernel. I ran this test again and saw:

        Setting pool0/active to 0
        Setting pool1/active to 1
        [   73.911067][ T4365] ibmveth 30000002 eth0: close starting
        Setting pool1/active to 1
        Setting pool1/active to 0
        [   73.911367][ T4366] ibmveth 30000002 eth0: close starting
        [   73.916056][ T4365] ibmveth 30000002 eth0: close complete
        [   73.916064][ T4365] ibmveth 30000002 eth0: open starting
        [  110.808564][  T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
        [  230.808495][  T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
        [  243.683786][  T123] INFO: task stress.sh:4365 blocked for more than 122 seconds.
        [  243.683827][  T123]       Not tainted 6.14.0-01103-g2df0c02dab82-dirty #8
        [  243.683833][  T123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [  243.683838][  T123] task:stress.sh       state:D stack:28096 pid:4365  tgid:4365  ppid:4364   task_flags:0x400040 flags:0x00042000
        [  243.683852][  T123] Call Trace:
        [  243.683857][  T123] [c00000000c38f690] [0000000000000001] 0x1 (unreliable)
        [  243.683868][  T123] [c00000000c38f840] [c00000000001f908] __switch_to+0x318/0x4e0
        [  243.683878][  T123] [c00000000c38f8a0] [c000000001549a70] __schedule+0x500/0x12a0
        [  243.683888][  T123] [c00000000c38f9a0] [c00000000154a878] schedule+0x68/0x210
        [  243.683896][  T123] [c00000000c38f9d0] [c00000000154ac80] schedule_preempt_disabled+0x30/0x50
        [  243.683904][  T123] [c00000000c38fa00] [c00000000154dbb0] __mutex_lock+0x730/0x10f0
        [  243.683913][  T123] [c00000000c38fb10] [c000000001154d40] napi_enable+0x30/0x60
        [  243.683921][  T123] [c00000000c38fb40] [c000000000f4ae94] ibmveth_open+0x68/0x5dc
        [  243.683928][  T123] [c00000000c38fbe0] [c000000000f4aa20] veth_pool_store+0x220/0x270
        [  243.683936][  T123] [c00000000c38fc70] [c000000000826278] sysfs_kf_write+0x68/0xb0
        [  243.683944][  T123] [c00000000c38fcb0] [c0000000008240b8] kernfs_fop_write_iter+0x198/0x2d0
        [  243.683951][  T123] [c00000000c38fd00] [c00000000071b9ac] vfs_write+0x34c/0x650
        [  243.683958][  T123] [c00000000c38fdc0] [c00000000071bea8] ksys_write+0x88/0x150
        [  243.683966][  T123] [c00000000c38fe10] [c0000000000317f4] system_call_exception+0x124/0x340
        [  243.683973][  T123] [c00000000c38fe50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
        ...
        [  243.684087][  T123] Showing all locks held in the system:
        [  243.684095][  T123] 1 lock held by khungtaskd/123:
        [  243.684099][  T123]  #0: c00000000278e370 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x50/0x248
        [  243.684114][  T123] 4 locks held by stress.sh/4365:
        [  243.684119][  T123]  #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
        [  243.684132][  T123]  #1: c000000041aea888 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
        [  243.684143][  T123]  #2: c0000000366fb9a8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
        [  243.684155][  T123]  #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_enable+0x30/0x60
        [  243.684166][  T123] 5 locks held by stress.sh/4366:
        [  243.684170][  T123]  #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
        [  243.684183][  T123]  #1: c00000000aee2288 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
        [  243.684194][  T123]  #2: c0000000366f4ba8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
        [  243.684205][  T123]  #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_disable+0x30/0x60
        [  243.684216][  T123]  #4: c0000003ff9bbf18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x138/0x12a0

    From the ibmveth debug, two threads are calling veth_pool_store, which
    calls ibmveth_close and ibmveth_open. Here's the sequence:

      T4365             T4366
      ----------------- ----------------- ---------
      veth_pool_store   veth_pool_store
                        ibmveth_close
      ibmveth_close
      napi_disable
                        napi_disable
      ibmveth_open
      napi_enable                         <- HANG

    ibmveth_close calls napi_disable at the top and ibmveth_open calls
    napi_enable at the top.

    https://docs.kernel.org/networking/napi.html]] says

      The control APIs are not idempotent. Control API calls are safe
      against concurrent use of datapath APIs but an incorrect sequence of
      control API calls may result in crashes, deadlocks, or race
      conditions. For example, calling napi_disable() multiple times in a
      row will deadlock.

    In the normal open and close paths, rtnl_mutex is acquired to prevent
    other callers. This is missing from veth_pool_store. Use rtnl_mutex in
    veth_pool_store fixes these hangs.

    Signed-off-by: Dave Marquardt <[email protected]>
    Fixes: 860f242 ("[PATCH] ibmveth change buffer pools dynamically")
    Reviewed-by: Nick Child <[email protected]>
    Reviewed-by: Simon Horman <[email protected]>
    Link: https://patch.msgid.link/[email protected]
    Signed-off-by: Jakub Kicinski <[email protected]>

Signed-off-by: Mamatha Inamdar <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 4, 2025
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-10/-/merge_requests/956

Description: Updates for ibmveth pool store

JIRA: https://issues.redhat.com/browse/RHEL-92996

Build Info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=67710112

Tested: Verified Brew build test kernel RPMs and confirmed issue is resovled

Signed-off-by: Mamatha Inamdar <[email protected]>

commit 053f3ff
Author: Dave Marquardt <[email protected]>
Date:   Wed Apr 2 10:44:03 2025 -0500

    net: ibmveth: make veth_pool_store stop hanging

    v2:
    - Created a single error handling unlock and exit in veth_pool_store
    - Greatly expanded commit message with previous explanatory-only text

    Summary: Use rtnl_mutex to synchronize veth_pool_store with itself,
    ibmveth_close and ibmveth_open, preventing multiple calls in a row to
    napi_disable.

    Background: Two (or more) threads could call veth_pool_store through
    writing to /sys/devices/vio/30000002/pool*/*. You can do this easily
    with a little shell script. This causes a hang.

    I configured LOCKDEP, compiled ibmveth.c with DEBUG, and built a new
    kernel. I ran this test again and saw:

        Setting pool0/active to 0
        Setting pool1/active to 1
        [   73.911067][ T4365] ibmveth 30000002 eth0: close starting
        Setting pool1/active to 1
        Setting pool1/active to 0
        [   73.911367][ T4366] ibmveth 30000002 eth0: close starting
        [   73.916056][ T4365] ibmveth 30000002 eth0: close complete
        [   73.916064][ T4365] ibmveth 30000002 eth0: open starting
        [  110.808564][  T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
        [  230.808495][  T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
        [  243.683786][  T123] INFO: task stress.sh:4365 blocked for more than 122 seconds.
        [  243.683827][  T123]       Not tainted 6.14.0-01103-g2df0c02dab82-dirty #8
        [  243.683833][  T123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [  243.683838][  T123] task:stress.sh       state:D stack:28096 pid:4365  tgid:4365  ppid:4364   task_flags:0x400040 flags:0x00042000
        [  243.683852][  T123] Call Trace:
        [  243.683857][  T123] [c00000000c38f690] [0000000000000001] 0x1 (unreliable)
        [  243.683868][  T123] [c00000000c38f840] [c00000000001f908] __switch_to+0x318/0x4e0
        [  243.683878][  T123] [c00000000c38f8a0] [c000000001549a70] __schedule+0x500/0x12a0
        [  243.683888][  T123] [c00000000c38f9a0] [c00000000154a878] schedule+0x68/0x210
        [  243.683896][  T123] [c00000000c38f9d0] [c00000000154ac80] schedule_preempt_disabled+0x30/0x50
        [  243.683904][  T123] [c00000000c38fa00] [c00000000154dbb0] __mutex_lock+0x730/0x10f0
        [  243.683913][  T123] [c00000000c38fb10] [c000000001154d40] napi_enable+0x30/0x60
        [  243.683921][  T123] [c00000000c38fb40] [c000000000f4ae94] ibmveth_open+0x68/0x5dc
        [  243.683928][  T123] [c00000000c38fbe0] [c000000000f4aa20] veth_pool_store+0x220/0x270
        [  243.683936][  T123] [c00000000c38fc70] [c000000000826278] sysfs_kf_write+0x68/0xb0
        [  243.683944][  T123] [c00000000c38fcb0] [c0000000008240b8] kernfs_fop_write_iter+0x198/0x2d0
        [  243.683951][  T123] [c00000000c38fd00] [c00000000071b9ac] vfs_write+0x34c/0x650
        [  243.683958][  T123] [c00000000c38fdc0] [c00000000071bea8] ksys_write+0x88/0x150
        [  243.683966][  T123] [c00000000c38fe10] [c0000000000317f4] system_call_exception+0x124/0x340
        [  243.683973][  T123] [c00000000c38fe50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
        ...
        [  243.684087][  T123] Showing all locks held in the system:
        [  243.684095][  T123] 1 lock held by khungtaskd/123:
        [  243.684099][  T123]  #0: c00000000278e370 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x50/0x248
        [  243.684114][  T123] 4 locks held by stress.sh/4365:
        [  243.684119][  T123]  #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
        [  243.684132][  T123]  #1: c000000041aea888 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
        [  243.684143][  T123]  #2: c0000000366fb9a8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
        [  243.684155][  T123]  #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_enable+0x30/0x60
        [  243.684166][  T123] 5 locks held by stress.sh/4366:
        [  243.684170][  T123]  #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
        [  243.684183][  T123]  #1: c00000000aee2288 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
        [  243.684194][  T123]  #2: c0000000366f4ba8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
        [  243.684205][  T123]  #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_disable+0x30/0x60
        [  243.684216][  T123]  #4: c0000003ff9bbf18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x138/0x12a0

    From the ibmveth debug, two threads are calling veth_pool_store, which
    calls ibmveth_close and ibmveth_open. Here's the sequence:

      T4365             T4366
      ----------------- ----------------- ---------
      veth_pool_store   veth_pool_store
                        ibmveth_close
      ibmveth_close
      napi_disable
                        napi_disable
      ibmveth_open
      napi_enable                         <- HANG

    ibmveth_close calls napi_disable at the top and ibmveth_open calls
    napi_enable at the top.

    https://docs.kernel.org/networking/napi.html]] says

      The control APIs are not idempotent. Control API calls are safe
      against concurrent use of datapath APIs but an incorrect sequence of
      control API calls may result in crashes, deadlocks, or race
      conditions. For example, calling napi_disable() multiple times in a
      row will deadlock.

    In the normal open and close paths, rtnl_mutex is acquired to prevent
    other callers. This is missing from veth_pool_store. Use rtnl_mutex in
    veth_pool_store fixes these hangs.

    Signed-off-by: Dave Marquardt <[email protected]>
    Fixes: 860f242 ("[PATCH] ibmveth change buffer pools dynamically")
    Reviewed-by: Nick Child <[email protected]>
    Reviewed-by: Simon Horman <[email protected]>
    Link: https://patch.msgid.link/[email protected]
    Signed-off-by: Jakub Kicinski <[email protected]>

Signed-off-by: Mamatha Inamdar <[email protected]>

Approved-by: Steve Best <[email protected]>
Approved-by: Michal Schmidt <[email protected]>
Approved-by: CKI KWF Bot <[email protected]>

Merged-by: Julio Faracco <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 5, 2025
JIRA: https://issues.redhat.com/browse/RHEL-73484

commit e40b801
Author: D. Wythe <[email protected]>
Date:   Thu Feb 16 14:37:36 2023 +0800

    net/smc: fix potential panic dues to unprotected smc_llc_srv_add_link()

    There is a certain chance to trigger the following panic:

    PID: 5900   TASK: ffff88c1c8af4100  CPU: 1   COMMAND: "kworker/1:48"
     #0 [ffff9456c1cc79a0] machine_kexec at ffffffff870665b7
     #1 [ffff9456c1cc79f0] __crash_kexec at ffffffff871b4c7a
     #2 [ffff9456c1cc7ab0] crash_kexec at ffffffff871b5b60
     #3 [ffff9456c1cc7ac0] oops_end at ffffffff87026ce7
     #4 [ffff9456c1cc7ae0] page_fault_oops at ffffffff87075715
     #5 [ffff9456c1cc7b58] exc_page_fault at ffffffff87ad0654
     #6 [ffff9456c1cc7b80] asm_exc_page_fault at ffffffff87c00b62
        [exception RIP: ib_alloc_mr+19]
        RIP: ffffffffc0c9cce3  RSP: ffff9456c1cc7c38  RFLAGS: 00010202
        RAX: 0000000000000000  RBX: 0000000000000002  RCX: 0000000000000004
        RDX: 0000000000000010  RSI: 0000000000000000  RDI: 0000000000000000
        RBP: ffff88c1ea281d00   R8: 000000020a34ffff   R9: ffff88c1350bbb20
        R10: 0000000000000000  R11: 0000000000000001  R12: 0000000000000000
        R13: 0000000000000010  R14: ffff88c1ab040a50  R15: ffff88c1ea281d00
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
     #7 [ffff9456c1cc7c60] smc_ib_get_memory_region at ffffffffc0aff6df [smc]
     #8 [ffff9456c1cc7c88] smcr_buf_map_link at ffffffffc0b0278c [smc]
     #9 [ffff9456c1cc7ce0] __smc_buf_create at ffffffffc0b03586 [smc]

    The reason here is that when the server tries to create a second link,
    smc_llc_srv_add_link() has no protection and may add a new link to
    link group. This breaks the security environment protected by
    llc_conf_mutex.

    Fixes: 2d2209f ("net/smc: first part of add link processing as SMC server")
    Signed-off-by: D. Wythe <[email protected]>
    Reviewed-by: Larysa Zaremba <[email protected]>
    Reviewed-by: Wenjia Zhang <[email protected]>
    Signed-off-by: David S. Miller <[email protected]>

Signed-off-by: Mete Durlu <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 5, 2025
JIRA: https://issues.redhat.com/browse/RHEL-92761
Upstream Status: kernel/git/torvalds/linux.git

commit 88f7f56
Author: Jinliang Zheng <[email protected]>
Date:   Thu Feb 20 19:20:14 2025 +0800

    dm: fix unconditional IO throttle caused by REQ_PREFLUSH

    When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush()
    generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC,
    which causes the flush_bio to be throttled by wbt_wait().

    An example from v5.4, similar problem also exists in upstream:

        crash> bt 2091206
        PID: 2091206  TASK: ffff2050df92a300  CPU: 109  COMMAND: "kworker/u260:0"
         #0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8
         #1 [ffff800084a2f820] __schedule at ffff800040bfa0c4
         #2 [ffff800084a2f880] schedule at ffff800040bfa4b4
         #3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4
         #4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc
         #5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0
         #6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254
         #7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38
         #8 [ffff800084a2fa60] generic_make_request at ffff800040570138
         #9 [ffff800084a2fae0] submit_bio at ffff8000405703b4
        #10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs]
        #11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs]
        #12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs]
        #13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs]
        #14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs]
        #15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs]
        #16 [ffff800084a2fdb0] process_one_work at ffff800040111d08
        #17 [ffff800084a2fe00] worker_thread at ffff8000401121cc
        #18 [ffff800084a2fe70] kthread at ffff800040118de4

    After commit 2def284 ("xfs: don't allow log IO to be throttled"),
    the metadata submitted by xlog_write_iclog() should not be throttled.
    But due to the existence of the dm layer, throttling flush_bio indirectly
    causes the metadata bio to be throttled.

    Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes
    wbt_should_throttle() return false to avoid wbt_wait().

    Signed-off-by: Jinliang Zheng <[email protected]>
    Reviewed-by: Tianxiang Peng <[email protected]>
    Reviewed-by: Hao Peng <[email protected]>
    Signed-off-by: Mikulas Patocka <[email protected]>

Signed-off-by: Benjamin Marzinski <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 5, 2025
JIRA: https://issues.redhat.com/browse/RHEL-77936

upstream
========
commit 2adbf53
Author: Athira Rajeev <[email protected]>
Date: Mon Dec 23 19:28:13 2024 +0530

description
===========
When kernel is built without debuginfo, running 'perf record' with
--off-cpu results in segfault as below:

   ./perf record --off-cpu -e dummy sleep 1
   libbpf: kernel BTF is missing at '/sys/kernel/btf/vmlinux', was CONFIG_DEBUG_INFO_BTF enabled?
   libbpf: failed to find '.BTF' ELF section in /lib/modules/6.13.0-rc3+/build/vmlinux
   libbpf: failed to find valid kernel BTF
   Segmentation fault (core dumped)

The backtrace pointed to:

   #0  0x00000000100fb17c in btf.type_cnt ()
   #1  0x00000000100fc1a8 in btf_find_by_name_kind ()
   #2  0x00000000100fc38c in btf.find_by_name_kind ()
   #3  0x00000000102ee3ac in off_cpu_prepare ()
   #4  0x000000001002f78c in cmd_record ()
   #5  0x00000000100aee78 in run_builtin ()
   #6  0x00000000100af3e4 in handle_internal_command ()
   #7  0x000000001001004c in main ()

Code sequence is:

   static void check_sched_switch_args(void)
   {
        struct btf *btf = btf__load_vmlinux_btf();
        const struct btf_type *t1, *t2, *t3;
        u32 type_id;

        type_id = btf__find_by_name_kind(btf, "btf_trace_sched_switch",
                                         BTF_KIND_TYPEDEF);

btf__load_vmlinux_btf() fails when CONFIG_DEBUG_INFO_BTF is not enabled.

Here bpf__find_by_name_kind() calls btf__type_cnt() with NULL btf value
and results in segfault.

To fix this, add a check to see if btf is not NULL before invoking
bpf__find_by_name_kind().

    Reviewed-by: Namhyung Kim <[email protected]>
    Signed-off-by: Athira Rajeev <[email protected]>
    Cc: Adrian Hunter <[email protected]>
    Cc: Disha Goel <[email protected]>
    Cc: Hari Bathini <[email protected]>
    Cc: Ian Rogers <[email protected]>
    Cc: Jiri Olsa <[email protected]>
    Cc: Kajol Jain <[email protected]>
    Cc: Madhavan Srinivasan <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

Signed-off-by: Michael Petlan <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 5, 2025
JIRA: https://issues.redhat.com/browse/RHEL-77936

upstream
========
commit c7b87ce
Author: Howard Chu <[email protected]>
Date: Tue Jan 21 18:55:19 2025 -0800

description
===========
libtraceevent parses and returns an array of argument fields, sometimes
larger than RAW_SYSCALL_ARGS_NUM (6) because it includes "__syscall_nr",
idx will traverse to index 6 (7th element) whereas sc->fmt->arg holds 6
elements max, creating an out-of-bounds access. This runtime error is
found by UBsan. The error message:

  $ sudo UBSAN_OPTIONS=print_stacktrace=1 ./perf trace -a --max-events=1
  builtin-trace.c:1966:35: runtime error: index 6 out of bounds for type 'syscall_arg_fmt [6]'
    #0 0x5c04956be5fe in syscall__alloc_arg_fmts /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:1966
    #1 0x5c04956c0510 in trace__read_syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2110
    #2 0x5c04956c372b in trace__syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2436
    #3 0x5c04956d2f39 in trace__init_syscalls_bpf_prog_array_maps /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:3897
    #4 0x5c04956d6d25 in trace__run /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:4335
    #5 0x5c04956e112e in cmd_trace /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:5502
    #6 0x5c04956eda7d in run_builtin /home/howard/hw/linux-perf/tools/perf/perf.c:351
    #7 0x5c04956ee0a8 in handle_internal_command /home/howard/hw/linux-perf/tools/perf/perf.c:404
    #8 0x5c04956ee37f in run_argv /home/howard/hw/linux-perf/tools/perf/perf.c:448
    #9 0x5c04956ee8e9 in main /home/howard/hw/linux-perf/tools/perf/perf.c:556
    #10 0x79eb3622a3b7 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #11 0x79eb3622a47a in __libc_start_main_impl ../csu/libc-start.c:360
    #12 0x5c04955422d4 in _start (/home/howard/hw/linux-perf/tools/perf/perf+0x4e02d4) (BuildId: 5b6cab2d59e96a4341741765ad6914a4d784dbc6)

     0.000 ( 0.014 ms): Chrome_ChildIO/117244 write(fd: 238, buf: !, count: 1)                                      = 1

Fixes: 5e58fcf ("perf trace: Allow allocating sc->arg_fmt even without the syscall tracepoint")
    Signed-off-by: Howard Chu <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Namhyung Kim <[email protected]>

Signed-off-by: Michael Petlan <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 5, 2025
JIRA: https://issues.redhat.com/browse/RHEL-77936

upstream
========
commit 888751e
Author: Thomas Richter <[email protected]>
Date: Fri Jan 31 12:24:00 2025 +0100

description
===========
perf test 11 hwmon fails on s390 with this error

 # ./perf test -Fv 11
 --- start ---
 ---- end ----
 11.1: Basic parsing test             : Ok
 --- start ---
 Testing 'temp_test_hwmon_event1'
 Using CPUID IBM,3931,704,A01,3.7,002f
 temp_test_hwmon_event1 -> hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/
 FAILED tests/hwmon_pmu.c:189 Unexpected config for
    'temp_test_hwmon_event1', 292470092988416 != 655361
 ---- end ----
 11.2: Parsing without PMU name       : FAILED!
 --- start ---
 Testing 'hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/'
 FAILED tests/hwmon_pmu.c:189 Unexpected config for
    'hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/',
    292470092988416 != 655361
 ---- end ----
 11.3: Parsing with PMU name          : FAILED!
 #

The root cause is in member test_event::config which is initialized
to 0xA0001 or 655361. During event parsing a long list event parsing
functions are called and end up with this gdb call stack:

 #0  hwmon_pmu__config_term (hwm=0x168dfd0, attr=0x3ffffff5ee8,
	term=0x168db60, err=0x3ffffff81c8) at util/hwmon_pmu.c:623
 #1  hwmon_pmu__config_terms (pmu=0x168dfd0, attr=0x3ffffff5ee8,
	terms=0x3ffffff5ea8, err=0x3ffffff81c8) at util/hwmon_pmu.c:662
 #2  0x00000000012f870c in perf_pmu__config_terms (pmu=0x168dfd0,
	attr=0x3ffffff5ee8, terms=0x3ffffff5ea8, zero=false,
	apply_hardcoded=false, err=0x3ffffff81c8) at util/pmu.c:1519
 #3  0x00000000012f88a4 in perf_pmu__config (pmu=0x168dfd0, attr=0x3ffffff5ee8,
	head_terms=0x3ffffff5ea8, apply_hardcoded=false, err=0x3ffffff81c8)
	at util/pmu.c:1545
 #4  0x00000000012680c4 in parse_events_add_pmu (parse_state=0x3ffffff7fb8,
	list=0x168dc00, pmu=0x168dfd0, const_parsed_terms=0x3ffffff6090,
	auto_merge_stats=true, alternate_hw_config=10)
	at util/parse-events.c:1508
 #5  0x00000000012684c6 in parse_events_multi_pmu_add (parse_state=0x3ffffff7fb8,
	event_name=0x168ec10 "temp_test_hwmon_event1", hw_config=10,
	const_parsed_terms=0x0, listp=0x3ffffff6230, loc_=0x3ffffff70e0)
	at util/parse-events.c:1592
 #6  0x00000000012f0e4e in parse_events_parse (_parse_state=0x3ffffff7fb8,
	scanner=0x16878c0) at util/parse-events.y:293
 #7  0x00000000012695a0 in parse_events__scanner (str=0x3ffffff81d8
	"temp_test_hwmon_event1", input=0x0, parse_state=0x3ffffff7fb8)
	at util/parse-events.c:1867
 #8  0x000000000126a1e8 in __parse_events (evlist=0x168b580,
	str=0x3ffffff81d8 "temp_test_hwmon_event1", pmu_filter=0x0,
	err=0x3ffffff81c8, fake_pmu=false, warn_if_reordered=true,
	fake_tp=false) at util/parse-events.c:2136
 #9  0x00000000011e36aa in parse_events (evlist=0x168b580,
	str=0x3ffffff81d8 "temp_test_hwmon_event1", err=0x3ffffff81c8)
	at /root/linux/tools/perf/util/parse-events.h:41
 #10 0x00000000011e3e64 in do_test (i=0, with_pmu=false, with_alias=false)
	at tests/hwmon_pmu.c:164
 #11 0x00000000011e422c in test__hwmon_pmu (with_pmu=false)
	at tests/hwmon_pmu.c:219
 #12 0x00000000011e431c in test__hwmon_pmu_without_pmu (test=0x1610368
	<suite.hwmon_pmu>, subtest=1) at tests/hwmon_pmu.c:23

where the attr::config is set to value 292470092988416 or 0x10a0000000000
in line 625 of file ./util/hwmon_pmu.c:

   attr->config = key.type_and_num;

However member key::type_and_num is defined as union and bit field:

   union hwmon_pmu_event_key {
        long type_and_num;
        struct {
                int num :16;
                enum hwmon_type type :8;
        };
   };

s390 is big endian and Intel is little endian architecture.
The events for the hwmon dummy pmu have num = 1 or num = 2 and
type is set to HWMON_TYPE_TEMP (which is 10).
On s390 this assignes member key::type_and_num the value of
0x10a0000000000 (which is 292470092988416) as shown in above
trace output.

Fix this and export the structure/union hwmon_pmu_event_key
so the test shares the same implementation as the event parsing
functions for union and bit fields. This should avoid
endianess issues on all platforms.

Output after:
 # ./perf test -F 11
 11.1: Basic parsing test         : Ok
 11.2: Parsing without PMU name   : Ok
 11.3: Parsing with PMU name      : Ok
 #

Fixes: 531ee0f ("perf test: Add hwmon "PMU" test")
    Signed-off-by: Thomas Richter <[email protected]>
    Reviewed-by: Ian Rogers <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Namhyung Kim <[email protected]>

Signed-off-by: Michael Petlan <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 5, 2025
JIRA: https://issues.redhat.com/browse/RHEL-78701

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 93ae6e6
Author: Lu Baolu <[email protected]>
Date:   Wed Mar 19 10:21:01 2025 +0800

    iommu/vt-d: Fix possible circular locking dependency

    We have recently seen report of lockdep circular lock dependency warnings
    on platforms like Skylake and Kabylake:

     ======================================================
     WARNING: possible circular locking dependency detected
     6.14.0-rc6-CI_DRM_16276-gca2c04fe76e8+ #1 Not tainted
     ------------------------------------------------------
     swapper/0/1 is trying to acquire lock:
     ffffffff8360ee48 (iommu_probe_device_lock){+.+.}-{3:3},
       at: iommu_probe_device+0x1d/0x70

     but task is already holding lock:
     ffff888102c7efa8 (&device->physical_node_lock){+.+.}-{3:3},
       at: intel_iommu_init+0xe75/0x11f0

     which lock already depends on the new lock.

     the existing dependency chain (in reverse order) is:

     -> #6 (&device->physical_node_lock){+.+.}-{3:3}:
            __mutex_lock+0xb4/0xe40
            mutex_lock_nested+0x1b/0x30
            intel_iommu_init+0xe75/0x11f0
            pci_iommu_init+0x13/0x70
            do_one_initcall+0x62/0x3f0
            kernel_init_freeable+0x3da/0x6a0
            kernel_init+0x1b/0x200
            ret_from_fork+0x44/0x70
            ret_from_fork_asm+0x1a/0x30

     -> #5 (dmar_global_lock){++++}-{3:3}:
            down_read+0x43/0x1d0
            enable_drhd_fault_handling+0x21/0x110
            cpuhp_invoke_callback+0x4c6/0x870
            cpuhp_issue_call+0xbf/0x1f0
            __cpuhp_setup_state_cpuslocked+0x111/0x320
            __cpuhp_setup_state+0xb0/0x220
            irq_remap_enable_fault_handling+0x3f/0xa0
            apic_intr_mode_init+0x5c/0x110
            x86_late_time_init+0x24/0x40
            start_kernel+0x895/0xbd0
            x86_64_start_reservations+0x18/0x30
            x86_64_start_kernel+0xbf/0x110
            common_startup_64+0x13e/0x141

     -> #4 (cpuhp_state_mutex){+.+.}-{3:3}:
            __mutex_lock+0xb4/0xe40
            mutex_lock_nested+0x1b/0x30
            __cpuhp_setup_state_cpuslocked+0x67/0x320
            __cpuhp_setup_state+0xb0/0x220
            page_alloc_init_cpuhp+0x2d/0x60
            mm_core_init+0x18/0x2c0
            start_kernel+0x576/0xbd0
            x86_64_start_reservations+0x18/0x30
            x86_64_start_kernel+0xbf/0x110
            common_startup_64+0x13e/0x141

     -> #3 (cpu_hotplug_lock){++++}-{0:0}:
            __cpuhp_state_add_instance+0x4f/0x220
            iova_domain_init_rcaches+0x214/0x280
            iommu_setup_dma_ops+0x1a4/0x710
            iommu_device_register+0x17d/0x260
            intel_iommu_init+0xda4/0x11f0
            pci_iommu_init+0x13/0x70
            do_one_initcall+0x62/0x3f0
            kernel_init_freeable+0x3da/0x6a0
            kernel_init+0x1b/0x200
            ret_from_fork+0x44/0x70
            ret_from_fork_asm+0x1a/0x30

     -> #2 (&domain->iova_cookie->mutex){+.+.}-{3:3}:
            __mutex_lock+0xb4/0xe40
            mutex_lock_nested+0x1b/0x30
            iommu_setup_dma_ops+0x16b/0x710
            iommu_device_register+0x17d/0x260
            intel_iommu_init+0xda4/0x11f0
            pci_iommu_init+0x13/0x70
            do_one_initcall+0x62/0x3f0
            kernel_init_freeable+0x3da/0x6a0
            kernel_init+0x1b/0x200
            ret_from_fork+0x44/0x70
            ret_from_fork_asm+0x1a/0x30

     -> #1 (&group->mutex){+.+.}-{3:3}:
            __mutex_lock+0xb4/0xe40
            mutex_lock_nested+0x1b/0x30
            __iommu_probe_device+0x24c/0x4e0
            probe_iommu_group+0x2b/0x50
            bus_for_each_dev+0x7d/0xe0
            iommu_device_register+0xe1/0x260
            intel_iommu_init+0xda4/0x11f0
            pci_iommu_init+0x13/0x70
            do_one_initcall+0x62/0x3f0
            kernel_init_freeable+0x3da/0x6a0
            kernel_init+0x1b/0x200
            ret_from_fork+0x44/0x70
            ret_from_fork_asm+0x1a/0x30

     -> #0 (iommu_probe_device_lock){+.+.}-{3:3}:
            __lock_acquire+0x1637/0x2810
            lock_acquire+0xc9/0x300
            __mutex_lock+0xb4/0xe40
            mutex_lock_nested+0x1b/0x30
            iommu_probe_device+0x1d/0x70
            intel_iommu_init+0xe90/0x11f0
            pci_iommu_init+0x13/0x70
            do_one_initcall+0x62/0x3f0
            kernel_init_freeable+0x3da/0x6a0
            kernel_init+0x1b/0x200
            ret_from_fork+0x44/0x70
            ret_from_fork_asm+0x1a/0x30

     other info that might help us debug this:

     Chain exists of:
       iommu_probe_device_lock --> dmar_global_lock -->
         &device->physical_node_lock

      Possible unsafe locking scenario:

            CPU0                    CPU1
            ----                    ----
       lock(&device->physical_node_lock);
                                    lock(dmar_global_lock);
                                    lock(&device->physical_node_lock);
       lock(iommu_probe_device_lock);

      *** DEADLOCK ***

    This driver uses a global lock to protect the list of enumerated DMA
    remapping units. It is necessary due to the driver's support for dynamic
    addition and removal of remapping units at runtime.

    Two distinct code paths require iteration over this remapping unit list:

    - Device registration and probing: the driver iterates the list to
      register each remapping unit with the upper layer IOMMU framework
      and subsequently probe the devices managed by that unit.
    - Global configuration: Upper layer components may also iterate the list
      to apply configuration changes.

    The lock acquisition order between these two code paths was reversed. This
    caused lockdep warnings, indicating a risk of deadlock. Fix this warning
    by releasing the global lock before invoking upper layer interfaces for
    device registration.

    Fixes: b150654 ("iommu/vt-d: Fix suspicious RCU usage")
    Closes: https://lore.kernel.org/linux-iommu/SJ1PR11MB612953431F94F18C954C4A9CB9D32@SJ1PR11MB6129.namprd11.prod.outlook.com/
    Tested-by: Chaitanya Kumar Borah <[email protected]>
    Cc: [email protected]
    Signed-off-by: Lu Baolu <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Joerg Roedel <[email protected]>

Signed-off-by: Eder Zulian <[email protected]>
github-actions bot pushed a commit that referenced this pull request Jun 5, 2025
Add a compile-time check that `*$ptr` is of the type of `$type->$($f)*`.
Rename those placeholders for clarity.

Given the incorrect usage:

> diff --git a/rust/kernel/rbtree.rs b/rust/kernel/rbtree.rs
> index 8d978c8..6a7089149878 100644
> --- a/rust/kernel/rbtree.rs
> +++ b/rust/kernel/rbtree.rs
> @@ -329,7 +329,7 @@ fn raw_entry(&mut self, key: &K) -> RawEntry<'_, K, V> {
>          while !(*child_field_of_parent).is_null() {
>              let curr = *child_field_of_parent;
>              // SAFETY: All links fields we create are in a `Node<K, V>`.
> -            let node = unsafe { container_of!(curr, Node<K, V>, links) };
> +            let node = unsafe { container_of!(curr, Node<K, V>, key) };
>
>              // SAFETY: `node` is a non-null node so it is valid by the type invariants.
>              match key.cmp(unsafe { &(*node).key }) {

this patch produces the compilation error:

> error[E0308]: mismatched types
>    --> rust/kernel/lib.rs:220:45
>     |
> 220 |         $crate::assert_same_type(field_ptr, (&raw const (*container_ptr).$($fields)*).cast_mut());
>     |         ------------------------ ---------  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `*mut rb_node`, found `*mut K`
>     |         |                        |
>     |         |                        expected all arguments to be this `*mut bindings::rb_node` type because they need to match the type of this parameter
>     |         arguments to this function are incorrect
>     |
>    ::: rust/kernel/rbtree.rs:270:6
>     |
> 270 | impl<K, V> RBTree<K, V>
>     |      - found this type parameter
> ...
> 332 |             let node = unsafe { container_of!(curr, Node<K, V>, key) };
>     |                                 ------------------------------------ in this macro invocation
>     |
>     = note: expected raw pointer `*mut bindings::rb_node`
>                found raw pointer `*mut K`
> note: function defined here
>    --> rust/kernel/lib.rs:227:8
>     |
> 227 | pub fn assert_same_type<T>(_: T, _: T) {}
>     |        ^^^^^^^^^^^^^^^^ -  ----  ---- this parameter needs to match the `*mut bindings::rb_node` type of parameter #1
>     |                         |  |
>     |                         |  parameter #2 needs to match the `*mut bindings::rb_node` type of this parameter
>     |                         parameter #1 and parameter #2 both reference this parameter `T`
>     = note: this error originates in the macro `container_of` (in Nightly builds, run with -Z macro-backtrace for more info)

[ We decided to go with a variation of v1 [1] that became v4, since it
  seems like the obvious approach, the error messages seem good enough
  and the debug performance should be fine, given the kernel is always
  built with -O2.

  In the future, we may want to make the helper non-hidden, with
  proper documentation, for others to use.

  [1] https://lore.kernel.org/rust-for-linux/CANiq72kQWNfSV0KK6qs6oJt+aGdgY=hXg=wJcmK3zYcokY1LNw@mail.gmail.com/

    - Miguel ]

Suggested-by: Alice Ryhl <[email protected]>
Link: https://lore.kernel.org/all/CAH5fLgh6gmqGBhPMi2SKn7mCmMWfOSiS0WP5wBuGPYh9ZTAiww@mail.gmail.com/
Signed-off-by: Tamir Duberstein <[email protected]>
Reviewed-by: Benno Lossin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[ Added intra-doc link. - Miguel ]
Signed-off-by: Miguel Ojeda <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants