Skip to content

Bpf/optimized usdt ci #8955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

olsajiri
Copy link
Contributor

test

tlendacky and others added 30 commits May 9, 2025 08:53
When increasing the array size in memblock_double_array() and the slab
is not yet available, a call to memblock_find_in_range() is used to
reserve/allocate memory. However, the range returned may not have been
accepted, which can result in a crash when booting an SNP guest:

  RIP: 0010:memcpy_orig+0x68/0x130
  Code: ...
  RSP: 0000:ffffffff9cc03ce8 EFLAGS: 00010006
  RAX: ff11001ff83e5000 RBX: 0000000000000000 RCX: fffffffffffff000
  RDX: 0000000000000bc0 RSI: ffffffff9dba8860 RDI: ff11001ff83e5c00
  RBP: 0000000000002000 R08: 0000000000000000 R09: 0000000000002000
  R10: 000000207fffe000 R11: 0000040000000000 R12: ffffffff9d06ef78
  R13: ff11001ff83e5000 R14: ffffffff9dba7c60 R15: 0000000000000c00
  memblock_double_array+0xff/0x310
  memblock_add_range+0x1fb/0x2f0
  memblock_reserve+0x4f/0xa0
  memblock_alloc_range_nid+0xac/0x130
  memblock_alloc_internal+0x53/0xc0
  memblock_alloc_try_nid+0x3d/0xa0
  swiotlb_init_remap+0x149/0x2f0
  mem_init+0xb/0xb0
  mm_core_init+0x8f/0x350
  start_kernel+0x17e/0x5d0
  x86_64_start_reservations+0x14/0x30
  x86_64_start_kernel+0x92/0xa0
  secondary_startup_64_no_verify+0x194/0x19b

Mitigate this by calling accept_memory() on the memory range returned
before the slab is available.

Prior to v6.12, the accept_memory() interface used a 'start' and 'end'
parameter instead of 'start' and 'size', therefore the accept_memory()
call must be adjusted to specify 'start + size' for 'end' when applying
to kernels prior to v6.12.

Cc: [email protected] # see patch description, needs adjustments for <= 6.11
Fixes: dcdfdd4 ("mm: Add support for unaccepted memory")
Signed-off-by: Tom Lendacky <[email protected]>
Link: https://lore.kernel.org/r/da1ac73bf4ded761e21b4e4bb5178382a580cd73.1746725050.git.thomas.lendacky@amd.com
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
…tion

Determining the SST/MST mode during state computation must be done based
on the output type stored in the CRTC state, which in turn is set once
based on the modeset connector's SST vs. MST type and will not change as
long as the connector is using the CRTC. OTOH the MST mode indicated by
the given connector's intel_dp::is_mst flag can change independently of
the above output type, based on what sink is at any moment plugged to
the connector.

Fix the state computation accordingly.

Cc: Jani Nikula <[email protected]>
Fixes: f6971d7 ("drm/i915/mst: adapt intel_dp_mtp_tu_compute_config() for 128b/132b SST")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4607
Reviewed-by: Jani Nikula <[email protected]>
Signed-off-by: Imre Deak <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
(cherry picked from commit 0f45696ddb2b901fbf15cb8d2e89767be481d59f)
Signed-off-by: Jani Nikula <[email protected]>
Our QA team reported a 10%-23%, throughput reduction on an io_uring
sqpoll testcase doing IO to a null_blk, that I traced back to a
reduction of the device submission queue depth utilization. It turns out
that, after commit af5d68f ("io_uring/sqpoll: manage task_work
privately"), we capped the number of task_work entries that can be
completed from a single spin of sqpoll to only 8 entries, before the
sqpoll goes around to (potentially) sleep.  While this cap doesn't drive
the submission side directly, it impacts the completion behavior, which
affects the number of IO queued by fio per sqpoll cycle on the
submission side, and io_uring ends up seeing less ios per sqpoll cycle.
As a result, block layer plugging is less effective, and we see more
time spent inside the block layer in profilings charts, and increased
submission latency measured by fio.

There are other places that have increased overhead once sqpoll sleeps
more often, such as the sqpoll utilization calculation.  But, in this
microbenchmark, those were not representative enough in perf charts, and
their removal didn't yield measurable changes in throughput.  The major
overhead comes from the fact we plug less, and less often, when submitting
to the block layer.

My benchmark is:

fio --ioengine=io_uring --direct=1 --iodepth=128 --runtime=300 --bs=4k \
    --invalidate=1 --time_based  --ramp_time=10 --group_reporting=1 \
    --filename=/dev/nullb0 --name=RandomReads-direct-nullb-sqpoll-4k-1 \
    --rw=randread --numjobs=1 --sqthread_poll

In one machine, tested on top of Linux 6.15-rc1, we have the following
baseline:
  READ: bw=4994MiB/s (5236MB/s), 4994MiB/s-4994MiB/s (5236MB/s-5236MB/s), io=439GiB (471GB), run=90001-90001msec

With this patch:
  READ: bw=5762MiB/s (6042MB/s), 5762MiB/s-5762MiB/s (6042MB/s-6042MB/s), io=506GiB (544GB), run=90001-90001msec

which is a 15% improvement in measured bandwidth.  The average
submission latency is noticeably lowered too.  As measured by
fio:

Baseline:
   lat (usec): min=20, max=241, avg=99.81, stdev=3.38
Patched:
   lat (usec): min=26, max=226, avg=86.48, stdev=4.82

If we look at blktrace, we can also see the plugging behavior is
improved. In the baseline, we end up limited to plugging 8 requests in
the block layer regardless of the device queue depth size, while after
patching we can drive more io, and we manage to utilize the full device
queue.

In the baseline, after a stabilization phase, an ordinary submission
looks like:
  254,0    1    49942     0.016028795  5977  U   N [iou-sqp-5976] 7

After patching, I see consistently more requests per unplug.
  254,0    1     4996     0.001432872  3145  U   N [iou-sqp-3144] 32

Ideally, the cap size would at least be the deep enough to fill the
device queue, but we can't predict that behavior, or assume all IO goes
to a single device, and thus can't guess the ideal batch size.  We also
don't want to let the tw run unbounded, though I'm not sure it would
really be a problem.  Instead, let's just give it a more sensible value
that will allow for more efficient batching.  I've tested with different
cap values, and initially proposed to increase the cap to 1024.  Jens
argued it is too big of a bump and I observed that, with 32, I'm no
longer able to observe this bottleneck in any of my machines.

Fixes: af5d68f ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
tl;dr: There is a window in the mm switching code where the new CR3 is
set and the CPU should be getting TLB flushes for the new mm.  But
should_flush_tlb() has a bug and suppresses the flush.  Fix it by
widening the window where should_flush_tlb() sends an IPI.

Long Version:

=== History ===

There were a few things leading up to this.

First, updating mm_cpumask() was observed to be too expensive, so it was
made lazier.  But being lazy caused too many unnecessary IPIs to CPUs
due to the now-lazy mm_cpumask().  So code was added to cull
mm_cpumask() periodically[2].  But that culling was a bit too aggressive
and skipped sending TLB flushes to CPUs that need them.  So here we are
again.

=== Problem ===

The too-aggressive code in should_flush_tlb() strikes in this window:

	// Turn on IPIs for this CPU/mm combination, but only
	// if should_flush_tlb() agrees:
	cpumask_set_cpu(cpu, mm_cpumask(next));

	next_tlb_gen = atomic64_read(&next->context.tlb_gen);
	choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
	load_new_mm_cr3(need_flush);
	// ^ After 'need_flush' is set to false, IPIs *MUST*
	// be sent to this CPU and not be ignored.

        this_cpu_write(cpu_tlbstate.loaded_mm, next);
	// ^ Not until this point does should_flush_tlb()
	// become true!

should_flush_tlb() will suppress TLB flushes between load_new_mm_cr3()
and writing to 'loaded_mm', which is a window where they should not be
suppressed.  Whoops.

=== Solution ===

Thankfully, the fuzzy "just about to write CR3" window is already marked
with loaded_mm==LOADED_MM_SWITCHING.  Simply checking for that state in
should_flush_tlb() is sufficient to ensure that the CPU is targeted with
an IPI.

This will cause more TLB flush IPIs.  But the window is relatively small
and I do not expect this to cause any kind of measurable performance
impact.

Update the comment where LOADED_MM_SWITCHING is written since it grew
yet another user.

Peter Z also raised a concern that should_flush_tlb() might not observe
'loaded_mm' and 'is_lazy' in the same order that switch_mm_irqs_off()
writes them.  Add a barrier to ensure that they are observed in the
order they are written.

Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Link: https://lore.kernel.org/oe-lkp/[email protected]/ [1]
Fixes: 6db2526 ("x86/mm/tlb: Only trim the mm_cpumask once a second") [2]
Reported-by: Stephen Dolan <[email protected]>
Cc: [email protected]
Acked-by: Ingo Molnar <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
…rnel/git/modules/linux

Pull modules fix from Petr Pavlu:
 "A single fix to prevent use of an uninitialized completion pointer
  when releasing a module_kobject in specific situations.

  This addresses a latent bug exposed by commit f95bbfe ("drivers:
  base: handle module_kobject creation")"

* tag 'modules-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
  module: ensure that kobject_put() is safe for module type kobjects
Pull io_uring fixes from Jens Axboe:

 - Fix for linked timeouts arming and firing wrt prep and issue of the
   request being managed by the linked timeout

 - Fix for a CQE ordering issue between requests with multishot and
   using the same buffer group. This is a dumbed down version for this
   release and for stable, it'll get improved for v6.16

 - Tweak the SQPOLL submit batch size. A previous commit made SQPOLL
   manage its own task_work and chose a tiny batch size, bump it from 8
   to 32 to fix a performance regression due to that

* tag 'io_uring-6.15-20250509' of git://git.kernel.dk/linux:
  io_uring/sqpoll: Increase task_work submission batch size
  io_uring: ensure deferred completions are flushed for multishot
  io_uring: always arm linked timeouts prior to issue
Pull block fixes from Jens Axboe:

 - Fix for a regression in this series for loop and read/write iterator
   handling

 - zone append block update tweak

 - remove a broken IO priority test

 - NVMe pull request via Christoph:
      - unblock ctrl state transition for firmware update (Daniel
        Wagner)

* tag 'block-6.15-20250509' of git://git.kernel.dk/linux:
  block: remove test of incorrect io priority level
  nvme: unblock ctrl state transition for firmware update
  block: only update request sector if needed
  loop: Add sanity check for read/write_iter
…linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:

 - The compressed half-word misaligned access instructions (c.lhu, c.lh,
   and c.sh) from the Zcb extension are now properly emulated

 - A series of fixes to properly emulate permissions while handling
   userspace misaligned accesses

 - A pair of fixes for PR_GET_TAGGED_ADDR_CTRL to avoid accessing the
   envcfg CSR on systems that don't support that CSR, and to report
   those failures up to userspace

 - The .rela.dyn section is no longer stripped from vmlinux, as it is
   necessary to relocate the kernel under some conditions (including
   kexec)

* tag 'riscv-for-linus-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: Disallow PR_GET_TAGGED_ADDR_CTRL without Supm
  scripts: Do not strip .rela.dyn section
  riscv: Fix kernel crash due to PR_SET_TAGGED_ADDR_CTRL
  riscv: misaligned: use get_user() instead of __get_user()
  riscv: misaligned: enable IRQs while handling misaligned accesses
  riscv: misaligned: factorize trap handling
  riscv: misaligned: Add handling for ZCB instructions
…git/arm64/linux

Pull arm64 fix from Catalin Marinas:
 "Move the arm64_use_ng_mappings variable from the .bss to the .data
  section as it is accessed very early during boot with the MMU off and
  before the .bss has been initialised.

  This could lead to incorrect idmap page table"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: cpufeature: Move arm64_use_ng_mappings to the .data section to prevent wrong idmap generation
Marek reported that the rework of handle_nested_irq() introduced a inverted
condition, which prevents handling of interrupts. Fix it up.

Fixes: 2ef2e13 ("genirq/chip: Rework handle_nested_irq()")
Reported-by: Marek Szyprowski <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Closes: https://lore.kernel/org/all/[email protected]
…/drm/xe/kernel into drm-fixes

Driver Changes:
- Prevent PF queue overflow
- Hold all forcewake during mocs test
- Remove GSC flush on reset path
- Fix forcewake put on error path
- Fix runtime warning when building without svm

Signed-off-by: Dave Airlie <[email protected]>

From: Lucas De Marchi <[email protected]>
Link: https://lore.kernel.org/r/jffqa56f2zp4i5ztz677cdspgxhnw7qfop3dd3l2epykfpfvza@q2nw6wapsphz
…org/drm/i915/kernel into drm-fixes

drm/i915 fixes for v6.15-rc6:
- Fix oops on resume after disconnecting DP MST sinks during suspend
- Fix SPLC num_waiters refcounting

Signed-off-by: Dave Airlie <[email protected]>
From: Jani Nikula <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
…m/kernel

Pull drm fixes from Dave Airlie:
 "Weekly drm fixes, bit bigger than last week, but overall amdgpu/xe
  with some ivpu bits and a random few fixes, and dropping the
  ttm_backup struct which wrapped struct file and was recently
  frowned at.

  drm:
   - Fix overflow when generating wedged event

  ttm:
   - Fix documentation
   - Remove struct ttm_backup

  panel:
   - simple: Fix timings for AUO G101EVN010

  amdgpu:
   - DC FP fixes
   - Freesync fix
   - DMUB AUX fixes
   - VCN fix
   - Hibernation fixes
   - HDP fixes

  xe:
   - Prevent PF queue overflow
   - Hold all forcewake during mocs test
   - Remove GSC flush on reset path
   - Fix forcewake put on error path
   - Fix runtime warning when building without svm

  i915:
   - Fix oops on resume after disconnecting DP MST sinks during suspend
   - Fix SPLC num_waiters refcounting

  ivpu:
   - Increase timeouts
   - Fix deadlock in cmdq ioctl
   - Unlock mutices in correct order

  v3d:
   - Avoid memory leak in job handling"

* tag 'drm-fixes-2025-05-10' of https://gitlab.freedesktop.org/drm/kernel: (32 commits)
  drm/i915/dp: Fix determining SST/MST mode during MTP TU state computation
  drm/xe: Add config control for svm flush work
  drm/xe: Release force wake first then runtime power
  drm/xe/gsc: do not flush the GSC worker from the reset path
  drm/xe/tests/mocs: Hold XE_FORCEWAKE_ALL for LNCF regs
  drm/xe: Add page queue multiplier
  drm/amdgpu/hdp7: use memcfg register to post the write for HDP flush
  drm/amdgpu/hdp6: use memcfg register to post the write for HDP flush
  drm/amdgpu/hdp5.2: use memcfg register to post the write for HDP flush
  drm/amdgpu/hdp5: use memcfg register to post the write for HDP flush
  drm/amdgpu/hdp4: use memcfg register to post the write for HDP flush
  drm/amdgpu: fix pm notifier handling
  Revert "drm/amd: Stop evicting resources on APUs in suspend"
  drm/amdgpu/vcn: using separate VCN1_AON_SOC offset
  drm/amd/display: Fix wrong handling for AUX_DEFER case
  drm/amd/display: Copy AUX read reply data whenever length > 0
  drm/amd/display: Remove incorrect checking in dmub aux handler
  drm/amd/display: Fix the checking condition in dmub aux handling
  drm/amd/display: Shift DMUB AUX reply command if necessary
  drm/amd/display: Call FP Protect Before Mode Programming/Mode Support
  ...
…ernel/git/ojeda/linux

Pull rust fixes from Miguel Ojeda:

 - Make CFI_AUTO_DEFAULT depend on !RUST or Rust >= 1.88.0

 - Clean Rust (and Clippy) lints for the upcoming Rust 1.87.0 and 1.88.0
   releases

 - Clean objtool warning for the upcoming Rust 1.87.0 release by adding
   one more noreturn function

* tag 'rust-fixes-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
  x86/Kconfig: make CFI_AUTO_DEFAULT depend on !RUST or Rust >= 1.88
  rust: clean Rust 1.88.0's `clippy::uninlined_format_args` lint
  rust: clean Rust 1.88.0's warning about `clippy::disallowed_macros` configuration
  rust: clean Rust 1.88.0's `unnecessary_transmutes` lint
  rust: allow Rust 1.87.0's `clippy::ptr_eq` lint
  objtool/rust: add one more `noreturn` Rust function for Rust 1.87.0
... or we risk stealing final mntput from sync umount - raising mnt_count
after umount(2) has verified that victim is not busy, but before it
has set MNT_SYNC_UMOUNT; in that case __legitimize_mnt() doesn't see
that it's safe to quietly undo mnt_count increment and leaves dropping
the reference to caller, where it'll be a full-blown mntput().

Check under mount_lock is needed; leaving the current one done before
taking that makes no sense - it's nowhere near common enough to bother
with.

Reviewed-by: Christian Brauner <[email protected]>
Signed-off-by: Al Viro <[email protected]>
do_umount() analogue of the race fixed in 119e1ef "fix
__legitimize_mnt()/mntput() race".  Here we want to make sure that
if __legitimize_mnt() doesn't notice our lock_mount_hash(), we will
notice their refcount increment.  Harder to hit than mntput_no_expire()
one, fortunately, and consequences are milder (sync umount acting
like umount -l on a rare race with RCU pathwalk hitting at just the
wrong time instead of use-after-free galore mntput_no_expire()
counterpart used to be hit).  Still a bug...

Fixes: 48a066e ("RCU'd vfsmounts")
Reviewed-by: Christian Brauner <[email protected]>
Signed-off-by: Al Viro <[email protected]>
as it is, a failed move_mount(2) from anon namespace breaks
all further propagation into that namespace, including normal
mounts in non-anon namespaces that would otherwise propagate
there.

Fixes: 064fe6e ("mount: handle mount propagation for detached mount trees")
Reviewed-by: Christian Brauner <[email protected]>
Signed-off-by: Al Viro <[email protected]>
propagate_mnt() does not attach anything to mounts created during
propagate_mnt() itself.  What's more, anything on ->mnt_slave_list
of such new mount must also be new, so we don't need to even look
there.

When move_mount() had been introduced, we've got an additional
class of mounts to skip - if we are moving from anon namespace,
we do not want to propagate to mounts we are moving (i.e. all
mounts in that anon namespace).

Unfortunately, the part about "everything on their ->mnt_slave_list
will also be ignorable" is not true - if we have propagation graph
	A -> B -> C
and do OPEN_TREE_CLONE open_tree() of B, we get
	A -> [B <-> B'] -> C
as propagation graph, where B' is a clone of B in our detached tree.
Making B private will result in
	A -> B' -> C
C still gets propagation from A, as it would after making B private
if we hadn't done that open_tree(), but now the propagation goes
through B'.  Trying to move_mount() our detached tree on subdirectory
in A should have
	* moved B' on that subdirectory in A
	* skipped the corresponding subdirectory in B' itself
	* copied B' on the corresponding subdirectory in C.
As it is, the logics in propagation_next() and friends ends up
skipping propagation into C, since it doesn't consider anything
downstream of B'.

IOW, walking the propagation graph should only skip the ->mnt_slave_list
of new mounts; the only places where the check for "in that one
anon namespace" are applicable are propagate_one() (where we should
treat that as the same kind of thing as "mountpoint we are looking
at is not visible in the mount we are looking at") and
propagation_would_overmount().  The latter is better dealt with
in the caller (can_move_mount_beneath()); on the first call of
propagation_would_overmount() the test is always false, on the
second it is always true in "move from anon namespace" case and
always false in "move within our namespace" one, so it's easier
to just use check_mnt() before bothering with the second call and
be done with that.

Fixes: 064fe6e ("mount: handle mount propagation for detached mount trees")
Reviewed-by: Christian Brauner <[email protected]>
Signed-off-by: Al Viro <[email protected]>
…/cifs-2.6

Pull smb client fixes from Steve French:

 - Fix dentry leak which can cause umount crash

 - Add warning for parse contexts error on compounded operation

* tag '6.15-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: client: Avoid race in open_cached_dir with lease breaks
  smb3 client: warn when parse contexts returns error on compounded operation
…inux/kernel/git/andi.shyti/linux into i2c/for-current

i2c-host-fixes for v6.15-rc6

- omap: use correct function to read from device tree
- MAINTAINERS: remove Seth from ISMT maintainership
 into HEAD

KVM/riscv fixes for 6.15, take kernel-patches#1

- Add missing reset of smstateen CSRs
…ux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.15, round kernel-patches#3

 - Avoid use of uninitialized memcache pointer in user_mem_abort()

 - Always set HCR_EL2.xMO bits when running in VHE, allowing interrupts
   to be taken while TGE=0 and fixing an ugly bug on AmpereOne that
   occurs when taking an interrupt while clearing the xMO bits
   (AC03_CPU_36)

 - Prevent VMMs from hiding support for AArch64 at any EL virtualized by
   KVM

 - Save/restore the host value for HCRX_EL2 instead of restoring an
   incorrect fixed value

 - Make host_stage2_set_owner_locked() check that the entire requested
   range is memory rather than just the first page
…into HEAD

KVM x86 fixes for 6.15-rcN

 - Forcibly leave SMM on SHUTDOWN interception on AMD CPUs to avoid causing
   problems due to KVM stuffing INIT on SHUTDOWN (KVM needs to sanitize the
   VMCB as its state is undefined after SHUTDOWN, emulating INIT is the
   least awful choice).

 - Track the valid sync/dirty fields in kvm_run as a u64 to ensure KVM
   KVM doesn't goof a sanity check in the future.

 - Free obsolete roots when (re)loading the MMU to fix a bug where
   pre-faulting memory can get stuck due to always encountering a stale
   root.

 - When dumping GHCB state, use KVM's snapshot instead of the raw GHCB page
   to print state, so that KVM doesn't print stale/wrong information.

 - When changing memory attributes (e.g. shared <=> private), add potential
   hugepage ranges to the mmu_invalidate_range_{start,end} set so that KVM
   doesn't create a shared/private hugepage when the the corresponding
   attributes will become mixed (the attributes are commited *after* KVM
   finishes the invalidation).

 - Rework the SRSO mitigation to enable BP_SPEC_REDUCE only when KVM has at
   least one active VM.  Effectively BP_SPEC_REDUCE when KVM is loaded led
   to very measurable performance regressions for non-KVM workloads.
…it/viro/vfs

Pull mount fixes from Al Viro:
 "A couple of races around legalize_mnt vs umount (both fairly old and
  hard to hit) plus two bugs in move_mount(2) - both around 'move
  detached subtree in place' logics"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix IS_MNT_PROPAGATING uses
  do_move_mount(): don't leak MNTNS_PROPAGATING on failures
  do_umount(): add missing barrier before refcount checks in sync case
  __legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock
…inux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:

 - A fix for the xenbus driver allowing to use a PVH Dom0 with
   Xenstore running in another domain

 - A fix for the xenbus driver addressing a rare race condition
   resulting in NULL dereferences and other problems

 - A fix for the xen-swiotlb driver fixing a problem seen on Arm
   platforms

* tag 'for-linus-6.15a-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xenbus: Use kref to track req lifetime
  xenbus: Allow PVH dom0 a non-local xenstore
  xen: swiotlb: Use swiotlb bouncing if kmalloc allocation demands it
…rnel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:

 - omap: use correct function to read from device tree

 - MAINTAINERS: remove Seth from ISMT maintainership

* tag 'i2c-for-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  MAINTAINERS: Remove entry for Seth Heasley
  i2c: omap: fix deprecated of_property_read_bool() use
…kernel/git/gregkh/char-misc

Pull char/misc/IIO driver fixes from Greg KH:
 "Here are a bunch of small driver fixes (mostly all IIO) for 6.15-rc6.
  Included in here are:

   - loads of tiny IIO driver fixes for reported issues

   - hyperv driver fix for a much-reported and worked on sysfs ring
     buffer creation bug

  All of these have been in linux-next for over a week (the IIO ones for
  many weeks now), with no reported issues"

* tag 'char-misc-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (30 commits)
  Drivers: hv: Make the sysfs node size for the ring buffer dynamic
  uio_hv_generic: Fix sysfs creation path for ring buffer
  iio: adis16201: Correct inclinometer channel resolution
  iio: adc: ad7606: fix serial register access
  iio: pressure: mprls0025pa: use aligned_s64 for timestamp
  iio: imu: adis16550: align buffers for timestamp
  staging: iio: adc: ad7816: Correct conditional logic for store mode
  iio: adc: ad7266: Fix potential timestamp alignment issue.
  iio: adc: ad7768-1: Fix insufficient alignment of timestamp.
  iio: adc: dln2: Use aligned_s64 for timestamp
  iio: accel: adxl355: Make timestamp 64-bit aligned using aligned_s64
  iio: temp: maxim-thermocouple: Fix potential lack of DMA safe buffer.
  iio: chemical: pms7003: use aligned_s64 for timestamp
  iio: chemical: sps30: use aligned_s64 for timestamp
  iio: imu: inv_mpu6050: align buffer for timestamp
  iio: imu: st_lsm6dsx: Fix wakeup source leaks on device unbind
  iio: adc: qcom-spmi-iadc: Fix wakeup source leaks on device unbind
  iio: accel: fxls8962af: Fix wakeup source leaks on device unbind
  iio: adc: ad7380: fix event threshold shift
  iio: hid-sensor-prox: Fix incorrect OFFSET calculation
  ...
…rnel/git/gregkh/staging

Pull staging driver fixes from Greg KH:
 "Here are three small staging driver fixes for 6.15-rc6. These are:

   - bcm2835-camera driver fix

   - two axis-fifo driver fixes

  All of these have been in linux-next for a few weeks with no reported
  issues"

* tag 'staging-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: axis-fifo: Remove hardware resets for user errors
  staging: axis-fifo: Correct handling of tx_fifo_depth for size validation
  staging: bcm2835-camera: Initialise dev in v4l2_dev
…/git/gregkh/usb

Pull USB fixes from Greg KH:
 "Here are some small USB driver fixes for 6.15-rc6. Included in here
  are:

   - typec driver fixes

   - usbtmc ioctl fixes

   - xhci driver fixes

   - cdnsp driver fixes

   - some gadget driver fixes

  Nothing really major, just all little stuff that people have reported
  being issues. All of these have been in linux-next this week with no
  reported issues"

* tag 'usb-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  xhci: dbc: Avoid event polling busyloop if pending rx transfers are inactive.
  usb: xhci: Don't trust the EP Context cycle bit when moving HW dequeue
  usb: usbtmc: Fix erroneous generic_read ioctl return
  usb: usbtmc: Fix erroneous wait_srq ioctl return
  usb: usbtmc: Fix erroneous get_stb ioctl error returns
  usb: typec: tcpm: delay SNK_TRY_WAIT_DEBOUNCE to SRC_TRYWAIT transition
  USB: usbtmc: use interruptible sleep in usbtmc_read
  usb: cdnsp: fix L1 resume issue for RTL_REVISION_NEW_LPM version
  usb: typec: ucsi: displayport: Fix NULL pointer access
  usb: typec: ucsi: displayport: Fix deadlock
  usb: misc: onboard_usb_dev: fix support for Cypress HX3 hubs
  usb: uhci-platform: Make the clock really optional
  usb: dwc3: gadget: Make gadget_wakeup asynchronous
  usb: gadget: Use get_status callback to set remote wakeup capability
  usb: gadget: f_ecm: Add get_status callback
  usb: host: tegra: Prevent host controller crash when OTG port is used
  usb: cdnsp: Fix issue with resuming from L1
  usb: gadget: tegra-xudc: ACK ST_RC after clearing CTRL_RUN
…x/kernel/git/driver-core/driver-core

Pull driver core fix from Greg KH:
 "Here is a single driver core fix for a regression for platform devices
  that is a regression from a change that went into 6.15-rc1 that
  affected Pixel devices. It has been in linux-next for over a week with
  no reported problems"

* tag 'driver-core-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
  platform: Fix race condition during DMA configure at IOMMU probe time
olsajiri and others added 23 commits May 13, 2025 13:02
Currently unapply_uprobe takes mmap_read_lock, but it might call
remove_breakpoint which eventually changes user pages.

Current code writes either breakpoint or original instruction, so
it can probably go away with that, but with the upcoming change that
writes multiple instructions on the probed address we need to ensure
that any update to mm's pages is exclusive.

Signed-off-by: Jiri Olsa <[email protected]>
We are about to add uprobe trampoline, so cleaning up the namespace.

Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
Making copy_from_page global and adding uprobe prefix.
Adding the uprobe prefix to copy_to_page as well for symmetry.

Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
Adding uprobe_write function that does what uprobe_write_opcode did
so far, but allows to pass verify callback function that checks the
memory location before writing the opcode.

It will be used in following changes to implement specific checking
logic for instruction update.

The uprobe_write_opcode now calls uprobe_write with verify_opcode as
the verify callback.

Signed-off-by: Jiri Olsa <[email protected]>
Adding nbytes argument to uprobe_write and related functions as
preparation for writing whole instructions in following changes.

Also renaming opcode arguments to insn, which seems to fit better.

Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
…code

The uprobe_write has special path to restore the original page when we
write original instruction back. This happens when uprobe_write detects
that we want to write anything else but breakpoint instruction.

Moving the detection away and passing it to uprobe_write as argument,
so it's possible to write different instructions (other than just
breakpoint and rest).

Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
Making update_ref_ctr call in uprobe_write conditional based
on do_ref_ctr argument. This way we can use uprobe_write for
instruction update without doing ref_ctr_offset update.

Signed-off-by: Jiri Olsa <[email protected]>
Adding support to add special mapping for user space trampoline with
following functions:

  uprobe_trampoline_get - find or add uprobe_trampoline
  uprobe_trampoline_put - remove or destroy uprobe_trampoline

The user space trampoline is exported as arch specific user space special
mapping through tramp_mapping, which is initialized in following changes
with new uprobe syscall.

The uprobe trampoline needs to be callable/reachable from the probed address,
so while searching for available address we use is_reachable_by_call function
to decide if the uprobe trampoline is callable from the probe address.

All uprobe_trampoline objects are stored in uprobes_state object and are
cleaned up when the process mm_struct goes down. Adding new arch hooks
for that, because this change is x86_64 specific.

Locking is provided by callers in following changes.

Signed-off-by: Jiri Olsa <[email protected]>
Adding new uprobe syscall that calls uprobe handlers for given
'breakpoint' address.

The idea is that the 'breakpoint' address calls the user space
trampoline which executes the uprobe syscall.

The syscall handler reads the return address of the initial call
to retrieve the original 'breakpoint' address. With this address
we find the related uprobe object and call its consumers.

Adding the arch_uprobe_trampoline_mapping function that provides
uprobe trampoline mapping. This mapping is backed with one global
page initialized at __init time and shared by the all the mapping
instances.

We do not allow to execute uprobe syscall if the caller is not
from uprobe trampoline mapping.

The uprobe syscall ensures the consumer (bpf program) sees registers
values in the state before the trampoline was called.

Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
Putting together all the previously added pieces to support optimized
uprobes on top of 5-byte nop instruction.

The current uprobe execution goes through following:

  - installs breakpoint instruction over original instruction
  - exception handler hit and calls related uprobe consumers
  - and either simulates original instruction or does out of line single step
    execution of it
  - returns to user space

The optimized uprobe path does following:

  - checks the original instruction is 5-byte nop (plus other checks)
  - adds (or uses existing) user space trampoline with uprobe syscall
  - overwrites original instruction (5-byte nop) with call to user space
    trampoline
  - the user space trampoline executes uprobe syscall that calls related uprobe
    consumers
  - trampoline returns back to next instruction

This approach won't speed up all uprobes as it's limited to using nop5 as
original instruction, but we plan to use nop5 as USDT probe instruction
(which currently uses single byte nop) and speed up the USDT probes.

The arch_uprobe_optimize triggers the uprobe optimization and is called after
first uprobe hit. I originally had it called on uprobe installation but then
it clashed with elf loader, because the user space trampoline was added in a
place where loader might need to put elf segments, so I decided to do it after
first uprobe hit when loading is done.

The uprobe is un-optimized in arch specific set_orig_insn call.

The instruction overwrite is x86 arch specific and needs to go through 3 updates:
(on top of nop5 instruction)

  - write int3 into 1st byte
  - write last 4 bytes of the call instruction
  - update the call instruction opcode

And cleanup goes though similar reverse stages:

  - overwrite call opcode with breakpoint (int3)
  - write last 4 bytes of the nop5 instruction
  - write the nop5 first instruction byte

We do not unmap and release uprobe trampoline when it's no longer needed,
because there's no easy way to make sure none of the threads is still
inside the trampoline. But we do not waste memory, because there's just
single page for all the uprobe trampoline mappings.

We do waste frame on page mapping for every 4GB by keeping the uprobe
trampoline page mapped, but that seems ok.

We take the benefit from the fact that set_swbp and set_orig_insn are
called under mmap_write_lock(mm), so we can use the current instruction
as the state the uprobe is in - nop5/breakpoint/call trampoline -
and decide the needed action (optimize/un-optimize) based on that.

Attaching the speed up from benchs/run_bench_uprobes.sh script:

current:
        usermode-count :  152.604 ± 0.044M/s
        syscall-count  :   13.359 ± 0.042M/s
-->     uprobe-nop     :    3.229 ± 0.002M/s
        uprobe-push    :    3.086 ± 0.004M/s
        uprobe-ret     :    1.114 ± 0.004M/s
        uprobe-nop5    :    1.121 ± 0.005M/s
        uretprobe-nop  :    2.145 ± 0.002M/s
        uretprobe-push :    2.070 ± 0.001M/s
        uretprobe-ret  :    0.931 ± 0.001M/s
        uretprobe-nop5 :    0.957 ± 0.001M/s

after the change:
        usermode-count :  152.448 ± 0.244M/s
        syscall-count  :   14.321 ± 0.059M/s
        uprobe-nop     :    3.148 ± 0.007M/s
        uprobe-push    :    2.976 ± 0.004M/s
        uprobe-ret     :    1.068 ± 0.003M/s
-->     uprobe-nop5    :    7.038 ± 0.007M/s
        uretprobe-nop  :    2.109 ± 0.004M/s
        uretprobe-push :    2.035 ± 0.001M/s
        uretprobe-ret  :    0.908 ± 0.001M/s
        uretprobe-nop5 :    3.377 ± 0.009M/s

I see bit more speed up on Intel (above) compared to AMD. The big nop5
speed up is partly due to emulating nop5 and partly due to optimization.

The key speed up we do this for is the USDT switch from nop to nop5:
        uprobe-nop     :    3.148 ± 0.007M/s
        uprobe-nop5    :    7.038 ± 0.007M/s

Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
Importing usdt.h from libbpf/usdt project.

Suggested-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
Adding __test_uprobe_syscall with non x86_64 stub to execute all the tests,
so we don't need to keep adding non x86_64 stub functions for new tests.

Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
…multi

Renaming uprobe_syscall_executed prog to test_uretprobe_multi
to fit properly in the following changes that add more programs.

Plus adding pid filter and increasing executed variable.

Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
Adding tests for optimized uprobe/usdt probes.

Checking that we get expected trampoline and attached bpf programs
get executed properly.

Signed-off-by: Jiri Olsa <[email protected]>
Adding test that makes sure parallel execution of the uprobe and
attach/detach of optimized uprobe on it works properly.

Signed-off-by: Jiri Olsa <[email protected]>
Make sure that calling uprobe syscall from outside uprobe trampoline
results in sigill signal.

Signed-off-by: Jiri Olsa <[email protected]>
Adding optimized usdt variant for basic usdt test to check that
usdt arguments are properly passed in optimized code path.

Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
Changing uretprobe_regs_trigger to allow the test for both
uprobe and uretprobe and renaming it to uprobe_regs_equal.

We check that both uprobe and uretprobe probes (bpf programs)
see expected registers with few exceptions.

Signed-off-by: Jiri Olsa <[email protected]>
…robe

Changing the test_uretprobe_regs_change test to test both uprobe
and uretprobe by adding entry consumer handler to the testmod
and making it to change one of the registers.

Making sure that changed values both uprobe and uretprobe handlers
propagate to the user space.

Signed-off-by: Jiri Olsa <[email protected]>
Adding uprobe as another exception to the seccomp filter alongside
with the uretprobe syscall.

Same as the uretprobe the uprobe syscall is installed by kernel as
replacement for the breakpoint exception and is limited to x86_64
arch and isn't expected to ever be supported in i386.

Cc: Kees Cook <[email protected]>
Cc: Eyal Birger <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
Adding uprobe checks into the current uretprobe tests.

All the related tests are now executed with attached uprobe
or uretprobe or without any probe.

Renaming the test fixture to uprobe, because it seems better.

Cc: Kees Cook <[email protected]>
Cc: Eyal Birger <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
@kernel-patches-daemon-bpf kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch 5 times, most recently from 49174bd to f3a4188 Compare May 20, 2025 23:25
@kernel-patches-daemon-bpf
Copy link

Automatically cleaning up stale PR; feel free to reopen if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.