50361 Commits

Author SHA1 Message Date
Linus Torvalds
9ce62ebbb7 Merge tag 'irq-msi-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull MSI updates from Thomas Gleixner:
 "Updates for [PCI] MSI related code:

   - Remove one variant of PCI/MSI management as all users have been
     converted to use per device domains. That reduces the variants to
     two:

     The modern and the real archaic legacy variant, which keeps the
     usual suspects in the museum category alive.

   - Rework the platform MSI device ID detection mechanism in the ARM
     GIC world to address resource leaks, duplicated code and other
     details. This requires a corresponding preparatory step in the
     PCI/iproc driver.

   - Trivial core code cleanups"

* tag 'irq-msi-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/gic-its: Rework platform MSI deviceID detection
  PCI: iproc: Implement MSI controller node detection with of_msi_xlate()
  genirq/msi: Slightly simplify msi_domain_alloc()
  PCI/MSI: Delete pci_msi_create_irq_domain()
2025-12-02 09:35:59 -08:00
Linus Torvalds
6863c8385c Merge tag 'irq-core-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq core updates from Thomas Gleixner:
 "Updates for the interrupt core and treewide cleanups:

   - Rework of the Per Processor Interrupt (PPI) management on ARM[64]

     PPI support was built under the assumption that the systems are
     homogenous so that the same CPU local device types are connected to
     them. That's unfortunately wishful thinking and created horrible
     workarounds.

     This rework provides affinity management for PPIs so that they can
     be individually configured in the firmware tables and mops up the
     related drivers all over the place.

   - Prevent CPUSET/isolation changes to arbitrarily affine interrupt
     threads to random CPUs, which ignores user or driver settings.

   - Plug a harmless race in the interrupt affinity proc interface,
     which allows to see a half updated mask

   - Adjust the priority of secondary interrupt threads on RT, so that
     the combination of primary and secondary thread emulates the
     hardware interrupt plus thread scenario. Having them at the same
     priority can cause starvation issues in some drivers"

* tag 'irq-core-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  genirq: Remove cpumask availability check on kthread affinity setting
  genirq: Fix interrupt threads affinity vs. cpuset isolated partitions
  genirq: Prevent early spurious wake-ups of interrupt threads
  genirq: Use raw_spinlock_irq() in irq_set_affinity_notifier()
  genirq/manage: Reduce priority of forced secondary interrupt handler
  genirq/proc: Fix race in show_irq_affinity()
  genirq: Fix percpu_devid irq affinity documentation
  perf: arm_pmu: Kill last use of per-CPU cpu_armpmu pointer
  irqdomain: Kill of_node_to_fwnode() helper
  genirq: Kill irq_{g,s}et_percpu_devid_partition()
  irqchip: Kill irq-partition-percpu
  irqchip/apple-aic: Drop support for custom PMU irq partitions
  irqchip/gic-v3: Drop support for custom PPI partitions
  coresight: trbe: Request specific affinities for per CPU interrupts
  perf: arm_spe_pmu: Request specific affinities for per CPU interrupts
  perf: arm_pmu: Request specific affinities for per CPU NMIs/interrupts
  genirq: Add request_percpu_irq_affinity() helper
  genirq: Allow per-cpu interrupt sharing for non-overlapping affinities
  genirq: Update request_percpu_nmi() to take an affinity
  genirq: Add affinity to percpu_devid interrupt requests
  ...
2025-12-02 09:14:26 -08:00
Linus Torvalds
2b09f480f0 Merge tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull rseq updates from Thomas Gleixner:
 "A large overhaul of the restartable sequences and CID management:

  The recent enablement of RSEQ in glibc resulted in regressions which
  are caused by the related overhead. It turned out that the decision to
  invoke the exit to user work was not really a decision. More or less
  each context switch caused that. There is a long list of small issues
  which sums up nicely and results in a 3-4% regression in I/O
  benchmarks.

  The other detail which caused issues due to extra work in context
  switch and task migration is the CID (memory context ID) management.
  It also requires to use a task work to consolidate the CID space,
  which is executed in the context of an arbitrary task and results in
  sporadic uncontrolled exit latencies.

  The rewrite addresses this by:

   - Removing deprecated and long unsupported functionality

   - Moving the related data into dedicated data structures which are
     optimized for fast path processing.

   - Caching values so actual decisions can be made

   - Replacing the current implementation with a optimized inlined
     variant.

   - Separating fast and slow path for architectures which use the
     generic entry code, so that only fault and error handling goes into
     the TIF_NOTIFY_RESUME handler.

   - Rewriting the CID management so that it becomes mostly invisible in
     the context switch path. That moves the work of switching modes
     into the fork/exit path, which is a reasonable tradeoff. That work
     is only required when a process creates more threads than the
     cpuset it is allowed to run on or when enough threads exit after
     that. An artificial thread pool benchmarks which triggers this did
     not degrade, it actually improved significantly.

     The main effect in migration heavy scenarios is that runqueue lock
     held time and therefore contention goes down significantly"

* tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
  sched/mmcid: Switch over to the new mechanism
  sched/mmcid: Implement deferred mode change
  irqwork: Move data struct to a types header
  sched/mmcid: Provide CID ownership mode fixup functions
  sched/mmcid: Provide new scheduler CID mechanism
  sched/mmcid: Introduce per task/CPU ownership infrastructure
  sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex
  sched/mmcid: Provide precomputed maximal value
  sched/mmcid: Move initialization out of line
  signal: Move MMCID exit out of sighand lock
  sched/mmcid: Convert mm CID mask to a bitmap
  cpumask: Cache num_possible_cpus()
  sched/mmcid: Use cpumask_weighted_or()
  cpumask: Introduce cpumask_weighted_or()
  sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()
  sched/mmcid: Move scheduler code out of global header
  sched: Fixup whitespace damage
  sched/mmcid: Cacheline align MM CID storage
  sched/mmcid: Use proper data structures
  sched/mmcid: Revert the complex CID management
  ...
2025-12-02 08:48:53 -08:00
Linus Torvalds
1dce50698a Merge tag 'core-uaccess-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scoped user access updates from Thomas Gleixner:
 "Scoped user mode access and related changes:

   - Implement the missing u64 user access function on ARM when
     CONFIG_CPU_SPECTRE=n.

     This makes it possible to access a 64bit value in generic code with
     [unsafe_]get_user(). All other architectures and ARM variants
     provide the relevant accessors already.

   - Ensure that ASM GOTO jump label usage in the user mode access
     helpers always goes through a local C scope label indirection
     inside the helpers.

     This is required because compilers are not supporting that a ASM
     GOTO target leaves a auto cleanup scope. GCC silently fails to emit
     the cleanup invocation and CLANG fails the build.

     [ Editor's note: gcc-16 will have fixed the code generation issue
       in commit f68fe3ddda4 ("eh: Invoke cleanups/destructors in asm
       goto jumps [PR122835]"). But we obviously have to deal with clang
       and older versions of gcc, so.. - Linus ]

     This provides generic wrapper macros and the conversion of affected
     architecture code to use them.

   - Scoped user mode access with auto cleanup

     Access to user mode memory can be required in hot code paths, but
     if it has to be done with user controlled pointers, the access is
     shielded with a speculation barrier, so that the CPU cannot
     speculate around the address range check. Those speculation
     barriers impact performance quite significantly.

     This cost can be avoided by "masking" the provided pointer so it is
     guaranteed to be in the valid user memory access range and
     otherwise to point to a guaranteed unpopulated address space. This
     has to be done without branches so it creates an address dependency
     for the access, which the CPU cannot speculate ahead.

     This results in repeating and error prone programming patterns:

       	    if (can_do_masked_user_access())
                      from = masked_user_read_access_begin((from));
              else if (!user_read_access_begin(from, sizeof(*from)))
                      return -EFAULT;
              unsafe_get_user(val, from, Efault);
              user_read_access_end();
              return 0;
        Efault:
              user_read_access_end();
              return -EFAULT;

      which can be replaced with scopes and automatic cleanup:

              scoped_user_read_access(from, Efault)
                      unsafe_get_user(val, from, Efault);
              return 0;
         Efault:
              return -EFAULT;

   - Convert code which implements the above pattern over to
     scope_user.*.access(). This also corrects a couple of imbalanced
     masked_*_begin() instances which are harmless on most
     architectures, but prevent PowerPC from implementing the masking
     optimization.

   - Add a missing speculation barrier in copy_from_user_iter()"

* tag 'core-uaccess-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lib/strn*,uaccess: Use masked_user_{read/write}_access_begin when required
  scm: Convert put_cmsg() to scoped user access
  iov_iter: Add missing speculation barrier to copy_from_user_iter()
  iov_iter: Convert copy_from_user_iter() to masked user access
  select: Convert to scoped user access
  x86/futex: Convert to scoped user access
  futex: Convert to get/put_user_inline()
  uaccess: Provide put/get_user_inline()
  uaccess: Provide scoped user access regions
  arm64: uaccess: Use unsafe wrappers for ASM GOTO
  s390/uaccess: Use unsafe wrappers for ASM GOTO
  riscv/uaccess: Use unsafe wrappers for ASM GOTO
  powerpc/uaccess: Use unsafe wrappers for ASM GOTO
  x86/uaccess: Use unsafe wrappers for ASM GOTO
  uaccess: Provide ASM GOTO safe wrappers for unsafe_*_user()
  ARM: uaccess: Implement missing __get_user_asm_dword()
2025-12-02 08:01:39 -08:00
Nam Cao
b30f635bb6 rv: Convert to use __free
Convert to use __free to tidy up the code.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/62854e2fcb8f8dd2180a98a9700702dcf89a6980.1763370183.git.namcao@linutronix.de
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-12-02 07:28:32 +01:00
Nam Cao
8db3790c4d rv: Convert to use lock guard
Convert to use lock guard to tidy up the code.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/dbefeb868093c40d4b29fd6b57294a6aa011b719.1763370183.git.namcao@linutronix.de
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-12-02 07:28:20 +01:00
Linus Torvalds
4a26e7032d Merge tag 'core-bugs-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull bug handling infrastructure updates from Ingo Molnar:
 "Core updates:

   - Improve WARN(), which has vararg printf like arguments, to work
     with the x86 #UD based WARN-optimizing infrastructure by hiding the
     format in the bug_table and replacing this first argument with the
     address of the bug-table entry, while making the actual function
     that's called a UD1 instruction (Peter Zijlstra)

   - Introduce the CONFIG_DEBUG_BUGVERBOSE_DETAILED Kconfig switch (Ingo
     Molnar, s390 support by Heiko Carstens)

  Fixes and cleanups:

   - bugs/s390: Remove private WARN_ON() implementation (Heiko Carstens)

   - <asm/bugs.h>: Make i386 use GENERIC_BUG_RELATIVE_POINTERS (Peter
     Zijlstra)"

* tag 'core-bugs-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
  x86/bugs: Make i386 use GENERIC_BUG_RELATIVE_POINTERS
  x86/bug: Fix BUG_FORMAT vs KASLR
  x86_64/bug: Inline the UD1
  x86/bug: Implement WARN_ONCE()
  x86_64/bug: Implement __WARN_printf()
  x86/bug: Use BUG_FORMAT for DEBUG_BUGVERBOSE_DETAILED
  x86/bug: Add BUG_FORMAT basics
  bug: Allow architectures to provide __WARN_printf()
  bug: Implement WARN_ON() using __WARN_FLAGS()
  bug: Add report_bug_entry()
  bug: Add BUG_FORMAT_ARGS infrastructure
  bug: Clean up CONFIG_GENERIC_BUG_RELATIVE_POINTERS
  bug: Add BUG_FORMAT infrastructure
  x86: Rework __bug_table helpers
  bugs/s390: Remove private WARN_ON() implementation
  bugs/core: Reorganize fields in the first line of WARNING output, add ->comm[] output
  bugs/sh: Concatenate 'cond_str' with '__FILE__' in __WARN_FLAGS(), to extend WARN_ON/BUG_ON output
  bugs/parisc: Concatenate 'cond_str' with '__FILE__' in __WARN_FLAGS(), to extend WARN_ON/BUG_ON output
  bugs/riscv: Concatenate 'cond_str' with '__FILE__' in __BUG_FLAGS(), to extend WARN_ON/BUG_ON output
  bugs/riscv: Pass in 'cond_str' to __BUG_FLAGS()
  ...
2025-12-01 21:33:01 -08:00
Linus Torvalds
6d2c10e889 Merge tag 'sched-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "Scalability and load-balancing improvements:

   - Enable scheduler feature NEXT_BUDDY (Mel Gorman)

   - Reimplement NEXT_BUDDY to align with EEVDF goals (Mel Gorman)

   - Skip sched_balance_running cmpxchg when balance is not due (Tim
     Chen)

   - Implement generic code for architecture specific sched domain NUMA
     distances (Tim Chen)

   - Optimize the NUMA distances of the sched-domains builds of Intel
     Granite Rapids (GNR) and Clearwater Forest (CWF) platforms (Tim
     Chen)

   - Implement proportional newidle balance: a randomized algorithm that
     runs newidle balancing proportional to its success rate. (Peter
     Zijlstra)

  Scheduler infrastructure changes:

   - Implement the 'sched_change' scoped_guard() pattern for the entire
     scheduler (Peter Zijlstra)

   - More broadly utilize the sched_change guard (Peter Zijlstra)

   - Add support to pick functions to take runqueue-flags (Joel
     Fernandes)

   - Provide and use set_need_resched_current() (Peter Zijlstra)

  Fair scheduling enhancements:

   - Forfeit vruntime on yield (Fernand Sieber)

   - Only update stats for allowed CPUs when looking for dst group (Adam
     Li)

  CPU-core scheduling enhancements:

   - Optimize core cookie matching check (Fernand Sieber)

  Deadline scheduler fixes:

   - Only set free_cpus for online runqueues (Doug Berger)

   - Fix dl_server time accounting (Peter Zijlstra)

   - Fix dl_server stop condition (Peter Zijlstra)

  Proxy scheduling fixes:

   - Yield the donor task (Fernand Sieber)

  Fixes and cleanups:

   - Fix do_set_cpus_allowed() locking (Peter Zijlstra)

   - Fix migrate_disable_switch() locking (Peter Zijlstra)

   - Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked()
     (Hao Jia)

   - Increase sched_tick_remote timeout (Phil Auld)

   - sched/deadline: Use cpumask_weight_and() in dl_bw_cpus() (Shrikanth
     Hegde)

   - sched/deadline: Clean up select_task_rq_dl() (Shrikanth Hegde)"

* tag 'sched-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
  sched: Provide and use set_need_resched_current()
  sched/fair: Proportional newidle balance
  sched/fair: Small cleanup to update_newidle_cost()
  sched/fair: Small cleanup to sched_balance_newidle()
  sched/fair: Revert max_newidle_lb_cost bump
  sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  sched/fair: Enable scheduler feature NEXT_BUDDY
  sched: Increase sched_tick_remote timeout
  sched/fair: Have SD_SERIALIZE affect newidle balancing
  sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  sched/deadline: Minor cleanup in select_task_rq_dl()
  sched/deadline: Use cpumask_weight_and() in dl_bw_cpus
  sched/deadline: Document dl_server
  sched/deadline: Fix dl_server stop condition
  sched/deadline: Fix dl_server time accounting
  sched/core: Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked()
  sched/eevdf: Fix min_vruntime vs avg_vruntime
  sched/core: Add comment explaining force-idle vruntime snapshots
  sched/core: Optimize core cookie matching check
  sched/proxy: Yield the donor task
  ...
2025-12-01 21:04:45 -08:00
Linus Torvalds
6c26fbe8c9 Merge tag 'perf-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance events updates from Ingo Molnar:
 "Callchain support:

   - Add support for deferred user-space stack unwinding for perf,
     enabled on x86. (Peter Zijlstra, Steven Rostedt)

   - unwind_user/x86: Enable frame pointer unwinding on x86 (Josh
     Poimboeuf)

  x86 PMU support and infrastructure:

   - x86/insn: Simplify for_each_insn_prefix() (Peter Zijlstra)

   - x86/insn,uprobes,alternative: Unify insn_is_nop() (Peter Zijlstra)

  Intel PMU driver:

   - Large series to prepare for and implement architectural PEBS
     support for Intel platforms such as Clearwater Forest (CWF) and
     Panther Lake (PTL). (Dapeng Mi, Kan Liang)

   - Check dynamic constraints (Kan Liang)

   - Optimize PEBS extended config (Peter Zijlstra)

   - cstates:
      - Remove PC3 support from LunarLake (Zhang Rui)
      - Add Pantherlake support (Zhang Rui)
      - Clearwater Forest support (Zide Chen)

  AMD PMU driver:

   - x86/amd: Check event before enable to avoid GPF (George Kennedy)

  Fixes and cleanups:

   - task_work: Fix NMI race condition (Peter Zijlstra)

   - perf/x86: Fix NULL event access and potential PEBS record loss
     (Dapeng Mi)

   - Misc other fixes and cleanups (Dapeng Mi, Ingo Molnar, Peter
     Zijlstra)"

* tag 'perf-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
  perf/x86/intel: Fix and clean up intel_pmu_drain_arch_pebs() type use
  perf/x86/intel: Optimize PEBS extended config
  perf/x86/intel: Check PEBS dyn_constraints
  perf/x86/intel: Add a check for dynamic constraints
  perf/x86/intel: Add counter group support for arch-PEBS
  perf/x86/intel: Setup PEBS data configuration and enable legacy groups
  perf/x86/intel: Update dyn_constraint base on PEBS event precise level
  perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
  perf/x86/intel: Process arch-PEBS records or record fragments
  perf/x86/intel/ds: Factor out PEBS group processing code to functions
  perf/x86/intel/ds: Factor out PEBS record processing code to functions
  perf/x86/intel: Initialize architectural PEBS
  perf/x86/intel: Correct large PEBS flag check
  perf/x86/intel: Replace x86_pmu.drain_pebs calling with static call
  perf/x86: Fix NULL event access and potential PEBS record loss
  perf/x86: Remove redundant is_x86_event() prototype
  entry,unwind/deferred: Fix unwind_reset_info() placement
  unwind_user/x86: Fix arch=um build
  perf: Support deferred user unwind
  unwind_user/x86: Teach FP unwind about start of function
  ...
2025-12-01 20:42:01 -08:00
Linus Torvalds
63e6995005 Merge tag 'objtool-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:

 - klp-build livepatch module generation (Josh Poimboeuf)

   Introduce new objtool features and a klp-build script to generate
   livepatch modules using a source .patch as input.

   This builds on concepts from the longstanding out-of-tree kpatch
   project which began in 2012 and has been used for many years to
   generate livepatch modules for production kernels. However, this is a
   complete rewrite which incorporates hard-earned lessons from 12+
   years of maintaining kpatch.

   Key improvements compared to kpatch-build:

    - Integrated with objtool: Leverages objtool's existing control-flow
      graph analysis to help detect changed functions.

    - Works on vmlinux.o: Supports late-linked objects, making it
      compatible with LTO, IBT, and similar.

    - Simplified code base: ~3k fewer lines of code.

    - Upstream: No more out-of-tree #ifdef hacks, far less cruft.

    - Cleaner internals: Vastly simplified logic for
      symbol/section/reloc inclusion and special section extraction.

    - Robust __LINE__ macro handling: Avoids false positive binary diffs
      caused by the __LINE__ macro by introducing a fix-patch-lines
      script which injects #line directives into the source .patch to
      preserve the original line numbers at compile time.

 - Disassemble code with libopcodes instead of running objdump
   (Alexandre Chartre)

 - Disassemble support (-d option to objtool) by Alexandre Chartre,
   which supports the decoding of various Linux kernel code generation
   specials such as alternatives:

      17ef:  sched_balance_find_dst_group+0x62f                 mov    0x34(%r9),%edx
      17f3:  sched_balance_find_dst_group+0x633               | <alternative.17f3>             | X86_FEATURE_POPCNT
      17f3:  sched_balance_find_dst_group+0x633               | call   0x17f8 <__sw_hweight64> | popcnt %rdi,%rax
      17f8:  sched_balance_find_dst_group+0x638                 cmp    %eax,%edx

   ... jump table alternatives:

      1895:  sched_use_asym_prio+0x5                            test   $0x8,%ch
      1898:  sched_use_asym_prio+0x8                            je     0x18a9 <sched_use_asym_prio+0x19>
      189a:  sched_use_asym_prio+0xa                          | <jump_table.189a>                        | JUMP
      189a:  sched_use_asym_prio+0xa                          | jmp    0x18ae <sched_use_asym_prio+0x1e> | nop2
      189c:  sched_use_asym_prio+0xc                            mov    $0x1,%eax
      18a1:  sched_use_asym_prio+0x11                           and    $0x80,%ecx

   ... exception table alternatives:

    native_read_msr:
      5b80:  native_read_msr+0x0                                                     mov    %edi,%ecx
      5b82:  native_read_msr+0x2                                                   | <ex_table.5b82> | EXCEPTION
      5b82:  native_read_msr+0x2                                                   | rdmsr           | resume at 0x5b84 <native_read_msr+0x4>
      5b84:  native_read_msr+0x4                                                     shl    $0x20,%rdx

   .... x86 feature flag decoding (also see the X86_FEATURE_POPCNT
        example in sched_balance_find_dst_group() above):

      2faaf:  start_thread_common.constprop.0+0x1f                                    jne    0x2fba4 <start_thread_common.constprop.0+0x114>
      2fab5:  start_thread_common.constprop.0+0x25                                  | <alternative.2fab5>                  | X86_FEATURE_ALWAYS                                  | X86_BUG_NULL_SEG
      2fab5:  start_thread_common.constprop.0+0x25                                  | jmp    0x2faba <.altinstr_aux+0x2f4> | jmp    0x4b0 <start_thread_common.constprop.0+0x3f> | nop5
      2faba:  start_thread_common.constprop.0+0x2a                                    mov    $0x2b,%eax

   ... NOP sequence shortening:

      1048e2:  snapshot_write_finalize+0xc2                                            je     0x104917 <snapshot_write_finalize+0xf7>
      1048e4:  snapshot_write_finalize+0xc4                                            nop6
      1048ea:  snapshot_write_finalize+0xca                                            nop11
      1048f5:  snapshot_write_finalize+0xd5                                            nop11
      104900:  snapshot_write_finalize+0xe0                                            mov    %rax,%rcx
      104903:  snapshot_write_finalize+0xe3                                            mov    0x10(%rdx),%rax

   ... and much more.

 - Function validation tracing support (Alexandre Chartre)

 - Various -ffunction-sections fixes (Josh Poimboeuf)

 - Clang AutoFDO (Automated Feedback-Directed Optimizations) support
   (Josh Poimboeuf)

 - Misc fixes and cleanups (Borislav Petkov, Chen Ni, Dylan Hatch, Ingo
   Molnar, John Wang, Josh Poimboeuf, Pankaj Raghav, Peter Zijlstra,
   Thorsten Blum)

* tag 'objtool-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
  objtool: Fix segfault on unknown alternatives
  objtool: Build with disassembly can fail when including bdf.h
  objtool: Trim trailing NOPs in alternative
  objtool: Add wide output for disassembly
  objtool: Compact output for alternatives with one instruction
  objtool: Improve naming of group alternatives
  objtool: Add Function to get the name of a CPU feature
  objtool: Provide access to feature and flags of group alternatives
  objtool: Fix address references in alternatives
  objtool: Disassemble jump table alternatives
  objtool: Disassemble exception table alternatives
  objtool: Print addresses with alternative instructions
  objtool: Disassemble group alternatives
  objtool: Print headers for alternatives
  objtool: Preserve alternatives order
  objtool: Add the --disas=<function-pattern> action
  objtool: Do not validate IBT for .return_sites and .call_sites
  objtool: Improve tracing of alternative instructions
  objtool: Add functions to better name alternatives
  objtool: Identify the different types of alternatives
  ...
2025-12-01 20:18:59 -08:00
Linus Torvalds
b53440f8e5 Merge tag 'locking-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "Mutexes:

   - Redo __mutex_init() to reduce generated code size (Sebastian
     Andrzej Siewior)

  Seqlocks:

   - Introduce scoped_seqlock_read() (Peter Zijlstra)

   - Change thread_group_cputime() to use scoped_seqlock_read() (Oleg
     Nesterov)

   - Change do_task_stat() to use scoped_seqlock_read() (Oleg Nesterov)

   - Change do_io_accounting() to use scoped_seqlock_read() (Oleg
     Nesterov)

   - Fix the incorrect documentation of read_seqbegin_or_lock() /
     need_seqretry() (Oleg Nesterov)

   - Allow KASAN to fail optimizing (Peter Zijlstra)

  Local lock updates:

   - Fix all kernel-doc warnings (Randy Dunlap)

   - Add the <linux/local_lock*.h> headers to MAINTAINERS (Sebastian
     Andrzej Siewior)

   - Reduce the risk of shadowing via s/l/__l/ and s/tl/__tl/ (Vincent
     Mailhol)

  Lock debugging:

   - spinlock/debug: Fix data-race in do_raw_write_lock (Alexander
     Sverdlin)

  Atomic primitives infrastructure:

   - atomic: Skip alignment check for try_cmpxchg() old arg (Arnd
     Bergmann)

  Rust runtime integration:

   - sync: atomic: Enable generated Atomic<T> usage (Boqun Feng)

   - sync: atomic: Implement Debug for Atomic<Debug> (Boqun Feng)

   - debugfs: Remove Rust native atomics and replace them with Linux
     versions (Boqun Feng)

   - debugfs: Implement Reader for Mutex<T> only when T is Unpin (Boqun
     Feng)

   - lock: guard: Add T: Unpin bound to DerefMut (Daniel Almeida)

   - lock: Pin the inner data (Daniel Almeida)

   - lock: Add a Pin<&mut T> accessor (Daniel Almeida)"

* tag 'locking-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/local_lock: Fix all kernel-doc warnings
  locking/local_lock: s/l/__l/ and s/tl/__tl/ to reduce the risk of shadowing
  locking/local_lock: Add the <linux/local_lock*.h> headers to MAINTAINERS
  locking/mutex: Redo __mutex_init() to reduce generated code size
  rust: debugfs: Replace the usage of Rust native atomics
  rust: sync: atomic: Implement Debug for Atomic<Debug>
  rust: sync: atomic: Make Atomic*Ops pub(crate)
  seqlock: Allow KASAN to fail optimizing
  rust: debugfs: Implement Reader for Mutex<T> only when T is Unpin
  seqlock: Change do_io_accounting() to use scoped_seqlock_read()
  seqlock: Change do_task_stat() to use scoped_seqlock_read()
  seqlock: Change thread_group_cputime() to use scoped_seqlock_read()
  seqlock: Introduce scoped_seqlock_read()
  documentation: seqlock: fix the wrong documentation of read_seqbegin_or_lock/need_seqretry
  atomic: Skip alignment check for try_cmpxchg() old arg
  rust: lock: Add a Pin<&mut T> accessor
  rust: lock: Pin the inner data
  rust: lock: guard: Add T: Unpin bound to DerefMut
  locking/spinlock/debug: Fix data-race in do_raw_write_lock
2025-12-01 19:50:58 -08:00
Linus Torvalds
1b5dd29869 Merge tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fd prepare updates from Christian Brauner:
 "This adds the FD_ADD() and FD_PREPARE() primitive. They simplify the
  common pattern of get_unused_fd_flags() + create file + fd_install()
  that is used extensively throughout the kernel and currently requires
  cumbersome cleanup paths.

  FD_ADD() - For simple cases where a file is installed immediately:

      fd = FD_ADD(O_CLOEXEC, vfio_device_open_file(device));
      if (fd < 0)
          vfio_device_put_registration(device);
      return fd;

  FD_PREPARE() - For cases requiring access to the fd or file, or
  additional work before publishing:

      FD_PREPARE(fdf, O_CLOEXEC, sync_file->file);
      if (fdf.err) {
          fput(sync_file->file);
          return fdf.err;
      }

      data.fence = fd_prepare_fd(fdf);
      if (copy_to_user((void __user *)arg, &data, sizeof(data)))
          return -EFAULT;

      return fd_publish(fdf);

  The primitives are centered around struct fd_prepare. FD_PREPARE()
  encapsulates all allocation and cleanup logic and must be followed by
  a call to fd_publish() which associates the fd with the file and
  installs it into the caller's fdtable. If fd_publish() isn't called,
  both are deallocated automatically. FD_ADD() is a shorthand that does
  fd_publish() immediately and never exposes the struct to the caller.

  I've implemented this in a way that it's compatible with the cleanup
  infrastructure while also being usable separately. IOW, it's centered
  around struct fd_prepare which is aliased to class_fd_prepare_t and so
  we can make use of all the basica guard infrastructure"

* tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits)
  io_uring: convert io_create_mock_file() to FD_PREPARE()
  file: convert replace_fd() to FD_PREPARE()
  vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD()
  tty: convert ptm_open_peer() to FD_ADD()
  ntsync: convert ntsync_obj_get_fd() to FD_PREPARE()
  media: convert media_request_alloc() to FD_PREPARE()
  hv: convert mshv_ioctl_create_partition() to FD_ADD()
  gpio: convert linehandle_create() to FD_PREPARE()
  pseries: port papr_rtas_setup_file_interface() to FD_ADD()
  pseries: convert papr_platform_dump_create_handle() to FD_ADD()
  spufs: convert spufs_gang_open() to FD_PREPARE()
  papr-hvpipe: convert papr_hvpipe_dev_create_handle() to FD_PREPARE()
  spufs: convert spufs_context_open() to FD_PREPARE()
  net/socket: convert __sys_accept4_file() to FD_ADD()
  net/socket: convert sock_map_fd() to FD_ADD()
  net/kcm: convert kcm_ioctl() to FD_PREPARE()
  net/handshake: convert handshake_nl_accept_doit() to FD_PREPARE()
  secretmem: convert memfd_secret() to FD_ADD()
  memfd: convert memfd_create() to FD_ADD()
  bpf: convert bpf_token_create() to FD_PREPARE()
  ...
2025-12-01 17:32:07 -08:00
Linus Torvalds
1d18101a64 Merge tag 'kernel-6.19-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull cred guard updates from Christian Brauner:
 "This contains substantial credential infrastructure improvements
  adding guard-based credential management that simplifies code and
  eliminates manual reference counting in many subsystems.

  Features:

   - Kernel Credential Guards

     Add with_kernel_creds() and scoped_with_kernel_creds() guards that
     allow using the kernel credentials without allocating and copying
     them. This was requested by Linus after seeing repeated
     prepare_kernel_creds() calls that duplicate the kernel credentials
     only to drop them again later.

     The new guards completely avoid the allocation and never expose the
     temporary variable to hold the kernel credentials anywhere in
     callers.

   - Generic Credential Guards

     Add scoped_with_creds() guards for the common override_creds() and
     revert_creds() pattern. This builds on earlier work that made
     override_creds()/revert_creds() completely reference count free.

   - Prepare Credential Guards

     Add prepare credential guards for the more complex pattern of
     preparing a new set of credentials and overriding the current
     credentials with them:
      - prepare_creds()
      - modify new creds
      - override_creds()
      - revert_creds()
      - put_cred()

  Cleanups:

   - Make init_cred static since it should not be directly accessed

   - Add kernel_cred() helper to properly access the kernel credentials

   - Fix scoped_class() macro that was introduced two cycles ago

   - coredump: split out do_coredump() from vfs_coredump() for cleaner
     credential handling

   - coredump: move revert_cred() before coredump_cleanup()

   - coredump: mark struct mm_struct as const

   - coredump: pass struct linux_binfmt as const

   - sev-dev: use guard for path"

* tag 'kernel-6.19-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
  trace: use override credential guard
  trace: use prepare credential guard
  coredump: use override credential guard
  coredump: use prepare credential guard
  coredump: split out do_coredump() from vfs_coredump()
  coredump: mark struct mm_struct as const
  coredump: pass struct linux_binfmt as const
  coredump: move revert_cred() before coredump_cleanup()
  sev-dev: use override credential guards
  sev-dev: use prepare credential guard
  sev-dev: use guard for path
  cred: add prepare credential guard
  net/dns_resolver: use credential guards in dns_query()
  cgroup: use credential guards in cgroup_attach_permissions()
  act: use credential guards in acct_write_process()
  smb: use credential guards in cifs_get_spnego_key()
  nfs: use credential guards in nfs_idmap_get_key()
  nfs: use credential guards in nfs_local_call_write()
  nfs: use credential guards in nfs_local_call_read()
  erofs: use credential guards
  ...
2025-12-01 13:45:41 -08:00
Zqiang
1dd6c84f1c sched_ext: Fix incorrect sched_class settings for per-cpu migration tasks
When loading the ebpf scheduler, the tasks in the scx_tasks list will
be traversed and invoke __setscheduler_class() to get new sched_class.
however, this would also incorrectly set the per-cpu migration
task's->sched_class to rt_sched_class, even after unload, the per-cpu
migration task's->sched_class remains sched_rt_class.

The log for this issue is as follows:

./scx_rustland --stats 1
[  199.245639][  T630] sched_ext: "rustland" does not implement cgroup cpu.weight
[  199.269213][  T630] sched_ext: BPF scheduler "rustland" enabled
04:25:09 [INFO] RustLand scheduler attached

bpftrace -e 'iter:task /strcontains(ctx->task->comm, "migration")/
{ printf("%s:%d->%pS\n", ctx->task->comm, ctx->task->pid, ctx->task->sched_class); }'
Attaching 1 probe...
migration/0:24->rt_sched_class+0x0/0xe0
migration/1:27->rt_sched_class+0x0/0xe0
migration/2:33->rt_sched_class+0x0/0xe0
migration/3:39->rt_sched_class+0x0/0xe0
migration/4:45->rt_sched_class+0x0/0xe0
migration/5:52->rt_sched_class+0x0/0xe0
migration/6:58->rt_sched_class+0x0/0xe0
migration/7:64->rt_sched_class+0x0/0xe0

sched_ext: BPF scheduler "rustland" disabled (unregistered from user space)
EXIT: unregistered from user space
04:25:21 [INFO] Unregister RustLand scheduler

bpftrace -e 'iter:task /strcontains(ctx->task->comm, "migration")/
{ printf("%s:%d->%pS\n", ctx->task->comm, ctx->task->pid, ctx->task->sched_class); }'
Attaching 1 probe...
migration/0:24->rt_sched_class+0x0/0xe0
migration/1:27->rt_sched_class+0x0/0xe0
migration/2:33->rt_sched_class+0x0/0xe0
migration/3:39->rt_sched_class+0x0/0xe0
migration/4:45->rt_sched_class+0x0/0xe0
migration/5:52->rt_sched_class+0x0/0xe0
migration/6:58->rt_sched_class+0x0/0xe0
migration/7:64->rt_sched_class+0x0/0xe0

This commit therefore generate a new scx_setscheduler_class() and
add check for stop_sched_class to replace __setscheduler_class().

Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-01 10:58:49 -10:00
Linus Torvalds
415d34b92c Merge tag 'namespace-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull namespace updates from Christian Brauner:
 "This contains substantial namespace infrastructure changes including a new
  system call, active reference counting, and extensive header cleanups.
  The branch depends on the shared kbuild branch for -fms-extensions support.

  Features:

   - listns() system call

     Add a new listns() system call that allows userspace to iterate
     through namespaces in the system. This provides a programmatic
     interface to discover and inspect namespaces, addressing
     longstanding limitations:

     Currently, there is no direct way for userspace to enumerate
     namespaces. Applications must resort to scanning /proc/*/ns/ across
     all processes, which is:
      - Inefficient - requires iterating over all processes
      - Incomplete - misses namespaces not attached to any running
        process but kept alive by file descriptors, bind mounts, or
        parent references
      - Permission-heavy - requires access to /proc for many processes
      - No ordering or ownership information
      - No filtering per namespace type

     The listns() system call solves these problems:

       ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
                      size_t nr_ns_ids, unsigned int flags);

       struct ns_id_req {
             __u32 size;
             __u32 spare;
             __u64 ns_id;
             struct /* listns */ {
                     __u32 ns_type;
                     __u32 spare2;
                     __u64 user_ns_id;
             };
       };

     Features include:
      - Pagination support for large namespace sets
      - Filtering by namespace type (MNT_NS, NET_NS, USER_NS, etc.)
      - Filtering by owning user namespace
      - Permission checks respecting namespace isolation

   - Active Reference Counting

     Introduce an active reference count that tracks namespace
     visibility to userspace. A namespace is visible in the following
     cases:
      - The namespace is in use by a task
      - The namespace is persisted through a VFS object (namespace file
        descriptor or bind-mount)
      - The namespace is a hierarchical type and is the parent of child
        namespaces

     The active reference count does not regulate lifetime (that's still
     done by the normal reference count) - it only regulates visibility
     to namespace file handles and listns().

     This prevents resurrection of namespaces that are pinned only for
     internal kernel reasons (e.g., user namespaces held by
     file->f_cred, lazy TLB references on idle CPUs, etc.) which should
     not be accessible via (1)-(3).

   - Unified Namespace Tree

     Introduce a unified tree structure for all namespaces with:
      - Fixed IDs assigned to initial namespaces
      - Lookup based solely on inode number
      - Maintained list of owned namespaces per user namespace
      - Simplified rbtree comparison helpers

   Cleanups

    - Header Reorganization:
      - Move namespace types into separate header (ns_common_types.h)
      - Decouple nstree from ns_common header
      - Move nstree types into separate header
      - Switch to new ns_tree_{node,root} structures with helper functions
      - Use guards for ns_tree_lock

   - Initial Namespace Reference Count Optimization
      - Make all reference counts on initial namespaces a nop to avoid
        pointless cacheline ping-pong for namespaces that can never go
        away
      - Drop custom reference count initialization for initial namespaces
      - Add NS_COMMON_INIT() macro and use it for all namespaces
      - pid: rely on common reference count behavior

   - Miscellaneous Cleanups
      - Rename exit_task_namespaces() to exit_nsproxy_namespaces()
      - Rename is_initial_namespace() and make argument const
      - Use boolean to indicate anonymous mount namespace
      - Simplify owner list iteration in nstree
      - nsfs: raise SB_I_NODEV, SB_I_NOEXEC, and DCACHE_DONTCACHE explicitly
      - nsfs: use inode_just_drop()
      - pidfs: raise DCACHE_DONTCACHE explicitly
      - pidfs: simplify PIDFD_GET__NAMESPACE ioctls
      - libfs: allow to specify s_d_flags
      - cgroup: add cgroup namespace to tree after owner is set
      - nsproxy: fix free_nsproxy() and simplify create_new_namespaces()

  Fixes:

   - setns(pidfd, ...) race condition

     Fix a subtle race when using pidfds with setns(). When the target
     task exits after prepare_nsset() but before commit_nsset(), the
     namespace's active reference count might have been dropped. If
     setns() then installs the namespaces, it would bump the active
     reference count from zero without taking the required reference on
     the owner namespace, leading to underflow when later decremented.

     The fix resurrects the ownership chain if necessary - if the caller
     succeeded in grabbing passive references, the setns() should
     succeed even if the target task exits or gets reaped.

   - Return EFAULT on put_user() error instead of success

   - Make sure references are dropped outside of RCU lock (some
     namespaces like mount namespace sleep when putting the last
     reference)

   - Don't skip active reference count initialization for network
     namespace

   - Add asserts for active refcount underflow

   - Add asserts for initial namespace reference counts (both passive
     and active)

   - ipc: enable is_ns_init_id() assertions

   - Fix kernel-doc comments for internal nstree functions

   - Selftests
      - 15 active reference count tests
      - 9 listns() functionality tests
      - 7 listns() permission tests
      - 12 inactive namespace resurrection tests
      - 3 threaded active reference count tests
      - commit_creds() active reference tests
      - Pagination and stress tests
      - EFAULT handling test
      - nsid tests fixes"

* tag 'namespace-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (103 commits)
  pidfs: simplify PIDFD_GET_<type>_NAMESPACE ioctls
  nstree: fix kernel-doc comments for internal functions
  nsproxy: fix free_nsproxy() and simplify create_new_namespaces()
  selftests/namespaces: fix nsid tests
  ns: drop custom reference count initialization for initial namespaces
  pid: rely on common reference count behavior
  ns: add asserts for initial namespace active reference counts
  ns: add asserts for initial namespace reference counts
  ns: make all reference counts on initial namespace a nop
  ipc: enable is_ns_init_id() assertions
  fs: use boolean to indicate anonymous mount namespace
  ns: rename is_initial_namespace()
  ns: make is_initial_namespace() argument const
  nstree: use guards for ns_tree_lock
  nstree: simplify owner list iteration
  nstree: switch to new structures
  nstree: add helper to operate on struct ns_tree_{node,root}
  nstree: move nstree types into separate header
  nstree: decouple from ns_common header
  ns: move namespace types into separate header
  ...
2025-12-01 09:47:41 -08:00
Linus Torvalds
b04b2e7a61 Merge tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
 "Features:

   - Cheaper MAY_EXEC handling for path lookup. This elides MAY_WRITE
     permission checks during path lookup and adds the
     IOP_FASTPERM_MAY_EXEC flag so filesystems like btrfs can avoid
     expensive permission work.

   - Hide dentry_cache behind runtime const machinery.

   - Add German Maglione as virtiofs co-maintainer.

  Cleanups:

   - Tidy up and inline step_into() and walk_component() for improved
     code generation.

   - Re-enable IOCB_NOWAIT writes to files. This refactors file
     timestamp update logic, fixing a layering bypass in btrfs when
     updating timestamps on device files and improving FMODE_NOCMTIME
     handling in VFS now that nfsd started using it.

   - Path lookup optimizations extracting slowpaths into dedicated
     routines and adding branch prediction hints for mntput_no_expire(),
     fd_install(), lookup_slow(), and various other hot paths.

   - Enable clang's -fms-extensions flag, requiring a JFS rename to
     avoid conflicts.

   - Remove spurious exports in fs/file_attr.c.

   - Stop duplicating union pipe_index declaration. This depends on the
     shared kbuild branch that brings in -fms-extensions support which
     is merged into this branch.

   - Use MD5 library instead of crypto_shash in ecryptfs.

   - Use largest_zero_folio() in iomap_dio_zero().

   - Replace simple_strtol/strtoul with kstrtoint/kstrtouint in init and
     initrd code.

   - Various typo fixes.

  Fixes:

   - Fix emergency sync for btrfs. Btrfs requires an explicit sync_fs()
     call with wait == 1 to commit super blocks. The emergency sync path
     never passed this, leaving btrfs data uncommitted during emergency
     sync.

   - Use local kmap in watch_queue's post_one_notification().

   - Add hint prints in sb_set_blocksize() for LBS dependency on THP"

* tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
  MAINTAINERS: add German Maglione as virtiofs co-maintainer
  fs: inline step_into() and walk_component()
  fs: tidy up step_into() & friends before inlining
  orangefs: use inode_update_timestamps directly
  btrfs: fix the comment on btrfs_update_time
  btrfs: use vfs_utimes to update file timestamps
  fs: export vfs_utimes
  fs: lift the FMODE_NOCMTIME check into file_update_time_flags
  fs: refactor file timestamp update logic
  include/linux/fs.h: trivial fix: regualr -> regular
  fs/splice.c: trivial fix: pipes -> pipe's
  fs: mark lookup_slow() as noinline
  fs: add predicts based on nd->depth
  fs: move mntput_no_expire() slowpath into a dedicated routine
  fs: remove spurious exports in fs/file_attr.c
  watch_queue: Use local kmap in post_one_notification()
  fs: touch up predicts in path lookup
  fs: move fd_install() slowpath into a dedicated routine and provide commentary
  fs: hide dentry_cache behind runtime const machinery
  fs: touch predicts in do_dentry_open()
  ...
2025-12-01 08:44:26 -08:00
Petr Mladek
5cae92e622 Merge branch 'rework/write_atomic-unsafe' into for-linus 2025-12-01 14:17:04 +01:00
Petr Mladek
4f132d81f9 Merge branch 'rework/threaded-printk' into for-linus 2025-12-01 14:16:45 +01:00
Petr Mladek
3a9a3f5fb2 Merge branch 'rework/suspend-fixes' into for-linus 2025-12-01 14:16:28 +01:00
Petr Mladek
b1e6c41ef9 Merge branch 'rework/preempt-legacy-kthread' into for-linus 2025-12-01 14:16:08 +01:00
Petr Mladek
2d786a5b80 Merge branch 'rework/nbcon-in-kdb' into for-linus 2025-12-01 14:15:43 +01:00
Petr Mladek
475bb520c3 Merge branch 'rework/atomic-flush-hardlockup' into for-linus 2025-12-01 14:15:26 +01:00
Petr Mladek
3869e431b5 Merge branch 'for-6.19-vsprintf-timespec64' into for-linus 2025-12-01 14:14:34 +01:00
Sebastian Andrzej Siewior
51d7a05452 locking/mutex: Redo __mutex_init() to reduce generated code size
mutex_init() invokes __mutex_init() providing the name of the lock and
a pointer to a the lock class. With LOCKDEP enabled this information is
useful but without LOCKDEP it not used at all. Passing the pointer
information of the lock class might be considered negligible but the
name of the lock is passed as well and the string is stored. This
information is wasting storage.

Split __mutex_init() into a _genereic() variant doing the initialisation
of the lock and a _lockdep() version which does _genereic() plus the
lockdep bits. Restrict the lockdep version to lockdep enabled builds
allowing the compiler to remove the unused parameter.

This results in the following size reduction:

        text     data       bss        dec  filename
  | 30237599  8161430   1176624   39575653  vmlinux.defconfig
  | 30233269  8149142   1176560   39558971  vmlinux.defconfig.patched
     -4.2KiB   -12KiB

  | 32455099  8471098  12934684   53860881  vmlinux.defconfig.lockdep
  | 32455100  8471098  12934684   53860882  vmlinux.defconfig.patched.lockdep

  | 27152407  7191822   2068040   36412269  vmlinux.defconfig.preempt_rt
  | 27145937  7183630   2067976   36397543  vmlinux.defconfig.patched.preempt_rt
     -6.3KiB    -8KiB

  | 29382020  7505742  13784608   50672370  vmlinux.defconfig.preempt_rt.lockdep
  | 29376229  7505742  13784544   50666515  vmlinux.defconfig.patched.preempt_rt.lockdep
     -5.6KiB

[peterz: folded fix from boqun]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Link: https://lkml.kernel.org/r/20251125145425.68319-1-boqun.feng@gmail.com
Link: https://patch.msgid.link/20251105142350.Tfeevs2N@linutronix.de
2025-12-01 06:51:57 +01:00
Frederic Weisbecker
9a08942f17 Merge branch 'rcu/misc' into next
- In order to prepare the layout for nohz_full work deferral to
  user exit, the context tracking state must shrink the counter
  of transitions to/from RCU not watching. The only possible hazard
  is to trigger wrap-around more easily, delaying a bit grace periods
  when that happens. This should be a rare event though. Yet add
  debugging and torture code to test that assumption.

- Fix memory leak on locktorture module

- Annotate accesses in rculist_nulls.h to prevent from KCSAN warnings.
  On recent discussions, we also concluded that all those WRITE_ONCE()
  and READ_ONCE() on list APIs deserve appropriate comments. Something
  to be expected for the next cycle.

- Provide a script to apply several configs to several commits with torture.

- Allow torture to reuse a build directory in order to save needless
  rebuild time.

- Various cleanups.
2025-11-30 22:20:33 +01:00
Linus Torvalds
e69c7c1751 Merge tag 'timers_urgent_for_v6.18_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Borislav Petkov:

 - Have timekeeping aux clocks sysfs interface setup function return an
   error code on failure instead of success

* tag 'timers_urgent_for_v6.18_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Fix error code in tk_aux_sysfs_init()
2025-11-30 08:47:10 -08:00
Lorenzo Stoakes
58eac97a8b mm: simplify and rename mm flags function for clarity
The __mm_flags_set_word() function is slightly ambiguous - we use 'set' to
refer to setting individual bits (such as in mm_flags_set()) but here we
use it to refer to overwriting the value altogether.

Rename it to __mm_flags_overwrite_word() to eliminate this ambiguity.

We additionally simplify the functions, eliminating unnecessary
bitmap_xxx() operations (the compiler would have optimised these out but
it's worth being as clear as we can be here).

Link: https://lkml.kernel.org/r/8f0bc556e1b90eca8ea5eba41f8d5d3f9cd7c98a.1764064557.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Alice Ryhl <aliceryhl@google.com>	[rust]
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andreas Hindborg <a.hindborg@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Björn Roy Baron <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Gary Guo <gary@garyguo.net>
Cc: Gregory Price <gourry@gourry.net>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mathew Brost <matthew.brost@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Trevor Gross <tmgross@umich.edu>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Wei Xu <weixugc@google.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-29 10:41:08 -08:00
Ritesh Oedayrajsingh Varma
ff34657aa7 bpf: optimize bpf_map_update_elem() for map-in-map types
Updating a BPF_MAP_TYPE_HASH_OF_MAPS or BPF_MAP_TYPE_ARRAY_OF_MAPS via
bpf_map_update_elem() is very expensive.

In one of our workloads, we're inserting ~1400 maps of type
BPF_MAP_TYPE_ARRAY into a BPF_MAP_TYPE_ARRAY_OF_MAPS. This takes ~21
seconds on a single thread, with an average of ~15ms per call:

Function Name:    map_update_elem
Number of calls:  1369
Total time:       21s 182ms 966µs
Maximum:          47ms 937µs
Average:          15ms 473µs
Minimum:          7µs

Profiling shows that nearly all of this time is going to synchronize_rcu(),
via maybe_wait_bpf_programs() in map_update_elem().

The call to synchronize_rcu() is done to ensure that after
bpf_map_update_elem() returns, no BPF programs are still looking at the old
value of the map, per commit 1ae80cf319 ("bpf: wait for running BPF
programs when updating map-in-map").

As discussed on the bpf mailing list, replace synchronize_rcu() with
synchronize_rcu_expedited(). This is 175x faster: it now takes an average
of 88 microseconds per call, for a total of 127 milliseconds in the same
benchmark:

Function Name:    map_update_elem
Number of calls:  1439
Total time:       127ms 626µs
Maximum:          445µs
Average:          88µs
Minimum:          10µs

Link: https://lore.kernel.org/bpf/CAH6OuBR=w2kybK6u7aH_35B=Bo1PCukeMZefR=7V4Z2tJNK--Q@mail.gmail.com/

Signed-off-by: Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu>
Link: https://lore.kernel.org/r/20251128000422.20462-1-ritesh@superluminal.eu
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-29 09:48:41 -08:00
Menglong Dong
c1af4465b9 bpf: make kprobe_multi_link_prog_run always_inline
Make kprobe_multi_link_prog_run() always inline to obtain better
performance. Before this patch, the bench performance is:

./bench trig-kprobe-multi
Setting up benchmark 'trig-kprobe-multi'...
Benchmark 'trig-kprobe-multi' started.
Iter   0 ( 95.485us): hits   62.462M/s ( 62.462M/prod), [...]
Iter   1 (-80.054us): hits   62.486M/s ( 62.486M/prod), [...]
Iter   2 ( 13.572us): hits   62.287M/s ( 62.287M/prod), [...]
Iter   3 ( 76.961us): hits   62.293M/s ( 62.293M/prod), [...]
Iter   4 (-77.698us): hits   62.394M/s ( 62.394M/prod), [...]
Iter   5 (-13.399us): hits   62.319M/s ( 62.319M/prod), [...]
Iter   6 ( 77.573us): hits   62.250M/s ( 62.250M/prod), [...]
Summary: hits   62.338 ± 0.083M/s ( 62.338M/prod)

And after this patch, the performance is:

Iter   0 (454.148us): hits   66.900M/s ( 66.900M/prod), [...]
Iter   1 (-435.540us): hits   68.925M/s ( 68.925M/prod), [...]
Iter   2 (  8.223us): hits   68.795M/s ( 68.795M/prod), [...]
Iter   3 (-12.347us): hits   68.880M/s ( 68.880M/prod), [...]
Iter   4 (  2.291us): hits   68.767M/s ( 68.767M/prod), [...]
Iter   5 ( -1.446us): hits   68.756M/s ( 68.756M/prod), [...]
Iter   6 ( 13.882us): hits   68.657M/s ( 68.657M/prod), [...]
Summary: hits   68.792 ± 0.087M/s ( 68.792M/prod)

As we can see, the performance of kprobe-multi increase from 62M/s to
68M/s.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20251126085246.309942-1-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-29 09:47:10 -08:00
Kumar Kartikeya Dwivedi
087849cca3 rqspinlock: Precede non-head waiter queueing with AA check
While previous commits sufficiently address the deadlocks, there are
still scenarios where queueing of waiters in NMIs can exacerbate the
possibility of timeouts.

Consider the case below:

CPU 0
<NMI>
res_spin_lock(A) -> becomes non-head waiter
</NMI>
lock owner in CS or pending waiter spinning

CPU 1
res_spin_lock(A) -> head waiter spinning on owner/pending bits

In such a scenario, the non-head waiter in NMI on CPU 0 will not poll
for deadlocks or timeout since it will simply queue behind previous
waiter (head on CPU 1), and also not enter the trylock fallback since
no rqspinlock queue waiter is active on CPU 0. In such a scenario, the
transaction initiated by the head waiter on CPU 1 will timeout,
signalling the NMI and ending the cyclic dependency, but it will cost
250 ms of time.

Instead, the NMI on CPU 0 could simply check for the presence of an AA
deadlock and only proceed with queueing on success. Add such a check
right before any form of queueing is initiated.

The reason the AA deadlock check is not used in conjunction with
in_nmi() is that a similar case could occur due to a reentrant path
in the owner's critical section, and unconditionally checking for AA
before entering the queueing path avoids expensive timeouts. Non-NMI
reentrancy only happens at controlled points in the slow path (with
specific tracepoints which do not impede the forward progress of a
waiter loop), or in the owner CS, while NMIs can land anywhere.

While this check is only needed for non-head waiter queueing, checking
whether we are head or not is racy without xchg_tail, and after that
point, we are already queued, hence for simplicity we must invoke the
check unconditionally.

Note that a more contrived case could still be constructed by using two
locks, and interrupting the progress of the respective owners by
non-head waiters of the other lock, in an ABBA fashion, which would
still not be covered by the current set of checks and conditions. It
would still lead to a timeout though, and not a deadlock. An ABBA check
cannot happen optimistically before the queueing, since it can be racy,
and needs to be happen continuously during the waiting period, which
would then require an unlinking step for queued NMI/reentrant waiters.
This is beyond the scope of this patch.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-29 09:35:36 -08:00
Kumar Kartikeya Dwivedi
30dc2f7025 rqspinlock: Disable spinning for trylock fallback
The original trylock fallback was inherited from qspinlock, and then
reused for the reentrant NMIs while the slow path is active. However,
under contention, it is very unlikely for the trylock to succeed in
taking the lock. In addition, a trylock also has no fairness guarantees,
and thus is prone to starvation issues under extreme scenarios.

The original qspinlock had no choice in terms of returning an error the
caller; if the node count was breached, it had to fall back to trylock
to attempt to take the lock. In case of rqspinlock, we do have the
option of returning to the user. Thus, simply attempt the trylock once,
and instead of spinning, return an error in case the lock cannot be
taken.

This ends up significantly reducing the time spent in the trylock
fallback, since we no longer wait for the timeout duration trying to
aimlessly acquire the lock when there's a high-probability that under
contention, it won't be available to us anyway.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-29 09:35:36 -08:00
Kumar Kartikeya Dwivedi
81d5a6a438 rqspinlock: Use trylock fallback when per-CPU rqnode is busy
In addition to deferring to the trylock fallback in NMIs, only do so
when an rqspinlock waiter is queued on the current CPU. This is detected
by noticing a non-zero node index. This allows NMI waiters to join the
waiter queue if it isn't interrupting an existing rqspinlock waiter, and
increase the chances of fairly obtaining the lock, performing deadlock
detection as the head, and not being starved while attempting the
trylock.

The trylock path in particular is unlikely to succeed under contention,
as it relies on the lock word becoming 0, which indicates no contention.
This means that the most likely result for NMIs attempting a trylock is
a timeout under contention if they don't hit an AA or ABBA case.

The core problem being addressed through the fixed commit was removing
the dependency edge between an NMI queue waiter and the queue waiter it
is interrupting. Whenever a circular dependency forms, and with no way
to break it (as non-head waiters don't poll for deadlocks or timeouts),
we would enter into a deadlock. A trylock either breaks such an edge by
probing for deadlocks, and finally terminating the waiting loop using a
timeout.

By excluding queueing on CPUs where the node index is non-zero for NMIs,
this sort of dependency is broken. The CPU enters the trylock path for
those cases, and falls back to deadlock checks and timeouts. However, in
other case where it doesn't interrupt the CPU in the slow path while its
queued on the lock, it can join the queue as a normal waiter, and avoid
trylock associated starvation and subsequent timeouts.

There are a few remaining cases here that matter: the NMI can still
preempt the owner in its critical section, and if it queues as a
non-head waiter, it can end up impeding the progress of the owner. While
this won't deadlock, since the head waiter will eventually signal the
NMI waiter to either stop (due to a timeout), it can still lead to long
timeouts. These gaps will be addressed in subsequent commits.

Note that while the node count detection approach is less conservative
than simply deferring NMIs to trylock, it is going to return errors
where attempts to lock B in NMI happen while waiters for lock A are in a
lower context on the same CPU. However, this only occurs when the lower
context is queued in the slow path, and the NMI attempt can proceed
without failure in all other cases. To continue to prevent AA deadlocks
(or ABBA in a similar NMI interrupting lower context pattern), we'd need
a more fleshed out algorithm to unlink NMI waiters after they queue and
detect such cases. However, all that complexity isn't appealing yet to
reduce the failure rate in the small window inside the slow path.

It is important to note that reentrancy in the slow path can also happen
through trace_contention_{begin,end}, but in those cases, unlike an NMI,
the forward progress of the head waiter (or the predecessor in general)
is not being blocked.

Fixes: 0d80e7f951 ("rqspinlock: Choose trylock fallback for NMI waiters")
Reported-by: Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-29 09:35:35 -08:00
Kumar Kartikeya Dwivedi
5860f5ce47 rqspinlock: Perform AA checks immediately
Currently, while we enter the check_timeout call immediately due to the
way the ts.spin is initialized, we still invoke the AA and ABBA checks
in the second invocation, and only initialize the timestamp in the first
one. Since each iteration is at least done with a 1ms delay, this can
add delays in detection of AA deadlocks, up to a ms.

Rework check_timeout() to avoid this. First, call check_deadlock_AA()
while initializing the timestamps for the wait period. This also means
that we only do it once per waiting period, instead of every invocation.
Finally, drop check_deadlock() and call check_deadlock_ABBA() directly.

To save on unnecessary ktime_get_mono_fast_ns() in case of AA deadlock,
sample the time only if it returns 0.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-29 09:35:35 -08:00
Kumar Kartikeya Dwivedi
beb7021a60 rqspinlock: Enclose lock/unlock within lock entry acquisitions
Ritesh reported that timeouts occurred frequently for rqspinlock despite
reentrancy on the same lock on the same CPU in [0]. This patch closes
one of the races leading to this behavior, and reduces the frequency of
timeouts.

We currently have a tiny window between the fast-path cmpxchg and the
grabbing of the lock entry where an NMI could land, attempt the same
lock that was just acquired, and end up timing out. This is not ideal.
Instead, move the lock entry acquisition from the fast path to before
the cmpxchg, and remove the grabbing of the lock entry in the slow path,
assuming it was already taken by the fast path. The TAS fallback is
invoked directly without being preceded by the typical fast path,
therefore we must continue to grab the deadlock detection entry in that
case.

Case on lock leading to missed AA:

cmpxchg lock A
<NMI>
... rqspinlock acquisition of A
... timeout
</NMI>
grab_held_lock_entry(A)

There is a similar case when unlocking the lock. If the NMI lands
between the WRITE_ONCE and smp_store_release, it is possible that we end
up in a situation where the NMI fails to diagnose the AA condition,
leading to a timeout.

Case on unlock leading to missed AA:

WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL)
<NMI>
... rqspinlock acquisition of A
... timeout
</NMI>
smp_store_release(A->locked, 0)

The patch changes the order on unlock to smp_store_release() succeeded
by WRITE_ONCE() of NULL. This avoids the missed AA detection described
above, but may lead to a false positive if the NMI lands between these
two statements, which is acceptable (and preferred over a timeout).

The original intention of the reverse order on unlock was to prevent the
following possible misdiagnosis of an ABBA scenario:

grab entry A
lock A
grab entry B
lock B
unlock B
   smp_store_release(B->locked, 0)
							grab entry B
							lock B
							grab entry A
							lock A
							! <detect ABBA>
   WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL)

If the store release were is after the WRITE_ONCE, the other CPU would
not observe B in the table of the CPU unlocking the lock B.  However,
since the threads are obviously participating in an ABBA deadlock, it
is no longer appealing to use the order above since it may lead to a
250 ms timeout due to missed AA detection.

  [0]: https://lore.kernel.org/bpf/CAH6OuBTjG+N=+GGwcpOUbeDN563oz4iVcU3rbse68egp9wj9_A@mail.gmail.com

Fixes: 0d80e7f951 ("rqspinlock: Choose trylock fallback for NMI waiters")
Reported-by: Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-29 09:35:35 -08:00
Amery Hung
b4bf1d23dc bpf: Disable file_alloc_security hook
A use-after-free bug may be triggered by calling bpf_inode_storage_get()
in a BPF LSM program hooked to file_alloc_security. Disable the hook to
prevent this from happening.

The cause of the bug is shown in the trace below. In alloc_file(), a
file struct is first allocated through kmem_cache_alloc(). Then,
file_alloc_security hook is invoked. Since the zero initialization or
assignment of f->f_inode happen after this LSM hook, a BPF program may
get a dangeld inode pointer by walking the file struct.

  alloc_file()
  -> alloc_empty_file()
     -> f = kmem_cache_alloc()
     -> init_file()
        -> security_file_alloc() // f->f_inode not init-ed yet!
     -> f->f_inode = NULL;
  -> file_init_path()
     -> f->f_inode = path->dentry->d_inode

Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
Closes: https://lore.kernel.org/bpf/1d2d1968.47cd3.19ab9528e94.Coremail.kaiyanm@hust.edu.cn/
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20251126202927.2584874-1-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-28 15:18:28 -08:00
Anton Protopopov
e3ea26add6 bpf: check for insn arrays in check_ptr_alignment
Do not abuse the strict_alignment_once flag, and check if the map is
an instruction array inside the check_ptr_alignment() function.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20251128063224.1305482-3-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-28 15:15:43 -08:00
Anton Protopopov
7feff23cdf bpf: force BPF_F_RDONLY_PROG on insn array creation
The original implementation added a hack to check_mem_access()
to prevent programs from writing into insn arrays. To get rid
of this hack, enforce BPF_F_RDONLY_PROG on map creation.

Also fix the corresponding selftest, as the error message changes
with this patch.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20251128063224.1305482-2-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-28 15:15:43 -08:00
Frederic Weisbecker
a50413848f Merge branch 'rcu/refscale' into next
Add performance testing for common context synchronizations
(Preemption, IRQ, Softirq) and per-cpu increments. Those are
relevant comparisons against SRCU-fast read side APIs, especially
as they are planned to synchronize further tracing fast-path code.
2025-11-28 23:30:38 +01:00
Rafael J. Wysocki
7cede21e9f Merge branches 'pm-qos' and 'pm-tools'
Merge PM QoS updates and a cpupower utility update for 6.19-rc1:

 - Introduce and document a QoS limit on CPU exit latency during wakeup
   from suspend-to-idle (Ulf Hansson)

 - Add support for building libcpupower statically (Zuo An)

* pm-qos:
  Documentation: power/cpuidle: Document the CPU system wakeup latency QoS
  cpuidle: Respect the CPU system wakeup QoS limit for cpuidle
  sched: idle: Respect the CPU system wakeup QoS limit for s2idle
  pmdomain: Respect the CPU system wakeup QoS limit for cpuidle
  pmdomain: Respect the CPU system wakeup QoS limit for s2idle
  PM: QoS: Introduce a CPU system wakeup QoS limit

* pm-tools:
  tools/power/cpupower: Support building libcpupower statically
2025-11-28 16:50:45 +01:00
Rafael J. Wysocki
638757c9c9 Merge branches 'pm-em' and 'pm-opp'
Merge energy model management updates and operating performance points
(OPP) library changes for 6.19-rc1:

 - Add support for sending netlink notifications to user space on energy
   model updates (Changwoo Mini, Peng Fan)

 - Minor improvements to the Rust OPP interface (Tamir Duberstein)

 - Fixes to scope-based pointers in the OPP library (Viresh Kumar)

* pm-em:
  PM: EM: Add to em_pd_list only when no failure
  PM: EM: Notify an event when the performance domain changes
  PM: EM: Implement em_notify_pd_created/updated()
  PM: EM: Implement em_notify_pd_deleted()
  PM: EM: Implement em_nl_get_pd_table_doit()
  PM: EM: Implement em_nl_get_pds_doit()
  PM: EM: Add an iterator and accessor for the performance domain
  PM: EM: Add a skeleton code for netlink notification
  PM: EM: Add em.yaml and autogen files
  PM: EM: Expose the ID of a performance domain via debugfs
  PM: EM: Assign a unique ID when creating a performance domain

* pm-opp:
  rust: opp: simplify callers of `to_c_str_array`
  OPP: Initialize scope-based pointers inline
  rust: opp: fix broken rustdoc link
2025-11-28 16:44:00 +01:00
Rafael J. Wysocki
f086594adb Merge branch 'pm-sleep'
Merge updates related to system suspend and hibernation for 6.19-rc1:

 - Replace snprintf() with scnprintf() in show_trace_dev_match()
   (Kaushlendra Kumar)

 - Fix memory allocation error handling in pm_vt_switch_required()
   (Malaya Kumar Rout)

 - Introduce CALL_PM_OP() macro and use it to simplify code in
   generic PM operations (Kaushlendra Kumar)

 - Add module param to backtrace all CPUs in the device power management
   watchdog (Sergey Senozhatsky)

 - Rework message printing in swsusp_save() (Rafael Wysocki)

 - Make it possible to change the number of hibernation compression
   threads (Xueqin Luo)

 - Clarify that only cgroup1 freezer uses PM freezer (Tejun Heo)

 - Add document on debugging shutdown hangs to PM documentation and
   correct a mistaken configuration option in it (Mario Limonciello)

 - Shut down wakeup source timer before removing the wakeup source from
   the list (Kaushlendra Kumar, Rafael Wysocki)

 - Introduce new PMSG_POWEROFF event for system shutdown handling with
   the help of PM device callbacks (Mario Limonciello)

 - Make pm_test delay interruptible by wakeup events (Riwen Lu)

 - Clean up kernel-doc comment style usage in the core hibernation
   code and remove unuseful comments from it (Sunday Adelodun, Rafael
   Wysocki)

 - Add support for handling wakeup events and aborting the suspend
   process while it is syncing file systems (Samuel Wu, Rafael Wysocki)

* pm-sleep: (21 commits)
  PM: hibernate: Extra cleanup of comments in swap handling code
  PM: sleep: Call pm_sleep_fs_sync() instead of ksys_sync_helper()
  PM: sleep: Add support for wakeup during filesystem sync
  PM: hibernate: Clean up kernel-doc comment style usage
  PM: suspend: Make pm_test delay interruptible by wakeup events
  usb: sl811-hcd: Add PM_EVENT_POWEROFF into suspend callbacks
  scsi: Add PM_EVENT_POWEROFF into suspend callbacks
  PM: Introduce new PMSG_POWEROFF event
  PM: wakeup: Update after recent wakeup source removal ordering change
  PM: wakeup: Delete timer before removing wakeup source from list
  Documentation: power: Correct a mistaken configuration option
  Documentation: power: Add document on debugging shutdown hangs
  freezer: Clarify that only cgroup1 freezer uses PM freezer
  PM: hibernate: add sysfs interface for hibernate_compression_threads
  PM: hibernate: make compression threads configurable
  PM: hibernate: dynamically allocate crc->unc_len/unc for configurable threads
  PM: hibernate: Rework message printing in swsusp_save()
  PM: dpm_watchdog: add module param to backtrace all CPUs
  PM: sleep: Introduce CALL_PM_OP() macro to simplify code
  PM: console: Fix memory allocation error handling in pm_vt_switch_required()
  ...
2025-11-28 16:01:13 +01:00
Rafael J. Wysocki
60d69a7ed1 Merge branches 'pm-core' and 'pm-runtime'
Merge a core power management update and runtime PM framework updates
for 6.19-rc1:

 - Add WQ_UNBOUND to pm_wq workqueue (Marco Crivellari)

 - Add runtime PM wrapper macros for ACQUIRE()/ACQUIRE_ERR() and use
   them in the PCI core and the ACPI TAD driver (Rafael Wysocki)

 - Improve runtime PM in the ACPI TAD driver (Rafael Wysocki)

 - Update pm_runtime_allow/forbid() documentation (Rafael Wysocki)

 - Fix typos in runtime.c comments (Malaya Kumar Rout)

* pm-core:
  PM: WQ_UNBOUND added to pm_wq workqueue

* pm-runtime:
  PCI/sysfs: Use PM_RUNTIME_ACQUIRE()/PM_RUNTIME_ACQUIRE_ERR()
  ACPI: TAD: Use PM_RUNTIME_ACQUIRE()/PM_RUNTIME_ACQUIRE_ERR()
  PM: runtime: Wrapper macros for ACQUIRE()/ACQUIRE_ERR()
  PM: runtime: fix typos in runtime.c comments
  ACPI: TAD: Improve runtime PM using guard macros
  ACPI: TAD: Rearrange runtime PM operations in acpi_tad_remove()
  PM: runtime: docs: Update pm_runtime_allow/forbid() documentation
2025-11-28 15:56:09 +01:00
Paul E. McKenney
bfad33230a refscale: Add SRCU-fast-updown readers
This commit adds refscale readers based on srcu_read_lock_fast_updown()
and srcu_read_unlock_fast_updown()
("refscale.scale_type=srcu-fast-updown"). On my x86 laptop, these are
about 2.2ns per pair.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: <bpf@vger.kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-28 15:23:17 +01:00
Christian Brauner
981bec8f69 bpf: convert bpf_token_create() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-24-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:33 +01:00
Christian Brauner
798c2da490 bpf: convert bpf_iter_new_fd() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-23-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:33 +01:00
Linus Torvalds
aa7243aaf1 Merge tag 'dma-mapping-6.18-2025-11-27' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
 "Two last minute fixes for the recently modified DMA API infrastructure:

   - proper handling of DMA_ATTR_MMIO in dma_iova_unlink() function (me)

   - regression fix for the code refactoring related to P2PDMA (Pranjal
     Shrivastava)"

* tag 'dma-mapping-6.18-2025-11-27' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
  dma-direct: Fix missing sg_dma_len assignment in P2PDMA bus mappings
  iommu/dma: add missing support for DMA_ATTR_MMIO for dma_iova_unlink()
2025-11-27 17:29:15 -08:00
Steven Rostedt
f6ed9c5d31 overflow: Introduce struct_offset() to get offset of member
The trace_marker_raw file in tracefs takes a buffer from user space that
contains an id as well as a raw data string which is usually a binary
structure. The structure used has the following:

	struct raw_data_entry {
		struct trace_entry	ent;
		unsigned int		id;
		char			buf[];
	};

Since the passed in "cnt" variable is both the size of buf as well as the
size of id, the code to allocate the location on the ring buffer had:

   size = struct_size(entry, buf, cnt - sizeof(entry->id));

Which is quite ugly and hard to understand. Instead, add a helper macro
called struct_offset() which then changes the above to a simple and easy
to understand:

   size = struct_offset(entry, id) + cnt;

This will likely come in handy for other use cases too.

Link: https://lore.kernel.org/all/CAHk-=whYZVoEdfO1PmtbirPdBMTV9Nxt9f09CK0k6S+HJD3Zmg@mail.gmail.com/

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Link: https://patch.msgid.link/20251126145249.05b1770a@gandalf.local.home
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-27 20:18:05 -05:00
Ilias Stamatis
6fb3acdebf Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
Commit 97523a4edb ("kernel/resource: remove first_lvl / siblings_only
logic") removed an optimization introduced by commit 756398750e
("resource: avoid unnecessary lookups in find_next_iomem_res()").  That
was not called out in the message of the first commit explicitly so it's
not entirely clear whether removing the optimization happened
inadvertently or not.

As the original commit message of the optimization explains there is no
point considering the children of a subtree in find_next_iomem_res() if
the top level range does not match.

Reinstating the optimization results in performance improvements in
systems where /proc/iomem is ~5k lines long.  Calling mmap() on /dev/mem
in such platforms takes 700-1500μs without the optimisation and 10-50μs
with the optimisation.

Note that even though commit 97523a4edb removed the 'sibling_only'
parameter from next_resource(), newer kernels have basically reinstated it
under the name 'skip_children'.

Link: https://lore.kernel.org/all/20251124165349.3377826-1-ilstam@amazon.com/T/#u
Fixes: 97523a4edb ("kernel/resource: remove first_lvl / siblings_only logic")
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: "Huang, Ying" <huang.ying.caritas@gmail.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27 14:24:45 -08:00
Breno Leitao
3fa805c37d vmcoreinfo: track and log recoverable hardware errors
Introduce a generic infrastructure for tracking recoverable hardware
errors (HW errors that are visible to the OS but does not cause a panic)
and record them for vmcore consumption.  This aids post-mortem crash
analysis tools by preserving a count and timestamp for the last occurrence
of such errors.  On the other side, correctable errors, which the OS
typically remains unaware of because the underlying hardware handles them
transparently, are less relevant for crash dump and therefore are NOT
tracked in this infrastructure.

Add centralized logging for sources of recoverable hardware errors based
on the subsystem it has been notified.

hwerror_data is write-only at kernel runtime, and it is meant to be read
from vmcore using tools like crash/drgn.  For example, this is how it
looks like when opening the crashdump from drgn.

	>>> prog['hwerror_data']
	(struct hwerror_info[1]){
		{
			.count = (int)844,
			.timestamp = (time64_t)1752852018,
		},
		...

This helps fleet operators quickly triage whether a crash may be
influenced by hardware recoverable errors (which executes a uncommon code
path in the kernel), especially when recoverable errors occurred shortly
before a panic, such as the bug fixed by commit ee62ce7a1d ("page_pool:
Track DMA-mapped pages and unmap them when destroying the pool")

This is not intended to replace full hardware diagnostics but provides a
fast way to correlate hardware events with kernel panics quickly.

Rare machine check exceptions—like those indicated by mce_flags.p5 or
mce_flags.winchip—are not accounted for in this method, as they fall
outside the intended usage scope for this feature's user base.

[leitao@debian.org: add hw-recoverable-errors to toctree]
  Link: https://lkml.kernel.org/r/20251127-vmcoreinfo_fix-v1-1-26f5b1c43da9@debian.org
Link: https://lkml.kernel.org/r/20251010-vmcore_hw_error-v5-1-636ede3efe44@debian.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Suggested-by: Tony Luck <tony.luck@intel.com>
Suggested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>	[APEI]
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Bob Moore <robert.moore@intel.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Omar Sandoval <osandov@osandov.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27 14:24:44 -08:00
Mike Rapoport (Microsoft)
7b71205ae1 kho: fix restoring of contiguous ranges of order-0 pages
When contiguous ranges of order-0 pages are restored, kho_restore_page()
calls prep_compound_page() with the first page in the range and order as
parameters and then kho_restore_pages() calls split_page() to make sure
all pages in the range are order-0.

However, since split_page() is not intended to split compound pages and
with VM_DEBUG enabled it will trigger a VM_BUG_ON_PAGE().

Update kho_restore_page() so that it will use prep_compound_page() when it
restores a folio and make sure it properly sets page count for both large
folios and ranges of order-0 pages.

Link: https://lkml.kernel.org/r/20251125110917.843744-3-rppt@kernel.org
Fixes: a667300bd5 ("kho: add support for preserving vmalloc allocations")
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reported-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27 14:24:44 -08:00