linux

Author	SHA1	Message	Date
Damien Le Moal	c8cdc025a6	block: fix NULL pointer dereference in blk_zone_reset_all_bio_endio() commit `c2b8d20628` upstream. For zoned block devices that do not need zone write plugs (e.g. most device mapper devices that support zones), the disk hash table of zone write plugs is NULL. For such devices, blk_zone_reset_all_bio_endio() should not attempt to scan this has table as that causes a NULL pointer dereference. Fix this by checking that the disk does have zone write plugs using the atomic counter. This is equivalent to checking for a non-NULL hash table but has the advantage to also speed up the execution of blk_zone_reset_all_bio_endio() for devices that do use zone write plugs but do not have any plug in the hash table (e.g. a disk with only full zones). Fixes: `efae226c2e` ("block: handle zone management operations completions") Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2026-01-08 10:15:06 +01:00
Damien Le Moal	2cd2003f7b	block: handle zone management operations completions [ Upstream commit `efae226c2e` ] The functions blk_zone_wplug_handle_reset_or_finish() and blk_zone_wplug_handle_reset_all() both modify the zone write pointer offset of zone write plugs that are the target of a reset, reset all or finish zone management operation. However, these functions do this modification before the BIO is executed. So if the zone operation fails, the modified zone write pointer offsets become invalid. Avoid this by modifying the zone write pointer offset of a zone write plug that is the target of a zone management operation when the operation completes. To do so, modify blk_zone_bio_endio() to call the new function blk_zone_mgmt_bio_endio() which in turn calls the functions blk_zone_reset_all_bio_endio(), blk_zone_reset_bio_endio() or blk_zone_finish_bio_endio() depending on the operation of the completed BIO, to modify a zone write plug write pointer offset accordingly. These functions are called only if the BIO execution was successful. Fixes: `dd291d77cc` ("block: Introduce zone write plugging") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> [ adapted bdev_zone_is_seq() check to disk_zone_is_conv() ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2026-01-08 10:15:04 +01:00
Damien Le Moal	aa85f48dfc	block: freeze queue when updating zone resources [ Upstream commit `bba4322e3f` ] Modify disk_update_zone_resources() to freeze the device queue before updating the number of zones, zone capacity and other zone related resources. The locking order resulting from the call to queue_limits_commit_update_frozen() is preserved, that is, the queue limits lock is first taken by calling queue_limits_start_update() before freezing the queue, and the queue is unfrozen after executing queue_limits_commit_update(), which replaces the call to queue_limits_commit_update_frozen(). This change ensures that there are no in-flights I/Os when the zone resources are updated due to a zone revalidation. In case of error when the limits are applied, directly call disk_free_zone_resources() from disk_update_zone_resources() while the disk queue is still frozen to avoid needing to freeze & unfreeze the queue again in blk_revalidate_disk_zones(), thus simplifying that function code a little. Fixes: `0b83c86b44` ("block: Prevent potential deadlock in blk_revalidate_disk_zones()") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> [ adapted blk_mq_freeze_queue/unfreeze_queue calls to single-argument void API ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2026-01-08 10:14:58 +01:00
Damien Le Moal	777a1ddeb9	block: Clear BLK_ZONE_WPLUG_PLUGGED when aborting plugged BIOs commit `552c1149af` upstream. Commit `fe0418eb9b` ("block: Prevent potential deadlocks in zone write plug error recovery") added a WARN check in disk_put_zone_wplug() to verify that when the last reference to a zone write plug is dropped, this zone write plug does not have the BLK_ZONE_WPLUG_PLUGGED flag set, that is, that it is not plugged. However, the function disk_zone_wplug_abort(), which is called for zone reset and zone finish operations, does not clear this flag after emptying a zone write plug BIO list. This can result in the disk_put_zone_wplug() warning to trigger if the user (erroneously as that is bad pratcice) issues zone reset or zone finish operations while the target zone still has plugged BIOs. Modify disk_put_zone_wplug() to clear the BLK_ZONE_WPLUG_PLUGGED flag. And while at it, also add a lockdep annotation to ensure that this function is called with the zone write plug spinlock held. Fixes: `fe0418eb9b` ("block: Prevent potential deadlocks in zone write plug error recovery") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Niklas Cassel <cassel@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2026-01-08 10:14:42 +01:00
Cong Zhang	9eb1ee1f2a	blk-mq: skip CPU offline notify on unmapped hctx [ Upstream commit `10845a105b` ] If an hctx has no software ctx mapped, blk_mq_map_swqueue() never allocates tags and leaves hctx->tags NULL. The CPU hotplug offline notifier can still run for that hctx, return early since hctx cannot hold any requests. Signed-off-by: Cong Zhang <cong.zhang@oss.qualcomm.com> Fixes: `bf0beec060` ("blk-mq: drain I/O when all CPUs in a hctx are offline") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-01-08 10:14:37 +01:00
Li Chen	0a65cac1d8	block: rate-limit capacity change info log commit `3179a5f7f8` upstream. loop devices under heavy stress-ng loop streessor can trigger many capacity change events in a short time. Each event prints an info message from set_capacity_and_notify(), flooding the console and contributing to soft lockups on slow consoles. Switch the printk in set_capacity_and_notify() to pr_info_ratelimited() so frequent capacity changes do not spam the log while still reporting occasional changes. Cc: stable@vger.kernel.org Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2026-01-08 10:14:14 +01:00
Mohamed Khalfella	3baeec23a8	block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock [ Upstream commit `59e25ef2b4` ] blk_mq_{add,del}_queue_tag_set() functions add and remove queues from tagset, the functions make sure that tagset and queues are marked as shared when two or more queues are attached to the same tagset. Initially a tagset starts as unshared and when the number of added queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along with all the queues attached to it. When the number of attached queues drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and the remaining queues as unshared. Both functions need to freeze current queues in tagset before setting on unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions hold set->tag_list_lock mutex, which makes sense as we do not want queues to be added or deleted in the process. This used to work fine until commit `98d81f0df7` ("nvme: use blk_mq_[un]quiesce_tagset") made the nvme driver quiesce tagset instead of quiscing individual queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in set->tag_list while holding set->tag_list_lock also. This results in deadlock between two threads with these stacktraces: __schedule+0x47c/0xbb0 ? timerqueue_add+0x66/0xb0 schedule+0x1c/0xa0 schedule_preempt_disabled+0xa/0x10 __mutex_lock.constprop.0+0x271/0x600 blk_mq_quiesce_tagset+0x25/0xc0 nvme_dev_disable+0x9c/0x250 nvme_timeout+0x1fc/0x520 blk_mq_handle_expired+0x5c/0x90 bt_iter+0x7e/0x90 blk_mq_queue_tag_busy_iter+0x27e/0x550 ? __blk_mq_complete_request_remote+0x10/0x10 ? __blk_mq_complete_request_remote+0x10/0x10 ? __call_rcu_common.constprop.0+0x1c0/0x210 blk_mq_timeout_work+0x12d/0x170 process_one_work+0x12e/0x2d0 worker_thread+0x288/0x3a0 ? rescuer_thread+0x480/0x480 kthread+0xb8/0xe0 ? kthread_park+0x80/0x80 ret_from_fork+0x2d/0x50 ? kthread_park+0x80/0x80 ret_from_fork_asm+0x11/0x20 __schedule+0x47c/0xbb0 ? xas_find+0x161/0x1a0 schedule+0x1c/0xa0 blk_mq_freeze_queue_wait+0x3d/0x70 ? destroy_sched_domains_rcu+0x30/0x30 blk_mq_update_tag_set_shared+0x44/0x80 blk_mq_exit_queue+0x141/0x150 del_gendisk+0x25a/0x2d0 nvme_ns_remove+0xc9/0x170 nvme_remove_namespaces+0xc7/0x100 nvme_remove+0x62/0x150 pci_device_remove+0x23/0x60 device_release_driver_internal+0x159/0x200 unbind_store+0x99/0xa0 kernfs_fop_write_iter+0x112/0x1e0 vfs_write+0x2b1/0x3d0 ksys_write+0x4e/0xb0 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x4b/0x53 The top stacktrace is showing nvme_timeout() called to handle nvme command timeout. timeout handler is trying to disable the controller and as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not to call queue callback handlers. The thread is stuck waiting for set->tag_list_lock as it tries to walk the queues in set->tag_list. The lock is held by the second thread in the bottom stack which is waiting for one of queues to be frozen. The queue usage counter will drop to zero after nvme_timeout() finishes, and this will not happen because the thread will wait for this mutex forever. Given that [un]quiescing queue is an operation that does not need to sleep, update blk_mq_[un]quiesce_tagset() to use RCU instead of taking set->tag_list_lock, update blk_mq_{add,del}_queue_tag_set() to use RCU safe list operations. Also, delete INIT_LIST_HEAD(&q->tag_set_list) in blk_mq_del_queue_tag_set() because we can not re-initialize it while the list is being traversed under RCU. The deleted queue will not be added/deleted to/from a tagset and it will be freed in blk_free_queue() after the end of RCU grace period. Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com> Fixes: `98d81f0df7` ("nvme: use blk_mq_[un]quiesce_tagset") Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-12-18 13:55:21 +01:00
Shaurya Rane	453e4b0c84	block: fix memory leak in __blkdev_issue_zero_pages [ Upstream commit `f7e3f852a4` ] Move the fatal signal check before bio_alloc() to prevent a memory leak when BLKDEV_ZERO_KILLABLE is set and a fatal signal is pending. Previously, the bio was allocated before checking for a fatal signal. If a signal was pending, the code would break out of the loop without freeing or chaining the just-allocated bio, causing a memory leak. This matches the pattern already used in __blkdev_issue_write_zeroes() where the signal check precedes the allocation. Fixes: `bf86bcdb40` ("blk-lib: check for kill signal in ioctl BLKZEROOUT") Reported-by: syzbot+527a7e48a3d3d315d862@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=527a7e48a3d3d315d862 Signed-off-by: Shaurya Rane <ssrane_b23@ee.vjti.ac.in> Reviewed-by: Keith Busch <kbusch@kernel.org> Tested-by: syzbot+527a7e48a3d3d315d862@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-12-18 13:55:19 +01:00
Cong Zhang	be83e65542	blk-mq: Abort suspend when wakeup events are pending [ Upstream commit `c196bf43d7` ] During system suspend, wakeup capable IRQs for block device can be delayed, which can cause blk_mq_hctx_notify_offline() to hang indefinitely while waiting for pending request to complete. Skip the request waiting loop and abort suspend when wakeup events are pending to prevent the deadlock. Fixes: `bf0beec060` ("blk-mq: drain I/O when all CPUs in a hctx are offline") Signed-off-by: Cong Zhang <cong.zhang@oss.qualcomm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-12-18 13:55:19 +01:00
Guenter Roeck	de38f64811	block/blk-throttle: Fix throttle slice time for SSDs [ Upstream commit `f76581f9f1` ] Commit `d61fcfa4bb` ("blk-throttle: choose a small throtl_slice for SSD") introduced device type specific throttle slices if BLK_DEV_THROTTLING_LOW was enabled. Commit `bf20ab538c` ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") removed support for BLK_DEV_THROTTLING_LOW, but left the device type specific throttle slices in place. This effectively changed throttling behavior on systems with SSD which now use a different and non-configurable slice time compared to non-SSD devices. Practical impact is that throughput tests with low configured throttle values (65536 bps) experience less than expected throughput on SSDs, presumably due to rounding errors associated with the small throttle slice time used for those devices. The same tests pass when setting the throttle values to 65536 * 4 = 262144 bps. The original code sets the throttle slice time to DFL_THROTL_SLICE_HD if CONFIG_BLK_DEV_THROTTLING_LOW is disabled. Restore that code to fix the problem. With that, DFL_THROTL_SLICE_SSD is no longer necessary. Revert to the original code and re-introduce DFL_THROTL_SLICE to replace both DFL_THROTL_SLICE_HD and DFL_THROTL_SLICE_SSD. This effectively reverts commit `d61fcfa4bb` ("blk-throttle: choose a small throtl_slice for SSD"). While at it, also remove MAX_THROTL_SLICE since it is not used anymore. Fixes: `bf20ab538c` ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") Cc: Yu Kuai <yukuai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-12-18 13:55:00 +01:00
Bart Van Assche	da06fa2308	block/mq-deadline: Switch back to a single dispatch list [ Upstream commit `d60055cf52` ] Commit `c807ab520f` ("block/mq-deadline: Add I/O priority support") modified the behavior of request flag BLK_MQ_INSERT_AT_HEAD from dispatching a request before other requests into dispatching a request before other requests with the same I/O priority. This is not correct since BLK_MQ_INSERT_AT_HEAD is used when requeuing requests and also when a flush request is inserted. Both types of requests should be dispatched as soon as possible. Hence, make the mq-deadline I/O scheduler again ignore the I/O priority for BLK_MQ_INSERT_AT_HEAD requests. Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Yu Kuai <yukuai@kernel.org> Reported-by: chengkaitao <chengkaitao@kylinos.cn> Closes: https://lore.kernel.org/linux-block/20251009155253.14611-1-pilgrimtao@gmail.com/ Fixes: `c807ab520f` ("block/mq-deadline: Add I/O priority support") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moalv <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-12-18 13:54:45 +01:00
Bart Van Assche	53d0a52ccc	block/mq-deadline: Introduce dd_start_request() [ Upstream commit `93a358af59` ] Prepare for adding a second caller of this function. No functionality has been changed. Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Yu Kuai <yukuai@kernel.org> Cc: chengkaitao <chengkaitao@kylinos.cn> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: `d60055cf52` ("block/mq-deadline: Switch back to a single dispatch list") Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-12-18 13:54:45 +01:00
Carlos Llamas	1f0f07fd8f	blk-crypto: use BLK_STS_INVAL for alignment errors [ Upstream commit `0b39ca4572` ] Make __blk_crypto_bio_prep() propagate BLK_STS_INVAL when IO segments fail the data unit alignment check. This was flagged by an LTP test that expects EINVAL when performing an O_DIRECT read with a misaligned buffer [1]. Cc: Eric Biggers <ebiggers@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/all/aP-c5gPjrpsn0vJA@google.com/ [1] Signed-off-by: Carlos Llamas <cmllamas@google.com> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-12-01 11:43:36 +01:00
Yu Kuai	56ac639d6f	blk-cgroup: fix possible deadlock while configuring policy [ Upstream commit `5d726c4dbe` ] Following deadlock can be triggered easily by lockdep: WARNING: possible circular locking dependency detected 6.17.0-rc3-00124-ga12c2658ced0 #1665 Not tainted ------------------------------------------------------ check/1334 is trying to acquire lock: ff1100011d9d0678 (&q->sysfs_lock){+.+.}-{4:4}, at: blk_unregister_queue+0x53/0x180 but task is already holding lock: ff1100011d9d00e0 (&q->q_usage_counter(queue)#3){++++}-{0:0}, at: del_gendisk+0xba/0x110 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&q->q_usage_counter(queue)#3){++++}-{0:0}: blk_queue_enter+0x40b/0x470 blkg_conf_prep+0x7b/0x3c0 tg_set_limit+0x10a/0x3e0 cgroup_file_write+0xc6/0x420 kernfs_fop_write_iter+0x189/0x280 vfs_write+0x256/0x490 ksys_write+0x83/0x190 __x64_sys_write+0x21/0x30 x64_sys_call+0x4608/0x4630 do_syscall_64+0xdb/0x6b0 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&q->rq_qos_mutex){+.+.}-{4:4}: __mutex_lock+0xd8/0xf50 mutex_lock_nested+0x2b/0x40 wbt_init+0x17e/0x280 wbt_enable_default+0xe9/0x140 blk_register_queue+0x1da/0x2e0 __add_disk+0x38c/0x5d0 add_disk_fwnode+0x89/0x250 device_add_disk+0x18/0x30 virtblk_probe+0x13a3/0x1800 virtio_dev_probe+0x389/0x610 really_probe+0x136/0x620 __driver_probe_device+0xb3/0x230 driver_probe_device+0x2f/0xe0 __driver_attach+0x158/0x250 bus_for_each_dev+0xa9/0x130 driver_attach+0x26/0x40 bus_add_driver+0x178/0x3d0 driver_register+0x7d/0x1c0 __register_virtio_driver+0x2c/0x60 virtio_blk_init+0x6f/0xe0 do_one_initcall+0x94/0x540 kernel_init_freeable+0x56a/0x7b0 kernel_init+0x2b/0x270 ret_from_fork+0x268/0x4c0 ret_from_fork_asm+0x1a/0x30 -> #0 (&q->sysfs_lock){+.+.}-{4:4}: __lock_acquire+0x1835/0x2940 lock_acquire+0xf9/0x450 __mutex_lock+0xd8/0xf50 mutex_lock_nested+0x2b/0x40 blk_unregister_queue+0x53/0x180 __del_gendisk+0x226/0x690 del_gendisk+0xba/0x110 sd_remove+0x49/0xb0 [sd_mod] device_remove+0x87/0xb0 device_release_driver_internal+0x11e/0x230 device_release_driver+0x1a/0x30 bus_remove_device+0x14d/0x220 device_del+0x1e1/0x5a0 __scsi_remove_device+0x1ff/0x2f0 scsi_remove_device+0x37/0x60 sdev_store_delete+0x77/0x100 dev_attr_store+0x1f/0x40 sysfs_kf_write+0x65/0x90 kernfs_fop_write_iter+0x189/0x280 vfs_write+0x256/0x490 ksys_write+0x83/0x190 __x64_sys_write+0x21/0x30 x64_sys_call+0x4608/0x4630 do_syscall_64+0xdb/0x6b0 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: &q->sysfs_lock --> &q->rq_qos_mutex --> &q->q_usage_counter(queue)#3 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->q_usage_counter(queue)#3); lock(&q->rq_qos_mutex); lock(&q->q_usage_counter(queue)#3); lock(&q->sysfs_lock); Root cause is that queue_usage_counter is grabbed with rq_qos_mutex held in blkg_conf_prep(), while queue should be freezed before rq_qos_mutex from other context. The blk_queue_enter() from blkg_conf_prep() is used to protect against policy deactivation, which is already protected with blkcg_mutex, hence convert blk_queue_enter() to blkcg_mutex to fix this problem. Meanwhile, consider that blkcg_mutex is held after queue is freezed from policy deactivation, also convert blkg_alloc() to use GFP_NOIO. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-11-13 15:34:07 -05:00
Yu Kuai	c7c6c09cb4	blk-crypto: fix missing blktrace bio split events commit `06d712d297` upstream. trace_block_split() is missing, resulting in blktrace inability to catch BIO split events and making it harder to analyze the BIO sequence. Cc: stable@vger.kernel.org Fixes: `488f6682c8` ("block: blk-crypto-fallback for Inline Encryption") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-10-19 16:33:48 +02:00
Qianfeng Rong	60002c90f2	block: use int to store blk_stack_limits() return value [ Upstream commit `b0b4518c99` ] Change the 'ret' variable in blk_stack_limits() from unsigned int to int, as it needs to store negative value -1. Storing the negative error codes in unsigned type, or performing equality comparisons (e.g., ret == -1), doesn't cause an issue at runtime [1] but can be confusing. Additionally, assigning negative error codes to unsigned type may trigger a GCC warning when the -Wsign-conversion flag is enabled. No effect on runtime. Link: https://lore.kernel.org/all/x3wogjf6vgpkisdhg3abzrx7v7zktmdnfmqeih5kosszmagqfs@oh3qxrgzkikf/ #1 Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Reviewed-by: John Garry <john.g.garry@oracle.com> Fixes: `fe0b393f2c` ("block: Correct handling of bottom device misaligment") Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20250902130930.68317-1-rongqianfeng@vivo.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-10-15 11:59:58 +02:00
Li Nan	babc634e9f	blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx [ Upstream commit `4c7ef92f6d` ] In __blk_mq_update_nr_hw_queues() the return value of blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx fails, later changing the number of hw_queues or removing disk will trigger the following warning: kernfs: can not remove 'nr_tags', no directory WARNING: CPU: 2 PID: 637 at fs/kernfs/dir.c:1707 kernfs_remove_by_name_ns+0x13f/0x160 Call Trace: remove_files.isra.1+0x38/0xb0 sysfs_remove_group+0x4d/0x100 sysfs_remove_groups+0x31/0x60 __kobject_del+0x23/0xf0 kobject_del+0x17/0x40 blk_mq_unregister_hctx+0x5d/0x80 blk_mq_sysfs_unregister_hctxs+0x94/0xd0 blk_mq_update_nr_hw_queues+0x124/0x760 nullb_update_nr_hw_queues+0x71/0xf0 [null_blk] nullb_device_submit_queues_store+0x92/0x120 [null_blk] kobjct_del() was called unconditionally even if sysfs creation failed. Fix it by checkig the kobject creation statusbefore deleting it. Fixes: `477e19dedc` ("blk-mq: adjust debugfs and sysfs register when updating nr_hw_queues") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250826084854.1030545-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-10-15 11:59:57 +02:00
Christoph Hellwig	0073c41d4b	block: add a queue_limits_commit_update_frozen helper [ Upstream commit `aa427d7b73` ] Add a helper that freezes the queue, updates the queue limits and unfreezes the queue and convert all open coded versions of that to the new helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250110054726.1499538-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: `708e2371f7` ("scsi: sr: Reinstate rotational media flag") Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-09-09 18:58:24 +02:00
Bart Van Assche	c50747a963	blk-zoned: Fix a lockdep complaint about recursive locking commit `198f36f902` upstream. If preparing a write bio fails then blk_zone_wplug_bio_work() calls bio_endio() with zwplug->lock held. If a device mapper driver is stacked on top of the zoned block device then this results in nested locking of zwplug->lock. The resulting lockdep complaint is a false positive because this is nested locking and not recursive locking. Suppress this false positive by calling blk_zone_wplug_bio_io_error() without holding zwplug->lock. This is safe because no code in blk_zone_wplug_bio_io_error() depends on zwplug->lock being held. This patch suppresses the following lockdep complaint: WARNING: possible recursive locking detected -------------------------------------------- kworker/3:0H/46 is trying to acquire lock: ffffff882968b830 (&zwplug->lock){-...}-{2:2}, at: blk_zone_write_plug_bio_endio+0x64/0x1f0 but task is already holding lock: ffffff88315bc230 (&zwplug->lock){-...}-{2:2}, at: blk_zone_wplug_bio_work+0x8c/0x48c other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&zwplug->lock); lock(&zwplug->lock); * DEADLOCK * May be due to missing lock nesting notation 3 locks held by kworker/3:0H/46: #0: ffffff8809486758 ((wq_completion)sdd_zwplugs){+.+.}-{0:0}, at: process_one_work+0x1bc/0x65c #1: ffffffc085de3d70 ((work_completion)(&zwplug->bio_work)){+.+.}-{0:0}, at: process_one_work+0x1e4/0x65c #2: ffffff88315bc230 (&zwplug->lock){-...}-{2:2}, at: blk_zone_wplug_bio_work+0x8c/0x48c stack backtrace: CPU: 3 UID: 0 PID: 46 Comm: kworker/3:0H Tainted: G W OE 6.12.38-android16-5-maybe-dirty-4k #1 8b362b6f76e3645a58cd27d86982bce10d150025 Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: Spacecraft board based on MALIBU (DT) Workqueue: sdd_zwplugs blk_zone_wplug_bio_work Call trace: dump_backtrace+0xfc/0x17c show_stack+0x18/0x28 dump_stack_lvl+0x40/0xa0 dump_stack+0x18/0x24 print_deadlock_bug+0x38c/0x398 __lock_acquire+0x13e8/0x2e1c lock_acquire+0x134/0x2b4 _raw_spin_lock_irqsave+0x5c/0x80 blk_zone_write_plug_bio_endio+0x64/0x1f0 bio_endio+0x9c/0x240 __dm_io_complete+0x214/0x260 clone_endio+0xe8/0x214 bio_endio+0x218/0x240 blk_zone_wplug_bio_work+0x204/0x48c process_one_work+0x26c/0x65c worker_thread+0x33c/0x498 kthread+0x110/0x134 ret_from_fork+0x10/0x20 Cc: stable@vger.kernel.org Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Fixes: `dd291d77cc` ("block: Introduce zone write plugging") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250825182720.1697203-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-09-04 15:31:54 +02:00
Damien Le Moal	8263f32e10	block: Introduce bio_needs_zone_write_plugging() commit `f70291411b` upstream. In preparation for fixing device mapper zone write handling, introduce the inline helper function bio_needs_zone_write_plugging() to test if a BIO requires handling through zone write plugging using the function blk_zone_plug_bio(). This function returns true for any write (op_is_write(bio) == true) operation directed at a zoned block device using zone write plugging, that is, a block device with a disk that has a zone write plug hash table. This helper allows simplifying the check on entry to blk_zone_plug_bio() and used in to protect calls to it for blk-mq devices and DM devices. Fixes: `f211268ed1` ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250625093327.548866-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-08-20 18:30:51 +02:00
John Garry	46aa80ef49	block: avoid possible overflow for chunk_sectors check in blk_stack_limits() [ Upstream commit `448dfecc7f` ] In blk_stack_limits(), we check that the t->chunk_sectors value is a multiple of the t->physical_block_size value. However, by finding the chunk_sectors value in bytes, we may overflow the unsigned int which holds chunk_sectors, so change the check to be based on sectors. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250729091448.1691334-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-08-20 18:30:49 +02:00
Yu Kuai	ed30c38d1e	lib/sbitmap: convert shallow_depth from one word to the whole sbitmap [ Upstream commit `42e6c6ce03` ] Currently elevators will record internal 'async_depth' to throttle asynchronous requests, and they both calculate shallow_dpeth based on sb->shift, with the respect that sb->shift is the available tags in one word. However, sb->shift is not the availbale tags in the last word, see __map_depth: if (index == sb->map_nr - 1) return sb->depth - (index << sb->shift); For consequence, if the last word is used, more tags can be get than expected, for example, assume nr_requests=256 and there are four words, in the worst case if user set nr_requests=32, then the first word is the last word, and still use bits per word, which is 64, to calculate async_depth is wrong. One the ohter hand, due to cgroup qos, bfq can allow only one request to be allocated, and set shallow_dpeth=1 will still allow the number of words request to be allocated. Fix this problems by using shallow_depth to the whole sbitmap instead of per word, also change kyber, mq-deadline and bfq to follow this, a new helper __map_depth_with_shallow() is introduced to calculate available bits in each word. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250807032413.1469456-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-08-20 18:30:49 +02:00
Christoph Hellwig	0257dc08a4	block: ensure discard_granularity is zero when discard is not supported [ Upstream commit `fad6551fcf` ] Documentation/ABI/stable/sysfs-block states: What: /sys/block/<disk>/queue/discard_granularity [...] A discard_granularity of 0 means that the device does not support discard functionality. but this got broken when sorting out the block limits updates. Fix this by setting the discard_granularity limit to zero when the combined max_discard_sectors is zero. Fixes: `3c407dc723` ("block: default the discard granularity to sector size") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250731152228.873923-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-08-15 12:14:05 +02:00
Ming Lei	e8767b89cd	block: fix kobject leak in blk_unregister_queue [ Upstream commit `3051247e4f` ] The kobject for the queue, `disk->queue_kobj`, is initialized with a reference count of 1 via `kobject_init()` in `blk_register_queue()`. While `kobject_del()` is called during the unregister path to remove the kobject from sysfs, the initial reference is never released. Add a call to `kobject_put()` in `blk_unregister_queue()` to properly decrement the reference count and fix the leak. Fixes: `2bd85221a6` ("block: untangle request_queue refcounting from sysfs") Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250711083009.2574432-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-07-24 08:56:30 +02:00
Damien Le Moal	943801c380	block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion commit `f705d33c2f` upstream. When blk_zone_write_plug_bio_endio() is called for a regular write BIO used to emulate a zone append operation, that is, a BIO flagged with BIO_EMULATES_ZONE_APPEND, the BIO operation code is restored to the original REQ_OP_ZONE_APPEND but the BIO_EMULATES_ZONE_APPEND flag is not cleared. Clear it to fully return the BIO to its orginal definition. Fixes: `9b1ce7f0c6` ("block: Implement zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250611005915.89843-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-06-27 11:11:19 +01:00
Jens Axboe	5538af3843	block: use plug request list tail for one-shot backmerge attempt commit `961296e89d` upstream. Previously, the block layer stored the requests in the plug list in LIFO order. For this reason, blk_attempt_plug_merge() would check just the head entry for a back merge attempt, and abort after that unless requests for multiple queues existed in the plug list. If more than one request is present in the plug list, this makes the one-shot back merging less useful than before, as it'll always fail to find a quick merge candidate. Use the tail entry for the one-shot merge attempt, which is the last added request in the list. If that fails, abort immediately unless there are multiple queues available. If multiple queues are available, then scan the list. Ideally the latter scan would be a backwards scan of the list, but as it currently stands, the plug list is singly linked and hence this isn't easily feasible. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-block/20250611121626.7252-1-abuehaze@amazon.com/ Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com> Fixes: `e70c301fae` ("block: don't reorder requests in blk_add_rq_to_plug") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-06-27 11:11:19 +01:00
Christoph Hellwig	0fccb6773b	block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work [ Upstream commit `cf625013d8` ] Bios queued up in the zone write plug have already gone through all all preparation in the submit_bio path, including the freeze protection. Submitting them through submit_bio_noacct_nocheck duplicates the work and can can cause deadlocks when freezing a queue with pending bio write plugs. Go straight to ->submit_bio or blk_mq_submit_bio to bypass the superfluous extra freeze protection and checks. Fixes: `9b1ce7f0c6` ("block: Implement zone append emulation") Reported-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250611044416.2351850-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-06-19 15:32:33 +02:00
Ming Lei	a5c7b61eed	block: use q->elevator with ->elevator_lock held in elv_iosched_show() [ Upstream commit `94209d27d1` ] Use q->elevator with ->elevator_lock held in elv_iosched_show(), since the local cached elevator reference may become stale after getting ->elevator_lock. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-06-19 15:32:32 +02:00
Ming Lei	0c60158ff1	block: fix adding folio to bio commit `26064d3e2b` upstream. >4GB folio is possible on some ARCHs, such as aarch64, 16GB hugepage is supported, then 'offset' of folio can't be held in 'unsigned int', cause warning in bio_add_folio_nofail() and IO failure. Fix it by adjusting 'page' & trimming 'offset' so that `->bi_offset` won't be overflow, and folio can be added to bio successfully. Fixes: `ed9832bc08` ("block: introduce folio awareness and add a bigger size from folio") Cc: Kundan Kumar <kundan.kumar@samsung.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Gavin Shan <gshan@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20250312145136.2891229-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> [ The follow-up fix `fbecd731de` ("xfs: fix zoned GC data corruption due to wrong bv_offset") addresses issues in the file fs/xfs/xfs_zone_gc.c. This file was first introduced in version v6.15-rc1. So don't backport the follow up fix to 6.12.y. ] Signed-off-by: Alva Lan <alvalan9@foxmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-06-10 07:12:59 -04:00
Johannes Thumshirn	1645fc1849	block: only update request sector if needed [ Upstream commit `db492e24f9` ] In case of a ZONE APPEND write, regardless of native ZONE APPEND or the emulation layer in the zone write plugging code, the sector the data got written to by the device needs to be updated in the bio. At the moment, this is done for every native ZONE APPEND write and every request that is flagged with 'BIO_ZONE_WRITE_PLUGGING'. But thus superfluously updates the sector for regular writes to a zoned block device. Check if a bio is a native ZONE APPEND write or if the bio is flagged as 'BIO_EMULATES_ZONE_APPEND', meaning the block layer's zone write plugging code handles the ZONE APPEND and translates it into a regular write and back. Only if one of these two criterion is met, update the sector in the bio upon completion. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/dea089581cb6b777c1cd1500b38ac0b61df4b2d1.1746530748.git.jth@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-05-29 11:03:12 +02:00
Christoph Hellwig	564f03a797	block: mark bounce buffering as incompatible with integrity [ Upstream commit `5fd0268a88` ] None of the few drivers still using the legacy block layer bounce buffering support integrity metadata. Explicitly mark the features as incompatible and stop creating the slab and mempool for integrity buffers for the bounce bio_set. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250225154449.422989-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-05-29 11:02:32 +02:00
Ming Lei	95080412e9	blk-throttle: don't take carryover for prioritized processing of metadata [ Upstream commit `a9fc8868b3` ] Commit `29390bb566` ("blk-throttle: support prioritized processing of metadata") takes bytes/ios carryover for prioritized processing of metadata. Turns out we can support it by charging it directly without trimming slice, and the result is same with carryover. Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250305043123.3938491-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-05-29 11:02:29 +02:00
Coly Li	75ae2a3553	badblocks: Fix a nonsense WARN_ON() which checks whether a u64 variable < 0 [ Upstream commit `7e76336e14` ] In _badblocks_check(), there are lines of code like this, 1246 sectors -= len; [snipped] 1251 WARN_ON(sectors < 0); The WARN_ON() at line 1257 doesn't make sense because sectors is unsigned long long type and never to be <0. Fix it by checking directly checking whether sectors is less than len. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Coly Li <colyli@kernel.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250309160556.42854-1-colyli@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-05-29 11:02:24 +02:00
Chen Linxuan	a5a507fa5f	blk-cgroup: improve policy registration error handling [ Upstream commit `e1a0202c6b` ] This patch improve the returned error code of blkcg_policy_register(). 1. Move the validation check for cpd/pd_alloc_fn and cpd/pd_free_fn function pairs to the start of blkcg_policy_register(). This ensures we immediately return -EINVAL if the function pairs are not correctly provided, rather than returning -ENOSPC after locking and unlocking mutexes unnecessarily. Those locks should not contention any problems, as error of policy registration is a super cold path. 2. Return -ENOMEM when cpd_alloc_fn() failed. Co-authored-by: Wen Tao <wentao@uniontech.com> Signed-off-by: Wen Tao <wentao@uniontech.com> Signed-off-by: Chen Linxuan <chenlinxuan@uniontech.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/3E333A73B6B6DFC0+20250317022924.150907-1-chenlinxuan@uniontech.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-05-29 11:02:13 +02:00
Darrick J. Wong	64f505b08e	block: fix race between set_blocksize and read paths [ Upstream commit `c0e473a0d2` ] With the new large sector size support, it's now the case that set_blocksize can change i_blksize and the folio order in a manner that conflicts with a concurrent reader and causes a kernel crash. Specifically, let's say that udev-worker calls libblkid to detect the labels on a block device. The read call can create an order-0 folio to read the first 4096 bytes from the disk. But then udev is preempted. Next, someone tries to mount an 8k-sectorsize filesystem from the same block device. The filesystem calls set_blksize, which sets i_blksize to 8192 and the minimum folio order to 1. Now udev resumes, still holding the order-0 folio it allocated. It then tries to schedule a read bio and do_mpage_readahead tries to create bufferheads for the folio. Unfortunately, blocks_per_folio == 0 because the page size is 4096 but the blocksize is 8192 so no bufferheads are attached and the bh walk never sets bdev. We then submit the bio with a NULL block device and crash. Therefore, truncate the page cache after flushing but before updating i_blksize. However, that's not enough -- we also need to lock out file IO and page faults during the update. Take both the i_rwsem and the invalidate_lock in exclusive mode for invalidations, and in shared mode for read/write operations. I don't know if this is the correct fix, but xfs/259 found it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/174543795699.4139148.2086129139322431423.stgit@frogsfrogsfrogs Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-05-29 11:02:01 +02:00
Steve Siwinski	bffc3038a2	scsi: sd_zbc: block: Respect bio vector limits for REPORT ZONES buffer commit `e8007fad54` upstream. The REPORT ZONES buffer size is currently limited by the HBA's maximum segment count to ensure the buffer can be mapped. However, the block layer further limits the number of iovec entries to 1024 when allocating a bio. To avoid allocation of buffers too large to be mapped, further restrict the maximum buffer size to BIO_MAX_INLINE_VECS. Replace the UIO_MAXIOV symbolic name with the more contextually appropriate BIO_MAX_INLINE_VECS. Fixes: `b091ac6168` ("sd_zbc: Fix report zones buffer allocation") Cc: stable@vger.kernel.org Signed-off-by: Steve Siwinski <ssiwinski@atto.com> Link: https://lore.kernel.org/r/20250508200122.243129-1-ssiwinski@atto.com Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-05-22 14:29:49 +02:00
Daniel Wagner	e10ec6e32b	blk-mq: create correct map for fallback case commit `a9ae6fe1c3` upstream. The fallback code in blk_mq_map_hw_queues is original from blk_mq_pci_map_queues and was added to handle the case where pci_irq_get_affinity will return NULL for !SMP configuration. blk_mq_map_hw_queues replaces besides blk_mq_pci_map_queues also blk_mq_virtio_map_queues which used to use blk_mq_map_queues for the fallback. It's possible to use blk_mq_map_queues for both cases though. blk_mq_map_queues creates the same map as blk_mq_clear_mq_map for !SMP that is CPU 0 will be mapped to hctx 0. The WARN_ON_ONCE has to be dropped for virtio as the fallback is also taken for certain configuration on default. Though there is still a WARN_ON_ONCE check in lib/group_cpus.c: WARN_ON(nr_present + nr_others < numgrps); which will trigger if the caller tries to create more hardware queues than CPUs. It tests the same as the WARN_ON_ONCE in blk_mq_pci_map_queues did. Fixes: `a5665c3d15` ("virtio: blk/scsi: replace blk_mq_virtio_map_queues with blk_mq_map_hw_queues") Reported-by: Steven Rostedt <rostedt@goodmis.org> Closes: https://lore.kernel.org/all/20250122093020.6e8a4e5b@gandalf.local.home/ Signed-off-by: Daniel Wagner <wagi@kernel.org> Link: https://lore.kernel.org/r/20250123-fix-blk_mq_map_hw_queues-v1-1-08dbd01f2c39@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-05-09 09:50:48 +02:00
Christoph Hellwig	6b9ebcbd31	mq-deadline: don't call req_get_ioprio from the I/O completion handler commit `1b0cab327e` upstream. req_get_ioprio looks at req->bio to find the I/O priority, which is not set when completing bios that the driver fully iterated through. Stash away the dd_per_prio in the elevator private data instead of looking it up again to optimize the code a bit while fixing the regression from removing the per-request ioprio value. Fixes: `6975c1a486` ("block: remove the ioprio field from struct request") Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com> Reported-by: Sam Protsenko <semen.protsenko@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Chris Bainbridge <chris.bainbridge@gmail.com> Tested-by: Sam Protsenko <semen.protsenko@linaro.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20241126102136.619067-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-05-02 07:59:33 +02:00
Christoph Hellwig	7f24ea6a46	block: never reduce ra_pages in blk_apply_bdi_limits [ Upstream commit `7b720c7202` ] When the user increased the read-ahead size through sysfs this value currently get lost if the device is reprobe, including on a resume from suspend. As there is no hardware limitation for the read-ahead size there is no real need to reset it or track a separate hardware limitation like for max_sectors. This restores the pre-atomic queue limit behavior in the sd driver as sd did not use blk_queue_io_opt and thus never updated the read ahead size to the value based of the optimal I/O, but changes behavior for all other drivers. As the new behavior seems useful and sd is the driver for which the readahead size tweaks are most useful that seems like a worthwhile trade off. Fixes: `804e498e04` ("sd: convert to the atomic queue limits API") Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250424082521.1967286-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-05-02 07:59:03 +02:00
Ming Lei	2afa5ea7c4	block: make sure ->nr_integrity_segments is cloned in blk_rq_prep_clone [ Upstream commit `fc0e982b8a` ] Make sure ->nr_integrity_segments is cloned in blk_rq_prep_clone(), otherwise requests cloned by device-mapper multipath will not have the proper nr_integrity_segments values set, then BUG() is hit from sg_alloc_table_chained(). Fixes: `b0fd271d5f` ("block: add request clone interface (v2)") Cc: stable@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250310115453.2271109-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-05-02 07:58:54 +02:00
Christoph Hellwig	ed7535b141	block: remove the ioprio field from struct request [ Upstream commit `6975c1a486` ] The request ioprio is only initialized from the first attached bio, so requests without a bio already never set it. Directly use the bio field instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20241112170050.1612998-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: `fc0e982b8a` ("block: make sure ->nr_integrity_segments is cloned in blk_rq_prep_clone") Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-05-02 07:58:54 +02:00
Christoph Hellwig	3e12e8c273	block: remove the write_hint field from struct request [ Upstream commit `61952bb734` ] The write_hint is only used for read/write requests, which must have a bio attached to them. Just use the bio field instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20241112170050.1612998-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: `fc0e982b8a` ("block: make sure ->nr_integrity_segments is cloned in blk_rq_prep_clone") Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-05-02 07:58:54 +02:00
Christoph Hellwig	7e2d224939	block: don't reorder requests in blk_add_rq_to_plug commit `e70c301fae` upstream. Add requests to the tail of the list instead of the front so that they are queued up in submission order. Remove the re-reordering in blk_mq_dispatch_plug_list, virtio_queue_rqs and nvme_queue_rqs now that the list is ordered as expected. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241113152050.157179-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-04-25 10:48:06 +02:00
Christoph Hellwig	2ad0f19a4e	block: add a rq_list type commit `a3396b9999` upstream. Replace the semi-open coded request list helpers with a proper rq_list type that mirrors the bio_list and has head and tail pointers. Besides better type safety this actually allows to insert at the tail of the list, which will be useful soon. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241113152050.157179-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-04-25 10:48:06 +02:00
Martin K. Petersen	c38a005e6e	block: integrity: Do not call set_page_dirty_lock() commit `39e1605051` upstream. Placing multiple protection information buffers inside the same page can lead to oopses because set_page_dirty_lock() can't be called from interrupt context. Since a protection information buffer is not backed by a file there is no point in setting its page dirty, there is nothing to synchronize. Drop the call to set_page_dirty_lock() and remove the last argument to bio_integrity_unpin_bvec(). Cc: stable@vger.kernel.org Fixes: `492c5d4559` ("block: bio-integrity: directly map user buffers") Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/yq1v7r3ev9g.fsf@ca-mkp.ca.oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-04-25 10:47:50 +02:00
Zheng Qixing	41e43134dd	block: fix resource leak in blk_register_queue() error path [ Upstream commit `40f2eb9b53` ] When registering a queue fails after blk_mq_sysfs_register() is successful but the function later encounters an error, we need to clean up the blk_mq_sysfs resources. Add the missing blk_mq_sysfs_unregister() call in the error path to properly clean up these resources and prevent a memory leak. Fixes: `320ae51fee` ("blk-mq: new multi-queue block IO queueing mechanism") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250412092554.475218-1-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-04-25 10:47:43 +02:00
Daniel Wagner	fe2bdefe86	blk-mq: introduce blk_mq_map_hw_queues [ Upstream commit `1452e9b470` ] blk_mq_pci_map_queues and blk_mq_virtio_map_queues will create a CPU to hardware queue mapping based on affinity information. These two function share common code and only differ on how the affinity information is retrieved. Also, those functions are located in the block subsystem where it doesn't really fit in. They are virtio and pci subsystem specific. Thus introduce provide a generic mapping function which uses the irq_get_affinity callback from bus_type. Originally idea from Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Daniel Wagner <wagi@kernel.org> Link: https://lore.kernel.org/r/20241202-refactor-blk-affinity-helpers-v6-4-27211e9c2cd5@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: `a2d5a00722` ("scsi: smartpqi: Use is_kdump_kernel() to check for kdump") Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-04-25 10:47:38 +02:00
Ming Lei	7184e99610	block: fix 'kmem_cache of name 'bio-108' already exists' [ Upstream commit `b654f7a51f` ] Device mapper bioset often has big bio_slab size, which can be more than 1000, then 8byte can't hold the slab name any more, cause the kmem_cache allocation warning of 'kmem_cache of name 'bio-108' already exists'. Fix the warning by extending bio_slab->name to 12 bytes, but fix output of /proc/slabinfo Reported-by: Guangwu Zhang <guazhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250228132656.2838008-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-03-22 12:54:21 -07:00
Olivier Gayot	f57e89c1cb	block: fix conversion of GPT partition name to 7-bit commit `e06472bab2` upstream. The utf16_le_to_7bit function claims to, naively, convert a UTF-16 string to a 7-bit ASCII string. By naively, we mean that it: * drops the first byte of every character in the original UTF-16 string * checks if all characters are printable, and otherwise replaces them by exclamation mark "!". This means that theoretically, all characters outside the 7-bit ASCII range should be replaced by another character. Examples: * lower-case alpha (ɒ) 0x0252 becomes 0x52 (R) * ligature OE (œ) 0x0153 becomes 0x53 (S) * hangul letter pieup (ㅂ) 0x3142 becomes 0x42 (B) * upper-case gamma (Ɣ) 0x0194 becomes 0x94 (not printable) so gets replaced by "!" The result of this conversion for the GPT partition name is passed to user-space as PARTNAME via udev, which is confusing and feels questionable. However, there is a flaw in the conversion function itself. By dropping one byte of each character and using isprint() to check if the remaining byte corresponds to a printable character, we do not actually guarantee that the resulting character is 7-bit ASCII. This happens because we pass 8-bit characters to isprint(), which in the kernel returns 1 for many values > 0x7f - as defined in ctype.c. This results in many values which should be replaced by "!" to be kept as-is, despite not being valid 7-bit ASCII. Examples: * e with acute accent (é) 0x00E9 becomes 0xE9 - kept as-is because isprint(0xE9) returns 1. * euro sign (€) 0x20AC becomes 0xAC - kept as-is because isprint(0xAC) returns 1. This way has broken pyudev utility[1], fixes it by using a mask of 7 bits instead of 8 bits before calling isprint. Link: https://github.com/pyudev/pyudev/issues/490#issuecomment-2685794648 [1] Link: https://lore.kernel.org/linux-block/4cac90c2-e414-4ebb-ae62-2a4589d9dc6e@canonical.com/ Cc: Mulhern <amulhern@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: stable@vger.kernel.org Signed-off-by: Olivier Gayot <olivier.gayot@canonical.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250305022154.3903128-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-03-13 13:01:58 +01:00
Damien Le Moal	2f572c42bb	block: Remove zone write plugs when handling native zone append writes commit `a6aa36e957` upstream. For devices that natively support zone append operations, REQ_OP_ZONE_APPEND BIOs are not processed through zone write plugging and are immediately issued to the zoned device. This means that there is no write pointer offset tracking done for these operations and that a zone write plug is not necessary. However, when receiving a zone append BIO, we may already have a zone write plug for the target zone if that zone was previously partially written using regular write operations. In such case, since the write pointer offset of the zone write plug is not incremented by the amount of sectors appended to the zone, 2 issues arise: 1) we risk leaving the plug in the disk hash table if the zone is fully written using zone append or regular write operations, because the write pointer offset will never reach the "zone full" state. 2) Regular write operations that are issued after zone append operations will always be failed by blk_zone_wplug_prepare_bio() as the write pointer alignment check will fail, even if the user correctly accounted for the zone append operations and issued the regular writes with a correct sector. Avoid these issues by immediately removing the zone write plug of zones that are the target of zone append operations when blk_zone_plug_bio() is called. The new function blk_zone_wplug_handle_native_zone_append() implements this for devices that natively support zone append. The removal of the zone write plug using disk_remove_zone_wplug() requires aborting all plugged regular write using disk_zone_wplug_abort() as otherwise the plugged write BIOs would never be executed (with the plug removed, the completion path will never see again the zone write plug as disk_get_zone_wplug() will return NULL). Rate-limited warnings are added to blk_zone_wplug_handle_native_zone_append() and to disk_zone_wplug_abort() to signal this. Since blk_zone_wplug_handle_native_zone_append() is called in the hot path for operations that will not be plugged, disk_get_zone_wplug() is optimized under the assumption that a user issuing zone append operations is not at the same time issuing regular writes and that there are no hashed zone write plugs. The struct gendisk atomic counter nr_zone_wplugs is added to check this, with this counter incremented in disk_insert_zone_wplug() and decremented in disk_remove_zone_wplug(). To be consistent with this fix, we do not need to fill the zone write plug hash table with zone write plugs for zones that are partially written for a device that supports native zone append operations. So modify blk_revalidate_seq_zone() to return early to avoid allocating and inserting a zone write plug for partially written sequential zones if the device natively supports zone append. Reported-by: Jorgen Hansen <Jorgen.Hansen@wdc.com> Fixes: `9b1ce7f0c6` ("block: Implement zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Jorgen Hansen <Jorgen.Hansen@wdc.com> Link: https://lore.kernel.org/r/20250214041434.82564-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-03-07 18:25:40 +01:00

1 2 3 4 5 ...

7594 Commits