summary refs log tree commit diff
path: root/include
AgeCommit message (Collapse)Author
2023-03-10driver core: fw_devlink: Consolidate device link flag computationSaravana Kannan
[ Upstream commit cd115c0409f283edde94bd5a9a42dc42bee0aba8 ] Consolidate the code that computes the flags to be used when creating a device link from a fwnode link. Fixes: 2de9d8e0d2fe ("driver core: fw_devlink: Improve handling of cyclic dependencies") Signed-off-by: Saravana Kannan <saravanak@google.com> Tested-by: Colin Foster <colin.foster@in-advantage.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Douglas Anderson <dianders@chromium.org> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Luca Weiss <luca.weiss@fairphone.com> # qcom/sm7225-fairphone-fp4 Link: https://lore.kernel.org/r/20230207014207.1678715-8-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10driver core: fw_devlink: Allow marking a fwnode link as being part of a cycleSaravana Kannan
[ Upstream commit 6a6dfdf8b3ff337be5a447e9f4e71969f18370ad ] To improve detection and handling of dependency cycles, we need to be able to mark fwnode links as being part of cycles. fwnode links marked as being part of a cycle should not block their consumers from probing. Fixes: 2de9d8e0d2fe ("driver core: fw_devlink: Improve handling of cyclic dependencies") Signed-off-by: Saravana Kannan <saravanak@google.com> Tested-by: Colin Foster <colin.foster@in-advantage.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Douglas Anderson <dianders@chromium.org> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Luca Weiss <luca.weiss@fairphone.com> # qcom/sm7225-fairphone-fp4 Link: https://lore.kernel.org/r/20230207014207.1678715-7-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10driver core: fw_devlink: Add DL_FLAG_CYCLE support to device linksSaravana Kannan
[ Upstream commit 67cad5c67019c38126b749621665b6723d3ae7e6 ] fw_devlink uses DL_FLAG_SYNC_STATE_ONLY device link flag for two purposes: 1. To allow a parent device to proxy its child device's dependency on a supplier so that the supplier doesn't get its sync_state() callback before the child device/consumer can be added and probed. In this usage scenario, we need to ignore cycles for ensure correctness of sync_state() callbacks. 2. When there are dependency cycles in firmware, we don't know which of those dependencies are valid. So, we have to ignore them all wrt probe ordering while still making sure the sync_state() callbacks come correctly. However, when detecting dependency cycles, there can be multiple dependency cycles between two devices that we need to detect. For example: A -> B -> A and A -> C -> B -> A. To detect multiple cycles correct, we need to be able to differentiate DL_FLAG_SYNC_STATE_ONLY device links used for (1) vs (2) above. To allow this differentiation, add a DL_FLAG_CYCLE that can be use to mark use case (2). We can then use the DL_FLAG_CYCLE to decide which DL_FLAG_SYNC_STATE_ONLY device links to follow when looking for dependency cycles. Fixes: 2de9d8e0d2fe ("driver core: fw_devlink: Improve handling of cyclic dependencies") Signed-off-by: Saravana Kannan <saravanak@google.com> Tested-by: Colin Foster <colin.foster@in-advantage.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Douglas Anderson <dianders@chromium.org> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Luca Weiss <luca.weiss@fairphone.com> # qcom/sm7225-fairphone-fp4 Link: https://lore.kernel.org/r/20230207014207.1678715-6-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10drivers: base: transport_class: fix possible memory leakYang Yingliang
[ Upstream commit a86367803838b369fe5486ac18771d14723c258c ] Current some drivers(like iscsi) call transport_register_device() failed, they don't call transport_destroy_device() to release the memory allocated in transport_setup_device(), because they don't know what was done, it should be internal thing to release the resource in register function. So fix this leak by calling destroy function inside register function. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Link: https://lore.kernel.org/r/20221110102307.3492557-1-yangyingliang@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10kobject: modify kobject_get_path() to take a const *Greg Kroah-Hartman
[ Upstream commit 33a0a1e3b3d17445832177981dc7a1c6a5b009f8 ] kobject_get_path() does not modify the kobject passed to it, so make the pointer constant. Cc: "Rafael J. Wysocki" <rafael@kernel.org> Link: https://lore.kernel.org/r/20221001165315.2690141-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Stable-dep-of: 3bb2a01caa81 ("kobject: Fix slab-out-of-bounds in fill_kobj_path()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10NFSD: enhance inter-server copy cleanupDai Ngo
[ Upstream commit df24ac7a2e3a9d0bc68f1756a880e50bfe4b4522 ] Currently nfsd4_setup_inter_ssc returns the vfsmount of the source server's export when the mount completes. After the copy is done nfsd4_cleanup_inter_ssc is called with the vfsmount of the source server and it searches nfsd_ssc_mount_list for a matching entry to do the clean up. The problems with this approach are (1) the need to search the nfsd_ssc_mount_list and (2) the code has to handle the case where the matching entry is not found which looks ugly. The enhancement is instead of nfsd4_setup_inter_ssc returning the vfsmount, it returns the nfsd4_ssc_umount_item which has the vfsmount embedded in it. When nfsd4_cleanup_inter_ssc is called it's passed with the nfsd4_ssc_umount_item directly to do the clean up so no searching is needed and there is no need to handle the 'not found' case. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> [ cel: adjusted whitespace and variable/function names ] Reviewed-by: Olga Kornievskaia <kolga@netapp.com> Stable-dep-of: 34e8f9ec4c9a ("NFSD: fix leaked reference count of nfsd4_ssc_umount_item") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10ASoC: soc-dapm.h: fixup warning struct snd_pcm_substream not declaredLucas Tanure
[ Upstream commit fdff966bfde7cf0c85562d2bfb1ff1ba83da5f7b ] Add struct snd_pcm_substream forward declaration Fixes: 078a85f2806f ("ASoC: dapm: Only power up active channels from a DAI") Signed-off-by: Lucas Tanure <lucas.tanure@collabora.com> Reviewed-by: Charles Keepax <ckeepax@opensource.cirrus.com> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Link: https://lore.kernel.org/r/20230215132851.1626881-1-lucas.tanure@collabora.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10HID: retain initial quirks set up when creating HID devicesDmitry Torokhov
[ Upstream commit 03a86105556e23650e4470c09f91cf7c360d5e28 ] In certain circumstances, such as when creating I2C-connected HID devices, we want to pass and retain some quirks (axis inversion, etc). The source of such quirks may be device tree, or DMI data, or something else not readily available to the HID core itself and therefore cannot be reconstructed easily. To allow this, introduce "initial_quirks" field in hid_device structure and use it when determining the final set of quirks. This fixes the problem with i2c-hid setting up device-tree sourced quirks too late and losing them on device rebind, and also allows to sever the tie between hid-code and i2c-hid when applying DMI-based quirks. Fixes: b60d3c803d76 ("HID: i2c-hid-of: Expose the touchscreen-inverted properties") Fixes: a2f416bf062a ("HID: multitouch: Add quirks for flipped axes") Reviewed-by: Guenter Roeck <groeck@chromium.org> Tested-by: Allen Ballway <ballway@chromium.org> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Reviewed-by: Alistair Francis <alistair@alistair23.me> Link: https://lore.kernel.org/r/Y+LYwu3Zs13hdVDy@google.com Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10ALSA: hda: Fix the control element identification for multiple codecsJaroslav Kysela
[ Upstream commit d045bceff5a904bd79d71dede9f927c00ce4906f ] Some motherboards have multiple HDA codecs connected to the serial bus. The current code may create multiple mixer controls with the almost identical identification. The current code use id.device field from the control element structure to store the codec address to avoid such clashes for multiple codecs. Unfortunately, the user space do not handle this correctly. For mixer controls, only name and index are used for the identifiers. This patch fixes this problem to compose the index using the codec address as an offset in case, when the control already exists. It is really unlikely that one codec will create 10 similar controls. This patch adds new kernel module parameter 'ctl_dev_id' to allow select the old behaviour, too. The CONFIG_SND_HDA_CTL_DEV_ID Kconfig option sets the default value. BugLink: https://github.com/alsa-project/alsa-lib/issues/294 BugLink: https://github.com/alsa-project/alsa-lib/issues/205 Fixes: 54d174031576 ("[ALSA] hda-codec - Fix connection list parsing") Fixes: 1afe206ab699 ("ALSA: hda - Try to find an empty control index when it's occupied") Signed-off-by: Jaroslav Kysela <perex@perex.cz> Link: https://lore.kernel.org/r/20230202092013.4066998-1-perex@perex.cz Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10scsi: ufs: exynos: Fix DMA alignment for PAGE_SIZE != 4096Bart Van Assche
[ Upstream commit 86bd0c4a2a5dc4265884648cb92c681646509692 ] The Exynos UFS controller only supports scatter/gather list elements that are aligned on a 4 KiB boundary. Fix DMA alignment in case PAGE_SIZE != 4096. Rename UFSHCD_QUIRK_ALIGN_SG_WITH_PAGE_SIZE into UFSHCD_QUIRK_4KB_DMA_ALIGNMENT. Cc: Kiwoong Kim <kwmad.kim@samsung.com> Fixes: 2b2bfc8aa519 ("scsi: ufs: Introduce a quirk to allow only page-aligned sg entries") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Alim Akhtar <alim.akhtar@samsung.com> Tested-by: Alim Akhtar <alim.akhtar@samsung.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10drm/mipi-dsi: Fix byte order of 16-bit DCS set/get brightnessDaniel Mentz
[ Upstream commit c9d27c6be518b4ef2966d9564654ef99292ea1b3 ] The MIPI DCS specification demands that brightness values are sent in big endian byte order. It also states that one parameter (i.e. one byte) shall be sent/received for 8 bit wide values, and two parameters shall be used for values that are between 9 and 16 bits wide. Add new functions to properly handle 16-bit brightness in big endian, since the two 8- and 16-bit cases are distinct from each other. [richard: use separate functions instead of switch/case] [richard: split into 16-bit component] Fixes: 1a9d759331b8 ("drm/dsi: Implement DCS set/get display brightness") Signed-off-by: Daniel Mentz <danielmentz@google.com> Link: https://android.googlesource.com/kernel/msm/+/754affd62d0ee268c686c53169b1dbb7deac8550 [richard: fix 16-bit brightness_get] Signed-off-by: Richard Acayan <mailingradian@gmail.com> Tested-by: Caleb Connolly <caleb@connolly.tech> Reviewed-by: Neil Armstrong <neil.armstrong@linaro.org> Reviewed-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org> Link: https://patchwork.freedesktop.org/patch/msgid/20230116224909.23884-2-mailingradian@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10net/mlx4_en: Introduce flexible array to silence overflow warningKees Cook
[ Upstream commit f8f185e39b4de91bc5235e5be0d829bea69d9b06 ] The call "skb_copy_from_linear_data(skb, inl + 1, spc)" triggers a FORTIFY memcpy() warning on ppc64 platform: In function ‘fortify_memcpy_chk’, inlined from ‘skb_copy_from_linear_data’ at ./include/linux/skbuff.h:4029:2, inlined from ‘build_inline_wqe’ at drivers/net/ethernet/mellanox/mlx4/en_tx.c:722:4, inlined from ‘mlx4_en_xmit’ at drivers/net/ethernet/mellanox/mlx4/en_tx.c:1066:3: ./include/linux/fortify-string.h:513:25: error: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning] 513 | __write_overflow_field(p_size_field, size); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Same behaviour on x86 you can get if you use "__always_inline" instead of "inline" for skb_copy_from_linear_data() in skbuff.h The call here copies data into inlined tx destricptor, which has 104 bytes (MAX_INLINE) space for data payload. In this case "spc" is known in compile-time but the destination is used with hidden knowledge (real structure of destination is different from that the compiler can see). That cause the fortify warning because compiler can check bounds, but the real bounds are different. "spc" can't be bigger than 64 bytes (MLX4_INLINE_ALIGN), so the data can always fit into inlined tx descriptor. The fact that "inl" points into inlined tx descriptor is determined earlier in mlx4_en_xmit(). Avoid confusing the compiler with "inl + 1" constructions to get to past the inl header by introducing a flexible array "data" to the struct so that the compiler can see that we are not dealing with an array of inl structs, but rather, arbitrary data following the structure. There are no changes to the structure layout reported by pahole, and the resulting machine code is actually smaller. Reported-by: Josef Oskera <joskera@redhat.com> Link: https://lore.kernel.org/lkml/20230217094541.2362873-1-joskera@redhat.com Fixes: f68f2ff91512 ("fortify: Detect struct member overflows in memcpy() at compile-time") Cc: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20230218183842.never.954-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10bpf: Zeroing allocated object from slab in bpf memory allocatorHou Tao
[ Upstream commit 997849c4b969034e225153f41026657def66d286 ] Currently the freed element in bpf memory allocator may be immediately reused, for htab map the reuse will reinitialize special fields in map value (e.g., bpf_spin_lock), but lookup procedure may still access these special fields, and it may lead to hard-lockup as shown below: NMI backtrace for cpu 16 CPU: 16 PID: 2574 Comm: htab.bin Tainted: G L 6.1.0+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), RIP: 0010:queued_spin_lock_slowpath+0x283/0x2c0 ...... Call Trace: <TASK> copy_map_value_locked+0xb7/0x170 bpf_map_copy_value+0x113/0x3c0 __sys_bpf+0x1c67/0x2780 __x64_sys_bpf+0x1c/0x20 do_syscall_64+0x30/0x60 entry_SYSCALL_64_after_hwframe+0x46/0xb0 ...... </TASK> For htab map, just like the preallocated case, these is no need to initialize these special fields in map value again once these fields have been initialized. For preallocated htab map, these fields are initialized through __GFP_ZERO in bpf_map_area_alloc(), so do the similar thing for non-preallocated htab in bpf memory allocator. And there is no need to use __GFP_ZERO for per-cpu bpf memory allocator, because __alloc_percpu_gfp() does it implicitly. Fixes: 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230215082132.3856544-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10net: add sock_init_data_uid()Pietro Borrello
[ Upstream commit 584f3742890e966d2f0a1f3c418c9ead70b2d99e ] Add sock_init_data_uid() to explicitly initialize the socket uid. To initialise the socket uid, sock_init_data() assumes a the struct socket* sock is always embedded in a struct socket_alloc, used to access the corresponding inode uid. This may not be true. Examples are sockets created in tun_chr_open() and tap_open(). Fixes: 86741ec25462 ("net: core: Add a UID field to struct sock.") Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()Frederic Weisbecker
[ Upstream commit 28319d6dc5e2ffefa452c2377dd0f71621b5bff0 ] RCU Tasks and PID-namespace unshare can interact in do_exit() in a complicated circular dependency: 1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace that every subsequent child of TASK A will belong to. But TASK A doesn't itself belong to that new PID namespace. 2) TASK A forks() and creates TASK B. TASK A stays attached to its PID namespace (let's say PID_NS1) and TASK B is the first task belonging to the new PID namespace created by unshare() (let's call it PID_NS2). 3) Since TASK B is the first task attached to PID_NS2, it becomes the PID_NS2 child reaper. 4) TASK A forks() again and creates TASK C which get attached to PID_NS2. Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has TASK B (belonging to PID_NS2) as a pid_namespace child_reaper. 5) TASK B exits and since it is the child reaper for PID_NS2, it has to kill all other tasks attached to PID_NS2, and wait for all of them to die before getting reaped itself (zap_pid_ns_process()). 6) TASK A calls synchronize_rcu_tasks() which leads to synchronize_srcu(&tasks_rcu_exit_srcu). 7) TASK B is waiting for TASK C to get reaped. But TASK B is under a tasks_rcu_exit_srcu SRCU critical section (exit_notify() is between exit_tasks_rcu_start() and exit_tasks_rcu_finish()), blocking TASK A. 8) TASK C exits and since TASK A is its parent, it waits for it to reap TASK C, but it can't because TASK A waits for TASK B that waits for TASK C. Pid_namespace semantics can hardly be changed at this point. But the coverage of tasks_rcu_exit_srcu can be reduced instead. The current task is assumed not to be concurrently reapable at this stage of exit_notify() and therefore tasks_rcu_exit_srcu can be temporarily relaxed without breaking its constraints, providing a way out of the deadlock scenario. [ paulmck: Fix build failure by adding additional declaration. ] Fixes: 3f95aa81d265 ("rcu: Make TASKS_RCU handle tasks that are almost done exiting") Reported-by: Pengfei Xu <pengfei.xu@intel.com> Suggested-by: Boqun Feng <boqun.feng@gmail.com> Suggested-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Suggested-by: Paul E. McKenney <paulmck@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Eric W . Biederman <ebiederm@xmission.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10genirq: Fix the return type of kstat_cpu_irqs_sum()Zhen Lei
[ Upstream commit 47904aed898a08f028572b9b5a5cc101ddfb2d82 ] The type of member ->irqs_sum is unsigned long, but kstat_cpu_irqs_sum() returns int, which can result in truncation. Therefore, change the kstat_cpu_irqs_sum() function's return value to unsigned long to avoid truncation. Fixes: f2c66cd8eedd ("/proc/stat: scalability of irq num per cpu") Reported-by: Elliott, Robert (Servers) <elliott@hpe.com> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Josh Don <joshdon@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10sbitmap: Use single per-bitmap counting to wake up queued tagsGabriel Krisman Bertazi
[ Upstream commit 4f8126bb2308066b877859e4b5923ffb54143630 ] sbitmap suffers from code complexity, as demonstrated by recent fixes, and eventual lost wake ups on nested I/O completion. The later happens, from what I understand, due to the non-atomic nature of the updates to wait_cnt, which needs to be subtracted and eventually reset when equal to zero. This two step process can eventually miss an update when a nested completion happens to interrupt the CPU in between the wait_cnt updates. This is very hard to fix, as shown by the recent changes to this code. The code complexity arises mostly from the corner cases to avoid missed wakes in this scenario. In addition, the handling of wake_batch recalculation plus the synchronization with sbq_queue_wake_up is non-trivial. This patchset implements the idea originally proposed by Jan [1], which removes the need for the two-step updates of wait_cnt. This is done by tracking the number of completions and wakeups in always increasing, per-bitmap counters. Instead of having to reset the wait_cnt when it reaches zero, we simply keep counting, and attempt to wake up N threads in a single wait queue whenever there is enough space for a batch. Waking up less than batch_wake shouldn't be a problem, because we haven't changed the conditions for wake up, and the existing batch calculation guarantees at least enough remaining completions to wake up a batch for each queue at any time. Performance-wise, one should expect very similar performance to the original algorithm for the case where there is no queueing. In both the old algorithm and this implementation, the first thing is to check ws_active, which bails out if there is no queueing to be managed. In the new code, we took care to avoid accounting completions and wakeups when there is no queueing, to not pay the cost of atomic operations unnecessarily, since it doesn't skew the numbers. For more interesting cases, where there is queueing, we need to take into account the cross-communication of the atomic operations. I've been benchmarking by running parallel fio jobs against a single hctx nullb in different hardware queue depth scenarios, and verifying both IOPS and queueing. Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel jobs. fio was issuing fixed-size randwrites with qd=64 against nullb, varying only the hardware queue length per test. queue size 2 4 8 16 32 64 6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K) patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K) The following is a similar experiment, ran against a nullb with a single bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40 parallel fio jobs operating on the same device queue size 2 4 8 16 32 64 6.1-rc2 1081.0K (2.3K) 957.2K (1.5K) 1699.1K (5.7K) 6178.2K (124.6K) 12227.9K (37.7K) 13286.6K (92.9K) patched 1081.8K (2.8K) 1316.5K (5.4K) 2364.4K (1.8K) 6151.4K (20.0K) 11893.6K (17.5K) 12385.6K (18.4K) It has also survived blktests and a 12h-stress run against nullb. I also ran the code against nvme and a scsi SSD, and I didn't observe performance regression in those. If there are other tests you think I should run, please let me know and I will follow up with results. [1] https://lore.kernel.org/all/aef9de29-e9f5-259a-f8be-12d1b734e72@google.com/ Cc: Hugh Dickins <hughd@google.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Liu Song <liusong@linux.alibaba.com> Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20221105231055.25953-1-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: b5fcf7871acb ("sbitmap: correct wake_batch recalculation to avoid potential IO hung") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-03fs: use consistent setgid checks in is_sxid()Christian Brauner
commit 8d84e39d76bd83474b26cb44f4b338635676e7e8 upstream. Now that we made the VFS setgid checking consistent an inode can't be marked security irrelevant even if the setgid bit is still set. Make this function consistent with all other helpers. Note that enforcing consistent setgid stripping checks for file modification and mode- and ownership changes will cause the setgid bit to be lost in more cases than useed to be the case. If an unprivileged user wrote to a non-executable setgid file that they don't have privilege over the setgid bit will be dropped. This will lead to temporary failures in some xfstests until they have been updated. Reported-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-03attr: use consistent sgid stripping checksChristian Brauner
commit ed5a7047d2011cb6b2bf84ceb6680124cc6a7d95 upstream. Currently setgid stripping in file_remove_privs()'s should_remove_suid() helper is inconsistent with other parts of the vfs. Specifically, it only raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the inode isn't in the caller's groups and the caller isn't privileged over the inode although we require this already in setattr_prepare() and setattr_copy() and so all filesystem implement this requirement implicitly because they have to use setattr_{prepare,copy}() anyway. But the inconsistency shows up in setgid stripping bugs for overlayfs in xfstests (e.g., generic/673, generic/683, generic/685, generic/686, generic/687). For example, we test whether suid and setgid stripping works correctly when performing various write-like operations as an unprivileged user (fallocate, reflink, write, etc.): echo "Test 1 - qa_user, non-exec file $verb" setup_testfile chmod a+rws $junk_file commit_and_check "$qa_user" "$verb" 64k 64k The test basically creates a file with 6666 permissions. While the file has the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a regular filesystem like xfs what will happen is: sys_fallocate() -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> setattr_copy() In should_remove_suid() we can see that ATTR_KILL_SUID is raised unconditionally because the file in the test has S_ISUID set. But we also see that ATTR_KILL_SGID won't be set because while the file is S_ISGID it is not S_IXGRP (see above) which is a condition for ATTR_KILL_SGID being raised. So by the time we call notify_change() we have attr->ia_valid set to ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that ATTR_KILL_SUID is set and does: ia_valid = attr->ia_valid |= ATTR_MODE attr->ia_mode = (inode->i_mode & ~S_ISUID); which means that when we call setattr_copy() later we will definitely update inode->i_mode. Note that attr->ia_mode still contains S_ISGID. Now we call into the filesystem's ->setattr() inode operation which will end up calling setattr_copy(). Since ATTR_MODE is set we will hit: if (ia_valid & ATTR_MODE) { umode_t mode = attr->ia_mode; vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode); if (!vfsgid_in_group_p(vfsgid) && !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) mode &= ~S_ISGID; inode->i_mode = mode; } and since the caller in the test is neither capable nor in the group of the inode the S_ISGID bit is stripped. But assume the file isn't suid then ATTR_KILL_SUID won't be raised which has the consequence that neither the setgid nor the suid bits are stripped even though it should be stripped because the inode isn't in the caller's groups and the caller isn't privileged over the inode. If overlayfs is in the mix things become a bit more complicated and the bug shows up more clearly. When e.g., ovl_setattr() is hit from ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and ATTR_KILL_SGID might be raised but because the check in notify_change() is questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be stripped the S_ISGID bit isn't removed even though it should be stripped: sys_fallocate() -> vfs_fallocate() -> ovl_fallocate() -> file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> ovl_setattr() // TAKE ON MOUNTER'S CREDS -> ovl_do_notify_change() -> notify_change() // GIVE UP MOUNTER'S CREDS // TAKE ON MOUNTER'S CREDS -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = attr_force | kill; -> notify_change() The fix for all of this is to make file_remove_privs()'s should_remove_suid() helper to perform the same checks as we already require in setattr_prepare() and setattr_copy() and have notify_change() not pointlessly requiring S_IXGRP again. It doesn't make any sense in the first place because the caller must calculate the flags via should_remove_suid() anyway which would raise ATTR_KILL_SGID. While we're at it we move should_remove_suid() from inode.c to attr.c where it belongs with the rest of the iattr helpers. Especially since it returns ATTR_KILL_S{G,U}ID flags. We also rename it to setattr_should_drop_suidgid() to better reflect that it indicates both setuid and setgid bit removal and also that it returns attr flags. Running xfstests with this doesn't report any regressions. We should really try and use consistent checks. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-25arch: fix broken BuildID for arm64 and riscvMasahiro Yamada
commit 99cb0d917ffa1ab628bb67364ca9b162c07699b1 upstream. Dennis Gilmore reports that the BuildID is missing in the arm64 vmlinux since commit 994b7ac1697b ("arm64: remove special treatment for the link order of head.o"). The issue is that the type of .notes section, which contains the BuildID, changed from NOTES to PROGBITS. Ard Biesheuvel figured out that whichever object gets linked first gets to decide the type of a section. The PROGBITS type is the result of the compiler emitting .note.GNU-stack as PROGBITS rather than NOTE. While Ard provided a fix for arm64, I want to fix this globally because the same issue is happening on riscv since commit 2348e6bf4421 ("riscv: remove special treatment for the link order of head.o"). This problem will happen in general for other architectures if they start to drop unneeded entries from scripts/head-object-list.txt. Discard .note.GNU-stack in include/asm-generic/vmlinux.lds.h. Link: https://lore.kernel.org/lkml/CAABkxwuQoz1CTbyb57n0ZX65eSYiTonFCU8-LCQc=74D=xE=rA@mail.gmail.com/ Fixes: 994b7ac1697b ("arm64: remove special treatment for the link order of head.o") Fixes: 2348e6bf4421 ("riscv: remove special treatment for the link order of head.o") Reported-by: Dennis Gilmore <dennis@ausil.us> Suggested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Tom Saeger <tom.saeger@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-25uaccess: Add speculation barrier to copy_from_user()Dave Hansen
commit 74e19ef0ff8061ef55957c3abd71614ef0f42f47 upstream. The results of "access_ok()" can be mis-speculated. The result is that you can end speculatively: if (access_ok(from, size)) // Right here even for bad from/size combinations. On first glance, it would be ideal to just add a speculation barrier to "access_ok()" so that its results can never be mis-speculated. But there are lots of system calls just doing access_ok() via "copy_to_user()" and friends (example: fstat() and friends). Those are generally not problematic because they do not _consume_ data from userspace other than the pointer. They are also very quick and common system calls that should not be needlessly slowed down. "copy_from_user()" on the other hand uses a user-controller pointer and is frequently followed up with code that might affect caches. Take something like this: if (!copy_from_user(&kernelvar, uptr, size)) do_something_with(kernelvar); If userspace passes in an evil 'uptr' that *actually* points to a kernel addresses, and then do_something_with() has cache (or other) side-effects, it could allow userspace to infer kernel data values. Add a barrier to the common copy_from_user() code to prevent mis-speculated values which happen after the copy. Also add a stub for architectures that do not define barrier_nospec(). This makes the macro usable in generic code. Since the barrier is now usable in generic code, the x86 #ifdef in the BPF code can also go away. Reported-by: Jordy Zomer <jordyzomer@google.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Daniel Borkmann <daniel@iogearbox.net> # BPF bits Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-25scsi: libsas: Add smp_ata_check_ready_type()Jie Zhan
[ Upstream commit 9181ce3cb5d96f0ee28246a857ca651830fa3746 ] Create function smp_ata_check_ready_type() for LLDDs to wait for SATA devices to come up after a link reset. Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com> Link: https://lore.kernel.org/r/20221118083714.4034612-4-zhanjie9@hisilicon.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Stable-dep-of: 3c2673a09cf1 ("scsi: hisi_sas: Fix SATA devices missing issue during I_T nexus reset") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-25random: always mix cycle counter in add_latent_entropy()Jason A. Donenfeld
[ Upstream commit d7bf7f3b813e3755226bcb5114ad2ac477514ebf ] add_latent_entropy() is called every time a process forks, in kernel_clone(). This in turn calls add_device_randomness() using the latent entropy global state. add_device_randomness() does two things: 2) Mixes into the input pool the latent entropy argument passed; and 1) Mixes in a cycle counter, a sort of measurement of when the event took place, the high precision bits of which are presumably difficult to predict. (2) is impossible without CONFIG_GCC_PLUGIN_LATENT_ENTROPY=y. But (1) is always possible. However, currently CONFIG_GCC_PLUGIN_LATENT_ENTROPY=n disables both (1) and (2), instead of just (2). This commit causes the CONFIG_GCC_PLUGIN_LATENT_ENTROPY=n case to still do (1) by passing NULL (len 0) to add_device_randomness() when add_latent_ entropy() is called. Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: PaX Team <pageexec@freemail.hu> Cc: Emese Revfy <re.emese@gmail.com> Fixes: 38addce8b600 ("gcc-plugins: Add latent_entropy plugin") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-25sched/psi: Stop relying on timer_pending() for poll_work reschedulingSuren Baghdasaryan
[ Upstream commit 710ffe671e014d5ccbcff225130a178b088ef090 ] Psi polling mechanism is trying to minimize the number of wakeups to run psi_poll_work and is currently relying on timer_pending() to detect when this work is already scheduled. This provides a window of opportunity for psi_group_change to schedule an immediate psi_poll_work after poll_timer_fn got called but before psi_poll_work could reschedule itself. Below is the depiction of this entire window: poll_timer_fn wake_up_interruptible(&group->poll_wait); psi_poll_worker wait_event_interruptible(group->poll_wait, ...) psi_poll_work psi_schedule_poll_work if (timer_pending(&group->poll_timer)) return; ... mod_timer(&group->poll_timer, jiffies + delay); Prior to 461daba06bdc we used to rely on poll_scheduled atomic which was reset and set back inside psi_poll_work and therefore this race window was much smaller. The larger window causes increased number of wakeups and our partners report visible power regression of ~10mA after applying 461daba06bdc. Bring back the poll_scheduled atomic and make this race window even narrower by resetting poll_scheduled only when we reach polling expiration time. This does not completely eliminate the possibility of extra wakeups caused by a race with psi_group_change however it will limit it to the worst case scenario of one extra wakeup per every tracking window (0.5s in the worst case). This patch also ensures correct ordering between clearing poll_scheduled flag and obtaining changed_states using memory barrier. Correct ordering between updating changed_states and setting poll_scheduled is ensured by atomic_xchg operation. By tracing the number of immediate rescheduling attempts performed by psi_group_change and the number of these attempts being blocked due to psi monitor being already active, we can assess the effects of this change: Before the patch: Run#1 Run#2 Run#3 Immediate reschedules attempted: 684365 1385156 1261240 Immediate reschedules blocked: 682846 1381654 1258682 Immediate reschedules (delta): 1519 3502 2558 Immediate reschedules (% of attempted): 0.22% 0.25% 0.20% After the patch: Run#1 Run#2 Run#3 Immediate reschedules attempted: 882244 770298 426218 Immediate reschedules blocked: 881996 769796 426074 Immediate reschedules (delta): 248 502 144 Immediate reschedules (% of attempted): 0.03% 0.07% 0.03% The number of non-blocked immediate reschedules dropped from 0.22-0.25% to 0.03-0.07%. The drop is attributed to the decrease in the race window size and the fact that we allow this race only when psi monitors reach polling window expiration time. Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism") Reported-by: Kathleen Chang <yt.chang@mediatek.com> Reported-by: Wenju Xu <wenju.xu@mediatek.com> Reported-by: Jonathan Chen <jonathan.jmchen@mediatek.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: SH Chen <show-hong.chen@mediatek.com> Link: https://lore.kernel.org/r/20221028194541.813985-1-surenb@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-22mm: extend max struct page size for kmsanArnd Bergmann
commit 3770e52fd4ec40ebee16ba19ad6c09dc0b52739b upstream. After x86 enabled support for KMSAN, it has become possible to have larger 'struct page' than was expected when commit 5470dea49f53 ("mm: use mm_zero_struct_page from SPARC on all 64b architectures") was merged: include/linux/mm.h:156:10: warning: no case matching constant switch condition '96' switch (sizeof(struct page)) { Extend the maximum accordingly. Link: https://lkml.kernel.org/r/20230130130739.563628-1-arnd@kernel.org Fixes: 5470dea49f53 ("mm: use mm_zero_struct_page from SPARC on all 64b architectures") Fixes: 4ca8cc8d1bbe ("x86: kmsan: enable KMSAN builds for x86") Fixes: f80be4571b19 ("kmsan: add KMSAN runtime core") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22dccp/tcp: Avoid negative sk_forward_alloc by ipv6_pinfo.pktoptions.Kuniyuki Iwashima
commit ca43ccf41224b023fc290073d5603a755fd12eed upstream. Eric Dumazet pointed out [0] that when we call skb_set_owner_r() for ipv6_pinfo.pktoptions, sk_rmem_schedule() has not been called, resulting in a negative sk_forward_alloc. We add a new helper which clones a skb and sets its owner only when sk_rmem_schedule() succeeds. Note that we move skb_set_owner_r() forward in (dccp|tcp)_v6_do_rcv() because tcp_send_synack() can make sk_forward_alloc negative before ipv6_opt_accepted() in the crossed SYN-ACK or self-connect() cases. [0]: https://lore.kernel.org/netdev/CANn89iK9oc20Jdi_41jb9URdF210r7d1Y-+uypbMSbOfY6jqrg@mail.gmail.com/ Fixes: 323fbd0edf3f ("net: dccp: Add handling of IPV6_PKTOPTIONS to dccp_v6_do_rcv()") Fixes: 3df80d9320bc ("[DCCP]: Introduce DCCPv6") Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22hugetlb: check for undefined shift on 32 bit architecturesMike Kravetz
commit ec4288fe63966b26d53907212ecd05dfa81dd2cc upstream. Users can specify the hugetlb page size in the mmap, shmget and memfd_create system calls. This is done by using 6 bits within the flags argument to encode the base-2 logarithm of the desired page size. The routine hstate_sizelog() uses the log2 value to find the corresponding hugetlb hstate structure. Converting the log2 value (page_size_log) to potential hugetlb page size is the simple statement: 1UL << page_size_log Because only 6 bits are used for page_size_log, the left shift can not be greater than 63. This is fine on 64 bit architectures where a long is 64 bits. However, if a value greater than 31 is passed on a 32 bit architecture (where long is 32 bits) the shift will result in undefined behavior. This was generally not an issue as the result of the undefined shift had to exactly match hugetlb page size to proceed. Recent improvements in runtime checking have resulted in this undefined behavior throwing errors such as reported below. Fix by comparing page_size_log to BITS_PER_LONG before doing shift. Link: https://lkml.kernel.org/r/20230216013542.138708-1-mike.kravetz@oracle.com Link: https://lore.kernel.org/lkml/CA+G9fYuei_Tr-vN9GS7SfFyU1y9hNysnf=PB7kT0=yv4MiPgVg@mail.gmail.com/ Fixes: 42d7395feb56 ("mm: support more pagesizes for MAP_HUGETLB/SHM_HUGETLB") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reviewed-by: Jesper Juhl <jesperjuhl76@gmail.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Cc: Anders Roxell <anders.roxell@linaro.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Sasha Levin <sashal@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22fbdev: Fix invalid page access after closing deferred I/O devicesTakashi Iwai
commit 3efc61d95259956db25347e2a9562c3e54546e20 upstream. When a fbdev with deferred I/O is once opened and closed, the dirty pages still remain queued in the pageref list, and eventually later those may be processed in the delayed work. This may lead to a corruption of pages, hitting an Oops. This patch makes sure to cancel the delayed work and clean up the pageref list at closing the device for addressing the bug. A part of the cleanup code is factored out as a new helper function that is called from the common fb_release(). Reviewed-by: Patrik Jakobsson <patrik.r.jakobsson@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de> Tested-by: Miko Larsson <mikoxyzzz@gmail.com> Fixes: 56c134f7f1b5 ("fbdev: Track deferred-I/O pages in pageref struct") Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20230129082856.22113-1-tiwai@suse.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22mm: shrinkers: fix deadlock in shrinker debugfsQi Zheng
commit badc28d4924bfed73efc93f716a0c3aa3afbdf6f upstream. The debugfs_remove_recursive() is invoked by unregister_shrinker(), which is holding the write lock of shrinker_rwsem. It will waits for the handler of debugfs file complete. The handler also needs to hold the read lock of shrinker_rwsem to do something. So it may cause the following deadlock: CPU0 CPU1 debugfs_file_get() shrinker_debugfs_count_show()/shrinker_debugfs_scan_write() unregister_shrinker() --> down_write(&shrinker_rwsem); debugfs_remove_recursive() // wait for (A) --> wait_for_completion(); // wait for (B) --> down_read_killable(&shrinker_rwsem) debugfs_file_put() -- (A) up_write() -- (B) The down_read_killable() can be killed, so that the above deadlock can be recovered. But it still requires an extra kill action, otherwise it will block all subsequent shrinker-related operations, so it's better to fix it. [akpm@linux-foundation.org: fix CONFIG_SHRINKER_DEBUG=n stub] Link: https://lkml.kernel.org/r/20230202105612.64641-1-zhengqi.arch@bytedance.com Fixes: 5035ebc644ae ("mm: shrinkers: introduce debugfs interface for memory shrinkers") Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22ceph: move mount state enum to super.hXiubo Li
[ Upstream commit b38b17b6a01ca4e738af097a1529910646ef4270 ] These flags are only used in ceph filesystem in fs/ceph, so just move it to the place it should be. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Venky Shankar <vshankar@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-22net: stmmac: do not stop RX_CLK in Rx LPI state for qcs404 SoCAndrey Konovalov
[ Upstream commit 54aa39a513dbf2164ca462a19f04519b2407a224 ] Currently in phy_init_eee() the driver unconditionally configures the PHY to stop RX_CLK after entering Rx LPI state. This causes an LPI interrupt storm on my qcs404-base board. Change the PHY initialization so that for "qcom,qcs404-ethqos" compatible device RX_CLK continues to run even in Rx LPI state. Signed-off-by: Andrey Konovalov <andrey.konovalov@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-14tracing: Fix TASK_COMM_LEN in trace event format fileYafang Shao
commit b6c7abd1c28a63ad633433d037ee15a1bc3023ba upstream. After commit 3087c61ed2c4 ("tools/testing/selftests/bpf: replace open-coded 16 with TASK_COMM_LEN"), the content of the format file under /sys/kernel/tracing/events/task/task_newtask was changed from field:char comm[16]; offset:12; size:16; signed:0; to field:char comm[TASK_COMM_LEN]; offset:12; size:16; signed:0; John reported that this change breaks older versions of perfetto. Then Mathieu pointed out that this behavioral change was caused by the use of __stringify(_len), which happens to work on macros, but not on enum labels. And he also gave the suggestion on how to fix it: :One possible solution to make this more robust would be to extend :struct trace_event_fields with one more field that indicates the length :of an array as an actual integer, without storing it in its stringified :form in the type, and do the formatting in f_show where it belongs. The result as follows after this change, $ cat /sys/kernel/tracing/events/task/task_newtask/format field:char comm[16]; offset:12; size:16; signed:0; Link: https://lore.kernel.org/lkml/Y+QaZtz55LIirsUO@google.com/ Link: https://lore.kernel.org/linux-trace-kernel/20230210155921.4610-1-laoar.shao@gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20230212151303.12353-1-laoar.shao@gmail.com Cc: stable@vger.kernel.org Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Kajetan Puchalski <kajetan.puchalski@arm.com> CC: Qais Yousef <qyousef@layalina.io> Fixes: 3087c61ed2c4 ("tools/testing/selftests/bpf: replace open-coded 16 with TASK_COMM_LEN") Reported-by: John Stultz <jstultz@google.com> Debugged-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-14net/mlx5: Expose SF firmware pages counterMaher Sanalla
[ Upstream commit 9965bbebae59b3563a4d95e4aed121e8965dfdc2 ] Currently, each core device has VF pages counter which stores number of fw pages used by its VFs and SFs. The current design led to a hang when performing firmware reset on DPU, where the DPU PFs stalled in sriov unload flow due to waiting on release of SFs pages instead of waiting on only VFs pages. Thus, Add a separate counter for SF firmware pages, which will prevent the stall scenario described above. Fixes: 1958fc2f0712 ("net/mlx5: SF, Add auxiliary device driver") Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-14net/mlx5: Store page counters in a single arrayMaher Sanalla
[ Upstream commit c3bdbaea654d8df39112de33037106134a520dc7 ] Currently, an independent page counter is used for tracking memory usage for each function type such as VF, PF and host PF (DPU). For better code-readibilty, use a single array that stores the number of allocated memory pages for each function type. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Stable-dep-of: 9965bbebae59 ("net/mlx5: Expose SF firmware pages counter") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-14drm/virtio: exbuf->fence_fd unmodified on interrupted waitRyan Neph
[ Upstream commit 8f20660f053cefd4693e69cfff9cf58f4f7c4929 ] An interrupted dma_fence_wait() becomes an -ERESTARTSYS returned to userspace ioctl(DRM_IOCTL_VIRTGPU_EXECBUFFER) calls, prompting to retry the ioctl(), but the passed exbuf->fence_fd has been reset to -1, making the retry attempt fail at sync_file_get_fence(). The uapi for DRM_IOCTL_VIRTGPU_EXECBUFFER is changed to retain the passed value for exbuf->fence_fd when returning anything besides a successful result from the ioctl. Fixes: 2cd7b6f08bc4 ("drm/virtio: add in/out fence support for explicit synchronization") Signed-off-by: Ryan Neph <ryanneph@chromium.org> Reviewed-by: Rob Clark <robdclark@gmail.com> Reviewed-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230203233345.2477767-1-ryanneph@chromium.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-14uapi: add missing ip/ipv6 header dependencies for linux/stddef.hHerton R. Krzesinski
[ Upstream commit 03702d4d29be4e2510ec80b248dbbde4e57030d9 ] Since commit 58e0be1ef6118 ("net: use struct_group to copy ip/ipv6 header addresses"), ip and ipv6 headers started to use the __struct_group definition, which is defined at include/uapi/linux/stddef.h. However, linux/stddef.h isn't explicitly included in include/uapi/linux/{ip,ipv6}.h, which breaks build of xskxceiver bpf selftest if you install the uapi headers in the system: $ make V=1 xskxceiver -C tools/testing/selftests/bpf ... make: Entering directory '(...)/tools/testing/selftests/bpf' gcc -g -O0 -rdynamic -Wall -Werror (...) In file included from xskxceiver.c:79: /usr/include/linux/ip.h:103:9: error: expected specifier-qualifier-list before ‘__struct_group’ 103 | __struct_group(/* no tag */, addrs, /* no attrs */, | ^~~~~~~~~~~~~~ ... Include the missing <linux/stddef.h> dependency in ip.h and do the same for the ipv6.h header. Fixes: 58e0be1ef611 ("net: use struct_group to copy ip/ipv6 header addresses") Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-09nvmem: core: remove nvmem_config wp_gpioRussell King (Oracle)
commit 569653f022a29a1a44ea9de5308b657228303fa5 upstream. No one provides wp_gpio, so let's remove it to avoid issues with the nvmem core putting this gpio. Cc: stable@vger.kernel.org Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Link: https://lore.kernel.org/r/20230127104015.23839-5-srinivas.kandagatla@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-09highmem: round down the address passed to kunmap_flush_on_unmap()Matthew Wilcox (Oracle)
commit 88d7b12068b95731c280af8ce88e8ee9561f96de upstream. We already round down the address in kunmap_local_indexed() which is the other implementation of __kunmap_local(). The only implementation of kunmap_flush_on_unmap() is PA-RISC which is expecting a page-aligned address. This may be causing PA-RISC to be flushing the wrong addresses currently. Link: https://lkml.kernel.org/r/20230126200727.1680362-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Fixes: 298fa1ad5571 ("highmem: Provide generic variant of kmap_atomic*") Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Helge Deller <deller@gmx.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: David Sterba <dsterba@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-09mm: memcg: fix NULL pointer in mem_cgroup_track_foreign_dirty_slowpath()Kefeng Wang
commit ac86f547ca1002aec2ef66b9e64d03f45bbbfbb9 upstream. As commit 18365225f044 ("hwpoison, memcg: forcibly uncharge LRU pages"), hwpoison will forcibly uncharg a LRU hwpoisoned page, the folio_memcg could be NULl, then, mem_cgroup_track_foreign_dirty_slowpath() could occurs a NULL pointer dereference, let's do not record the foreign writebacks for folio memcg is null in mem_cgroup_track_foreign_dirty() to fix it. Link: https://lkml.kernel.org/r/20230129040945.180629-1-wangkefeng.wang@huawei.com Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reported-by: Ma Wupeng <mawupeng1@huawei.com> Tested-by: Miko Larsson <mikoxyzzz@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Ma Wupeng <mawupeng1@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-09mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smapsMike Kravetz
commit 3489dbb696d25602aea8c3e669a6d43b76bd5358 upstream. Patch series "Fixes for hugetlb mapcount at most 1 for shared PMDs". This issue of mapcount in hugetlb pages referenced by shared PMDs was discussed in [1]. The following two patches address user visible behavior caused by this issue. [1] https://lore.kernel.org/linux-mm/Y9BF+OCdWnCSilEu@monkey/ This patch (of 2): A hugetlb page will have a mapcount of 1 if mapped by multiple processes via a shared PMD. This is because only the first process increases the map count, and subsequent processes just add the shared PMD page to their page table. page_mapcount is being used to decide if a hugetlb page is shared or private in /proc/PID/smaps. Pages referenced via a shared PMD were incorrectly being counted as private. To fix, check for a shared PMD if mapcount is 1. If a shared PMD is found count the hugetlb page as shared. A new helper to check for a shared PMD is added. [akpm@linux-foundation.org: simplification, per David] [akpm@linux-foundation.org: hugetlb.h: include page_ref.h for page_count()] Link: https://lkml.kernel.org/r/20230126222721.222195-2-mike.kravetz@oracle.com Fixes: 25ee01a2fca0 ("mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-09rtc: efi: Enable SET/GET WAKEUP services as optionalShanker Donthineni
commit 101ca8d05913b7d1e6e8b9dd792193d4082fff86 upstream. The current implementation of rtc-efi is expecting all the 4 time services GET{SET}_TIME{WAKEUP} must be supported by UEFI firmware. As per the EFI_RT_PROPERTIES_TABLE, the platform specific implementations can choose to enable selective time services based on the RTC device capabilities. This patch does the following changes to provide GET/SET RTC services on platforms that do not support the WAKEUP feature. 1) Relax time services cap check when creating a platform device. 2) Clear RTC_FEATURE_ALARM bit in the absence of WAKEUP services. 3) Conditional alarm entries in '/proc/driver/rtc'. Cc: <stable@vger.kernel.org> # v6.0+ Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Link: https://lore.kernel.org/r/20230102230630.192911-1-sdonthineni@nvidia.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-09scsi: iscsi_tcp: Fix UAF during logout when accessing the shost ipaddressMike Christie
[ Upstream commit 6f1d64b13097e85abda0f91b5638000afc5f9a06 ] Bug report and analysis from Ding Hui. During iSCSI session logout, if another task accesses the shost ipaddress attr, we can get a KASAN UAF report like this: [ 276.942144] BUG: KASAN: use-after-free in _raw_spin_lock_bh+0x78/0xe0 [ 276.942535] Write of size 4 at addr ffff8881053b45b8 by task cat/4088 [ 276.943511] CPU: 2 PID: 4088 Comm: cat Tainted: G E 6.1.0-rc8+ #3 [ 276.943997] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 [ 276.944470] Call Trace: [ 276.944943] <TASK> [ 276.945397] dump_stack_lvl+0x34/0x48 [ 276.945887] print_address_description.constprop.0+0x86/0x1e7 [ 276.946421] print_report+0x36/0x4f [ 276.947358] kasan_report+0xad/0x130 [ 276.948234] kasan_check_range+0x35/0x1c0 [ 276.948674] _raw_spin_lock_bh+0x78/0xe0 [ 276.949989] iscsi_sw_tcp_host_get_param+0xad/0x2e0 [iscsi_tcp] [ 276.951765] show_host_param_ISCSI_HOST_PARAM_IPADDRESS+0xe9/0x130 [scsi_transport_iscsi] [ 276.952185] dev_attr_show+0x3f/0x80 [ 276.953005] sysfs_kf_seq_show+0x1fb/0x3e0 [ 276.953401] seq_read_iter+0x402/0x1020 [ 276.954260] vfs_read+0x532/0x7b0 [ 276.955113] ksys_read+0xed/0x1c0 [ 276.955952] do_syscall_64+0x38/0x90 [ 276.956347] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 276.956769] RIP: 0033:0x7f5d3a679222 [ 276.957161] Code: c0 e9 b2 fe ff ff 50 48 8d 3d 32 c0 0b 00 e8 a5 fe 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24 [ 276.958009] RSP: 002b:00007ffc864d16a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 276.958431] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f5d3a679222 [ 276.958857] RDX: 0000000000020000 RSI: 00007f5d3a4fe000 RDI: 0000000000000003 [ 276.959281] RBP: 00007f5d3a4fe000 R08: 00000000ffffffff R09: 0000000000000000 [ 276.959682] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000020000 [ 276.960126] R13: 0000000000000003 R14: 0000000000000000 R15: 0000557a26dada58 [ 276.960536] </TASK> [ 276.961357] Allocated by task 2209: [ 276.961756] kasan_save_stack+0x1e/0x40 [ 276.962170] kasan_set_track+0x21/0x30 [ 276.962557] __kasan_kmalloc+0x7e/0x90 [ 276.962923] __kmalloc+0x5b/0x140 [ 276.963308] iscsi_alloc_session+0x28/0x840 [scsi_transport_iscsi] [ 276.963712] iscsi_session_setup+0xda/0xba0 [libiscsi] [ 276.964078] iscsi_sw_tcp_session_create+0x1fd/0x330 [iscsi_tcp] [ 276.964431] iscsi_if_create_session.isra.0+0x50/0x260 [scsi_transport_iscsi] [ 276.964793] iscsi_if_recv_msg+0xc5a/0x2660 [scsi_transport_iscsi] [ 276.965153] iscsi_if_rx+0x198/0x4b0 [scsi_transport_iscsi] [ 276.965546] netlink_unicast+0x4d5/0x7b0 [ 276.965905] netlink_sendmsg+0x78d/0xc30 [ 276.966236] sock_sendmsg+0xe5/0x120 [ 276.966576] ____sys_sendmsg+0x5fe/0x860 [ 276.966923] ___sys_sendmsg+0xe0/0x170 [ 276.967300] __sys_sendmsg+0xc8/0x170 [ 276.967666] do_syscall_64+0x38/0x90 [ 276.968028] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 276.968773] Freed by task 2209: [ 276.969111] kasan_save_stack+0x1e/0x40 [ 276.969449] kasan_set_track+0x21/0x30 [ 276.969789] kasan_save_free_info+0x2a/0x50 [ 276.970146] __kasan_slab_free+0x106/0x190 [ 276.970470] __kmem_cache_free+0x133/0x270 [ 276.970816] device_release+0x98/0x210 [ 276.971145] kobject_cleanup+0x101/0x360 [ 276.971462] iscsi_session_teardown+0x3fb/0x530 [libiscsi] [ 276.971775] iscsi_sw_tcp_session_destroy+0xd8/0x130 [iscsi_tcp] [ 276.972143] iscsi_if_recv_msg+0x1bf1/0x2660 [scsi_transport_iscsi] [ 276.972485] iscsi_if_rx+0x198/0x4b0 [scsi_transport_iscsi] [ 276.972808] netlink_unicast+0x4d5/0x7b0 [ 276.973201] netlink_sendmsg+0x78d/0xc30 [ 276.973544] sock_sendmsg+0xe5/0x120 [ 276.973864] ____sys_sendmsg+0x5fe/0x860 [ 276.974248] ___sys_sendmsg+0xe0/0x170 [ 276.974583] __sys_sendmsg+0xc8/0x170 [ 276.974891] do_syscall_64+0x38/0x90 [ 276.975216] entry_SYSCALL_64_after_hwframe+0x63/0xcd We can easily reproduce by two tasks: 1. while :; do iscsiadm -m node --login; iscsiadm -m node --logout; done 2. while :; do cat \ /sys/devices/platform/host*/iscsi_host/host*/ipaddress; done iscsid | cat --------------------------------+--------------------------------------- |- iscsi_sw_tcp_session_destroy | |- iscsi_session_teardown | |- device_release | |- iscsi_session_release ||- dev_attr_show |- kfree | |- show_host_param_ | ISCSI_HOST_PARAM_IPADDRESS | |- iscsi_sw_tcp_host_get_param | |- r/w tcp_sw_host->session (UAF) |- iscsi_host_remove | |- iscsi_host_free | Fix the above bug by splitting the session removal into 2 parts: 1. removal from iSCSI class which includes sysfs and removal from host tracking. 2. freeing of session. During iscsi_tcp host and session removal we can remove the session from sysfs then remove the host from sysfs. At this point we know userspace is not accessing the kernel via sysfs so we can free the session and host. Link: https://lore.kernel.org/r/20230117193937.21244-2-michael.christie@oracle.com Signed-off-by: Mike Christie <michael.christie@oracle.com> Reviewed-by: Lee Duncan <lduncan@suse.com> Acked-by: Ding Hui <dinghui@sangfor.com.cn> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-09kunit: fix kunit_test_init_section_suites(...)Brendan Higgins
[ Upstream commit 254c71374a70051a043676b67ba4f7ad392b5fe6 ] Looks like kunit_test_init_section_suites(...) was messed up in a merge conflict. This fixes it. kunit_test_init_section_suites(...) was not updated to avoid the extra level of indirection when .kunit_test_suites was flattened. Given no-one was actively using it, this went unnoticed for a long period of time. Fixes: e5857d396f35 ("kunit: flatten kunit_suite*** to kunit_suite** in .kunit_test_suites") Signed-off-by: Brendan Higgins <brendan.higgins@linux.dev> Signed-off-by: David Gow <davidgow@google.com> Tested-by: Martin Fernandez <martin.fernandez@eclypsium.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-09use less confusing names for iov_iter direction initializersAl Viro
[ Upstream commit de4eda9de2d957ef2d6a8365a01e26a435e958cb ] READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with write(2)", but people keep interpreting those as "we read data from it" and "we write data to it", i.e. exactly the wrong way. Call them ITER_DEST and ITER_SOURCE - at least that is harder to misinterpret... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Stable-dep-of: 6dd88fd59da8 ("vhost-scsi: unbreak any layout for response") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-09bpf, sockmap: Check for any of tcp_bpf_prots when cloning a listenerJakub Sitnicki
[ Upstream commit ddce1e091757d0259107c6c0c7262df201de2b66 ] A listening socket linked to a sockmap has its sk_prot overridden. It points to one of the struct proto variants in tcp_bpf_prots. The variant depends on the socket's family and which sockmap programs are attached. A child socket cloned from a TCP listener initially inherits their sk_prot. But before cloning is finished, we restore the child's proto to the listener's original non-tcp_bpf_prots one. This happens in tcp_create_openreq_child -> tcp_bpf_clone. Today, in tcp_bpf_clone we detect if the child's proto should be restored by checking only for the TCP_BPF_BASE proto variant. This is not correct. The sk_prot of listening socket linked to a sockmap can point to to any variant in tcp_bpf_prots. If the listeners sk_prot happens to be not the TCP_BPF_BASE variant, then the child socket unintentionally is left if the inherited sk_prot by tcp_bpf_clone. This leads to issues like infinite recursion on close [1], because the child state is otherwise not set up for use with tcp_bpf_prot operations. Adjust the check in tcp_bpf_clone to detect all of tcp_bpf_prots variants. Note that it wouldn't be sufficient to check the socket state when overriding the sk_prot in tcp_bpf_update_proto in order to always use the TCP_BPF_BASE variant for listening sockets. Since commit b8b8315e39ff ("bpf, sockmap: Remove unhash handler for BPF sockmap usage") it is possible for a socket to transition to TCP_LISTEN state while already linked to a sockmap, e.g. connect() -> insert into map -> connect(AF_UNSPEC) -> listen(). [1]: https://lore.kernel.org/all/00000000000073b14905ef2e7401@google.com/ Fixes: e80251555f0b ("tcp_bpf: Don't let child socket inherit parent protocol ops on copy") Reported-by: syzbot+04c21ed96d861dccc5cd@syzkaller.appspotmail.com Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20230113-sockmap-fix-v2-2-1e0ee7ac2f90@cloudflare.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01netfilter: conntrack: unify established states for SCTP pathsSriram Yagnaraman
commit a44b7651489f26271ac784b70895e8a85d0cebf4 upstream. An SCTP endpoint can start an association through a path and tear it down over another one. That means the initial path will not see the shutdown sequence, and the conntrack entry will remain in ESTABLISHED state for 5 days. By merging the HEARTBEAT_ACKED and ESTABLISHED states into one ESTABLISHED state, there remains no difference between a primary or secondary path. The timeout for the merged ESTABLISHED state is set to 210 seconds (hb_interval * max_path_retrans + rto_max). So, even if a path doesn't see the shutdown sequence, it will expire in a reasonable amount of time. With this change in place, there is now more than one state from which we can transition to ESTABLISHED, COOKIE_ECHOED and HEARTBEAT_SENT, so handle the setting of ASSURED bit whenever a state change has happened and the new state is ESTABLISHED. Removed the check for dir==REPLY since the transition to ESTABLISHED can happen only in the reply direction. Fixes: 9fb9cbb1082d ("[NETFILTER]: Add nf_conntrack subsystem.") Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-01platform/x86: apple-gmux: Add apple_gmux_detect() helperHans de Goede
[ Upstream commit d143908f80f3e5d164ac3342f73d6b9f536e8b4d ] Add a new (static inline) apple_gmux_detect() helper to apple-gmux.h which can be used for gmux detection instead of apple_gmux_present(). The latter is not really reliable since an ACPI device with a HID of APP000B is present on some devices without a gmux at all, as well as on devices with a newer (unsupported) MMIO based gmux model. This causes apple_gmux_present() to return false-positives on a number of different Apple laptop models. This new helper uses the same probing as the actual apple-gmux driver, so that it does not return false positives. To avoid code duplication the gmux_probe() function of the actual driver is also moved over to using the new apple_gmux_detect() helper. This avoids false positives (vs _HID + IO region detection) on: MacBookPro5,4 https://pastebin.com/8Xjq7RhS MacBookPro8,1 https://linux-hardware.org/?probe=e513cfbadb&log=dmesg MacBookPro9,2 https://bugzilla.kernel.org/attachment.cgi?id=278961 MacBookPro10,2 https://lkml.org/lkml/2014/9/22/657 MacBookPro11,2 https://forums.fedora-fr.org/viewtopic.php?id=70142 MacBookPro11,4 https://raw.githubusercontent.com/im-0/investigate-card-reader-suspend-problem-on-mbp11.4/master/test-16/dmesg Fixes: 21245df307cb ("ACPI: video: Add Apple GMUX brightness control detection") Link: https://lore.kernel.org/platform-driver-x86/20230123113750.462144-1-hdegoede@redhat.com/ Reported-by: Emmanouil Kouroupakis <kartebi@gmail.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230124105754.62167-3-hdegoede@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01platform/x86: apple-gmux: Move port defines to apple-gmux.hHans de Goede
[ Upstream commit 39f5a81f7ad80eb3fbcbfd817c6552db9de5504d ] This is a preparation patch for adding a new static inline apple_gmux_detect() helper which actually checks a supported gmux is present, rather then only checking an ACPI device with the HID is there as apple_gmux_present() does. Fixes: 21245df307cb ("ACPI: video: Add Apple GMUX brightness control detection") Link: https://lore.kernel.org/platform-driver-x86/20230123113750.462144-1-hdegoede@redhat.com/ Reported-by: Emmanouil Kouroupakis <kartebi@gmail.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230124105754.62167-2-hdegoede@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01drm/drm_vma_manager: Add drm_vma_node_allow_once()Nirmoy Das
[ Upstream commit 899d3a3c19ac0e5da013ce34833dccb97d19b5e4 ] Currently there is no easy way for a drm driver to safely check and allow drm_vma_offset_node for a drm file just once. Allow drm drivers to call non-refcounted version of drm_vma_node_allow() so that a driver doesn't need to keep track of each drm_vma_node_allow() to call subsequent drm_vma_node_revoke() to prevent memory leak. Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: David Airlie <airlied@gmail.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Andi Shyti <andi.shyti@linux.intel.com> Suggested-by: Chris Wilson <chris.p.wilson@intel.com> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://lore.kernel.org/r/20230117175236.22317-1-nirmoy.das@intel.com Signed-off-by: Maxime Ripard <maxime@cerno.tech> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01scsi: ufs: core: Fix devfreq deadlocksJohan Hovold
commit ba81043753fffbc2ad6e0c5ff2659f12ac2f46b4 upstream. There is a lock inversion and rwsem read-lock recursion in the devfreq target callback which can lead to deadlocks. Specifically, ufshcd_devfreq_scale() already holds a clk_scaling_lock read lock when toggling the write booster, which involves taking the dev_cmd mutex before taking another clk_scaling_lock read lock. This can lead to a deadlock if another thread: 1) tries to acquire the dev_cmd and clk_scaling locks in the correct order, or 2) takes a clk_scaling write lock before the attempt to take the clk_scaling read lock a second time. Fix this by dropping the clk_scaling_lock before toggling the write booster as was done before commit 0e9d4ca43ba8 ("scsi: ufs: Protect some contexts from unexpected clock scaling"). While the devfreq callbacks are already serialised, add a second serialising mutex to handle the unlikely case where a callback triggered through the devfreq sysfs interface is racing with a request to disable clock scaling through the UFS controller 'clkscale_enable' sysfs attribute. This could otherwise lead to the write booster being left disabled after having disabled clock scaling. Also take the new mutex in ufshcd_clk_scaling_allow() to make sure that any pending write booster update has completed on return. Note that this currently only affects Qualcomm platforms since commit 87bd05016a64 ("scsi: ufs: core: Allow host driver to disable wb toggling during clock scaling"). The lock inversion (i.e. 1 above) was reported by lockdep as: ====================================================== WARNING: possible circular locking dependency detected 6.1.0-next-20221216 #211 Not tainted ------------------------------------------------------ kworker/u16:2/71 is trying to acquire lock: ffff076280ba98a0 (&hba->dev_cmd.lock){+.+.}-{3:3}, at: ufshcd_query_flag+0x50/0x1c0 but task is already holding lock: ffff076280ba9cf0 (&hba->clk_scaling_lock){++++}-{3:3}, at: ufshcd_devfreq_scale+0x2b8/0x380 which lock already depends on the new lock. [ +0.011606] the existing dependency chain (in reverse order) is: -> #1 (&hba->clk_scaling_lock){++++}-{3:3}: lock_acquire+0x68/0x90 down_read+0x58/0x80 ufshcd_exec_dev_cmd+0x70/0x2c0 ufshcd_verify_dev_init+0x68/0x170 ufshcd_probe_hba+0x398/0x1180 ufshcd_async_scan+0x30/0x320 async_run_entry_fn+0x34/0x150 process_one_work+0x288/0x6c0 worker_thread+0x74/0x450 kthread+0x118/0x120 ret_from_fork+0x10/0x20 -> #0 (&hba->dev_cmd.lock){+.+.}-{3:3}: __lock_acquire+0x12a0/0x2240 lock_acquire.part.0+0xcc/0x220 lock_acquire+0x68/0x90 __mutex_lock+0x98/0x430 mutex_lock_nested+0x2c/0x40 ufshcd_query_flag+0x50/0x1c0 ufshcd_query_flag_retry+0x64/0x100 ufshcd_wb_toggle+0x5c/0x120 ufshcd_devfreq_scale+0x2c4/0x380 ufshcd_devfreq_target+0xf4/0x230 devfreq_set_target+0x84/0x2f0 devfreq_update_target+0xc4/0xf0 devfreq_monitor+0x38/0x1f0 process_one_work+0x288/0x6c0 worker_thread+0x74/0x450 kthread+0x118/0x120 ret_from_fork+0x10/0x20 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&hba->clk_scaling_lock); lock(&hba->dev_cmd.lock); lock(&hba->clk_scaling_lock); lock(&hba->dev_cmd.lock); *** DEADLOCK *** Fixes: 0e9d4ca43ba8 ("scsi: ufs: Protect some contexts from unexpected clock scaling") Cc: stable@vger.kernel.org # 5.12 Cc: Can Guo <quic_cang@quicinc.com> Tested-by: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Johan Hovold <johan+linaro@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230116161201.16923-1-johan+linaro@kernel.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>