summary refs log tree commit diff
path: root/net
AgeCommit message (Collapse)Author
2023-05-11SUNRPC: remove the maximum number of retries in call_bind_statusDai Ngo
[ Upstream commit 691d0b782066a6eeeecbfceb7910a8f6184e6105 ] Currently call_bind_status places a hard limit of 3 to the number of retries on EACCES error. This limit was done to prevent NLM unlock requests from being hang forever when the server keeps returning garbage. However this change causes problem for cases when NLM service takes longer than 9 seconds to register with the port mapper after a restart. This patch removes this hard coded limit and let the RPC handles the retry based on the standard hard/soft task semantics. Fixes: 0b760113a3a1 ("NLM: Don't hang forever on NLM unlock requests") Reported-by: Helen Chao <helen.chao@oracle.com> Tested-by: Helen Chao <helen.chao@oracle.com> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.Kuniyuki Iwashima
[ Upstream commit 50749f2dd6854a41830996ad302aef2ffaf011d8 ] syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY skbs. We can reproduce the problem with these sequences: sk = socket(AF_INET, SOCK_DGRAM, 0) sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE) sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1) sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53)) sk.close() sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets skb->cb->ubuf.refcnt to 1, and calls sock_hold(). Here, struct ubuf_info_msgzc indirectly holds a refcnt of the socket. When the skb is sent, __skb_tstamp_tx() clones it and puts the clone into the socket's error queue with the TX timestamp. When the original skb is received locally, skb_copy_ubufs() calls skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt. This additional count is decremented while freeing the skb, but struct ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is not called. The last refcnt is not released unless we retrieve the TX timestamped skb by recvmsg(). Since we clear the error queue in inet_sock_destruct() after the socket's refcnt reaches 0, there is a circular dependency. If we close() the socket holding such skbs, we never call sock_put() and leak the count, sk, and skb. TCP has the same problem, and commit e0c8bccd40fc ("net: stream: purge sk_error_queue in sk_stream_kill_queues()") tried to fix it by calling skb_queue_purge() during close(). However, there is a small chance that skb queued in a qdisc or device could be put into the error queue after the skb_queue_purge() call. In __skb_tstamp_tx(), the cloned skb should not have a reference to the ubuf to remove the circular dependency, but skb_clone() does not call skb_copy_ubufs() for zerocopy skb. So, we need to call skb_orphan_frags_rx() for the cloned skb to call skb_copy_ubufs(). [0]: BUG: memory leak unreferenced object 0xffff88800c6d2d00 (size 1152): comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 cd af e8 81 00 00 00 00 ................ 02 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00 ...@............ backtrace: [<0000000055636812>] sk_prot_alloc+0x64/0x2a0 net/core/sock.c:2024 [<0000000054d77b7a>] sk_alloc+0x3b/0x800 net/core/sock.c:2083 [<0000000066f3c7e0>] inet_create net/ipv4/af_inet.c:319 [inline] [<0000000066f3c7e0>] inet_create+0x31e/0xe40 net/ipv4/af_inet.c:245 [<000000009b83af97>] __sock_create+0x2ab/0x550 net/socket.c:1515 [<00000000b9b11231>] sock_create net/socket.c:1566 [inline] [<00000000b9b11231>] __sys_socket_create net/socket.c:1603 [inline] [<00000000b9b11231>] __sys_socket_create net/socket.c:1588 [inline] [<00000000b9b11231>] __sys_socket+0x138/0x250 net/socket.c:1636 [<000000004fb45142>] __do_sys_socket net/socket.c:1649 [inline] [<000000004fb45142>] __se_sys_socket net/socket.c:1647 [inline] [<000000004fb45142>] __x64_sys_socket+0x73/0xb0 net/socket.c:1647 [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd BUG: memory leak unreferenced object 0xffff888017633a00 (size 240): comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 2d 6d 0c 80 88 ff ff .........-m..... backtrace: [<000000002b1c4368>] __alloc_skb+0x229/0x320 net/core/skbuff.c:497 [<00000000143579a6>] alloc_skb include/linux/skbuff.h:1265 [inline] [<00000000143579a6>] sock_omalloc+0xaa/0x190 net/core/sock.c:2596 [<00000000be626478>] msg_zerocopy_alloc net/core/skbuff.c:1294 [inline] [<00000000be626478>] msg_zerocopy_realloc+0x1ce/0x7f0 net/core/skbuff.c:1370 [<00000000cbfc9870>] __ip_append_data+0x2adf/0x3b30 net/ipv4/ip_output.c:1037 [<0000000089869146>] ip_make_skb+0x26c/0x2e0 net/ipv4/ip_output.c:1652 [<00000000098015c2>] udp_sendmsg+0x1bac/0x2390 net/ipv4/udp.c:1253 [<0000000045e0e95e>] inet_sendmsg+0x10a/0x150 net/ipv4/af_inet.c:819 [<000000008d31bfde>] sock_sendmsg_nosec net/socket.c:714 [inline] [<000000008d31bfde>] sock_sendmsg+0x141/0x190 net/socket.c:734 [<0000000021e21aa4>] __sys_sendto+0x243/0x360 net/socket.c:2117 [<00000000ac0af00c>] __do_sys_sendto net/socket.c:2129 [inline] [<00000000ac0af00c>] __se_sys_sendto net/socket.c:2125 [inline] [<00000000ac0af00c>] __x64_sys_sendto+0xe1/0x1c0 net/socket.c:2125 [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY") Fixes: b5947e5d1e71 ("udp: msg_zerocopy") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11netlink: Use copy_to_user() for optval in netlink_getsockopt().Kuniyuki Iwashima
[ Upstream commit d913d32cc2707e9cd24fe6fa6d7d470e9c728980 ] Brad Spencer provided a detailed report [0] that when calling getsockopt() for AF_NETLINK, some SOL_NETLINK options set only 1 byte even though such options require at least sizeof(int) as length. The options return a flag value that fits into 1 byte, but such behaviour confuses users who do not initialise the variable before calling getsockopt() and do not strictly check the returned value as char. Currently, netlink_getsockopt() uses put_user() to copy data to optlen and optval, but put_user() casts the data based on the pointer, char *optval. As a result, only 1 byte is set to optval. To avoid this behaviour, we need to use copy_to_user() or cast optval for put_user(). Note that this changes the behaviour on big-endian systems, but we document that the size of optval is int in the man page. $ man 7 netlink ... Socket options To set or get a netlink socket option, call getsockopt(2) to read or setsockopt(2) to write the option with the option level argument set to SOL_NETLINK. Unless otherwise noted, optval is a pointer to an int. Fixes: 9a4595bc7e67 ("[NETLINK]: Add set/getsockopt options to support more than 32 groups") Fixes: be0c22a46cfb ("netlink: add NETLINK_BROADCAST_ERROR socket option") Fixes: 38938bfe3489 ("netlink: add NETLINK_NO_ENOBUFS socket flag") Fixes: 0a6a3a23ea6e ("netlink: add NETLINK_CAP_ACK socket option") Fixes: 2d4bc93368f5 ("netlink: extended ACK reporting") Fixes: 89d35528d17d ("netlink: Add new socket option to enable strict checking on dumps") Reported-by: Brad Spencer <bspencer@blackberry.com> Link: https://lore.kernel.org/netdev/ZD7VkNWFfp22kTDt@datsun.rim.net/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Link: https://lore.kernel.org/r/20230421185255.94606-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11ipv4: Fix potential uninit variable access bug in __ip_make_skb()Ziyang Xuan
[ Upstream commit 99e5acae193e369b71217efe6f1dad42f3f18815 ] Like commit ea30388baebc ("ipv6: Fix an uninit variable access bug in __ip6_make_skb()"). icmphdr does not in skb linear region under the scenario of SOCK_RAW socket. Access icmp_hdr(skb)->type directly will trigger the uninit variable access bug. Use a local variable icmp_type to carry the correct value in different scenarios. Fixes: 96793b482540 ("[IPV4]: Add ICMPMsgStats MIB (RFC 4293)") Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11net/sched: sch_fq: fix integer overflow of "credit"Davide Caratti
[ Upstream commit 7041101ff6c3073fd8f2e99920f535b111c929cb ] if sch_fq is configured with "initial quantum" having values greater than INT_MAX, the first assignment of "credit" does signed integer overflow to a very negative value. In this situation, the syzkaller script provided by Cristoph triggers the CPU soft-lockup warning even with few sockets. It's not an infinite loop, but "credit" wasn't probably meant to be minus 2Gb for each new flow. Capping "initial quantum" to INT_MAX proved to fix the issue. v2: validation of "initial quantum" is done in fq_policy, instead of open coding in fq_change() _ suggested by Jakub Kicinski Reported-by: Christoph Paasch <cpaasch@apple.com> Link: https://github.com/multipath-tcp/mptcp_net-next/issues/377 Fixes: afe4fd062416 ("pkt_sched: fq: Fair Queue packet scheduler") Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Link: https://lore.kernel.org/r/7b3a3c7e36d03068707a021760a194a8eb5ad41a.1682002300.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11netfilter: nf_tables: don't write table validation state without mutexFlorian Westphal
[ Upstream commit 9a32e9850686599ed194ccdceb6cd3dd56b2d9b9 ] The ->cleanup callback needs to be removed, this doesn't work anymore as the transaction mutex is already released in the ->abort function. Just do it after a successful validation pass, this either happens from commit or abort phases where transaction mutex is held. Fixes: f102d66b335a ("netfilter: nf_tables: use dedicated mutex to guard transactions") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11netfilter: conntrack: fix wrong ct->timeout valueTzung-Bi Shih
[ Upstream commit 73db1b8f2bb6725b7391e85aab41fdf592b3c0c1 ] (struct nf_conn)->timeout is an interval before the conntrack confirmed. After confirmed, it becomes a timestamp. It is observed that timeout of an unconfirmed conntrack: - Set by calling ctnetlink_change_timeout(). As a result, `nfct_time_stamp` was wrongly added to `ct->timeout` twice. - Get by calling ctnetlink_dump_timeout(). As a result, `nfct_time_stamp` was wrongly subtracted. Call Trace: <TASK> dump_stack_lvl ctnetlink_dump_timeout __ctnetlink_glue_build ctnetlink_glue_build __nfqnl_enqueue_packet nf_queue nf_hook_slow ip_mc_output ? __pfx_ip_finish_output ip_send_skb ? __pfx_dst_output udp_send_skb udp_sendmsg ? __pfx_ip_generic_getfrag sock_sendmsg Separate the 2 cases in: - Setting `ct->timeout` in __nf_ct_set_timeout(). - Getting `ct->timeout` in ctnetlink_dump_timeout(). Pablo appends: Update ctnetlink to set up the timeout _after_ the IPS_CONFIRMED flag is set on, otherwise conntrack creation via ctnetlink breaks. Note that the problem described in this patch occurs since the introduction of the nfnetlink_queue conntrack support, select a sufficiently old Fixes: tag for -stable kernel to pick up this fix. Fixes: a4b4766c3ceb ("netfilter: nfnetlink_queue: rename related to nfqueue attaching conntrack info") Signed-off-by: Tzung-Bi Shih <tzungbi@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11netfilter: conntrack: restore IPS_CONFIRMED out of ↵Pablo Neira Ayuso
nf_conntrack_hash_check_insert() [ Upstream commit 2cdaa3eefed83082923cf219c8b6a314e622da74 ] e6d57e9ff0ae ("netfilter: conntrack: fix rmmod double-free race") consolidates IPS_CONFIRMED bit set in nf_conntrack_hash_check_insert(). However, this breaks ctnetlink: # conntrack -I -p tcp --timeout 123 --src 1.2.3.4 --dst 5.6.7.8 --state ESTABLISHED --sport 1 --dport 4 -u SEEN_REPLY conntrack v1.4.6 (conntrack-tools): Operation failed: Device or resource busy This is a partial revert of the aforementioned commit to restore IPS_CONFIRMED. Fixes: e6d57e9ff0ae ("netfilter: conntrack: fix rmmod double-free race") Reported-by: Stéphane Graber <stgraber@stgraber.org> Tested-by: Stéphane Graber <stgraber@stgraber.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf, sockmap: Revert buggy deadlock fix in the sockhash and sockmapDaniel Borkmann
[ Upstream commit 8c5c2a4898e3d6bad86e29d471e023c8a19ba799 ] syzbot reported a splat and bisected it to recent commit ed17aa92dc56 ("bpf, sockmap: fix deadlocks in the sockhash and sockmap"): [...] WARNING: CPU: 1 PID: 9280 at kernel/softirq.c:376 __local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376 Modules linked in: CPU: 1 PID: 9280 Comm: syz-executor.1 Not tainted 6.2.0-syzkaller-13249-gd319f344561d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/30/2023 RIP: 0010:__local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376 [...] Call Trace: <TASK> spin_unlock_bh include/linux/spinlock.h:395 [inline] sock_map_del_link+0x2ea/0x510 net/core/sock_map.c:165 sock_map_unref+0xb0/0x1d0 net/core/sock_map.c:184 sock_hash_delete_elem+0x1ec/0x2a0 net/core/sock_map.c:945 map_delete_elem kernel/bpf/syscall.c:1536 [inline] __sys_bpf+0x2edc/0x53e0 kernel/bpf/syscall.c:5053 __do_sys_bpf kernel/bpf/syscall.c:5166 [inline] __se_sys_bpf kernel/bpf/syscall.c:5164 [inline] __x64_sys_bpf+0x79/0xc0 kernel/bpf/syscall.c:5164 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fe8f7c8c169 </TASK> [...] Revert for now until we have a proper solution. Fixes: ed17aa92dc56 ("bpf, sockmap: fix deadlocks in the sockhash and sockmap") Reported-by: syzbot+49f6cef45247ff249498@syzkaller.appspotmail.com Cc: Hsin-Wei Hung <hsinweih@uci.edu> Cc: Xin Liu <liuxin350@huawei.com> Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/000000000000f1db9605f939720e@google.com/ Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf, sockmap: fix deadlocks in the sockhash and sockmapXin Liu
[ Upstream commit ed17aa92dc56b6d8883e4b7a8f1c6fbf5ed6cd29 ] When huang uses sched_switch tracepoint, the tracepoint does only one thing in the mounted ebpf program, which deletes the fixed elements in sockhash ([0]) It seems that elements in sockhash are rarely actively deleted by users or ebpf program. Therefore, we do not pay much attention to their deletion. Compared with hash maps, sockhash only provides spin_lock_bh protection. This causes it to appear to have self-locking behavior in the interrupt context. [0]:https://lore.kernel.org/all/CABcoxUayum5oOqFMMqAeWuS8+EzojquSOSyDA3J_2omY=2EeAg@mail.gmail.com/ Reported-by: Hsin-Wei Hung <hsinweih@uci.edu> Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Xin Liu <liuxin350@huawei.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20230406122622.109978-1-liuxin350@huawei.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11xsk: Fix unaligned descriptor validationKal Conley
[ Upstream commit d769ccaf957fe7391f357c0a923de71f594b8a2b ] Make sure unaligned descriptors that straddle the end of the UMEM are considered invalid. Currently, descriptor validation is broken for zero-copy mode which only checks descriptors at page granularity. For example, descriptors in zero-copy mode that overrun the end of the UMEM but not a page boundary are (incorrectly) considered valid. The UMEM boundary check needs to happen before the page boundary and contiguity checks in xp_desc_crosses_non_contig_pg(). Do this check in xp_unaligned_validate_desc() instead like xp_check_unaligned() already does. Fixes: 2b43470add8c ("xsk: Introduce AF_XDP buffer allocation API") Signed-off-by: Kal Conley <kal.conley@dectris.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230405235920.7305-2-kal.conley@dectris.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11netfilter: keep conntrack reference until IPsecv6 policy checks are doneMadhu Koriginja
[ Upstream commit b0e214d212030fe497d4d150bb3474e50ad5d093 ] Keep the conntrack reference until policy checks have been performed for IPsec V6 NAT support, just like ipv4. The reference needs to be dropped before a packet is queued to avoid having the conntrack module unloadable. Fixes: 58a317f1061c ("netfilter: ipv6: add IPv6 NAT support") Signed-off-by: Madhu Koriginja <madhu.koriginja@nxp.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11net/packet: convert po->auxdata to an atomic flagEric Dumazet
[ Upstream commit fd53c297aa7b077ae98a3d3d2d3aa278a1686ba6 ] po->auxdata can be read while another thread is changing its value, potentially raising KCSAN splat. Convert it to PACKET_SOCK_AUXDATA flag. Fixes: 8dc419447415 ("[PACKET]: Add optional checksum computation for recvmsg") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11net/packet: convert po->origdev to an atomic flagEric Dumazet
[ Upstream commit ee5675ecdf7a4e713ed21d98a70c2871d6ebed01 ] syzbot/KCAN reported that po->origdev can be read while another thread is changing its value. We can avoid this splat by converting this field to an actual bit. Following patches will convert remaining 1bit fields. Fixes: 80feaacb8a64 ("[AF_PACKET]: Add option to return orig_dev to userspace.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11net/packet: annotate accesses to po->xmitEric Dumazet
[ Upstream commit b9d83ab8a708f23a4001d60e9d8d0b3be3d9f607 ] po->xmit can be set from setsockopt(PACKET_QDISC_BYPASS), while read locklessly. Use READ_ONCE()/WRITE_ONCE() to avoid potential load/store tearing issues. Fixes: d346a3fae3ff ("packet: introduce PACKET_QDISC_BYPASS socket option") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11vlan: partially enable SIOCSHWTSTAMP in containerVadim Fedorenko
[ Upstream commit 731b73dba359e3ff00517c13aa0daa82b34ff466 ] Setting timestamp filter was explicitly disabled on vlan devices in containers because it might affect other processes on the host. But it's absolutely legit in case when real device is in the same namespace. Fixes: 873017af7784 ("vlan: disable SIOCSHWTSTAMP in container") Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-01bluetooth: Perform careful capability checks in hci_sock_ioctl()Ruihan Li
commit 25c150ac103a4ebeed0319994c742a90634ddf18 upstream. Previously, capability was checked using capable(), which verified that the caller of the ioctl system call had the required capability. In addition, the result of the check would be stored in the HCI_SOCK_TRUSTED flag, making it persistent for the socket. However, malicious programs can abuse this approach by deliberately sharing an HCI socket with a privileged task. The HCI socket will be marked as trusted when the privileged task occasionally makes an ioctl call. This problem can be solved by using sk_capable() to check capability, which ensures that not only the current task but also the socket opener has the specified capability, thus reducing the risk of privilege escalation through the previously identified vulnerability. Cc: stable@vger.kernel.org Fixes: f81f5b2db869 ("Bluetooth: Send control open and close messages for HCI raw sockets") Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-01mptcp: fix accept vs worker racePaolo Abeni
commit 63740448a32eb662e05894425b47bcc5814136f4 upstream. The mptcp worker and mptcp_accept() can race, as reported by Christoph: refcount_t: addition on 0; use-after-free. WARNING: CPU: 1 PID: 14351 at lib/refcount.c:25 refcount_warn_saturate+0x105/0x1b0 lib/refcount.c:25 Modules linked in: CPU: 1 PID: 14351 Comm: syz-executor.2 Not tainted 6.3.0-rc1-gde5e8fd0123c #11 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:refcount_warn_saturate+0x105/0x1b0 lib/refcount.c:25 Code: 02 31 ff 89 de e8 1b f0 a7 ff 84 db 0f 85 6e ff ff ff e8 3e f5 a7 ff 48 c7 c7 d8 c7 34 83 c6 05 6d 2d 0f 02 01 e8 cb 3d 90 ff <0f> 0b e9 4f ff ff ff e8 1f f5 a7 ff 0f b6 1d 54 2d 0f 02 31 ff 89 RSP: 0018:ffffc90000a47bf8 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff88802eae98c0 RSI: ffffffff81097d4f RDI: 0000000000000001 RBP: ffff88802e712180 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: ffff88802eaea148 R12: ffff88802e712100 R13: ffff88802e712a88 R14: ffff888005cb93a8 R15: ffff88802e712a88 FS: 0000000000000000(0000) GS:ffff88803ed00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f277fd89120 CR3: 0000000035486002 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __refcount_add include/linux/refcount.h:199 [inline] __refcount_inc include/linux/refcount.h:250 [inline] refcount_inc include/linux/refcount.h:267 [inline] sock_hold include/net/sock.h:775 [inline] __mptcp_close+0x4c6/0x4d0 net/mptcp/protocol.c:3051 mptcp_close+0x24/0xe0 net/mptcp/protocol.c:3072 inet_release+0x56/0xa0 net/ipv4/af_inet.c:429 __sock_release+0x51/0xf0 net/socket.c:653 sock_close+0x18/0x20 net/socket.c:1395 __fput+0x113/0x430 fs/file_table.c:321 task_work_run+0x96/0x100 kernel/task_work.c:179 exit_task_work include/linux/task_work.h:38 [inline] do_exit+0x4fc/0x10c0 kernel/exit.c:869 do_group_exit+0x51/0xf0 kernel/exit.c:1019 get_signal+0x12b0/0x1390 kernel/signal.c:2859 arch_do_signal_or_restart+0x25/0x260 arch/x86/kernel/signal.c:306 exit_to_user_mode_loop kernel/entry/common.c:168 [inline] exit_to_user_mode_prepare+0x131/0x1a0 kernel/entry/common.c:203 __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline] syscall_exit_to_user_mode+0x19/0x40 kernel/entry/common.c:296 do_syscall_64+0x46/0x90 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7fec4b4926a9 Code: Unable to access opcode bytes at 0x7fec4b49267f. RSP: 002b:00007fec49f9dd78 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca RAX: fffffffffffffe00 RBX: 00000000006bc058 RCX: 00007fec4b4926a9 RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00000000006bc058 RBP: 00000000006bc050 R08: 00000000007df998 R09: 00000000007df998 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006bc05c R13: fffffffffffffea8 R14: 000000000000000b R15: 000000000001fe40 </TASK> The root cause is that the worker can force fallback to TCP the first mptcp subflow, actually deleting the unaccepted msk socket. We can explicitly prevent the race delaying the unaccepted msk deletion at listener shutdown time. In case the closed subflow is later accepted, just drop the mptcp context and let the user-space deal with the paired mptcp socket. Fixes: b6985b9b8295 ("mptcp: use the workqueue to destroy unaccepted sockets") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Link: https://github.com/multipath-tcp/mptcp_net-next/issues/375 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-01mptcp: stops worker on unaccepted sockets at listener closePaolo Abeni
commit 2a6a870e44dd88f1a6a2893c65ef756a9edfb4c7 upstream. This is a partial revert of the blamed commit, with a relevant change: mptcp_subflow_queue_clean() now just change the msk socket status and stop the worker, so that the UaF issue addressed by the blamed commit is not re-introduced. The above prevents the mptcp worker from running concurrently with inet_csk_listen_stop(), as such race would trigger a warning, as reported by Christoph: RSP: 002b:00007f784fe09cd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e WARNING: CPU: 0 PID: 25807 at net/ipv4/inet_connection_sock.c:1387 inet_csk_listen_stop+0x664/0x870 net/ipv4/inet_connection_sock.c:1387 RAX: ffffffffffffffda RBX: 00000000006bc050 RCX: 00007f7850afd6a9 RDX: 0000000000000000 RSI: 0000000020000340 RDI: 0000000000000004 Modules linked in: RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006bc05c R13: fffffffffffffea8 R14: 00000000006bc050 R15: 000000000001fe40 </TASK> CPU: 0 PID: 25807 Comm: syz-executor.7 Not tainted 6.2.0-g778e54711659 #7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:inet_csk_listen_stop+0x664/0x870 net/ipv4/inet_connection_sock.c:1387 RAX: 0000000000000000 RBX: ffff888100dfbd40 RCX: 0000000000000000 RDX: ffff8881363aab80 RSI: ffffffff81c494f4 RDI: 0000000000000005 RBP: ffff888126dad080 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff888100dfe040 R13: 0000000000000001 R14: 0000000000000000 R15: ffff888100dfbdd8 FS: 00007f7850a2c800(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b32d26000 CR3: 000000012fdd8006 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> __tcp_close+0x5b2/0x620 net/ipv4/tcp.c:2875 __mptcp_close_ssk+0x145/0x3d0 net/mptcp/protocol.c:2427 mptcp_destroy_common+0x8a/0x1c0 net/mptcp/protocol.c:3277 mptcp_destroy+0x41/0x60 net/mptcp/protocol.c:3304 __mptcp_destroy_sock+0x56/0x140 net/mptcp/protocol.c:2965 __mptcp_close+0x38f/0x4a0 net/mptcp/protocol.c:3057 mptcp_close+0x24/0xe0 net/mptcp/protocol.c:3072 inet_release+0x53/0xa0 net/ipv4/af_inet.c:429 __sock_release+0x4e/0xf0 net/socket.c:651 sock_close+0x15/0x20 net/socket.c:1393 __fput+0xff/0x420 fs/file_table.c:321 task_work_run+0x8b/0xe0 kernel/task_work.c:179 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0x113/0x120 kernel/entry/common.c:203 __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline] syscall_exit_to_user_mode+0x1d/0x40 kernel/entry/common.c:296 do_syscall_64+0x46/0x90 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7f7850af70dc RAX: 0000000000000000 RBX: 0000000000000004 RCX: 00007f7850af70dc RDX: 00007f7850a2c800 RSI: 0000000000000002 RDI: 0000000000000003 RBP: 00000000006bd980 R08: 0000000000000000 R09: 00000000000018a0 R10: 00000000316338a4 R11: 0000000000000293 R12: 0000000000211e31 R13: 00000000006bc05c R14: 00007f785062c000 R15: 0000000000211af0 Fixes: 0a3f4f1f9c27 ("mptcp: fix UaF in listener shutdown") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Link: https://github.com/multipath-tcp/mptcp_net-next/issues/371 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-26sctp: Call inet6_destroy_sock() via sk->sk_destruct().Kuniyuki Iwashima
commit 6431b0f6ff1633ae598667e4cdd93830074a03e8 upstream. After commit d38afeec26ed ("tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct()."), we call inet6_destroy_sock() in sk->sk_destruct() by setting inet6_sock_destruct() to it to make sure we do not leak inet6-specific resources. SCTP sets its own sk->sk_destruct() in the sctp_init_sock(), and SCTPv6 socket reuses it as the init function. To call inet6_sock_destruct() from SCTPv6 sk->sk_destruct(), we set sctp_v6_destruct_sock() in a new init function. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-26dccp: Call inet6_destroy_sock() via sk->sk_destruct().Kuniyuki Iwashima
commit 1651951ebea54970e0bda60c638fc2eee7a6218f upstream. After commit d38afeec26ed ("tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct()."), we call inet6_destroy_sock() in sk->sk_destruct() by setting inet6_sock_destruct() to it to make sure we do not leak inet6-specific resources. DCCP sets its own sk->sk_destruct() in the dccp_init_sock(), and DCCPv6 socket shares it by calling the same init function via dccp_v6_init_sock(). To call inet6_sock_destruct() from DCCPv6 sk->sk_destruct(), we export it and set dccp_v6_sk_destruct() in the init function. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-26inet6: Remove inet6_destroy_sock() in sk->sk_prot->destroy().Kuniyuki Iwashima
commit b5fc29233d28be7a3322848ebe73ac327559cdb9 upstream. After commit d38afeec26ed ("tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct()."), we call inet6_destroy_sock() in sk->sk_destruct() by setting inet6_sock_destruct() to it to make sure we do not leak inet6-specific resources. Now we can remove unnecessary inet6_destroy_sock() calls in sk->sk_prot->destroy(). DCCP and SCTP have their own sk->sk_destruct() function, so we change them separately in the following patches. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-26net: bridge: switchdev: don't notify FDB entries with "master dynamic"Vladimir Oltean
[ Upstream commit 927cdea5d2095287ddd5246e5aa68eb5d68db2be ] There is a structural problem in switchdev, where the flag bits in struct switchdev_notifier_fdb_info (added_by_user, is_local etc) only represent a simplified / denatured view of what's in struct net_bridge_fdb_entry :: flags (BR_FDB_ADDED_BY_USER, BR_FDB_LOCAL etc). Each time we want to pass more information about struct net_bridge_fdb_entry :: flags to struct switchdev_notifier_fdb_info (here, BR_FDB_STATIC), we find that FDB entries were already notified to switchdev with no regard to this flag, and thus, switchdev drivers had no indication whether the notified entries were static or not. For example, this command: ip link add br0 type bridge && ip link set swp0 master br0 bridge fdb add dev swp0 00:01:02:03:04:05 master dynamic has never worked as intended with switchdev. It causes a struct net_bridge_fdb_entry to be passed to br_switchdev_fdb_notify() which has a single flag set: BR_FDB_ADDED_BY_USER. This is further passed to the switchdev notifier chain, where interested drivers have no choice but to assume this is a static (does not age) and sticky (does not migrate) FDB entry. So currently, all drivers offload it to hardware as such, as can be seen below ("offload" is set). bridge fdb get 00:01:02:03:04:05 dev swp0 master 00:01:02:03:04:05 dev swp0 offload master br0 The software FDB entry expires $ageing_time centiseconds after the kernel last sees a packet with this MAC SA, and the bridge notifies its deletion as well, so it eventually disappears from hardware too. This is a problem, because it is actually desirable to start offloading "master dynamic" FDB entries correctly - they should expire $ageing_time centiseconds after the *hardware* port last sees a packet with this MAC SA - and this is how the current incorrect behavior was discovered. With an offloaded data plane, it can be expected that software only sees exception path packets, so an otherwise active dynamic FDB entry would be aged out by software sooner than it should. With the change in place, these FDB entries are no longer offloaded: bridge fdb get 00:01:02:03:04:05 dev swp0 master 00:01:02:03:04:05 dev swp0 master br0 and this also constitutes a better way (assuming a backport to stable kernels) for user space to determine whether the kernel has the capability of doing something sane with these or not. As opposed to "master dynamic" FDB entries, on the current behavior of which no one currently depends on (which can be deduced from the lack of kselftests), Ido Schimmel explains that entries with the "extern_learn" flag (BR_FDB_ADDED_BY_EXT_LEARN) should still be notified to switchdev, since the spectrum driver listens to them (and this is kind of okay, because although they are treated identically to "static", they are expected to not age, and to roam). Fixes: 6b26b51b1d13 ("net: bridge: Add support for notifying devices about FDB add/del") Link: https://lore.kernel.org/netdev/20230327115206.jk5q5l753aoelwus@skbuf/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20230418155902.898627-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-26net: rpl: fix rpl header size calculationAlexander Aring
[ Upstream commit 4e006c7a6dac0ead4c1bf606000aa90a372fc253 ] This patch fixes a missing 8 byte for the header size calculation. The ipv6_rpl_srh_size() is used to check a skb_pull() on skb->data which points to skb_transport_header(). Currently we only check on the calculated addresses fields using CmprI and CmprE fields, see: https://www.rfc-editor.org/rfc/rfc6554#section-3 there is however a missing 8 byte inside the calculation which stands for the fields before the addresses field. Those 8 bytes are represented by sizeof(struct ipv6_rpl_sr_hdr) expression. Fixes: 8610c7c6e3bd ("net: ipv6: add support for rpl sr exthdr") Signed-off-by: Alexander Aring <aahringo@redhat.com> Reported-by: maxpl0it <maxpl0it@protonmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-26netfilter: nf_tables: tighten netlink attribute requirements for catch-all ↵Pablo Neira Ayuso
elements [ Upstream commit d4eb7e39929a3b1ff30fb751b4859fc2410702a0 ] If NFT_SET_ELEM_CATCHALL is set on, then userspace provides no set element key. Otherwise, bail out with -EINVAL. Fixes: aaa31047a6d2 ("netfilter: nftables: add catch-all set element support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-26netfilter: nf_tables: validate catch-all set elementsPablo Neira Ayuso
[ Upstream commit d46fc894147cf98dd6e8210aa99ed46854191840 ] catch-all set element might jump/goto to chain that uses expressions that require validation. Fixes: aaa31047a6d2 ("netfilter: nftables: add catch-all set element support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-26net: sched: sch_qfq: prevent slab-out-of-bounds in qfq_activate_aggGwangun Jung
[ Upstream commit 3037933448f60f9acb705997eae62013ecb81e0d ] If the TCA_QFQ_LMAX value is not offered through nlattr, lmax is determined by the MTU value of the network device. The MTU of the loopback device can be set up to 2^31-1. As a result, it is possible to have an lmax value that exceeds QFQ_MIN_LMAX. Due to the invalid lmax value, an index is generated that exceeds the QFQ_MAX_INDEX(=24) value, causing out-of-bounds read/write errors. The following reports a oob access: [ 84.582666] BUG: KASAN: slab-out-of-bounds in qfq_activate_agg.constprop.0 (net/sched/sch_qfq.c:1027 net/sched/sch_qfq.c:1060 net/sched/sch_qfq.c:1313) [ 84.583267] Read of size 4 at addr ffff88810f676948 by task ping/301 [ 84.583686] [ 84.583797] CPU: 3 PID: 301 Comm: ping Not tainted 6.3.0-rc5 #1 [ 84.584164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 [ 84.584644] Call Trace: [ 84.584787] <TASK> [ 84.584906] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1)) [ 84.585108] print_report (mm/kasan/report.c:320 mm/kasan/report.c:430) [ 84.585570] kasan_report (mm/kasan/report.c:538) [ 84.585988] qfq_activate_agg.constprop.0 (net/sched/sch_qfq.c:1027 net/sched/sch_qfq.c:1060 net/sched/sch_qfq.c:1313) [ 84.586599] qfq_enqueue (net/sched/sch_qfq.c:1255) [ 84.587607] dev_qdisc_enqueue (net/core/dev.c:3776) [ 84.587749] __dev_queue_xmit (./include/net/sch_generic.h:186 net/core/dev.c:3865 net/core/dev.c:4212) [ 84.588763] ip_finish_output2 (./include/net/neighbour.h:546 net/ipv4/ip_output.c:228) [ 84.589460] ip_output (net/ipv4/ip_output.c:430) [ 84.590132] ip_push_pending_frames (./include/net/dst.h:444 net/ipv4/ip_output.c:126 net/ipv4/ip_output.c:1586 net/ipv4/ip_output.c:1606) [ 84.590285] raw_sendmsg (net/ipv4/raw.c:649) [ 84.591960] sock_sendmsg (net/socket.c:724 net/socket.c:747) [ 84.592084] __sys_sendto (net/socket.c:2142) [ 84.593306] __x64_sys_sendto (net/socket.c:2150) [ 84.593779] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [ 84.593902] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) [ 84.594070] RIP: 0033:0x7fe568032066 [ 84.594192] Code: 0e 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c09[ 84.594796] RSP: 002b:00007ffce388b4e8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c Code starting with the faulting instruction =========================================== [ 84.595047] RAX: ffffffffffffffda RBX: 00007ffce388cc70 RCX: 00007fe568032066 [ 84.595281] RDX: 0000000000000040 RSI: 00005605fdad6d10 RDI: 0000000000000003 [ 84.595515] RBP: 00005605fdad6d10 R08: 00007ffce388eeec R09: 0000000000000010 [ 84.595749] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040 [ 84.595984] R13: 00007ffce388cc30 R14: 00007ffce388b4f0 R15: 0000001d00000001 [ 84.596218] </TASK> [ 84.596295] [ 84.596351] Allocated by task 291: [ 84.596467] kasan_save_stack (mm/kasan/common.c:46) [ 84.596597] kasan_set_track (mm/kasan/common.c:52) [ 84.596725] __kasan_kmalloc (mm/kasan/common.c:384) [ 84.596852] __kmalloc_node (./include/linux/kasan.h:196 mm/slab_common.c:967 mm/slab_common.c:974) [ 84.596979] qdisc_alloc (./include/linux/slab.h:610 ./include/linux/slab.h:731 net/sched/sch_generic.c:938) [ 84.597100] qdisc_create (net/sched/sch_api.c:1244) [ 84.597222] tc_modify_qdisc (net/sched/sch_api.c:1680) [ 84.597357] rtnetlink_rcv_msg (net/core/rtnetlink.c:6174) [ 84.597495] netlink_rcv_skb (net/netlink/af_netlink.c:2574) [ 84.597627] netlink_unicast (net/netlink/af_netlink.c:1340 net/netlink/af_netlink.c:1365) [ 84.597759] netlink_sendmsg (net/netlink/af_netlink.c:1942) [ 84.597891] sock_sendmsg (net/socket.c:724 net/socket.c:747) [ 84.598016] ____sys_sendmsg (net/socket.c:2501) [ 84.598147] ___sys_sendmsg (net/socket.c:2557) [ 84.598275] __sys_sendmsg (./include/linux/file.h:31 net/socket.c:2586) [ 84.598399] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [ 84.598520] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) [ 84.598688] [ 84.598744] The buggy address belongs to the object at ffff88810f674000 [ 84.598744] which belongs to the cache kmalloc-8k of size 8192 [ 84.599135] The buggy address is located 2664 bytes to the right of [ 84.599135] allocated 7904-byte region [ffff88810f674000, ffff88810f675ee0) [ 84.599544] [ 84.599598] The buggy address belongs to the physical page: [ 84.599777] page:00000000e638567f refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10f670 [ 84.600074] head:00000000e638567f order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ 84.600330] flags: 0x200000000010200(slab|head|node=0|zone=2) [ 84.600517] raw: 0200000000010200 ffff888100043180 dead000000000122 0000000000000000 [ 84.600764] raw: 0000000000000000 0000000080020002 00000001ffffffff 0000000000000000 [ 84.601009] page dumped because: kasan: bad access detected [ 84.601187] [ 84.601241] Memory state around the buggy address: [ 84.601396] ffff88810f676800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 84.601620] ffff88810f676880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 84.601845] >ffff88810f676900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 84.602069] ^ [ 84.602243] ffff88810f676980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 84.602468] ffff88810f676a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 84.602693] ================================================================== [ 84.602924] Disabling lock debugging due to kernel taint Fixes: 3015f3d2a3cd ("pkt_sched: enable QFQ to support TSO/GSO") Reported-by: Gwangun Jung <exsociety@gmail.com> Signed-off-by: Gwangun Jung <exsociety@gmail.com> Acked-by: Jamal Hadi Salim<jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-26netfilter: nf_tables: Modify nla_memdup's flag to GFP_KERNEL_ACCOUNTChen Aotian
[ Upstream commit af0acf22aea359e04412237d68787401f96bb583 ] For memory alloc that store user data from nla[NFTA_OBJ_USERDATA], use GFP_KERNEL_ACCOUNT is more suitable. Fixes: 33758c891479 ("memcg: enable accounting for nft objects") Signed-off-by: Chen Aotian <chenaotian2@163.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-26netfilter: br_netfilter: fix recent physdev match breakageFlorian Westphal
[ Upstream commit 94623f579ce338b5fa61b5acaa5beb8aa657fb9e ] Recent attempt to ensure PREROUTING hook is executed again when a decrypted ipsec packet received on a bridge passes through the network stack a second time broke the physdev match in INPUT hook. We can't discard the nf_bridge info strct from sabotage_in hook, as this is needed by the physdev match. Keep the struct around and handle this with another conditional instead. Fixes: 2b272bb558f1 ("netfilter: br_netfilter: disable sabotage_in hook after first suppression") Reported-and-tested-by: Farid BENAMROUCHE <fariouche@yahoo.fr> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-20mptcp: stricter state check in mptcp_workerPaolo Abeni
commit d6a0443733434408f2cbd4c53fea6910599bab9e upstream. As reported by Christoph, the mptcp protocol can run the worker when the relevant msk socket is in an unexpected state: connect() // incoming reset + fastclose // the mptcp worker is scheduled mptcp_disconnect() // msk is now CLOSED listen() mptcp_worker() Leading to the following splat: divide error: 0000 [#1] PREEMPT SMP CPU: 1 PID: 21 Comm: kworker/1:0 Not tainted 6.3.0-rc1-gde5e8fd0123c #11 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 Workqueue: events mptcp_worker RIP: 0010:__tcp_select_window+0x22c/0x4b0 net/ipv4/tcp_output.c:3018 RSP: 0018:ffffc900000b3c98 EFLAGS: 00010293 RAX: 000000000000ffd7 RBX: 000000000000ffd7 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff8214ce97 RDI: 0000000000000004 RBP: 000000000000ffd7 R08: 0000000000000004 R09: 0000000000010000 R10: 000000000000ffd7 R11: ffff888005afa148 R12: 000000000000ffd7 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88803ed00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000405270 CR3: 000000003011e006 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> tcp_select_window net/ipv4/tcp_output.c:262 [inline] __tcp_transmit_skb+0x356/0x1280 net/ipv4/tcp_output.c:1345 tcp_transmit_skb net/ipv4/tcp_output.c:1417 [inline] tcp_send_active_reset+0x13e/0x320 net/ipv4/tcp_output.c:3459 mptcp_check_fastclose net/mptcp/protocol.c:2530 [inline] mptcp_worker+0x6c7/0x800 net/mptcp/protocol.c:2705 process_one_work+0x3bd/0x950 kernel/workqueue.c:2390 worker_thread+0x5b/0x610 kernel/workqueue.c:2537 kthread+0x138/0x170 kernel/kthread.c:376 ret_from_fork+0x2c/0x50 arch/x86/entry/entry_64.S:308 </TASK> This change addresses the issue explicitly checking for bad states before running the mptcp worker. Fixes: e16163b6e2b7 ("mptcp: refactor shutdown and close") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Link: https://github.com/multipath-tcp/mptcp_net-next/issues/374 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-20mptcp: use mptcp_schedule_work instead of open-coding itPaolo Abeni
commit a5cb752b125766524c921faab1a45cc96065b0a7 upstream. Beyond reducing code duplication this also avoids scheduling the mptcp_worker on a closed socket on some edge scenarios. The addressed issue is actually older than the blamed commit below, but this fix needs it as a pre-requisite. Fixes: ba8f48f7a4d7 ("mptcp: introduce mptcp_schedule_work") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-20skbuff: Fix a race between coalescing and releasing SKBsLiang Chen
[ Upstream commit 0646dc31ca886693274df5749cd0c8c1eaaeb5ca ] Commit 1effe8ca4e34 ("skbuff: fix coalescing for page_pool fragment recycling") allowed coalescing to proceed with non page pool page and page pool page when @from is cloned, i.e. to->pp_recycle --> false from->pp_recycle --> true skb_cloned(from) --> true However, it actually requires skb_cloned(@from) to hold true until coalescing finishes in this situation. If the other cloned SKB is released while the merging is in process, from_shinfo->nr_frags will be set to 0 toward the end of the function, causing the increment of frag page _refcount to be unexpectedly skipped resulting in inconsistent reference counts. Later when SKB(@to) is released, it frees the page directly even though the page pool page is still in use, leading to use-after-free or double-free errors. So it should be prohibited. The double-free error message below prompted us to investigate: BUG: Bad page state in process swapper/1 pfn:0e0d1 page:00000000c6548b28 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x2 pfn:0xe0d1 flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff) raw: 000fffffc0000000 0000000000000000 ffffffff00000101 0000000000000000 raw: 0000000000000002 0000000000000000 ffffffffffffffff 0000000000000000 page dumped because: nonzero _refcount CPU: 1 PID: 0 Comm: swapper/1 Tainted: G E 6.2.0+ Call Trace: <IRQ> dump_stack_lvl+0x32/0x50 bad_page+0x69/0xf0 free_pcp_prepare+0x260/0x2f0 free_unref_page+0x20/0x1c0 skb_release_data+0x10b/0x1a0 napi_consume_skb+0x56/0x150 net_rx_action+0xf0/0x350 ? __napi_schedule+0x79/0x90 __do_softirq+0xc8/0x2b1 __irq_exit_rcu+0xb9/0xf0 common_interrupt+0x82/0xa0 </IRQ> <TASK> asm_common_interrupt+0x22/0x40 RIP: 0010:default_idle+0xb/0x20 Fixes: 53e0961da1c7 ("page_pool: add frag page recycling support in page pool") Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230413090353.14448-1-liangchen.linux@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-20udp6: fix potential access to stale informationEric Dumazet
[ Upstream commit 1c5950fc6fe996235f1d18539b9c6b64b597f50f ] lena wang reported an issue caused by udpv6_sendmsg() mangling msg->msg_name and msg->msg_namelen, which are later read from ____sys_sendmsg() : /* * If this is sendmmsg() and sending to current destination address was * successful, remember it. */ if (used_address && err >= 0) { used_address->name_len = msg_sys->msg_namelen; if (msg_sys->msg_name) memcpy(&used_address->name, msg_sys->msg_name, used_address->name_len); } udpv6_sendmsg() wants to pretend the remote address family is AF_INET in order to call udp_sendmsg(). A fix would be to modify the address in-place, instead of using a local variable, but this could have other side effects. Instead, restore initial values before we return from udpv6_sendmsg(). Fixes: c71d8ebe7a44 ("net: Fix security_socket_sendmsg() bypass problem.") Reported-by: lena wang <lena.wang@mediatek.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20230412130308.1202254-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-20sctp: fix a potential overflow in sctp_ifwdtsn_skipXin Long
[ Upstream commit 32832a2caf82663870126c5186cf8f86c8b2a649 ] Currently, when traversing ifwdtsn skips with _sctp_walk_ifwdtsn, it only checks the pos against the end of the chunk. However, the data left for the last pos may be < sizeof(struct sctp_ifwdtsn_skip), and dereference it as struct sctp_ifwdtsn_skip may cause coverflow. This patch fixes it by checking the pos against "the end of the chunk - sizeof(struct sctp_ifwdtsn_skip)" in sctp_ifwdtsn_skip, similar to sctp_fwdtsn_skip. Fixes: 0fc2ea922c8a ("sctp: implement validate_ftsn for sctp_stream_interleave") Signed-off-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/2a71bffcd80b4f2c61fac6d344bb2f11c8fd74f7.1681155810.git.lucien.xin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-20net: qrtr: Fix an uninit variable access bug in qrtr_tx_resume()Ziyang Xuan
[ Upstream commit 6417070918de3bcdbe0646e7256dae58fd8083ba ] Syzbot reported a bug as following: ===================================================== BUG: KMSAN: uninit-value in qrtr_tx_resume+0x185/0x1f0 net/qrtr/af_qrtr.c:230 qrtr_tx_resume+0x185/0x1f0 net/qrtr/af_qrtr.c:230 qrtr_endpoint_post+0xf85/0x11b0 net/qrtr/af_qrtr.c:519 qrtr_tun_write_iter+0x270/0x400 net/qrtr/tun.c:108 call_write_iter include/linux/fs.h:2189 [inline] aio_write+0x63a/0x950 fs/aio.c:1600 io_submit_one+0x1d1c/0x3bf0 fs/aio.c:2019 __do_sys_io_submit fs/aio.c:2078 [inline] __se_sys_io_submit+0x293/0x770 fs/aio.c:2048 __x64_sys_io_submit+0x92/0xd0 fs/aio.c:2048 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Uninit was created at: slab_post_alloc_hook mm/slab.h:766 [inline] slab_alloc_node mm/slub.c:3452 [inline] __kmem_cache_alloc_node+0x71f/0xce0 mm/slub.c:3491 __do_kmalloc_node mm/slab_common.c:967 [inline] __kmalloc_node_track_caller+0x114/0x3b0 mm/slab_common.c:988 kmalloc_reserve net/core/skbuff.c:492 [inline] __alloc_skb+0x3af/0x8f0 net/core/skbuff.c:565 __netdev_alloc_skb+0x120/0x7d0 net/core/skbuff.c:630 qrtr_endpoint_post+0xbd/0x11b0 net/qrtr/af_qrtr.c:446 qrtr_tun_write_iter+0x270/0x400 net/qrtr/tun.c:108 call_write_iter include/linux/fs.h:2189 [inline] aio_write+0x63a/0x950 fs/aio.c:1600 io_submit_one+0x1d1c/0x3bf0 fs/aio.c:2019 __do_sys_io_submit fs/aio.c:2078 [inline] __se_sys_io_submit+0x293/0x770 fs/aio.c:2048 __x64_sys_io_submit+0x92/0xd0 fs/aio.c:2048 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd It is because that skb->len requires at least sizeof(struct qrtr_ctrl_pkt) in qrtr_tx_resume(). And skb->len equals to size in qrtr_endpoint_post(). But size is less than sizeof(struct qrtr_ctrl_pkt) when qrtr_cb->type equals to QRTR_TYPE_RESUME_TX in qrtr_endpoint_post() under the syzbot scenario. This triggers the uninit variable access bug. Add size check when qrtr_cb->type equals to QRTR_TYPE_RESUME_TX in qrtr_endpoint_post() to fix the bug. Fixes: 5fdeb0d372ab ("net: qrtr: Implement outgoing flow control") Reported-by: syzbot+4436c9630a45820fda76@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=c14607f0963d27d5a3d5f4c8639b500909e43540 Suggested-by: Manivannan Sadhasivam <mani@kernel.org> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230410012352.3997823-1-william.xuanziyang@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-20Bluetooth: Set ISO Data Path on broadcast sinkClaudia Draghicescu
[ Upstream commit d2e4f1b1cba8742db66aaf77374cab7c0c7c8656 ] This patch enables ISO data rx on broadcast sink. Fixes: eca0ae4aea66 ("Bluetooth: Add initial implementation of BIS connections") Signed-off-by: Claudia Draghicescu <claudia.rosu@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-20Bluetooth: SCO: Fix possible circular locking dependency sco_sock_getsockoptLuiz Augusto von Dentz
[ Upstream commit 975abc0c90fc485ff9b4a6afa475c3b1398d5d47 ] This attempts to fix the following trace: ====================================================== WARNING: possible circular locking dependency detected 6.3.0-rc2-g68fcb3a7bf97 #4706 Not tainted ------------------------------------------------------ sco-tester/31 is trying to acquire lock: ffff8880025b8070 (&hdev->lock){+.+.}-{3:3}, at: sco_sock_getsockopt+0x1fc/0xa90 but task is already holding lock: ffff888001eeb130 (sk_lock-AF_BLUETOOTH-BTPROTO_SCO){+.+.}-{0:0}, at: sco_sock_getsockopt+0x104/0xa90 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (sk_lock-AF_BLUETOOTH-BTPROTO_SCO){+.+.}-{0:0}: lock_sock_nested+0x32/0x80 sco_connect_cfm+0x118/0x4a0 hci_sync_conn_complete_evt+0x1e6/0x3d0 hci_event_packet+0x55c/0x7c0 hci_rx_work+0x34c/0xa00 process_one_work+0x575/0x910 worker_thread+0x89/0x6f0 kthread+0x14e/0x180 ret_from_fork+0x2b/0x50 -> #1 (hci_cb_list_lock){+.+.}-{3:3}: __mutex_lock+0x13b/0xcc0 hci_sync_conn_complete_evt+0x1ad/0x3d0 hci_event_packet+0x55c/0x7c0 hci_rx_work+0x34c/0xa00 process_one_work+0x575/0x910 worker_thread+0x89/0x6f0 kthread+0x14e/0x180 ret_from_fork+0x2b/0x50 -> #0 (&hdev->lock){+.+.}-{3:3}: __lock_acquire+0x18cc/0x3740 lock_acquire+0x151/0x3a0 __mutex_lock+0x13b/0xcc0 sco_sock_getsockopt+0x1fc/0xa90 __sys_getsockopt+0xe9/0x190 __x64_sys_getsockopt+0x5b/0x70 do_syscall_64+0x42/0x90 entry_SYSCALL_64_after_hwframe+0x70/0xda other info that might help us debug this: Chain exists of: &hdev->lock --> hci_cb_list_lock --> sk_lock-AF_BLUETOOTH-BTPROTO_SCO Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sk_lock-AF_BLUETOOTH-BTPROTO_SCO); lock(hci_cb_list_lock); lock(sk_lock-AF_BLUETOOTH-BTPROTO_SCO); lock(&hdev->lock); *** DEADLOCK *** 1 lock held by sco-tester/31: #0: ffff888001eeb130 (sk_lock-AF_BLUETOOTH-BTPROTO_SCO){+.+.}-{0:0}, at: sco_sock_getsockopt+0x104/0xa90 Fixes: 248733e87d50 ("Bluetooth: Allow querying of supported offload codecs over SCO socket") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-20Bluetooth: Fix printing errors if LE Connection times outLuiz Augusto von Dentz
[ Upstream commit b62e72200eaad523f08d8319bba50fc652e032a8 ] This fixes errors like bellow when LE Connection times out since that is actually not a controller error: Bluetooth: hci0: Opcode 0x200d failed: -110 Bluetooth: hci0: request failed to create LE connection: err -110 Instead the code shall properly detect if -ETIMEDOUT is returned and send HCI_OP_LE_CREATE_CONN_CANCEL to give up on the connection. Link: https://github.com/bluez/bluez/issues/340 Fixes: 8e8b92ee60de ("Bluetooth: hci_sync: Add hci_le_create_conn_sync") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-20Bluetooth: hci_conn: Fix not cleaning up on LE Connection failureLuiz Augusto von Dentz
[ Upstream commit 19cf60bf63cbaf5262eac400c707966e19999b83 ] hci_connect_le_scan_cleanup shall always be invoked to cleanup the states and re-enable passive scanning if necessary, otherwise it may cause the pending action to stay active causing multiple attempts to connect. Fixes: 9b3628d79b46 ("Bluetooth: hci_sync: Cleanup hci_conn if it cannot be aborted") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-20net: openvswitch: fix race on port outputFelix Huettner
[ Upstream commit 066b86787fa3d97b7aefb5ac0a99a22dad2d15f8 ] assume the following setup on a single machine: 1. An openvswitch instance with one bridge and default flows 2. two network namespaces "server" and "client" 3. two ovs interfaces "server" and "client" on the bridge 4. for each ovs interface a veth pair with a matching name and 32 rx and tx queues 5. move the ends of the veth pairs to the respective network namespaces 6. assign ip addresses to each of the veth ends in the namespaces (needs to be the same subnet) 7. start some http server on the server network namespace 8. test if a client in the client namespace can reach the http server when following the actions below the host has a chance of getting a cpu stuck in a infinite loop: 1. send a large amount of parallel requests to the http server (around 3000 curls should work) 2. in parallel delete the network namespace (do not delete interfaces or stop the server, just kill the namespace) there is a low chance that this will cause the below kernel cpu stuck message. If this does not happen just retry. Below there is also the output of bpftrace for the functions mentioned in the output. The series of events happening here is: 1. the network namespace is deleted calling `unregister_netdevice_many_notify` somewhere in the process 2. this sets first `NETREG_UNREGISTERING` on both ends of the veth and then runs `synchronize_net` 3. it then calls `call_netdevice_notifiers` with `NETDEV_UNREGISTER` 4. this is then handled by `dp_device_event` which calls `ovs_netdev_detach_dev` (if a vport is found, which is the case for the veth interface attached to ovs) 5. this removes the rx_handlers of the device but does not prevent packages to be sent to the device 6. `dp_device_event` then queues the vport deletion to work in background as a ovs_lock is needed that we do not hold in the unregistration path 7. `unregister_netdevice_many_notify` continues to call `netdev_unregister_kobject` which sets `real_num_tx_queues` to 0 8. port deletion continues (but details are not relevant for this issue) 9. at some future point the background task deletes the vport If after 7. but before 9. a packet is send to the ovs vport (which is not deleted at this point in time) which forwards it to the `dev_queue_xmit` flow even though the device is unregistering. In `skb_tx_hash` (which is called in the `dev_queue_xmit`) path there is a while loop (if the packet has a rx_queue recorded) that is infinite if `dev->real_num_tx_queues` is zero. To prevent this from happening we update `do_output` to handle devices without carrier the same as if the device is not found (which would be the code path after 9. is done). Additionally we now produce a warning in `skb_tx_hash` if we will hit the infinite loop. bpftrace (first word is function name): __dev_queue_xmit server: real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1 netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 2, reg_state: 1 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 6, reg_state: 2 ovs_netdev_detach_dev server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, reg_state: 2 netdev_rx_handler_unregister server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 netdev_rx_handler_unregister ret server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 27, reg_state: 2 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 22, reg_state: 2 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 18, reg_state: 2 netdev_unregister_kobject: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 ovs_vport_send server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2 __dev_queue_xmit server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2 netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2 broken device server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024 ovs_dp_detach_port server: real_num_tx_queues: 0 cpu 9, pid: 9124, tid: 9124, reg_state: 2 synchronize_rcu_expedited: cpu 9, pid: 33604, tid: 33604 stuck message: watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [curl:1929279] Modules linked in: veth pktgen bridge stp llc ip_set_hash_net nft_counter xt_set nft_compat nf_tables ip_set_hash_ip ip_set nfnetlink_cttimeout nfnetlink openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tls binfmt_misc nls_iso8859_1 input_leds joydev serio_raw dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel drm efi_pstore virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel virtio_net ahci net_failover crypto_simd cryptd psmouse libahci virtio_blk failover CPU: 5 PID: 1929279 Comm: curl Not tainted 5.15.0-67-generic #74-Ubuntu Hardware name: OpenStack Foundation OpenStack Nova, BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:netdev_pick_tx+0xf1/0x320 Code: 00 00 8d 48 ff 0f b7 c1 66 39 ca 0f 86 e9 01 00 00 45 0f b7 ff 41 39 c7 0f 87 5b 01 00 00 44 29 f8 41 39 c7 0f 87 4f 01 00 00 <eb> f2 0f 1f 44 00 00 49 8b 94 24 28 04 00 00 48 85 d2 0f 84 53 01 RSP: 0018:ffffb78b40298820 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff9c8773adc2e0 RCX: 000000000000083f RDX: 0000000000000000 RSI: ffff9c8773adc2e0 RDI: ffff9c870a25e000 RBP: ffffb78b40298858 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c870a25e000 R13: ffff9c870a25e000 R14: ffff9c87fe043480 R15: 0000000000000000 FS: 00007f7b80008f00(0000) GS:ffff9c8e5f740000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7b80f6a0b0 CR3: 0000000329d66000 CR4: 0000000000350ee0 Call Trace: <IRQ> netdev_core_pick_tx+0xa4/0xb0 __dev_queue_xmit+0xf8/0x510 ? __bpf_prog_exit+0x1e/0x30 dev_queue_xmit+0x10/0x20 ovs_vport_send+0xad/0x170 [openvswitch] do_output+0x59/0x180 [openvswitch] do_execute_actions+0xa80/0xaa0 [openvswitch] ? kfree+0x1/0x250 ? kfree+0x1/0x250 ? kprobe_perf_func+0x4f/0x2b0 ? flow_lookup.constprop.0+0x5c/0x110 [openvswitch] ovs_execute_actions+0x4c/0x120 [openvswitch] ovs_dp_process_packet+0xa1/0x200 [openvswitch] ? ovs_ct_update_key.isra.0+0xa8/0x120 [openvswitch] ? ovs_ct_fill_key+0x1d/0x30 [openvswitch] ? ovs_flow_key_extract+0x2db/0x350 [openvswitch] ovs_vport_receive+0x77/0xd0 [openvswitch] ? __htab_map_lookup_elem+0x4e/0x60 ? bpf_prog_680e8aff8547aec1_kfree+0x3b/0x714 ? trace_call_bpf+0xc8/0x150 ? kfree+0x1/0x250 ? kfree+0x1/0x250 ? kprobe_perf_func+0x4f/0x2b0 ? kprobe_perf_func+0x4f/0x2b0 ? __mod_memcg_lruvec_state+0x63/0xe0 netdev_port_receive+0xc4/0x180 [openvswitch] ? netdev_port_receive+0x180/0x180 [openvswitch] netdev_frame_hook+0x1f/0x40 [openvswitch] __netif_receive_skb_core.constprop.0+0x23d/0xf00 __netif_receive_skb_one_core+0x3f/0xa0 __netif_receive_skb+0x15/0x60 process_backlog+0x9e/0x170 __napi_poll+0x33/0x180 net_rx_action+0x126/0x280 ? ttwu_do_activate+0x72/0xf0 __do_softirq+0xd9/0x2e7 ? rcu_report_exp_cpu_mult+0x1b0/0x1b0 do_softirq+0x7d/0xb0 </IRQ> <TASK> __local_bh_enable_ip+0x54/0x60 ip_finish_output2+0x191/0x460 __ip_finish_output+0xb7/0x180 ip_finish_output+0x2e/0xc0 ip_output+0x78/0x100 ? __ip_finish_output+0x180/0x180 ip_local_out+0x5e/0x70 __ip_queue_xmit+0x184/0x440 ? tcp_syn_options+0x1f9/0x300 ip_queue_xmit+0x15/0x20 __tcp_transmit_skb+0x910/0x9c0 ? __mod_memcg_state+0x44/0xa0 tcp_connect+0x437/0x4e0 ? ktime_get_with_offset+0x60/0xf0 tcp_v4_connect+0x436/0x530 __inet_stream_connect+0xd4/0x3a0 ? kprobe_perf_func+0x4f/0x2b0 ? aa_sk_perm+0x43/0x1c0 inet_stream_connect+0x3b/0x60 __sys_connect_file+0x63/0x70 __sys_connect+0xa6/0xd0 ? setfl+0x108/0x170 ? do_fcntl+0xe8/0x5a0 __x64_sys_connect+0x18/0x20 do_syscall_64+0x5c/0xc0 ? __x64_sys_fcntl+0xa9/0xd0 ? exit_to_user_mode_prepare+0x37/0xb0 ? syscall_exit_to_user_mode+0x27/0x50 ? do_syscall_64+0x69/0xc0 ? __sys_setsockopt+0xea/0x1e0 ? exit_to_user_mode_prepare+0x37/0xb0 ? syscall_exit_to_user_mode+0x27/0x50 ? __x64_sys_setsockopt+0x1f/0x30 ? do_syscall_64+0x69/0xc0 ? irqentry_exit+0x1d/0x30 ? exc_page_fault+0x89/0x170 entry_SYSCALL_64_after_hwframe+0x61/0xcb RIP: 0033:0x7f7b8101c6a7 Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 18 89 54 24 0c 48 89 34 24 89 RSP: 002b:00007ffffd6b2198 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7b8101c6a7 RDX: 0000000000000010 RSI: 00007ffffd6b2360 RDI: 0000000000000005 RBP: 0000561f1370d560 R08: 00002795ad21d1ac R09: 0030312e302e302e R10: 00007ffffd73f080 R11: 0000000000000246 R12: 0000561f1370c410 R13: 0000000000000000 R14: 0000000000000005 R15: 0000000000000000 </TASK> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") Co-developed-by: Luca Czesla <luca.czesla@mail.schwarz> Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz> Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/ZC0pBXBAgh7c76CA@kernel-bug-kernel-bug Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-20tcp: restrict net.ipv4.tcp_app_winYueHaibing
[ Upstream commit dc5110c2d959c1707e12df5f792f41d90614adaa ] UBSAN: shift-out-of-bounds in net/ipv4/tcp_input.c:555:23 shift exponent 255 is too large for 32-bit type 'int' CPU: 1 PID: 7907 Comm: ssh Not tainted 6.3.0-rc4-00161-g62bad54b26db-dirty #206 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x136/0x150 __ubsan_handle_shift_out_of_bounds+0x21f/0x5a0 tcp_init_transfer.cold+0x3a/0xb9 tcp_finish_connect+0x1d0/0x620 tcp_rcv_state_process+0xd78/0x4d60 tcp_v4_do_rcv+0x33d/0x9d0 __release_sock+0x133/0x3b0 release_sock+0x58/0x1b0 'maxwin' is int, shifting int for 32 or more bits is undefined behaviour. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-209p/xen : Fix use after free bug in xen_9pfs_front_remove due to race conditionZheng Wang
[ Upstream commit ea4f1009408efb4989a0f139b70fb338e7f687d0 ] In xen_9pfs_front_probe, it calls xen_9pfs_front_alloc_dataring to init priv->rings and bound &ring->work with p9_xen_response. When it calls xen_9pfs_front_event_handler to handle IRQ requests, it will finally call schedule_work to start the work. When we call xen_9pfs_front_remove to remove the driver, there may be a sequence as follows: Fix it by finishing the work before cleanup in xen_9pfs_front_free. Note that, this bug is found by static analysis, which might be false positive. CPU0 CPU1 |p9_xen_response xen_9pfs_front_remove| xen_9pfs_front_free| kfree(priv) | //free priv | |p9_tag_lookup |//use priv->client Fixes: 71ebd71921e4 ("xen/9pfs: connect to the backend") Signed-off-by: Zheng Wang <zyytlz.wz@163.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-20bpf: tcp: Use sock_gen_put instead of sock_put in bpf_iter_tcpMartin KaFai Lau
[ Upstream commit 580031ff9952b7dbf48dedba6b56a100ae002bef ] While reviewing the udp-iter batching patches, noticed the bpf_iter_tcp calling sock_put() is incorrect. It should call sock_gen_put instead because bpf_iter_tcp is iterating the ehash table which has the req sk and tw sk. This patch replaces all sock_put with sock_gen_put in the bpf_iter_tcp codepath. Fixes: 04c7820b776f ("bpf: tcp: Bpf iter batching and lock_sock") Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230328004232.2134233-1-martin.lau@linux.dev Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-20Bluetooth: hci_conn: Fix possible UAFLuiz Augusto von Dentz
commit 5dc7d23e167e2882ef118456ceccd57873e876d8 upstream. This fixes the following trace: ================================================================== BUG: KASAN: slab-use-after-free in hci_conn_del+0xba/0x3a0 Write of size 8 at addr ffff88800208e9c8 by task iso-tester/31 CPU: 0 PID: 31 Comm: iso-tester Not tainted 6.3.0-rc2-g991aa4a69a47 #4716 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.1-2.fc36 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x1d/0x70 print_report+0xce/0x610 ? __virt_addr_valid+0xd4/0x150 ? hci_conn_del+0xba/0x3a0 kasan_report+0xdd/0x110 ? hci_conn_del+0xba/0x3a0 hci_conn_del+0xba/0x3a0 hci_conn_hash_flush+0xf2/0x120 hci_dev_close_sync+0x388/0x920 hci_unregister_dev+0x122/0x260 vhci_release+0x4f/0x90 __fput+0x102/0x430 task_work_run+0xf1/0x160 ? __pfx_task_work_run+0x10/0x10 ? mark_held_locks+0x24/0x90 exit_to_user_mode_prepare+0x170/0x180 syscall_exit_to_user_mode+0x19/0x50 do_syscall_64+0x4e/0x90 entry_SYSCALL_64_after_hwframe+0x70/0xda Fixes: 0f00cd322d22 ("Bluetooth: Free potentially unfreed SCO connection") Link: https://syzkaller.appspot.com/bug?extid=8bb72f86fc823817bc5d Cc: <stable@vger.kernel.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-20Bluetooth: Free potentially unfreed SCO connectionArchie Pusaka
commit 0f00cd322d22d4441de51aa80bcce5bb6a8cbb44 upstream. It is possible to initiate a SCO connection while deleting the corresponding ACL connection, e.g. in below scenario: (1) < hci setup sync connect command (2) > hci disconn complete event (for the acl connection) (3) > hci command complete event (for(1), failure) When it happens, hci_cs_setup_sync_conn won't be able to obtain the reference to the SCO connection, so it will be stuck and potentially hinder subsequent connections to the same device. This patch prevents that by also deleting the SCO connection if it is still not established when the corresponding ACL connection is deleted. Signed-off-by: Archie Pusaka <apusaka@chromium.org> Reviewed-by: Ying Hsu <yinghsu@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-20Bluetooth: Fix race condition in hidp_session_threadMin Li
commit c95930abd687fcd1aa040dc4fe90dff947916460 upstream. There is a potential race condition in hidp_session_thread that may lead to use-after-free. For instance, the timer is active while hidp_del_timer is called in hidp_session_thread(). After hidp_session_put, then 'session' will be freed, causing kernel panic when hidp_idle_timeout is running. The solution is to use del_timer_sync instead of del_timer. Here is the call trace: ? hidp_session_probe+0x780/0x780 call_timer_fn+0x2d/0x1e0 __run_timers.part.0+0x569/0x940 hidp_session_probe+0x780/0x780 call_timer_fn+0x1e0/0x1e0 ktime_get+0x5c/0xf0 lapic_next_deadline+0x2c/0x40 clockevents_program_event+0x205/0x320 run_timer_softirq+0xa9/0x1b0 __do_softirq+0x1b9/0x641 __irq_exit_rcu+0xdc/0x190 irq_exit_rcu+0xe/0x20 sysvec_apic_timer_interrupt+0xa1/0xc0 Cc: stable@vger.kernel.org Signed-off-by: Min Li <lm0963hack@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-20Bluetooth: L2CAP: Fix use-after-free in l2cap_disconnect_{req,rsp}Luiz Augusto von Dentz
commit a2a9339e1c9deb7e1e079e12e27a0265aea8421a upstream. Similar to commit d0be8347c623 ("Bluetooth: L2CAP: Fix use-after-free caused by l2cap_chan_put"), just use l2cap_chan_hold_unless_zero to prevent referencing a channel that is about to be destroyed. Cc: stable@kernel.org Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Min Li <lm0963hack@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-13can: isotp: isotp_recvmsg(): use sock_recv_cmsgs() to get SOCK_RXQ_OVFL infosOliver Hartkopp
commit 0145462fc802cd447ef5d029758043c7f15b4b1e upstream. isotp.c was still using sock_recv_timestamp() which does not provide control messages to detect dropped PDUs in the receive path. Fixes: e057dd3fc20f ("can: add ISO 15765-2:2016 transport protocol") Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/all/20230330170248.62342-1-socketcan@hartkopp.net Cc: stable@vger.kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-13can: isotp: isotp_ops: fix poll() to not report false EPOLLOUT eventsMichal Sojka
commit 79e19fa79cb5d5f1b3bf3e3ae24989ccb93c7b7b upstream. When using select()/poll()/epoll() with a non-blocking ISOTP socket to wait for when non-blocking write is possible, a false EPOLLOUT event is sometimes returned. This can happen at least after sending a message which must be split to multiple CAN frames. The reason is that isotp_sendmsg() returns -EAGAIN when tx.state is not equal to ISOTP_IDLE and this behavior is not reflected in datagram_poll(), which is used in isotp_ops. This is fixed by introducing ISOTP-specific poll function, which suppresses the EPOLLOUT events in that case. v2: https://lore.kernel.org/all/20230302092812.320643-1-michal.sojka@cvut.cz v1: https://lore.kernel.org/all/20230224010659.48420-1-michal.sojka@cvut.cz https://lore.kernel.org/all/b53a04a2-ba1f-3858-84c1-d3eb3301ae15@hartkopp.net Signed-off-by: Michal Sojka <michal.sojka@cvut.cz> Reported-by: Jakub Jira <jirajak2@fel.cvut.cz> Tested-by: Oliver Hartkopp <socketcan@hartkopp.net> Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Fixes: e057dd3fc20f ("can: add ISO 15765-2:2016 transport protocol") Link: https://lore.kernel.org/all/20230331125511.372783-1-michal.sojka@cvut.cz Cc: stable@vger.kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-13can: isotp: fix race between isotp_sendsmg() and isotp_release()Oliver Hartkopp
commit 051737439eaee5bdd03d3c2ef5510d54a478fd05 upstream. As discussed with Dae R. Jeong and Hillf Danton here [1] the sendmsg() function in isotp.c might get into a race condition when restoring the former tx.state from the old_state. Remove the old_state concept and implement proper locking for the ISOTP_IDLE transitions in isotp_sendmsg(), inspired by a simplification idea from Hillf Danton. Introduce a new tx.state ISOTP_SHUTDOWN and use the same locking mechanism from isotp_release() which resolves a potential race between isotp_sendsmg() and isotp_release(). [1] https://lore.kernel.org/linux-can/ZB%2F93xJxq%2FBUqAgG@dragonet v1: https://lore.kernel.org/all/20230331102114.15164-1-socketcan@hartkopp.net v2: https://lore.kernel.org/all/20230331123600.3550-1-socketcan@hartkopp.net take care of signal interrupts for wait_event_interruptible() in isotp_release() v3: https://lore.kernel.org/all/20230331130654.9886-1-socketcan@hartkopp.net take care of signal interrupts for wait_event_interruptible() in isotp_sendmsg() in the wait_tx_done case v4: https://lore.kernel.org/all/20230331131935.21465-1-socketcan@hartkopp.net take care of signal interrupts for wait_event_interruptible() in isotp_sendmsg() in ALL cases Cc: Dae R. Jeong <threeearcat@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Fixes: 4f027cba8216 ("can: isotp: split tx timer into transmission and timeout") Link: https://lore.kernel.org/all/20230331131935.21465-1-socketcan@hartkopp.net Cc: stable@vger.kernel.org [mkl: rephrase commit message] Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>