summary refs log tree commit diff
path: root/net/tls
AgeCommit message (Collapse)Author
2023-08-16net: tls: avoid discarding data on record closeJakub Kicinski
commit 6b47808f223c70ff564f9b363446d2a5fa1e05b2 upstream. TLS records end with a 16B tag. For TLS device offload we only need to make space for this tag in the stream, the device will generate and replace it with the actual calculated tag. Long time ago the code would just re-reference the head frag which mostly worked but was suboptimal because it prevented TCP from combining the record into a single skb frag. I'm not sure if it was correct as the first frag may be shorter than the tag. The commit under fixes tried to replace that with using the page frag and if the allocation failed rolling back the data, if record was long enough. It achieves better fragment coalescing but is also buggy. We don't roll back the iterator, so unless we're at the end of send we'll skip the data we designated as tag and start the next record as if the rollback never happened. There's also the possibility that the record was constructed with MSG_MORE and the data came from a different syscall and we already told the user space that we "got it". Allocate a single dummy page and use it as fallback. Found by code inspection, and proven by forcing allocation failures. Fixes: e7b159a48ba6 ("net/tls: remove the record tail optimization") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-09tls: rx: strp: don't use GFP_KERNEL in softirq contextJakub Kicinski
commit 74836ec828fe17b63f2006fdbf53311d691396bf upstream. When receive buffer is small, or the TCP rx queue looks too complicated to bother using it directly - we allocate a new skb and copy data into it. We already use sk->sk_allocation... but nothing actually sets it to GFP_ATOMIC on the ->sk_data_ready() path. Users of HW offload are far more likely to experience problems due to scheduling while atomic. "Copy mode" is very rarely triggered with SW crypto. Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Tested-by: Shai Amiram <samiram@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-09tls: improve lockless access safety of tls_err_abort()Jakub Kicinski
[ Upstream commit 8a0d57df8938e9fd2e99d47a85b7f37d86f91097 ] Most protos' poll() methods insert a memory barrier between writes to sk_err and sk_error_report(). This dates back to commit a4d258036ed9 ("tcp: Fix race in tcp_poll"). I guess we should do the same thing in TLS, tcp_poll() does not hold the socket lock. Fixes: 3c4d7559159b ("tls: kernel TLS support") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-05tls: rx: strp: preserve decryption status of skbs when neededJakub Kicinski
[ Upstream commit eca9bfafee3a0487e59c59201ae14c7594ba940a ] When receive buffer is small we try to copy out the data from TCP into a skb maintained by TLS to prevent connection from stalling. Unfortunately if a single record is made up of a mix of decrypted and non-decrypted skbs combining them into a single skb leads to loss of decryption status, resulting in decryption errors or data corruption. Similarly when trying to use TCP receive queue directly we need to make sure that all the skbs within the record have the same status. If we don't the mixed status will be detected correctly but we'll CoW the anchor, again collapsing it into a single paged skb without decrypted status preserved. So the "fixup" code will not know which parts of skb to re-encrypt. Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Tested-by: Shai Amiram <samiram@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-05tls: rx: strp: factor out copying skb dataJakub Kicinski
[ Upstream commit c1c607b1e5d5477d82ca6a86a05a4f10907b33ee ] We'll need to copy input skbs individually in the next patch. Factor that code out (without assuming we're copying a full record). Tested-by: Shai Amiram <samiram@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: eca9bfafee3a ("tls: rx: strp: preserve decryption status of skbs when needed") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-05tls: rx: strp: force mixed decrypted records into copy modeJakub Kicinski
[ Upstream commit 14c4be92ebb3e36e392aa9dd8f314038a9f96f3c ] If a record is partially decrypted we'll have to CoW it, anyway, so go into copy mode and allocate a writable skb right away. This will make subsequent fix simpler because we won't have to teach tls_strp_msg_make_copy() how to copy skbs while preserving decrypt status. Tested-by: Shai Amiram <samiram@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: eca9bfafee3a ("tls: rx: strp: preserve decryption status of skbs when needed") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-05tls: rx: strp: fix determining record length in copy modeJakub Kicinski
[ Upstream commit 8b0c0dc9fbbd01e58a573a41c38885f9e4c17696 ] We call tls_rx_msg_size(skb) before doing skb->len += chunk. So the tls_rx_msg_size() code will see old skb->len, most likely leading to an over-read. Worst case we will over read an entire record, next iteration will try to trim the skb but may end up turning frag len negative or discarding the subsequent record (since we already told TCP we've read it during previous read but now we'll trim it out of the skb). Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Tested-by: Shai Amiram <samiram@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-05tls: rx: strp: set the skb->len of detached / CoW'ed skbsJakub Kicinski
[ Upstream commit 210620ae44a83f25220450bbfcc22e6fe986b25f ] alloc_skb_with_frags() fills in page frag sizes but does not set skb->len and skb->data_len. Set those correctly otherwise device offload will most likely generate an empty skb and hit the BUG() at the end of __skb_nsg(). Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Tested-by: Shai Amiram <samiram@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-05tls: rx: device: fix checking decryption statusJakub Kicinski
[ Upstream commit b3a03b540e3cf62a255213d084d76d71c02793d5 ] skb->len covers the entire skb, including the frag_list. In fact we're guaranteed that rxm->full_len <= skb->len, so since the change under Fixes we were not checking decrypt status of any skb but the first. Note that the skb_pagelen() added here may feel a bit costly, but it's removed by subsequent fixes, anyway. Reported-by: Tariq Toukan <tariqt@nvidia.com> Fixes: 86b259f6f888 ("tls: rx: device: bound the frag walk") Tested-by: Shai Amiram <samiram@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-24net: deal with most data-races in sk_wait_event()Eric Dumazet
[ Upstream commit d0ac89f6f9879fae316c155de77b5173b3e2c9c9 ] __condition is evaluated twice in sk_wait_event() macro. First invocation is lockless, and reads can race with writes, as spotted by syzbot. BUG: KCSAN: data-race in sk_stream_wait_connect / tcp_disconnect write to 0xffff88812d83d6a0 of 4 bytes by task 9065 on cpu 1: tcp_disconnect+0x2cd/0xdb0 inet_shutdown+0x19e/0x1f0 net/ipv4/af_inet.c:911 __sys_shutdown_sock net/socket.c:2343 [inline] __sys_shutdown net/socket.c:2355 [inline] __do_sys_shutdown net/socket.c:2363 [inline] __se_sys_shutdown+0xf8/0x140 net/socket.c:2361 __x64_sys_shutdown+0x31/0x40 net/socket.c:2361 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffff88812d83d6a0 of 4 bytes by task 9040 on cpu 0: sk_stream_wait_connect+0x1de/0x3a0 net/core/stream.c:75 tcp_sendmsg_locked+0x2e4/0x2120 net/ipv4/tcp.c:1266 tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1484 inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:651 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] __sys_sendto+0x246/0x300 net/socket.c:2142 __do_sys_sendto net/socket.c:2154 [inline] __se_sys_sendto net/socket.c:2150 [inline] __x64_sys_sendto+0x78/0x90 net/socket.c:2150 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x00000000 -> 0x00000068 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-17net: tls: fix device-offloaded sendpage straddling recordsJakub Kicinski
[ Upstream commit e539a105f947b9db470fec39fe91d85fe737a432 ] Adrien reports that incorrect data is transmitted when a single page straddles multiple records. We would transmit the same data in all iterations of the loop. Reported-by: Adrien Moulin <amoulin@corp.free.fr> Link: https://lore.kernel.org/all/61481278.42813558.1677845235112.JavaMail.zimbra@corp.free.fr Fixes: c1318b39c7d3 ("tls: Add opt-in zerocopy mode of sendfile()") Tested-by: Adrien Moulin <amoulin@corp.free.fr> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Maxim Mikityanskiy <maxtram95@gmail.com> Link: https://lore.kernel.org/r/20230304192610.3818098-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-17net: tls: fix possible race condition between do_tls_getsockopt_conf() and ↵Hangyu Hua
do_tls_setsockopt_conf() [ Upstream commit 49c47cc21b5b7a3d8deb18fc57b0aa2ab1286962 ] ctx->crypto_send.info is not protected by lock_sock in do_tls_getsockopt_conf(). A race condition between do_tls_getsockopt_conf() and error paths of do_tls_setsockopt_conf() may lead to a use-after-free or null-deref. More discussion: https://lore.kernel.org/all/Y/ht6gQL+u6fj3dG@hog/ Fixes: 3c4d7559159b ("tls: kernel TLS support") Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Link: https://lore.kernel.org/r/20230228023344.9623-1-hbh25y@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-17tls: rx: fix return value for async cryptoJakub Kicinski
[ Upstream commit 4d42cd6bc2ac1b9be50ade13771daec90c9d18b1 ] Gaurav reports that TLS Rx is broken with async crypto accelerators. The commit under fixes missed updating the retval byte counting logic when updating how records are stored. Even tho both before and after the change 'decrypted' was updated inside the main loop, it was completely overwritten when processing the async completions. Now that the rx_list only holds non-zero-copy records we need to add, not overwrite. Reported-and-bisected-by: Gaurav Jain <gaurav.jain@nxp.com> Fixes: cbbdee9918a2 ("tls: rx: async: don't put async zc on the list") Link: https://bugzilla.kernel.org/show_bug.cgi?id=217064 Tested-by: Gaurav Jain <gaurav.jain@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230227181201.1793772-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-11net: tls: avoid hanging tasks on the tx_lockJakub Kicinski
commit f3221361dc85d4de22586ce8441ec2c67b454f5d upstream. syzbot sent a hung task report and Eric explains that adversarial receiver may keep RWIN at 0 for a long time, so we are not guaranteed to make forward progress. Thread which took tx_lock and went to sleep may not release tx_lock for hours. Use interruptible sleep where possible and reschedule the work if it can't take the lock. Testing: existing selftest passes Reported-by: syzbot+9c0268252b8ef967c62e@syzkaller.appspotmail.com Fixes: 79ffe6087e91 ("net/tls: add a TX lock") Link: https://lore.kernel.org/all/000000000000e412e905f5b46201@google.com/ Cc: stable@vger.kernel.org # wait 4 weeks Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230301002857.2101894-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-09net/tls: tls_is_tx_ready() checked list_entryPietro Borrello
[ Upstream commit ffe2a22562444720b05bdfeb999c03e810d84cbb ] tls_is_tx_ready() checks that list_first_entry() does not return NULL. This condition can never happen. For empty lists, list_first_entry() returns the list_entry() of the head, which is a type confusion. Use list_first_entry_or_null() which returns NULL in case of empty lists. Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance") Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it> Link: https://lore.kernel.org/r/20230128-list-entry-null-check-tls-v1-1-525bbfe6f0d0@diag.uniroma1.it Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-09use less confusing names for iov_iter direction initializersAl Viro
[ Upstream commit de4eda9de2d957ef2d6a8365a01e26a435e958cb ] READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with write(2)", but people keep interpreting those as "we read data from it" and "we write data to it", i.e. exactly the wrong way. Call them ITER_DEST and ITER_SOURCE - at least that is harder to misinterpret... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Stable-dep-of: 6dd88fd59da8 ("vhost-scsi: unbreak any layout for response") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31bpf, sockmap: Fix missing BPF_F_INGRESS flag when using apply_bytesPengcheng Yang
[ Upstream commit a351d6087bf7d3d8440d58d3bf244ec64b89394a ] When redirecting, we use sk_msg_to_ingress() to get the BPF_F_INGRESS flag from the msg->flags. If apply_bytes is used and it is larger than the current data being processed, sk_psock_msg_verdict() will not be called when sendmsg() is called again. At this time, the msg->flags is 0, and we lost the BPF_F_INGRESS flag. So we need to save the BPF_F_INGRESS flag in sk_psock and use it when redirection. Fixes: 8934ce2fd081 ("bpf: sockmap redirect ingress support") Signed-off-by: Pengcheng Yang <yangpc@wangsu.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/1669718441-2654-3-git-send-email-yangpc@wangsu.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-11net/tls: Fix memory leak in tls_enc_skb() and tls_sw_fallback_init()Yu Liao
'aead_req' and 'aead_send' is allocated but not freed in default switch case. This commit fixes the potential memory leak by freeing them under the situation. Note that the default cases here should never be reached as they'd mean we allowed offloading an unsupported algorithm. Fixes: ea7a9d88ba21 ("net/tls: Use cipher sizes structs") Signed-off-by: Yu Liao <liaoyu15@huawei.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Link: https://lore.kernel.org/r/20221110090329.2036382-1-liaoyu15@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-14tls: strp: make sure the TCP skbs do not have overlapping dataJakub Kicinski
TLS tries to get away with using the TCP input queue directly. This does not work if there is duplicated data (multiple skbs holding bytes for the same seq number range due to retransmits). Check for this condition and fall back to copy mode, it should be rare. Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-27net: tls: Add ARIA-GCM algorithmTaehee Yoo
RFC 6209 describes ARIA for TLS 1.2. ARIA-128-GCM and ARIA-256-GCM are defined in RFC 6209. This patch would offer performance increment and an opportunity for hardware offload. Benchmark results: iperf-ssl are used. CPU: intel i3-12100. TLS(openssl-3.0-dev) [ 3] 0.0- 1.0 sec 185 MBytes 1.55 Gbits/sec [ 3] 1.0- 2.0 sec 186 MBytes 1.56 Gbits/sec [ 3] 2.0- 3.0 sec 186 MBytes 1.56 Gbits/sec [ 3] 3.0- 4.0 sec 186 MBytes 1.56 Gbits/sec [ 3] 4.0- 5.0 sec 186 MBytes 1.56 Gbits/sec [ 3] 0.0- 5.0 sec 927 MBytes 1.56 Gbits/sec kTLS(aria-generic) [ 3] 0.0- 1.0 sec 198 MBytes 1.66 Gbits/sec [ 3] 1.0- 2.0 sec 194 MBytes 1.62 Gbits/sec [ 3] 2.0- 3.0 sec 194 MBytes 1.63 Gbits/sec [ 3] 3.0- 4.0 sec 194 MBytes 1.63 Gbits/sec [ 3] 4.0- 5.0 sec 194 MBytes 1.62 Gbits/sec [ 3] 0.0- 5.0 sec 974 MBytes 1.63 Gbits/sec kTLS(aria-avx wirh GFNI) [ 3] 0.0- 1.0 sec 632 MBytes 5.30 Gbits/sec [ 3] 1.0- 2.0 sec 657 MBytes 5.51 Gbits/sec [ 3] 2.0- 3.0 sec 657 MBytes 5.51 Gbits/sec [ 3] 3.0- 4.0 sec 656 MBytes 5.50 Gbits/sec [ 3] 4.0- 5.0 sec 656 MBytes 5.50 Gbits/sec [ 3] 0.0- 5.0 sec 3.18 GBytes 5.47 Gbits/sec Signed-off-by: Taehee Yoo <ap420073@gmail.com> Reviewed-by: Vadim Fedorenko <vfedorenko@novek.ru> Link: https://lore.kernel.org/r/20220925150033.24615-1-ap420073@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-22net/tls: Support 256 bit keys with TX device offloadGal Pressman
Add the missing clause for 256 bit keys in tls_set_device_offload(), and the needed adjustments in tls_device_fallback.c. Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-22net/tls: Use cipher sizes structsGal Pressman
Use the newly introduced cipher sizes structs instead of the repeated switch cases churn. Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-22net/tls: Describe ciphers sizes by const structsTariq Toukan
Introduce cipher sizes descriptor. It helps reducing the amount of code duplications and repeated switch/cases that assigns the proper sizes according to the cipher type. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-17tls: rx: react to strparser initialization errorsJakub Kicinski
Even though the normal strparser's init function has a return value we got away with ignoring errors until now, as it only validates the parameters and we were passing correct parameters. tls_strp can fail to init on memory allocation errors, which syzbot duly induced and reported. Reported-by: syzbot+abd45eb849b05194b1b6@syzkaller.appspotmail.com Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-11Merge tag 'net-6.0-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bluetooth, bpf, can and netfilter. A little larger than usual but it's all fixes, no late features. It's large partially because of timing, and partially because of follow ups to stuff that got merged a week or so before the merge window and wasn't as widely tested. Maybe the Bluetooth fixes are a little alarming so we'll address that, but the rest seems okay and not scary. Notably we're including a fix for the netfilter Kconfig [1], your WiFi warning [2] and a bluetooth fix which should unblock syzbot [3]. Current release - regressions: - Bluetooth: - don't try to cancel uninitialized works [3] - L2CAP: fix use-after-free caused by l2cap_chan_put - tls: rx: fix device offload after recent rework - devlink: fix UAF on failed reload and leftover locks in mlxsw Current release - new code bugs: - netfilter: - flowtable: fix incorrect Kconfig dependencies [1] - nf_tables: fix crash when nf_trace is enabled - bpf: - use proper target btf when exporting attach_btf_obj_id - arm64: fixes for bpf trampoline support - Bluetooth: - ISO: unlock on error path in iso_sock_setsockopt() - ISO: fix info leak in iso_sock_getsockopt() - ISO: fix iso_sock_getsockopt for BT_DEFER_SETUP - ISO: fix memory corruption on iso_pinfo.base - ISO: fix not using the correct QoS - hci_conn: fix updating ISO QoS PHY - phy: dp83867: fix get nvmem cell fail Previous releases - regressions: - wifi: cfg80211: fix validating BSS pointers in __cfg80211_connect_result [2] - atm: bring back zatm uAPI after ATM had been removed - properly fix old bug making bonding ARP monitor mode not being able to work with software devices with lockless Tx - tap: fix null-deref on skb->dev in dev_parse_header_protocol - revert "net: usb: ax88179_178a needs FLAG_SEND_ZLP" it helps some devices and breaks others - netfilter: - nf_tables: many fixes rejecting cross-object linking which may lead to UAFs - nf_tables: fix null deref due to zeroed list head - nf_tables: validate variable length element extension - bgmac: fix a BUG triggered by wrong bytes_compl - bcmgenet: indicate MAC is in charge of PHY PM Previous releases - always broken: - bpf: - fix bad pointer deref in bpf_sys_bpf() injected via test infra - disallow non-builtin bpf programs calling the prog_run command - don't reinit map value in prealloc_lru_pop - fix UAFs during the read of map iterator fd - fix invalidity check for values in sk local storage map - reject sleepable program for non-resched map iterator - mptcp: - move subflow cleanup in mptcp_destroy_common() - do not queue data on closed subflows - virtio_net: fix memory leak inside XDP_TX with mergeable - vsock: fix memory leak when multiple threads try to connect() - rework sk_user_data sharing to prevent psock leaks - geneve: fix TOS inheriting for ipv4 - tunnels & drivers: do not use RT_TOS for IPv6 flowlabel - phy: c45 baset1: do not skip aneg configuration if clock role is not specified - rose: avoid overflow when /proc displays timer information - x25: fix call timeouts in blocking connects - can: mcp251x: fix race condition on receive interrupt - can: j1939: - replace user-reachable WARN_ON_ONCE() with netdev_warn_once() - fix memory leak of skbs in j1939_session_destroy() Misc: - docs: bpf: clarify that many things are not uAPI - seg6: initialize induction variable to first valid array index (to silence clang vs objtool warning) - can: ems_usb: fix clang 14's -Wunaligned-access warning" * tag 'net-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (117 commits) net: atm: bring back zatm uAPI dpaa2-eth: trace the allocated address instead of page struct net: add missing kdoc for struct genl_multicast_group::flags nfp: fix use-after-free in area_cache_get() MAINTAINERS: use my korg address for mt7601u mlxsw: minimal: Fix deadlock in ports creation bonding: fix reference count leak in balance-alb mode net: usb: qmi_wwan: Add support for Cinterion MV32 bpf: Shut up kern_sys_bpf warning. net/tls: Use RCU API to access tls_ctx->netdev tls: rx: device: don't try to copy too much on detach tls: rx: device: bound the frag walk net_sched: cls_route: remove from list when handle is 0 selftests: forwarding: Fix failing tests with old libnet net: refactor bpf_sk_reuseport_detach() net: fix refcount bug in sk_psock_get (2) selftests/bpf: Ensure sleepable program is rejected by hash map iter selftests/bpf: Add write tests for sk local storage map iterator selftests/bpf: Add tests for reading a dangling map iter fd bpf: Only allow sleepable program for resched-able iterator ...
2022-08-10net/tls: Use RCU API to access tls_ctx->netdevMaxim Mikityanskiy
Currently, tls_device_down synchronizes with tls_device_resync_rx using RCU, however, the pointer to netdev is stored using WRITE_ONCE and loaded using READ_ONCE. Although such approach is technically correct (rcu_dereference is essentially a READ_ONCE, and rcu_assign_pointer uses WRITE_ONCE to store NULL), using special RCU helpers for pointers is more valid, as it includes additional checks and might change the implementation transparently to the callers. Mark the netdev pointer as __rcu and use the correct RCU helpers to access it. For non-concurrent access pass the right conditions that guarantee safe access (locks taken, refcount value). Also use the correct helper in mlx5e, where even READ_ONCE was missing. The transition to RCU exposes existing issues, fixed by this commit: 1. bond_tls_device_xmit could read netdev twice, and it could become NULL the second time, after the NULL check passed. 2. Drivers shouldn't stop processing the last packet if tls_device_down just set netdev to NULL, before tls_dev_del was called. This prevents a possible packet drop when transitioning to the fallback software mode. Fixes: 89df6a810470 ("net/bonding: Implement TLS TX device offload") Fixes: c55dcdd435aa ("net/tls: Fix use-after-free after the TLS device goes down and up") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Link: https://lore.kernel.org/r/20220810081602.1435800-1-maximmi@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-10tls: rx: device: don't try to copy too much on detachJakub Kicinski
Another device offload bug, we use the length of the output skb as an indication of how much data to copy. But that skb is sized to offset + record length, and we start from offset. So we end up double-counting the offset which leads to skb_copy_bits() returning -EFAULT. Reported-by: Tariq Toukan <tariqt@nvidia.com> Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Tested-by: Ran Rozenstein <ranro@nvidia.com> Link: https://lore.kernel.org/r/20220809175544.354343-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-10tls: rx: device: bound the frag walkJakub Kicinski
We can't do skb_walk_frags() on the input skbs, because the input skbs is really just a pointer to the tcp read queue. We need to bound the "is decrypted" check by the amount of data in the message. Note that the walk in tls_device_reencrypt() is after a CoW so the skb there is safe to walk. Actually in the current implementation it can't have frags at all, but whatever, maybe one day it will. Reported-by: Tariq Toukan <tariqt@nvidia.com> Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Tested-by: Ran Rozenstein <ranro@nvidia.com> Link: https://lore.kernel.org/r/20220809175544.354343-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-08iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()Al Viro
Most of the users immediately follow successful iov_iter_get_pages() with advancing by the amount it had returned. Provide inline wrappers doing that, convert trivial open-coded uses of those. BTW, iov_iter_get_pages() never returns more than it had been asked to; such checks in cifs ought to be removed someday... Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-01net/tls: Remove redundant workqueue flush before destroyTariq Toukan
destroy_workqueue() safely destroys the workqueue after draining it. No need for the explicit call to flush_workqueue(). Remove it. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20220801112444.26175-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-28net/tls: Multi-threaded calls to TX tls_dev_delTariq Toukan
Multiple TLS device-offloaded contexts can be added in parallel via concurrent calls to .tls_dev_add, while calls to .tls_dev_del are sequential in tls_device_gc_task. This is not a sustainable behavior. This creates a rate gap between add and del operations (addition rate outperforms the deletion rate). When running for enough time, the TLS device resources could get exhausted, failing to offload new connections. Replace the single-threaded garbage collector work with a per-context alternative, so they can be handled on several cores in parallel. Use a new dedicated destruct workqueue for this. Tested with mlx5 device: Before: 22141 add/sec, 103 del/sec After: 11684 add/sec, 11684 del/sec Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-28net/tls: Perform immediate device ctx cleanup when possibleTariq Toukan
TLS context destructor can be run in atomic context. Cleanup operations for device-offloaded contexts could require access and interaction with the device callbacks, which might sleep. Hence, the cleanup of such contexts must be deferred and completed inside an async work. For all others, this is not necessary, as cleanup is atomic. Invoke cleanup immediately for them, avoiding queueing redundant gc work. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-28tls: rx: Fix unsigned comparison with less than zeroYang Li
The return from the call to tls_rx_msg_size() is int, it can be a negative error code, however this is being assigned to an unsigned long variable 'sz', so making 'sz' an int. Eliminate the following coccicheck warning: ./net/tls/tls_strp.c:211:6-8: WARNING: Unsigned expression compared with zero: sz < 0 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/20220728031019.32838-1-yang.lee@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-28tls: rx: fix the false positive warningJakub Kicinski
I went too far in the accessor conversion, we can't use tls_strp_msg() after decryption because the message may not be ready. What we care about on this path is that the output skb is detached, i.e. we didn't somehow just turn around and used the input skb with its TCP data still attached. So look at the anchor directly. Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-28tls: strp: rename and multithread the workqueueJakub Kicinski
Paolo points out that there seems to be no strong reason strparser users a single threaded workqueue. Perhaps there were some performance or pinning considerations? Since we don't know (and it's the slow path) let's default to the most natural, multi-threaded choice. Also rename the workqueue to "tls-". Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-28tls: rx: don't consider sock_rcvtimeo() cumulativeJakub Kicinski
Eric indicates that restarting rcvtimeo on every wait may be fine. I thought that we should consider it cumulative, and made tls_rx_reader_lock() return the remaining timeo after acquiring the reader lock. tls_rx_rec_wait() gets its timeout passed in by value so it does not keep track of time previously spent. Make the lock waiting consistent with tls_rx_rec_wait() - don't keep track of time spent. Read the timeo fresh in tls_rx_rec_wait(). It's unclear to me why callers are supposed to cache the value. Link: https://lore.kernel.org/all/CANn89iKcmSfWgvZjzNGbsrndmCch2HC_EPZ7qmGboDNaWoviNQ@mail.gmail.com/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-26tls: rx: do not use the standard strparserJakub Kicinski
TLS is a relatively poor fit for strparser. We pause the input every time a message is received, wait for a read which will decrypt the message, start the parser, repeat. strparser is built to delineate the messages, wrap them in individual skbs and let them float off into the stack or a different socket. TLS wants the data pages and nothing else. There's no need for TLS to keep cloning (and occasionally skb_unclone()'ing) the TCP rx queue. This patch uses a pre-allocated skb and attaches the skbs from the TCP rx queue to it as frags. TLS is careful never to modify the input skb without CoW'ing / detaching it first. Since we call TCP rx queue cleanup directly we also get back the benefit of skb deferred free. Overall this results in a 6% gain in my benchmarks. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-26tls: rx: device: add input CoW helperJakub Kicinski
Wrap the remaining skb_cow_data() into a helper, so it's easier to replace down the lane. The new version will change the skb so make sure relevant pointers get reloaded after the call. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-26tls: rx: device: keep the zero copy status with offloadJakub Kicinski
The non-zero-copy path assumes a full skb with decrypted contents. This means the device offload would have to CoW the data. Try to keep the zero-copy status instead, copy the data to user space. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-26tls: rx: don't free the output in case of zero-copyJakub Kicinski
In the future we'll want to reuse the input skb in case of zero-copy so we shouldn't always free darg.skb. Move the freeing of darg.skb into the non-zc cases. All cases will now free ctx->recv_pkt (inside let tls_rx_rec_done()). Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-26tls: rx: factor SW handling out of tls_rx_one_record()Jakub Kicinski
After recent changes the SW side of tls_rx_one_record() can be nicely encapsulated in its own function. Move the pad handling as well. This will be useful for ->zc handling in tls_decrypt_device(). Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-26tls: rx: wrap recv_pkt accesses in helpersJakub Kicinski
To allow for the logic to change later wrap accesses which interrogate the input skb in helper functions. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-24net/tls: Remove the context from the list in tls_device_downMaxim Mikityanskiy
tls_device_down takes a reference on all contexts it's going to move to the degraded state (software fallback). If sk_destruct runs afterwards, it can reduce the reference counter back to 1 and return early without destroying the context. Then tls_device_down will release the reference it took and call tls_device_free_ctx. However, the context will still stay in tls_device_down_list forever. The list will contain an item, memory for which is released, making a memory corruption possible. Fix the above bug by properly removing the context from all lists before any call to tls_device_free_ctx. Fixes: 3740651bf7e2 ("tls: Fix context leak on tls_device_down") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-21tls: rx: release the sock lock on locking timeoutJakub Kicinski
Eric reports we should release the socket lock if the entire "grab reader lock" operation has failed. The callers assume they don't have to release it or otherwise unwind. Reported-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+16e72110feb2b653ef27@syzkaller.appspotmail.com Fixes: 4cbc325ed6b4 ("tls: rx: allow only one reader at a time") Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220720203701.2179034-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-18net/tls: Fix race in TLS device down flowTariq Toukan
Socket destruction flow and tls_device_down function sync against each other using tls_device_lock and the context refcount, to guarantee the device resources are freed via tls_dev_del() by the end of tls_device_down. In the following unfortunate flow, this won't happen: - refcount is decreased to zero in tls_device_sk_destruct. - tls_device_down starts, skips the context as refcount is zero, going all the way until it flushes the gc work, and returns without freeing the device resources. - only then, tls_device_queue_ctx_destruction is called, queues the gc work and frees the context's device resources. Solve it by decreasing the refcount in the socket's destruction flow under the tls_device_lock, for perfect synchronization. This does not slow down the common likely destructor flow, in which both the refcount is decreased and the spinlock is acquired, anyway. Fixes: e8f69799810c ("net/tls: Add generic NIC offload infrastructure") Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18tls: rx: decrypt into a fresh skbJakub Kicinski
We currently CoW Rx skbs whenever we can't decrypt to a user space buffer. The skbs can be enormous (64kB) and CoW does a linear alloc which has a strong chance of failing under memory pressure. Or even without, skb_cow_data() assumes GFP_ATOMIC. Allocate a new frag'd skb and decrypt into it. We finally take advantage of the decrypted skb getting returned via darg. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18tls: rx: async: don't put async zc on the listJakub Kicinski
The "zero-copy" path in SW TLS will engage either for no skbs or for all but last. If the recvmsg parameters are right and the socket can do ZC we'll ZC until the iterator can't fit a full record at which point we'll decrypt one more record and copy over the necessary bits to fill up the request. The only reason we hold onto the ZC skbs which went thru the async path until the end of recvmsg() is to count bytes. We need an accurate count of zc'ed bytes so that we can calculate how much of the non-zc'd data to copy. To allow freeing input skbs on the ZC path count only how much of the list we'll need to consume. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-18tls: rx: async: hold onto the input skbJakub Kicinski
Async crypto currently benefits from the fact that we decrypt in place. When we allow input and output to be different skbs we will have to hang onto the input while we move to the next record. Clone the inputs and keep them on a list. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>