summary refs log tree commit diff
AgeCommit message (Collapse)Author
2019-01-30tun: move the call to tun_set_real_num_queuesGeorge Amanakis
Call tun_set_real_num_queues() after the increment of tun->numqueues since the former depends on it. Otherwise, the number of queues is not correctly accounted for, which results to warnings similar to: "vnet0 selects TX queue 11, but real number of TX queues is 11". Fixes: 0b7959b62573 ("tun: publish tfile after it's fully initialized") Reported-and-tested-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30ipv6: sr: clear IP6CB(skb) on SRH ip4ip6 encapsulationYohei Kanemaru
skb->cb may contain data from previous layers (in an observed case IPv4 with L3 Master Device). In the observed scenario, the data in IPCB(skb)->frags was misinterpreted as IP6CB(skb)->frag_max_size, eventually caused an unexpected IPv6 fragmentation in ip6_fragment() through ip6_finish_output(). This patch clears IP6CB(skb), which potentially contains garbage data, on the SRH ip4ip6 encapsulation. Fixes: 32d99d0b6702 ("ipv6: sr: add support for ip4ip6 encapsulation") Signed-off-by: Yohei Kanemaru <yohei.kanemaru@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30Merge branch 'virtio_net-Fix-problems-around-XDP-tx-and-napi_tx'David S. Miller
Toshiaki Makita says: ==================== virtio_net: Fix problems around XDP tx and napi_tx While I'm looking into how to account standard tx counters on XDP tx processing, I found several bugs around XDP tx and napi_tx. Patch1: Fix oops on error path. Patch2 depends on this. Patch2: Fix memory corruption on freeing xdp_frames with napi_tx enabled. Patch3: Minor fix patch5 depends on. Patch4: Fix memory corruption on processing xdp_frames when XDP is disabled. Also patch5 depends on this. Patch5: Fix memory corruption on processing xdp_frames while XDP is being disabled. Patch6: Minor fix patch7 depends on. Patch7: Fix memory corruption on freeing sk_buff or xdp_frames when a normal queue is reused for XDP and vise versa. v2: - patch5: Make rcu_assign_pointer/synchronize_net conditional instead of _virtnet_set_queues. - patch7: Use napi_consume_skb() instead of dev_consume_skb_any() ==================== Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30virtio_net: Differentiate sk_buff and xdp_frame on freeingToshiaki Makita
We do not reset or free up unused buffers when enabling/disabling XDP, so it can happen that xdp_frames are freed after disabling XDP or sk_buffs are freed after enabling XDP on xdp tx queues. Thus we need to handle both forms (xdp_frames and sk_buffs) regardless of XDP setting. One way to trigger this problem is to disable XDP when napi_tx is enabled. In that case, virtnet_xdp_set() calls virtnet_napi_enable() which kicks NAPI. The NAPI handler will call virtnet_poll_cleantx() which invokes free_old_xmit_skbs() for queues which have been used by XDP. Note that even with this change we need to keep skipping free_old_xmit_skbs() from NAPI handlers when XDP is enabled, because XDP tx queues do not aquire queue locks. - v2: Use napi_consume_skb() instead of dev_consume_skb_any() Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30virtio_net: Use xdp_return_frame to free xdp_frames on destroying vqsToshiaki Makita
put_page() can work as a fallback for freeing xdp_frames, but the appropriate way is to use xdp_return_frame(). Fixes: cac320c850ef ("virtio_net: convert to use generic xdp_frame and xdp_return_frame API") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30virtio_net: Don't process redirected XDP frames when XDP is disabledToshiaki Makita
Commit 8dcc5b0ab0ec ("virtio_net: fix ndo_xdp_xmit crash towards dev not ready for XDP") tried to avoid access to unexpected sq while XDP is disabled, but was not complete. There was a small window which causes out of bounds sq access in virtnet_xdp_xmit() while disabling XDP. An example case of - curr_queue_pairs = 6 (2 for SKB and 4 for XDP) - online_cpu_num = xdp_queue_paris = 4 when XDP is enabled: CPU 0 CPU 1 (Disabling XDP) (Processing redirected XDP frames) virtnet_xdp_xmit() virtnet_xdp_set() _virtnet_set_queues() set curr_queue_pairs (2) check if rq->xdp_prog is not NULL virtnet_xdp_sq(vi) qp = curr_queue_pairs - xdp_queue_pairs + smp_processor_id() = 2 - 4 + 1 = -1 sq = &vi->sq[qp] // out of bounds access set xdp_queue_pairs (0) rq->xdp_prog = NULL Basically we should not change curr_queue_pairs and xdp_queue_pairs while someone can read the values. Thus, when disabling XDP, assign NULL to rq->xdp_prog first, and wait for RCU grace period, then change xxx_queue_pairs. Note that we need to keep the current order when enabling XDP though. - v2: Make rcu_assign_pointer/synchronize_net conditional instead of _virtnet_set_queues. Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30virtio_net: Fix out of bounds access of sqToshiaki Makita
When XDP is disabled, curr_queue_pairs + smp_processor_id() can be larger than max_queue_pairs. There is no guarantee that we have enough XDP send queues dedicated for each cpu when XDP is disabled, so do not count drops on sq in that case. Fixes: 5b8f3c8d30a6 ("virtio_net: Add XDP related stats") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30virtio_net: Fix not restoring real_num_rx_queuesToshiaki Makita
When _virtnet_set_queues() failed we did not restore real_num_rx_queues. Fix this by placing the change of real_num_rx_queues after _virtnet_set_queues(). This order is also in line with virtnet_set_channels(). Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30virtio_net: Don't call free_old_xmit_skbs for xdp_framesToshiaki Makita
When napi_tx is enabled, virtnet_poll_cleantx() called free_old_xmit_skbs() even for xdp send queue. This is bogus since the queue has xdp_frames, not sk_buffs, thus mangled device tx bytes counters because skb->len is meaningless value, and even triggered oops due to general protection fault on freeing them. Since xdp send queues do not aquire locks, old xdp_frames should be freed only in virtnet_xdp_xmit(), so just skip free_old_xmit_skbs() for xdp send queues. Similarly virtnet_poll_tx() called free_old_xmit_skbs(). This NAPI handler is called even without calling start_xmit() because cb for tx is by default enabled. Once the handler is called, it enabled the cb again, and then the handler would be called again. We don't need this handler for XDP, so don't enable cb as well as not calling free_old_xmit_skbs(). Also, we need to disable tx NAPI when disabling XDP, so virtnet_poll_tx() can safely access curr_queue_pairs and xdp_queue_pairs, which are not atomically updated while disabling XDP. Fixes: b92f1e6751a6 ("virtio-net: transmit napi") Fixes: 7b0411ef4aa6 ("virtio-net: clean tx descriptors from rx napi") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30virtio_net: Don't enable NAPI when interface is downToshiaki Makita
Commit 4e09ff536284 ("virtio-net: disable NAPI only when enabled during XDP set") tried to fix inappropriate NAPI enabling/disabling when !netif_running(), but was not complete. On error path virtio_net could enable NAPI even when !netif_running(). This can cause enabling NAPI twice on virtnet_open(), which would trigger BUG_ON() in napi_enable(). Fixes: 4941d472bf95b ("virtio-net: do not reset during XDP set") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30Merge branch 'erspan-always-reports-output-key-to-userspace'David S. Miller
Lorenzo Bianconi says: ==================== erspan: always reports output key to userspace Erspan protocol relies on output key to set session id header field. However TUNNEL_KEY bit is cleared in order to not add key field to the external GRE header and so the configured o_key is not reported to userspace. Fix the issue adding TUNNEL_KEY bit to the o_flags parameter dumping device info ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30net: ip6_gre: always reports o_key to userspaceLorenzo Bianconi
As Erspan_v4, Erspan_v6 protocol relies on o_key to configure session id header field. However TUNNEL_KEY bit is cleared in ip6erspan_tunnel_xmit since ERSPAN protocol does not set the key field of the external GRE header and so the configured o_key is not reported to userspace. The issue can be triggered with the following reproducer: $ip link add ip6erspan1 type ip6erspan local 2000::1 remote 2000::2 \ key 1 seq erspan_ver 1 $ip link set ip6erspan1 up ip -d link sh ip6erspan1 ip6erspan1@NONE: <BROADCAST,MULTICAST> mtu 1422 qdisc noop state DOWN mode DEFAULT link/ether ba:ff:09:24:c3:0e brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 1500 ip6erspan remote 2000::2 local 2000::1 encaplimit 4 flowlabel 0x00000 ikey 0.0.0.1 iseq oseq Fix the issue adding TUNNEL_KEY bit to the o_flags parameter in ip6gre_fill_info Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30net: ip_gre: always reports o_key to userspaceLorenzo Bianconi
Erspan protocol (version 1 and 2) relies on o_key to configure session id header field. However TUNNEL_KEY bit is cleared in erspan_xmit since ERSPAN protocol does not set the key field of the external GRE header and so the configured o_key is not reported to userspace. The issue can be triggered with the following reproducer: $ip link add erspan1 type erspan local 192.168.0.1 remote 192.168.0.2 \ key 1 seq erspan_ver 1 $ip link set erspan1 up $ip -d link sh erspan1 erspan1@NONE: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UNKNOWN mode DEFAULT link/ether 52:aa:99:95:9a:b5 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 1500 erspan remote 192.168.0.2 local 192.168.0.1 ttl inherit ikey 0.0.0.1 iseq oseq erspan_index 0 Fix the issue adding TUNNEL_KEY bit to the o_flags parameter in ipgre_fill_info Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30ucc_geth: Reset BQL queue when stopping deviceMathias Thore
After a timeout event caused by for example a broadcast storm, when the MAC and PHY are reset, the BQL TX queue needs to be reset as well. Otherwise, the device will exhibit severe performance issues even after the storm has ended. Co-authored-by: David Gounaris <david.gounaris@infinera.com> Signed-off-by: Mathias Thore <mathias.thore@infinera.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30Merge branch 'net-various-compat-ioctl-fixes'David S. Miller
Johannes Berg says: ==================== various compat ioctl fixes Back a long time ago, I already fixed a few of these by passing the size of the struct ifreq to do_sock_ioctl(). However, Robert found more cases, and now it won't be as simple because we'd have to pass that down all the way to e.g. bond_do_ioctl() which isn't really feasible. Therefore, restore the old code. While looking at why SIOCGIFNAME was broken, I realized that Al had removed that case - which had been handled in an explicit separate function - as well, and looking through his work at the time I saw that bond ioctls were also affected by the erroneous removal. I've restored SIOCGIFNAME and bond ioctls by going through the (now renamed) dev_ifsioc() instead of reintroducing their own helper functions, which I hope is correct but have only tested with SIOCGIFNAME. ==================== Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30net: socket: make bond ioctls go through compat_ifreq_ioctl()Johannes Berg
Same story as before, these use struct ifreq and thus need to be read with the shorter version to not cause faults. Cc: stable@vger.kernel.org Fixes: f92d4fc95341 ("kill bond_ioctl()") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30net: socket: fix SIOCGIFNAME in compatJohannes Berg
As reported by Robert O'Callahan in https://bugzilla.kernel.org/show_bug.cgi?id=202273 reverting the previous changes in this area broke the SIOCGIFNAME ioctl in compat again (I'd previously fixed it after his previous report of breakage in https://bugzilla.kernel.org/show_bug.cgi?id=199469). This is obviously because I fixed SIOCGIFNAME more or less by accident. Fix it explicitly now by making it pass through the restored compat translation code. Cc: stable@vger.kernel.org Fixes: 4cf808e7ac32 ("kill dev_ifname32()") Reported-by: Robert O'Callahan <robert@ocallahan.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30Revert "kill dev_ifsioc()"Johannes Berg
This reverts commit bf4405737f9f ("kill dev_ifsioc()"). This wasn't really unused as implied by the original commit, it still handles the copy to/from user differently, and the commit thus caused issues such as https://bugzilla.kernel.org/show_bug.cgi?id=199469 and https://bugzilla.kernel.org/show_bug.cgi?id=202273 However, deviating from a strict revert, rename dev_ifsioc() to compat_ifreq_ioctl() to be clearer as to its purpose and add a comment. Cc: stable@vger.kernel.org Fixes: bf4405737f9f ("kill dev_ifsioc()") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30Revert "socket: fix struct ifreq size in compat ioctl"Johannes Berg
This reverts commit 1cebf8f143c2 ("socket: fix struct ifreq size in compat ioctl"), it's a bugfix for another commit that I'll revert next. This is not a 'perfect' revert, I'm keeping some coding style intact rather than revert to the state with indentation errors. Cc: stable@vger.kernel.org Fixes: 1cebf8f143c2 ("socket: fix struct ifreq size in compat ioctl") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Need to save away the IV across tls async operations, from Dave Watson. 2) Upon successful packet processing, we should liberate the SKB with dev_consume_skb{_irq}(). From Yang Wei. 3) Only apply RX hang workaround on effected macb chips, from Harini Katakam. 4) Dummy netdev need a proper namespace assigned to them, from Josh Elsasser. 5) Some paths of nft_compat run lockless now, and thus we need to use a proper refcnt_t. From Florian Westphal. 6) Avoid deadlock in mlx5 by doing IRQ locking, from Moni Shoua. 7) netrom does not refcount sockets properly wrt. timers, fix that by using the sock timer API. From Cong Wang. 8) Fix locking of inexact inserts of xfrm policies, from Florian Westphal. 9) Missing xfrm hash generation bump, also from Florian. 10) Missing of_node_put() in hns driver, from Yonglong Liu. 11) Fix DN_IFREQ_SIZE, from Johannes Berg. 12) ip6mr notifier is invoked during traversal of wrong table, from Nir Dotan. 13) TX promisc settings not performed correctly in qed, from Manish Chopra. 14) Fix OOB access in vhost, from Jason Wang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits) MAINTAINERS: Add entry for XDP (eXpress Data Path) net: set default network namespace in init_dummy_netdev() net: b44: replace dev_kfree_skb_xxx by dev_consume_skb_xxx for drop profiles net: caif: call dev_consume_skb_any when skb xmit done net: 8139cp: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles net: macb: Apply RXUBR workaround only to versions with errata net: ti: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles net: apple: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles net: amd8111e: replace dev_kfree_skb_irq by dev_consume_skb_irq net: alteon: replace dev_kfree_skb_irq by dev_consume_skb_irq net: tls: Fix deadlock in free_resources tx net: tls: Save iv in tls_rec for async crypto requests vhost: fix OOB in get_rx_bufs() qed: Fix stack out of bounds bug qed: Fix system crash in ll2 xmit qed: Fix VF probe failure while FLR qed: Fix LACP pdu drops for VFs qed: Fix bug in tx promiscuous mode settings net: i825xx: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles netfilter: ipt_CLUSTERIP: fix warning unused variable cn ...
2019-01-29MAINTAINERS: Add entry for XDP (eXpress Data Path)Jesper Dangaard Brouer
Add multiple people as maintainers for XDP, sorted alphabetically. XDP is also tied to driver level support and code, but we cannot add all drivers to the list. Instead K: and N: match on 'xdp' in hope to catch some of those changes in drivers. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-29net: set default network namespace in init_dummy_netdev()Josh Elsasser
Assign a default net namespace to netdevs created by init_dummy_netdev(). Fixes a NULL pointer dereference caused by busy-polling a socket bound to an iwlwifi wireless device, which bumps the per-net BUSYPOLLRXPACKETS stat if napi_poll() received packets: BUG: unable to handle kernel NULL pointer dereference at 0000000000000190 IP: napi_busy_loop+0xd6/0x200 Call Trace: sock_poll+0x5e/0x80 do_sys_poll+0x324/0x5a0 SyS_poll+0x6c/0xf0 do_syscall_64+0x6b/0x1f0 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Fixes: 7db6b048da3b ("net: Commonize busy polling code to focus on napi_id instead of socket") Signed-off-by: Josh Elsasser <jelsasser@appneta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-29net: b44: replace dev_kfree_skb_xxx by dev_consume_skb_xxx for drop profilesYang Wei
The skb should be freed by dev_consume_skb_any() in b44_start_xmit() when bounce_skb is used. The skb is be replaced by bounce_skb, so the original skb should be consumed(not drop). dev_consume_skb_irq() should be called in b44_tx() when skb xmit done. It makes drop profiles(dropwatch, perf) more friendly. Signed-off-by: Yang Wei <yang.wei9@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-29net: caif: call dev_consume_skb_any when skb xmit doneYang Wei
The skb shouled be consumed when xmit done, it makes drop profiles (dropwatch, perf) more friendly. dev_kfree_skb_irq()/kfree_skb() shouled be replaced by dev_consume_skb_any(), it makes code cleaner. Signed-off-by: Yang Wei <yang.wei9@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-29net: 8139cp: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profilesYang Wei
dev_consume_skb_irq() should be called in cp_tx() when skb xmit done. It makes drop profiles(dropwatch, perf) more friendly. Signed-off-by: Yang Wei <yang.wei9@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-29net: macb: Apply RXUBR workaround only to versions with errataHarini Katakam
The interrupt handler contains a workaround for RX hang applicable to Zynq and AT91RM9200 only. Subsequent versions do not need this workaround. This workaround unnecessarily resets RX whenever RX used bit read is observed, which can be often under heavy traffic. There is no other action performed on RX UBR interrupt. Hence introduce a CAPS mask; enable this interrupt and workaround only on affected versions. Signed-off-by: Harini Katakam <harini.katakam@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28net: ti: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profilesYang Wei
dev_consume_skb_irq() should be called in cpmac_end_xmit() when xmit done. It makes drop profiles more friendly. Signed-off-by: Yang Wei <yang.wei9@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28net: apple: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profilesYang Wei
dev_consume_skb_irq() should be called in bmac_txdma_intr() when xmit done. It makes drop profiles more friendly. Signed-off-by: Yang Wei <yang.wei9@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28net: amd8111e: replace dev_kfree_skb_irq by dev_consume_skb_irqYang Wei
dev_consume_skb_irq() should be called in amd8111e_tx() when xmit done. It makes drop profiles more friendly. Signed-off-by: Yang Wei <yang.wei9@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28net: alteon: replace dev_kfree_skb_irq by dev_consume_skb_irqYang Wei
dev_consume_skb_irq() should be called in ace_tx_int() when xmit done. It makes drop profiles more friendly. Signed-off-by: Yang Wei <yang.wei9@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28net: tls: Fix deadlock in free_resources txDave Watson
If there are outstanding async tx requests (when crypto returns EINPROGRESS), there is a potential deadlock: the tx work acquires the lock, while we cancel_delayed_work_sync() while holding the lock. Drop the lock while waiting for the work to complete. Fixes: a42055e8d2c30 ("Add support for async encryption of records...") Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28net: tls: Save iv in tls_rec for async crypto requestsDave Watson
aead_request_set_crypt takes an iv pointer, and we change the iv soon after setting it. Some async crypto algorithms don't save the iv, so we need to save it in the tls_rec for async requests. Found by hardcoding x64 aesni to use async crypto manager (to test the async codepath), however I don't think this combination can happen in the wild. Presumably other hardware offloads will need this fix, but there have been no user reports. Fixes: a42055e8d2c30 ("Add support for async encryption of records...") Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28vhost: fix OOB in get_rx_bufs()Jason Wang
After batched used ring updating was introduced in commit e2b3b35eb989 ("vhost_net: batch used ring update in rx"). We tend to batch heads in vq->heads for more than one packet. But the quota passed to get_rx_bufs() was not correctly limited, which can result a OOB write in vq->heads. headcount = get_rx_bufs(vq, vq->heads + nvq->done_idx, vhost_len, &in, vq_log, &log, likely(mergeable) ? UIO_MAXIOV : 1); UIO_MAXIOV was still used which is wrong since we could have batched used in vq->heads, this will cause OOB if the next buffer needs more than 960 (1024 (UIO_MAXIOV) - 64 (VHOST_NET_BATCH)) heads after we've batched 64 (VHOST_NET_BATCH) heads: Acked-by: Stefan Hajnoczi <stefanha@redhat.com> ============================================================================= BUG kmalloc-8k (Tainted: G B ): Redzone overwritten ----------------------------------------------------------------------------- INFO: 0x00000000fd93b7a2-0x00000000f0713384. First byte 0xa9 instead of 0xcc INFO: Allocated in alloc_pd+0x22/0x60 age=3933677 cpu=2 pid=2674 kmem_cache_alloc_trace+0xbb/0x140 alloc_pd+0x22/0x60 gen8_ppgtt_create+0x11d/0x5f0 i915_ppgtt_create+0x16/0x80 i915_gem_create_context+0x248/0x390 i915_gem_context_create_ioctl+0x4b/0xe0 drm_ioctl_kernel+0xa5/0xf0 drm_ioctl+0x2ed/0x3a0 do_vfs_ioctl+0x9f/0x620 ksys_ioctl+0x6b/0x80 __x64_sys_ioctl+0x11/0x20 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 INFO: Slab 0x00000000d13e87af objects=3 used=3 fp=0x (null) flags=0x200000000010201 INFO: Object 0x0000000003278802 @offset=17064 fp=0x00000000e2e6652b Fixing this by allocating UIO_MAXIOV + VHOST_NET_BATCH iovs for vhost-net. This is done through set the limitation through vhost_dev_init(), then set_owner can allocate the number of iov in a per device manner. This fixes CVE-2018-16880. Fixes: e2b3b35eb989 ("vhost_net: batch used ring update in rx") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28Merge branch 'qed-Bug-fixes'David S. Miller
Manish Chopra says: ==================== qed: Bug fixes This series have SR-IOV and some general fixes. Please consider applying it to "net" ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28qed: Fix stack out of bounds bugManish Chopra
KASAN reported following bug in qed_init_qm_get_idx_from_flags due to inappropriate casting of "pq_flags". Fix the type of "pq_flags". [ 196.624707] BUG: KASAN: stack-out-of-bounds in qed_init_qm_get_idx_from_flags+0x1a4/0x1b8 [qed] [ 196.624712] Read of size 8 at addr ffff809b00bc7360 by task kworker/0:9/1712 [ 196.624714] [ 196.624720] CPU: 0 PID: 1712 Comm: kworker/0:9 Not tainted 4.18.0-60.el8.aarch64+debug #1 [ 196.624723] Hardware name: To be filled by O.E.M. Saber/Saber, BIOS 0ACKL024 09/26/2018 [ 196.624733] Workqueue: events work_for_cpu_fn [ 196.624738] Call trace: [ 196.624742] dump_backtrace+0x0/0x2f8 [ 196.624745] show_stack+0x24/0x30 [ 196.624749] dump_stack+0xe0/0x11c [ 196.624755] print_address_description+0x68/0x260 [ 196.624759] kasan_report+0x178/0x340 [ 196.624762] __asan_report_load_n_noabort+0x38/0x48 [ 196.624786] qed_init_qm_get_idx_from_flags+0x1a4/0x1b8 [qed] [ 196.624808] qed_init_qm_info+0xec0/0x2200 [qed] [ 196.624830] qed_resc_alloc+0x284/0x7e8 [qed] [ 196.624853] qed_slowpath_start+0x6cc/0x1ae8 [qed] [ 196.624864] __qede_probe.isra.10+0x1cc/0x12c0 [qede] [ 196.624874] qede_probe+0x78/0xf0 [qede] [ 196.624879] local_pci_probe+0xc4/0x180 [ 196.624882] work_for_cpu_fn+0x54/0x98 [ 196.624885] process_one_work+0x758/0x1900 [ 196.624888] worker_thread+0x4e0/0xd18 [ 196.624892] kthread+0x2c8/0x350 [ 196.624897] ret_from_fork+0x10/0x18 [ 196.624899] [ 196.624902] Allocated by task 2: [ 196.624906] kasan_kmalloc.part.1+0x40/0x108 [ 196.624909] kasan_kmalloc+0xb4/0xc8 [ 196.624913] kasan_slab_alloc+0x14/0x20 [ 196.624916] kmem_cache_alloc_node+0x1dc/0x480 [ 196.624921] copy_process.isra.1.part.2+0x1d8/0x4a98 [ 196.624924] _do_fork+0x150/0xfa0 [ 196.624926] kernel_thread+0x48/0x58 [ 196.624930] kthreadd+0x3a4/0x5a0 [ 196.624932] ret_from_fork+0x10/0x18 [ 196.624934] [ 196.624937] Freed by task 0: [ 196.624938] (stack is not available) [ 196.624940] [ 196.624943] The buggy address belongs to the object at ffff809b00bc0000 [ 196.624943] which belongs to the cache thread_stack of size 32768 [ 196.624946] The buggy address is located 29536 bytes inside of [ 196.624946] 32768-byte region [ffff809b00bc0000, ffff809b00bc8000) [ 196.624948] The buggy address belongs to the page: [ 196.624952] page:ffff7fe026c02e00 count:1 mapcount:0 mapping:ffff809b4001c000 index:0x0 compound_mapcount: 0 [ 196.624960] flags: 0xfffff8000008100(slab|head) [ 196.624967] raw: 0fffff8000008100 dead000000000100 dead000000000200 ffff809b4001c000 [ 196.624970] raw: 0000000000000000 0000000000080008 00000001ffffffff 0000000000000000 [ 196.624973] page dumped because: kasan: bad access detected [ 196.624974] [ 196.624976] Memory state around the buggy address: [ 196.624980] ffff809b00bc7200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 196.624983] ffff809b00bc7280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 196.624985] >ffff809b00bc7300: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 04 f2 f2 f2 [ 196.624988] ^ [ 196.624990] ffff809b00bc7380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 196.624993] ffff809b00bc7400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 196.624995] ================================================================== Signed-off-by: Manish Chopra <manishc@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28qed: Fix system crash in ll2 xmitManish Chopra
Cache number of fragments in the skb locally as in case of linear skb (with zero fragments), tx completion (or freeing of skb) may happen before driver tries to get number of frgaments from the skb which could lead to stale access to an already freed skb. Signed-off-by: Manish Chopra <manishc@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28qed: Fix VF probe failure while FLRManish Chopra
VFs may hit VF-PF channel timeout while probing, as in some cases it was observed that VF FLR and VF "acquire" message transaction (i.e first message from VF to PF in VF's probe flow) could occur simultaneously which could lead VF to fail sending "acquire" message to PF as VF is marked disabled from HW perspective due to FLR, which will result into channel timeout and VF probe failure. In such cases, try retrying VF "acquire" message so that in later attempts it could be successful to pass message to PF after the VF FLR is completed and can be probed successfully. Signed-off-by: Manish Chopra <manishc@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28qed: Fix LACP pdu drops for VFsManish Chopra
VF is always configured to drop control frames (with reserved mac addresses) but to work LACP on the VFs, it would require LACP control frames to be forwarded or transmitted successfully. This patch fixes this in such a way that trusted VFs (marked through ndo_set_vf_trust) would be allowed to pass the control frames such as LACP pdus. Signed-off-by: Manish Chopra <manishc@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28qed: Fix bug in tx promiscuous mode settingsManish Chopra
When running tx switched traffic between VNICs created via a bridge(to which VFs are added), adapter drops the unicast packets in tx flow due to VNIC's ucast mac being unknown to it. But VF interfaces being in promiscuous mode should have caused adapter to accept all the unknown ucast packets. Later, it was found that driver doesn't really configure tx promiscuous mode settings to accept all unknown unicast macs. This patch fixes tx promiscuous mode settings to accept all unknown/unmatched unicast macs and works out the scenario. Signed-off-by: Manish Chopra <manishc@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28net: i825xx: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profilesYang Wei
dev_consume_skb_irq() should be called in i596_interrupt() when skb xmit done. It makes drop profiles(dropwatch, perf) more friendly. Signed-off-by: Yang Wei <albin_yang@163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for your net tree: 1) The nftnl mutex is now per-netns, therefore use reference counter for matches and targets to deal with concurrent updates from netns. Moreover, place extensions in a pernet list. Patches from Florian Westphal. 2) Bail out with EINVAL in case of negative timeouts via setsockopt() through ip_vs_set_timeout(), from ZhangXiaoxu. 3) Spurious EINVAL on ebtables 32bit binary with 64bit kernel, also from Florian. 4) Reset TCP option header parser in case of fingerprint mismatch, otherwise follow up overlapping fingerprint definitions including TCP options do not work, from Fernando Fernandez Mancera. 5) Compilation warning in ipt_CLUSTER with CONFIG_PROC_FS unset. From Anders Roxell. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28Revert "mm, memory_hotplug: initialize struct pages for the full memory section"Michal Hocko
This reverts commit 2830bf6f05fb3e05bc4743274b806c821807a684. The underlying assumption that one sparse section belongs into a single numa node doesn't hold really. Robert Shteynfeld has reported a boot failure. The boot log was not captured but his memory layout is as follows: Early memory node ranges node 1: [mem 0x0000000000001000-0x0000000000090fff] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff] node 1: [mem 0x0000000100000000-0x0000001423ffffff] node 0: [mem 0x0000001424000000-0x0000002023ffffff] This means that node0 starts in the middle of a memory section which is also in node1. memmap_init_zone tries to initialize padding of a section even when it is outside of the given pfn range because there are code paths (e.g. memory hotplug) which assume that the full worth of memory section is always initialized. In this particular case, though, such a range is already intialized and most likely already managed by the page allocator. Scribbling over those pages corrupts the internal state and likely blows up when any of those pages gets used. Reported-by: Robert Shteynfeld <robert.shteynfeld@gmail.com> Fixes: 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section") Cc: stable@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-28netfilter: ipt_CLUSTERIP: fix warning unused variable cnAnders Roxell
When CONFIG_PROC_FS isn't set the variable cn isn't used. net/ipv4/netfilter/ipt_CLUSTERIP.c: In function ‘clusterip_net_exit’: net/ipv4/netfilter/ipt_CLUSTERIP.c:849:24: warning: unused variable ‘cn’ [-Wunused-variable] struct clusterip_net *cn = clusterip_pernet(net); ^~ Rework so the variable 'cn' is declared inside "#ifdef CONFIG_PROC_FS". Fixes: b12f7bad5ad3 ("netfilter: ipt_CLUSTERIP: remove wrong WARN_ON_ONCE in netns exit routine") Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-28netfilter: nfnetlink_osf: add missing fmatch checkFernando Fernandez Mancera
When we check the tcp options of a packet and it doesn't match the current fingerprint, the tcp packet option pointer must be restored to its initial value in order to do the proper tcp options check for the next fingerprint. Here we can see an example. Assumming the following fingerprint base with two lines: S10:64:1:60:M*,S,T,N,W6: Linux:3.0::Linux 3.0 S20:64:1:60:M*,S,T,N,W7: Linux:4.19:arch:Linux 4.1 Where TCP options are the last field in the OS signature, all of them overlap except by the last one, ie. 'W6' versus 'W7'. In case a packet for Linux 4.19 kicks in, the osf finds no matching because the TCP options pointer is updated after checking for the TCP options in the first line. Therefore, reset pointer back to where it should be. Fixes: 11eeef41d5f6 ("netfilter: passive OS fingerprint xtables match") Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-28netfilter: ebtables: compat: un-break 32bit setsockopt when no rules are presentFlorian Westphal
Unlike ip(6)tables ebtables only counts user-defined chains. The effect is that a 32bit ebtables binary on a 64bit kernel can do 'ebtables -N FOO' only after adding at least one rule, else the request fails with -EINVAL. This is a similar fix as done in 3f1e53abff84 ("netfilter: ebtables: don't attempt to allocate 0-sized compat array"). Fixes: 7d7d7e02111e9 ("netfilter: compat: reject huge allocation requests") Reported-by: Francesco Ruggeri <fruggeri@arista.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-27net: dsa: mv88e6xxx: Fix serdes irq setup going recursiveAndrew Lunn
Duec to a typo, mv88e6390_serdes_irq_setup() calls itself, rather than mv88e6390x_serdes_irq_setup(). It then blows the stack, and shortly after the machine blows up. Fixes: 2defda1f4b91 ("net: dsa: mv88e6xxx: Add support for SERDES on ports 2-8 for 6390X") Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-27ip6mr: Fix notifiers call on mroute_clean_tables()Nir Dotan
When the MC route socket is closed, mroute_clean_tables() is called to cleanup existing routes. Mistakenly notifiers call was put on the cleanup of the unresolved MC route entries cache. In a case where the MC socket closes before an unresolved route expires, the notifier call leads to a crash, caused by the driver trying to increment a non initialized refcount_t object [1] and then when handling is done, to decrement it [2]. This was detected by a test recently added in commit 6d4efada3b82 ("selftests: forwarding: Add multicast routing test"). Fix that by putting notifiers call on the resolved entries traversal, instead of on the unresolved entries traversal. [1] [ 245.748967] refcount_t: increment on 0; use-after-free. [ 245.754829] WARNING: CPU: 3 PID: 3223 at lib/refcount.c:153 refcount_inc_checked+0x2b/0x30 ... [ 245.802357] Hardware name: Mellanox Technologies Ltd. MSN2740/SA001237, BIOS 5.6.5 06/07/2016 [ 245.811873] RIP: 0010:refcount_inc_checked+0x2b/0x30 ... [ 245.907487] Call Trace: [ 245.910231] mlxsw_sp_router_fib_event.cold.181+0x42/0x47 [mlxsw_spectrum] [ 245.917913] notifier_call_chain+0x45/0x7 [ 245.922484] atomic_notifier_call_chain+0x15/0x20 [ 245.927729] call_fib_notifiers+0x15/0x30 [ 245.932205] mroute_clean_tables+0x372/0x3f [ 245.936971] ip6mr_sk_done+0xb1/0xc0 [ 245.940960] ip6_mroute_setsockopt+0x1da/0x5f0 ... [2] [ 246.128487] refcount_t: underflow; use-after-free. [ 246.133859] WARNING: CPU: 0 PID: 7 at lib/refcount.c:187 refcount_sub_and_test_checked+0x4c/0x60 [ 246.183521] Hardware name: Mellanox Technologies Ltd. MSN2740/SA001237, BIOS 5.6.5 06/07/2016 ... [ 246.193062] Workqueue: mlxsw_core_ordered mlxsw_sp_router_fibmr_event_work [mlxsw_spectrum] [ 246.202394] RIP: 0010:refcount_sub_and_test_checked+0x4c/0x60 ... [ 246.298889] Call Trace: [ 246.301617] refcount_dec_and_test_checked+0x11/0x20 [ 246.307170] mlxsw_sp_router_fibmr_event_work.cold.196+0x47/0x78 [mlxsw_spectrum] [ 246.315531] process_one_work+0x1fa/0x3f0 [ 246.320005] worker_thread+0x2f/0x3e0 [ 246.324083] kthread+0x118/0x130 [ 246.327683] ? wq_update_unbound_numa+0x1b0/0x1b0 [ 246.332926] ? kthread_park+0x80/0x80 [ 246.337013] ret_from_fork+0x1f/0x30 Fixes: 088aa3eec2ce ("ip6mr: Support fib notifications") Signed-off-by: Nir Dotan <nird@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-27decnet: fix DN_IFREQ_SIZEJohannes Berg
Digging through the ioctls with Al because of the previous patches, we found that on 64-bit decnet's dn_dev_ioctl() is wrong, because struct ifreq::ifr_ifru is actually 24 bytes (not 16 as expected from struct sockaddr) due to the ifru_map and ifru_settings members. Clearly, decnet expects the ioctl to be called with a struct like struct ifreq_dn { char ifr_name[IFNAMSIZ]; struct sockaddr_dn ifr_addr; }; since it does struct ifreq *ifr = ...; struct sockaddr_dn *sdn = (struct sockaddr_dn *)&ifr->ifr_addr; This means that DN_IFREQ_SIZE is too big for what it wants on 64-bit, as it is sizeof(struct ifreq) - sizeof(struct sockaddr) + sizeof(struct sockaddr_dn) This assumes that sizeof(struct sockaddr) is the size of ifr_ifru but that isn't true. Fix this to use offsetof(struct ifreq, ifr_ifru). This indeed doesn't really matter much - the result is that we copy in/out 8 bytes more than we should on 64-bit platforms. In case the "struct ifreq_dn" lands just on the end of a page though it might lead to faults. As far as I can tell, it has been like this forever, so it seems very likely that nobody cares. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-27net: stmmac: dwmac-rk: fix error handling in rk_gmac_powerup()Alexey Khoroshilov
If phy_power_on() fails in rk_gmac_powerup(), clocks are left enabled. Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-27Merge branch 'hns-fixes'David S. Miller
Peng Li says: ==================== net: hns: code optimizations & bugfixes for HNS driver This patchset includes bugfixes and code optimizations for the HNS ethernet controller driver ==================== Signed-off-by: David S. Miller <davem@davemloft.net>