summary refs log tree commit diff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2019-11-21ipv4: move fib4_has_custom_rules() helper to public headerPaolo Abeni
So that we can use it in the next patch. Additionally constify the helper argument. Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-21net: Fix Kconfig indentation, continuedKrzysztof Kozlowski
Adjust indentation from spaces to tab (+optional two spaces) as in coding style. This fixes various indentation mixups (seven spaces, tab+one space, etc). Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-21lwtunnel: check erspan options before allocating tun_infoXin Long
As Jakub suggested on another patch, it's better to do the check on erspan options before allocating memory. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-21lwtunnel: be STRICT to validate the new LWTUNNEL_IP(6)_OPTSXin Long
LWTUNNEL_IP(6)_OPTS are the new items in ip(6)_tun_policy, which are parsed by nla_parse_nested_deprecated(). We should check it strictly by setting .strict_start_type = LWTUNNEL_IP(6)_OPTS. This patch also adds missing LWTUNNEL_IP6_OPTS in ip6_tun_policy. Fixes: 4ece47787077 ("lwtunnel: add options setting and dumping for geneve") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-21net: remove the unnecessary strict_start_type in some policiesXin Long
ct_policy and mpls_policy are parsed with nla_parse_nested(), which does NL_VALIDATE_STRICT validation, strict_start_type is not needed to set as it is actually trying to make some attributes parsed with NL_VALIDATE_STRICT. This patch is to remove it, and do the same on rtm_nh_policy which is parsed by nlmsg_parse(). Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20tcp: warn if offset reach the maxlen limit when using snprintfHangbin Liu
snprintf returns the number of chars that would be written, not number of chars that were actually written. As such, 'offs' may get larger than 'tbl.maxlen', causing the 'tbl.maxlen - offs' being < 0, and since the parameter is size_t, it would overflow. Since using scnprintf may hide the limit error, while the buffer is still enough now, let's just add a WARN_ON_ONCE in case it reach the limit in future. v2: Use WARN_ON_ONCE as Jiri and Eric suggested. Suggested-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20ip_gre: Make none-tun-dst gre tunnel store tunnel info as metadat_dst in recvwenxu
Currently collect_md gre tunnel will store the tunnel info(metadata_dst) to skb_dst. And now the non-tun-dst gre tunnel already can add tunnel header through lwtunnel. When received a arp_request on the non-tun-dst gre tunnel. The packet of arp response will send through the non-tun-dst tunnel without tunnel info which will lead the arp response packet to be dropped. If the non-tun-dst gre tunnel also store the tunnel info as metadata_dst, The arp response packet will set the releted tunnel info in the iptunnel_metadata_reply. The following is the test script: ip netns add cl ip l add dev vethc type veth peer name eth0 netns cl ifconfig vethc 172.168.0.7/24 up ip l add dev tun1000 type gretap key 1000 ip link add user1000 type vrf table 1 ip l set user1000 up ip l set dev tun1000 master user1000 ifconfig tun1000 10.0.1.1/24 up ip netns exec cl ifconfig eth0 172.168.0.17/24 up ip netns exec cl ip l add dev tun type gretap local 172.168.0.17 remote 172.168.0.7 key 1000 ip netns exec cl ifconfig tun 10.0.1.7/24 up ip r r 10.0.1.7 encap ip id 1000 dst 172.168.0.17 key dev tun1000 table 1 With this patch ip netns exec cl ping 10.0.1.1 can success Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20net: ipconfig: Wait for deferred device probesThomas Bogendoerfer
If network device drives are using deferred probing, it was possible that waiting for devices to show up in ipconfig was already over, when the device eventually showed up. By calling wait_for_device_probe() we now make sure deferred probing is done before checking for available devices. Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20lwtunnel: add support for multiple geneve optsXin Long
geneve RFC (draft-ietf-nvo3-geneve-14) allows a geneve packet to carry multiple geneve opts, so it's necessary for lwtunnel to support adding multiple geneve opts in one lwtunnel route. But vxlan and erspan opts are still only allowed to add one option. With this patch, iproute2 could make it like: # ip r a 1.1.1.0/24 encap ip id 1 geneve_opts 0:0:12121212,1:2:12121212 \ dst 10.1.0.2 dev geneve1 # ip r a 1.1.1.0/24 encap ip id 1 vxlan_opts 456 \ dst 10.1.0.2 dev erspan1 # ip r a 1.1.1.0/24 encap ip id 1 erspan_opts 1:123:0:0 \ dst 10.1.0.2 dev erspan1 Which are pretty much like cls_flower and act_tunnel_key. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-18lwtunnel: change to use nla_put_u8 for LWTUNNEL_IP_OPT_ERSPAN_VERXin Long
LWTUNNEL_IP_OPT_ERSPAN_VER is u8 type, and nla_put_u8 should have been used instead of nla_put_u32(). This is a copy-paste error. Fixes: b0a21810bd5e ("lwtunnel: add options setting and dumping for erspan") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Wildcard support for the net,iface set from Kristian Evensen. 2) Offload support for matching on the input interface. 3) Simplify matching on vlan header fields. 4) Add nft_payload_rebuild_vlan_hdr() function to rebuild the vlan header from the vlan sk_buff metadata. 5) Pass extack to nft_flow_cls_offload_setup(). 6) Add C-VLAN matching support. 7) Use time64_t in xt_time to fix y2038 overflow, from Arnd Bergmann. 8) Use time_t in nft_meta to fix y2038 overflow, also from Arnd. 9) Add flow_action_entry_next() helper function to flowtable offload infrastructure. 10) Add IPv6 support to the flowtable offload infrastructure. 11) Support for input interface matching from postrouting, from Phil Sutter. 12) Missing check for ndo callback in flowtable offload, from wenxu. 13) Remove conntrack parameter from flow_offload_fill_dir(), from wenxu. 14) Do not pass flow_rule object for rule removal, cookie is sufficient to achieve this. 15) Release flow_rule object in case of error from the offload commit path. 16) Undo offload ruleset updates if transaction fails. 17) Check for error when binding flowtable callbacks, from wenxu. 18) Always unbind flowtable callbacks when unregistering hooks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Lots of overlapping changes and parallel additions, stuff like that. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16ipmr: Fix skb headroom in ipmr_get_route().Guillaume Nault
In route.c, inet_rtm_getroute_build_skb() creates an skb with no headroom. This skb is then used by inet_rtm_getroute() which may pass it to rt_fill_info() and, from there, to ipmr_get_route(). The later might try to reuse this skb by cloning it and prepending an IPv4 header. But since the original skb has no headroom, skb_push() triggers skb_under_panic(): skbuff: skb_under_panic: text:00000000ca46ad8a len:80 put:20 head:00000000cd28494e data:000000009366fd6b tail:0x3c end:0xec0 dev:veth0 ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:108! invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 6 PID: 587 Comm: ip Not tainted 5.4.0-rc6+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014 RIP: 0010:skb_panic+0xbf/0xd0 Code: 41 a2 ff 8b 4b 70 4c 8b 4d d0 48 c7 c7 20 76 f5 8b 44 8b 45 bc 48 8b 55 c0 48 8b 75 c8 41 54 41 57 41 56 41 55 e8 75 dc 7a ff <0f> 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 RSP: 0018:ffff888059ddf0b0 EFLAGS: 00010286 RAX: 0000000000000086 RBX: ffff888060a315c0 RCX: ffffffff8abe4822 RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff88806c9a79cc RBP: ffff888059ddf118 R08: ffffed100d9361b1 R09: ffffed100d9361b0 R10: ffff88805c68aee3 R11: ffffed100d9361b1 R12: ffff88805d218000 R13: ffff88805c689fec R14: 000000000000003c R15: 0000000000000ec0 FS: 00007f6af184b700(0000) GS:ffff88806c980000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffc8204a000 CR3: 0000000057b40006 CR4: 0000000000360ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: skb_push+0x7e/0x80 ipmr_get_route+0x459/0x6fa rt_fill_info+0x692/0x9f0 inet_rtm_getroute+0xd26/0xf20 rtnetlink_rcv_msg+0x45d/0x630 netlink_rcv_skb+0x1a5/0x220 rtnetlink_rcv+0x15/0x20 netlink_unicast+0x305/0x3a0 netlink_sendmsg+0x575/0x730 sock_sendmsg+0xb5/0xc0 ___sys_sendmsg+0x497/0x4f0 __sys_sendmsg+0xcb/0x150 __x64_sys_sendmsg+0x48/0x50 do_syscall_64+0xd2/0xac0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Actually the original skb used to have enough headroom, but the reserve_skb() call was lost with the introduction of inet_rtm_getroute_build_skb() by commit 404eb77ea766 ("ipv4: support sport, dport and ip_proto in RTM_GETROUTE"). We could reserve some headroom again in inet_rtm_getroute_build_skb(), but this function shouldn't be responsible for handling the special case of ipmr_get_route(). Let's handle that directly in ipmr_get_route() by calling skb_realloc_headroom() instead of skb_clone(). Fixes: 404eb77ea766 ("ipv4: support sport, dport and ip_proto in RTM_GETROUTE") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15netfilter: Support iif matches in POSTROUTINGPhil Sutter
Instead of generally passing NULL to NF_HOOK_COND() for input device, pass skb->dev which contains input device for routed skbs. Note that iptables (both legacy and nft) reject rules with input interface match from being added to POSTROUTING chains, but nftables allows this. Cc: Eric Garver <eric@garver.life> Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-15netfilter: nf_flow_table_offload: add IPv6 supportPablo Neira Ayuso
Add nf_flow_rule_route_ipv6() and use it from the IPv6 and the inet flowtable type definitions. Rename the nf_flow_rule_route() function to nf_flow_rule_route_ipv4(). Adjust maximum number of actions, which now becomes 16 to leave sufficient room for the IPv6 address mangling for NAT. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-13Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2019-11-13 1) Remove a unnecessary net_exit function from the xfrm interface. From Xin Long. 2) Assign xfrm4_udp_encap_rcv to a UDP socket only if xfrm is configured. From Alexey Dobriyan. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12netfilter: nf_flow_table: hardware offload supportPablo Neira Ayuso
This patch adds the dataplane hardware offload to the flowtable infrastructure. Three new flags represent the hardware state of this flow: * FLOW_OFFLOAD_HW: This flow entry resides in the hardware. * FLOW_OFFLOAD_HW_DYING: This flow entry has been scheduled to be remove from hardware. This might be triggered by either packet path (via TCP RST/FIN packet) or via aging. * FLOW_OFFLOAD_HW_DEAD: This flow entry has been already removed from the hardware, the software garbage collector can remove it from the software flowtable. This patch supports for: * IPv4 only. * Aging via FLOW_CLS_STATS, no packet and byte counter synchronization at this stage. This patch also adds the action callback that specifies how to convert the flow entry into the flow_rule object that is passed to the driver. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12netfilter: nf_tables: add flowtable offload control planePablo Neira Ayuso
This patch adds the NFTA_FLOWTABLE_FLAGS attribute that allows users to specify the NF_FLOWTABLE_HW_OFFLOAD flag. This patch also adds a new setup interface for the flowtable type to perform the flowtable offload block callback configuration. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-11lwtunnel: ignore any TUNNEL_OPTIONS_PRESENT flags set by usersXin Long
TUNNEL_OPTIONS_PRESENT (TUNNEL_GENEVE_OPT|TUNNEL_VXLAN_OPT| TUNNEL_ERSPAN_OPT) flags should be set only according to tb[LWTUNNEL_IP_OPTS], which is done in ip_tun_parse_opts(). When setting info key.tun_flags, the TUNNEL_OPTIONS_PRESENT bits in tb[LWTUNNEL_IP(6)_FLAGS] passed from users should be ignored. While at it, replace all (TUNNEL_GENEVE_OPT|TUNNEL_VXLAN_OPT| TUNNEL_ERSPAN_OPT) with 'TUNNEL_OPTIONS_PRESENT'. Fixes: 3093fbe7ff4b ("route: Per route IP tunnel metadata via lightweight tunnel") Fixes: 32a2b002ce61 ("ipv6: route: per route IP tunnel metadata via lightweight tunnel") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-11lwtunnel: get nlsize for erspan options properlyXin Long
erspan v1 has OPT_ERSPAN_INDEX while erspan v2 has OPT_ERSPAN_DIR and OPT_ERSPAN_HWID attributes, and they require different nlsize when dumping. So this patch is to get nlsize for erspan options properly according to erspan version. Fixes: b0a21810bd5e ("lwtunnel: add options setting and dumping for erspan") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-11lwtunnel: change to use nla_parse_nested on new optionsXin Long
As the new options added in kernel, all should always use strict parsing from the beginning with nla_parse_nested(), instead of nla_parse_nested_deprecated(). Fixes: b0a21810bd5e ("lwtunnel: add options setting and dumping for erspan") Fixes: edf31cbb1502 ("lwtunnel: add options setting and dumping for vxlan") Fixes: 4ece47787077 ("lwtunnel: add options setting and dumping for geneve") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
One conflict in the BPF samples Makefile, some fixes in 'net' whilst we were converting over to Makefile.target rules in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08net: icmp: fix data-race in cmp_global_allow()Eric Dumazet
This code reads two global variables without protection of a lock. We need READ_ONCE()/WRITE_ONCE() pairs to avoid load/store-tearing and better document the intent. KCSAN reported : BUG: KCSAN: data-race in icmp_global_allow / icmp_global_allow read to 0xffffffff861a8014 of 4 bytes by task 11201 on cpu 0: icmp_global_allow+0x36/0x1b0 net/ipv4/icmp.c:254 icmpv6_global_allow net/ipv6/icmp.c:184 [inline] icmpv6_global_allow net/ipv6/icmp.c:179 [inline] icmp6_send+0x493/0x1140 net/ipv6/icmp.c:514 icmpv6_send+0x71/0xb0 net/ipv6/ip6_icmp.c:43 ip6_link_failure+0x43/0x180 net/ipv6/route.c:2640 dst_link_failure include/net/dst.h:419 [inline] vti_xmit net/ipv4/ip_vti.c:243 [inline] vti_tunnel_xmit+0x27f/0xa50 net/ipv4/ip_vti.c:279 __netdev_start_xmit include/linux/netdevice.h:4420 [inline] netdev_start_xmit include/linux/netdevice.h:4434 [inline] xmit_one net/core/dev.c:3280 [inline] dev_hard_start_xmit+0xef/0x430 net/core/dev.c:3296 __dev_queue_xmit+0x14c9/0x1b60 net/core/dev.c:3873 dev_queue_xmit+0x21/0x30 net/core/dev.c:3906 neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530 neigh_output include/net/neighbour.h:511 [inline] ip6_finish_output2+0x7a6/0xec0 net/ipv6/ip6_output.c:116 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline] __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175 dst_output include/net/dst.h:436 [inline] ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179 write to 0xffffffff861a8014 of 4 bytes by task 11183 on cpu 1: icmp_global_allow+0x174/0x1b0 net/ipv4/icmp.c:272 icmpv6_global_allow net/ipv6/icmp.c:184 [inline] icmpv6_global_allow net/ipv6/icmp.c:179 [inline] icmp6_send+0x493/0x1140 net/ipv6/icmp.c:514 icmpv6_send+0x71/0xb0 net/ipv6/ip6_icmp.c:43 ip6_link_failure+0x43/0x180 net/ipv6/route.c:2640 dst_link_failure include/net/dst.h:419 [inline] vti_xmit net/ipv4/ip_vti.c:243 [inline] vti_tunnel_xmit+0x27f/0xa50 net/ipv4/ip_vti.c:279 __netdev_start_xmit include/linux/netdevice.h:4420 [inline] netdev_start_xmit include/linux/netdevice.h:4434 [inline] xmit_one net/core/dev.c:3280 [inline] dev_hard_start_xmit+0xef/0x430 net/core/dev.c:3296 __dev_queue_xmit+0x14c9/0x1b60 net/core/dev.c:3873 dev_queue_xmit+0x21/0x30 net/core/dev.c:3906 neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530 neigh_output include/net/neighbour.h:511 [inline] ip6_finish_output2+0x7a6/0xec0 net/ipv6/ip6_output.c:116 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline] __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 11183 Comm: syz-executor.2 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 4cdf507d5452 ("icmp: add a global rate limitation") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07tcp: Remove one extra ktime_get_ns() from cookie_init_timestampEric Dumazet
tcp_make_synack() already uses tcp_clock_ns(), and can pass the value to cookie_init_timestamp() to avoid another call to ktime_get_ns() helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07inetpeer: fix data-race in inet_putpeer / inet_putpeerEric Dumazet
We need to explicitely forbid read/store tearing in inet_peer_gc() and inet_putpeer(). The following syzbot report reminds us about inet_putpeer() running without a lock held. BUG: KCSAN: data-race in inet_putpeer / inet_putpeer write to 0xffff888121fb2ed0 of 4 bytes by interrupt on cpu 0: inet_putpeer+0x37/0xa0 net/ipv4/inetpeer.c:240 ip4_frag_free+0x3d/0x50 net/ipv4/ip_fragment.c:102 inet_frag_destroy_rcu+0x58/0x80 net/ipv4/inet_fragment.c:228 __rcu_reclaim kernel/rcu/rcu.h:222 [inline] rcu_do_batch+0x256/0x5b0 kernel/rcu/tree.c:2157 rcu_core+0x369/0x4d0 kernel/rcu/tree.c:2377 rcu_core_si+0x12/0x20 kernel/rcu/tree.c:2386 __do_softirq+0x115/0x33f kernel/softirq.c:292 invoke_softirq kernel/softirq.c:373 [inline] irq_exit+0xbb/0xe0 kernel/softirq.c:413 exiting_irq arch/x86/include/asm/apic.h:536 [inline] smp_apic_timer_interrupt+0xe6/0x280 arch/x86/kernel/apic/apic.c:1137 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830 native_safe_halt+0xe/0x10 arch/x86/kernel/paravirt.c:71 arch_cpu_idle+0x1f/0x30 arch/x86/kernel/process.c:571 default_idle_call+0x1e/0x40 kernel/sched/idle.c:94 cpuidle_idle_call kernel/sched/idle.c:154 [inline] do_idle+0x1af/0x280 kernel/sched/idle.c:263 write to 0xffff888121fb2ed0 of 4 bytes by interrupt on cpu 1: inet_putpeer+0x37/0xa0 net/ipv4/inetpeer.c:240 ip4_frag_free+0x3d/0x50 net/ipv4/ip_fragment.c:102 inet_frag_destroy_rcu+0x58/0x80 net/ipv4/inet_fragment.c:228 __rcu_reclaim kernel/rcu/rcu.h:222 [inline] rcu_do_batch+0x256/0x5b0 kernel/rcu/tree.c:2157 rcu_core+0x369/0x4d0 kernel/rcu/tree.c:2377 rcu_core_si+0x12/0x20 kernel/rcu/tree.c:2386 __do_softirq+0x115/0x33f kernel/softirq.c:292 run_ksoftirqd+0x46/0x60 kernel/softirq.c:603 smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 4b9d9be839fd ("inetpeer: remove unused list") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07ipv4: Fix table id reference in fib_sync_down_addrDavid Ahern
Hendrik reported routes in the main table using source address are not removed when the address is removed. The problem is that fib_sync_down_addr does not account for devices in the default VRF which are associated with the main table. Fix by updating the table id reference. Fixes: 5a56a0b3a45d ("net: Don't delete routes in different VRFs") Reported-by: Hendrik Donner <hd@os-cillation.de> Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06tcp: fix data-race in tcp_recvmsg()Eric Dumazet
Reading tp->recvmsg_inq after socket lock is released raises a KCSAN warning [1] Replace has_tss & has_cmsg by cmsg_flags and make sure to not read tp->recvmsg_inq a second time. [1] BUG: KCSAN: data-race in tcp_chrono_stop / tcp_recvmsg write to 0xffff888126adef24 of 2 bytes by interrupt on cpu 0: tcp_chrono_set net/ipv4/tcp_output.c:2309 [inline] tcp_chrono_stop+0x14c/0x280 net/ipv4/tcp_output.c:2338 tcp_clean_rtx_queue net/ipv4/tcp_input.c:3165 [inline] tcp_ack+0x274f/0x3170 net/ipv4/tcp_input.c:3688 tcp_rcv_established+0x37e/0xf50 net/ipv4/tcp_input.c:5696 tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1561 tcp_v4_rcv+0x19dc/0x1bb0 net/ipv4/tcp_ipv4.c:1942 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5214 napi_skb_finish net/core/dev.c:5677 [inline] napi_gro_receive+0x28f/0x330 net/core/dev.c:5710 read to 0xffff888126adef25 of 1 bytes by task 7275 on cpu 1: tcp_recvmsg+0x77b/0x1a30 net/ipv4/tcp.c:2187 inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838 sock_recvmsg_nosec net/socket.c:871 [inline] sock_recvmsg net/socket.c:889 [inline] sock_recvmsg+0x92/0xb0 net/socket.c:885 sock_read_iter+0x15f/0x1e0 net/socket.c:967 call_read_iter include/linux/fs.h:1889 [inline] new_sync_read+0x389/0x4f0 fs/read_write.c:414 __vfs_read+0xb1/0xc0 fs/read_write.c:427 vfs_read fs/read_write.c:461 [inline] vfs_read+0x143/0x2c0 fs/read_write.c:446 ksys_read+0xd5/0x1b0 fs/read_write.c:587 __do_sys_read fs/read_write.c:597 [inline] __se_sys_read fs/read_write.c:595 [inline] __x64_sys_read+0x4c/0x60 fs/read_write.c:595 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 7275 Comm: sshd Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: b75eba76d3d7 ("tcp: send in-queue bytes in cmsg upon read") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06net: silence data-races on sk_backlog.tailEric Dumazet
sk->sk_backlog.tail might be read without holding the socket spinlock, we need to add proper READ_ONCE()/WRITE_ONCE() to silence the warnings. KCSAN reported : BUG: KCSAN: data-race in tcp_add_backlog / tcp_recvmsg write to 0xffff8881265109f8 of 8 bytes by interrupt on cpu 1: __sk_add_backlog include/net/sock.h:907 [inline] sk_add_backlog include/net/sock.h:938 [inline] tcp_add_backlog+0x476/0xce0 net/ipv4/tcp_ipv4.c:1759 tcp_v4_rcv+0x1a70/0x1bd0 net/ipv4/tcp_ipv4.c:1947 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:4929 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5043 netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5133 napi_skb_finish net/core/dev.c:5596 [inline] napi_gro_receive+0x28f/0x330 net/core/dev.c:5629 receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061 virtnet_receive drivers/net/virtio_net.c:1323 [inline] virtnet_poll+0x436/0x7d0 drivers/net/virtio_net.c:1428 napi_poll net/core/dev.c:6311 [inline] net_rx_action+0x3ae/0xa90 net/core/dev.c:6379 __do_softirq+0x115/0x33f kernel/softirq.c:292 invoke_softirq kernel/softirq.c:373 [inline] irq_exit+0xbb/0xe0 kernel/softirq.c:413 exiting_irq arch/x86/include/asm/apic.h:536 [inline] do_IRQ+0xa6/0x180 arch/x86/kernel/irq.c:263 ret_from_intr+0x0/0x19 native_safe_halt+0xe/0x10 arch/x86/kernel/paravirt.c:71 arch_cpu_idle+0x1f/0x30 arch/x86/kernel/process.c:571 default_idle_call+0x1e/0x40 kernel/sched/idle.c:94 cpuidle_idle_call kernel/sched/idle.c:154 [inline] do_idle+0x1af/0x280 kernel/sched/idle.c:263 cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:355 start_secondary+0x208/0x260 arch/x86/kernel/smpboot.c:264 secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241 read to 0xffff8881265109f8 of 8 bytes by task 8057 on cpu 0: tcp_recvmsg+0x46e/0x1b40 net/ipv4/tcp.c:2050 inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838 sock_recvmsg_nosec net/socket.c:871 [inline] sock_recvmsg net/socket.c:889 [inline] sock_recvmsg+0x92/0xb0 net/socket.c:885 sock_read_iter+0x15f/0x1e0 net/socket.c:967 call_read_iter include/linux/fs.h:1889 [inline] new_sync_read+0x389/0x4f0 fs/read_write.c:414 __vfs_read+0xb1/0xc0 fs/read_write.c:427 vfs_read fs/read_write.c:461 [inline] vfs_read+0x143/0x2c0 fs/read_write.c:446 ksys_read+0xd5/0x1b0 fs/read_write.c:587 __do_sys_read fs/read_write.c:597 [inline] __se_sys_read fs/read_write.c:595 [inline] __x64_sys_read+0x4c/0x60 fs/read_write.c:595 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 8057 Comm: syz-fuzzer Not tainted 5.4.0-rc6+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06lwtunnel: add options setting and dumping for erspanXin Long
Based on the code framework built on the last patch, to support setting and dumping for vxlan, we only need to add ip_tun_parse_opts_erspan() for .build_state and ip_tun_fill_encap_opts_erspan() for .fill_encap and if (tun_flags & TUNNEL_ERSPAN_OPT) for .get_encap_size. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06lwtunnel: add options setting and dumping for vxlanXin Long
Based on the code framework built on the last patch, to support setting and dumping for vxlan, we only need to add ip_tun_parse_opts_vxlan() for .build_state and ip_tun_fill_encap_opts_vxlan() for .fill_encap and if (tun_flags & TUNNEL_VXLAN_OPT) for .get_encap_size. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06lwtunnel: add options setting and dumping for geneveXin Long
To add options setting and dumping, .build_state(), .fill_encap() and .get_encap_size() in ip_tun_lwt_ops needs to be extended: ip_tun_build_state(): ip_tun_parse_opts(): ip_tun_parse_opts_geneve() ip_tun_fill_encap_info(): ip_tun_fill_encap_opts(): ip_tun_fill_encap_opts_geneve() ip_tun_encap_nlsize() ip_tun_opts_nlsize(): if (tun_flags & TUNNEL_GENEVE_OPT) ip_tun_parse_opts(), ip_tun_fill_encap_opts() and ip_tun_opts_nlsize() processes LWTUNNEL_IP_OPTS. ip_tun_parse_opts_geneve(), ip_tun_fill_encap_opts_geneve() and if (tun_flags & TUNNEL_GENEVE_OPT) processes LWTUNNEL_IP_OPTS_GENEVE. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06lwtunnel: add options process for cmp_encapXin Long
When comparing two tun_info, dst_cache member should have been skipped, as dst_cache is a per cpu pointer and they are always different values even in two tun_info with the same keys. So this patch is to skip dst_cache member and compare the key, mode and options_len only. For the future opts setting support, also to compare options. Fixes: 2d79849903e0 ("lwtunnel: ip tunnel: fix multiple routes with different encap") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06lwtunnel: add options process for arp requestXin Long
Without options copied to the dst tun_info in iptunnel_metadata_reply() called by arp_process for handling arp_request, the generated arp_reply packet may be dropped or sent out with wrong options for some tunnels like erspan and vxlan, and the traffic will break. Fixes: 63d008a4e9ee ("ipv4: send arp replies to the correct tunnel") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06net: annotate lockless accesses to sk->sk_max_ack_backlogEric Dumazet
sk->sk_max_ack_backlog can be read without any lock being held at least in TCP/DCCP cases. We need to use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing and/or potential KCSAN warnings. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06net: annotate lockless accesses to sk->sk_ack_backlogEric Dumazet
sk->sk_ack_backlog can be read without any lock being held. We need to use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing and/or potential KCSAN warnings. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06inet_diag: use jiffies_delta_to_msecs()Eric Dumazet
Use jiffies_delta_to_msecs() to avoid reporting 'infinite' timeouts and to cleanup code. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05icmp: remove duplicate codeMatteo Croce
The same code which recognizes ICMP error packets is duplicated several times. Use the icmp_is_err() and icmpv6_is_err() helpers instead, which do the same thing. ip_multipath_l3_keys() and tcf_nat_act() didn't check for all the error types, assume that they should instead. Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03net: icmp: use input address in tracerouteFrancesco Ruggeri
Even with icmp_errors_use_inbound_ifaddr set, traceroute returns the primary address of the interface the packet was received on, even if the path goes through a secondary address. In the example: 1.0.3.1/24 ---- 1.0.1.3/24 1.0.1.1/24 ---- 1.0.2.1/24 1.0.2.4/24 ---- |H1|--------------------------|R1|--------------------------|H2| ---- N1 ---- N2 ---- where 1.0.3.1/24 is R1's primary address on N1, traceroute from H1 to H2 returns: traceroute to 1.0.2.4 (1.0.2.4), 30 hops max, 60 byte packets 1 1.0.3.1 (1.0.3.1) 0.018 ms 0.006 ms 0.006 ms 2 1.0.2.4 (1.0.2.4) 0.021 ms 0.007 ms 0.007 ms After applying this patch, it returns: traceroute to 1.0.2.4 (1.0.2.4), 30 hops max, 60 byte packets 1 1.0.1.1 (1.0.1.1) 0.033 ms 0.007 ms 0.006 ms 2 1.0.2.4 (1.0.2.4) 0.011 ms 0.007 ms 0.007 ms Original-patch-by: Bill Fenner <fenner@arista.com> Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
The only slightly tricky merge conflict was the netdevsim because the mutex locking fix overlapped a lot of driver reload reorganization. The rest were (relatively) trivial in nature. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-01inet: stop leaking jiffies on the wireEric Dumazet
Historically linux tried to stick to RFC 791, 1122, 2003 for IPv4 ID field generation. RFC 6864 made clear that no matter how hard we try, we can not ensure unicity of IP ID within maximum lifetime for all datagrams with a given source address/destination address/protocol tuple. Linux uses a per socket inet generator (inet_id), initialized at connection startup with a XOR of 'jiffies' and other fields that appear clear on the wire. Thiemo Nagel pointed that this strategy is a privacy concern as this provides 16 bits of entropy to fingerprint devices. Let's switch to a random starting point, this is just as good as far as RFC 6864 is concerned and does not leak anything critical. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Thiemo Nagel <tnagel@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-31tcp: increase tcp_max_syn_backlog max valueEric Dumazet
tcp_max_syn_backlog default value depends on memory size and TCP ehash size. Before this patch, the max value was 2048 [1], which is considered too small nowadays. Increase it to 4096 to match the recent SOMAXCONN change. [1] This is with TCP ehash size being capped to 524288 buckets. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Yue Cao <ycao009@ucr.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30net: annotate accesses to sk->sk_incoming_cpuEric Dumazet
This socket field can be read and written by concurrent cpus. Use READ_ONCE() and WRITE_ONCE() annotations to document this, and avoid some compiler 'optimizations'. KCSAN reported : BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0: sk_incoming_cpu_update include/net/sock.h:953 [inline] tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 napi_poll net/core/dev.c:6392 [inline] net_rx_action+0x3ae/0xa90 net/core/dev.c:6460 __do_softirq+0x115/0x33f kernel/softirq.c:292 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082 do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337 do_softirq kernel/softirq.c:329 [inline] __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189 read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1: sk_incoming_cpu_update include/net/sock.h:952 [inline] tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 napi_poll net/core/dev.c:6392 [inline] net_rx_action+0x3ae/0xa90 net/core/dev.c:6460 __do_softirq+0x115/0x33f kernel/softirq.c:292 run_ksoftirqd+0x46/0x60 kernel/softirq.c:603 smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29inet: do not call sublist_rcv on empty listFlorian Westphal
syzbot triggered struct net NULL deref in NF_HOOK_LIST: RIP: 0010:NF_HOOK_LIST include/linux/netfilter.h:331 [inline] RIP: 0010:ip6_sublist_rcv+0x5c9/0x930 net/ipv6/ip6_input.c:292 ipv6_list_rcv+0x373/0x4b0 net/ipv6/ip6_input.c:328 __netif_receive_skb_list_ptype net/core/dev.c:5274 [inline] Reason: void ipv6_list_rcv(struct list_head *head, struct packet_type *pt, struct net_device *orig_dev) [..] list_for_each_entry_safe(skb, next, head, list) { /* iterates list */ skb = ip6_rcv_core(skb, dev, net); /* ip6_rcv_core drops skb -> NULL is returned */ if (skb == NULL) continue; [..] } /* sublist is empty -> curr_net is NULL */ ip6_sublist_rcv(&sublist, curr_dev, curr_net); Before the recent change NF_HOOK_LIST did a list iteration before struct net deref, i.e. it was a no-op in the empty list case. List iteration now happens after *net deref, causing crash. Follow the same pattern as the ip(v6)_list_rcv loop and add a list_empty test for the final sublist dispatch too. Cc: Edward Cree <ecree@solarflare.com> Reported-by: syzbot+c54f457cad330e57e967@syzkaller.appspotmail.com Fixes: ca58fbe06c54 ("netfilter: add and use nf_hook_slow_list()") Signed-off-by: Florian Westphal <fw@strlen.de> Tested-by: Leon Romanovsky <leonro@mellanox.com> Tested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29erspan: fix the tun_info options_len check for erspanXin Long
The check for !md doens't really work for ip_tunnel_info_opts(info) which only does info + 1. Also to avoid out-of-bounds access on info, it should ensure options_len is not less than erspan_metadata in both erspan_xmit() and ip6erspan_tunnel_xmit(). Fixes: 1a66a836da ("gre: add collect_md mode to ERSPAN tunnel") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-28udp: fix data-race in udp_set_dev_scratch()Eric Dumazet
KCSAN reported a data-race in udp_set_dev_scratch() [1] The issue here is that we must not write over skb fields if skb is shared. A similar issue has been fixed in commit 89c22d8c3b27 ("net: Fix skb csum races when peeking") While we are at it, use a helper only dealing with udp_skb_scratch(skb)->csum_unnecessary, as this allows udp_set_dev_scratch() to be called once and thus inlined. [1] BUG: KCSAN: data-race in udp_set_dev_scratch / udpv6_recvmsg write to 0xffff888120278317 of 1 bytes by task 10411 on cpu 1: udp_set_dev_scratch+0xea/0x200 net/ipv4/udp.c:1308 __first_packet_length+0x147/0x420 net/ipv4/udp.c:1556 first_packet_length+0x68/0x2a0 net/ipv4/udp.c:1579 udp_poll+0xea/0x110 net/ipv4/udp.c:2720 sock_poll+0xed/0x250 net/socket.c:1256 vfs_poll include/linux/poll.h:90 [inline] do_select+0x7d0/0x1020 fs/select.c:534 core_sys_select+0x381/0x550 fs/select.c:677 do_pselect.constprop.0+0x11d/0x160 fs/select.c:759 __do_sys_pselect6 fs/select.c:784 [inline] __se_sys_pselect6 fs/select.c:769 [inline] __x64_sys_pselect6+0x12e/0x170 fs/select.c:769 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 read to 0xffff888120278317 of 1 bytes by task 10413 on cpu 0: udp_skb_csum_unnecessary include/net/udp.h:358 [inline] udpv6_recvmsg+0x43e/0xe90 net/ipv6/udp.c:310 inet6_recvmsg+0xbb/0x240 net/ipv6/af_inet6.c:592 sock_recvmsg_nosec+0x5c/0x70 net/socket.c:871 ___sys_recvmsg+0x1a0/0x3e0 net/socket.c:2480 do_recvmmsg+0x19a/0x5c0 net/socket.c:2601 __sys_recvmmsg+0x1ef/0x200 net/socket.c:2680 __do_sys_recvmmsg net/socket.c:2703 [inline] __se_sys_recvmmsg net/socket.c:2696 [inline] __x64_sys_recvmmsg+0x89/0xb0 net/socket.c:2696 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 10413 Comm: syz-executor.0 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 2276f58ac589 ("udp: use a separate rx queue for packet reception") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-28net: use skb_queue_empty_lockless() in busy poll contextsEric Dumazet
Busy polling usually runs without locks. Let's use skb_queue_empty_lockless() instead of skb_queue_empty() Also uses READ_ONCE() in __skb_try_recv_datagram() to address a similar potential problem. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-28net: use skb_queue_empty_lockless() in poll() handlersEric Dumazet
Many poll() handlers are lockless. Using skb_queue_empty_lockless() instead of skb_queue_empty() is more appropriate. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-28udp: use skb_queue_empty_lockless()Eric Dumazet
syzbot reported a data-race [1]. We should use skb_queue_empty_lockless() to document that we are not ensuring a mutual exclusion and silence KCSAN. [1] BUG: KCSAN: data-race in __skb_recv_udp / __udp_enqueue_schedule_skb write to 0xffff888122474b50 of 8 bytes by interrupt on cpu 0: __skb_insert include/linux/skbuff.h:1852 [inline] __skb_queue_before include/linux/skbuff.h:1958 [inline] __skb_queue_tail include/linux/skbuff.h:1991 [inline] __udp_enqueue_schedule_skb+0x2c1/0x410 net/ipv4/udp.c:1470 __udp_queue_rcv_skb net/ipv4/udp.c:1940 [inline] udp_queue_rcv_one_skb+0x7bd/0xc70 net/ipv4/udp.c:2057 udp_queue_rcv_skb+0xb5/0x400 net/ipv4/udp.c:2074 udp_unicast_rcv_skb.isra.0+0x7e/0x1c0 net/ipv4/udp.c:2233 __udp4_lib_rcv+0xa44/0x17c0 net/ipv4/udp.c:2300 udp_rcv+0x2b/0x40 net/ipv4/udp.c:2470 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 read to 0xffff888122474b50 of 8 bytes by task 8921 on cpu 1: skb_queue_empty include/linux/skbuff.h:1494 [inline] __skb_recv_udp+0x18d/0x500 net/ipv4/udp.c:1653 udp_recvmsg+0xe1/0xb10 net/ipv4/udp.c:1712 inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838 sock_recvmsg_nosec+0x5c/0x70 net/socket.c:871 ___sys_recvmsg+0x1a0/0x3e0 net/socket.c:2480 do_recvmmsg+0x19a/0x5c0 net/socket.c:2601 __sys_recvmmsg+0x1ef/0x200 net/socket.c:2680 __do_sys_recvmmsg net/socket.c:2703 [inline] __se_sys_recvmmsg net/socket.c:2696 [inline] __x64_sys_recvmmsg+0x89/0xb0 net/socket.c:2696 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 8921 Comm: syz-executor.4 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-26ipv4: fix route update on metric change.Paolo Abeni
Since commit af4d768ad28c ("net/ipv4: Add support for specifying metric of connected routes"), when updating an IP address with a different metric, the associated connected route is updated, too. Still, the mentioned commit doesn't handle properly some corner cases: $ ip addr add dev eth0 192.168.1.0/24 $ ip addr add dev eth0 192.168.2.1/32 peer 192.168.2.2 $ ip addr add dev eth0 192.168.3.1/24 $ ip addr change dev eth0 192.168.1.0/24 metric 10 $ ip addr change dev eth0 192.168.2.1/32 peer 192.168.2.2 metric 10 $ ip addr change dev eth0 192.168.3.1/24 metric 10 $ ip -4 route 192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.0 192.168.2.2 dev eth0 proto kernel scope link src 192.168.2.1 192.168.3.0/24 dev eth0 proto kernel scope link src 192.168.2.1 metric 10 Only the last route is correctly updated. The problem is the current test in fib_modify_prefix_metric(): if (!(dev->flags & IFF_UP) || ifa->ifa_flags & (IFA_F_SECONDARY | IFA_F_NOPREFIXROUTE) || ipv4_is_zeronet(prefix) || prefix == ifa->ifa_local || ifa->ifa_prefixlen == 32) Which should be the logical 'not' of the pre-existing test in fib_add_ifaddr(): if (!ipv4_is_zeronet(prefix) && !(ifa->ifa_flags & IFA_F_SECONDARY) && (prefix != addr || ifa->ifa_prefixlen < 32)) To properly negate the original expression, we need to change the last logical 'or' to a logical 'and'. Fixes: af4d768ad28c ("net/ipv4: Add support for specifying metric of connected routes") Reported-and-suggested-by: Beniamino Galvani <bgalvani@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-25tcp: add TCP_INFO status for failed client TFOJason Baron
The TCPI_OPT_SYN_DATA bit as part of tcpi_options currently reports whether or not data-in-SYN was ack'd on both the client and server side. We'd like to gather more information on the client-side in the failure case in order to indicate the reason for the failure. This can be useful for not only debugging TFO, but also for creating TFO socket policies. For example, if a middle box removes the TFO option or drops a data-in-SYN, we can can detect this case, and turn off TFO for these connections saving the extra retransmits. The newly added tcpi_fastopen_client_fail status is 2 bits and has the following 4 states: 1) TFO_STATUS_UNSPEC Catch-all state which includes when TFO is disabled via black hole detection, which is indicated via LINUX_MIB_TCPFASTOPENBLACKHOLE. 2) TFO_COOKIE_UNAVAILABLE If TFO_CLIENT_NO_COOKIE mode is off, this state indicates that no cookie is available in the cache. 3) TFO_DATA_NOT_ACKED Data was sent with SYN, we received a SYN/ACK but it did not cover the data portion. Cookie is not accepted by server because the cookie may be invalid or the server may be overloaded. 4) TFO_SYN_RETRANSMITTED Data was sent with SYN, we received a SYN/ACK which did not cover the data after at least 1 additional SYN was sent (without data). It may be the case that a middle-box is dropping data-in-SYN packets. Thus, it would be more efficient to not use TFO on this connection to avoid extra retransmits during connection establishment. These new fields do not cover all the cases where TFO may fail, but other failures, such as SYN/ACK + data being dropped, will result in the connection not becoming established. And a connection blackhole after session establishment shows up as a stalled connection. Signed-off-by: Jason Baron <jbaron@akamai.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Christoph Paasch <cpaasch@apple.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>