summary refs log tree commit diff
path: root/include/uapi
AgeCommit message (Collapse)Author
2017-02-09openvswitch: Add force commit.Jarno Rajahalme
Stateful network admission policy may allow connections to one direction and reject connections initiated in the other direction. After policy change it is possible that for a new connection an overlapping conntrack entry already exists, where the original direction of the existing connection is opposed to the new connection's initial packet. Most importantly, conntrack state relating to the current packet gets the "reply" designation based on whether the original direction tuple or the reply direction tuple matched. If this "directionality" is wrong w.r.t. to the stateful network admission policy it may happen that packets in neither direction are correctly admitted. This patch adds a new "force commit" option to the OVS conntrack action that checks the original direction of an existing conntrack entry. If that direction is opposed to the current packet, the existing conntrack entry is deleted and a new one is subsequently created in the correct direction. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09openvswitch: Add original direction conntrack tuple to sw_flow_key.Jarno Rajahalme
Add the fields of the conntrack original direction 5-tuple to struct sw_flow_key. The new fields are initially marked as non-existent, and are populated whenever a conntrack action is executed and either finds or generates a conntrack entry. This means that these fields exist for all packets that were not rejected by conntrack as untrackable. The original tuple fields in the sw_flow_key are filled from the original direction tuple of the conntrack entry relating to the current packet, or from the original direction tuple of the master conntrack entry, if the current conntrack entry has a master. Generally, expected connections of connections having an assigned helper (e.g., FTP), have a master conntrack entry. The main purpose of the new conntrack original tuple fields is to allow matching on them for policy decision purposes, with the premise that the admissibility of tracked connections reply packets (as well as original direction packets), and both direction packets of any related connections may be based on ACL rules applying to the master connection's original direction 5-tuple. This also makes it easier to make policy decisions when the actual packet headers might have been transformed by NAT, as the original direction 5-tuple represents the packet headers before any such transformation. When using the original direction 5-tuple the admissibility of return and/or related packets need not be based on the mere existence of a conntrack entry, allowing separation of admission policy from the established conntrack state. While existence of a conntrack entry is required for admission of the return or related packets, policy changes can render connections that were initially admitted to be rejected or dropped afterwards. If the admission of the return and related packets was based on mere conntrack state (e.g., connection being in an established state), a policy change that would make the connection rejected or dropped would need to find and delete all conntrack entries affected by such a change. When using the original direction 5-tuple matching the affected conntrack entries can be allowed to time out instead, as the established state of the connection would not need to be the basis for packet admission any more. It should be noted that the directionality of related connections may be the same or different than that of the master connection, and neither the original direction 5-tuple nor the conntrack state bits carry this information. If needed, the directionality of the master connection can be stored in master's conntrack mark or labels, which are automatically inherited by the expected related connections. The fact that neither ARP nor ND packets are trackable by conntrack allows mutual exclusion between ARP/ND and the new conntrack original tuple fields. Hence, the IP addresses are overlaid in union with ARP and ND fields. This allows the sw_flow_key to not grow much due to this patch, but it also means that we must be careful to never use the new key fields with ARP or ND packets. ARP is easy to distinguish and keep mutually exclusive based on the ethernet type, but ND being an ICMPv6 protocol requires a bit more attention. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09openvswitch: Unionize ovs_key_ct_label with a u32 array.Jarno Rajahalme
Make the array of labels in struct ovs_key_ct_label an union, adding a u32 array of the same byte size as the existing u8 array. It is faster to loop through the labels 32 bits at the time, which is also the alignment of netlink attributes. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09sctp: implement sender-side procedures for Add Incoming/Outgoing Streams ↵Xin Long
Request Parameter This patch is to implement Sender-Side Procedures for the Add Outgoing and Incoming Streams Request Parameter described in rfc6525 section 5.1.5-5.1.6. It is also to add sockopt SCTP_ADD_STREAMS in rfc6525 section 6.3.4 for users. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09sctp: implement sender-side procedures for SSN/TSN Reset Request ParameterXin Long
This patch is to implement Sender-Side Procedures for the SSN/TSN Reset Request Parameter descibed in rfc6525 section 5.1.4. It is also to add sockopt SCTP_RESET_ASSOC in rfc6525 section 6.3.3 for users. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
The conflict was an interaction between a bug fix in the netvsc driver in 'net' and an optimization of the RX path in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree, they are: 1) Stash ctinfo 3-bit field into pointer to nf_conntrack object from sk_buff so we only access one single cacheline in the conntrack hotpath. Patchset from Florian Westphal. 2) Don't leak pointer to internal structures when exporting x_tables ruleset back to userspace, from Willem DeBruijn. This includes new helper functions to copy data to userspace such as xt_data_to_user() as well as conversions of our ip_tables, ip6_tables and arp_tables clients to use it. Not surprinsingly, ebtables requires an ad-hoc update. There is also a new field in x_tables extensions to indicate the amount of bytes that we copy to userspace. 3) Add nf_log_all_netns sysctl: This new knob allows you to enable logging via nf_log infrastructure for all existing netnamespaces. Given the effort to provide pernet syslog has been discontinued, let's provide a way to restore logging using netfilter kernel logging facilities in trusted environments. Patch from Michal Kubecek. 4) Validate SCTP checksum from conntrack helper, from Davide Caratti. 5) Merge UDPlite conntrack and NAT helpers into UDP, this was mostly a copy&paste from the original helper, from Florian Westphal. 6) Reset netfilter state when duplicating packets, also from Florian. 7) Remove unnecessary check for broadcast in IPv6 in pkttype match and nft_meta, from Liping Zhang. 8) Add missing code to deal with loopback packets from nft_meta when used by the netdev family, also from Liping. 9) Several cleanups on nf_tables, one to remove unnecessary check from the netlink control plane path to add table, set and stateful objects and code consolidation when unregister chain hooks, from Gao Feng. 10) Fix harmless reference counter underflow in IPVS that, however, results in problems with the introduction of the new refcount_t type, from David Windsor. 11) Enable LIBCRC32C from nf_ct_sctp instead of nf_nat_sctp, from Davide Caratti. 12) Missing documentation on nf_tables uapi header, from Liping Zhang. 13) Use rb_entry() helper in xt_connlimit, from Geliang Tang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03bridge: uapi: add per vlan tunnel infoRoopa Prabhu
New nested netlink attribute to associate tunnel info per vlan. This is used by bridge driver to send tunnel metadata to bridge ports in vlan tunnel mode. This patch also adds new per port flag IFLA_BRPORT_VLAN_TUNNEL to enable vlan tunnel mode. off by default. One example use for this is a vxlan bridging gateway or vtep which maps vlans to vn-segments (or vnis). User can configure per-vlan tunnel information which the bridge driver can use to bridge vlan into the corresponding vn-segment. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03vxlan: support fdb and learning in COLLECT_METADATA modeRoopa Prabhu
Vxlan COLLECT_METADATA mode today solves the per-vni netdev scalability problem in l3 networks. It expects all forwarding information to be present in dst_metadata. This patch series enhances collect metadata mode to include the case where only vni is present in dst_metadata, and the vxlan driver can then use the rest of the forwarding information datbase to make forwarding decisions. There is no change to default COLLECT_METADATA behaviour. These changes only apply to COLLECT_METADATA when used with the bridging use-case with a special dst_metadata tunnel info flag (eg: where vxlan device is part of a bridge). For all this to work, the vxlan driver will need to now support a single fdb table hashed by mac + vni. This series essentially makes this happen. use-case and workflow: vxlan collect metadata device participates in bridging vlan to vn-segments. Bridge driver above the vxlan device, sends the vni corresponding to the vlan in the dst_metadata. vxlan driver will lookup forwarding database with (mac + vni) for the required remote destination information to forward the packet. Changes introduced by this patch: - allow learning and forwarding database state in vxlan netdev in COLLECT_METADATA mode. Current behaviour is not changed by default. tunnel info flag IP_TUNNEL_INFO_BRIDGE is used to support the new bridge friendly mode. - A single fdb table hashed by (mac, vni) to allow fdb entries with multiple vnis in the same fdb table - rx path already has the vni - tx path expects a vni in the packet with dst_metadata - prior to this series, fdb remote_dsts carried remote vni and the vxlan device carrying the fdb table represented the source vni. With the vxlan device now representing multiple vnis, this patch adds a src vni attribute to the fdb entry. The remote vni already uses NDA_VNI attribute. This patch introduces NDA_SRC_VNI netlink attribute to represent the src vni in a multi vni fdb table. iproute2 example (patched and pruned iproute2 output to just show relevant fdb entries): example shows same host mac learnt on two vni's. before (netdev per vni): $bridge fdb show | grep "00:02:00:00:00:03" 00:02:00:00:00:03 dev vxlan1001 dst 12.0.0.8 self 00:02:00:00:00:03 dev vxlan1000 dst 12.0.0.8 self after this patch with collect metadata in bridged mode (single netdev): $bridge fdb show | grep "00:02:00:00:00:03" 00:02:00:00:00:03 dev vxlan0 src_vni 1001 dst 12.0.0.8 self 00:02:00:00:00:03 dev vxlan0 src_vni 1000 dst 12.0.0.8 self Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03net/sched: act_ife: Change to use ife moduleYotam Gigi
Use the encode/decode functionality from the ife module instead of using implementation inside the act_ife. Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03net: Introduce ife encapsulation moduleYotam Gigi
This module is responsible for the ife encapsulation protocol encode/decode logics. That module can: - ife_encode: encode skb and reserve space for the ife meta header - ife_decode: decode skb and extract the meta header size - ife_tlv_meta_encode - encodes one tlv entry into the reserved ife header space. - ife_tlv_meta_decode - decodes one tlv entry from the packet - ife_tlv_meta_next - advance to the next tlv Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03ipv6: sr: remove cleanup flag and fix HMAC computationDavid Lebrun
In the latest version of the IPv6 Segment Routing IETF draft [1] the cleanup flag is removed and the flags field length is shrunk from 16 bits to 8 bits. As a consequence, the input of the HMAC computation is modified in a non-backward compatible way by covering the whole octet of flags instead of only the cleanup bit. As such, if an implementation compatible with the latest draft computes the HMAC of an SRH who has other flags set to 1, then the HMAC result would differ from the current implementation. This patch carries those modifications to prevent conflict with other implementations of IPv6 SR. [1] https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-05 Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02net: add LINUX_MIB_PFMEMALLOCDROP counterEric Dumazet
Debugging issues caused by pfmemalloc is often tedious. Add a new SNMP counter to more easily diagnose these problems. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Josef Bacik <jbacik@fb.com> Acked-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02unix: add ioctl to open a unix socket file with O_PATHAndrey Vagin
This ioctl opens a file to which a socket is bound and returns a file descriptor. The caller has to have CAP_NET_ADMIN in the socket network namespace. Currently it is impossible to get a path and a mount point for a socket file. socket_diag reports address, device ID and inode number for unix sockets. An address can contain a relative path or a file may be moved somewhere. And these properties say nothing about a mount namespace and a mount point of a socket file. With the introduced ioctl, we can get a path by reading /proc/self/fd/X and get mnt_id from /proc/self/fdinfo/X. In CRIU we are going to use this ioctl to dump and restore unix socket. Here is an example how it can be used: $ strace -e socket,bind,ioctl ./test /tmp/test_sock socket(AF_UNIX, SOCK_STREAM, 0) = 3 bind(3, {sa_family=AF_UNIX, sun_path="test_sock"}, 11) = 0 ioctl(3, SIOCUNIXFILE, 0) = 4 ^Z $ ss -a | grep test_sock u_str LISTEN 0 1 test_sock 17798 * 0 $ ls -l /proc/760/fd/{3,4} lrwx------ 1 root root 64 Feb 1 09:41 3 -> 'socket:[17798]' l--------- 1 root root 64 Feb 1 09:41 4 -> /tmp/test_sock $ cat /proc/760/fdinfo/4 pos: 0 flags: 012000000 mnt_id: 40 $ cat /proc/self/mountinfo | grep "^40\s" 40 19 0:37 / /tmp rw shared:23 - tmpfs tmpfs rw Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
All merge conflicts were simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30net: ethtool: add support for 2500BaseT and 5000BaseT link modesPavel Belous
This patch introduce support for 2500BaseT and 5000BaseT link modes. These modes are included in the new IEEE 802.3bz standard. Signed-off-by: Pavel Belous <pavel.s.belous@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-29tcp: record pkts sent and retransmisttedYuchung Cheng
Add two stats in SCM_TIMESTAMPING_OPT_STATS: TCP_NLA_DATA_SEGS_OUT: total data packets sent including retransmission TCP_NLA_TOTAL_RETRANS: total data packets retransmitted The names are picked to be consistent with corresponding fields in TCP_INFO. This allows applications that are using the timestamping API to measure latency stats to also retrive retransmission rate of application write. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Two trivial overlapping changes conflicts in MPLS and mlx5. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) GTP fixes from Andreas Schultz (missing genl module alias, clear IP DF on transmit). 2) Netfilter needs to reflect the fwmark when sending resets, from Pau Espin Pedrol. 3) nftable dump OOPS fix from Liping Zhang. 4) Fix erroneous setting of VIRTIO_NET_HDR_F_DATA_VALID on transmit, from Rolf Neugebauer. 5) Fix build error of ipt_CLUSTERIP when procfs is disabled, from Arnd Bergmann. 6) Fix regression in handling of NETIF_F_SG in harmonize_features(), from Eric Dumazet. 7) Fix RTNL deadlock wrt. lwtunnel module loading, from David Ahern. 8) tcp_fastopen_create_child() needs to setup tp->max_window, from Alexey Kodanev. 9) Missing kmemdup() failure check in ipv6 segment routing code, from Eric Dumazet. 10) Don't execute unix_bind() under the bindlock, otherwise we deadlock with splice. From WANG Cong. 11) ip6_tnl_parse_tlv_enc_lim() potentially reallocates the skb buffer, therefore callers must reload cached header pointers into that skb. Fix from Eric Dumazet. 12) Fix various bugs in legacy IRQ fallback handling in alx driver, from Tobias Regnery. 13) Do not allow lwtunnel drivers to be unloaded while they are referenced by active instances, from Robert Shearman. 14) Fix truncated PHY LED trigger names, from Geert Uytterhoeven. 15) Fix a few regressions from virtio_net XDP support, from John Fastabend and Jakub Kicinski. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (102 commits) ISDN: eicon: silence misleading array-bounds warning net: phy: micrel: add support for KSZ8795 gtp: fix cross netns recv on gtp socket gtp: clear DF bit on GTP packet tx gtp: add genl family modules alias tcp: don't annotate mark on control socket from tcp_v6_send_response() ravb: unmap descriptors when freeing rings virtio_net: reject XDP programs using header adjustment virtio_net: use dev_kfree_skb for small buffer XDP receive r8152: check rx after napi is enabled r8152: re-schedule napi for tx r8152: avoid start_xmit to schedule napi when napi is disabled r8152: avoid start_xmit to call napi_schedule during autosuspend net: dsa: Bring back device detaching in dsa_slave_suspend() net: phy: leds: Fix truncated LED trigger names net: phy: leds: Break dependency of phy.h on phy_led_triggers.h net: phy: leds: Clear phy_num_led_triggers on failure to avoid crash net-next: ethernet: mediatek: change the compatible string Documentation: devicetree: change the mediatek ethernet compatible string bnxt_en: Fix RTNL lock usage on bnxt_get_port_module_status(). ...
2017-01-27Merge tag 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma fixes from Doug Ledford: "Second round of -rc fixes for 4.10. This -rc cycle has been slow for the rdma subsystem. I had already sent you the first batch before the Holiday break. After that, we kept only getting a few here or there. Up until this week, when I got a drop of 13 to one driver (qedr). So, here's the -rc patches I have. I currently have none held in reserve, so unless something new comes in, this is it until the next merge window opens. Summary: - series of iw_cxgb4 fixes to make it work with the drain cq API - one or two patches each to: srp, iser, cxgb3, vmw_pvrdma, umem, rxe, and ipoib - one big series (13 patches) for the new qedr driver" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (27 commits) RDMA/cma: Fix unknown symbol when CONFIG_IPV6 is not enabled IB/rxe: Prevent from completer to operate on non valid QP IB/rxe: Fix rxe dev insertion to rxe_dev_list IB/umem: Release pid in error and ODP flow RDMA/qedr: Dispatch port active event from qedr_add RDMA/qedr: Fix and simplify memory leak in PD alloc RDMA/qedr: Fix RDMA CM loopback RDMA/qedr: Fix formatting RDMA/qedr: Mark three functions as static RDMA/qedr: Don't reset QP when queues aren't flushed RDMA/qedr: Don't spam dmesg if QP is in error state RDMA/qedr: Remove CQ spinlock from CM completion handlers RDMA/qedr: Return max inline data in QP query result RDMA/qedr: Return success when not changing QP state RDMA/qedr: Add uapi header qedr-abi.h RDMA/qedr: Fix MTU returned from QP query RDMA/core: Add the function ib_mtu_int_to_enum IB/vmw_pvrdma: Fix incorrect cleanup on pvrdma_pci_probe error path IB/vmw_pvrdma: Don't leak info from alloc_ucontext IB/cxgb3: fix misspelling in header guard ...
2017-01-27Merge tag 'media/v4.10-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media fixes from Mauro Carvalho Chehab: - fix a regression on tvp5150 causing failures at input selection and image glitches - CEC was moved out of staging for v4.10. Fix some bugs on it while not too late - fix a regression on pctv452e caused by VM stack changes - fix suspend issued with smiapp - fix a regression on cobalt driver - fix some warnings and Kconfig issues with some random configs. * tag 'media/v4.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: [media] s5k4ecgx: select CRC32 helper [media] dvb: avoid warning in dvb_net [media] v4l: tvp5150: Don't override output pinmuxing at stream on/off time [media] v4l: tvp5150: Fix comment regarding output pin muxing [media] v4l: tvp5150: Reset device at probe time, not in get/set format handlers [media] pctv452e: move buffer to heap, no mutex [media] media/cobalt: use pci_irq_allocate_vectors [media] cec: fix race between configuring and unconfiguring [media] cec: move cec_report_phys_addr into cec_config_thread_func [media] cec: replace cec_report_features by cec_fill_msg_report_features [media] cec: update log_addr[] before finishing configuration [media] cec: CEC_MSG_GIVE_FEATURES should abort for CEC version < 2 [media] cec: when canceling a message, don't overwrite old status info [media] cec: fix report_current_latency [media] smiapp: Make suspend and resume functions __maybe_unused [media] smiapp: Implement power-on and power-off sequences without runtime PM
2017-01-27Merge tag 'linux-can-next-for-4.11-20170124' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2017-01-24 this is a pull request of 4 patches for net-next/master. The first patch by Oliver Hartkopp adds a netlink API to configure the interface termination of a CAN card. The next two patches are by me and add a netlink API to query and configure CAN interfaces that only support fixed bitrates. The last patch by Colin Ian King simplifies the return path in the softing_cs driver's softingcs_probe() function. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-27net/ipv6: allow sysctl to change link-local address generation modeFelix Jia
The address generation mode for IPv6 link-local can only be configured by netlink messages. This patch adds the ability to change the address generation mode via sysctl. v1 -> v2 Removed the rtnl lock and switch to use RCU lock to iterate through the netdev list. v2 -> v3 Removed the addrgenmode variable from the idev structure and use the systcl storage for the flag. Simplifed the logic for sysctl handling by removing the supported for all operation. Added support for more types of tunnel interfaces for link-local address generation. Based the patches from net-next. v3 -> v4 Removed unnecessary whitespace changes. Signed-off-by: Felix Jia <felix.jia@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26Merge tag 'batadv-next-for-davem-20170126' of ↵David S. Miller
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This feature/cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - ignore self-generated loop detect MAC addresses in translation table, by Simon Wunderlich - install uapi batman_adv.h header, by Sven Eckelmann - bump copyright years, by Sven Eckelmann - Remove an unused variable in translation table code, by Sven Eckelmann - Handle NET_XMIT_CN like NET_XMIT_SUCCESS (revised according to Davids suggestion), and a follow up code clean up, by Gao Feng (2 patches) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains a large batch with Netfilter fixes for your net tree, they are: 1) Two patches to solve conntrack garbage collector cpu hogging, one to remove GC_MAX_EVICTS and another to look at the ratio (scanned entries vs. evicted entries) to make a decision on whether to reduce or not the scanning interval. From Florian Westphal. 2) Two patches to fix incorrect set element counting if NLM_F_EXCL is is not set. Moreover, don't decrenent set->nelems from abort patch if -ENFILE which leaks a spare slot in the set. This includes a patch to deconstify the set walk callback to update set->ndeact. 3) Two fixes for the fwmark_reflect sysctl feature: Propagate mark to reply packets both from nf_reject and local stack, from Pau Espin Pedrol. 4) Fix incorrect handling of loopback traffic in rpfilter and nf_tables fib expression, from Liping Zhang. 5) Fix oops on stateful objects netlink dump, when no filter is specified. Also from Liping Zhang. 6) Fix a build error if proc is not available in ipt_CLUSTERIP, related to fix that was applied in the previous batch for net. From Arnd Bergmann. 7) Fix lack of string validation in table, chain, set and stateful object names in nf_tables, from Liping Zhang. Moreover, restrict maximum log prefix length to 127 bytes, otherwise explicitly bail out. 8) Two patches to fix spelling and typos in nf_tables uapi header file and Kconfig, patches from Alexander Alemayhu and William Breathitt Gray. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26batman-adv: update copyright years for 2017Sven Eckelmann
Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-01-26uapi: install batman_adv.h headerSven Eckelmann
09748a22f4ab ("batman-adv: add generic netlink family for batman-adv") introduced the new batman_adv.h which describes the netlink attributes and commands of batman-adv. But the Kbuild entry to install the header was not added. All currently known tools ship their own copy of batman_adv.h but it should be installed anyway to later be able to migrate to the system batman_adv.h. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-01-25net/tcp-fastopen: Add new API supportWei Wang
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an alternative way to perform Fast Open on the active side (client). Prior to this patch, a client needs to replace the connect() call with sendto(MSG_FASTOPEN). This can be cumbersome for applications who want to use Fast Open: these socket operations are often done in lower layer libraries used by many other applications. Changing these libraries and/or the socket call sequences are not trivial. A more convenient approach is to perform Fast Open by simply enabling a socket option when the socket is created w/o changing other socket calls sequence: s = socket() create a new socket setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …); newly introduced sockopt If set, new functionality described below will be used. Return ENOTSUPP if TFO is not supported or not enabled in the kernel. connect() With cookie present, return 0 immediately. With no cookie, initiate 3WHS with TFO cookie-request option and return -1 with errno = EINPROGRESS. write()/sendmsg() With cookie present, send out SYN with data and return the number of bytes buffered. With no cookie, and 3WHS not yet completed, return -1 with errno = EINPROGRESS. No MSG_FASTOPEN flag is needed. read() Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but write() is not called yet. Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is established but no msg is received yet. Return number of bytes read if socket is established and there is msg received. The new API simplifies life for applications that always perform a write() immediately after a successful connect(). Such applications can now take advantage of Fast Open by merely making one new setsockopt() call at the time of creating the socket. Nothing else about the application's socket call sequence needs to change. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25Merge tag 'mlx5-updates-2017-01-24' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2017-24-01 The first seven patches from Or Gerlitz in this series further enhances the mlx5 SRIOV switchdev mode to support offloading IPv6 tunnels using the TC tunnel key set (encap) and unset (decap) actions. Or Gerlitz says: ======================== As part of doing this change, few cleanups are done in the IPv4 code, later we move to use the full tunnel key info provided to the driver as the key for our internal hashing which is used to identify cases where the same tunnel is used for encapsulating multiple flows. As done in the IPv4 case, the control path for offloading IPv6 tunnels uses route/neigh lookups and construction of the IPv6 tunnel headers on the encap path and matching on the outer hears in the decap path. The last patch of the series enlarges the HW FDB size for the switchdev mode, so it has now room to contain offloaded flows as many as min(max number of HW flow counters supported, max HW table size supported). ======================== Next to Or's series you can find several patches handling several topics. From Mohamad, add support for SRIOV VF min rate guarantee by using the TSAR BW share weights mechanism. From Or, Two patches to enable Eth VFs to query their min-inline value for user-space. for that we move a mlx5 low level min inline helper function from mlx5 ethernet driver into the core driver and then use it in mlx5_ib to expose the inline mode to rdma applications through libmlx5. From Kamal Heib, Reduce memory consumption on kdump kernel. From Shaker Daibes, code reuse in CQE compression control logic ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25net sched actions: Add support for user cookiesJamal Hadi Salim
Introduce optional 128-bit action cookie. Like all other cookie schemes in the networking world (eg in protocols like http or existing kernel fib protocol field, etc) the idea is to save user state that when retrieved serves as a correlator. The kernel _should not_ intepret it. The user can store whatever they wish in the 128 bits. Sample exercise(showing variable length use of cookie) .. create an accept action with cookie a1b2c3d4 sudo $TC actions add action ok index 1 cookie a1b2c3d4 .. dump all gact actions.. sudo $TC -s actions ls action gact action order 0: gact action pass random type none pass val 0 index 1 ref 1 bind 0 installed 5 sec used 5 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 cookie a1b2c3d4 .. bind the accept action to a filter.. sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \ u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 1 ... send some traffic.. $ ping 127.0.0.1 -c 3 PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.020 ms 64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.027 ms 64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.038 ms Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24netfilter: nft_log: restrict the log prefix length to 127Liping Zhang
First, log prefix will be truncated to NF_LOG_PREFIXLEN-1, i.e. 127, at nf_log_packet(), so the extra part is useless. Second, after adding a log rule with a very very long prefix, we will fail to dump the nft rules after this _special_ one, but acctually, they do exist. For example: # name_65000=$(printf "%0.sQ" {1..65000}) # nft add rule filter output log prefix "$name_65000" # nft add rule filter output counter # nft add rule filter output counter # nft list chain filter output table ip filter { chain output { type filter hook output priority 0; policy accept; } } So now, restrict the log prefix length to NF_LOG_PREFIXLEN-1. Fixes: 96518518cc41 ("netfilter: add nftables") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-24RDMA/qedr: Add uapi header qedr-abi.hAmrani, Ram
Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24bpf: allow option for setting bpf_l4_csum_replace from scratchDaniel Borkmann
When programs need to calculate the csum from scratch for small UDP packets and use bpf_l4_csum_replace() to feed the result from helpers like bpf_csum_diff(), then we need a flag besides BPF_F_MARK_MANGLED_0 that would ignore the case of current csum being 0, and which would still allow for the helper to set the csum and transform when needed to CSUM_MANGLED_0. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24IB/mlx5: Enable Eth VFs to query their min-inline value for user-spaceOr Gerlitz
For some mlx5 HW models (CX4, CX4Lx), the VF driver needs to put part of the packet headers on the TX descriptor so the e-switch can do proper matching and steering. This is called "min-inline", it's advertized to the VF by the FW and also enforced on them by the HW, such that if they don't obey, their packets are dropped. SRIOV VF libmlx5 instances should take into account the min-inline value of their vports. For that end, we provide this value through the vendor response part of init_ucontext command. The min inline value is reported in a way which will let newer libmlx5 instances realize that they are running over an older kernel and act accordingly (e.g apply some educated guess). Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-24net/sched: Introduce sample tc actionYotam Gigi
This action allows the user to sample traffic matched by tc classifier. The sampling consists of choosing packets randomly and sampling them using the psample module. The user can configure the psample group number, the sampling rate and the packet's truncation (to save kernel-user traffic). Example: To sample ingress traffic from interface eth1, one may use the commands: tc qdisc add dev eth1 handle ffff: ingress tc filter add dev eth1 parent ffff: \ matchall action sample rate 12 group 4 Where the first command adds an ingress qdisc and the second starts sampling randomly with an average of one sampled packet per 12 packets on dev eth1 to psample group 4. Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: Introduce psample, a new genetlink channel for packet samplingYotam Gigi
Add a general way for kernel modules to sample packets, without being tied to any specific subsystem. This netlink channel can be used by tc, iptables, etc. and allow to standardize packet sampling in the kernel. For every sampled packet, the psample module adds the following metadata fields: PSAMPLE_ATTR_IIFINDEX - the packets input ifindex, if applicable PSAMPLE_ATTR_OIFINDEX - the packet output ifindex, if applicable PSAMPLE_ATTR_ORIGSIZE - the packet's original size, in case it has been truncated during sampling PSAMPLE_ATTR_SAMPLE_GROUP - the packet's sample group, which is set by the user who initiated the sampling. This field allows the user to differentiate between several samplers working simultaneously and filter packets relevant to him PSAMPLE_ATTR_GROUP_SEQ - sequence counter of last sent packet. The sequence is kept for each group PSAMPLE_ATTR_SAMPLE_RATE - the sampling rate used for sampling the packets PSAMPLE_ATTR_DATA - the actual packet bits The sampled packets are sent to the PSAMPLE_NL_MCGRP_SAMPLE multicast group. In addition, add the GET_GROUPS netlink command which allows the user to see the current sample groups, their refcount and sequence number. This command currently supports only netlink dump mode. Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24bridge: multicast to unicastFelix Fietkau
Implements an optional, per bridge port flag and feature to deliver multicast packets to any host on the according port via unicast individually. This is done by copying the packet per host and changing the multicast destination MAC to a unicast one accordingly. multicast-to-unicast works on top of the multicast snooping feature of the bridge. Which means unicast copies are only delivered to hosts which are interested in it and signalized this via IGMP/MLD reports previously. This feature is intended for interface types which have a more reliable and/or efficient way to deliver unicast packets than broadcast ones (e.g. wifi). However, it should only be enabled on interfaces where no IGMPv2/MLDv1 report suppression takes place. This feature is disabled by default. The initial patch and idea is from Felix Fietkau. Signed-off-by: Felix Fietkau <nbd@nbd.name> [linus.luessing@c0d3.blue: various bug + style fixes, commit message] Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24IB/cxgb3: fix misspelling in header guardNicolas Iooss
Use CXGB3_... instead of CXBG3_... Fixes: a85fb3383340 ("IB/cxgb3: Move user vendor structures") Cc: stable@vger.kernel.org # 4.9 Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Steve Wise <swise@chelsio.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24can: dev: add CAN interface API for fixed bitratesMarc Kleine-Budde
Some CAN interfaces only support fixed fixed bitrates. This patch adds a netlink interface to get the list of the CAN interface's fixed bitrates and data bitrates. Inside the driver arrays of supported data- bitrate values are defined. const u32 drvname_bitrate[] = { 20000, 50000, 100000 }; const u32 drvname_data_bitrate[] = { 200000, 500000, 1000000 }; struct drvname_priv *priv; priv = netdev_priv(dev); priv->bitrate_const = drvname_bitrate; priv->bitrate_const_cnt = ARRAY_SIZE(drvname_bitrate); priv->data_bitrate_const = drvname_data_bitrate; priv->data_bitrate_const_cnt = ARRAY_SIZE(drvname_data_bitrate); Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-01-24can: dev: add CAN interface termination APIOliver Hartkopp
This patch adds a netlink interface to configure the CAN bus termination of CAN interfaces. Inside the driver an array of supported termination values is defined: const u16 drvname_termination[] = { 60, 120, CAN_TERMINATION_DISABLED }; struct drvname_priv *priv; priv = netdev_priv(dev); priv->termination_const = drvname_termination; priv->termination_const_cnt = ARRAY_SIZE(drvname_termination); priv->termination = CAN_TERMINATION_DISABLED; And the funtion to set the value has to be defined: priv->do_set_termination = drvname_set_termination; Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Reviewed-by: Ramesh Shanmugasundaram <Ramesh.shanmugasundaram@bp.renesas.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-01-23bpf: add a longest prefix match trie map implementationDaniel Mack
This trie implements a longest prefix match algorithm that can be used to match IP addresses to a stored set of ranges. Internally, data is stored in an unbalanced trie of nodes that has a maximum height of n, where n is the prefixlen the trie was created with. Tries may be created with prefix lengths that are multiples of 8, in the range from 8 to 2048. The key used for lookup and update operations is a struct bpf_lpm_trie_key, and the value is a uint64_t. The code carries more information about the internal implementation. Signed-off-by: Daniel Mack <daniel@zonque.org> Reviewed-by: David Herrmann <dh.herrmann@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20tipc: make replicast a user selectable optionJon Paul Maloy
If the bearer carrying multicast messages supports broadcast, those messages will be sent to all cluster nodes, irrespective of whether these nodes host any actual destinations socket or not. This is clearly wasteful if the cluster is large and there are only a few real destinations for the message being sent. In this commit we extend the eligibility of the newly introduced "replicast" transmit option. We now make it possible for a user to select which method he wants to be used, either as a mandatory setting via setsockopt(), or as a relative setting where we let the broadcast layer decide which method to use based on the ratio between cluster size and the message's actual number of destination nodes. In the latter case, a sending socket must stick to a previously selected method until it enters an idle period of at least 5 seconds. This eliminates the risk of message reordering caused by method change, i.e., when changes to cluster size or number of destinations would otherwise mandate a new method to be used. Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20bpf: add bpf_probe_read_str helperGianluca Borello
Provide a simple helper with the same semantics of strncpy_from_unsafe(): int bpf_probe_read_str(void *dst, int size, const void *unsafe_addr) This gives more flexibility to a bpf program. A typical use case is intercepting a file name during sys_open(). The current approach is: SEC("kprobe/sys_open") void bpf_sys_open(struct pt_regs *ctx) { char buf[PATHLEN]; // PATHLEN is defined to 256 bpf_probe_read(buf, sizeof(buf), ctx->di); /* consume buf */ } This is suboptimal because the size of the string needs to be estimated at compile time, causing more memory to be copied than often necessary, and can become more problematic if further processing on buf is done, for example by pushing it to userspace via bpf_perf_event_output(), since the real length of the string is unknown and the entire buffer must be copied (and defining an unrolled strnlen() inside the bpf program is a very inefficient and unfeasible approach). With the new helper, the code can easily operate on the actual string length rather than the buffer size: SEC("kprobe/sys_open") void bpf_sys_open(struct pt_regs *ctx) { char buf[PATHLEN]; // PATHLEN is defined to 256 int res = bpf_probe_read_str(buf, sizeof(buf), ctx->di); /* consume buf, for example push it to userspace via * bpf_perf_event_output(), but this time we can use * res (the string length) as event size, after checking * its boundaries. */ } Another useful use case is when parsing individual process arguments or individual environment variables navigating current->mm->arg_start and current->mm->env_start: using this helper and the return value, one can quickly iterate at the right offset of the memory area. The code changes simply leverage the already existent strncpy_from_unsafe() kernel function, which is safe to be called from a bpf program as it is used in bpf_trace_printk(). Signed-off-by: Gianluca Borello <g.borello@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: implement sender-side procedures for SSN Reset Request ParameterXin Long
This patch is to implement sender-side procedures for the Outgoing and Incoming SSN Reset Request Parameter described in rfc6525 section 5.1.2 and 5.1.3. It is also add sockopt SCTP_RESET_STREAMS in rfc6525 section 6.3.2 for users. Note that the new asoc member strreset_outstanding is to make sure only one reconf request chunk on the fly as rfc6525 section 5.1.1 demands. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: add sockopt SCTP_ENABLE_STREAM_RESETXin Long
This patch is to add sockopt SCTP_ENABLE_STREAM_RESET to get/set strreset_enable to indicate which reconf request type it supports, which is described in rfc6525 section 6.3.1. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17bridge: sparse fixes in br_ip6_multicast_alloc_query()Lance Richardson
Changed type of csum field in struct igmpv3_query from __be16 to __sum16 to eliminate type warning, made same change in struct igmpv3_report for consistency. Fixed up an ntohs() where htons() should have been used instead. Signed-off-by: Lance Richardson <lrichard@redhat.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2017-01-17mpls: Packet statsRobert Shearman
Having MPLS packet stats is useful for observing network operation and for diagnosing network problems. In the absence of anything better, RFC2863 and RFC3813 are used for guidance for which stats to expose and the semantics of them. In particular rx_noroutes maps to in unknown protos in RFC2863. The stats are exposed to userspace via AF_MPLS attributes embedded in the IFLA_STATS_AF_SPEC attribute of RTM_GETSTATS messages. All the introduced fields are 64-bit, even error ones, to ensure no overflow with long uptimes. Per-CPU counters are used to avoid cache-line contention on the commonly used fields. The other fields have also been made per-CPU for code to avoid performance problems in error conditions on the assumption that on some platforms the cost of atomic operations could be more expensive than sending the packet (which is what would be done in the success case). If that's not the case, we could instead not use per-CPU counters for these fields. Only unicast and non-fragment are exposed at the moment, but other counters can be exposed in the future either by adding to the end of struct mpls_link_stats or by additional netlink attributes in the AF_MPLS IFLA_STATS_AF_SPEC nested attribute. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17net: AF-specific RTM_GETSTATS attributesRobert Shearman
Add the functionality for including address-family-specific per-link stats in RTM_GETSTATS messages. This is done through adding a new IFLA_STATS_AF_SPEC attribute under which address family attributes are nested and then the AF-specific attributes can be further nested. This follows the model of IFLA_AF_SPEC on RTM_*LINK messages and it has the advantage of presenting an easily extended hierarchy. The rtnl_af_ops structure is extended to provide AFs with the opportunity to fill and provide the size of their stats attributes. One alternative would have been to provide AFs with the ability to add attributes directly into the RTM_GETSTATS message without a nested hierarchy. I discounted this approach as it increases the rate at which the 32 attribute number space is used up and it makes implementation a little more tricky for stats dump resuming (at the moment the order in which attributes are added to the message has to match the numeric order of the attributes). Another alternative would have been to register per-AF RTM_GETSTATS handlers. I discounted this approach as I perceived a common use-case to be getting all the stats for an interface and this approach would necessitate multiple requests/dumps to retrieve them all. Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16ipv6: sr: add missing Kbuild export for header filesDavid Lebrun
Add missing IPv6-SR header files in include/uapi/linux/Kbuild. Also, prevent seg6_lwt_headroom() from being exported and add missing linux/types.h include. Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>