summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2017-07-15tcp_bbr: introduce bbr_init_pacing_rate_from_rtt() helperNeal Cardwell
Introduce a helper to initialize the BBR pacing rate unconditionally, based on the current cwnd and RTT estimate. This is a pure refactor, but is needed for two following fixes. Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-15tcp_bbr: introduce bbr_bw_to_pacing_rate() helperNeal Cardwell
Introduce a helper to convert a BBR bandwidth and gain factor to a pacing rate in bytes per second. This is a pure refactor, but is needed for two following fixes. Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-15tcp_bbr: cut pacing rate only if filled pipeNeal Cardwell
In bbr_set_pacing_rate(), which decides whether to cut the pacing rate, there was some code that considered exiting STARTUP to be equivalent to the notion of filling the pipe (i.e., bbr_full_bw_reached()). Specifically, as the code was structured, exiting STARTUP and going into PROBE_RTT could cause us to cut the pacing rate down to something silly and low, based on whatever bandwidth samples we've had so far, when it's possible that all of them have been small app-limited bandwidth samples that are not representative of the bandwidth available in the path. (The code was correct at the time it was written, but the state machine changed without this spot being adjusted correspondingly.) Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-15openvswitch: Fix for force/commit action failuresGreg Rose
When there is an established connection in direction A->B, it is possible to receive a packet on port B which then executes ct(commit,force) without first performing ct() - ie, a lookup. In this case, we would expect that this packet can delete the existing entry so that we can commit a connection with direction B->A. However, currently we only perform a check in skb_nfct_cached() for whether OVS_CS_F_TRACKED is set and OVS_CS_F_INVALID is not set, ie that a lookup previously occurred. In the above scenario, a lookup has not occurred but we should still be able to statelessly look up the existing entry and potentially delete the entry if it is in the opposite direction. This patch extends the check to also hint that if the action has the force flag set, then we will lookup the existing entry so that the force check at the end of skb_nfct_cached has the ability to delete the connection. Fixes: dd41d330b03 ("openvswitch: Add force commit.") CC: Pravin Shelar <pshelar@nicira.com> CC: dev@openvswitch.org Signed-off-by: Joe Stringer <joe@ovn.org> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-15ipv4: ip_do_fragment: fix headroom testsVasily Averin
Some time ago David Woodhouse reported skb_under_panic when we try to push ethernet header to fragmented ipv6 skbs. It was fixed for ipv6 by Florian Westphal in commit 1d325d217c7f ("ipv6: ip6_fragment: fix headroom tests and skb leak") However similar problem still exist in ipv4. It does not trigger skb_under_panic due paranoid check in ip_finish_output2, however according to Alexey Kuznetsov current state is abnormal and ip_fragment should be fixed too. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-14sctp: fix an array overflow when all ext chunks are setXin Long
Marcelo noticed an array overflow caused by commit c28445c3cb07 ("sctp: add reconf_enable in asoc ep and netns"), in which sctp would add SCTP_CID_RECONF into extensions when reconf_enable is set in sctp_make_init and sctp_make_init_ack. Then now when all ext chunks are set, 4 ext chunk ids can be put into extensions array while extensions array size is 3. It would cause a kernel panic because of this overflow. This patch is to fix it by defining extensions array size is 4 in both sctp_make_init and sctp_make_init_ack. Fixes: c28445c3cb07 ("sctp: add reconf_enable in asoc ep and netns") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-14net sched actions: rename act_get_notify() to tcf_get_notify()Roman Mashak
Make name consistent with other TC event notification routines, such as tcf_add_notify() and tcf_del_notify() Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-14net/packet: Fix Tx queue selection for AF_PACKETIván Briano
When PACKET_QDISC_BYPASS is not used, Tx queue selection will be done before the packet is enqueued, taking into account any mappings set by a queuing discipline such as mqprio without hardware offloading. This selection may be affected by a previously saved queue_mapping, either on the Rx path, or done before the packet reaches the device, as it's currently the case for AF_PACKET. In order for queue selection to work as expected when using traffic control, there can't be another selection done before that point is reached, so move the call to packet_pick_tx_queue to packet_direct_xmit, leaving the default xmit path as it was before PACKET_QDISC_BYPASS was introduced. A forward declaration of packet_pick_tx_queue() is introduced to avoid the need to reorder the functions within the file. Fixes: d346a3fae3ff ("packet: introduce PACKET_QDISC_BYPASS socket option") Signed-off-by: Iván Briano <ivan.briano@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-14net: bridge: fix dest lookup when vlan proto doesn't matchNikolay Aleksandrov
With 802.1ad support the vlan_ingress code started checking for vlan protocol mismatch which causes the current tag to be inserted and the bridge vlan protocol & pvid to be set. The vlan tag insertion changes the skb mac_header and thus the lookup mac dest pointer which was loaded prior to calling br_allowed_ingress in br_handle_frame_finish is VLAN_HLEN bytes off now, pointing to the last two bytes of the destination mac and the first four of the source mac causing lookups to always fail and broadcasting all such packets to all ports. Same thing happens for locally originated packets when passing via br_dev_xmit. So load the dest pointer after the vlan checks and possible skb change. Fixes: 8580e2117c06 ("bridge: Prepare for 802.1ad vlan filtering support") Reported-by: Anitha Narasimha Murthy <anitha@cumulusnetworks.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-14netpoll: shut up a kernel warning on refcountWANG Cong
When we convert atomic_t to refcount_t, a new kernel warning on "increment on 0" is introduced in the netpoll code, zap_completion_queue(). In fact for this special case, we know the refcount is 0 and we just have to set it to 1 to satisfy the following dev_kfree_skb_any(), so we can just use refcount_set(..., 1) instead. Fixes: 633547973ffc ("net: convert sk_buff.users from atomic_t to refcount_t") Reported-by: Dave Jones <davej@codemonkey.org.uk> Cc: Reshetova, Elena <elena.reshetova@intel.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-13net: set fib rule refcount after mallocDavid Ahern
The configure callback of fib_rules_ops can change the refcnt of a fib rule. For instance, mlxsw takes a refcnt when adding the processing of the rule to a work queue. Thus the rule refcnt can not be reset to to 1 afterwards. Move the refcnt setting to after the allocation. Fixes: 5361e209dd30 ("net: avoid one splat in fib_nl_delrule()") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-13dccp: make const array error_code staticColin Ian King
Don't populate array error_code on the stack but make it static. Makes the object code smaller by almost 250 bytes: Before: text data bss dec hex filename 10366 983 0 11349 2c55 net/dccp/input.o After: text data bss dec hex filename 10161 1039 0 11200 2bc0 net/dccp/input.o Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-13bpf: fix return in bpf_skb_adjust_netKefeng Wang
The bpf_skb_adjust_net() ignores the return value of bpf_skb_net_shrink/grow, and always return 0, fix it by return 'ret'. Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix 64-bit division in mlx5 IPSEC offload support, from Ilan Tayari and Arnd Bergmann. 2) Fix race in statistics gathering in bnxt_en driver, from Michael Chan. 3) Can't use a mutex in RCU reader protected section on tap driver, from Cong WANG. 4) Fix mdb leak in bridging code, from Eduardo Valentin. 5) Fix free of wrong pointer variable in nfp driver, from Dan Carpenter. 6) Buffer overflow in brcmfmac driver, from Arend van SPriel. 7) ioremap_nocache() return value needs to be checked in smsc911x driver, from Alexey Khoroshilov. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (34 commits) net: stmmac: revert "support future possible different internal phy mode" sfc: don't read beyond unicast address list datagram: fix kernel-doc comments socket: add documentation for missing elements smsc911x: Add check for ioremap_nocache() return code brcmfmac: fix possible buffer overflow in brcmf_cfg80211_mgmt_tx() net: hns: Bugfix for Tx timeout handling in hns driver net: ipmr: ipmr_get_table() returns NULL nfp: freeing the wrong variable mlxsw: spectrum_switchdev: Check status of memory allocation mlxsw: spectrum_switchdev: Remove unused variable mlxsw: spectrum_router: Fix use-after-free in route replace mlxsw: spectrum_router: Add missing rollback samples/bpf: fix a build issue bridge: mdb: fix leak on complete_info ptr on fail path tap: convert a mutex to a spinlock cxgb4: fix BUG() on interrupt deallocating path of ULD qed: Fix printk option passed when printing ipv6 addresses net: Fix minor code bug in timestamping.txt net: stmmac: Make 'alloc_dma_[rt]x_desc_resources()' look even closer ...
2017-07-12datagram: fix kernel-doc commentsstephen hemminger
An underscore in the kernel-doc comment section has special meaning and mis-use generates an errors. ./net/core/datagram.c:207: ERROR: Unknown target name: "msg". ./net/core/datagram.c:379: ERROR: Unknown target name: "msg". ./net/core/datagram.c:816: ERROR: Unknown target name: "t". Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-12net: ipmr: ipmr_get_table() returns NULLDan Carpenter
The ipmr_get_table() function doesn't return error pointers it returns NULL on error. Fixes: 4f75ba6982bc ("net: ipmr: Add ipmr_rtm_getroute") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-11bridge: mdb: fix leak on complete_info ptr on fail pathEduardo Valentin
We currently get the following kmemleak report: unreferenced object 0xffff8800039d9820 (size 32): comm "softirq", pid 0, jiffies 4295212383 (age 792.416s) hex dump (first 32 bytes): 00 0c e0 03 00 88 ff ff ff 02 00 00 00 00 00 00 ................ 00 00 00 01 ff 11 00 02 86 dd 00 00 ff ff ff ff ................ backtrace: [<ffffffff8152b4aa>] kmemleak_alloc+0x4a/0xa0 [<ffffffff811d8ec8>] kmem_cache_alloc_trace+0xb8/0x1c0 [<ffffffffa0389683>] __br_mdb_notify+0x2a3/0x300 [bridge] [<ffffffffa038a0ce>] br_mdb_notify+0x6e/0x70 [bridge] [<ffffffffa0386479>] br_multicast_add_group+0x109/0x150 [bridge] [<ffffffffa0386518>] br_ip6_multicast_add_group+0x58/0x60 [bridge] [<ffffffffa0387fb5>] br_multicast_rcv+0x1d5/0xdb0 [bridge] [<ffffffffa037d7cf>] br_handle_frame_finish+0xcf/0x510 [bridge] [<ffffffffa03a236b>] br_nf_hook_thresh.part.27+0xb/0x10 [br_netfilter] [<ffffffffa03a3738>] br_nf_hook_thresh+0x48/0xb0 [br_netfilter] [<ffffffffa03a3fb9>] br_nf_pre_routing_finish_ipv6+0x109/0x1d0 [br_netfilter] [<ffffffffa03a4400>] br_nf_pre_routing_ipv6+0xd0/0x14c [br_netfilter] [<ffffffffa03a3c27>] br_nf_pre_routing+0x197/0x3d0 [br_netfilter] [<ffffffff814a2952>] nf_iterate+0x52/0x60 [<ffffffff814a29bc>] nf_hook_slow+0x5c/0xb0 [<ffffffffa037ddf4>] br_handle_frame+0x1a4/0x2c0 [bridge] This happens when switchdev_port_obj_add() fails. This patch frees complete_info object in the fail path. Reviewed-by: Vallish Vaidyeshwara <vallish@amazon.com> Signed-off-by: Eduardo Valentin <eduval@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-11Merge tag 'ceph-for-4.13-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph updates from Ilya Dryomov: "The main item here is support for v12.y.z ("Luminous") clusters: RESEND_ON_SPLIT, RADOS_BACKOFF, OSDMAP_PG_UPMAP and CRUSH_CHOOSE_ARGS feature bits, and various other changes in the RADOS client protocol. On top of that we have a new fsc mount option to allow supplying fscache uniquifier (similar to NFS) and the usual pile of filesystem fixes from Zheng" * tag 'ceph-for-4.13-rc1' of git://github.com/ceph/ceph-client: (44 commits) libceph: advertise support for NEW_OSDOP_ENCODING and SERVER_LUMINOUS libceph: osd_state is 32 bits wide in luminous crush: remove an obsolete comment crush: crush_init_workspace starts with struct crush_work libceph, crush: per-pool crush_choose_arg_map for crush_do_rule() crush: implement weight and id overrides for straw2 libceph: apply_upmap() libceph: compute actual pgid in ceph_pg_to_up_acting_osds() libceph: pg_upmap[_items] infrastructure libceph: ceph_decode_skip_* helpers libceph: kill __{insert,lookup,remove}_pg_mapping() libceph: introduce and switch to decode_pg_mapping() libceph: don't pass pgid by value libceph: respect RADOS_BACKOFF backoffs libceph: make DEFINE_RB_* helpers more general libceph: avoid unnecessary pi lookups in calc_target() libceph: use target pi for calc_target() calculations libceph: always populate t->target_{oid,oloc} in calc_target() libceph: make sure need_resend targets reflect latest map libceph: delete from need_resend_linger before check_linger_pool_dne() ...
2017-07-08mpls: fix uninitialized in_label var warning in mpls_getrouteRoopa Prabhu
Fix the below warning generated by static checker: net/mpls/af_mpls.c:2111 mpls_getroute() error: uninitialized symbol 'in_label'." Fixes: 397fc9e5cefe ("mpls: route get support") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-08bonding: avoid NETDEV_CHANGEMTU event when unregistering slaveWANG Cong
As Hongjun/Nicolas summarized in their original patch: " When a device changes from one netns to another, it's first unregistered, then the netns reference is updated and the dev is registered in the new netns. Thus, when a slave moves to another netns, it is first unregistered. This triggers a NETDEV_UNREGISTER event which is caught by the bonding driver. The driver calls bond_release(), which calls dev_set_mtu() and thus triggers NETDEV_CHANGEMTU (the device is still in the old netns). " This is a very special case, because the device is being unregistered no one should still care about the NETDEV_CHANGEMTU event triggered at this point, we can avoid broadcasting this event on this path, and avoid touching inetdev_event()/addrconf_notify() path. It requires to export __dev_set_mtu() to bonding driver. Reported-by: Hongjun Li <hongjun.li@6wind.com> Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-08rds: tcp: use sock_create_lite() to create the accept socketSowmini Varadhan
There are two problems with calling sock_create_kern() from rds_tcp_accept_one() 1. it sets up a new_sock->sk that is wasteful, because this ->sk is going to get replaced by inet_accept() in the subsequent ->accept() 2. The new_sock->sk is a leaked reference in sock_graft() which expects to find a null parent->sk Avoid these problems by calling sock_create_lite(). Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-07libceph: osd_state is 32 bits wide in luminousIlya Dryomov
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07crush: remove an obsolete commentIlya Dryomov
Reflects ceph.git commit dca1ae1e0a6b02029c3a7f9dec4114972be26d50. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07crush: crush_init_workspace starts with struct crush_workIlya Dryomov
It is not just a pointer to crush_work, it is the whole structure. That is not a problem since it only contains a pointer. But it will be a problem if new data members are added to crush_work. Reflects ceph.git commit ee957dd431bfbeb6dadaf77764db8e0757417328. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph, crush: per-pool crush_choose_arg_map for crush_do_rule()Ilya Dryomov
If there is no crush_choose_arg_map for a given pool, a NULL pointer is passed to preserve existing crush_do_rule() behavior. Reflects ceph.git commits 55fb91d64071552ea1bc65ab4ea84d3c8b73ab4b, dbe36e08be00c6519a8c89718dd47b0219c20516. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07crush: implement weight and id overrides for straw2Ilya Dryomov
bucket_straw2_choose needs to use weights that may be different from weight_items. For instance to compensate for an uneven distribution caused by a low number of values. Or to fix the probability biais introduced by conditional probabilities (see http://tracker.ceph.com/issues/15653 for more information). We introduce a weight_set for each straw2 bucket to set the desired weight for a given item at a given position. The weight of a given item when picking the first replica (first position) may be different from the weight the second replica (second position). For instance the weight matrix for a given bucket containing items 3, 7 and 13 could be as follows: position 0 position 1 item 3 0x10000 0x100000 item 7 0x40000 0x10000 item 13 0x40000 0x10000 When crush_do_rule picks the first of two replicas (position 0), item 7, 3 are four times more likely to be choosen by bucket_straw2_choose than item 13. When choosing the second replica (position 1), item 3 is ten times more likely to be choosen than item 7, 13. By default the weight_set of each bucket exactly matches the content of item_weights for each position to ensure backward compatibility. bucket_straw2_choose compares items by using their id. The same ids are also used to index buckets and they must be unique. For each item in a bucket an array of ids can be provided for placement purposes and they are used instead of the ids. If no replacement ids are provided, the legacy behavior is preserved. Reflects ceph.git commit 19537a450fd5c5a0bb8b7830947507a76db2ceca. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: apply_upmap()Ilya Dryomov
Previously, pg_to_raw_osds() didn't filter for existent OSDs because raw_to_up_osds() would filter for "up" ("up" is predicated on "exists") and raw_to_up_osds() was called directly after pg_to_raw_osds(). Now, with apply_upmap() call in there, nonexistent OSDs in pg_to_raw_osds() output can affect apply_upmap(). Introduce remove_nonexistent_osds() to deal with that. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: compute actual pgid in ceph_pg_to_up_acting_osds()Ilya Dryomov
Move raw_pg_to_pg() call out of get_temp_osds() and into ceph_pg_to_up_acting_osds(), for upcoming apply_upmap(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: pg_upmap[_items] infrastructureIlya Dryomov
pg_temp and pg_upmap encodings are the same (PG -> array of osds), except for the incremental remove: it's an empty mapping in new_pg_temp for pg_temp and a separate old_pg_upmap set for pg_upmap. (This isn't to allow for empty pg_upmap mappings -- apparently, pg_temp just wasn't looked at as an example for pg_upmap encoding.) Reuse __decode_pg_temp() for decoding pg_upmap and new_pg_upmap. __decode_pg_temp() stores into pg_temp union member, but since pg_upmap union member is identical, reading through pg_upmap later is OK. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: ceph_decode_skip_* helpersIlya Dryomov
Some of these won't be as efficient as they could be (e.g. ceph_decode_skip_set(... 32 ...) could advance by len * sizeof(u32) once instead of advancing by sizeof(u32) len times), but that's fine and not worth a bunch of extra macro code. Replace skip_name_map() with ceph_decode_skip_map as an example. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: kill __{insert,lookup,remove}_pg_mapping()Ilya Dryomov
Switch to DEFINE_RB_FUNCS2-generated {insert,lookup,erase}_pg_mapping(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: introduce and switch to decode_pg_mapping()Ilya Dryomov
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: don't pass pgid by valueIlya Dryomov
Make __{lookup,remove}_pg_mapping() look like their ceph_spg_mapping counterparts: take const struct ceph_pg *. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: respect RADOS_BACKOFF backoffsIlya Dryomov
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: avoid unnecessary pi lookups in calc_target()Ilya Dryomov
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: use target pi for calc_target() calculationsIlya Dryomov
For luminous and beyond we are encoding the actual spgid, which requires operating with the correct pg_num, i.e. that of the target pool. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: always populate t->target_{oid,oloc} in calc_target()Ilya Dryomov
need_check_tiering logic doesn't make a whole lot of sense. Drop it and apply tiering unconditionally on every calc_target() call instead. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: make sure need_resend targets reflect latest mapIlya Dryomov
Otherwise we may miss events like PG splits, pool deletions, etc when we get multiple incremental maps at once. Because check_pool_dne() can now be fed an unlinked request, finish_request() needed to be taught to handle unlinked requests. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: delete from need_resend_linger before check_linger_pool_dne()Ilya Dryomov
When processing a map update consisting of multiple incrementals, we may end up running check_linger_pool_dne() on a lingering request that was previously added to need_resend_linger list. If it is concluded that the target pool doesn't exist, the request is killed off while still on need_resend_linger list, which leads to a crash on a NULL lreq->osd in kick_requests(): libceph: linger_id 18446462598732840961 pool does not exist BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 IP: ceph_osdc_handle_map+0x4ae/0x870 Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: resend on PG splits if OSD has RESEND_ON_SPLITIlya Dryomov
Note that ceph_osd_request_target fields are updated regardless of RESEND_ON_SPLIT. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: drop need_resend from calc_target()Ilya Dryomov
Replace it with more fine-grained bools to separate updating ceph_osd_request_target fields and the decision to resend. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: MOSDOp v8 encoding (actual spgid + full hash)Ilya Dryomov
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: ceph_connection_operations::reencode_message() methodIlya Dryomov
Give upper layers a chance to reencode the message after the connection is negotiated and ->peer_features is set. OSD client will use this to support both luminous and pre-luminous OSDs (in a single cluster): the former need MOSDOp v8; the latter will continue to be sent MOSDOp v4. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: encode_{pgid,oloc}() helpersIlya Dryomov
Factor out encode_{pgid,oloc}() and use ceph_encode_string() for oid. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: introduce ceph_spg, ceph_pg_to_primary_shard()Ilya Dryomov
Store both raw pgid and actual spgid in ceph_osd_request_target. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: new pi->last_force_request_resendIlya Dryomov
The old (v15) pi->last_force_request_resend has been repurposed to make pre-RESEND_ON_SPLIT clients that don't check for PG splits but do obey pi->last_force_request_resend resend on splits. See ceph.git commit 189ca7ec6420 ("mon/OSDMonitor: make pre-luminous clients resend ops on split"). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: fold [l]req->last_force_resend into ceph_osd_request_targetIlya Dryomov
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: support SERVER_JEWEL feature bitsIlya Dryomov
Only MON_STATEFUL_SUB, really. MON_ROUTE_OSDMAP and OSDSUBOP_NO_SNAPCONTEXT are irrelevant. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: handle non-empty dest in ceph_{oloc,oid}_copy()Ilya Dryomov
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: remove ceph_sanitize_features() workaroundIlya Dryomov
Reflects ceph.git commit ff1959282826ae6acd7134e1b1ede74ffd1cc04a. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>