summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-12-14libceph: move msgr1 protocol specific fields to its own structIlya Dryomov
A couple whitespace fixups, no functional changes. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: move msgr1 protocol implementation to its own fileIlya Dryomov
A pure move, no other changes. Note that ceph_tcp_recv{msg,page}() and ceph_tcp_send{msg,page}() helpers are also moved. msgr2 will bring its own, more efficient, variants based on iov_iter. Switching msgr1 to them was considered but decided against to avoid subtle regressions. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: separate msgr1 protocol implementationIlya Dryomov
In preparation for msgr2, define internal messenger <-> protocol interface (as opposed to external messenger <-> client interface, which is struct ceph_connection_operations) consisting of try_read(), try_write(), revoke(), revoke_incoming(), opened(), reset_session() and reset_protocol() ops. The semantics are exactly the same as they are now. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: export remaining protocol independent infrastructureIlya Dryomov
In preparation for msgr2, make all protocol independent functions in messenger.c global. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: export zero_pageIlya Dryomov
In preparation for msgr2, make zero_page global. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: rename and export con->flags bitsIlya Dryomov
In preparation for msgr2, move the defines to the header file. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: rename and export con->state statesIlya Dryomov
In preparation for msgr2, rename msgr1 specific states and move the defines to the header file. Also drop state transition comments. They don't cover all possible transitions (e.g. NEGOTIATING -> STANDBY, etc) and currently do more harm than good. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: make con->state an intIlya Dryomov
unsigned long is a leftover from when con->state used to be a set of bits managed with set_bit(), clear_bit(), etc. Save a bit of memory. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: don't export ceph_messenger_{init_fini}() to modulesIlya Dryomov
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: make sure our addr->port is zero and addr->nonce is non-zeroIlya Dryomov
Our messenger instance addr->port is normally zero -- anything else is nonsensical because as a client we connect to multiple servers and don't listen on any port. However, a user can supply an arbitrary addr:port via ip option and the port is currently preserved. Zero it. Conversely, make sure our addr->nonce is non-zero. A zero nonce is special: in combination with a zero port, it is used to blocklist the entire ip. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: factor out ceph_con_get_out_msg()Ilya Dryomov
Move the logic of grabbing the next message from the queue into its own function. Like ceph_con_in_msg_alloc(), this is protocol independent. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: change ceph_con_in_msg_alloc() to take hdrIlya Dryomov
ceph_con_in_msg_alloc() is protocol independent, but con->in_hdr (and struct ceph_msg_header in general) is msgr1 specific. While the struct is deeply ingrained inside and outside the messenger, con->in_hdr field can be separated. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: change ceph_msg_data_cursor_init() to take cursorIlya Dryomov
Make it possible to have local cursors and embed them outside struct ceph_msg. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: handle discarding acked and requeued messages separatelyIlya Dryomov
Make it easier to follow and remove dependency on msgr1 specific CEPH_MSGR_TAG_SEQ. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: drop msg->ack_stamp fieldIlya Dryomov
It is set in process_ack() but never used. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: remove redundant session reset log messageIlya Dryomov
Stick with pr_info message because session reset isn't an error most of the time. When it is (i.e. if the server denies the reconnect attempt), we get a bunch of other pr_err messages. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: clear con->peer_global_seq on RESETSESSIONIlya Dryomov
con->peer_global_seq is part of session state. Clear it when the server tells us to reset, not just in ceph_con_close(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: rename reset_connection() to ceph_con_reset_session()Ilya Dryomov
With just session reset bits left, rename appropriately. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: split protocol reset bits out of reset_connection()Ilya Dryomov
Move protocol reset bits into ceph_con_reset_protocol(), leaving just session reset bits. Note that con->out_skip is now reset on faults. This fixes a crash in the case of a stateful session getting a fault while in the middle of revoking a message. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: don't call reset_connection() on version/feature mismatchesIlya Dryomov
A fault due to a version mismatch or a feature set mismatch used to be treated differently from other faults: the connection would get closed without trying to reconnect and there was a ->bad_proto() connection op for notifying about that. This changed a long time ago, see commits 6384bb8b8e88 ("libceph: kill bad_proto ceph connection op") and 0fa6ebc600bc ("libceph: fix protocol feature mismatch failure path"). Nowadays these aren't any different from other faults (i.e. we try to reconnect even though the mismatch won't resolve until the server is replaced). reset_connection() calls there are rather confusing because reset_connection() resets a session together an individual instance of the protocol. This is cleaned up in the next patch. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: lower exponential backoff delayIlya Dryomov
The current setting allows the backoff to climb up to 5 minutes. This is too high -- it becomes hard to tell whether the client is stuck on something or just in backoff. In userspace, ms_max_backoff is defaulted to 15 seconds. Let's do the same. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: include middle_len in process_message() doutIlya Dryomov
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: implement updated ceph_mds_request_head structureJeff Layton
When we added the btime feature in mainline ceph, we had to extend struct ceph_mds_request_args so that it could be set. Implement the same in the kernel client. Rename ceph_mds_request_head with a _old extension, and a union ceph_mds_request_args_ext to allow for the extended size of the new header format. Add the appropriate code to handle both formats in struct create_request_message and key the behavior on whether the peer supports CEPH_FEATURE_FS_BTIME. The gid_list field in the payload is now populated from the saved credential. For now, we don't add any support for setting the btime via setattr, but this does enable us to add that in the future. [ idryomov: break unnecessarily long lines ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: clean up argument lists to __prepare_send_request and __send_requestJeff Layton
We can always get the mdsc from the session, so there's no need to pass it in as a separate argument. Pass the session to __prepare_send_request as well, to prepare for later patches that will need to access it. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: take a cred reference instead of tracking individual uid/gidJeff Layton
Replace req->r_uid/r_gid with an r_cred pointer and take a reference to that at the point where we previously would sample the two. Use that to populate the uid and gid in the header and release the reference when the request is freed. This should enable us to later add support for sending supplementary group lists in MDS requests. [ idryomov: break unnecessarily long lines ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: don't reach into request header for readdir infoJeff Layton
We already have a pointer to the argument struct in req->r_args. Use that instead of groveling around in the ceph_mds_request_head. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: set osdmap epoch for setxattrXiubo Li
When setting the file/dir layout, it may need data pool info. So in mds server, it needs to check the osdmap. At present, if mds doesn't find the data pool specified, it will try to get the latest osdmap. Now if pass the osd epoch for setxattr, the mds server can only check this epoch of osdmap. URL: https://tracker.ceph.com/issues/48504 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: remove redundant assignment to variable iColin Ian King
The variable i is being initialized with a value that is never read and it is being updated later with a new value in a for-loop. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: add ceph.caps vxattrLuis Henriques
Add a new vxattr that allows userspace to list the caps for a specific directory or file. [ jlayton: change format delimiter to '/' ] Signed-off-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: when filling trace, call ceph_get_inode outside of mutexesJeff Layton
Geng Jichao reported a rather complex deadlock involving several moving parts: 1) readahead is issued against an inode and some of its pages are locked while the read is in flight 2) the same inode is evicted from the cache, and this task gets stuck waiting for the page lock because of the above readahead 3) another task is processing a reply trace, and looks up the inode being evicted while holding the s_mutex. That ends up waiting for the eviction to complete 4) a write reply for an unrelated inode is then processed in the ceph_con_workfn job. It calls ceph_check_caps after putting wrbuffer caps, and that gets stuck waiting on the s_mutex held by 3. The reply to "1" is stuck behind the write reply in "4", so we deadlock at that point. This patch changes the trace processing to call ceph_get_inode outside of the s_mutex and snap_rwsem, which should break the cycle above. [ idryomov: break unnecessarily long lines ] URL: https://tracker.ceph.com/issues/47998 Reported-by: Geng Jichao <gengjichao@jd.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14Revert "ceph: allow rename operation under different quota realms"Luis Henriques
This reverts commit dffdcd71458e699e839f0bf47c3d42d64210b939. When doing a rename across quota realms, there's a corner case that isn't handled correctly. Here's a testcase: mkdir files limit truncate files/file -s 10G setfattr limit -n ceph.quota.max_bytes -v 1000000 mv files limit/ The above will succeed because ftruncate(2) won't immediately notify the MDSs with the new file size, and thus the quota realms stats won't be updated. Since the possible fixes for this issue would have a huge performance impact, the solution for now is to simply revert to returning -EXDEV when doing a cross quota realms rename. URL: https://tracker.ceph.com/issues/48203 Signed-off-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: fix inode refcount leak when ceph_fill_inode on non-I_NEW inode failsJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: downgrade warning from mdsmap decode to debugLuis Henriques
While the MDS cluster is unstable and changing state the client may get mdsmap updates that will trigger warnings: [144692.478400] ceph: mdsmap_decode got incorrect state(up:standby-replay) [144697.489552] ceph: mdsmap_decode got incorrect state(up:standby-replay) [144697.489580] ceph: mdsmap_decode got incorrect state(up:standby-replay) This patch downgrades these warnings to debug, as they may flood the logs if the cluster is unstable for a while. Signed-off-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: fix race in concurrent __ceph_remove_cap invocationsLuis Henriques
A NULL pointer dereference may occur in __ceph_remove_cap with some of the callbacks used in ceph_iterate_session_caps, namely trim_caps_cb and remove_session_caps_cb. Those callers hold the session->s_mutex, so they are prevented from concurrent execution, but ceph_evict_inode does not. Since the callers of this function hold the i_ceph_lock, the fix is simply a matter of returning immediately if caps->ci is NULL. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/43272 Suggested-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: pass down the flags to grab_cache_page_write_beginJeff Layton
write_begin operations are passed a flags parameter that we need to mirror here, so that we don't (e.g.) recurse back into filesystem code inappropriately. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: add ceph.{cluster_fsid/client_id} vxattrsXiubo Li
These two vxattrs will only exist in local client side, with which we can easily know which mountpoint the file belongs to and also they can help locate the debugfs path quickly. URL: https://tracker.ceph.com/issues/48057 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: add status debugfs fileXiubo Li
This will help list some useful client side info, like the client entity address/name and blocklisted status, etc. URL: https://tracker.ceph.com/issues/48057 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14libceph: remove unused port macrosLiu, Changcheng
1. monitor's default port is defined by CEPH_MON_PORT 2. CEPH_PORT_START and CEPH_PORT_LAST are not needed. Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: ensure we have Fs caps when fetching dir link countJeff Layton
The link count for a directory is defined as inode->i_subdirs + 2, (for "." and ".."). i_subdirs is only populated when Fs caps are held. Ensure we grab Fs caps when fetching the link count for a directory. [ idryomov: break unnecessarily long line ] URL: https://tracker.ceph.com/issues/48125 Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: send dentry lease metrics to MDS daemonXiubo Li
For the old ceph version, if it received this one metric message containing the dentry lease metric info, it will just ignore it. URL: https://tracker.ceph.com/issues/43423 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: acquire Fs caps when getting dir statsJeff Layton
We only update the inode's dirstats when we have Fs caps from the MDS. Declare a new VXATTR_FLAG_DIRSTAT that we set on all dirstats, and have the vxattr handling code acquire those caps when it's set. URL: https://tracker.ceph.com/issues/48104 Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: fix up some warnings on W=1 buildsJeff Layton
Convert some decodes into unused variables into skips, and fix up some non-kerneldoc comment headers to not start with "/**". Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: queue MDS requests to REJECTED sessions when CLEANRECOVER is setJeff Layton
Ilya noticed that the first access to a blacklisted mount would often get back -EACCES, but then subsequent calls would be OK. The problem is in __do_request. If the session is marked as REJECTED, a hard error is returned instead of waiting for a new session to come into being. When the session is REJECTED and the mount was done with recover_session=clean, queue the request to the waiting_for_map queue, which will be awoken after tearing down the old session. We can only do this for sync requests though, so check for async ones first and just let the callers redrive a sync request. URL: https://tracker.ceph.com/issues/47385 Reported-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: remove timeout on allowing reconnect after blocklistingJeff Layton
30 minutes is a long time to wait, and this makes it difficult to test the feature by manually blocklisting clients. Remove the timeout infrastructure and just allow the client to reconnect at will. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: add new RECOVER mount_state when recovering sessionJeff Layton
When recovering a session (a'la recover_session=clean), we want to do all of the operations that we do on a forced umount, but changing the mount state to SHUTDOWN is can cause queued MDS requests to fail when the session comes back. Most of those can idle until the session is recovered in this situation. Reserve SHUTDOWN state for forced umount, and make a new RECOVER state for the forced reconnect situation. Change several tests for equality with SHUTDOWN to test for that or RECOVER. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: make fsc->mount_state an intJeff Layton
This field is an unsigned long currently, which is a bit of a waste on most arches since this just holds an enum. Make it (signed) int instead. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14ceph: don't WARN when removing caps due to blocklistingJeff Layton
We expect to remove dirty caps when the client is blocklisted. Don't throw a warning in that case. [ idryomov: break unnecessarily long line ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-13Linux 5.10Linus Torvalds
2020-12-13Merge tag 'x86-urgent-2020-12-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A set of x86 and membarrier fixes: - Correct a few problems in the x86 and the generic membarrier implementation. Small corrections for assumptions about visibility which have turned out not to be true. - Make the PAT bits for memory encryption correct vs 4K and 2M/1G page table entries as they are at a different location. - Fix a concurrency issue in the the local bandwidth readout of resource control leading to incorrect values - Fix the ordering of allocating a vector for an interrupt. The order missed to respect the provided cpumask when the first attempt of allocating node local in the mask fails. It then tries the node instead of trying the full provided mask first. This leads to erroneous error messages and breaking the (user) supplied affinity request. Reorder it. - Make the INT3 padding detection in optprobe work correctly" * tag 'x86-urgent-2020-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kprobes: Fix optprobe to detect INT3 padding correctly x86/apic/vector: Fix ordering in vector assignment x86/resctrl: Fix incorrect local bandwidth when mba_sc is enabled x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WP membarrier: Execute SYNC_CORE on the calling thread membarrier: Explicitly sync remote cores when SYNC_CORE is requested membarrier: Add an actual barrier before rseq_preempt() x86/membarrier: Get rid of a dubious optimization
2020-12-13Merge tag 'block-5.10-2020-12-12' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "This should be it for 5.10. Mike and Song looked into the warning case, and thankfully it appears the fix was pretty trivial - we can just change the md device chunk type to unsigned int to get rid of it. They cannot currently be < 0, and nobody is checking for that either. We're reverting the discard changes as the corruption reports came in very late, and there's just no time to attempt to deal with it at this point. Reverting the changes in question is the right call for 5.10" * tag 'block-5.10-2020-12-12' of git://git.kernel.dk/linux-block: md: change mddev 'chunk_sectors' from int to unsigned Revert "md: add md_submit_discard_bio() for submitting discard bio" Revert "md/raid10: extend r10bio devs to raid disks" Revert "md/raid10: pull codes that wait for blocked dev into one function" Revert "md/raid10: improve raid10 discard request" Revert "md/raid10: improve discard request for far layout" Revert "dm raid: remove unnecessary discard limits for raid10"