summaryrefslogtreecommitdiff
path: root/fs/ceph/super.h
AgeCommit message (Collapse)Author
2016-04-11->getxattr(): pass dentry and inode as separate argumentsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-25ceph: kill ceph_get_dentry_parent_inode()Yan, Zheng
use vfs helper dget_parent() instead Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: fix security xattr deadlockYan, Zheng
When security is enabled, security module can call filesystem's getxattr/setxattr callbacks during d_instantiate(). For cephfs, d_instantiate() is usually called by MDS' dispatch thread, while handling MDS reply. If the MDS reply does not include xattrs and corresponding caps, getxattr/setxattr need to send a new request to MDS and waits for the reply. This makes MDS' dispatch sleep, nobody handles later MDS replies. The fix is make sure lookup/atomic_open reply include xattrs and corresponding caps. So getxattr can be handled by cached xattrs. This requires some modification to both MDS and request message. (Client tells MDS what caps it wants; MDS encodes proper caps in the reply) Smack security module may call setxattr during d_instantiate(). Unlike getxattr, we can't force MDS to issue CEPH_CAP_XATTR_EXCL to us. So just make setxattr return error when called by MDS' dispatch thread. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: kill ceph_empty_snapcIlya Dryomov
ceph_empty_snapc->num_snaps == 0 at all times. Passing such a snapc to ceph_osdc_alloc_request() (possibly through ceph_osdc_new_request()) is equivalent to passing NULL, as ceph_osdc_alloc_request() uses it only for sizing the request message. Further, in all four cases the subsequent ceph_osdc_build_request() is passed NULL for snapc, meaning that 0 is encoded for seq and num_snaps and making ceph_empty_snapc entirely useless. The two cases where it actually mattered were removed in commits 860560904962 ("ceph: avoid sending unnessesary FLUSHSNAP message") and 23078637e054 ("ceph: fix queuing inode to mdsdir's snaprealm"). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: don't enable rbytes mount option by defaultYan, Zheng
When rbytes mount option is enabled, directory size is recursive size. Recursive size is not updated instantly. This can cause directory size to change between successive stat(1) Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-04ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 supportYan, Zheng
Add support for the format change of MClientReply/MclientCaps. Also add code that denies access to inodes with pool_ns layouts. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2015-11-02ceph: make fsync() wait unsafe requests that created/modified inodeYan, Zheng
If we get a unsafe reply for request that created/modified inode, add the unsafe request to a list in the newly created/modified inode. So we can make fsync() wait these unsafe requests. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-07-31ceph: always re-send cap flushes when MDS recoversYan, Zheng
commit e548e9b93d3e565e42b938a99804114565be1f81 makes the kclient only re-send cap flush once during MDS failover. If the kclient sends a cap flush after MDS enters reconnect stage but before MDS recovers. The kclient will skip re-sending the same cap flush when MDS recovers. This causes problem for newly created inode. The MDS handles cap flushes before replaying unsafe requests, so it's possible that MDS find corresponding inode is missing when handling cap flush. The fix is reverting to old behaviour: always re-send when MDS recovers Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-06-25ceph: rework dcache readdirYan, Zheng
Previously our dcache readdir code relies on that child dentries in directory dentry's d_subdir list are sorted by dentry's offset in descending order. When adding dentries to the dcache, if a dentry already exists, our readdir code moves it to head of directory dentry's d_subdir list. This design relies on dcache internals. Al Viro suggests using ncpfs's approach: keeping array of pointers to dentries in page cache of directory inode. the validity of those pointers are presented by directory inode's complete and ordered flags. When a dentry gets pruned, we clear directory inode's complete flag in the d_prune() callback. Before moving a dentry to other directory, we clear the ordered flag for both old and new directory. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-06-25ceph: pre-allocate data structure that tracks caps flushingYan, Zheng
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-06-25ceph: re-send flushing caps (which are revoked) in reconnect stageYan, Zheng
if flushing caps were revoked, we should re-send the cap flush in client reconnect stage. This guarantees that MDS processes the cap flush message before issuing the flushing caps to other client. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-06-25ceph: track pending caps flushing globallyYan, Zheng
So we know TID of the oldest pending caps flushing. Later patch will send this information to MDS, so that MDS can trim its completed caps flush list. Tracking pending caps flushing globally also simplifies syncfs code. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-06-25ceph: track pending caps flushing accuratelyYan, Zheng
Previously we do not trace accurate TID for flushing caps. when MDS failovers, we have no choice but to re-send all flushing caps with a new TID. This can cause problem because MDS can has already flushed some caps and has issued the same caps to other client. The re-sent cap flush has a new TID, which makes MDS unable to detect if it has already processed the cap flush. This patch adds code to track pending caps flushing accurately. When re-sending cap flush is needed, we use its original flush TID. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-06-25ceph: don't pre-allocate space for cap release messagesYan, Zheng
Previously we pre-allocate cap release messages for each caps. This wastes lots of memory when there are large amount of caps. This patch make the code not pre-allocate the cap release messages. Instead, we add the corresponding ceph_cap struct to a list when releasing a cap. Later when flush cap releases is needed, we allocate the cap release messages dynamically. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-06-25ceph: avoid sending unnessesary FLUSHSNAP messageYan, Zheng
when a snap notification contains no new snapshot, we can avoid sending FLUSHSNAP message to MDS. But we still need to create cap_snap in some case because it's required by write path and page writeback path Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-06-25ceph: use empty snap context for uninline_data and get_pool_permYan, Zheng
Cached_context in ceph_snap_realm is directly accessed by uninline_data() and get_pool_perm(). This is racy in theory. both uninline_data() and get_pool_perm() do not modify existing object, they only create new object. So we can pass the empty snap context to them. Unlike cached_context in ceph_snap_realm, we do not need to protect the empty snap context. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-06-25ceph: check OSD caps before read/writeYan, Zheng
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-20ceph: remove redundant declarationFabian Frederick
ceph_aops was already defined extern in addr.c section Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-20ceph: fix dcache/nocache mount optionYan, Zheng
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19ceph: provide seperate {inode,file}_operations for snapdirYan, Zheng
remove all unsupported operations from {inode,file}_operations. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19ceph: improve reference tracking for snaprealmYan, Zheng
When snaprealm is created, its initial reference count is zero. But in some rare cases, the newly created snaprealm is not referenced by anyone. This causes snaprealm with zero reference count not freed. The fix is set reference count of newly snaprealm to 1. The reference is return the function who requests to create the snaprealm. When the function finishes its job, it releases the reference. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: flush inline versionYan, Zheng
After converting inline data to normal data, client need to flush the new i_inline_version (CEPH_INLINE_NONE) to MDS. This commit makes cap messages (sent to MDS) contain inline_version and inline_data. Client always converts inline data to normal data before data write, so the inline data length part is always zero. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: convert inline data to normal data before data writeYan, Zheng
Before any data write, convert inline data to normal data and set i_inline_version to CEPH_INLINE_NONE. The OSD request that saves inline data to object contains 3 operations (CMPXATTR, WRITE and SETXATTR). It compares a xattr named 'inline_version' to prevent old data overwrites newer data. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: use getattr request to fetch inline dataYan, Zheng
Add a new parameter 'locked_page' to ceph_do_getattr(). If inline data in getattr reply will be copied to the page. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: add inline data to pagecacheYan, Zheng
Request reply and cap message can contain inline data. add inline data to the page cache if there is Fc cap. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: introduce global empty snap contextYan, Zheng
Current snaphost code does not properly handle moving inode from one empty snap realm to another empty snap realm. After changing inode's snap realm, some dirty pages' snap context can be not equal to inode's i_head_snap. This can trigger BUG() in ceph_put_wrbuffer_cap_refs() The fix is introduce a global empty snap context for all empty snap realm. This avoids triggering the BUG() for filesystem with no snapshot. Fixes: http://tracker.ceph.com/issues/9928 Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
2014-12-17ceph: introduce a new inode flag indicating if cached dentries are orderedYan, Zheng
After creating/deleting/renaming file, offsets of sibling dentries may change. So we can not use cached dentries to satisfy readdir. But we can still use the cached dentries to conclude -ENOENT for lookup. This patch introduces a new inode flag indicating if child dentries are ordered. The flag is set at the same time marking a directory complete. After creating/deleting/renaming file, we clear the flag on directory inode. This prevents ceph_readdir() from using cached dentries to satisfy readdir syscall. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-10-14ceph: additional debugfs outputJohn Spray
MDS session state and client global ID is useful instrumentation when testing. Signed-off-by: John Spray <john.spray@redhat.com>
2014-10-14ceph: include the initial ACL in create/mkdir/mknod MDS requestsYan, Zheng
Current code set new file/directory's initial ACL in a non-atomic manner. Client first sends request to MDS to create new file/directory, then set the initial ACL after the new file/directory is successfully created. The fix is include the initial ACL in create/mkdir/mknod MDS requests. So MDS can handle creating file/directory and setting the initial ACL in one request. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2014-10-14ceph: request xattrs if xattr_version is zeroYan, Zheng
Following sequence of events can happen. - Client releases an inode, queues cap release message. - A 'lookup' reply brings the same inode back, but the reply doesn't contain xattrs because MDS didn't receive the cap release message and thought client already has up-to-data xattrs. The fix is force sending a getattr request to MDS if xattrs_version is 0. The getattr mask is set to CEPH_STAT_CAP_XATTR, so MDS knows client does not have xattr. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-06-06ceph: pre-allocate ceph_cap struct for ceph_add_cap()Yan, Zheng
So that ceph_add_cap() can be used while i_ceph_lock is locked. This simplifies the code that handle cap import/export. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-28ceph: clear directory's completeness when creating fileYan, Zheng
When creating a file, ceph_set_dentry_offset() puts the new dentry at the end of directory's d_subdirs, then set the dentry's offset based on directory's max offset. The offset does not reflect the real postion of the dentry in directory. Later readdir reply from MDS may change the dentry's position/offset. This inconsistency can cause missing/duplicate entries in readdir result if readdir is partly satisfied by dcache_readdir(). The fix is clear directory's completeness after creating/renaming file. It prevents later readdir from using dcache_readdir(). Fixes: http://tracker.ceph.com/issues/8025 Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-04ceph: use fl->fl_file as owner identifier of flock and posix lockYan, Zheng
flock and posix lock should use fl->fl_file instead of process ID as owner identifier. (posix lock uses fl->fl_owner. fl->fl_owner is usually equal to fl->fl_file, but it also can be a customized value). The process ID of who holds the lock is just for F_GETLK fcntl(2). The fix is rename the 'pid' fields of struct ceph_mds_request_args and struct ceph_filelock to 'owner', rename 'pid_namespace' fields to 'pid'. Assign fl->fl_file to the 'owner' field of lock messages. We also set the most significant bit of the 'owner' field. MDS can use that bit to distinguish between old and new clients. The MDS counterpart of this patch modifies the flock code to not take the 'pid_namespace' into consideration when checking conflict locks. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03ceph: fix ceph_dir_llseek()Yan, Zheng
Comparing offset with inode->i_sb->s_maxbytes doesn't make sense for directory. For a fragmented directory, offset (frag_t, off) can be larger than inode->i_sb->s_maxbytes. At the very beginning of ceph_dir_llseek(), local variable old_offset is initialized to parameter offset. This doesn't make sense neither. Old_offset should be ceph_make_fpos(fi->frag, fi->next_offset). Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-02-17ceph: make ceph_forget_all_cached_acls() static inlineGuangliang Zhao
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Sage Weil <sage@inktank.com>
2014-01-30ceph: remove duplicate declaration of ceph_setattrPeter Rosin
Signed-off-by: Peter Rosin <peda@lysator.liu.se> Signed-off-by: Sage Weil <sage@inktank.com>
2014-01-29ceph: fix posix ACL hooksSage Weil
The merge of commit 7221fe4c2ed7 ("ceph: add acl for cephfs") raced with upstream changes in the generic POSIX ACL code (eg commit 2aeccbe957d0 "fs: add generic xattr_acl handlers" and others). Some of the fallout was fixed in commit 4db658ea0ca ("ceph: Fix up after semantic merge conflict"), but it was incomplete: the set_acl inode_operation wasn't getting set, and the prototype needed to be adjusted a bit (it doesn't take a dentry anymore). Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-28ceph: Fix up after semantic merge conflictLinus Torvalds
The previous ceph-client merge resulted in ceph not even building, because there was a merge conflict that wasn't visible as an actual data conflict: commit 7221fe4c2ed7 ("ceph: add acl for cephfs") added support for POSIX ACL's into Ceph, but unluckily we also had the VFS tree change a lot of the POSIX ACL helper functions to be much more helpful to filesystems (see for example commits 2aeccbe957d0 "fs: add generic xattr_acl handlers", 5bf3258fd2ac "fs: make posix_acl_chmod more useful" and 37bc15392a23 "fs: make posix_acl_create more useful") The reason this conflict wasn't obvious was many-fold: because it was a semantic conflict rather than a data conflict, it wasn't visible in the git merge as a conflict. And because the VFS tree hadn't been in linux-next, people hadn't become aware of it that way. And because I was at jury duty this morning, I was using my laptop and as a result not doing constant "allmodconfig" builds. Anyway, this fixes the build and generally removes a fair chunk of the Ceph POSIX ACL support code, since the improved helpers seem to match really well for Ceph too. But I don't actually have any way to *test* the end result, and I was really hoping for some ACK's for this. Oh, well. Not compiling certainly doesn't make things easier to test, so I'm committing this without the acks after having waited for four hours... Plus it's what I would have done for the merge had I noticed the semantic conflict.. Reported-by: Dave Jones <davej@redhat.com> Cc: Sage Weil <sage@inktank.com> Cc: Guangliang Zhao <lucienchao@gmail.com> Cc: Li Wang <li.wang@ubuntykylin.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21ceph: add imported caps when handling cap export messageYan, Zheng
Version 3 cap export message includes information about the imported caps. It allows us to add the imported caps if the corresponding cap import message still hasn't been received. This allow us to handle situation that the importer MDS crashes and the cap import message is missing. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-01-21ceph: check inode caps in ceph_d_revalidateYan, Zheng
Some inodes in readdir reply may have no caps. Getattr mds request for these inodes can return -ESTALE. The fix is consider dentry that links to inode with no caps as invalid. Invalid dentry causes a lookup request to send to the mds, the MDS will send caps back. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-01-21ceph: fix cache revoke raceYan, Zheng
handle following sequence of events: - non-auth MDS revokes Fc cap. queue invalidate work - auth MDS issues Fc cap through request reply. i_rdcache_gen gets increased. - invalidate work runs. it finds i_rdcache_revoking != i_rdcache_gen, so it does nothing. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-31ceph: add acl for cephfsGuangliang Zhao
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com> Reviewed-by: Li Wang <li.wang@ubuntykylin.com> Reviewed-by: Zheng Yan <zheng.z.yan@intel.com>
2013-12-13ceph: drop unconnected inodesYan, Zheng
Positve dentry and corresponding inode are always accompanied in MDS reply. So no need to keep inode in the cache after dropping all its aliases. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-11-23ceph: queue cap release in __ceph_remove_cap()Yan, Zheng
call __queue_cap_release() in __ceph_remove_cap(), this avoids acquiring s_cap_lock twice. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-09-06ceph: remove ceph_lookup_inode()Yan, Zheng
commit 6f60f889 (ceph: fix freeing inode vs removing session caps race) introduced ceph_lookup_inode(). But there is already a ceph_find_inode() which provides similar function. So remove ceph_lookup_inode(), use ceph_find_inode() instead. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Alex Elder <alex.elder@linary.org> Reviewed-by: Sage Weil <sage@inktank.com>
2013-09-06ceph: use fscache as a local presisent cacheMilosz Tanski
Adding support for fscache to the Ceph filesystem. This would bring it to on par with some of the other network filesystems in Linux (like NFS, AFS, etc...) In order to mount the filesystem with fscache the 'fsc' mount option must be passed. Signed-off-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: Sage Weil <sage@inktank.com>
2013-08-15ceph: introduce i_truncate_mutexYan, Zheng
I encountered below deadlock when running fsstress wmtruncate work truncate MDS --------------- ------------------ -------------------------- lock i_mutex <- truncate file lock i_mutex (blocked) <- revoking Fcb (filelock to MIX) send request -> handle request (xlock filelock) At the initial time, there are some dirty pages in the page cache. When the kclient receives the truncate message, it reduces inode size and creates some 'out of i_size' dirty pages. wmtruncate work can't truncate these dirty pages because it's blocked by the i_mutex. Later when the kclient receives the cap message that revokes Fcb caps, It can't flush all dirty pages because writepages() only flushes dirty pages within the inode size. When the MDS handles the 'truncate' request from kclient, it waits for the filelock to become stable. But the filelock is stuck in unstable state because it can't finish revoking kclient's Fcb caps. The truncate pagecache locking has already caused lots of trouble for use. I think it's time simplify it by introducing a new mutex. We use the new mutex to prevent concurrent truncate_inode_pages(). There is no need to worry about race between buffered write and truncate_inode_pages(), because our "get caps" mechanism prevents them from concurrent execution. Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-08-09ceph: fix freeing inode vs removing session caps raceYan, Zheng
remove_session_caps() uses iterate_session_caps() to remove caps, but iterate_session_caps() skips inodes that are being deleted. So session->s_nr_caps can be non-zero after iterate_session_caps() return. We can fix the issue by waiting until deletions are complete. __wait_on_freeing_inode() is designed for the job, but it is not exported, so we use lookup inode function to access it. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-07-03ceph: fix pending vmtruncate raceYan, Zheng
The locking order for pending vmtruncate is wrong, it can lead to following race: write wmtruncate work ------------------------ ---------------------- lock i_mutex check i_truncate_pending check i_truncate_pending truncate_inode_pages() lock i_mutex (blocked) copy data to page cache unlock i_mutex truncate_inode_pages() The fix is take i_mutex before calling __ceph_do_pending_vmtruncate() Fixes: http://tracker.ceph.com/issues/5453 Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-03ceph: Reconstruct the func ceph_reserve_caps.majianpeng
Drop ignored return value. Fix allocation failure case to not leak. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Reviewed-by: Sage Weil <sage@inktank.com>