summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2013-07-02Merge tag 'dlm-3.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "This set includes a number of SCTP related fixes in the dlm, and a few other minor fixes and changes." * tag 'dlm-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: Avoid LVB truncation dlm: log an error for unmanaged lockspaces dlm: config: using strlcpy instead of strncpy dlm: remove duplicated include from lowcomms.c dlm: disable nagle for SCTP dlm: retry failed SCTP sends dlm: try other IPs when sctp init assoc fails dlm: clear correct bit during sctp init failure handling dlm: set sctp assoc id during setup dlm: clear correct init bit during sctp setup
2013-07-02Merge tag 'for-f2fs-3.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "This patch-set includes the following major enhancement patches: - remount_fs callback function - restore parent inode number to enhance the fsync performance - xattr security labels - reduce the number of redundant lock/unlock data pages - avoid frequent write_inode calls The other minor bug fixes are as follows. - endian conversion bugs - various bugs in the roll-forward recovery routine" * tag 'for-f2fs-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (56 commits) f2fs: fix to recover i_size from roll-forward f2fs: remove the unused argument "sbi" of func destroy_fsync_dnodes() f2fs: remove reusing any prefree segments f2fs: code cleanup and simplify in func {find/add}_gc_inode f2fs: optimize the init_dirty_segmap function f2fs: fix an endian conversion bug detected by sparse f2fs: fix crc endian conversion f2fs: add remount_fs callback support f2fs: recover wrong pino after checkpoint during fsync f2fs: optimize do_write_data_page() f2fs: make locate_dirty_segment() as static f2fs: remove unnecessary parameter "offset" from __add_sum_entry() f2fs: avoid freqeunt write_inode calls f2fs: optimise the truncate_data_blocks_range() range f2fs: use the F2FS specific flags in f2fs_ioctl() f2fs: sync dir->i_size with its block allocation f2fs: fix i_blocks translation on various types of files f2fs: set sb->s_fs_info before calling parse_options() f2fs: support xattr security labels f2fs: fix iget/iput of dir during recovery ...
2013-07-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmwLinus Torvalds
Pull GFS2 updates from Steven Whitehouse: "There are a few bug fixes for various, mostly very minor corner cases, plus some interesting new features. The new features include atomic_open whose main benefit will be the reduction in locking overhead in case of combined lookup/create and open operations, sorting the log buffer lists by block number to improve the efficiency of AIL writeback, and aggressively issuing revokes in gfs2_log_flush to reduce overhead when dropping glocks." * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: GFS2: Reserve journal space for quota change in do_grow GFS2: Fix fstrim boundary conditions GFS2: fix warning message GFS2: aggressively issue revokes in gfs2_log_flush GFS2: fix regression in dir_double_exhash GFS2: Add atomic_open support GFS2: Only do one directory search on create GFS2: fix error propagation in init_threads() GFS2: Remove no-op wrapper function GFS2: Cocci spatch "ptr_ret.spatch" GFS2: Eliminate gfs2_rg_lops GFS2: Sort buffer lists by inplace block number
2013-07-02Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 update from Ted Ts'o: "Lots of bug fixes, cleanups and optimizations. In the bug fixes category, of note is a fix for on-line resizing file systems where the block size is smaller than the page size (i.e., file systems 1k blocks on x86, or more interestingly file systems with 4k blocks on Power or ia64 systems.) In the cleanup category, the ext4's punch hole implementation was significantly improved by Lukas Czerner, and now supports bigalloc file systems. In addition, Jan Kara significantly cleaned up the write submission code path. We also improved error checking and added a few sanity checks. In the optimizations category, two major optimizations deserve mention. The first is that ext4_writepages() is now used for nodelalloc and ext3 compatibility mode. This allows writes to be submitted much more efficiently as a single bio request, instead of being sent as individual 4k writes into the block layer (which then relied on the elevator code to coalesce the requests in the block queue). Secondly, the extent cache shrink mechanism, which was introduce in 3.9, no longer has a scalability bottleneck caused by the i_es_lru spinlock. Other optimizations include some changes to reduce CPU usage and to avoid issuing empty commits unnecessarily." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits) ext4: optimize starting extent in ext4_ext_rm_leaf() jbd2: invalidate handle if jbd2_journal_restart() fails ext4: translate flag bits to strings in tracepoints ext4: fix up error handling for mpage_map_and_submit_extent() jbd2: fix theoretical race in jbd2__journal_restart ext4: only zero partial blocks in ext4_zero_partial_blocks() ext4: check error return from ext4_write_inline_data_end() ext4: delete unnecessary C statements ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree() jbd2: move superblock checksum calculation to jbd2_write_superblock() ext4: pass inode pointer instead of file pointer to punch hole ext4: improve free space calculation for inline_data ext4: reduce object size when !CONFIG_PRINTK ext4: improve extent cache shrink mechanism to avoid to burn CPU time ext4: implement error handling of ext4_mb_new_preallocation() ext4: fix corruption when online resizing a fs with 1K block size ext4: delete unused variables ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents jbd2: remove debug dependency on debug_fs and update Kconfig help text jbd2: use a single printk for jbd_debug() ...
2013-07-02Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS patches (part 1) from Al Viro: "The major change in this pile is ->readdir() replacement with ->iterate(), dealing with ->f_pos races in ->readdir() instances for good. There's a lot more, but I'd prefer to split the pull request into several stages and this is the first obvious cutoff point." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (67 commits) [readdir] constify ->actor [readdir] ->readdir() is gone [readdir] convert ecryptfs [readdir] convert coda [readdir] convert ocfs2 [readdir] convert fatfs [readdir] convert xfs [readdir] convert btrfs [readdir] convert hostfs [readdir] convert afs [readdir] convert ncpfs [readdir] convert hfsplus [readdir] convert hfs [readdir] convert befs [readdir] convert cifs [readdir] convert freevxfs [readdir] convert fuse [readdir] convert hpfs reiserfs: switch reiserfs_readdir_dentry to inode reiserfs: is_privroot_deh() needs only directory inode, actually ...
2013-07-02sync: don't block the flusher thread waiting on IODave Chinner
When sync does it's WB_SYNC_ALL writeback, it issues data Io and then immediately waits for IO completion. This is done in the context of the flusher thread, and hence completely ties up the flusher thread for the backing device until all the dirty inodes have been synced. On filesystems that are dirtying inodes constantly and quickly, this means the flusher thread can be tied up for minutes per sync call and hence badly affect system level write IO performance as the page cache cannot be cleaned quickly. We already have a wait loop for IO completion for sync(2), so cut this out of the flusher thread and delegate it to wait_sb_inodes(). Hence we can do rapid IO submission, and then wait for it all to complete. Effect of sync on fsmark before the patch: FSUse% Count Size Files/sec App Overhead ..... 0 640000 4096 35154.6 1026984 0 720000 4096 36740.3 1023844 0 800000 4096 36184.6 916599 0 880000 4096 1282.7 1054367 0 960000 4096 3951.3 918773 0 1040000 4096 40646.2 996448 0 1120000 4096 43610.1 895647 0 1200000 4096 40333.1 921048 And a single sync pass took: real 0m52.407s user 0m0.000s sys 0m0.090s After the patch, there is no impact on fsmark results, and each individual sync(2) operation run concurrently with the same fsmark workload takes roughly 7s: real 0m6.930s user 0m0.000s sys 0m0.039s IOWs, sync is 7-8x faster on a busy filesystem and does not have an adverse impact on ongoing async data write operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-02Btrfs: wait ordered range before doing direct ioJosef Bacik
My recent truncate patch uncovered this bug, but I can reproduce it without the truncate patch. If you mount with -o compress-force, do a direct write to some area, do a buffered write to some other area, and then do a direct read you will get the wrong data for where you did the buffered write. This is because the generic direct io helpers only call filemap_write_and_wait once, and for compression we need it twice. So to be safe add the btrfs_wait_ordered_range to the start of the direct io function to make sure any compressed writes have truly been written. This patch makes xfstests 130 pass when you mount with -o compress-force=lzo. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: only do the tree_mod_log_free_eb if this is our last refJosef Bacik
There is another bug in the tree mod log stuff in that we're calling tree_mod_log_free_eb every single time a block is cow'ed. The problem with this is that if this block is shared by multiple snapshots we will call this multiple times per block, so if we go to rewind the mod log for this block we'll BUG_ON() in __tree_mod_log_rewind because we try to rewind a free twice. We only want to call tree_mod_log_free_eb if we are actually freeing the block. With this patch I no longer hit the panic in __tree_mod_log_rewind. Thanks, Cc: stable@vger.kernel.org Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: hold the tree mod lock in __tree_mod_log_rewindJosef Bacik
We need to hold the tree mod log lock in __tree_mod_log_rewind since we walk forward in the tree mod entries, otherwise we'll end up with random entries and trip the BUG_ON() at the front of __tree_mod_log_rewind. This fixes the panics people were seeing when running find /whatever -type f -exec btrfs fi defrag {} \; Thansk, Cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: make backref walking code handle skinny metadataJosef Bacik
I missed fixing the backref stuff when I introduced the skinny metadata. If you try and do things like snapshot aware defrag with skinny metadata you are going to see tons of warnings related to the backref count being less than 0. This is because the delayed refs will be found for stuff just fine, but it won't find the skinny metadata extent refs. With this patch I'm not seeing warnings anymore. Thanks, Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: fix crash regarding to ulist_add_mergeLiu Bo
Several users reported this crash of NULL pointer or general protection, the story is that we add a rbtree for speedup ulist iteration, and we use krealloc() to address ulist growth, and krealloc() use memcpy to copy old data to new memory area, so it's OK for an array as it doesn't use pointers while it's not OK for a rbtree as it uses pointers. So krealloc() will mess up our rbtree and it ends up with crash. Reviewed-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: fix several potential problems in copy_nocow_pages_for_inodeMiao Xie
- It makes no sense that we deal with a inode in the dead tree. - fix the race between dio and page copy by waiting the dio completion - avoid the page copy vs truncate/punch hole - check if the page is in the page cache or not Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: cleanup the code of copy_nocow_pages_for_inode()Miao Xie
- It make no sense that we continue to do something after the error happened, just go back with this patch. - remove some check of copy_nocow_pages_for_inode(), such as page check after write, inode check in the end of the function, because we are sure they exist. - remove the unnecessary goto in the return value check of the write Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: fix oops when recovering the file data by scrub functionMiao Xie
We get oops while running btrfs replace start test, ------------[ cut here ]------------ kernel BUG at mm/filemap.c:608! [SNIP] Call Trace: [<ffffffffa04b36c7>] copy_nocow_pages_for_inode+0x217/0x3f0 [btrfs] [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs] [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs] [<ffffffffa04bb8ce>] iterate_extent_inodes+0x1ae/0x300 [btrfs] [<ffffffffa04bbab2>] iterate_inodes_from_logical+0x92/0xb0 [btrfs] [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs] [<ffffffffa04b3b07>] copy_nocow_pages_worker+0x97/0x150 [btrfs] [<ffffffffa048eed4>] worker_loop+0x134/0x540 [btrfs] [<ffffffff816274ea>] ? __schedule+0x3ca/0x7f0 [<ffffffffa048eda0>] ? btrfs_queue_worker+0x300/0x300 [btrfs] [<ffffffff8106f2f0>] kthread+0xc0/0xd0 [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80 [<ffffffff8163181c>] ret_from_fork+0x7c/0xb0 [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80 [SNIP] RIP [<ffffffff8111f4c5>] unlock_page+0x35/0x40 RSP <ffff88010316bb98> ---[ end trace 421e79ad0dd72c7d ]--- it is because we forgot to lock the page again after we read data to the page. Fix it. Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: make the chunk allocator completely tree locklessJosef Bacik
When adjusting the enospc rules for relocation I ran into a deadlock because we were relocating the only system chunk and that forced us to try and allocate a new system chunk while holding locks in the chunk tree, which caused us to deadlock. To fix this I've moved all of the dev extent addition and chunk addition out to the delayed chunk completion stuff. We still keep the in-memory stuff which makes sure everything is consistent. One change I had to make was to search the commit root of the device tree to find a free dev extent, and hold onto any chunk em's that we allocated in that transaction so we do not allocate the same dev extent twice. This has the side effect of fixing a bug with balance that has been there ever since balance existed. Basically you can free a block group and it's dev extent and then immediately allocate that dev extent for a new block group and write stuff to that dev extent, all within the same transaction. So if you happen to crash during a balance you could come back to a completely broken file system. This patch should keep these sort of things from happening in the future since we won't be able to allocate free'd dev extents until after the transaction commits. This has passed all of the xfstests and my super annoying stress test followed by a balance. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: cleanup orphaned root orphan itemJosef Bacik
I hit a weird problem were my root item had been deleted but the orphan item had not. This isn't necessarily a problem, but it keeps the file system from being mounted. To fix this we just need to axe the orphan item if we can't find the fs root when we're putting them altogether. With this patch I was able to successfully mount my file system. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: fix wrong mirror number tuningMiao Xie
Now reading the data from the target device of the replace operation is allowed, so the mirror number that is greater than the stripes number of a chunk is valid, we will tune it when we find there is no target device later. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: cleanup redundant code in btrfs_submit_direct()Miao Xie
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: remove btrfs_sector_sum structureMiao Xie
Using the structure btrfs_sector_sum to keep the checksum value is unnecessary, because the extents that btrfs_sector_sum points to are continuous, we can find out the expected checksums by btrfs_ordered_sum's bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After removing bytenr, there is only one member in the structure, so it makes no sense to keep the structure, just remove it, and use a u32 array to store the checksum value. By this change, we don't use the while loop to get the checksums one by one. Now, we can get several checksum value at one time, it improved the performance by ~74% on my SSD (31MB/s -> 54MB/s). test command: # dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: check if we can nocow if we don't have data spaceJosef Bacik
We always just try and reserve data space when we write, but if we are out of space but have prealloc'ed extents we should still successfully write. This patch will try and see if we can write to prealloc'ed space and if we can go ahead and allow the write to continue. With this patch we now pass xfstests generic/274. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delallocJosef Bacik
try_to_writeback_inodes_sb_nr returns 1 if writeback is already underway, which is completely fraking useless for us as we need to make sure pages are actually written before we go and check if there are ordered extents. So replace this with an open coding of try_to_writeback_inodes_sb_nr minus the writeback underway check so that we are sure to actually have flushed some dirty pages out and will have ordered extents to use. With this patch xfstests generic/273 now passes. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: use a percpu to keep track of possibly pinned bytesJosef Bacik
There are all of these checks in the ENOSPC code to see if committing the transaction would free up enough space to make the allocation. This is because early on we just committed the transaction and hoped and prayed, which resulted in cases where it took _forever_ to get an ENOSPC when we really were out of space. So we check space_info->bytes_pinned, except this isn't completely true because it doesn't account for space we may free but are stuck in delayed refs. So tests like xfstests 226 would fail because we wouldn't commit the transaction to free up the data space. So instead add a percpu counter that will be a little fuzzier, it will add bytes as soon as we try to free up the space, and remove any space it doesn't actually free up when we get around to doing the actual free. We then 0 out this counter every transaction period so we have a better idea of how much space we will actually free up by committing this transaction. With this patch we now pass xfstests 226. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: check for actual acls rather than just xattrs when caching no aclJosef Bacik
We have an optimization that will go ahead and cache no acls on an inode if there are no xattrs on the inode. This saves us a lookup later to check the acls for writes or any other access. The problem is I use selinux so I always have an xattr on inodes, so make this test a little smarter and check for the actual acl hash on the key and if it isn't there then we still get to cache no acl which makes everybody who uses selinux a little happier. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02pstore: Add hsize argument in write_buf call of pstore_ftrace_callAruna Balakrishnaiah
Incorporate the addition of hsize argument in write_buf callback of pstore. This was forgotten in 6bbbca735936e15b9431882eceddcf6dff76e03c pstore: Pass header size in the pstore write callback Causing a build failure when ftrace and pstore are enabled. Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-02f2fs: fix to recover i_size from roll-forwardJaegeuk Kim
If user requests many data writes and fsync together, the last updated i_size should be stored to the inode block consistently. But, previous write_end just marks the inode as dirty and doesn't update its metadata into its inode block. After that, fsync just writes the inode block with newly updated data index excluding inode metadata updates. So, this patch introduces write_end in which updates inode block too when the i_size is changed. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02f2fs: remove the unused argument "sbi" of func destroy_fsync_dnodes()Gu Zheng
As destroy_fsync_dnodes() is a simple list-cleanup func, so delete the unused and unrelated f2fs_sb_info argument of it. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02f2fs: remove reusing any prefree segmentsJaegeuk Kim
This patch removes check_prefree_segments initially designed to enhance the performance by narrowing the range of LBA usage across the whole block device. When allocating a new segment, previous f2fs tries to find proper prefree segments, and then, if finds a segment, it reuses the segment for further data or node block allocation. However, I found that this was totally wrong approach since the prefree segments have several data or node blocks that will be used by the roll-forward mechanism operated after sudden-power-off. Let's assume the following scenario. /* write 8MB with fsync */ for (i = 0; i < 2048; i++) { offset = i * 4096; write(fd, offset, 4KB); fsync(fd); } In this case, naive segment allocation sequence will be like: data segment: x, x+1, x+2, x+3 node segment: y, y+1, y+2, y+3. But, if we can reuse prefree segments, the sequence can be like: data segment: x, x+1, y, y+1 node segment: y, y+1, y+2, y+3. Because, y, y+1, and y+2 became prefree segments one by one, and those are reused by data allocation. After conducting this workload, we should consider how to recover the latest inode with its data. If we reuse the prefree segments such as y or y+1, we lost the old node blocks so that f2fs even cannot start roll-forward recovery. Therefore, I suggest that we should remove reusing prefree segments. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02f2fs: code cleanup and simplify in func {find/add}_gc_inodeGu Zheng
This patch simplifies list operations in find_gc_inode and add_gc_inode. Just simple code cleanup. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> [Jaegeuk Kim: add description] Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02f2fs: optimize the init_dirty_segmap functionNamjae Jeon
Optimize the while loop condition Since this condition will always be true and while loop will be terminated by the following condition in code: if (segno >= TOTAL_SEGS(sbi)) break; Hence we can replace the while loop condition with while(1) instead of always checking for segno to be less than Total segs. Also we do not need to use TOTAL_SEGS() everytime. We can store this value in a local variable since this value is constant. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02f2fs: fix an endian conversion bug detected by sparseJaegeuk Kim
This patch should fix the following bug reported by kbuild test robot. fs/f2fs/recovery.c:233:33: sparse: incorrect type in assignment (different base types) parse warnings: (new ones prefixed by >>) >> recovery.c:233: sparse: incorrect type in assignment (different base types) recovery.c:233: expected unsigned int [unsigned] [assigned] ofs_in_node recovery.c:233: got restricted __le16 [assigned] [usertype] ofs_in_node >> recovery.c:238: sparse: incorrect type in assignment (different base types) recovery.c:238: expected unsigned int [unsigned] ofs_in_node recovery.c:238: got restricted __le16 [assigned] [usertype] ofs_in_node Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-02f2fs: fix crc endian conversionJaegeuk Kim
While calculating CRC for the checkpoint block, we use __u32, but when storing the crc value to the disk, we use __le32. Let's fix the inconsistency. Reported-and-Tested-by: Oded Gabbay <ogabbay@advaoptical.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-01nfsd4: return delegation immediately if lease failsJ. Bruce Fields
This case shouldn't happen--the administrator shouldn't really allow other applications access to the export until clients have had the chance to reclaim their state--but if it does then we should set the "return this lease immediately" bit on the reply. That still leaves some small races, but it's the best the protocol allows us to do in the case a lease is ripped out from under us.... Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01nfsd4: do not throw away 4.1 lock state on last unlockJ. Bruce Fields
This reverts commit eb2099f31b0f090684a64ef8df44a30ff7c45fc2 "nfsd4: release lockowners on last unlock in 4.1 case". Trond identified language in rfc 5661 section 8.2.4 which forbids this behavior: Stateids associated with byte-range locks are an exception. They remain valid even if a LOCKU frees all remaining locks, so long as the open file with which they are associated remains open, unless the client frees the stateids via the FREE_STATEID operation. And bakeathon 2013 testing found a 4.1 freebsd client was getting an incorrect BAD_STATEID return from a FREE_STATEID in the above situation and then failing. The spec language honestly was probably a mistake but at this point with implementations already following it we're probably stuck with that. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01nfsd4: delegation-based open reclaims should bypass permissionsJ. Bruce Fields
We saw a v4.0 client's create fail as follows: - open create succeeds and gets a read delegation - client attempts to set mode on new file, gets DELAY while server recalls delegation. - client attempts a CLAIM_DELEGATE_CUR open using the delegation, gets error because of new file mode. This probably can't happen on a recent kernel since we're no longer giving out delegations on create opens. Nevertheless, it's a bug--reclaim opens should bypass permission checks. Reported-by: Steve Dickson <steved@redhat.com> Reported-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01nfsd4: minor read_buf cleanupJ. Bruce Fields
The code to step to the next page seems reasonably self-contained. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01nfsd4: fix decoding of compounds across page boundariesJ. Bruce Fields
A freebsd NFSv4.0 client was getting rare IO errors expanding a tarball. A network trace showed the server returning BAD_XDR on the final getattr of a getattr+write+getattr compound. The final getattr started on a page boundary. I believe the Linux client ignores errors on the post-write getattr, and that that's why we haven't seen this before. Cc: stable@vger.kernel.org Reported-by: Rick Macklem <rmacklem@uoguelph.ca> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01nfsd4: clean up nfs4_open_delegationJ. Bruce Fields
The nfs4_open_delegation logic is unecessarily baroque. Also stop pretending we support write delegations in several places. Some day we will support write delegations, but when that happens adding back in these flag parameters will be the easy part. For now they're just confusing. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01NFSD: Don't give out read delegations on createsSteve Dickson
When an exclusive create is done with the mode bits set (aka open(testfile, O_CREAT | O_EXCL, 0777)) this causes a OPEN op followed by a SETATTR op. When a read delegation is given in the OPEN, it causes the SETATTR to delay with EAGAIN until the delegation is recalled. This patch caused exclusive creates to give out a write delegation (which turn into no delegation) which allows the SETATTR seamlessly succeed. Signed-off-by: Steve Dickson <steved@redhat.com> [bfields: do this for any CREATE, not just exclusive; comment] Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01nfsd4: allow client to send no cb_sec flavorsJ. Bruce Fields
In testing I notice that some of the pynfs tests forget to send any cb_sec flavors, and that we haven't necessarily errored out in that case before. I'll fix pynfs, but am also inclined to default to trying AUTH_NONE in that case in case this is something clients actually do. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01nfsd4: fail attempts to request gss on the backchannelJ. Bruce Fields
We don't support gss on the backchannel. We should state that fact up front rather than just letting things continue and later making the client try to figure out why the backchannel isn't working. Trond suggested instead returning NFS4ERR_NOENT. I think it would be tricky for the client to distinguish between the case "I don't support gss on the backchannel" and "I can't find that in my cache, please create another context and try that instead", and I'd prefer something that currently doesn't have any other meaning for this operation, hence the (somewhat arbitrary) NFS4ERR_ENCR_ALG_UNSUPP. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01nfsd4: implement minimal SP4_MACH_CREDJ. Bruce Fields
Do a minimal SP4_MACH_CRED implementation suggested by Trond, ignoring the client-provided spo_must_* arrays and just enforcing credential checks for the minimum required operations. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01svcrpc: store gss mech in svc_credJ. Bruce Fields
Store a pointer to the gss mechanism used in the rq_cred and cl_cred. This will make it easier to enforce SP4_MACH_CRED, which needs to compare the mechanism used on the exchange_id with that used on protected operations. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01net: avoid calling sched_clock when LLS is offEliezer Tamir
Change Low Latency Sockets code for select and poll so that when LLS is disabled sched_clock() is never called. Also, avoid sending POLL_LL to sockets if disabled. Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-01ceph: tidy ceph_mdsmap_decode() a littleDan Carpenter
I introduced a new temporary variable "info" instead of "m->m_info[mds]". Also I reversed the if condition and pulled everything in one indent level. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Alex Elder <elder@inktank.com>
2013-07-01ceph: improve error handling in ceph_mdsmap_decodeEmil Goode
This patch makes the following improvements to the error handling in the ceph_mdsmap_decode function: - Add a NULL check for return value from kcalloc - Make use of the variable err Signed-off-by: Emil Goode <emilgoode@gmail.com> Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-01ceph: fix up comment for ceph_count_locks() as to which lock to holdJim Schutt
Signed-off-by: Jim Schutt <jaschut@sandia.gov> Reviewed-by: Alex Elder <elder@inktank.com>
2013-07-01Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncateJosef Bacik
This has plagued us forever and I'm so over working around it. When we truncate down to a non-page aligned offset we will call btrfs_truncate_page to zero out the end of the page and write it back to disk, this will keep us from exposing stale data if we truncate back up from that point. The problem with this is it requires data space to do this, and people don't really expect to get ENOSPC from truncate() for these sort of things. This also tends to bite the orphan cleanup stuff too which keeps people from mounting. To get around this we can just move this into btrfs_cont_expand() to make sure if we are truncating up from a non-page size aligned i_size we will zero out the rest of this page so that we don't expose stale data. This will give ENOSPC if you try to truncate() up or if you try to write past the end of isize, which is much more reasonable. This fixes xfstests generic/083 failing to mount because of the orphan cleanup failing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01Btrfs: optimize reada_for_balanceJosef Bacik
This patch does two things. First we no longer explicitly read in the blocks we're trying to readahead. For things like balance_level we may never actually use the blocks so this just adds uneeded latency, and balance_level and split_node will both read in the blocks they care about explicitly so if the blocks need to be waited on it will be done there. Secondly we no longer drop the path if we do readahead, we just set the path blocking before we call reada_for_balance() and then we're good to go. Hopefully this will cut down on the number of re-searches. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01Btrfs: optimize read_block_for_searchJosef Bacik
This patch does two things, first it only does one call to btrfs_buffer_uptodate() with the gen specified instead of once with 0 and then again with gen specified. The other thing is to call btrfs_read_buffer() on the buffer we've found instead of dropping it and then calling read_tree_block(). This will keep us from doing yet another radix tree lookup for a buffer we've already found. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01Btrfs: unlock extent range on enospc in compressed submitJosef Bacik
A user reported a deadlock where the async submit thread was blocked on the lock_extent() lock, and then everybody behind him was locked on the page lock for the page he was holding. Looking at the code I noticed we do not unlock the extent range when we get ENOSPC and goto retry. This is bad because we immediately try to lock that range again to do the cow, which will cause a deadlock. Fix this by unlocking the range. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>