Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
"The biggest chunk is a series of patches from Ilya that add support
for new Ceph osd and crush map features, including some new tunables,
primary affinity, and the new encoding that is needed for erasure
coding support. This brings things into parity with the server side
and the looming firefly release. There is also support for allocation
hints in RBD that help limit fragmentation on the server side.
There is also a series of patches from Zheng fixing NFS reexport,
directory fragmentation support, flock vs fnctl behavior, and some
issues with clustered MDS.
Finally, there are some miscellaneous fixes from Yunchuan Wen for
fscache, Fabian Frederick for ACLs, and from me for fsync(dirfd)
behavior"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (79 commits)
ceph: skip invalid dentry during dcache readdir
libceph: dump pool {read,write}_tier to debugfs
libceph: output primary affinity values on osdmap updates
ceph: flush cap release queue when trimming session caps
ceph: don't grabs open file reference for aborted request
ceph: drop extra open file reference in ceph_atomic_open()
ceph: preallocate buffer for readdir reply
libceph: enable PRIMARY_AFFINITY feature bit
libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
libceph: add support for osd primary affinity
libceph: add support for primary_temp mappings
libceph: return primary from ceph_calc_pg_acting()
libceph: switch ceph_calc_pg_acting() to new helpers
libceph: introduce apply_temps() helper
libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers
libceph: ceph_can_shift_osds(pool) and pool type defines
libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions
libceph: enable OSDMAP_ENC feature bit
libceph: primary_affinity decode bits
libceph: primary_affinity infrastructure
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"This patch-set includes the following major enhancement patches.
- introduce large directory support
- introduce f2fs_issue_flush to merge redundant flush commands
- merge write IOs as much as possible aligned to the segment
- add sysfs entries to tune the f2fs configuration
- use radix_tree for the free_nid_list to reduce in-memory operations
- remove costly bit operations in f2fs_find_entry
- enhance the readahead flow for CP/NAT/SIT/SSA blocks
The other bug fixes are as follows:
- recover xattr node blocks correctly after sudden-power-cut
- fix to calculate the maximum number of node ids
- enhance to handle many error cases
And, there are a bunch of cleanups"
* tag 'for-f2fs-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (62 commits)
f2fs: fix wrong statistics of inline data
f2fs: check the acl's validity before setting
f2fs: introduce f2fs_issue_flush to avoid redundant flush issue
f2fs: fix to cover io->bio with io_rwsem
f2fs: fix error path when fail to read inline data
f2fs: use list_for_each_entry{_safe} for simplyfying code
f2fs: avoid free slab cache under spinlock
f2fs: avoid unneeded lookup when xattr name length is too long
f2fs: avoid unnecessary bio submit when wait page writeback
f2fs: return -EIO when node id is not matched
f2fs: avoid RECLAIM_FS-ON-W warning
f2fs: skip unnecessary node writes during fsync
f2fs: introduce fi->i_sem to protect fi's info
f2fs: change reclaim rate in percentage
f2fs: add missing documentation for dir_level
f2fs: remove unnecessary threshold
f2fs: throttle the memory footprint with a sysfs entry
f2fs: avoid to drop nat entries due to the negative nr_shrink
f2fs: call f2fs_wait_on_page_writeback instead of native function
f2fs: introduce nr_pages_to_write for segment alignment
...
|
|
Introduce a block group type bit for a global reserve and fill the space
info for SPACE_INFO ioctl. This should replace the newly added ioctl
(01e219e8069516cdb98594d417b8bb8d906ed30d) to get just the 'size' part
of the global reserve, while the actual usage can be now visible in the
'btrfs fi df' output during ENOSPC stress.
The unpatched userspace tools will show the blockgroup as 'unknown'.
CC: Jeff Mahoney <jeffm@suse.com>
CC: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Reproducer:
mount /dev/ubda /mnt
mount -oremount,thread_pool=42 /mnt
Gives a crash:
? btrfs_workqueue_set_max+0x0/0x70
btrfs_resize_thread_pool+0xe3/0xf0
? sync_filesystem+0x0/0xc0
? btrfs_resize_thread_pool+0x0/0xf0
btrfs_remount+0x1d2/0x570
? kern_path+0x0/0x80
do_remount_sb+0xd9/0x1c0
do_mount+0x26a/0xbf0
? kfree+0x0/0x1b0
SyS_mount+0xc4/0x110
It's a call
btrfs_workqueue_set_max(fs_info->scrub_wr_completion_workers, new_pool_size);
with
fs_info->scrub_wr_completion_workers = NULL;
as scrub wqs get created only on user's demand.
Patch skips not-created-yet workqueues.
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
CC: Qu Wenruo <quwenruo@cn.fujitsu.com>
CC: Chris Mason <clm@fb.com>
CC: Josef Bacik <jbacik@fb.com>
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Pull MTD updates from Brian Norris:
- A few SPI NOR ID definitions
- Kill the NAND "max pagesize" restriction
- Fix some x16 bus-width NAND support
- Add NAND JEDEC parameter page support
- DT bindings for NAND ECC
- GPMI NAND updates (subpage reads)
- More OMAP NAND refactoring
- New STMicro SPI NOR driver (now in 40 patches!)
- A few other random bugfixes
* tag 'for-linus-20140405' of git://git.infradead.org/linux-mtd: (120 commits)
Fix index regression in nand_read_subpage
mtd: diskonchip: mem resource name is not optional
mtd: nand: fix mention to CONFIG_MTD_NAND_ECC_BCH
mtd: nand: fix GET/SET_FEATURES address on 16-bit devices
mtd: omap2: Use devm_ioremap_resource()
mtd: denali_dt: Use devm_ioremap_resource()
mtd: devices: elm: update DRIVER_NAME as "omap-elm"
mtd: devices: elm: configure parallel channels based on ecc_steps
mtd: devices: elm: clean elm_load_syndrome
mtd: devices: elm: check for hardware engine's design constraints
mtd: st_spi_fsm: Succinctly reorganise .remove()
mtd: st_spi_fsm: Allow loop to run at least once before giving up CPU
mtd: st_spi_fsm: Correct vendor name spelling issue - missing "M"
mtd: st_spi_fsm: Avoid duplicating MTD core code
mtd: st_spi_fsm: Remove useless consts from function arguments
mtd: st_spi_fsm: Convert ST SPI FSM (NOR) Flash driver to new DT partitions
mtd: st_spi_fsm: Move runtime configurable msg sequences into device's struct
mtd: st_spi_fsm: Supply the W25Qxxx chip specific configuration call-back
mtd: st_spi_fsm: Supply the S25FLxxx chip specific configuration call-back
mtd: st_spi_fsm: Supply the MX25xxx chip specific configuration call-back
...
|
|
I'm not sure why we weren't aborting here in the first place, it is obviously a
bad time from the fact that we print the leaf and yell loudly about it. Fix
this up, otherwise we panic because our path could be pointing into oblivion.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
btrfs_drop_extents can now return -EINVAL, but only one caller
in btrfs_clone was checking for it. This adds it to the
caller for inline extents, which is where we really need it.
Signed-off-by: Chris Mason <clm@fb.com>
|
|
This patch fix a regression caused by the following patch:
Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
break while loop will make us call @spin_unlock() without
calling @spin_lock() before, fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Steps to reproduce:
# mkfs.btrfs -f /dev/sda[8-11] -m raid5 -d raid5
# mount /dev/sda8 /mnt
# btrfs scrub start -BR /mnt
# echo $? <--unverified errors make return value be 3
This is because we don't setup right mapping between physical
and logical address for raid56, which makes checksum mismatch.
But we will find everthing is fine later when rechecking using
btrfs_map_block().
This patch fixed the problem by settuping right mappings and
we only verify data stripes' checksums.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
To compress a small file range(<=blocksize) that is not
an inline extent can not save disk space at all. skip it can
save us some cpu time.
This patch can also fix wrong setting nocompression flag for
inode, say a case when @total_in is 4096, and then we get
@total_compressed 52,because we do aligment to page cache size
firstly, and then we get into conclusion @total_in=@total_compressed
thus we will clear this inode's compression flag.
An exception comes from inserting inline extent failure but we
still have @total_compressed < @total_in,so we will still reset
inode's flag, this is ok, because we don't have good compression
effect.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
If we don't reschedule use rb_next to find the next extent state
instead of a full tree search, which is more efficient and safe
since we didn't release the io tree's lock.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
There's no point building the path string in each iteration of the
send_hole loop, as it produces always the same string.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Originally following cmds will work:
# btrfs fi resize -10A <mnt>
# btrfs fi resize -10Gaha <mnt>
Filter the arg by checking the return pointer of memparse.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
During an incremental send, when we finish processing an inode (corresponding to
a regular file) we would assume the gap between the end of the last processed file
extent and the file's size corresponded to a file hole, and therefore incorrectly
send a bunch of zero bytes to overwrite that region in the file.
This affects only kernel 3.14.
Reproducer:
mkfs.btrfs -f /dev/sdc
mount /dev/sdc /mnt
xfs_io -f -c "falloc -k 0 268435456" /mnt/foo
btrfs subvolume snapshot -r /mnt /mnt/mysnap0
xfs_io -c "pwrite -S 0x01 -b 9216 16190218 9216" /mnt/foo
xfs_io -c "pwrite -S 0x02 -b 1121 198720104 1121" /mnt/foo
xfs_io -c "pwrite -S 0x05 -b 9216 107887439 9216" /mnt/foo
xfs_io -c "pwrite -S 0x06 -b 9216 225520207 9216" /mnt/foo
xfs_io -c "pwrite -S 0x07 -b 67584 102138300 67584" /mnt/foo
xfs_io -c "pwrite -S 0x08 -b 7000 94897484 7000" /mnt/foo
xfs_io -c "pwrite -S 0x09 -b 113664 245083212 113664" /mnt/foo
xfs_io -c "pwrite -S 0x10 -b 123 17937788 123" /mnt/foo
xfs_io -c "pwrite -S 0x11 -b 39936 229573311 39936" /mnt/foo
xfs_io -c "pwrite -S 0x12 -b 67584 174792222 67584" /mnt/foo
xfs_io -c "pwrite -S 0x13 -b 9216 249253213 9216" /mnt/foo
xfs_io -c "pwrite -S 0x16 -b 67584 150046083 67584" /mnt/foo
xfs_io -c "pwrite -S 0x17 -b 39936 118246040 39936" /mnt/foo
xfs_io -c "pwrite -S 0x18 -b 67584 215965442 67584" /mnt/foo
xfs_io -c "pwrite -S 0x19 -b 33792 97096725 33792" /mnt/foo
xfs_io -c "pwrite -S 0x20 -b 125952 166300596 125952" /mnt/foo
xfs_io -c "pwrite -S 0x21 -b 123 1078957 123" /mnt/foo
xfs_io -c "pwrite -S 0x25 -b 9216 212044492 9216" /mnt/foo
xfs_io -c "pwrite -S 0x26 -b 7000 265037146 7000" /mnt/foo
xfs_io -c "pwrite -S 0x27 -b 42757 215922685 42757" /mnt/foo
xfs_io -c "pwrite -S 0x28 -b 7000 69865411 7000" /mnt/foo
xfs_io -c "pwrite -S 0x29 -b 67584 67948958 67584" /mnt/foo
xfs_io -c "pwrite -S 0x30 -b 39936 266967019 39936" /mnt/foo
xfs_io -c "pwrite -S 0x31 -b 1121 19582453 1121" /mnt/foo
xfs_io -c "pwrite -S 0x32 -b 17408 257710255 17408" /mnt/foo
xfs_io -c "pwrite -S 0x33 -b 39936 3895518 39936" /mnt/foo
xfs_io -c "pwrite -S 0x34 -b 125952 12045847 125952" /mnt/foo
xfs_io -c "pwrite -S 0x35 -b 17408 19156379 17408" /mnt/foo
xfs_io -c "pwrite -S 0x36 -b 39936 50160066 39936" /mnt/foo
xfs_io -c "pwrite -S 0x37 -b 113664 9549793 113664" /mnt/foo
xfs_io -c "pwrite -S 0x38 -b 105472 94391506 105472" /mnt/foo
xfs_io -c "pwrite -S 0x39 -b 23552 143632863 23552" /mnt/foo
xfs_io -c "pwrite -S 0x40 -b 39936 241283845 39936" /mnt/foo
xfs_io -c "pwrite -S 0x41 -b 113664 199937606 113664" /mnt/foo
xfs_io -c "pwrite -S 0x42 -b 67584 67380093 67584" /mnt/foo
xfs_io -c "pwrite -S 0x43 -b 67584 26793129 67584" /mnt/foo
xfs_io -c "pwrite -S 0x44 -b 39936 14421913 39936" /mnt/foo
xfs_io -c "pwrite -S 0x45 -b 123 253097405 123" /mnt/foo
xfs_io -c "pwrite -S 0x46 -b 1121 128233424 1121" /mnt/foo
xfs_io -c "pwrite -S 0x47 -b 105472 91577959 105472" /mnt/foo
xfs_io -c "pwrite -S 0x48 -b 1121 7245381 1121" /mnt/foo
xfs_io -c "pwrite -S 0x49 -b 113664 182414694 113664" /mnt/foo
xfs_io -c "pwrite -S 0x50 -b 9216 32750608 9216" /mnt/foo
xfs_io -c "pwrite -S 0x51 -b 67584 266546049 67584" /mnt/foo
xfs_io -c "pwrite -S 0x52 -b 67584 87969398 67584" /mnt/foo
xfs_io -c "pwrite -S 0x53 -b 9216 260848797 9216" /mnt/foo
xfs_io -c "pwrite -S 0x54 -b 39936 119461243 39936" /mnt/foo
xfs_io -c "pwrite -S 0x55 -b 7000 200178693 7000" /mnt/foo
xfs_io -c "pwrite -S 0x56 -b 9216 243316029 9216" /mnt/foo
xfs_io -c "pwrite -S 0x57 -b 7000 209658229 7000" /mnt/foo
xfs_io -c "pwrite -S 0x58 -b 101376 179745192 101376" /mnt/foo
xfs_io -c "pwrite -S 0x59 -b 9216 64012300 9216" /mnt/foo
xfs_io -c "pwrite -S 0x60 -b 125952 181705139 125952" /mnt/foo
xfs_io -c "pwrite -S 0x61 -b 23552 235737348 23552" /mnt/foo
xfs_io -c "pwrite -S 0x62 -b 113664 106021355 113664" /mnt/foo
xfs_io -c "pwrite -S 0x63 -b 67584 135753552 67584" /mnt/foo
xfs_io -c "pwrite -S 0x64 -b 23552 95730888 23552" /mnt/foo
xfs_io -c "pwrite -S 0x65 -b 11 17311415 11" /mnt/foo
xfs_io -c "pwrite -S 0x66 -b 33792 120695553 33792" /mnt/foo
xfs_io -c "pwrite -S 0x67 -b 9216 17164631 9216" /mnt/foo
xfs_io -c "pwrite -S 0x68 -b 9216 136065853 9216" /mnt/foo
xfs_io -c "pwrite -S 0x69 -b 67584 37752198 67584" /mnt/foo
xfs_io -c "pwrite -S 0x70 -b 101376 189717473 101376" /mnt/foo
xfs_io -c "pwrite -S 0x71 -b 7000 227463698 7000" /mnt/foo
xfs_io -c "pwrite -S 0x72 -b 9216 12655137 9216" /mnt/foo
xfs_io -c "pwrite -S 0x73 -b 7000 7488866 7000" /mnt/foo
xfs_io -c "pwrite -S 0x74 -b 113664 87813649 113664" /mnt/foo
xfs_io -c "pwrite -S 0x75 -b 33792 25802183 33792" /mnt/foo
xfs_io -c "pwrite -S 0x76 -b 39936 93524024 39936" /mnt/foo
xfs_io -c "pwrite -S 0x77 -b 33792 113336388 33792" /mnt/foo
xfs_io -c "pwrite -S 0x78 -b 105472 184955320 105472" /mnt/foo
xfs_io -c "pwrite -S 0x79 -b 101376 225691598 101376" /mnt/foo
xfs_io -c "pwrite -S 0x80 -b 23552 77023155 23552" /mnt/foo
xfs_io -c "pwrite -S 0x81 -b 11 201888192 11" /mnt/foo
xfs_io -c "pwrite -S 0x82 -b 11 115332492 11" /mnt/foo
xfs_io -c "pwrite -S 0x83 -b 67584 230278015 67584" /mnt/foo
xfs_io -c "pwrite -S 0x84 -b 11 120589073 11" /mnt/foo
xfs_io -c "pwrite -S 0x85 -b 125952 202207819 125952" /mnt/foo
xfs_io -c "pwrite -S 0x86 -b 113664 86672080 113664" /mnt/foo
xfs_io -c "pwrite -S 0x87 -b 17408 208459603 17408" /mnt/foo
xfs_io -c "pwrite -S 0x88 -b 7000 73372211 7000" /mnt/foo
xfs_io -c "pwrite -S 0x89 -b 7000 42252122 7000" /mnt/foo
xfs_io -c "pwrite -S 0x90 -b 23552 46784881 23552" /mnt/foo
xfs_io -c "pwrite -S 0x91 -b 101376 63172351 101376" /mnt/foo
xfs_io -c "pwrite -S 0x92 -b 23552 59341931 23552" /mnt/foo
xfs_io -c "pwrite -S 0x93 -b 39936 239599283 39936" /mnt/foo
xfs_io -c "pwrite -S 0x94 -b 67584 175643105 67584" /mnt/foo
xfs_io -c "pwrite -S 0x97 -b 23552 105534880 23552" /mnt/foo
xfs_io -c "pwrite -S 0x98 -b 113664 8236844 113664" /mnt/foo
xfs_io -c "pwrite -S 0x99 -b 125952 144489686 125952" /mnt/foo
xfs_io -c "pwrite -S 0xa0 -b 7000 73273112 7000" /mnt/foo
xfs_io -c "pwrite -S 0xa1 -b 125952 194580243 125952" /mnt/foo
xfs_io -c "pwrite -S 0xa2 -b 123 56296779 123" /mnt/foo
xfs_io -c "pwrite -S 0xa3 -b 11 233066845 11" /mnt/foo
xfs_io -c "pwrite -S 0xa4 -b 39936 197727090 39936" /mnt/foo
xfs_io -c "pwrite -S 0xa5 -b 101376 53579812 101376" /mnt/foo
xfs_io -c "pwrite -S 0xa6 -b 9216 85669738 9216" /mnt/foo
xfs_io -c "pwrite -S 0xa7 -b 125952 21266322 125952" /mnt/foo
xfs_io -c "pwrite -S 0xa8 -b 23552 125726568 23552" /mnt/foo
xfs_io -c "pwrite -S 0xa9 -b 9216 18423680 9216" /mnt/foo
xfs_io -c "pwrite -S 0xb0 -b 1121 165901483 1121" /mnt/foo
btrfs subvolume snapshot -r /mnt /mnt/mysnap1
xfs_io -c "pwrite -S 0xff -b 10 16190218 10" /mnt/foo
btrfs subvolume snapshot -r /mnt /mnt/mysnap2
md5sum /mnt/foo # returns 79e53f1466bfc09fd82b450689e6119e
md5sum /mnt/mysnap2/foo # returns 79e53f1466bfc09fd82b450689e6119e too
btrfs send /mnt/mysnap1 -f /tmp/1.snap
btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/2.snap
mkfs.btrfs -f /dev/sdc
mount /dev/sdc /mnt
btrfs receive /mnt -f /tmp/1.snap
btrfs receive /mnt -f /tmp/2.snap
md5sum /mnt/mysnap2/foo # returns 2bb414c5155767cedccd7063e51beabd !!
A testcase for xfstests follows soon too.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
The error handling was copy and pasted from memdup_user(). It should be
checking for NULL obviously.
Fixes: abccd00f8af2 ('btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
While running fsstress and snapshots concurrently, we will hit something
like followings:
Thread 1 Thread 2
|->fallocate
|->write pages
|->join transaction
|->add ordered extent
|->end transaction
|->flushing data
|->creating pending snapshots
|->write data into src root's
fallocated space
After above work flows finished, we will get a state that source and
snapshot root share same space, but source root have written data into
fallocated space, this will make fsck fail to verify checksums for
snapshot root's preallocating file extent data.Nocow writting also
has this same problem.
Fix this problem by syncing snapshots with nocow writting:
1.for nocow writting,if there are pending snapshots, we will
fall into COW way.
2.if there are pending nocow writes, snapshots for this root
will be blocked until nocow writting finish.
Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
When testing fsstress with snapshot making background, some snapshot
following problem.
Snapshot 270:
inode 323: size 0
Snapshot 271:
inode 323: size 349145
|-------Hole---|---------Empty gap-------|-------Hole-----|
0 122880 172032 349145
Snapshot 272:
inode 323: size 349145
|-------Hole---|------------Data---------|-------Hole-----|
0 122880 172032 349145
The fsstress operation on inode 323 is the following:
write: offset 126832 len 43124
truncate: size 349145
Since the write with offset is consist of 2 operations:
1. punch hole
2. write data
Hole punching is faster than data write, so hole punching in write
and truncate is done first and then buffered write, so the snapshot 271 got
empty gap, which will not pass btrfsck.
To fix the bug, this patch will change the write sequence which will
first punch a hole covering the write end if a hole is needed.
Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Print the message only when the device is seen for the first time.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
When encountering memory pressure, testers have run into the following
lockdep warning. It was caused by __link_block_group calling kobject_add
with the groups_sem held. kobject_add calls kvasprintf with GFP_KERNEL,
which gets us into reclaim context. The kobject doesn't actually need
to be added under the lock -- it just needs to ensure that it's only
added for the first block group to be linked.
=========================================================
[ INFO: possible irq lock inversion dependency detected ]
3.14.0-rc8-default #1 Not tainted
---------------------------------------------------------
kswapd0/169 just changed the state of lock:
(&delayed_node->mutex){+.+.-.}, at: [<ffffffffa018baea>] __btrfs_release_delayed_node+0x3a/0x200 [btrfs]
but this lock took another, RECLAIM_FS-unsafe lock in the past:
(&found->groups_sem){+++++.}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&found->groups_sem);
local_irq_disable();
lock(&delayed_node->mutex);
lock(&found->groups_sem);
<Interrupt>
lock(&delayed_node->mutex);
*** DEADLOCK ***
2 locks held by kswapd0/169:
#0: (shrinker_rwsem){++++..}, at: [<ffffffff81159e8a>] shrink_slab+0x3a/0x160
#1: (&type->s_umount_key#27){++++..}, at: [<ffffffff811bac6f>] grab_super_passive+0x3f/0x90
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
We currently rely too heavily on roots being read-only to save us from just
accessing root->commit_root. We can easily balance blocks out from underneath a
read only root, so to save us from getting screwed make sure we only access
root->commit_root under the commit root sem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
If we remove a file that has inline data after mount, our statistics turns to
inaccurate.
cat /sys/kernel/debug/f2fs/status
- Inline_data Inode: 4294967295
Let's add stat_inc_inline_inode() to stat inline info of the file when lookup.
Change log from v1:
o stat in f2fs_lookup() instead of in do_read_inode() for excluding wrong stat.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
Before setting the acl, call posix_acl_valid() to check if it is
valid or not.
Signed-off-by: zhangzhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
Some storage devices show relatively high latencies to complete cache_flush
commands, even though their normal IO speed is prettry much high. In such
the case, it needs to merge cache_flush commands as much as possible to avoid
issuing them redundantly.
So, this patch introduces a mount option, "-o flush_merge", to mitigate such
the overhead.
If this option is enabled by user, F2FS merges the cache_flush commands and then
issues just one cache_flush on behalf of them. Once the single command is
finished, F2FS sends a completion signal to all the pending threads.
Note that, this option can be used under a workload consisting of very intensive
concurrent fsync calls, while the storage handles cache_flush commands slowly.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
Lets try this again. We can deadlock the box if we send on a box and try to
write onto the same fs with the app that is trying to listen to the send pipe.
This is because the writer could get stuck waiting for a transaction commit
which is being blocked by the send. So fix this by making sure looking at the
commit roots is always going to be consistent. We do this by keeping track of
which roots need to have their commit roots swapped during commit, and then
taking the commit_root_sem and swapping them all at once. Then make sure we
take a read lock on the commit_root_sem in cases where we search the commit root
to make sure we're always looking at a consistent view of the commit roots.
Previously we had problems with this because we would swap a fs tree commit root
and then swap the extent tree commit root independently which would cause the
backref walking code to screw up sometimes. With this patch we no longer
deadlock and pass all the weird send/receive corner cases. Thanks,
Reportedy-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
So I have an awful exercise script that will run snapshot, balance and
send/receive in parallel. This sometimes would crash spectacularly and when it
came back up the fs would be completely hosed. Turns out this is because of a
bad interaction of balance and send/receive. Send will hold onto its entire
path for the whole send, but its blocks could get relocated out from underneath
it, and because it doesn't old tree locks theres nothing to keep this from
happening. So it will go to read in a slot with an old transid, and we could
have re-allocated this block for something else and it could have a completely
different transid. But because we think it is invalid we clear uptodate and
re-read in the block. If we do this before we actually write out the new block
we could write back stale data to the fs, and boom we're screwed.
Now we definitely need to fix this disconnect between send and balance, but we
really really need to not allow ourselves to accidently read in stale data over
new data. So make sure we check if the extent buffer is not under io before
clearing uptodate, this will kick back EIO to the caller instead of reading in
stale data and keep us from corrupting the fs. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
We could have possibly added an extent_op to the locked_ref while we dropped
locked_ref->lock, so check for this case as well and loop around. Otherwise we
could lose flag updates which would lead to extent tree corruption. Thanks,
cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
This was done to allow NO_COW to continue to be NO_COW after relocation but it
is not right. When relocating we will convert blocks to FULL_BACKREF that we
relocate. We can leave some of these full backref blocks behind if they are not
cow'ed out during the relocation, like if we fail the relocation with ENOSPC and
then just drop the reloc tree. Then when we go to cow the block again we won't
lookup the extent flags because we won't think there has been a snapshot
recently which means we will do our normal ref drop thing instead of adding back
a tree ref and dropping the shared ref. This will cause btrfs_free_extent to
blow up because it can't find the ref we are trying to free. This was found
with my ref verifying tool. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Pull NFS client updates from Trond Myklebust:
"Highlights include:
- Stable fix for a use after free issue in the NFSv4.1 open code
- Fix the SUNRPC bi-directional RPC code to account for TCP segmentation
- Optimise usage of readdirplus when confronted with 'ls -l' situations
- Soft mount bugfixes
- NFS over RDMA bugfixes
- NFSv4 close locking fixes
- Various NFSv4.x client state management optimisations
- Rename/unlink code cleanups"
* tag 'nfs-for-3.15-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits)
nfs: pass string length to pr_notice message about readdir loops
NFSv4: Fix a use-after-free problem in open()
SUNRPC: rpc_restart_call/rpc_restart_call_prepare should clear task->tk_status
SUNRPC: Don't let rpc_delay() clobber non-timeout errors
SUNRPC: Ensure call_connect_status() deals correctly with SOFTCONN tasks
SUNRPC: Ensure call_status() deals correctly with SOFTCONN tasks
NFSv4: Ensure we respect soft mount timeouts during trunking discovery
NFSv4: Schedule recovery if nfs40_walk_client_list() is interrupted
NFS: advertise only supported callback netids
SUNRPC: remove KERN_INFO from dprintk() call sites
SUNRPC: Fix large reads on NFS/RDMA
NFS: Clean up: revert increase in READDIR RPC buffer max size
SUNRPC: Ensure that call_bind times out correctly
SUNRPC: Ensure that call_connect times out correctly
nfs: emit a fsnotify_nameremove call in sillyrename codepath
nfs: remove synchronous rename code
nfs: convert nfs_rename to use async_rename infrastructure
nfs: make nfs_async_rename non-static
nfs: abstract out code needed to complete a sillyrename
NFSv4: Clear the open state flags if the new stateid does not match
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module updates from Rusty Russell:
"Nothing major: the stricter permissions checking for sysfs broke a
staging driver; fix included. Greg KH said he'd take the patch but
hadn't as the merge window opened, so it's included here to avoid
breaking build"
* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
staging: fix up speakup kobject mode
Use 'E' instead of 'X' for unsigned module taint flag.
VERIFY_OCTAL_PERMISSIONS: stricter checking for sysfs perms.
kallsyms: fix percpu vars on x86-64 with relocation.
kallsyms: generalize address range checking
module: LLVMLinux: Remove unused function warning from __param_check macro
Fix: module signature vs tracepoints: add new TAINT_UNSIGNED_MODULE
module: remove MODULE_GENERIC_TABLE
module: allow multiple calls to MODULE_DEVICE_TABLE() per module
module: use pr_cont
|
|
skip dentries that were added before MDS issued FILE_SHARED to
client.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
|
|
There is no guarantee that the strings in the nfs_cache_array will be
NULL-terminated. In the event that we end up hitting a readdir loop, we
need to ensure that we pass the warning message the length of the
string.
Reported-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
|
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
|
ceph_atomic_open() calls ceph_open() after receiving the MDS reply.
ceph_open() grabs an extra open file reference. (The open request
already holds an open file reference)
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
|
Preallocate buffer for readdir reply. Limit number of entries in
readdir reply according to the buffer size.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
|
This avoids 'cp -a' modifying layout of new files/directories.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
|
If buffer size is zero, return the size of layout vxattr. If buffer
size is not zero, check if it is large enough for layout vxattr.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
|
send_mds_reconnect() may call discard_cap_releases() after all
release messages have been dropped by cleanup_cap_releases()
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
|
|
Remove unsupported symlink operations.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
|
|
When adjusting caps client wants, MDS does not record caps that are
not allowed. For non-auth MDS, it does not record WR caps. So when
a MDS reply changes a non-auth cap to auth cap, client needs to set
cap's mds_wanted according to the reply.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
|
flock and posix lock should use fl->fl_file instead of process ID
as owner identifier. (posix lock uses fl->fl_owner. fl->fl_owner
is usually equal to fl->fl_file, but it also can be a customized
value). The process ID of who holds the lock is just for F_GETLK
fcntl(2).
The fix is rename the 'pid' fields of struct ceph_mds_request_args
and struct ceph_filelock to 'owner', rename 'pid_namespace' fields
to 'pid'. Assign fl->fl_file to the 'owner' field of lock messages.
We also set the most significant bit of the 'owner' field. MDS can
use that bit to distinguish between old and new clients.
The MDS counterpart of this patch modifies the flock code to not
take the 'pid_namespace' into consideration when checking conflict
locks.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
|
|
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
|
VFS does not directly pass flock's operation code to filesystem's
flock callback. It translates the operation code to the form how
posix lock's parameters are presented.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
|
handle following sequence of events:
- client releases a inode with i_max_size > 0. The release message
is queued. (is not sent to the auth MDS)
- a 'lookup' request reply from non-auth MDS returns the same inode.
- client opens the inode in write mode. The version of inode trace
in 'open' request reply is equal to the cached inode's version.
- client requests new max size. The MDS ignores the request because
it does not affect client's write range
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
|
|
Only auth MDS can issue write caps to clients, so don't consider
write caps registered with non-auth MDS as valid.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
|
Pull xfs update from Dave Chinner:
"There are a couple of new fallocate features in this request - it was
decided that it was easiest to push them through the XFS tree using
topic branches and have the ext4 support be based on those branches.
Hence you may see some overlap with the ext4 tree merge depending on
how they including those topic branches into their tree. Other than
that, there is O_TMPFILE support, some cleanups and bug fixes.
The main changes in the XFS tree for 3.15-rc1 are:
- O_TMPFILE support
- allowing AIO+DIO writes beyond EOF
- FALLOC_FL_COLLAPSE_RANGE support for fallocate syscall and XFS
implementation
- FALLOC_FL_ZERO_RANGE support for fallocate syscall and XFS
implementation
- IO verifier cleanup and rework
- stack usage reduction changes
- vm_map_ram NOIO context fixes to remove lockdep warings
- various bug fixes and cleanups"
* tag 'xfs-for-linus-3.15-rc1' of git://oss.sgi.com/xfs/xfs: (34 commits)
xfs: fix directory hash ordering bug
xfs: extra semi-colon breaks a condition
xfs: Add support for FALLOC_FL_ZERO_RANGE
fs: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
xfs: inode log reservations are still too small
xfs: xfs_check_page_type buffer checks need help
xfs: avoid AGI/AGF deadlock scenario for inode chunk allocation
xfs: use NOIO contexts for vm_map_ram
xfs: don't leak EFSBADCRC to userspace
xfs: fix directory inode iolock lockdep false positive
xfs: allocate xfs_da_args to reduce stack footprint
xfs: always do log forces via the workqueue
xfs: modify verifiers to differentiate CRC from other errors
xfs: print useful caller information in xfs_error_report
xfs: add xfs_verifier_error()
xfs: add helper for updating checksums on xfs_bufs
xfs: add helper for verifying checksums on xfs_bufs
xfs: Use defines for CRC offsets in all cases
xfs: skip pointless CRC updates after verifier failures
xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Major changes for 3.14 include support for the newly added ZERO_RANGE
and COLLAPSE_RANGE fallocate operations, and scalability improvements
in the jbd2 layer and in xattr handling when the extended attributes
spill over into an external block.
Other than that, the usual clean ups and minor bug fixes"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
ext4: fix premature freeing of partial clusters split across leaf blocks
ext4: remove unneeded test of ret variable
ext4: fix comment typo
ext4: make ext4_block_zero_page_range static
ext4: atomically set inode->i_flags in ext4_set_inode_flags()
ext4: optimize Hurd tests when reading/writing inodes
ext4: kill i_version support for Hurd-castrated file systems
ext4: each filesystem creates and uses its own mb_cache
fs/mbcache.c: doucple the locking of local from global data
fs/mbcache.c: change block and index hash chain to hlist_bl_node
ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
ext4: refactor ext4_fallocate code
ext4: Update inode i_size after the preallocation
ext4: fix partial cluster handling for bigalloc file systems
ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
ext4: only call sync_filesystm() when remounting read-only
fs: push sync_filesystem() down to the file system's remount_fs()
jbd2: improve error messages for inconsistent journal heads
jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
jbd2: minimize region locked by j_list_lock in journal_get_create_access()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull pstore fixes from Tony Luck:
"Series of small bug fixes for pstore"
* tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
pstore: Fix memory leak when decompress using big_oops_buf
pstore: Fix buffer overflow while write offset equal to buffer size
pstore: Correct the max_dump_cnt clearing of ramoops
pstore: Fix NULL pointer fault if get NULL prz in ramoops_get_next_prz
pstore: skip zero size persistent ram buffer in traverse
pstore: clarify clearing of _read_cnt in ramoops_context
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse update from Miklos Szeredi:
"This series adds cached writeback support to fuse, improving write
throughput"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix "uninitialized variable" warning
fuse: Turn writeback cache on
fuse: Fix O_DIRECT operations vs cached writeback misorder
fuse: fuse_flush() should wait on writeback
fuse: Implement write_begin/write_end callbacks
fuse: restructure fuse_readpage()
fuse: Flush files on wb close
fuse: Trust kernel i_mtime only
fuse: Trust kernel i_size only
fuse: Connection bit for enabling writeback
fuse: Prepare to handle short reads
fuse: Linking file to inode helper
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This set includes a couple trivial cleanups and changes recovery log
messages from DEBUG to INFO"
* tag 'dlm-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: use INFO for recovery messages
fs: Include appropriate header file in dlm/ast.c
dlm: silence a harmless use after free warning
|