Age | Commit message (Collapse) | Author |
|
Pull AIO leak fixes from Ben LaHaise:
"I've put these two patches plus Linus's change through a round of
tests, and it passes millions of iterations of the aio numa
migratepage test, as well as a number of repetitions of a few simple
read and write tests.
The first patch fixes the memory leak Kent introduced, while the
second patch makes aio_migratepage() much more paranoid and robust"
* git://git.kvack.org/~bcrl/aio-next:
aio/migratepages: make aio migrate pages sane
aio: fix kioctx leak introduced by "aio: Fix a trinity splat"
|
|
Since commit 36bc08cc01709 ("fs/aio: Add support to aio ring pages
migration") the aio ring setup code has used a special per-ring backing
inode for the page allocations, rather than just using random anonymous
pages.
However, rather than remembering the pages as it allocated them, it
would allocate the pages, insert them into the file mapping (dirty, so
that they couldn't be free'd), and then forget about them. And then to
look them up again, it would mmap the mapping, and then use
"get_user_pages()" to get back an array of the pages we just created.
Now, not only is that incredibly inefficient, it also leaked all the
pages if the mmap failed (which could happen due to excessive number of
mappings, for example).
So clean it all up, making it much more straightforward. Also remove
some left-overs of the previous (broken) mm_populate() usage that was
removed in commit d6c355c7dabc ("aio: fix race in ring buffer page
lookup introduced by page migration support") but left the pointless and
now misleading MAP_POPULATE flag around.
Tested-and-acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The arbitrary restriction on page counts offered by the core
migrate_page_move_mapping() code results in rather suspicious looking
fiddling with page reference counts in the aio_migratepage() operation.
To fix this, make migrate_page_move_mapping() take an extra_count parameter
that allows aio to tell the code about its own reference count on the page
being migrated.
While cleaning up aio_migratepage(), make it validate that the old page
being passed in is actually what aio_migratepage() expects to prevent
misbehaviour in the case of races.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
|
e34ecee2ae791df674dfb466ce40692ca6218e43 reworked the percpu reference
counting to correct a bug trinity found. Unfortunately, the change lead
to kioctxes being leaked because there was no final reference count to
put. Add that reference count back in to fix things.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
|
|
Pull xfs bugfixes from Ben Myers:
"This contains fixes for some asserts
related to project quotas, a memory leak, a hang when disabling group or
project quotas before disabling user quotas, Dave's email address, several
fixes for the alignment of file allocation to stripe unit/width geometry, a
fix for an assertion with xfs_zero_remaining_bytes, and the behavior of
metadata writeback in the face of IO errors.
Details:
- fix memory leak in xfs_dir2_node_removename
- fix quota assertion in xfs_setattr_size
- fix quota assertions in xfs_qm_vop_create_dqattach
- fix for hang when disabling group and project quotas before
disabling user quotas
- fix Dave Chinner's email address in MAINTAINERS
- fix for file allocation alignment
- fix for assertion in xfs_buf_stale by removing xfsbdstrat
- fix for alignment with swalloc mount option
- fix for "retry forever" semantics on IO errors"
* tag 'xfs-for-linus-v3.13-rc5' of git://oss.sgi.com/xfs/xfs:
xfs: abort metadata writeback on permanent errors
xfs: swalloc doesn't align allocations properly
xfs: remove xfsbdstrat error
xfs: align initial file allocations correctly
MAINTAINERS: fix incorrect mail address of XFS maintainer
xfs: fix infinite loop by detaching the group/project hints from user dquot
xfs: fix assertion failure at xfs_setattr_nonsize
xfs: fix false assertion at xfs_qm_vop_create_dqattach
xfs: fix memory leak in xfs_dir2_node_removename
|
|
Some pstore backing devices use on board flash as persistent
storage. These have limited numbers of write cycles so it
is a poor idea to use them from high frequency operations.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fix from Greg KH:
"Here's a single sysfs fix for 3.13-rc5 that resolves a lockdep issue
in sysfs that has been reported"
* tag 'driver-core-3.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
sysfs: give different locking key to regular and bin files
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull two Ceph fixes from Sage Weil:
"One of these is fixing a regression from the d_flags file type patch
that went into -rc1 that broke instantiation of inodes and dentries
(we were doing dentries first). The other is just an off-by-one
corner case"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: Avoid data inconsistency due to d-cache aliasing in readpage()
ceph: initialize inode before instantiating dentry
|
|
If we are doing aysnc writeback of metadata, we can get write errors
but have nobody to report them to. At the moment, we simply attempt
to reissue the write from io completion in the hope that it's a
transient error.
When it's not a transient error, the buffer is stuck forever in
this loop, and we cannot break out of it. Eventually, unmount will
hang because the AIL cannot be emptied and everything goes downhill
from them.
To solve this problem, only retry the write IO once before aborting
it. We don't throw the buffer away because some transient errors can
last minutes (e.g. FC path failover) or even hours (thin
provisioned devices that have run out of backing space) before they
go away. Hence we really want to keep trying until we can't try any
more.
Because the buffer was not cleaned, however, it does not get removed
from the AIL and hence the next pass across the AIL will start IO on
it again. As such, we still get the "retry forever" semantics that
we currently have, but we allow other access to the buffer in the
mean time. Meanwhile the filesystem can continue to modify the
buffer and relog it, so the IO errors won't hang the log or the
filesystem.
Now when we are pushing the AIL, we can see all these "permanent IO
error" buffers and we can issue a warning about failures before we
retry the IO. We can also catch these buffers when unmounting an
issue a corruption warning, too.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
When swalloc is specified as a mount option, allocations are
supposed to be aligned to the stripe width rather than the stripe
unit of the underlying filesystem. However, it does not do this.
What the implementation does is round up the allocation size to a
stripe width, hence ensuring that all allocations span a full stripe
width. It does not, however, ensure that that allocation is aligned
to a stripe width, and hence the allocations can span multiple
underlying stripes and so still see RMW cycles for things like
direct IO on MD RAID.
So, if the swalloc mount option is set, change the allocation
alignment in xfs_bmap_btalloc() to use the stripe width rather than
the stripe unit.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
The xfsbdstrat helper is a small but useless wrapper for xfs_buf_iorequest that
handles the case of a shut down filesystem. Most of the users have private,
uncached buffers that can just be freed in this case, but the complex error
handling in xfs_bioerror_relse messes up the case when it's called without
a locked buffer.
Remove xfsbdstrat and opencode the error handling in the callers. All but
one can simply return an error and don't need to deal with buffer state,
and the one caller that cares about the buffer state could do with a major
cleanup as well, but we'll defer that to later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
The function xfs_bmap_isaeof() is used to indicate that an
allocation is occurring at or past the end of file, and as such
should be aligned to the underlying storage geometry if possible.
Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
behaviour of this function for empty files - it turned off
allocation alignment for this case accidentally. Hence large initial
allocations from direct IO are not getting correctly aligned to the
underlying geometry, and that is cause write performance to drop in
alignment sensitive configurations.
Fix it by considering allocation into empty files as requiring
aligned allocation again.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit f9b395a8ef8f34d19cae2cde361e19c96e097fad)
|
|
xfs_quota(8) will hang up if trying to turn group/project quota off
before the user quota is off, this could be 100% reproduced by:
# mount -ouquota,gquota /dev/sda7 /xfs
# mkdir /xfs/test
# xfs_quota -xc 'off -g' /xfs <-- hangs up
# echo w > /proc/sysrq-trigger
# dmesg
SysRq : Show Blocked State
task PC stack pid father
xfs_quota D 0000000000000000 0 27574 2551 0x00000000
[snip]
Call Trace:
[<ffffffff81aaa21d>] schedule+0xad/0xc0
[<ffffffff81aa327e>] schedule_timeout+0x35e/0x3c0
[<ffffffff8114b506>] ? mark_held_locks+0x176/0x1c0
[<ffffffff810ad6c0>] ? call_timer_fn+0x2c0/0x2c0
[<ffffffffa0c25380>] ? xfs_qm_shrink_count+0x30/0x30 [xfs]
[<ffffffff81aa3306>] schedule_timeout_uninterruptible+0x26/0x30
[<ffffffffa0c26155>] xfs_qm_dquot_walk+0x235/0x260 [xfs]
[<ffffffffa0c059d8>] ? xfs_perag_get+0x1d8/0x2d0 [xfs]
[<ffffffffa0c05805>] ? xfs_perag_get+0x5/0x2d0 [xfs]
[<ffffffffa0b7707e>] ? xfs_inode_ag_iterator+0xae/0xf0 [xfs]
[<ffffffffa0c22280>] ? xfs_trans_free_dqinfo+0x50/0x50 [xfs]
[<ffffffffa0b7709f>] ? xfs_inode_ag_iterator+0xcf/0xf0 [xfs]
[<ffffffffa0c261e6>] xfs_qm_dqpurge_all+0x66/0xb0 [xfs]
[<ffffffffa0c2497a>] xfs_qm_scall_quotaoff+0x20a/0x5f0 [xfs]
[<ffffffffa0c2b8f6>] xfs_fs_set_xstate+0x136/0x180 [xfs]
[<ffffffff8136cf7a>] do_quotactl+0x53a/0x6b0
[<ffffffff812fba4b>] ? iput+0x5b/0x90
[<ffffffff8136d257>] SyS_quotactl+0x167/0x1d0
[<ffffffff814cf2ee>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff81abcd19>] system_call_fastpath+0x16/0x1b
It's fine if we turn user quota off at first, then turn off other
kind of quotas if they are enabled since the group/project dquot
refcount is decreased to zero once the user quota if off. Otherwise,
those dquots refcount is non-zero due to the user dquot might refer
to them as hint(s). Hence, above operation cause an infinite loop
at xfs_qm_dquot_walk() while trying to purge dquot cache.
This problem has been around since Linux 3.4, it was introduced by:
[ b84a3a9675 xfs: remove the per-filesystem list of dquots ]
Originally we will release the group dquot pointers because the user
dquots maybe carrying around as a hint via xfs_qm_detach_gdquots().
However, with above change, there is no such work to be done before
purging group/project dquot cache.
In order to solve this problem, this patch introduces a special routine
xfs_qm_dqpurge_hints(), and it would release the group/project dquot
pointers the user dquots maybe carrying around as a hint, and then it
will proceed to purge the user dquot cache if requested.
Cc: stable@vger.kernel.org
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit df8052e7dae00bde6f21b40b6e3e1099770f3afc)
|
|
For CRC enabled v5 super block, change a file's ownership can simply
trigger an ASSERT failure at xfs_setattr_nonsize() if both group and
project quota are enabled, i.e,
[ 305.337609] XFS: Assertion failed: !XFS_IS_PQUOTA_ON(mp), file: fs/xfs/xfs_iops.c, line: 621
[ 305.339250] Kernel BUG at ffffffffa0a7fa32 [verbose debug info unavailable]
[ 305.383939] Call Trace:
[ 305.385536] [<ffffffffa0a7d95a>] xfs_setattr_nonsize+0x69a/0x720 [xfs]
[ 305.387142] [<ffffffffa0a7dea9>] xfs_vn_setattr+0x29/0x70 [xfs]
[ 305.388727] [<ffffffff811ca388>] notify_change+0x1a8/0x350
[ 305.390298] [<ffffffff811ac39d>] chown_common+0xfd/0x110
[ 305.391868] [<ffffffff811ad6bf>] SyS_fchownat+0xaf/0x110
[ 305.393440] [<ffffffff811ad760>] SyS_lchown+0x20/0x30
[ 305.394995] [<ffffffff8170f7dd>] system_call_fastpath+0x1a/0x1f
[ 305.399870] RIP [<ffffffffa0a7fa32>] assfail+0x22/0x30 [xfs]
This fix adjust the assertion to check if the super block support both
quota inodes or not.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 5a01dd54f4a7fb513062070c5acef20d13cad980)
|
|
After the previous fix, there still has another ASSERT failure if turning
off any type of quota while fsstress is running at the same time.
Backtrace in this case:
[ 50.867897] XFS: Assertion failed: XFS_IS_GQUOTA_ON(mp), file: fs/xfs/xfs_qm.c, line: 2118
[ 50.867924] ------------[ cut here ]------------
... <snip>
[ 50.867957] Kernel BUG at ffffffffa0b55a32 [verbose debug info unavailable]
[ 50.867999] invalid opcode: 0000 [#1] SMP
[ 50.869407] Call Trace:
[ 50.869446] [<ffffffffa0bc408a>] xfs_qm_vop_create_dqattach+0x19a/0x2d0 [xfs]
[ 50.869512] [<ffffffffa0b9cc45>] xfs_create+0x5c5/0x6a0 [xfs]
[ 50.869564] [<ffffffffa0b5307c>] xfs_vn_mknod+0xac/0x1d0 [xfs]
[ 50.869615] [<ffffffffa0b531d6>] xfs_vn_mkdir+0x16/0x20 [xfs]
[ 50.869655] [<ffffffff811becd5>] vfs_mkdir+0x95/0x130
[ 50.869689] [<ffffffff811bf63a>] SyS_mkdirat+0xaa/0xe0
[ 50.869723] [<ffffffff811bf689>] SyS_mkdir+0x19/0x20
[ 50.869757] [<ffffffff8170f7dd>] system_call_fastpath+0x1a/0x1f
[ 50.869793] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 <snip>
[ 50.870003] RIP [<ffffffffa0b55a32>] assfail+0x22/0x30 [xfs]
[ 50.870050] RSP <ffff88002941fd60>
[ 50.879251] ---[ end trace c93a2b342341c65b ]---
We're hitting the ASSERT(XFS_IS_*QUOTA_ON(mp)) in xfs_qm_vop_create_dqattach(),
however the assertion itself is not right IMHO. While performing quota off, we
firstly clear the XFS_*QUOTA_ACTIVE bit(s) from struct xfs_mount without taking
any special locks, see xfs_qm_scall_quotaoff(). Hence there is no guarantee
that the desired quota is still active.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 37eb9706ebf5b99d14c6086cdeef2c2f73f9c9fb)
|
|
Fix the leak of kernel memory in xfs_dir2_node_removename()
when xfs_dir2_leafn_remove() returns an error code.
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit ef701600fd26cace9d513ee174688a2b83832126)
|
|
If the length of data to be read in readpage() is exactly
PAGE_CACHE_SIZE, the original code does not flush d-cache
for data consistency after finishing reading. This patches fixes
this.
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
commit b18825a7c8 (Put a small type field into struct dentry::d_flags)
put a type field into struct dentry::d_flags. __d_instantiate() set the
field by checking inode->i_mode. So we should initialize inode before
instantiating dentry when handling mds reply.
Fixes: http://tracker.ceph.com/issues/6930
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
|
|
Merge patches from Andrew Morton:
"13 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: memcg: do not allow task about to OOM kill to bypass the limit
mm: memcg: fix race condition between memcg teardown and swapin
thp: move preallocated PTE page table on move_huge_pmd()
mfd/rtc: s5m: fix register updating by adding regmap for RTC
rtc: s5m: enable IRQ wake during suspend
rtc: s5m: limit endless loop waiting for register update
rtc: s5m: fix unsuccesful IRQ request during probe
drivers/rtc/rtc-s5m.c: fix info->rtc assignment
include/linux/kernel.h: make might_fault() a nop for !MMU
drivers/rtc/rtc-at91rm9200.c: correct alarm over day/month wrap
procfs: also fix proc_reg_get_unmapped_area() for !MMU case
mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
include/linux/hugetlb.h: make isolate_huge_page() an inline
|
|
Commit fad1a86e25e0 ("procfs: call default get_unmapped_area on
MMU-present architectures"), as its title says, took care of only the
MMU case, leaving the !MMU side still in the regressed state (returning
-EIO in all cases where pde->proc_fops->get_unmapped_area is NULL).
From the fad1a86e25e0 changelog:
"Commit c4fe24485729 ("sparc: fix PCI device proc file mmap(2)") added
proc_reg_get_unmapped_area in proc_reg_file_ops and
proc_reg_file_ops_no_compat, by which now mmap always returns EIO if
get_unmapped_area method is not defined for the target procfs file, which
causes regression of mmap on /proc/vmcore.
To address this issue, like get_unmapped_area(), call default
current->mm->get_unmapped_area on MMU-present architectures if
pde->proc_fops->get_unmapped_area, i.e. the one in actual file operation
in the procfs file, is not defined"
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org> [3.12.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"This is a small collection of fixes. It was rebased this morning, but
I was just fixing signed-off-by tags with the wrong email"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix access_ok() check in btrfs_ioctl_send()
Btrfs: make sure we cleanup all reloc roots if error happens
Btrfs: skip building backref tree for uuid and quota tree when doing balance relocation
Btrfs: fix an oops when doing balance relocation
Btrfs: don't miss skinny extent items on delayed ref head contention
btrfs: call mnt_drop_write after interrupted subvol deletion
Btrfs: don't clear the default compression type
|
|
Pull nfsd reply cache bugfix from Bruce Fields:
"One bugfix for nfsd crashes"
* 'for-3.13' of git://linux-nfs.org/~bfields/linux:
nfsd: when reusing an existing repcache entry, unhash it first
|
|
When explicitly hashing the end of a string with the word-at-a-time
interface, we have to be careful which end of the word we pick up.
On big-endian CPUs, the upper-bits will contain the data we're after, so
ensure we generate our masks accordingly (and avoid hashing whatever
random junk may have been sitting after the string).
This patch adds a new dcache helper, bytemask_from_count, which creates
a mask appropriate for the CPU endianness.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull xfs bugfixes from Ben Myers:
- fix for buffer overrun in agfl with growfs on v4 superblock
- return EINVAL if requested discard length is less than a block
- fix possible memory corruption in xfs_attrlist_by_handle()
* tag 'xfs-for-linus-v3.13-rc4' of git://oss.sgi.com/xfs/xfs:
xfs: growfs overruns AGFL buffer on V4 filesystems
xfs: don't perform discard if the given range length is less than block size
xfs: underflow bug in xfs_attrlist_by_handle()
|
|
The closing parenthesis is in the wrong place. We want to check
"sizeof(*arg->clone_sources) * arg->clone_sources_count" instead of
"sizeof(*arg->clone_sources * arg->clone_sources_count)".
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org
|
|
I hit an oops when merging reloc roots fails, the reason is that
new reloc roots may be added and we should make sure we cleanup
all reloc roots.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
relocation
Quota tree and UUID Tree is only cowed, they can not be snapshoted.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
I hit an oops when inserting reloc root into @reloc_root_tree(it can be
easily triggered when forcing cow for relocation root)
[ 866.494539] [<ffffffffa0499579>] btrfs_init_reloc_root+0x79/0xb0 [btrfs]
[ 866.495321] [<ffffffffa044c240>] record_root_in_trans+0xb0/0x110 [btrfs]
[ 866.496109] [<ffffffffa044d758>] btrfs_record_root_in_trans+0x48/0x80 [btrfs]
[ 866.496908] [<ffffffffa0494da8>] select_reloc_root+0xa8/0x210 [btrfs]
[ 866.497703] [<ffffffffa0495c8a>] do_relocation+0x16a/0x540 [btrfs]
This is because reloc root inserted into @reloc_root_tree is not within one
transaction,reloc root may be cowed and root block bytenr will be reused then
oops happens.We should update reloc root in @reloc_root_tree when cow reloc
root node, fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Currently extent-tree.c:btrfs_lookup_extent_info() can miss the lookup
of skinny extent items. This can happen when the execution flow is the
following:
* We do an extent tree lookup and fail to find a skinny extent item;
* As a result, we attempt to see if a non-skinny extent item exists,
either by looking at previous item in the leaf or by doing another
full extent tree search;
* We have a transaction and then we check for a matching delayed ref
head in the transaction's delayed refs rbtree;
* We find such delayed ref head and then we try to lock it with a
call to mutex_trylock();
* The lock was contended so we jump to the label "again", which repeats
the extent tree search but for a non-skinny extent item, because we set
previously metadata variable to 0 and the search key to look for a
non-skinny extent-item;
* After the jump (and after releasing the transaction's delayed refs
lock), a skinny extent item might have been added to the extent tree
but we will miss it because metadata is set to 0 and the search key
is set for a non-skinny extent-item.
The fix here is to not reset metadata to 0 and to jump to the initial search
key setup if the delayed ref head is contended, instead of jumping directly
to the extent tree search label ("again").
This issue was found while investigating the issue reported at Bugzilla 64961.
David Sterba suspected this function was missing extent items, and that
this could be caused by the last change to this function, which was made
in the following patch:
[PATCH] Btrfs: optimize btrfs_lookup_extent_info()
(commit 74be9510876a66ad9826613ac8a526d26f9e7f01)
But in fact this issue already existed before, because after failing to find
a skinny extent item, the code set the search key for a non-skinny extent
item, and on contention of a matching delayed ref head it would not search
the extent tree for a skinny extent item anymore.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
If btrfs_ioctl_snap_destroy blocks on the mutex and the process is
killed, mnt_write count is unbalanced and leads to unmountable
filesystem.
CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
We met a oops caused by the wrong compression type:
[ 556.512356] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 556.512370] IP: [<ffffffff811dbaa0>] __list_del_entry+0x1/0x98
[SNIP]
[ 556.512490] [<ffffffff811dbb44>] ? list_del+0xd/0x2b
[ 556.512539] [<ffffffffa05dd5ce>] find_workspace+0x97/0x175 [btrfs]
[ 556.512546] [<ffffffff813c14b5>] ? _raw_spin_lock+0xe/0x10
[ 556.512576] [<ffffffffa05de276>] btrfs_compress_pages+0x2d/0xa2 [btrfs]
[ 556.512601] [<ffffffffa05af060>] compress_file_range.constprop.54+0x1f2/0x4e8 [btrfs]
[ 556.512627] [<ffffffffa05af388>] async_cow_start+0x32/0x4d [btrfs]
[ 556.512655] [<ffffffffa05cc7a1>] worker_loop+0x144/0x4c3 [btrfs]
[ 556.512661] [<ffffffff81059404>] ? finish_task_switch+0x80/0xb8
[ 556.512689] [<ffffffffa05cc65d>] ? btrfs_queue_worker+0x244/0x244 [btrfs]
[ 556.512695] [<ffffffff8104fa4e>] kthread+0x8d/0x95
[ 556.512699] [<ffffffff81050000>] ? bit_waitqueue+0x34/0x7d
[ 556.512704] [<ffffffff8104f9c1>] ? __kthread_parkme+0x65/0x65
[ 556.512709] [<ffffffff813c7eec>] ret_from_fork+0x7c/0xb0
[ 556.512713] [<ffffffff8104f9c1>] ? __kthread_parkme+0x65/0x65
Steps to reproduce:
# mkfs.btrfs -f <dev>
# mount -o nodatacow <dev> <mnt>
# touch <mnt>/<file>
# chattr =c <mnt>/<file>
# dd if=/dev/zero of=<mnt>/<file> bs=1M count=10
It is because we cleared the default compression type when setting the
nodatacow. In fact, we needn't do it because we have used COMPRESS flag to
indicate if we need compressed the file data or not, needn't use the
variant -- compress_type -- in btrfs_info to do the same thing, and just
use it to hold the default compression type. Or we would get a wrong compress
type for a file whose own compress flag is set but the compress flag of its
filesystem is not set.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
The DRC code will attempt to reuse an existing, expired cache entry in
preference to allocating a new one. It'll then search the cache, and if
it gets a hit it'll then free the cache entry that it was going to
reuse.
The cache code doesn't unhash the entry that it's going to reuse
however, so it's possible for it end up designating an entry for reuse
and then subsequently freeing the same entry after it finds it. This
leads it to a later use-after-free situation and usually some list
corruption warnings or an oops.
Fix this by simply unhashing the entry that we intend to reuse. That
will mean that it's not findable via a search and should prevent this
situation from occurring.
Cc: stable@vger.kernel.org # v3.10+
Reported-by: Christoph Hellwig <hch@infradead.org>
Reported-by: g. artim <gartim@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
This loop in xfs_growfs_data_private() is incorrect for V4
superblocks filesystems:
for (bucket = 0; bucket < XFS_AGFL_SIZE(mp); bucket++)
agfl->agfl_bno[bucket] = cpu_to_be32(NULLAGBLOCK);
For V4 filesystems, we don't have a agfl header structure, and so
XFS_AGFL_SIZE() returns an entire sector's worth of entries, which
we then index from an offset into the sector. Hence: buffer overrun.
This problem was introduced in 3.10 by commit 77c95bba ("xfs: add
CRC checks to the AGFL") which changed the AGFL structure but failed
to update the growfs code to handle the different structures.
Fix it by using the correct offset into the buffer for both V4 and
V5 filesystems.
Cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit b7d961b35b3ab69609aeea93f870269cb6e7ba4d)
|
|
For discard operation, we should return EINVAL if the given range length
is less than a block size, otherwise it will go through the file system
to discard data blocks as the end range might be evaluated to -1, e.g,
# fstrim -v -o 0 -l 100 /xfs7
/xfs7: 9811378176 bytes were trimmed
This issue can be triggered via xfstests/generic/288.
Also, it seems to get the request queue pointer via bdev_get_queue()
instead of the hard code pointer dereference is not a bad thing.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit f9fd0135610084abef6867d984e9951c3099950d)
|
|
If we allocate less than sizeof(struct attrlist) then we end up
corrupting memory or doing a ZERO_PTR_SIZE dereference.
This can only be triggered with CAP_SYS_ADMIN.
Reported-by: Nico Golde <nico@ngolde.de>
Reported-by: Fabian Yamaguchi <fabs@goesec.de>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 071c529eb672648ee8ca3f90944bcbcc730b4c06)
|
|
027a485d12e0 ("sysfs: use a separate locking class for open files
depending on mmap") assigned different lockdep key to
sysfs_open_file->mutex depending on whether the file implements mmap
or not in an attempt to avoid spurious lockdep warning caused by
merging of regular and bin file paths.
While this restored some of the original behavior of using different
locks (at least lockdep is concerned) for the different clases of
files. The restoration wasn't full because now the lockdep key
assignment depends on whether the file has mmap or not instead of
whether it's a regular file or not.
This means that bin files which don't implement mmap will get assigned
the same lockdep class as regular files. This is problematic because
file_operations for bin files still implements the mmap file operation
and checking whether the sysfs file actually implements mmap happens
in the file operation after grabbing @sysfs_open_file->mutex. We
still end up adding locking dependency from mmap locking to
sysfs_open_file->mutex to the regular file mutex which triggers
spurious circular locking warning.
Fix it by restoring the original behavior fully by differentiating
lockdep key by whether the file is regular or bin, instead of the
existence of mmap.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dave Jones <davej@redhat.com>
Link: http://lkml.kernel.org/g/20131203184324.GA11320@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Pull aio fix from Benjamin LaHaise:
"AIO fix from Gu Zheng that fixes a GPF that Dave Jones uncovered with
trinity"
* git://git.kvack.org/~bcrl/aio-next:
aio: clean up aio ring in the fail path
|
|
Clean up the aio ring file in the fail path of aio_setup_ring
and ioctx_alloc. And maybe it can fix the GPF issue reported by
Dave Jones:
https://lkml.org/lkml/2013/11/25/898
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
- cpufreq regression fix from Bjørn Mork restoring the pre-3.12
behavior of the framework during system suspend/hibernation to avoid
garbage sysfs files from being left behind in case of a suspend error
- PNP regression fix to restore the correct states of devices after
resume from hibernation broken in 3.12. From Dmitry Torokhov.
- cpuidle fix to prevent cpuidle device unregistration from crashing
due to a NULL pointer dereference if cpuidle has been disabled from
the kernel command line. From Konrad Rzeszutek Wilk.
- intel_idle fix for the C6 state definition on Intel Avoton/Rangeley
processors from Arne Bockholdt.
- Power capping framework fix to make the energy_uj sysfs attribute
work in accordance with the documentation. From Srinivas Pandruvada.
- epoll fix to make it ignore the EPOLLWAKEUP flag if the kernel has
been compiled with CONFIG_PM_SLEEP unset (in which case that flag
should not have any effect). From Amit Pundir.
- cpufreq fix to prevent governor sysfs files from being lost over
system suspend/resume in some (arguably unusual) situations. From
Viresh Kumar.
* tag 'pm-3.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PowerCap: Fix mode for energy counter
PNP: fix restoring devices after hibernation
cpuidle: Check for dev before deregistering it.
epoll: drop EPOLLWAKEUP if PM_SLEEP is disabled
cpufreq: fix garbage kobjects on errors during suspend/resume
cpufreq: suspend governors on system suspend/hibernate
intel_idle: Fixed C6 state on Avoton/Rangeley processors
|
|
Pull block layer fixes from Jens Axboe:
"A small collection of fixes for the current series. It contains:
- A fix for a use-after-free of a request in blk-mq. From Ming Lei
- A fix for a blk-mq bug that could attempt to dereference a NULL rq
if allocation failed
- Two xen-blkfront small fixes
- Cleanup of submit_bio_wait() type uses in the kernel, unifying
that. From Kent
- A fix for 32-bit blkg_rwstat reading. I apologize for this one
looking mangled in the shortlog, it's entirely my fault for missing
an empty line between the description and body of the text"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: fix use-after-free of request
blk-mq: fix dereference of rq->mq_ctx if allocation fails
block: xen-blkfront: Fix possible NULL ptr dereference
xen-blkfront: Silence pfn maybe-uninitialized warning
block: submit_bio_wait() conversions
Update of blkg_stat and blkg_rwstat may happen in bh context
|
|
Pull NFS client bugfixes from Trond Myklebust:
- Stable fix for a NFSv4.1 delegation and state recovery deadlock
- Stable fix for a loop on irrecoverable errors when returning
delegations
- Fix a 3-way deadlock between layoutreturn, open, and state recovery
- Update the MAINTAINERS file with contact information for Trond
Myklebust
- Close needs to handle NFS4ERR_ADMIN_REVOKED
- Enabling v4.2 should not recompile nfsd and lockd
- Fix a couple of compile warnings
* tag 'nfs-for-3.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
nfs: fix do_div() warning by instead using sector_div()
MAINTAINERS: Update contact information for Trond Myklebust
NFSv4.1: Prevent a 3-way deadlock between layoutreturn, open and state recovery
SUNRPC: do not fail gss proc NULL calls with EACCES
NFSv4: close needs to handle NFS4ERR_ADMIN_REVOKED
NFSv4: Update list of irrecoverable errors on DELEGRETURN
NFSv4 wait on recovery for async session errors
NFS: Fix a warning in nfs_setsecurity
NFS: Enabling v4.2 should not recompile nfsd and lockd
|
|
When compiling a 32bit kernel with CONFIG_LBDAF=n the compiler complains like
shown below. Fix this warning by instead using sector_div() which is provided
by the kernel.h header file.
fs/nfs/blocklayout/extents.c: In function ‘normalize’:
include/asm-generic/div64.h:43:28: warning: comparison of distinct pointer types lacks a cast [enabled by default]
fs/nfs/blocklayout/extents.c:47:13: note: in expansion of macro ‘do_div’
nfs/blocklayout/extents.c:47:2: warning: right shift count >= width of type [enabled by default]
fs/nfs/blocklayout/extents.c:47:2: warning: passing argument 1 of ‘__div64_32’ from incompatible pointer type [enabled by default]
include/asm-generic/div64.h:35:17: note: expected ‘uint64_t *’ but argument is of type ‘sector_t *’
extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
Andy Adamson reports:
The state manager is recovering expired state and recovery OPENs are being
processed. If kswapd is pruning inodes at the same time, a deadlock can occur
when kswapd calls evict_inode on an NFSv4.1 inode with a layout, and the
resultant layoutreturn gets an error that the state mangager is to handle,
causing the layoutreturn to wait on the (NFS client) cl_rpcwaitq.
At the same time an open is waiting for the inode deletion to complete in
__wait_on_freeing_inode.
If the open is either the open called by the state manager, or an open from
the same open owner that is holding the NFSv4 sequence id which causes the
OPEN from the state manager to wait for the sequence id on the Seqid_waitqueue,
then the state is deadlocked with kswapd.
The fix is simply to have layoutreturn ignore all errors except NFS4ERR_DELAY.
We already know that layouts are dropped on all server reboots, and that
it has to be coded to deal with the "forgetful client model" that doesn't
send layoutreturns.
Reported-by: Andy Adamson <andros@netapp.com>
Link: http://lkml.kernel.org/r/1385402270-14284-1-git-send-email-andros@netapp.com
Signed-off-by: Trond Myklebust <Trond.Myklebust@primarydata.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next
Pull squashfs bugfix from Phillip Lougher:
"Just a single bug fix to the new "directly decompress into the page
cache" code"
* tag 'squashfs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next:
Squashfs: fix failure to unlock pages on decompress error
|
|
Drop EPOLLWAKEUP from epoll events mask if CONFIG_PM_SLEEP is disabled.
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The pipe code was trying (and failing) to be very careful about freeing
the pipe info only after the last access, with a pattern like:
spin_lock(&inode->i_lock);
if (!--pipe->files) {
inode->i_pipe = NULL;
kill = 1;
}
spin_unlock(&inode->i_lock);
__pipe_unlock(pipe);
if (kill)
free_pipe_info(pipe);
where the final freeing is done last.
HOWEVER. The above is actually broken, because while the freeing is
done at the end, if we have two racing processes releasing the pipe
inode info, the one that *doesn't* free it will decrement the ->files
count, and unlock the inode i_lock, but then still use the
"pipe_inode_info" afterwards when it does the "__pipe_unlock(pipe)".
This is *very* hard to trigger in practice, since the race window is
very small, and adding debug options seems to just hide it by slowing
things down.
Simon originally reported this way back in July as an Oops in
kmem_cache_allocate due to a single bit corruption (due to the final
"spin_unlock(pipe->mutex.wait_lock)" incrementing a field in a different
allocation that had re-used the free'd pipe-info), it's taken this long
to figure out.
Since the 'pipe->files' accesses aren't even protected by the pipe lock
(we very much use the inode lock for that), the simple solution is to
just drop the pipe lock early. And since there were two users of this
pattern, create a helper function for it.
Introduced commit ba5bb147330a ("pipe: take allocation and freeing of
pipe_inode_info out of ->i_mutex").
Reported-by: Simon Kirby <sim@hostway.ca>
Reported-by: Ian Applegate <ia@cloudflare.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org # v3.10+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs dentry reference count fix from Al Viro.
This fixes a possible inode_permission NULL pointer dereference (and
other problems) that were due to the root dentry count being decremented
too much. In commit 48a066e72d97 ("RCU'd vfsmounts") the placement of
clearing the LOOKUP_RCU bit changed, and we then returned failure of
incrementing the lockref on the parent dentry with LOOKUP_RCU cleared.
But that meant we needed to go through the same cleanup routines that
the later failures did wrt LOOKUP_ROOT and nd->root.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix bogus path_put() of nd->root after some unlazy_walk() failures
|
|
Failure to grab reference to parent dentry should go through the
same cleanup as nd->seq mismatch. As it is, we might end up with
caller thinking it needs to path_put() nd->root, with obvious
nasty results once we'd hit that bug enough times to drive the
refcount of root dentry all the way to zero...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Pull cifs fixes from Steve French:
"SMB3 "validate negotiate" is needed to prevent certain types of
downgrade attacks.
Also changes SMB2/SMB3 copy offload from using the BTRFS copy ioctl
(BTRFS_IOC_CLONE) to a cifs specific ioctl (CIFS_IOC_COPYCHUNK_FILE)
to address Christoph's comment that there are semantic differences
between requesting copy offload in which copy-on-write is mandatory
(as in the BTRFS ioctl) and optional in the SMB2/SMB3 case. Also
fixes SMB2/SMB3 copychunk for large files"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
[CIFS] Do not use btrfs refcopy ioctl for SMB2 copy offload
Check SMB3 dialects against downgrade attacks
Removed duplicated (and unneeded) goto
CIFS: Fix SMB2/SMB3 Copy offload support (refcopy) for large files
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are 3 patches for sysfs issues that have been reported. Well, 1
patch really, the first one is reverted as it's not really needed (the
correct fix is coming in through the different driver subsystems
instead)
But that 1 sysfs fix is needed, so this is still a good thing to pull
in now"
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tag 'driver-core-3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
Revert "sysfs: handle duplicate removal attempts in sysfs_remove_group()"
sysfs: use a separate locking class for open files depending on mmap
sysfs: handle duplicate removal attempts in sysfs_remove_group()
|