Age | Commit message (Collapse) | Author |
|
Define CIFS for FS-Cache and register for caching. Upon registration the
top-level index object cookie will be stuck to the netfs definition by
FS-Cache.
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
Add a kernel config option to enable local caching for CIFS.
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
The ip_address field is not used and seems redundant as there is union addr
already and I don't see any future use as well.
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
The recent commit 6ca9f3bae8b1854794dfa63cdd3b88b7dfe24c13 modified the code so
that filp is full instantiated whenever the file is created and passed back.
The below comment is no longer true, remove it.
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
Add conditional compile macros to guard the header file against multiple
inclusion.
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
This is a clean up of the code which deals with LM_FLAG_NOEXP
which aims to remove any possible race conditions by using
gl_spin to cover the gap between testing for the LM_FLAG_NOEXP
and the GL_FROZEN flag.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
SELinux needs to pass the MAY_ACCESS flag so it can handle auditting
correctly. Presently the masking of MAY_* flags is done in the VFS. In
order to allow LSMs to decide what flags they care about and what flags
they don't just pass them all and the each LSM mask off what they don't
need. This patch should contain no functional changes to either the VFS or
any LSM.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Stephen D. Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
|
|
Currently MAY_ACCESS means that filesystems must check the permissions
right then and not rely on cached results or the results of future
operations on the object. This can be because of a call to sys_access() or
because of a call to chdir() which needs to check search without relying on
any future operations inside that dir. I plan to use MAY_ACCESS for other
purposes in the security system, so I split the MAY_ACCESS and the
MAY_CHDIR cases.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Stephen D. Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
|
|
When commit be6d3e56a6b9b3a4ee44a0685e39e595073c6f0d "introduce new LSM hooks
where vfsmount is available." was proposed, regarding security_path_truncate(),
only "struct file *" argument (which AppArmor wanted to use) was removed.
But length and time_attrs arguments are not used by TOMOYO nor AppArmor.
Thus, let's remove these arguments.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: James Morris <jmorris@namei.org>
|
|
nfs_commit_inode() needs to be defined irrespectively of whether or not
we are supporting NFSv3 and NFSv4.
Allow the compiler to optimise away code in the NFSv2-only case by
converting it into an inlined stub function.
Reported-and-tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Ensure that writepage respects the nonblock flag
NFS: kswapd must not block in nfs_release_page
nfs: include space for the NUL in root path
|
|
Remove the __exit mark from cifs_exit_dns_resolver() as it's called by the
module init routine in case of error, and so may have been discarded during
linkage.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
See https://bugzilla.kernel.org/show_bug.cgi?id=16056
If other processes are blocked waiting for kswapd to free up some memory so
that they can make progress, then we cannot allow kswapd to block on those
processes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
|
|
In root_nfs_name() it does the following:
if (strlen(buf) + strlen(cp) > NFS_MAXPATHLEN) {
printk(KERN_ERR "Root-NFS: Pathname for remote directory too long.\n");
return -1;
}
sprintf(nfs_export_path, buf, cp);
In the original code if (strlen(buf) + strlen(cp) == NFS_MAXPATHLEN)
then the sprintf() would lead to an overflow. Generally the rest of the
code assumes that the path can have NFS_MAXPATHLEN (1024) characters and
a NUL terminator so the fix is to add space to the nfs_export_path[]
buffer.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
trunc_start() in bmap.c incorrectly uses sizeof(struct gfs2_inode) instead of
sizeof(struct gfs2_dinode).
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
It's possible for get_task_cred() as it currently stands to 'corrupt' a set of
credentials by incrementing their usage count after their replacement by the
task being accessed.
What happens is that get_task_cred() can race with commit_creds():
TASK_1 TASK_2 RCU_CLEANER
-->get_task_cred(TASK_2)
rcu_read_lock()
__cred = __task_cred(TASK_2)
-->commit_creds()
old_cred = TASK_2->real_cred
TASK_2->real_cred = ...
put_cred(old_cred)
call_rcu(old_cred)
[__cred->usage == 0]
get_cred(__cred)
[__cred->usage == 1]
rcu_read_unlock()
-->put_cred_rcu()
[__cred->usage == 1]
panic()
However, since a tasks credentials are generally not changed very often, we can
reasonably make use of a loop involving reading the creds pointer and using
atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero.
If successful, we can safely return the credentials in the knowledge that, even
if the task we're accessing has released them, they haven't gone to the RCU
cleanup code.
We then change task_state() in procfs to use get_task_cred() rather than
calling get_cred() on the result of __task_cred(), as that suffers from the
same problem.
Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be
tripped when it is noticed that the usage count is not zero as it ought to be,
for example:
kernel BUG at kernel/cred.c:168!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/ksm/run
CPU 0
Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex
745
RIP: 0010:[<ffffffff81069881>] [<ffffffff81069881>] __put_cred+0xc/0x45
RSP: 0018:ffff88019e7e9eb8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff
RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0
RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0
R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001
FS: 00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0)
Stack:
ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45
<0> ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000
<0> ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246
Call Trace:
[<ffffffff810698cd>] put_cred+0x13/0x15
[<ffffffff81069b45>] commit_creds+0x16b/0x175
[<ffffffff8106aace>] set_current_groups+0x47/0x4e
[<ffffffff8106ac89>] sys_setgroups+0xf6/0x105
[<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00
48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 <0f> 0b eb fe 65 48 8b
04 25 00 cc 00 00 48 3b b8 58 04 00 00 75
RIP [<ffffffff81069881>] __put_cred+0xc/0x45
RSP <ffff88019e7e9eb8>
---[ end trace df391256a100ebdd ]---
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This reverts commit b7dc2df5725fe7355fd76000ead7e39728e1b8a9.
The initial patch didn't quite work since it doesn't cover all
the possible routes by which the GLF_FROZEN flag might be set.
A revised fix is coming up in the next patch.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
This looks like a big change, but in reality its only a single line of actual
code change, the rest is just moving a function to before its new caller.
The "try" flag for glocks is a rather subtle and delicate setting since it
requires that the state machine tries just hard enough to ensure that it has
a good chance of getting the requested lock, but no so hard that the
request can land up blocked behind another.
The patch adds in an additional check which will fail any queued try
locks if there is another request blocking the try lock request which
is not granted and compatible, nor in progress already. The check is made
only after all pending locks which may be granted have been granted.
I've checked this with the reproducer for the reported flock bug which
this is intended to fix, and it now passes.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
The k[mc]allocs in dr_split_leaf() and dir_double_exhash() are failable,
so remove __GFP_NOFAIL from their masks.
Cc: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
Function gfs2_write_alloc_required always returned zero as its
return code. Therefore, it doesn't need to return a return code
at all. Given that, we can use the return value to return whether
or not the dinode needs block allocations rather than passing
that value in, which in turn simplifies a bunch of error checking.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
This patch implements a wait for the journal id in the case that it has
not been specified on the command line. This is to allow the future
removal of the mount.gfs2 helper. The journal id would instead be
directly communicated by gfs_controld to the file system. Here is a
comparison of the two systems:
Current:
1. mount calls mount.gfs2
2. mount.gfs2 connects to gfs_controld to retrieve the journal id
3. mount.gfs2 adds the journal id to the mount command line and calls
the mount system call
4. gfs_controld receives the status of the mount request via a uevent
Proposed:
1. mount calls the mount system call (no mount.gfs2 helper)
2. gfs_controld receives a uevent for a gfs2 fs which it doesn't know
about already
3. gfs_controld assigns a journal id to it via sysfs
4. the mount system call then completes as normal (sending a uevent
according to status)
The advantage of the proposed system is that it is completely backward
compatible with the current system both at the kernel and at the
userland levels. The "first" parameter can also be set the same way,
with the restriction that it must be set before the journal id is
assigned.
In addition, if mount becomes stuck waiting for a reply from
gfs_controld which never arrives, then it is killable and will abort the
mount gracefully.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
Use nobh_writepage rather than calling mpage_writepage directly.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
|
|
The function ecryptfs_uid_hash wrongly assumes that the
second parameter to hash_long() is the number of hash
buckets instead of the number of hash bits.
This patch fixes that and renames the variable
ecryptfs_hash_buckets to ecryptfs_hash_bits to make it
clearer.
Fixes: CVE-2010-2492
Signed-off-by: Andre Osterhues <aosterhues@escrypt.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: use complete_all and wake_up_all
ceph: Correct obvious typo of Kconfig variable "CRYPTO_AES"
ceph: fix dentry lease release
ceph: fix leak of dentry in ceph_init_dentry() error path
ceph: fix pg_mapping leak on pg_temp updates
ceph: fix d_release dop for snapdir, snapped dentries
ceph: avoid dcache readdir for snapdir
|
|
If we don't need a huge amount of memory in ->readdir() then
we can use kmalloc rather than vmalloc to allocate it. This
should cut down on the greater overheads associated with
vmalloc for smaller directories.
We may be able to eliminate vmalloc entirely at some stage,
but this is easy to do right away.
Also using GFP_NOFS to avoid any issues wrt to deleting inodes
while under a glock, and suggestion from Linus to factor out
the alloc/dealloc.
I've given this a test with a variety of different sized
directories and it seems to work ok.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This fixes an issue triggered by running concurrent syncs. One of the syncs
would go through while the other would just hang indefinitely. In any case, we
never actually want to wake a single waiter, so the *_all functions should
be used.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Pass the correct end of the buffer to p9stat_read.
Signed-off-by: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
|
|
Our current handling of direct I/O completions is rather suboptimal,
because we defer it to a workqueue more often than needed, and we
perform a much to aggressive flush of the workqueue in case unwritten
extent conversions happen.
This patch changes the direct I/O reads to not even use a completion
handler, as we don't bother to use it at all, and to perform the unwritten
extent conversions in caller context for synchronous direct I/O.
For a small I/O size direct I/O workload on a consumer grade SSD, such as
the untar of a kernel tree inside qemu this patch gives speedups of
about 5%. Getting us much closer to the speed of a native block device,
or a fully allocated XFS file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
If we write into an unwritten extent using AIO we need to complete the AIO
request after the extent conversion has finished. Without that a read could
race to see see the extent still unwritten and return zeros. For synchronous
I/O we already take care of that by flushing the xfsconvertd workqueue (which
might be a bit of overkill).
To do that add iocb and result fields to struct xfs_ioend, so that we can
call aio_complete from xfs_end_io after the extent conversion has happened.
Note that we need a new result field as io_error is used for positive errno
values, while the AIO code can return negative error values and positive
transfer sizes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
Filesystems with unwritten extent support must not complete an AIO request
until the transaction to convert the extent has been commited. That means
the aio_complete calls needs to be moved into the ->end_io callback so
that the filesystem can control when to call it exactly.
This makes a bit of a mess out of dio_complete and the ->end_io callback
prototype even more complicated.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
Commit 0fd7275cc42ab734eaa1a2c747e65479bd1e42af ("xfs: fix gcc 4.6
set but not read and unused statement warnings") failed to convert
some code inside XFS_NATIVE_HOST (big endian host code only) and
hence fails to build on such machines. Fix it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
Supporting symlinks from untagged to tagged directories is reasonable,
and needed to support CONFIG_SYSFS_DEPRECATED. So don't fail a prior
allowing that case to work.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
This happens for network devices when SYSFS_DEPRECATED is enabled.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
Recently my tagged sysfs support revealed a flaw in the device core
that a few rare drivers are running into such that we don't always put
network devices in a class subdirectory named net/.
Since we are not creating the class directory the network devices wind
up in a non-tagged directory, but the symlinks to the network devices
from /sys/class/net are in a tagged directory. All of which works
until we go to remove or rename the symlink. When we remove or rename
a symlink we look in the namespace of the target of the symlink.
Since the target of the symlink is in a non-tagged sysfs directory we
don't have a namespace to look in, and we fail to remove the symlink.
Detect this problem up front and simply don't create symlinks we won't
be able to remove later. This prevents symlink leakage and fails in
a much clearer and more understandable way.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
xfs_truncate_file is only used for truncating quota files. Move it to
xfs_qm_syscalls.c so it can be marked static and take advatange of the
fact by removing the unused page cache validation and taking the iget
into the helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
The b_strat callback is used by xfs_buf_iostrategy to perform additional
checks before submitting a buffer. It is used in xfs_bwrite and when
writing out delayed buffers. In xfs_bwrite it we can de-virtualize the
call easily as b_strat is set a few lines above the call to
xfs_buf_iostrategy. For the delayed buffers the rationale is a bit
more complicated:
- there are three callers of xfs_buf_delwri_queue, which places buffers
on the delwri list:
(1) xfs_bdwrite - this sets up b_strat, so it's fine
(2) xfs_buf_iorequest. None of the callers can have XBF_DELWRI set:
- xlog_bdstrat is only used for log buffers, which are never delwri
- _xfs_buf_read explicitly clears the delwri flag
- xfs_buf_iodone_work retries log buffers only
- xfsbdstrat - only used for reads, superblock writes without the
delwri flag, log I/O and file zeroing with explicitly allocated
buffers.
- xfs_buf_iostrategy - only calls xfs_buf_iorequest if b_strat is
not set
(3) xfs_buf_unlock
- only puts the buffer on the delwri list if the DELWRI flag is
already set. The DELWRI flag is only ever set in xfs_bwrite,
xfs_buf_iodone_callbacks, or xfs_trans_log_buf. For
xfs_buf_iodone_callbacks and xfs_trans_log_buf we require
an initialized buf item, which means b_strat was set to
xfs_bdstrat_cb in xfs_buf_item_init.
Conclusion: we can just get rid of the callback and replace it with
explicit calls to xfs_bdstrat_cb.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Since Linux 2.6.33 the kernel has support for real O_SYNC, which made
the osyncisosync option a no-op. Warn the users about this and remove
the mount flag for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Move xfs_filestream_peek_ag, xxfs_filestream_get_ag and xfs_filestream_put_ag
from xfs_filestream.h to xfs_filestream.c where it's only callers are, and
remove the inline marker while we're at it to let the compiler decide on the
inlining. Also don't return a value from xfs_filestream_put_ag because
we don't need it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
[hch: dropped a few hunks that need structural changes instead]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
When CONFIG_XFS_POSIX_ACL is not set "xfs_check_acl" is #defined
to NULL - which breaks the code attempting to add a tracepoint
on this function.
Only define the tracepoint when the function exists.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
map_len is unsigned. Checking map_len <= 0 is buggy when it should be
below zero. So, check exact expression instead of map_len.
Signed-off-by: Kulikov Vasiliy <segooon@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Avoid a lockdep warning by preventing page cache allocation from
recursing back into the filesystem during memory reclaim.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Calling into memory reclaim with a locked inode buffer can deadlock
if memory reclaim tries to lock the inode buffer during inode
teardown. Convert the relevant memory allocations to use KM_NOFS to
avoid this deadlock condition.
Reported-by: Peter Watkins <treestem@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
xfs_trans_add_item() is called with ip->i_ilock held, which means it
is unsafe for memory reclaim to recurse back into the filesystem
(ilock is required in writeback). Hence the allocation needs to be
KM_NOFS to avoid recursion.
Lockdep report indicating memory allocation being called with the
ip->i_ilock held is as follows:
[ 1749.866796] =================================
[ 1749.867788] [ INFO: inconsistent lock state ]
[ 1749.868327] 2.6.35-rc3-dgc+ #25
[ 1749.868741] ---------------------------------
[ 1749.868741] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
[ 1749.868741] dd/2835 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 1749.868741] (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff813170fb>] xfs_ilock+0x10b/0x190
[ 1749.868741] {IN-RECLAIM_FS-W} state was registered at:
[ 1749.868741] [<ffffffff810b3a97>] __lock_acquire+0x437/0x1450
[ 1749.868741] [<ffffffff810b4b56>] lock_acquire+0xa6/0x160
[ 1749.868741] [<ffffffff810a20b5>] down_write_nested+0x65/0xb0
[ 1749.868741] [<ffffffff813170fb>] xfs_ilock+0x10b/0x190
[ 1749.868741] [<ffffffff8134e819>] xfs_reclaim_inode+0x99/0x310
[ 1749.868741] [<ffffffff8134f56b>] xfs_inode_ag_walk+0x8b/0x150
[ 1749.868741] [<ffffffff8134f6bb>] xfs_inode_ag_iterator+0x8b/0xf0
[ 1749.868741] [<ffffffff8134f7a8>] xfs_reclaim_inode_shrink+0x88/0x90
[ 1749.868741] [<ffffffff81119d07>] shrink_slab+0x137/0x1a0
[ 1749.868741] [<ffffffff8111bbe1>] balance_pgdat+0x421/0x6a0
[ 1749.868741] [<ffffffff8111bf7d>] kswapd+0x11d/0x320
[ 1749.868741] [<ffffffff8109ce56>] kthread+0x96/0xa0
[ 1749.868741] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
[ 1749.868741] irq event stamp: 4234335
[ 1749.868741] hardirqs last enabled at (4234335): [<ffffffff81147d25>] kmem_cache_free+0x115/0x220
[ 1749.868741] hardirqs last disabled at (4234334): [<ffffffff81147c4d>] kmem_cache_free+0x3d/0x220
[ 1749.868741] softirqs last enabled at (4233112): [<ffffffff81084dd2>] __do_softirq+0x142/0x260
[ 1749.868741] softirqs last disabled at (4233095): [<ffffffff81035edc>] call_softirq+0x1c/0x50
[ 1749.868741]
[ 1749.868741] other info that might help us debug this:
[ 1749.868741] 2 locks held by dd/2835:
[ 1749.868741] #0: (&(&ip->i_iolock)->mr_lock#2){+.+.+.}, at: [<ffffffff81316edd>] xfs_ilock_nowait+0xed/0x200
[ 1749.868741] #1: (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff813170fb>] xfs_ilock+0x10b/0x190
[ 1749.868741]
[ 1749.868741] stack backtrace:
[ 1749.868741] Pid: 2835, comm: dd Not tainted 2.6.35-rc3-dgc+ #25
[ 1749.868741] Call Trace:
[ 1749.868741] [<ffffffff810b1faa>] print_usage_bug+0x18a/0x190
[ 1749.868741] [<ffffffff8104264f>] ? save_stack_trace+0x2f/0x50
[ 1749.868741] [<ffffffff810b2400>] ? check_usage_backwards+0x0/0xf0
[ 1749.868741] [<ffffffff810b2f11>] mark_lock+0x331/0x400
[ 1749.868741] [<ffffffff810b3047>] mark_held_locks+0x67/0x90
[ 1749.868741] [<ffffffff810b3111>] lockdep_trace_alloc+0xa1/0xe0
[ 1749.868741] [<ffffffff81147419>] kmem_cache_alloc+0x39/0x1e0
[ 1749.868741] [<ffffffff8133f954>] kmem_zone_alloc+0x94/0xe0
[ 1749.868741] [<ffffffff8133f9be>] kmem_zone_zalloc+0x1e/0x50
[ 1749.868741] [<ffffffff81335f02>] xfs_trans_add_item+0x72/0xb0
[ 1749.868741] [<ffffffff81339e41>] xfs_trans_ijoin+0xa1/0xd0
[ 1749.868741] [<ffffffff81319f82>] xfs_itruncate_finish+0x312/0x5d0
[ 1749.868741] [<ffffffff8133cb87>] xfs_free_eofblocks+0x227/0x280
[ 1749.868741] [<ffffffff8133cd18>] xfs_release+0x138/0x190
[ 1749.868741] [<ffffffff813464c5>] xfs_file_release+0x15/0x20
[ 1749.868741] [<ffffffff81150ebf>] fput+0x13f/0x260
[ 1749.868741] [<ffffffff8114d8c2>] filp_close+0x52/0x80
[ 1749.868741] [<ffffffff8114d9a9>] sys_close+0xb9/0x120
[ 1749.868741] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
xfs_ireclaim has to get and put te pag structure because it is only
called with the inode to reclaim. The one caller of this function
already has a reference on the pag and a pointer to is, so move the
radix tree delete to the caller and remove xfs_ireclaim completely.
This avoids a xfs_perag_get/put on every inode being reclaimed.
The overhead was noticed in a bug report at:
https://bugzilla.kernel.org/show_bug.cgi?id=16348
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
xfs_buf_read() fails to detect dispatch errors before attempting to
wait on sychronous IO. If there was an error, it will get stuck
forever, waiting for an I/O that was never started. Make sure the
error is detected correctly.
Further, such a failure can leave locked pages in the page cache
which will cause a later operation to hang on the page. Ensure that
we correctly process pages in the buffers when we get a dispatch
error.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
I missed Dave Chinner's second revision of this change, and pushed
his first version out to the repository instead.
commit a476c59ebb279d738718edc0e3fb76aab3687114
Author: Dave Chinner <dchinner@redhat.com>
This commit compensates for that by moving a block of code up a bit
further, with a result that matches the the effect of Dave's second
version.
Dave's first version was:
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Dave's second version was:
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
|