summaryrefslogtreecommitdiff
path: root/fs/fscache
AgeCommit message (Collapse)Author
2015-04-02FS-Cache: Retain the netfs context in the retrieval op earlierDavid Howells
Now that the retrieval operation may be disposed of by fscache_put_operation() before we actually set the context, the retrieval-specific cleanup operation can produce a NULL-pointer dereference when it tries to unconditionally clean up the netfs context. Given that it is expected that we'll get at least as far as the place where we currently set the context pointer and it is unlikely we'll go through the error handling paths prior to that point, retain the context right from the point that the retrieval op is allocated. Concomitant to this, we need to retain the cookie pointer in the retrieval op also so that we can call the netfs to release its context in the release method. In addition, we might now get into fscache_release_retrieval_op() with the op only initialised. To this end, set the operation to DEAD only after the release method has been called and skip the n_pages test upon cleanup if the op is still in the INITIALISED state. Without these changes, the following oops might be seen: BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8 ... RIP: 0010:[<ffffffffa0089c98>] fscache_release_retrieval_op+0xae/0x100 ... Call Trace: [<ffffffffa0088560>] fscache_put_operation+0x117/0x2e0 [<ffffffffa008b8f5>] __fscache_read_or_alloc_pages+0x351/0x3ac [<ffffffffa00b761f>] __nfs_readpages_from_fscache+0x59/0xbf [nfs] [<ffffffffa00b06c5>] nfs_readpages+0x10c/0x185 [nfs] [<ffffffff81124925>] ? alloc_pages_current+0x119/0x13e [<ffffffff810ee5fd>] ? __page_cache_alloc+0xfb/0x10a [<ffffffff810f87f8>] __do_page_cache_readahead+0x188/0x22c [<ffffffff810f8b3a>] ondemand_readahead+0x29e/0x2af [<ffffffff810f8c92>] page_cache_sync_readahead+0x38/0x3a [<ffffffff810ef337>] generic_file_read_iter+0x1a2/0x55a [<ffffffffa00a9dff>] ? nfs_revalidate_mapping+0xd6/0x288 [nfs] [<ffffffffa00a6a23>] nfs_file_read+0x49/0x70 [nfs] [<ffffffff811363be>] new_sync_read+0x78/0x9c [<ffffffff81137164>] __vfs_read+0x13/0x38 [<ffffffff8113721e>] vfs_read+0x95/0x121 [<ffffffff811372f6>] SyS_read+0x4c/0x8a [<ffffffff81557a52>] system_call_fastpath+0x12/0x17 Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: The operation cancellation method needs calling in more placesDavid Howells
Any time an incomplete operation is cancelled, the operation cancellation function needs to be called to clean up. This is currently being passed directly to some of the functions that might want to call it, but not all. Instead, pass the cancellation method pointer to the fscache_operation_init() and have that cache it in the operation struct. Further, plug in a dummy cancellation handler if the caller declines to set one as this allows us to call the function unconditionally (the extra overhead isn't worth bothering about as we don't expect to be calling this typically). The cancellation method must thence be called everywhere the CANCELLED state is set. Note that we call it *before* setting the CANCELLED state such that the method can use the old state value to guide its operation. fscache_do_cancel_retrieval() needs moving higher up in the sources so that the init function can use it now. Without this, the following oops may be seen: FS-Cache: Assertion failed FS-Cache: 3 == 0 is false ------------[ cut here ]------------ kernel BUG at ../fs/fscache/page.c:261! ... RIP: 0010:[<ffffffffa0089c1b>] fscache_release_retrieval_op+0x77/0x100 [<ffffffffa008853d>] fscache_put_operation+0x114/0x2da [<ffffffffa008b8c2>] __fscache_read_or_alloc_pages+0x358/0x3b3 [<ffffffffa00b761f>] __nfs_readpages_from_fscache+0x59/0xbf [nfs] [<ffffffffa00b06c5>] nfs_readpages+0x10c/0x185 [nfs] [<ffffffff81124925>] ? alloc_pages_current+0x119/0x13e [<ffffffff810ee5fd>] ? __page_cache_alloc+0xfb/0x10a [<ffffffff810f87f8>] __do_page_cache_readahead+0x188/0x22c [<ffffffff810f8b3a>] ondemand_readahead+0x29e/0x2af [<ffffffff810f8c92>] page_cache_sync_readahead+0x38/0x3a [<ffffffff810ef337>] generic_file_read_iter+0x1a2/0x55a [<ffffffffa00a9dff>] ? nfs_revalidate_mapping+0xd6/0x288 [nfs] [<ffffffffa00a6a23>] nfs_file_read+0x49/0x70 [nfs] [<ffffffff811363be>] new_sync_read+0x78/0x9c [<ffffffff81137164>] __vfs_read+0x13/0x38 [<ffffffff8113721e>] vfs_read+0x95/0x121 [<ffffffff811372f6>] SyS_read+0x4c/0x8a [<ffffffff81557a52>] system_call_fastpath+0x12/0x17 The assertion is showing that the remaining number of pages (n_pages) is not 0 when the operation is being released. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: Put an aborted initialised op so that it is accounted correctlyDavid Howells
Call fscache_put_operation() or a wrapper on any op that has gone through fscache_operation_init() so that the accounting shown in /proc is done correctly, specifically fscache_n_op_release. fscache_put_operation() therefore now allows an op in the INITIALISED state as well as in the CANCELLED and COMPLETE states. Note that this means that an operation can get put that doesn't have its ->object pointer filled in, so anything that depends on the object needs to be conditional in fscache_put_operation(). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: Fix cancellation of in-progress operationDavid Howells
Cancellation of an in-progress operation needs to update the relevant counters and start any operations that are pending waiting on this one. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: Count the number of initialised operationsDavid Howells
Count and display through /proc/fs/fscache/stats the number of initialised operations. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: Out of line fscache_operation_init()David Howells
Out of line fscache_operation_init() so that it can access internal FS-Cache features, such as stats, in a later commit. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: Permit fscache_cancel_op() to cancel in-progress operations tooDavid Howells
Currently, fscache_cancel_op() only cancels pending operations - attempts to cancel in-progress operations are ignored. This leads to a problem in fscache_wait_for_operation_activation() whereby the wait is terminated, but the object has been killed. The check at the end of the function now triggers because it's no longer contingent on the cache having produced an I/O error since the commit that fixed the logic error in fscache_object_is_dead(). The result of the check is that it tries to cancel the operation - but since the object may not be pending by this point, the cancellation request may be ignored - with the result that the the object is just put by the caller and fscache_put_operation has an assertion failure because the operation isn't in either the COMPLETE or the CANCELLED states. To fix this, we permit in-progress ops to be cancelled under some circumstances. The bug results in an oops that looks something like this: FS-Cache: fscache_wait_for_operation_activation() = -ENOBUFS [obj dead 3] FS-Cache: FS-Cache: Assertion failed FS-Cache: 3 == 5 is false ------------[ cut here ]------------ kernel BUG at ../fs/fscache/operation.c:432! ... RIP: 0010:[<ffffffffa0088574>] fscache_put_operation+0xf2/0x2cd Call Trace: [<ffffffffa008b92a>] __fscache_read_or_alloc_pages+0x2ec/0x3b3 [<ffffffffa00b761f>] __nfs_readpages_from_fscache+0x59/0xbf [nfs] [<ffffffffa00b06c5>] nfs_readpages+0x10c/0x185 [nfs] [<ffffffff81124925>] ? alloc_pages_current+0x119/0x13e [<ffffffff810ee5fd>] ? __page_cache_alloc+0xfb/0x10a [<ffffffff810f87f8>] __do_page_cache_readahead+0x188/0x22c [<ffffffff810f8b3a>] ondemand_readahead+0x29e/0x2af [<ffffffff810f8c92>] page_cache_sync_readahead+0x38/0x3a [<ffffffff810ef337>] generic_file_read_iter+0x1a2/0x55a [<ffffffffa00a9dff>] ? nfs_revalidate_mapping+0xd6/0x288 [nfs] [<ffffffffa00a6a23>] nfs_file_read+0x49/0x70 [nfs] [<ffffffff811363be>] new_sync_read+0x78/0x9c [<ffffffff81137164>] __vfs_read+0x13/0x38 [<ffffffff8113721e>] vfs_read+0x95/0x121 [<ffffffff811372f6>] SyS_read+0x4c/0x8a [<ffffffff81557a52>] system_call_fastpath+0x12/0x17 Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: fscache_object_is_dead() has wrong logic, kill itDavid Howells
fscache_object_is_dead() returns true only if the object is marked dead and the cache got an I/O error. This should be a logical OR instead. Since two of the callers got split up into handling for separate subcases, expand the other callers and kill the function. This is probably the right thing to do anyway since one of the subcases isn't about the object at all, but rather about the cache. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: Synchronise object death state change vs operation submissionDavid Howells
When an object is being marked as no longer live, do this under the object spinlock to prevent a race with operation submission targeted on that object. The problem occurs due to the following pair of intertwined sequences when the cache tries to create an object that would take it over the hard available space limit: NETFS INTERFACE =============== (A) The netfs calls fscache_acquire_cookie(). object creation is deferred to the object state machine and the netfs is allowed to continue. OBJECT STATE MACHINE KTHREAD ============================ (1) The object is looked up on disk by fscache_look_up_object() calling cachefiles_walk_to_object(). The latter finds that the object is not yet represented on disk and calls fscache_object_lookup_negative(). (2) fscache_object_lookup_negative() sets FSCACHE_COOKIE_NO_DATA_YET and clears FSCACHE_COOKIE_LOOKING_UP, thus allowing the netfs to start queuing read operations. (B) The netfs calls fscache_read_or_alloc_pages(). This calls fscache_wait_for_deferred_lookup() which sees FSCACHE_COOKIE_LOOKING_UP become clear, allowing the read to begin. (C) A read operation is set up and passed to fscache_submit_op() to deal with. (3) cachefiles_walk_to_object() calls cachefiles_has_space(), which fails (or one of the file operations to create stuff fails). cachefiles returns an error to fscache. (4) fscache_look_up_object() transits to the LOOKUP_FAILURE state, (5) fscache_lookup_failure() sets FSCACHE_OBJECT_LOOKED_UP and FSCACHE_COOKIE_UNAVAILABLE and clears FSCACHE_COOKIE_LOOKING_UP then transits to the KILL_OBJECT state. (6) fscache_kill_object() clears FSCACHE_OBJECT_IS_LIVE in an attempt to reject any further requests from the netfs. (7) object->n_ops is examined and found to be 0. fscache_kill_object() transits to the DROP_OBJECT state. (D) fscache_submit_op() locks the object spinlock, sees if it can dispatch the op immediately by calling fscache_object_is_active() - which fails since FSCACHE_OBJECT_IS_AVAILABLE has not yet been set. (E) fscache_submit_op() then tests FSCACHE_OBJECT_LOOKED_UP - which is set. It then queues the object and increments object->n_ops. (8) fscache_drop_object() releases the object and eventually fscache_put_object() calls cachefiles_put_object() which suffers an assertion failure here: ASSERTCMP(object->fscache.n_ops, ==, 0); Locking the object spinlock in step (6) around the clearance of FSCACHE_OBJECT_IS_LIVE ensures that the the decision trees in fscache_submit_op() and fscache_submit_exclusive_op() don't see the IS_LIVE flag being cleared mid-decision: either the op is queued before step (7) - in which case fscache_kill_object() will see n_ops>0 and will deal with the op - or the op will be rejected. This, combined with rejecting op submission if the target object is dying, fix the problem. The problem shows up as the following oops: CacheFiles: Assertion failed CacheFiles: 1 == 0 is false ------------[ cut here ]------------ kernel BUG at ../fs/cachefiles/interface.c:339! ... RIP: 0010:[<ffffffffa014fd9c>] [<ffffffffa014fd9c>] cachefiles_put_object+0x2a4/0x301 [cachefiles] ... Call Trace: [<ffffffffa008674b>] fscache_put_object+0x18/0x21 [fscache] [<ffffffffa00883e6>] fscache_object_work_func+0x3ba/0x3c9 [fscache] [<ffffffff81054dad>] process_one_work+0x226/0x441 [<ffffffff81055d91>] worker_thread+0x273/0x36b [<ffffffff81055b1e>] ? rescuer_thread+0x2e1/0x2e1 [<ffffffff81059b9d>] kthread+0x10e/0x116 [<ffffffff81059a8f>] ? kthread_create_on_node+0x1bb/0x1bb [<ffffffff815579ac>] ret_from_fork+0x7c/0xb0 [<ffffffff81059a8f>] ? kthread_create_on_node+0x1bb/0x1bb Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: Handle a new operation submitted against a killed objectDavid Howells
Reject new operations that are being submitted against an object if that object has failed its lookup or creation states or has been killed by the cache backend for some other reason, such as having been culled. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: When submitting an op, cancel it if the target object is dyingDavid Howells
When submitting an operation, prefer to cancel the operation immediately rather than queuing it for later processing if the object is marked as dying (ie. the object state machine has reached the KILL_OBJECT state). Whilst we're at it, change the series of related test_bit() calls into a READ_ONCE() and bitwise-AND operators to reduce the number of load instructions (test_bit() has a volatile address). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: Move fscache_report_unexpected_submission() to make it more availableDavid Howells
Move fscache_report_unexpected_submission() up within operation.c so that it can be called from fscache_submit_exclusive_op() too. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-02-24FS-Cache: Count culled objects and objects rejected due to lack of spaceDavid Howells
Count the number of objects that get culled by the cache backend and the number of objects that the cache backend declines to instantiate due to lack of space in the cache. These numbers are made available through /proc/fs/fscache/stats Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2014-10-13fs/fscache/object-list.c: use __seq_open_private()Rob Jones
Reduce boilerplate code by using __seq_open_private() instead of seq_open() in fscache_objlist_open(). Signed-off-by: Rob Jones <rob.jones@codethink.co.uk> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com>
2014-09-17FS-Cache: refcount becomes corrupt under vma pressure.Milosz Tanski
In rare cases under heavy VMA pressure the ref count for a fscache cookie becomes corrupt. In this case we decrement ref count even if we fail before incrementing the refcount. FS-Cache: Assertion failed bnode-eca5f9c6/syslog 0 > 0 is false ------------[ cut here ]------------ kernel BUG at fs/fscache/cookie.c:519! invalid opcode: 0000 [#1] SMP Call Trace: [<ffffffffa01ba060>] __fscache_relinquish_cookie+0x50/0x220 [fscache] [<ffffffffa02d64ce>] ceph_fscache_unregister_inode_cookie+0x3e/0x50 [ceph] [<ffffffffa02ae1d3>] ceph_destroy_inode+0x33/0x200 [ceph] [<ffffffff811cf67e>] ? __fsnotify_inode_delete+0xe/0x10 [<ffffffff811a9e0c>] destroy_inode+0x3c/0x70 [<ffffffff811a9f51>] evict+0x111/0x180 [<ffffffff811aa763>] iput+0x103/0x190 [<ffffffff811a5de8>] __dentry_kill+0x1c8/0x220 [<ffffffff811a5f31>] shrink_dentry_list+0xf1/0x250 [<ffffffff811a762c>] prune_dcache_sb+0x4c/0x60 [<ffffffff811930af>] super_cache_scan+0xff/0x170 [<ffffffff8113d7a0>] shrink_slab_node+0x140/0x2c0 [<ffffffff8113f2da>] shrink_slab+0x8a/0x130 [<ffffffff81142572>] balance_pgdat+0x3e2/0x5d0 [<ffffffff811428ca>] kswapd+0x16a/0x4a0 [<ffffffff810a43f0>] ? __wake_up_sync+0x20/0x20 [<ffffffff81142760>] ? balance_pgdat+0x5d0/0x5d0 [<ffffffff81083e09>] kthread+0xc9/0xe0 [<ffffffff81010000>] ? ftrace_raw_event_xen_mmu_release_ptpage+0x70/0x90 [<ffffffff81083d40>] ? flush_kthread_worker+0xb0/0xb0 [<ffffffff8159f63c>] ret_from_fork+0x7c/0xb0 [<ffffffff81083d40>] ? flush_kthread_worker+0xb0/0xb0 RIP [<ffffffffa01b984b>] __fscache_disable_cookie+0x1db/0x210 [fscache] RSP <ffff8803bc85f9b8> ---[ end trace 254d0d7c74a01f25 ]--- Signed-off-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: David Howells <dhowells@redhat.com>
2014-08-27FS-Cache: Reduce cookie ref count if submit fails.Milosz Tanski
I've been seeing issues with disposing cookies under vma pressure. The symptom is that the refcount gets out of sync. In this case we fail to decrement the refcount if submit fails. I found this while auditing the error in and around cookie operations. Signed-off-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: David Howells <dhowells@redhat.com>
2014-08-27FS-Cache: Timeout for releasepage()Milosz Tanski
This is meant to avoid a recusive hang caused by underlying filesystem trying to grab a free page and causing a write-out. INFO: task kworker/u30:7:28375 blocked for more than 120 seconds. Not tainted 3.15.0-virtual #74 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u30:7 D 0000000000000000 0 28375 2 0x00000000 Workqueue: fscache_operation fscache_op_work_func [fscache] ffff88000b147148 0000000000000046 0000000000000000 ffff88000b1471c8 ffff8807aa031820 0000000000014040 ffff88000b147fd8 0000000000014040 ffff880f0c50c860 ffff8807aa031820 ffff88000b147158 ffff88007be59cd0 Call Trace: [<ffffffff815930e9>] schedule+0x29/0x70 [<ffffffffa018bed5>] __fscache_wait_on_page_write+0x55/0x90 [fscache] [<ffffffff810a4350>] ? __wake_up_sync+0x20/0x20 [<ffffffffa018c135>] __fscache_maybe_release_page+0x65/0x1e0 [fscache] [<ffffffffa02ad813>] ceph_releasepage+0x83/0x100 [ceph] [<ffffffff811635b0>] ? anon_vma_fork+0x130/0x130 [<ffffffff8112cdd2>] try_to_release_page+0x32/0x50 [<ffffffff81140096>] shrink_page_list+0x7e6/0x9d0 [<ffffffff8113f278>] ? isolate_lru_pages.isra.73+0x78/0x1e0 [<ffffffff81140932>] shrink_inactive_list+0x252/0x4c0 [<ffffffff811412b1>] shrink_lruvec+0x3e1/0x670 [<ffffffff8114157f>] shrink_zone+0x3f/0x110 [<ffffffff81141b06>] do_try_to_free_pages+0x1d6/0x450 [<ffffffff8114a939>] ? zone_statistics+0x99/0xc0 [<ffffffff81141e44>] try_to_free_pages+0xc4/0x180 [<ffffffff81136982>] __alloc_pages_nodemask+0x6b2/0xa60 [<ffffffff811c1d4e>] ? __find_get_block+0xbe/0x250 [<ffffffff810a405e>] ? wake_up_bit+0x2e/0x40 [<ffffffff811740c3>] alloc_pages_current+0xb3/0x180 [<ffffffff8112cf07>] __page_cache_alloc+0xb7/0xd0 [<ffffffff8112da6c>] grab_cache_page_write_begin+0x7c/0xe0 [<ffffffff81214072>] ? ext4_mark_inode_dirty+0x82/0x220 [<ffffffff81214a89>] ext4_da_write_begin+0x89/0x2d0 [<ffffffff8112c6ee>] generic_perform_write+0xbe/0x1d0 [<ffffffff811a96b1>] ? update_time+0x81/0xc0 [<ffffffff811ad4c2>] ? mnt_clone_write+0x12/0x30 [<ffffffff8112e80e>] __generic_file_aio_write+0x1ce/0x3f0 [<ffffffff8112ea8e>] generic_file_aio_write+0x5e/0xe0 [<ffffffff8120b94f>] ext4_file_write+0x9f/0x410 [<ffffffff8120af56>] ? ext4_file_open+0x66/0x180 [<ffffffff8118f0da>] do_sync_write+0x5a/0x90 [<ffffffffa025c6c9>] cachefiles_write_page+0x149/0x430 [cachefiles] [<ffffffff812cf439>] ? radix_tree_gang_lookup_tag+0x89/0xd0 [<ffffffffa018c512>] fscache_write_op+0x222/0x3b0 [fscache] [<ffffffffa018b35a>] fscache_op_work_func+0x3a/0x100 [fscache] [<ffffffff8107bfe9>] process_one_work+0x179/0x4a0 [<ffffffff8107d47b>] worker_thread+0x11b/0x370 [<ffffffff8107d360>] ? manage_workers.isra.21+0x2e0/0x2e0 [<ffffffff81083d69>] kthread+0xc9/0xe0 [<ffffffff81010000>] ? ftrace_raw_event_xen_mmu_release_ptpage+0x70/0x90 [<ffffffff81083ca0>] ? flush_kthread_worker+0xb0/0xb0 [<ffffffff8159eefc>] ret_from_fork+0x7c/0xb0 [<ffffffff81083ca0>] ? flush_kthread_worker+0xb0/0xb0 Signed-off-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: David Howells <dhowells@redhat.com>
2014-08-06fs/fscache: make ctl_table staticFabian Frederick
fscache_sysctls and fscache_sysctls_root are only used in main.c Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: David Howells <dhowells@redhat.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-16sched: Remove proliferation of wait_on_bit() action functionsNeilBrown
The current "wait_on_bit" interface requires an 'action' function to be provided which does the actual waiting. There are over 20 such functions, many of them identical. Most cases can be satisfied by one of just two functions, one which uses io_schedule() and one which just uses schedule(). So: Rename wait_on_bit and wait_on_bit_lock to wait_on_bit_action and wait_on_bit_lock_action to make it explicit that they need an action function. Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io which are *not* given an action function but implicitly use a standard one. The decision to error-out if a signal is pending is now made based on the 'mode' argument rather than being encoded in the action function. All instances of the old wait_on_bit and wait_on_bit_lock which can use the new version have been changed accordingly and their action functions have been discarded. wait_on_bit{_lock} does not return any specific error code in the event of a signal so the caller must check for non-zero and interpolate their own error code as appropriate. The wait_on_bit() call in __fscache_wait_on_invalidate() was ambiguous as it specified TASK_UNINTERRUPTIBLE but used fscache_wait_bit_interruptible as an action function. David Howells confirms this should be uniformly "uninterruptible" The main remaining user of wait_on_bit{,_lock}_action is NFS which needs to use a freezer-aware schedule() call. A comment in fs/gfs2/glock.c notes that having multiple 'action' functions is useful as they display differently in the 'wchan' field of 'ps'. (and /proc/$PID/wchan). As the new bit_wait{,_io} functions are tagged "__sched", they will not show up at all, but something higher in the stack. So the distinction will still be visible, only with different function names (gds2_glock_wait versus gfs2_glock_dq_wait in the gfs2/glock.c case). Since first version of this patch (against 3.15) two new action functions appeared, on in NFS and one in CIFS. CIFS also now uses an action function that makes the same freezer aware schedule call as NFS. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: David Howells <dhowells@redhat.com> (fscache, keys) Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2) Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steve French <sfrench@samba.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-06fscache: convert use of typedef ctl_table to struct ctl_tableJoe Perches
This typedef is unnecessary and should just be removed. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04fs/fscache: replace seq_printf by seq_putsFabian Frederick
Replace seq_printf where possible + coalesce formats from 2 existing seq_puts Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04fs/fscache: convert printk to pr_foo()Fabian Frederick
All printk converted to pr_foo() except internal.h: printk(KERN_DEBUG Coalesce formats. Add pr_fmt Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-17FS-Cache: Handle removal of unadded object to the fscache_object_list rb treeDavid Howells
When FS-Cache allocates an object, the following sequence of events can occur: -->fscache_alloc_object() -->cachefiles_alloc_object() [via cache->ops->alloc_object] <--[returns new object] -->fscache_attach_object() <--[failed] -->cachefiles_put_object() [via cache->ops->put_object] -->fscache_object_destroy() -->fscache_objlist_remove() -->rb_erase() to remove the object from fscache_object_list. resulting in a crash in the rbtree code. The problem is that the object is only added to fscache_object_list on the success path of fscache_attach_object() where it calls fscache_objlist_add(). So if fscache_attach_object() fails, the object won't have been added to the objlist rbtree. We do, however, unconditionally try to remove the object from the tree. Thanks to NeilBrown for finding this and suggesting this solution. Reported-by: NeilBrown <neilb@suse.de> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: (a customer of) NeilBrown <neilb@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-14Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block IO core updates from Jens Axboe: "This is the pull request for the core changes in the block layer for 3.13. It contains: - The new blk-mq request interface. This is a new and more scalable queueing model that marries the best part of the request based interface we currently have (which is fully featured, but scales poorly) and the bio based "interface" which the new drivers for high IOPS devices end up using because it's much faster than the request based one. The bio interface has no block layer support, since it taps into the stack much earlier. This means that drivers end up having to implement a lot of functionality on their own, like tagging, timeout handling, requeue, etc. The blk-mq interface provides all these. Some drivers even provide a switch to select bio or rq and has code to handle both, since things like merging only works in the rq model and hence is faster for some workloads. This is a huge mess. Conversion of these drivers nets us a substantial code reduction. Initial results on converting SCSI to this model even shows an 8x improvement on single queue devices. So while the model was intended to work on the newer multiqueue devices, it has substantial improvements for "classic" hardware as well. This code has gone through extensive testing and development, it's now ready to go. A pull request is coming to convert virtio-blk to this model will be will be coming as well, with more drivers scheduled for 3.14 conversion. - Two blktrace fixes from Jan and Chen Gang. - A plug merge fix from Alireza Haghdoost. - Conversion of __get_cpu_var() from Christoph Lameter. - Fix for sector_div() with 64-bit divider from Geert Uytterhoeven. - A fix for a race between request completion and the timeout handling from Jeff Moyer. This is what caused the merge conflict with blk-mq/core, in case you are looking at that. - A dm stacking fix from Mike Snitzer. - A code consolidation fix and duplicated code removal from Kent Overstreet. - A handful of block bug fixes from Mikulas Patocka, fixing a loop crash and memory corruption on blk cg. - Elevator switch bug fix from Tomoki Sekiyama. A heads-up that I had to rebase this branch. Initially the immutable bio_vecs had been queued up for inclusion, but a week later, it became clear that it wasn't fully cooked yet. So the decision was made to pull this out and postpone it until 3.14. It was a straight forward rebase, just pruning out the immutable series and the later fixes of problems with it. The rest of the patches applied directly and no further changes were made" * 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits) block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO block: Do not call sector_div() with a 64-bit divisor kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup() block: Consolidate duplicated bio_trim() implementations block: Use rw_copy_check_uvector() block: Enable sysfs nomerge control for I/O requests in the plug list block: properly stack underlying max_segment_size to DM device elevator: acquire q->sysfs_lock in elevator_change() elevator: Fix a race in elevator switching and md device initialization block: Replace __get_cpu_var uses bdi: test bdi_init failure block: fix a probe argument to blk_register_region loop: fix crash if blk_alloc_queue fails blk-core: Fix memory corruption if blkcg_init_queue fails block: fix race between request completion and timeout handling blktrace: Send BLK_TN_PROCESS events to all running traces blk-mq: don't disallow request merges for req->special being set blk-mq: mq plug list breakage blk-mq: fix for flush deadlock ...
2013-11-08block: Replace __get_cpu_var usesChristoph Lameter
__get_cpu_var() is used for multiple purposes in the kernel source. One of them is address calculation via the form &__get_cpu_var(x). This calculates the address for the instance of the percpu variable of the current processor based on an offset. Other use cases are for storing and retrieving data from the current processors percpu area. __get_cpu_var() can be used as an lvalue when writing data or on the right side of an assignment. __get_cpu_var() is defined as : #define __get_cpu_var(var) (*this_cpu_ptr(&(var))) __get_cpu_var() always only does an address determination. However, store and retrieve operations could use a segment prefix (or global register on other platforms) to avoid the address calculation. this_cpu_write() and this_cpu_read() can directly take an offset into a percpu area and use optimized assembly code to read and write per cpu variables. This patch converts __get_cpu_var into either an explicit address calculation using this_cpu_ptr() or into a use of this_cpu operations that use the offset. Thereby address calculations are avoided and less registers are used when code is generated. At the end of the patch set all uses of __get_cpu_var have been removed so the macro is removed too. The patch set includes passes over all arches as well. Once these operations are used throughout then specialized macros can be defined in non -x86 arches as well in order to optimize per cpu access by f.e. using a global register that may be set to the per cpu base. Transformations done to __get_cpu_var() 1. Determine the address of the percpu instance of the current processor. DEFINE_PER_CPU(int, y); int *x = &__get_cpu_var(y); Converts to int *x = this_cpu_ptr(&y); 2. Same as #1 but this time an array structure is involved. DEFINE_PER_CPU(int, y[20]); int *x = __get_cpu_var(y); Converts to int *x = this_cpu_ptr(y); 3. Retrieve the content of the current processors instance of a per cpu variable. DEFINE_PER_CPU(int, y); int x = __get_cpu_var(y) Converts to int x = __this_cpu_read(y); 4. Retrieve the content of a percpu struct DEFINE_PER_CPU(struct mystruct, y); struct mystruct x = __get_cpu_var(y); Converts to memcpy(&x, this_cpu_ptr(&y), sizeof(x)); 5. Assignment to a per cpu variable DEFINE_PER_CPU(int, y) __get_cpu_var(y) = x; Converts to this_cpu_write(y, x); 6. Increment/Decrement etc of a per cpu variable DEFINE_PER_CPU(int, y); __get_cpu_var(y)++ Converts to this_cpu_inc(y) Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-09-27FS-Cache: Provide the ability to enable/disable cookiesDavid Howells
Provide the ability to enable and disable fscache cookies. A disabled cookie will reject or ignore further requests to: Acquire a child cookie Invalidate and update backing objects Check the consistency of a backing object Allocate storage for backing page Read backing pages Write to backing pages but still allows: Checks/waits on the completion of already in-progress objects Uncaching of pages Relinquishment of cookies Two new operations are provided: (1) Disable a cookie: void fscache_disable_cookie(struct fscache_cookie *cookie, bool invalidate); If the cookie is not already disabled, this locks the cookie against other dis/enablement ops, marks the cookie as being disabled, discards or invalidates any backing objects and waits for cessation of activity on any associated object. This is a wrapper around a chunk split out of fscache_relinquish_cookie(), but it reinitialises the cookie such that it can be reenabled. All possible failures are handled internally. The caller should consider calling fscache_uncache_all_inode_pages() afterwards to make sure all page markings are cleared up. (2) Enable a cookie: void fscache_enable_cookie(struct fscache_cookie *cookie, bool (*can_enable)(void *data), void *data) If the cookie is not already enabled, this locks the cookie against other dis/enablement ops, invokes can_enable() and, if the cookie is not an index cookie, will begin the procedure of acquiring backing objects. The optional can_enable() function is passed the data argument and returns a ruling as to whether or not enablement should actually be permitted to begin. All possible failures are handled internally. The cookie will only be marked as enabled if provisional backing objects are allocated. A later patch will introduce these to NFS. Cookie enablement during nfs_open() is then contingent on i_writecount <= 0. can_enable() checks for a race between open(O_RDONLY) and open(O_WRONLY/O_RDWR). This simplifies NFS's cookie handling and allows us to get rid of open(O_RDONLY) accidentally introducing caching to an inode that's open for writing already. One operation has its API modified: (3) Acquire a cookie. struct fscache_cookie *fscache_acquire_cookie( struct fscache_cookie *parent, const struct fscache_cookie_def *def, void *netfs_data, bool enable); This now has an additional argument that indicates whether the requested cookie should be enabled by default. It doesn't need the can_enable() function because the caller must prevent multiple calls for the same netfs object and it doesn't need to take the enablement lock because no one else can get at the cookie before this returns. Signed-off-by: David Howells <dhowells@redhat.com
2013-09-27FS-Cache: Add use/unuse/wake cookie wrappersDavid Howells
Add wrapper functions for dealing with cookie->n_active: (*) __fscache_use_cookie() to increment it. (*) __fscache_unuse_cookie() to decrement and test against zero. (*) __fscache_wake_unused_cookie() to wake up anyone waiting for it to reach zero. The second and third are split so that the third can be done after cookie->lock has been released in case the waiter wakes up whilst we're still holding it and tries to get it. We will need to wake-on-zero once the cookie disablement patch is applied because it will then be possible to see n_active become zero without the cookie being relinquished. Also move the cookie usement out of fscache_attr_changed_op() and into fscache_attr_changed() and the operation struct so that cookie disablement will be able to track it. Whilst we're at it, only increment n_active if we're about to do fscache_submit_op() so that we don't have to deal with undoing it if anything earlier fails. Possibly this should be moved into fscache_submit_op() which could look at FSCACHE_OP_UNUSE_COOKIE. Signed-off-by: David Howells <dhowells@redhat.com>
2013-09-19Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph fixes from Sage Weil: "These fix several bugs with RBD from 3.11 that didn't get tested in time for the merge window: some error handling, a use-after-free, and a sequencing issue when unmapping and image races with a notify operation. There is also a patch fixing a problem with the new ceph + fscache code that just went in" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: fscache: check consistency does not decrement refcount rbd: fix error handling from rbd_snap_name() rbd: ignore unmapped snapshots that no longer exist rbd: fix use-after free of rbd_dev->disk rbd: make rbd_obj_notify_ack() synchronous rbd: complete notifies before cleaning up osd_client and rbd_dev libceph: add function to ensure notifies are complete
2013-09-11lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interruptJan Kara
With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is one such possible user), the following race can happen: radix_tree_preload() ... radix_tree_insert() radix_tree_node_alloc() if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; <interrupt> ... radix_tree_preload() ... radix_tree_insert() radix_tree_node_alloc() if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; And we give out one radix tree node twice. That clearly results in radix tree corruption with different results (usually OOPS) depending on which two users of radix tree race. We fix the problem by making radix_tree_node_alloc() always allocate fresh radix tree nodes when in interrupt. Using preloading when in interrupt doesn't make sense since all the allocations have to be atomic anyway and we cannot steal nodes from process-context users because some users rely on radix_tree_insert() succeeding after radix_tree_preload(). in_interrupt() check is somewhat ugly but we cannot simply key off passed gfp_mask as that is acquired from root_gfp_mask() and thus the same for all preload users. Another part of the fix is to avoid node preallocation in radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again, preallocation in such case doesn't make sense and when preallocation would happen in interrupt we could possibly leak some allocated nodes. However, some users of radix_tree_preload() require following radix_tree_insert() to succeed. To avoid unexpected effects for these users, radix_tree_preload() only warns if passed gfp mask doesn't allow waiting and we provide a new function radix_tree_maybe_preload() for those users which get different gfp mask from different call sites and which are prepared to handle radix_tree_insert() failure. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-10fscache: check consistency does not decrement refcountMilosz Tanski
__fscache_check_consistency() does not decrement the count of operations active after it finishes in the success case. This leads to a hung tasks on cookie de-registration (commonly in inode eviction). INFO: task kworker/1:2:4214 blocked for more than 120 seconds. kworker/1:2 D ffff880443513fc0 0 4214 2 0x00000000 Workqueue: ceph-msgr con_work [libceph] ... Call Trace: [<ffffffff81569fc6>] ? _raw_spin_unlock_irqrestore+0x16/0x20 [<ffffffffa0016570>] ? fscache_wait_bit_interruptible+0x30/0x30 [fscache] [<ffffffff81568d09>] schedule+0x29/0x70 [<ffffffffa001657e>] fscache_wait_atomic_t+0xe/0x20 [fscache] [<ffffffff815665cf>] out_of_line_wait_on_atomic_t+0x9f/0xe0 [<ffffffff81083560>] ? autoremove_wake_function+0x40/0x40 [<ffffffffa0015a9c>] __fscache_relinquish_cookie+0x15c/0x310 [fscache] [<ffffffffa00a4fae>] ceph_fscache_unregister_inode_cookie+0x3e/0x50 [ceph] [<ffffffffa007e373>] ceph_destroy_inode+0x33/0x200 [ceph] [<ffffffff811c13ae>] ? __fsnotify_inode_delete+0xe/0x10 [<ffffffff8119ba1c>] destroy_inode+0x3c/0x70 [<ffffffff8119bb69>] evict+0x119/0x1b0 Signed-off-by: Milosz Tanski <milosz@adfin.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Sage Weil <sage@inktank.com>
2013-09-06fscache: Netfs function for cleanup post readpagesMilosz Tanski
Currently the fscache code expect the netfs to call fscache_readpages_or_alloc inside the aops readpages callback. It marks all the pages in the list provided by readahead with PG_private_2. In the cases that the netfs fails to read all the pages (which is legal) it ends up returning to the readahead and triggering a BUG. This happens because the page list still contains marked pages. This patch implements a simple fscache_readpages_cancel function that the netfs should call before returning from readpages. It will revoke the pages from the underlying cache backend and unmark them. The problem was originally worked out in the Ceph devel tree, but it also occurs in CIFS. It appears that NFS, AFS and 9P are okay as read_cache_pages() will clean up the unprocessed pages in the case of an error. This can be used to address the following oops: [12410647.597278] BUG: Bad page state in process petabucket pfn:3d504e [12410647.597292] page:ffffea000f541380 count:0 mapcount:0 mapping: (null) index:0x0 [12410647.597298] page flags: 0x200000000001000(private_2) ... [12410647.597334] Call Trace: [12410647.597345] [<ffffffff815523f2>] dump_stack+0x19/0x1b [12410647.597356] [<ffffffff8111def7>] bad_page+0xc7/0x120 [12410647.597359] [<ffffffff8111e49e>] free_pages_prepare+0x10e/0x120 [12410647.597361] [<ffffffff8111fc80>] free_hot_cold_page+0x40/0x170 [12410647.597363] [<ffffffff81123507>] __put_single_page+0x27/0x30 [12410647.597365] [<ffffffff81123df5>] put_page+0x25/0x40 [12410647.597376] [<ffffffffa02bdcf9>] ceph_readpages+0x2e9/0x6e0 [ceph] [12410647.597379] [<ffffffff81122a8f>] __do_page_cache_readahead+0x1af/0x260 [12410647.597382] [<ffffffff81122ea1>] ra_submit+0x21/0x30 [12410647.597384] [<ffffffff81118f64>] filemap_fault+0x254/0x490 [12410647.597387] [<ffffffff8113a74f>] __do_fault+0x6f/0x4e0 [12410647.597391] [<ffffffff810125bd>] ? __switch_to+0x16d/0x4a0 [12410647.597395] [<ffffffff810865ba>] ? finish_task_switch+0x5a/0xc0 [12410647.597398] [<ffffffff8113d856>] handle_pte_fault+0xf6/0x930 [12410647.597401] [<ffffffff81008c33>] ? pte_mfn_to_pfn+0x93/0x110 [12410647.597403] [<ffffffff81008cce>] ? xen_pmd_val+0xe/0x10 [12410647.597405] [<ffffffff81005469>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [12410647.597407] [<ffffffff8113f361>] handle_mm_fault+0x251/0x370 [12410647.597411] [<ffffffff812b0ac4>] ? call_rwsem_down_read_failed+0x14/0x30 [12410647.597414] [<ffffffff8155bffa>] __do_page_fault+0x1aa/0x550 [12410647.597418] [<ffffffff8108011d>] ? up_write+0x1d/0x20 [12410647.597422] [<ffffffff8113141c>] ? vm_mmap_pgoff+0xbc/0xe0 [12410647.597425] [<ffffffff81143bb8>] ? SyS_mmap_pgoff+0xd8/0x240 [12410647.597427] [<ffffffff8155c3ae>] do_page_fault+0xe/0x10 [12410647.597431] [<ffffffff81558818>] page_fault+0x28/0x30 Signed-off-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: David Howells <dhowells@redhat.com>
2013-09-06FS-Cache: Add interface to check consistency of a cached objectDavid Howells
Extend the fscache netfs API so that the netfs can ask as to whether a cache object is up to date with respect to its corresponding netfs object: int fscache_check_consistency(struct fscache_cookie *cookie) This will call back to the netfs to check whether the auxiliary data associated with a cookie is correct. It returns 0 if it is and -ESTALE if it isn't; it may also return -ENOMEM and -ERESTARTSYS. The backends now have to implement a mandatory operation pointer: int (*check_consistency)(struct fscache_object *object) that corresponds to the above API call. FS-Cache takes care of pinning the object and the cookie in memory and managing this call with respect to the object state. Original-author: Hongyi Jia <jiayisuse@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Hongyi Jia <jiayisuse@gmail.com> cc: Milosz Tanski <milosz@adfin.com>
2013-06-19FS-Cache: Don't use spin_is_locked() in assertionsDavid Howells
Under certain circumstances, spin_is_locked() is hardwired to 0 - even when the code would normally be in a locked section where it should return 1. This means it cannot be used for an assertion that checks that a spinlock is locked. Remove such usages from FS-Cache. The following oops might otherwise be observed: FS-Cache: Assertion failed BUG: failure at fs/fscache/operation.c:270/fscache_start_operations()! Kernel panic - not syncing: BUG! CPU: 0 PID: 10 Comm: kworker/u2:1 Not tainted 3.10.0-rc1-00133-ge7ebb75 #2 Workqueue: fscache_operation fscache_op_work_func [fscache] 7f091c48 603c8947 7f090000 7f9b1361 7f25f080 00000001 7f26d440 7f091c90 60299eb8 7f091d90 602951c5 7f26d440 3000000008 7f091da0 7f091cc0 7f091cd0 00000007 00000007 00000006 7f091ae0 00000010 0000010e 7f9af330 7f091ae0 Call Trace: 7f091c88: [<60299eb8>] dump_stack+0x17/0x19 7f091c98: [<602951c5>] panic+0xf4/0x1e9 7f091d38: [<6002b10e>] set_signals+0x1e/0x40 7f091d58: [<6005b89e>] __wake_up+0x4e/0x70 7f091d98: [<7f9aa003>] fscache_start_operations+0x43/0x50 [fscache] 7f091da8: [<7f9aa1e3>] fscache_op_complete+0x1d3/0x220 [fscache] 7f091db8: [<60082985>] unlock_page+0x55/0x60 7f091de8: [<7fb25bb0>] cachefiles_read_copier+0x250/0x330 [cachefiles] 7f091e58: [<7f9ab03c>] fscache_op_work_func+0xac/0x120 [fscache] 7f091e88: [<6004d5b0>] process_one_work+0x250/0x3a0 7f091ef8: [<6004edc7>] worker_thread+0x177/0x2a0 7f091f38: [<6004ec50>] worker_thread+0x0/0x2a0 7f091f58: [<60054418>] kthread+0xd8/0xe0 7f091f68: [<6005bb27>] finish_task_switch.isra.64+0x37/0xa0 7f091fd8: [<600185cf>] new_thread_handler+0x8f/0xb0 Reported-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-and-tested-By: Milosz Tanski <milosz@adfin.com>
2013-06-19FS-Cache: The retrieval remaining-pages counter needs to be atomic_tDavid Howells
struct fscache_retrieval contains a count of the number of pages that still need some processing (n_pages). This is decremented as the pages are processed. However, this needs to be atomic as fscache_retrieval_complete() (I think) just occasionally may be called from cachefiles_read_backing_file() and cachefiles_read_copier() simultaneously. This happens when an fscache_read_or_alloc_pages() request containing a lot of pages (say a couple of hundred) is being processed. The read on each backing page is dispatched individually because we need to insert a monitor into the waitqueue to catch when the read completes. However, under low-memory conditions, we might be forced to wait in the allocator - and this gives the I/O on the backing page a chance to complete first. When the I/O completes, fscache_enqueue_retrieval() chucks the retrieval onto the workqueue without waiting for the operation to finish the initial I/O dispatch (we want to release any pages we can as soon as we can), thus both can end up running simultaneously and potentially attempting to partially complete the retrieval simultaneously (ENOMEM may occur, backing pages may already be in the page cache). This was demonstrated by parallelling the non-atomic counter with an atomic counter and printing both of them when the assertion fails. At this point, the atomic counter has reached zero, but the non-atomic counter has not. To fix this, make the counter an atomic_t. This results in the following bug appearing FS-Cache: Assertion failed 3 == 5 is false ------------[ cut here ]------------ kernel BUG at fs/fscache/operation.c:421! or FS-Cache: Assertion failed 3 == 5 is false ------------[ cut here ]------------ kernel BUG at fs/fscache/operation.c:414! With a backtrace like the following: RIP: 0010:[<ffffffffa0211b1d>] fscache_put_operation+0x1ad/0x240 [fscache] Call Trace: [<ffffffffa0213185>] fscache_retrieval_work+0x55/0x270 [fscache] [<ffffffffa0213130>] ? fscache_retrieval_work+0x0/0x270 [fscache] [<ffffffff81090b10>] worker_thread+0x170/0x2a0 [<ffffffff81096d10>] ? autoremove_wake_function+0x0/0x40 [<ffffffff810909a0>] ? worker_thread+0x0/0x2a0 [<ffffffff81096966>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810968d0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-and-tested-By: Milosz Tanski <milosz@adfin.com> Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19FS-Cache: Simplify cookie retention for fscache_objects, fixing oopsDavid Howells
Simplify the way fscache cache objects retain their cookie. The way I implemented the cookie storage handling made synchronisation a pain (ie. the object state machine can't rely on the cookie actually still being there). Instead of the the object being detached from the cookie and the cookie being freed in __fscache_relinquish_cookie(), we defer both operations: (*) The detachment of the object from the list in the cookie now takes place in fscache_drop_object() and is thus governed by the object state machine (fscache_detach_from_cookie() has been removed). (*) The release of the cookie is now in fscache_object_destroy() - which is called by the cache backend just before it frees the object. This means that the fscache_cookie struct is now available to the cache all the way through from ->alloc_object() to ->drop_object() and ->put_object() - meaning that it's no longer necessary to take object->lock to guarantee access. However, __fscache_relinquish_cookie() doesn't wait for the object to go all the way through to destruction before letting the netfs proceed. That would massively slow down the netfs. Since __fscache_relinquish_cookie() leaves the cookie around, in must therefore break all attachments to the netfs - which includes ->def, ->netfs_data and any outstanding page read/writes. To handle this, struct fscache_cookie now has an n_active counter: (1) This starts off initialised to 1. (2) Any time the cache needs to get at the netfs data, it calls fscache_use_cookie() to increment it - if it is not zero. If it was zero, then access is not permitted. (3) When the cache has finished with the data, it calls fscache_unuse_cookie() to decrement it. This does a wake-up on it if it reaches 0. (4) __fscache_relinquish_cookie() decrements n_active and then waits for it to reach 0. The initialisation to 1 in step (1) ensures that we only get wake ups when we're trying to get rid of the cookie. This leaves __fscache_relinquish_cookie() a lot simpler. *** This fixes a problem in the current code whereby if fscache_invalidate() is followed sufficiently quickly by fscache_relinquish_cookie() then it is possible for __fscache_relinquish_cookie() to have detached the cookie from the object and cleared the pointer before a thread is dispatched to process the invalidation state in the object state machine. Since the pending write clearance was deferred to the invalidation state to make it asynchronous, we need to either wait in relinquishment for the stores tree to be cleared in the invalidation state or we need to handle the clearance in relinquishment. Further, if the relinquishment code does clear the tree, then the invalidation state need to make the clearance contingent on still having the cookie to hand (since that's where the tree is rooted) and we have to prevent the cookie from disappearing for the duration. This can lead to an oops like the following: BUG: unable to handle kernel NULL pointer dereference at 000000000000000c ... RIP: 0010:[<ffffffff8151023e>] _spin_lock+0xe/0x30 ... CR2: 000000000000000c ... ... Process kslowd002 (...) .... Call Trace: [<ffffffffa01c3278>] fscache_invalidate_writes+0x38/0xd0 [fscache] [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 [<ffffffff8105e759>] ? find_busiest_queue+0x69/0x150 [<ffffffff8110ddd4>] ? slow_work_enqueue+0x104/0x180 [<ffffffffa01c1303>] fscache_object_slow_work_execute+0x5e3/0x9d0 [fscache] [<ffffffff81096b67>] ? bit_waitqueue+0x17/0xd0 [<ffffffff8110e233>] slow_work_execute+0x233/0x310 [<ffffffff8110e515>] slow_work_thread+0x205/0x360 [<ffffffff81096ca0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8110e310>] ? slow_work_thread+0x0/0x360 [<ffffffff81096936>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810968a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 The parameter to fscache_invalidate_writes() was object->cookie which is NULL. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Milosz Tanski <milosz@adfin.com> Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19FS-Cache: Fix object state machine to have separate work and wait statesDavid Howells
Fix object state machine to have separate work and wait states as that makes it easier to envision. There are now three kinds of state: (1) Work state. This is an execution state. No event processing is performed by a work state. The function attached to a work state returns a pointer indicating the next state to which the OSM should transition. Returning NO_TRANSIT repeats the current state, but goes back to the scheduler first. (2) Wait state. This is an event processing state. No execution is performed by a wait state. Wait states are just tables of "if event X occurs, clear it and transition to state Y". The dispatcher returns to the scheduler if none of the events in which the wait state has an interest are currently pending. (3) Out-of-band state. This is a special work state. Transitions to normal states can be overridden when an unexpected event occurs (eg. I/O error). Instead the dispatcher disables and clears the OOB event and transits to the specified work state. This then acts as an ordinary work state, though object->state points to the overridden destination. Returning NO_TRANSIT resumes the overridden transition. In addition, the states have names in their definitions, so there's no need for tables of state names. Further, the EV_REQUEUE event is no longer necessary as that is automatic for work states. Since the states are now separate structs rather than values in an enum, it's not possible to use comparisons other than (non-)equality between them, so use some object->flags to indicate what phase an object is in. The EV_RELEASE, EV_RETIRE and EV_WITHDRAW events have been squished into one (EV_KILL). An object flag now carries the information about retirement. Similarly, the RELEASING, RECYCLING and WITHDRAWING states have been merged into an KILL_OBJECT state and additional states have been added for handling waiting dependent objects (JUMPSTART_DEPS and KILL_DEPENDENTS). A state has also been added for synchronising with parent object initialisation (WAIT_FOR_PARENT) and another for initiating look up (PARENT_READY). Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Milosz Tanski <milosz@adfin.com> Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19FS-Cache: Wrap checks on object stateDavid Howells
Wrap checks on object state (mostly outside of fs/fscache/object.c) with inline functions so that the mechanism can be replaced. Some of the state checks within object.c are left as-is as they will be replaced. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Milosz Tanski <milosz@adfin.com> Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19FS-Cache: Uninline fscache_object_init()David Howells
Uninline fscache_object_init() so as not to expose some of the FS-Cache internals to the cache backend. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Milosz Tanski <milosz@adfin.com> Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19FS-Cache: Don't sleep in page release if __GFP_FS is not setDavid Howells
Don't sleep in __fscache_maybe_release_page() if __GFP_FS is not set. This goes some way towards mitigating fscache deadlocking against ext4 by way of the allocator, eg: INFO: task flush-8:0:24427 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. flush-8:0 D ffff88003e2b9fd8 0 24427 2 0x00000000 ffff88003e2b9138 0000000000000046 ffff880012e3a040 ffff88003e2b9fd8 0000000000011c80 ffff88003e2b9fd8 ffffffff81a10400 ffff880012e3a040 0000000000000002 ffff880012e3a040 ffff88003e2b9098 ffffffff8106dcf5 Call Trace: [<ffffffff8106dcf5>] ? __lock_is_held+0x31/0x53 [<ffffffff81219b61>] ? radix_tree_lookup_element+0xf4/0x12a [<ffffffff81454bed>] schedule+0x60/0x62 [<ffffffffa01d349c>] __fscache_wait_on_page_write+0x8b/0xa5 [fscache] [<ffffffff810498a8>] ? __init_waitqueue_head+0x4d/0x4d [<ffffffffa01d393a>] __fscache_maybe_release_page+0x30c/0x324 [fscache] [<ffffffffa01d369a>] ? __fscache_maybe_release_page+0x6c/0x324 [fscache] [<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170 [<ffffffffa01fd7b2>] nfs_fscache_release_page+0x68/0x94 [nfs] [<ffffffffa01ef73e>] nfs_release_page+0x7e/0x86 [nfs] [<ffffffff810aa553>] try_to_release_page+0x32/0x3b [<ffffffff810b6c70>] shrink_page_list+0x535/0x71a [<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170 [<ffffffff810b7352>] shrink_inactive_list+0x20a/0x2dd [<ffffffff81071a13>] ? mark_held_locks+0xbe/0xea [<ffffffff810b7a65>] shrink_lruvec+0x34c/0x3eb [<ffffffff810b7bd3>] do_try_to_free_pages+0xcf/0x355 [<ffffffff810b7fc8>] try_to_free_pages+0x9a/0xa1 [<ffffffff810b08d2>] __alloc_pages_nodemask+0x494/0x6f7 [<ffffffff810d9a07>] kmem_getpages+0x58/0x155 [<ffffffff810dc002>] fallback_alloc+0x120/0x1f3 [<ffffffff8106db23>] ? trace_hardirqs_off+0xd/0xf [<ffffffff810dbed3>] ____cache_alloc_node+0x177/0x186 [<ffffffff81162a6c>] ? ext4_init_io_end+0x1c/0x37 [<ffffffff810dc403>] kmem_cache_alloc+0xf1/0x176 [<ffffffff810b17ac>] ? test_set_page_writeback+0x101/0x113 [<ffffffff81162a6c>] ext4_init_io_end+0x1c/0x37 [<ffffffff81162ce4>] ext4_bio_write_page+0x20f/0x3af [<ffffffff8115cc02>] mpage_da_submit_io+0x26e/0x2f6 [<ffffffff811088e5>] ? __find_get_block_slow+0x38/0x133 [<ffffffff81161348>] mpage_da_map_and_submit+0x3a7/0x3bd [<ffffffff81161a60>] ext4_da_writepages+0x30d/0x426 [<ffffffff810b3359>] do_writepages+0x1c/0x2a [<ffffffff81102f4d>] __writeback_single_inode+0x3e/0xe5 [<ffffffff81103995>] writeback_sb_inodes+0x1bd/0x2f4 [<ffffffff81103b3b>] __writeback_inodes_wb+0x6f/0xb4 [<ffffffff81103c81>] wb_writeback+0x101/0x195 [<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170 [<ffffffff811043aa>] ? wb_do_writeback+0xaa/0x173 [<ffffffff8110434a>] wb_do_writeback+0x4a/0x173 [<ffffffff81071bbc>] ? trace_hardirqs_on+0xd/0xf [<ffffffff81038554>] ? del_timer+0x4b/0x5b [<ffffffff811044e0>] bdi_writeback_thread+0x6d/0x147 [<ffffffff81104473>] ? wb_do_writeback+0x173/0x173 [<ffffffff81048fbc>] kthread+0xd0/0xd8 [<ffffffff81455eb2>] ? _raw_spin_unlock_irq+0x29/0x3e [<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55 [<ffffffff81456aac>] ret_from_fork+0x7c/0xb0 [<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55 2 locks held by flush-8:0/24427: #0: (&type->s_umount_key#41){.+.+..}, at: [<ffffffff810e3b73>] grab_super_passive+0x4c/0x76 #1: (jbd2_handle){+.+...}, at: [<ffffffff81190d81>] start_this_handle+0x475/0x4ea The problem here is that another thread, which is attempting to write the to-be-stored NFS page to the on-ext4 cache file is waiting for the journal lock, eg: INFO: task kworker/u:2:24437 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u:2 D ffff880039589768 0 24437 2 0x00000000 ffff8800395896d8 0000000000000046 ffff8800283bf040 ffff880039589fd8 0000000000011c80 ffff880039589fd8 ffff880039f0b040 ffff8800283bf040 0000000000000006 ffff8800283bf6b8 ffff880039589658 ffffffff81071a13 Call Trace: [<ffffffff81071a13>] ? mark_held_locks+0xbe/0xea [<ffffffff81455e73>] ? _raw_spin_unlock_irqrestore+0x3a/0x50 [<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170 [<ffffffff81071bbc>] ? trace_hardirqs_on+0xd/0xf [<ffffffff81454bed>] schedule+0x60/0x62 [<ffffffff81190c23>] start_this_handle+0x317/0x4ea [<ffffffff810498a8>] ? __init_waitqueue_head+0x4d/0x4d [<ffffffff81190fcc>] jbd2__journal_start+0xb3/0x12e [<ffffffff81176606>] __ext4_journal_start_sb+0xb2/0xc6 [<ffffffff8115f137>] ext4_da_write_begin+0x109/0x233 [<ffffffff810a964d>] generic_file_buffered_write+0x11a/0x264 [<ffffffff811032cf>] ? __mark_inode_dirty+0x2d/0x1ee [<ffffffff810ab1ab>] __generic_file_aio_write+0x2a5/0x2d5 [<ffffffff810ab24a>] generic_file_aio_write+0x6f/0xd0 [<ffffffff81159a2c>] ext4_file_write+0x38c/0x3c4 [<ffffffff810e0915>] do_sync_write+0x91/0xd1 [<ffffffffa00a17f0>] cachefiles_write_page+0x26f/0x310 [cachefiles] [<ffffffffa01d470b>] fscache_write_op+0x21e/0x37a [fscache] [<ffffffff81455eb2>] ? _raw_spin_unlock_irq+0x29/0x3e [<ffffffffa01d2479>] fscache_op_work_func+0x78/0xd7 [fscache] [<ffffffff8104455a>] process_one_work+0x232/0x3a8 [<ffffffff810444ff>] ? process_one_work+0x1d7/0x3a8 [<ffffffff81044ee0>] worker_thread+0x214/0x303 [<ffffffff81044ccc>] ? manage_workers+0x245/0x245 [<ffffffff81048fbc>] kthread+0xd0/0xd8 [<ffffffff81455eb2>] ? _raw_spin_unlock_irq+0x29/0x3e [<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55 [<ffffffff81456aac>] ret_from_fork+0x7c/0xb0 [<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55 4 locks held by kworker/u:2/24437: #0: (fscache_operation){.+.+.+}, at: [<ffffffff810444ff>] process_one_work+0x1d7/0x3a8 #1: ((&op->work)){+.+.+.}, at: [<ffffffff810444ff>] process_one_work+0x1d7/0x3a8 #2: (sb_writers#14){.+.+.+}, at: [<ffffffff810ab22c>] generic_file_aio_write+0x51/0xd0 #3: (&sb->s_type->i_mutex_key#19){+.+.+.}, at: [<ffffffff810ab236>] generic_file_aio_write+0x5b/0x fscache already tries to cancel pending stores, but it can't cancel a write for which I/O is already in progress. An alternative would be to accept writing garbage to the cache under extreme circumstances and to kill the afflicted cache object if we have to do this. However, we really need to know how strapped the allocator is before deciding to do that. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Milosz Tanski <milosz@adfin.com> Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19fs/fscache: remove spin_lock() from the condition in while()Sebastian Andrzej Siewior
The spinlock() within the condition in while() will cause a compile error if it is not a function. This is not a problem on mainline but it does not look pretty and there is no reason to do it that way. That patch writes it a little differently and avoids the double condition. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Milosz Tanski <milosz@adfin.com> Acked-by: Jeff Layton <jlayton@redhat.com>
2013-04-29fs/fscache/stats.c: fix memory leakAnurup m
There is a kernel memory leak observed when the proc file /proc/fs/fscache/stats is read. The reason is that in fscache_stats_open, single_open is called and the respective release function is not called during release. Hence fix with correct release function - single_release(). Addresses https://bugzilla.kernel.org/show_bug.cgi?id=57101 Signed-off-by: Anurup m <anurup.m@huawei.com> Cc: shyju pv <shyju.pv@huawei.com> Cc: Sanil kumar <sanil.kumar@huawei.com> Cc: Nataraj m <nataraj.m@huawei.com> Cc: Li Zefan <lizefan@huawei.com> Cc: David Howells <dhowells@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27hlist: drop the node parameter from iteratorsSasha Levin
I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20FS-Cache: Clear remaining page count on retrieval cancellationDavid Howells
Provide fscache_cancel_op() with a pointer to a function it should invoke under lock if it cancels an operation. Use this to clear the remaining page count upon cancellation of a pending retrieval operation so that fscache_release_retrieval_op() doesn't get an assertion failure (see below). This can happen when a signal occurs, say from CTRL-C being pressed during data retrieval. FS-Cache: Assertion failed 3 == 0 is false ------------[ cut here ]------------ kernel BUG at fs/fscache/page.c:237! invalid opcode: 0000 [#641] SMP Modules linked in: cachefiles(F) nfsv4(F) nfsv3(F) nfsv2(F) nfs(F) fscache(F) auth_rpcgss(F) nfs_acl(F) lockd(F) sunrpc(F) CPU 0 Pid: 6075, comm: slurp-q Tainted: GF D 3.7.0-rc8-fsdevel+ #411 /DG965RY RIP: 0010:[<ffffffffa007f328>] [<ffffffffa007f328>] fscache_release_retrieval_op+0x75/0xff [fscache] RSP: 0000:ffff88001c6d7988 EFLAGS: 00010296 RAX: 000000000000000f RBX: ffff880014cdfe00 RCX: ffffffff6c102000 RDX: ffffffff8102d1ad RSI: ffffffff6c102000 RDI: ffffffff8102d1d6 RBP: ffff88001c6d7998 R08: 0000000000000002 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffe00 R13: ffff88001c6d7ab4 R14: ffff88001a8638a0 R15: ffff88001552b190 FS: 00007f877aaf0700(0000) GS:ffff88003bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fff11378fd2 CR3: 000000001c6c6000 CR4: 00000000000007f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process slurp-q (pid: 6075, threadinfo ffff88001c6d6000, task ffff88001c6c4080) Stack: ffffffffa007ec07 ffff880014cdfe00 ffff88001c6d79c8 ffffffffa007db4d ffffffffa007ec07 ffff880014cdfe00 00000000fffffe00 ffff88001c6d7ab4 ffff88001c6d7a38 ffffffffa008116d 0000000000000000 ffff88001c6c4080 Call Trace: [<ffffffffa007ec07>] ? fscache_cancel_op+0x194/0x1cf [fscache] [<ffffffffa007db4d>] fscache_put_operation+0x135/0x2ed [fscache] [<ffffffffa007ec07>] ? fscache_cancel_op+0x194/0x1cf [fscache] [<ffffffffa008116d>] __fscache_read_or_alloc_pages+0x413/0x4bc [fscache] [<ffffffff810ac8ae>] ? __alloc_pages_nodemask+0x195/0x75c [<ffffffffa00aab0f>] __nfs_readpages_from_fscache+0x86/0x13d [nfs] [<ffffffffa00a5fe0>] nfs_readpages+0x186/0x1bd [nfs] [<ffffffff810d23c8>] ? alloc_pages_current+0xc7/0xe4 [<ffffffff810a68b5>] ? __page_cache_alloc+0x84/0x91 [<ffffffff810af912>] ? __do_page_cache_readahead+0xa6/0x2e0 [<ffffffff810afaa3>] __do_page_cache_readahead+0x237/0x2e0 [<ffffffff810af912>] ? __do_page_cache_readahead+0xa6/0x2e0 [<ffffffff810afe3e>] ra_submit+0x1c/0x20 [<ffffffff810b019b>] ondemand_readahead+0x359/0x382 [<ffffffff810b0279>] page_cache_sync_readahead+0x38/0x3a [<ffffffff810a77b5>] generic_file_aio_read+0x26b/0x637 [<ffffffffa00f1852>] ? nfs_mark_delegation_referenced+0xb/0xb [nfsv4] [<ffffffffa009cc85>] nfs_file_read+0xaa/0xcf [nfs] [<ffffffff810db5b3>] do_sync_read+0x91/0xd1 [<ffffffff810dbb8b>] vfs_read+0x9b/0x144 [<ffffffff810dbc78>] sys_read+0x44/0x75 [<ffffffff81422892>] system_call_fastpath+0x16/0x1b Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20FS-Cache: Mark cancellation of in-progress operationDavid Howells
Mark as cancelled an operation that is in progress rather than pending at the time it is cancelled, and call fscache_complete_op() to cancel an operation so that blocked ops can be started. Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20FS-Cache: One of the write operation paths doesn't set the object stateDavid Howells
In fscache_write_op(), if the object is determined to have become inactive or to have lost its cookie, we don't move the operation state from in-progress, and so an assertion in fscache_put_operation() fails with an assertion (see below). Instrumenting fscache_op_work_func() indicates that it called fscache_write_op() before calling fscache_put_operation() - where the assertion failed. The assertion at line 433 indicates that the operation state is IN_PROGRESS rather than being COMPLETE or CANCELLED. Instrumenting fscache_write_op() showed that it was being called on an object that had had its cookie removed and that this was due to relinquishment of the cookie by the netfs. At this point fscache no longer has access to the pages of netfs data that were requested to be written, and so simply cancelling the operation is the thing to do. FS-Cache: Assertion failed 3 == 5 is false ------------[ cut here ]------------ kernel BUG at fs/fscache/operation.c:433! invalid opcode: 0000 [#1] SMP Modules linked in: cachefiles(F) nfsv4(F) nfsv3(F) nfsv2(F) nfs(F) fscache(F) auth_rpcgss(F) nfs_acl(F) lockd(F) sunrpc(F) CPU 0 Pid: 1035, comm: kworker/u:3 Tainted: GF 3.7.0-rc8-fsdevel+ #411 /DG965RY RIP: 0010:[<ffffffffa007db22>] [<ffffffffa007db22>] fscache_put_operation+0x11a/0x2ed [fscache] RSP: 0018:ffff88003e32bcf8 EFLAGS: 00010296 RAX: 000000000000000f RBX: ffff88001818eb78 RCX: ffffffff6c102000 RDX: ffffffff8102d1ad RSI: ffffffff6c102000 RDI: ffffffff8102d1d6 RBP: ffff88003e32bd18 R08: 0000000000000002 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa00811da R13: 0000000000000001 R14: 0000000100625d26 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88003bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fff7dd31c68 CR3: 000000003d730000 CR4: 00000000000007f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kworker/u:3 (pid: 1035, threadinfo ffff88003e32a000, task ffff88003bb38080) Stack: ffffffff8102d1ad ffff88001818eb78 ffffffffa00811da 0000000000000001 ffff88003e32bd48 ffffffffa007f0ad ffff88001818eb78 ffffffff819583c0 ffff88003df24e00 ffff88003882c3e0 ffff88003e32bde8 ffffffff81042de0 Call Trace: [<ffffffff8102d1ad>] ? vprintk_emit+0x3c6/0x41a [<ffffffffa00811da>] ? __fscache_read_or_alloc_pages+0x4bc/0x4bc [fscache] [<ffffffffa007f0ad>] fscache_op_work_func+0xec/0x123 [fscache] [<ffffffff81042de0>] process_one_work+0x21c/0x3b0 [<ffffffff81042d82>] ? process_one_work+0x1be/0x3b0 [<ffffffffa007efc1>] ? fscache_operation_gc+0x23e/0x23e [fscache] [<ffffffff8104332e>] worker_thread+0x202/0x2df [<ffffffff8104312c>] ? rescuer_thread+0x18e/0x18e [<ffffffff81047c1c>] kthread+0xd0/0xd8 [<ffffffff81421bfa>] ? _raw_spin_unlock_irq+0x29/0x3e [<ffffffff81047b4c>] ? __init_kthread_worker+0x55/0x55 [<ffffffff814227ec>] ret_from_fork+0x7c/0xb0 [<ffffffff81047b4c>] ? __init_kthread_worker+0x55/0x55 Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20FS-Cache: Fix signal handling during waitsDavid Howells
wait_on_bit() with TASK_INTERRUPTIBLE returns 1 rather than a negative error code, so change what we check for. This means that the signal handling in fscache_wait_for_retrieval_activation() should now work properly. Without this, the following bug can be seen if CTRL-C is pressed during fscache read operation: FS-Cache: Assertion failed 2 == 3 is false ------------[ cut here ]------------ kernel BUG at fs/fscache/page.c:347! invalid opcode: 0000 [#1] SMP Modules linked in: cachefiles(F) nfsv4(F) nfsv3(F) nfsv2(F) nfs(F) fscache(F) auth_rpcgss(F) nfs_acl(F) lockd(F) sunrpc(F) CPU 1 Pid: 15006, comm: slurp-q Tainted: GF 3.7.0-rc8-fsdevel+ #411 /DG965RY RIP: 0010:[<ffffffffa007fcb4>] [<ffffffffa007fcb4>] fscache_wait_for_retrieval_activation+0x167/0x177 [fscache] RSP: 0018:ffff88002a4c39a8 EFLAGS: 00010292 RAX: 000000000000001a RBX: ffff88002d3dc158 RCX: 0000000000008685 RDX: ffffffff8102ccd6 RSI: 0000000000000001 RDI: ffffffff8102d1d6 RBP: ffff88002a4c39c8 R08: 0000000000000002 R09: 0000000000000000 R10: ffffffff8163afa0 R11: ffff88003bd11900 R12: ffffffffa00868c8 R13: ffff880028306458 R14: ffff88002d3dc1b0 R15: ffff88001372e538 FS: 00007f17426a0700(0000) GS:ffff88003bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f1742494a44 CR3: 0000000031bd7000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process slurp-q (pid: 15006, threadinfo ffff88002a4c2000, task ffff880023de3040) Stack: ffff88002d3dc158 ffff88001372e538 ffff88002a4c3ab4 ffff8800283064e0 ffff88002a4c3a38 ffffffffa0080f6d 0000000000000000 ffff880023de3040 ffff88002a4c3ac8 ffffffff810ac8ae ffff880028306458 ffff88002a4c3bc8 Call Trace: [<ffffffffa0080f6d>] __fscache_read_or_alloc_pages+0x24f/0x4bc [fscache] [<ffffffff810ac8ae>] ? __alloc_pages_nodemask+0x195/0x75c [<ffffffffa00aab0f>] __nfs_readpages_from_fscache+0x86/0x13d [nfs] [<ffffffffa00a5fe0>] nfs_readpages+0x186/0x1bd [nfs] [<ffffffff810d23c8>] ? alloc_pages_current+0xc7/0xe4 [<ffffffff810a68b5>] ? __page_cache_alloc+0x84/0x91 [<ffffffff810af912>] ? __do_page_cache_readahead+0xa6/0x2e0 [<ffffffff810afaa3>] __do_page_cache_readahead+0x237/0x2e0 [<ffffffff810af912>] ? __do_page_cache_readahead+0xa6/0x2e0 [<ffffffff810afe3e>] ra_submit+0x1c/0x20 [<ffffffff810b019b>] ondemand_readahead+0x359/0x382 [<ffffffff810b0279>] page_cache_sync_readahead+0x38/0x3a [<ffffffff810a77b5>] generic_file_aio_read+0x26b/0x637 [<ffffffffa00f1852>] ? nfs_mark_delegation_referenced+0xb/0xb [nfsv4] [<ffffffffa009cc85>] nfs_file_read+0xaa/0xcf [nfs] [<ffffffff810db5b3>] do_sync_read+0x91/0xd1 [<ffffffff810dbb8b>] vfs_read+0x9b/0x144 [<ffffffff810dbc78>] sys_read+0x44/0x75 [<ffffffff81422892>] system_call_fastpath+0x16/0x1b Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20FS-Cache: Add transition to handle invalidate immediately after lookup David Howells
Add a missing transition to the FS-Cache object state machine to handle an invalidation event occuring between the back end completing the object lookup by calling fscache_obtained_object() (which moves to state OBJECT_AVAILABLE) and the backend returning to fscache_lookup_object() and thence to fscache_object_state_machine() which then does a goto lookup_transit to handle the transition - but lookup_transit doesn't handle EV_INVALIDATE. Without this, the following BUG can be logged: FS-Cache: Unsupported event 2 [5/f7] in state OBJECT_AVAILABLE ------------[ cut here ]------------ kernel BUG at fs/fscache/object.c:357! Where event 2 is EV_INVALIDATE. Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20NFS: nfs_migrate_page() does not wait for FS-Cache to finish with a pageDavid Howells
nfs_migrate_page() does not wait for FS-Cache to finish with a page, probably leading to the following bad-page-state: BUG: Bad page state in process python-bin pfn:17d39b page:ffffea00053649e8 flags:004000000000100c count:0 mapcount:0 mapping:(null) index:38686 (Tainted: G B ---------------- ) Pid: 31053, comm: python-bin Tainted: G B ---------------- 2.6.32-71.24.1.el6.x86_64 #1 Call Trace: [<ffffffff8111bfe7>] bad_page+0x107/0x160 [<ffffffff8111ee69>] free_hot_cold_page+0x1c9/0x220 [<ffffffff8111ef19>] __pagevec_free+0x59/0xb0 [<ffffffff8104b988>] ? flush_tlb_others_ipi+0x128/0x130 [<ffffffff8112230c>] release_pages+0x21c/0x250 [<ffffffff8115b92a>] ? remove_migration_pte+0x28a/0x2b0 [<ffffffff8115f3f8>] ? mem_cgroup_get_reclaim_stat_from_page+0x18/0x70 [<ffffffff81122687>] ____pagevec_lru_add+0x167/0x180 [<ffffffff811226f8>] __lru_cache_add+0x58/0x70 [<ffffffff81122731>] lru_cache_add_lru+0x21/0x40 [<ffffffff81123f49>] putback_lru_page+0x69/0x100 [<ffffffff8115c0bd>] migrate_pages+0x13d/0x5d0 [<ffffffff81122687>] ? ____pagevec_lru_add+0x167/0x180 [<ffffffff81152ab0>] ? compaction_alloc+0x0/0x370 [<ffffffff8115255c>] compact_zone+0x4cc/0x600 [<ffffffff8111cfac>] ? get_page_from_freelist+0x15c/0x820 [<ffffffff810672f4>] ? check_preempt_wakeup+0x1c4/0x3c0 [<ffffffff8115290e>] compact_zone_order+0x7e/0xb0 [<ffffffff81152a49>] try_to_compact_pages+0x109/0x170 [<ffffffff8111e94d>] __alloc_pages_nodemask+0x5ed/0x850 [<ffffffff814c9136>] ? thread_return+0x4e/0x778 [<ffffffff81150d43>] alloc_pages_vma+0x93/0x150 [<ffffffff81167ea5>] do_huge_pmd_anonymous_page+0x135/0x340 [<ffffffff814cb6f6>] ? rwsem_down_read_failed+0x26/0x30 [<ffffffff81136755>] handle_mm_fault+0x245/0x2b0 [<ffffffff814ce383>] do_page_fault+0x123/0x3a0 [<ffffffff814cbdf5>] page_fault+0x25/0x30 nfs_migrate_page() calls nfs_fscache_release_page() which doesn't actually wait - even if __GFP_WAIT is set. The reason that doesn't wait is that fscache_maybe_release_page() might deadlock the allocator as the work threads writing to the cache may all end up sleeping on memory allocation. However, I wonder if that is actually a problem. There are a number of things I can do to deal with this: (1) Make nfs_migrate_page() wait. (2) Make fscache_maybe_release_page() honour the __GFP_WAIT flag. (3) Set a timeout around the wait. (4) Make nfs_migrate_page() return an error if the page is still busy. For the moment, I'll select (2) and (4). Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jeff Layton <jlayton@redhat.com>
2012-12-20FS-Cache: Exclusive op submission can BUG if there's been an I/O errorDavid Howells
The function to submit an exclusive op (fscache_submit_exclusive_op()) can BUG if there's been an I/O error because it may see the parent cache object in an unexpected state. It should only BUG if there hasn't been an I/O error. In this case the problem was produced by remounting the cache partition to be R/O. The EROFS state was detected and the cache was aborted, but not everything handled the aborting correctly. SysRq : Emergency Remount R/O EXT4-fs (sda6): re-mounted. Opts: (null) Emergency Remount complete CacheFiles: I/O Error: Failed to update xattr with error -30 FS-Cache: Cache cachefiles stopped due to I/O error ------------[ cut here ]------------ kernel BUG at fs/fscache/operation.c:128! invalid opcode: 0000 [#1] SMP CPU 0 Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc Pid: 6612, comm: kworker/u:2 Not tainted 3.1.0-rc8-fsdevel+ #1093 /DG965RY RIP: 0010:[<ffffffffa00739c0>] [<ffffffffa00739c0>] fscache_submit_exclusive_op+0x2ad/0x2c2 [fscache] RSP: 0018:ffff880000853d40 EFLAGS: 00010206 RAX: ffff880038ac72a8 RBX: ffff8800181f2260 RCX: ffffffff81f2b2b0 RDX: 0000000000000001 RSI: ffffffff8179a478 RDI: ffff8800181f2280 RBP: ffff880000853d60 R08: 0000000000000002 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: ffff880038ac7268 R13: ffff8800181f2280 R14: ffff88003a359190 R15: 000000010122b162 FS: 0000000000000000(0000) GS:ffff88003bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000034cc4a77f0 CR3: 0000000010e96000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kworker/u:2 (pid: 6612, threadinfo ffff880000852000, task ffff880014c3c040) Stack: ffff8800181f2260 ffff8800181f2310 ffff880038ac7268 ffff8800181f2260 ffff880000853dc0 ffffffffa0072375 ffff880037ecfe00 ffff88003a359198 ffff880000853dc0 0000000000000246 0000000000000000 ffff88000a91d308 Call Trace: [<ffffffffa0072375>] fscache_object_work_func+0x792/0xe65 [fscache] [<ffffffff81047e44>] process_one_work+0x1eb/0x37f [<ffffffff81047de6>] ? process_one_work+0x18d/0x37f [<ffffffffa0071be3>] ? fscache_enqueue_dependents+0xd8/0xd8 [fscache] [<ffffffff810482e4>] worker_thread+0x15a/0x21a [<ffffffff8104818a>] ? rescuer_thread+0x188/0x188 [<ffffffff8104bf96>] kthread+0x7f/0x87 [<ffffffff813ad6f4>] kernel_thread_helper+0x4/0x10 [<ffffffff81026b98>] ? finish_task_switch+0x45/0xc0 [<ffffffff813abd1d>] ? retint_restore_args+0xe/0xe [<ffffffff8104bf17>] ? __init_kthread_worker+0x53/0x53 [<ffffffff813ad6f0>] ? gs_change+0xb/0xb Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20FS-Cache: Limit the number of I/O error reports for a cacheDavid Howells
Limit the number of I/O error reports for a cache to 1 to prevent massive amounts of noise. After the first I/O error the cache is taken off line automatically, so must be restarted to resume caching. Signed-off-by: David Howells <dhowells@redhat.com>