summaryrefslogtreecommitdiff
path: root/fs/nfs/write.c
AgeCommit message (Collapse)Author
2014-06-24nfs: Fix cache_validity check in nfs_write_pageuptodate()Scott Mayhew
NFS_INO_INVALID_DATA cannot be ignored, even if we have a delegation. We're still having some problems with data corruption when multiple clients are appending to a file and those clients are being granted write delegations on open. To reproduce: Client A: vi /mnt/`hostname -s` while :; do echo "XXXXXXXXXXXXXXX" >>/mnt/file; sleep $(( $RANDOM % 5 )); done Client B: vi /mnt/`hostname -s` while :; do echo "YYYYYYYYYYYYYYY" >>/mnt/file; sleep $(( $RANDOM % 5 )); done What's happening is that in nfs_update_inode() we're recognizing that the file size has changed and we're setting NFS_INO_INVALID_DATA accordingly, but then we ignore the cache_validity flags in nfs_write_pageuptodate() because we have a delegation. As a result, in nfs_updatepage() we're extending the write to cover the full page even though we've not read in the data to begin with. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Cc: <stable@vger.kernel.org> # v3.11+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-10Merge tag 'nfs-for-3.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: - massive cleanup of the NFS read/write code by Anna and Dros - support multiple NFS read/write requests per page in order to deal with non-page aligned pNFS striping. Also cleans up the r/wsize < page size code nicely. - stable fix for ensuring inode is declared uptodate only after all the attributes have been checked. - stable fix for a kernel Oops when remounting - NFS over RDMA client fixes - move the pNFS files layout driver into its own subdirectory" * tag 'nfs-for-3.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits) NFS: populate ->net in mount data when remounting pnfs: fix lockup caused by pnfs_generic_pg_test NFSv4.1: Fix typo in dprintk NFSv4.1: Comment is now wrong and redundant to code NFS: Use raw_write_seqcount_begin/end int nfs4_reclaim_open_state xprtrdma: Disconnect on registration failure xprtrdma: Remove BUG_ON() call sites xprtrdma: Avoid deadlock when credit window is reset SUNRPC: Move congestion window constants to header file xprtrdma: Reset connection timeout after successful reconnect xprtrdma: Use macros for reconnection timeout constants xprtrdma: Allocate missing pagelist xprtrdma: Remove Tavor MTU setting xprtrdma: Ensure ia->ri_id->qp is not NULL when reconnecting xprtrdma: Reduce the number of hardway buffer allocations xprtrdma: Limit work done by completion handler xprtrmda: Reduce calls to ib_poll_cq() in completion handlers xprtrmda: Reduce lock contention in completion handlers xprtrdma: Split the completion queue xprtrdma: Make rpcrdma_ep_destroy() return void ...
2014-05-29nfs: page group support in nfs_mark_uptodateWeston Andros Adamson
Change how nfs_mark_uptodate checks to see if writes cover a whole page. This patch should have no effect yet since all page groups currently have one request, but will come into play when pg_test functions are modified to split pages into sub-page regions. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-29nfs: page group syncing in write pathWeston Andros Adamson
Operations that modify state for a whole page must be syncronized across all requests within a page group. In the write path, this is calling end_page_writeback and removing the head request from an inode. Both of these operations should not be called until all requests in a page group have reached the point where they would call them. This patch should have no effect yet since all page groups currently have one request, but will come into play when pg_test functions are modified to split pages into sub-page regions. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-29nfs: add support for multiple nfs reqs per pageWeston Andros Adamson
Add "page groups" - a circular list of nfs requests (struct nfs_page) that all reference the same page. This gives nfs read and write paths the ability to account for sub-page regions independently. This somewhat follows the design of struct buffer_head's sub-page accounting. Only "head" requests are ever added/removed from the inode list in the buffered write path. "head" and "sub" requests are treated the same through the read path and the rest of the write/commit path. Requests are given an extra reference across the life of the list. Page groups are never rejoined after being split. If the read/write request fails and the client falls back to another path (ie revert to MDS in PNFS case), the already split requests are pushed through the recoalescing code again, which may split them further and then coalesce them into properly sized requests on the wire. Fragmentation shouldn't be a problem with the current design, because we flush all requests in page group when a non-contiguous request is added, so the only time resplitting should occur is on a resend of a read or write. This patch lays the groundwork for sub-page splitting, but does not actually do any splitting. For now all page groups have one request as pg_test functions don't yet split pages. There are several related patches that are needed support multiple requests per page group. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-29nfs: remove unused arg from nfs_create_requestWeston Andros Adamson
@inode is passed but not used. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-29NFS: Create a common nfs_pageio_ops structAnna Schumaker
At this point the read and write structures look identical, so combine them into something shared by both. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-29NFS: Create a common generic_pg_pgios()Anna Schumaker
What we have here is two functions that look identical. Let's share some more code! Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-29NFS: Create a common multiple_pgios() functionAnna Schumaker
Once again, these two functions look identical in the read and write case. Time to combine them together! Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-29NFS: Create a common initiate_pgio() functionAnna Schumaker
Most of this code is the same for both the read and write paths, so combine everything and use the rw_ops when necessary. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-28NFS: Create a generic_pgio functionAnna Schumaker
These functions are almost identical on both the read and write side. FLUSH_COND_STABLE will never be set for the read path, so leaving it in the generic code won't hurt anything. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-28NFS: Create a common pgio_error functionAnna Schumaker
At this point, the read and write versions of this function look identical so both should use the same function. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-28NFS: Create a common rpcsetup function for reads and writesAnna Schumaker
Write adds a little bit of code dealing with flush flags, but since "how" will always be 0 when reading we can share the code. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-28NFS: Create a common rpc_call_ops structAnna Schumaker
The read and write paths set up this struct in exactly the same way, so create a single shared struct. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-28NFS: Create a common nfs_pgio_result_common functionAnna Schumaker
Combining these functions will let me make a single nfs_rw_common_ops struct (see the next patch). Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-28NFS: Create a common pgio_rpc_prepare functionAnna Schumaker
The read and write paths do exactly the same thing for the rpc_prepare rpc_op. This patch combines them together into a single function. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-28NFS: Create a common rw_header_alloc and rw_header_free functionAnna Schumaker
I create a new struct nfs_rw_ops to decide the differences between reads and writes. This struct will be set when initializing a new nfs_pgio_descriptor, and then passed on to the nfs_rw_header when a new header is allocated. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-28NFS: Create a common pgio_alloc and pgio_release functionAnna Schumaker
These functions are identical for the read and write paths so they can be combined. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-28NFS: Move the write verifier into the nfs_pgio_headerAnna Schumaker
The header had a pointer to the verifier that was set from the old write data struct. We don't need to keep the pointer around now that we have shared structures. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-28NFS: Create a common read and write header structAnna Schumaker
The only difference is the write verifier field, but we can keep that for a little bit longer. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-28NFS: Create a common read and write data structAnna Schumaker
At this point, the only difference between nfs_read_data and nfs_write_data is the write verifier. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-28NFS: Create a common results structure for reads and writesAnna Schumaker
Reads and writes have very similar results. This patch combines the two structs together with comments to show where the differing fields are used. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-28NFS: Create a common argument structure for reads and writesAnna Schumaker
Reads and writes have very similar arguments. This patch combines them together and documents the few fields used only by write. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-28nfs: remove ->write_pageio_init from rpc opsChristoph Hellwig
The write_pageio_init method is just a very convoluted way to grab the right nfs_pageio_ops vector. The vector to chose is not a choice of protocol version, but just a pNFS vs MDS I/O choice that can simply be done inside nfs_pageio_init_write based on the presence of a layout driver, and a new force_mds flag to the special case of falling back to MDS I/O on a pNFS-capable volume. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-04-18arch: Mass conversion of smp_mb__*()Peter Zijlstra
Mostly scripted conversion of the smp_mb__* barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-15NFS: Don't ignore suid/sgid bit changes after a successful writeTrond Myklebust
If we suspect that the server may have cleared the suid/sgid bit, then mark the inode for revalidation. Reported-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-28nfs: add memory barriers around NFS_INO_INVALID_DATA and NFS_INO_INVALIDATINGJeff Layton
If the setting of NFS_INO_INVALIDATING gets reordered to before the clearing of NFS_INO_INVALID_DATA, then another task may hit a race window where both appear to be clear, even though the inode's pages are still in need of invalidation. Fix this by adding the appropriate memory barriers. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-27NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mappingJeff Layton
There is a possible race in how the nfs_invalidate_mapping function is handled. Currently, we go and invalidate the pages in the file and then clear NFS_INO_INVALID_DATA. The problem is that it's possible for a stale page to creep into the mapping after the page was invalidated (i.e., via readahead). If another writer comes along and sets the flag after that happens but before invalidate_inode_pages2 returns then we could clear the flag without the cache having been properly invalidated. So, we must clear the flag first and then invalidate the pages. Doing this however, opens another race: It's possible to have two concurrent read() calls that end up in nfs_revalidate_mapping at the same time. The first one clears the NFS_INO_INVALID_DATA flag and then goes to call nfs_invalidate_mapping. Just before calling that though, the other task races in, checks the flag and finds it cleared. At that point, it trusts that the mapping is good and gets the lock on the page, allowing the read() to be satisfied from the cache even though the data is no longer valid. These effects are easily manifested by running diotest3 from the LTP test suite on NFS. That program does a series of DIO writes and buffered reads. The operations are serialized and page-aligned but the existing code fails the test since it occasionally allows a read to come out of the cache incorrectly. While mixing direct and buffered I/O isn't recommended, I believe it's possible to hit this in other ways that just use buffered I/O, though that situation is much harder to reproduce. The problem is that the checking/clearing of that flag and the invalidation of the mapping really need to be atomic. Fix this by serializing concurrent invalidations with a bitlock. At the same time, we also need to allow other places that check NFS_INO_INVALID_DATA to check whether we might be in the middle of invalidating the file, so fix up a couple of places that do that to look for the new NFS_INO_INVALIDATING flag. Doing this requires us to be careful not to set the bitlock unnecessarily, so this code only does that if it believes it will be doing an invalidation. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-17nfs: always make sure page is up-to-date before extending a write to cover ↵Scott Mayhew
the entire page We should always make sure the cached page is up-to-date when we're determining whether we can extend a write to cover the full page -- even if we've received a write delegation from the server. Commit c7559663 added logic to skip this check if we have a write delegation, which can lead to data corruption such as the following scenario if client B receives a write delegation from the NFS server: Client A: # echo 123456789 > /mnt/file Client B: # echo abcdefghi >> /mnt/file # cat /mnt/file 0�D0�abcdefghi Just because we hold a write delegation doesn't mean that we've read in the entire page contents. Cc: <stable@vger.kernel.org> # v3.11+ Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-05NFS: dprintk() should not print negative fileids and inode numbersNiels de Vos
A fileid in NFS is a uint64. There are some occurrences where dprintk() outputs a signed fileid. This leads to confusion and more difficult to read debugging (negative fileids matching positive inode numbers). Signed-off-by: Niels de Vos <ndevos@redhat.com> CC: Santosh Pradhan <spradhan@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2013-10-24nfs: use %p[dD] instead of open-coded (and often racy) equivalentsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-05NFS: Don't check lock owner compatibility in writes unless file is lockedTrond Myklebust
If we're doing buffered writes, and there is no file locking involved, then we don't have to worry about whether or not the lock owner information is identical. By relaxing this check, we ensure that fork()ed child processes can write to a page without having to first sync dirty data that was written by the parent to disk. Reported-by: Quentin Barnes <qbarnes@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Tested-by: Quentin Barnes <qbarnes@gmail.com>
2013-09-05nfs4.1: Add SP4_MACH_CRED write and commit supportWeston Andros Adamson
WRITE and COMMIT can use the machine credential. If WRITE is supported and COMMIT is not, make all (mach cred) writes FILE_SYNC4. Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-09-04NFSv4: Don't try to recover NFSv4 locks when they are lost.NeilBrown
When an NFSv4 client loses contact with the server it can lose any locks that it holds. Currently when it reconnects to the server it simply tries to reclaim those locks. This might succeed even though some other client has held and released a lock in the mean time. So the first client might think the file is unchanged, but it isn't. This isn't good. If, when recovery happens, the locks cannot be claimed because some other client still holds the lock, then we get a message in the kernel logs, but the client can still write. So two clients can both think they have a lock and can both write at the same time. This is equally not good. There was a patch a while ago http://comments.gmane.org/gmane.linux.nfs/41917 which tried to address some of this, but it didn't seem to go anywhere. That patch would also send a signal to the process. That might be useful but for now this patch just causes writes to fail. For NFSv4 (unlike v2/v3) there is a strong link between the lock and the write request so we can fairly easily fail any IO of the lock is gone. While some applications might not expect this, it is still safer than allowing the write to succeed. Because this is a fairly big change in behaviour a module parameter, "recover_locks", is introduced which defaults to true (the current behaviour) but can be set to "false" to tell the client not to try to recover things that were lost. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-09-03NFS avoid expired credential keys for buffered writesAndy Adamson
We must avoid buffering a WRITE that is using a credential key (e.g. a GSS context key) that is about to expire or has expired. We currently will paint ourselves into a corner by returning success to the applciation for such a buffered WRITE, only to discover that we do not have permission when we attempt to flush the WRITE (and potentially associated COMMIT) to disk. Use the RPC layer credential key timeout and expire routines which use a a watermark, gss_key_expire_timeo. We test the key in nfs_file_write. If a WRITE is using a credential with a key that will expire within watermark seconds, flush the inode in nfs_write_end and send only NFS_FILE_SYNC WRITEs by adding nfs_ctx_key_to_expire to nfs_need_sync_write. Note that this results in single page NFS_FILE_SYNC WRITEs. Signed-off-by: Andy Adamson <andros@netapp.com> [Trond: removed a pr_warn_ratelimited() for now] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22NFS: Add event tracing for generic NFS eventsTrond Myklebust
Add tracepoints for inode attribute updates, attribute revalidation, writeback start/end fsync start/end, attribute change start/end, permission check start/end. The intention is to enable performance tracing using 'perf'as well as improving debugging. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-07-09NFS: Allow nfs_updatepage to extend a write under additional circumstancesScott Mayhew
Currently nfs_updatepage allows a write to be extended to cover a full page only if we don't have a byte range lock lock on the file... but if we have a write delegation on the file or if we have the whole file locked for writing then we should be allowed to extend the write as well. Signed-off-by: Scott Mayhew <smayhew@redhat.com> [Trond: fix up call to nfs_have_delegation()] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-03-25NFS: Don't accept more reads/writes if the open context recovery failedTrond Myklebust
If the state recovery failed, we want to ensure that the application doesn't try to use the same file descriptor for more reads or writes. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-01-04NFS: Ensure that we free the rpc_task after read and write cleanups are doneTrond Myklebust
This patch ensures that we free the rpc_task after the cleanup callbacks are done in order to avoid a deadlock problem that can be triggered if the callback needs to wait for another workqueue item to complete. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Weston Andros Adamson <dros@netapp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Bruce Fields <bfields@fieldses.org> Cc: stable@vger.kernel.org [>= 3.5]
2012-12-20NFS: nfs_migrate_page() does not wait for FS-Cache to finish with a pageDavid Howells
nfs_migrate_page() does not wait for FS-Cache to finish with a page, probably leading to the following bad-page-state: BUG: Bad page state in process python-bin pfn:17d39b page:ffffea00053649e8 flags:004000000000100c count:0 mapcount:0 mapping:(null) index:38686 (Tainted: G B ---------------- ) Pid: 31053, comm: python-bin Tainted: G B ---------------- 2.6.32-71.24.1.el6.x86_64 #1 Call Trace: [<ffffffff8111bfe7>] bad_page+0x107/0x160 [<ffffffff8111ee69>] free_hot_cold_page+0x1c9/0x220 [<ffffffff8111ef19>] __pagevec_free+0x59/0xb0 [<ffffffff8104b988>] ? flush_tlb_others_ipi+0x128/0x130 [<ffffffff8112230c>] release_pages+0x21c/0x250 [<ffffffff8115b92a>] ? remove_migration_pte+0x28a/0x2b0 [<ffffffff8115f3f8>] ? mem_cgroup_get_reclaim_stat_from_page+0x18/0x70 [<ffffffff81122687>] ____pagevec_lru_add+0x167/0x180 [<ffffffff811226f8>] __lru_cache_add+0x58/0x70 [<ffffffff81122731>] lru_cache_add_lru+0x21/0x40 [<ffffffff81123f49>] putback_lru_page+0x69/0x100 [<ffffffff8115c0bd>] migrate_pages+0x13d/0x5d0 [<ffffffff81122687>] ? ____pagevec_lru_add+0x167/0x180 [<ffffffff81152ab0>] ? compaction_alloc+0x0/0x370 [<ffffffff8115255c>] compact_zone+0x4cc/0x600 [<ffffffff8111cfac>] ? get_page_from_freelist+0x15c/0x820 [<ffffffff810672f4>] ? check_preempt_wakeup+0x1c4/0x3c0 [<ffffffff8115290e>] compact_zone_order+0x7e/0xb0 [<ffffffff81152a49>] try_to_compact_pages+0x109/0x170 [<ffffffff8111e94d>] __alloc_pages_nodemask+0x5ed/0x850 [<ffffffff814c9136>] ? thread_return+0x4e/0x778 [<ffffffff81150d43>] alloc_pages_vma+0x93/0x150 [<ffffffff81167ea5>] do_huge_pmd_anonymous_page+0x135/0x340 [<ffffffff814cb6f6>] ? rwsem_down_read_failed+0x26/0x30 [<ffffffff81136755>] handle_mm_fault+0x245/0x2b0 [<ffffffff814ce383>] do_page_fault+0x123/0x3a0 [<ffffffff814cbdf5>] page_fault+0x25/0x30 nfs_migrate_page() calls nfs_fscache_release_page() which doesn't actually wait - even if __GFP_WAIT is set. The reason that doesn't wait is that fscache_maybe_release_page() might deadlock the allocator as the work threads writing to the cache may all end up sleeping on memory allocation. However, I wonder if that is actually a problem. There are a number of things I can do to deal with this: (1) Make nfs_migrate_page() wait. (2) Make fscache_maybe_release_page() honour the __GFP_WAIT flag. (3) Set a timeout around the wait. (4) Make nfs_migrate_page() return an error if the page is still busy. For the moment, I'll select (2) and (4). Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jeff Layton <jlayton@redhat.com>
2012-12-15NFS: Don't use SetPageError in the NFS writeback codeTrond Myklebust
The writeback code is already capable of passing errors back to user space by means of the open_context->error. In the case of ENOSPC, Neil Brown is reporting seeing 2 errors being returned. Neil writes: "e.g. if /mnt2/ if an nfs mounted filesystem that has no space then strace dd if=/dev/zero conv=fsync >> /mnt2/afile count=1 reported Input/output error and the relevant parts of the strace output are: write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512 fsync(1) = -1 EIO (Input/output error) close(1) = -1 ENOSPC (No space left on device)" Neil then shows that the duplication of error messages appears to be due to the use of the PageError() mechanism, which causes filemap_fdatawait_range to return the extra EIO. The regression was introduced by commit 7b281ee026552f10862b617a2a51acf49c829554 (NFS: fsync() must exit with an error if page writeback failed). Fix this by removing the call to SetPageError(), and just relying on open_context->error reporting the ENOSPC back to fsync(). Reported-by: Neil Brown <neilb@suse.de> Tested-by: Neil Brown <neilb@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org [3.6+]
2012-12-11Merge branch 'bugfixes' into nfs-for-nextTrond Myklebust
2012-12-11nfs: don't extend writes to cover entire page if pagecache is invalidJeff Layton
Jian reported that the following sequence would leave "testfile" with corrupt data: # mount localhost:/export /mnt/nfs/ -o vers=3 # echo abc > /mnt/nfs/testfile; echo def >> /export/testfile; echo ghi >> /mnt/nfs/testfile # cat -v /export/testfile abc ^@^@^@^@ghi While there's no locking involved here, the operations are serialized, so CTO should prevent corruption. The first write to the file is fine and writes 4 bytes. The file is then extended on the server. When it's reopened a GETATTR is issued and the size change is noticed. This causes NFS_INO_INVALID_DATA to be set on the file. Because the file is opened for write only, nfs_want_read_modify_write() returns 0 to nfs_write_begin(). nfs_updatepage then calls nfs_write_pageuptodate() to see if it should extend the nfs_page to cover the whole page. NFS_INO_INVALID_DATA is still set on the file at that point, but that flag is ignored and nfs_pageuptodate erroneously extends the write to cover the whole page, with the write done on the server side filled in with zeroes. This patch just has that function check for NFS_INO_INVALID_DATA in addition to NFS_INO_REVAL_PAGECACHE. This fixes the bug, but looking over the code, I wonder if we might have a similar bug in nfs_revalidate_size(). The difference between those two flags is very subtle, so it seems like we ought to be checking for NFS_INO_INVALID_DATA in most of the places that we look for NFS_INO_REVAL_PAGECACHE. I believe this is regression introduced by commit 8d197a568. The code did check for NFS_INO_INVALID_DATA prior to that patch. Original bug report is here: https://bugzilla.redhat.com/show_bug.cgi?id=885743 Cc: <stable@vger.kernel.org> # 3.5+ Reported-by: Jian Li <jiali@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-25nfs: Fix wrong slab cache in nfs_commit_mempoolYanchuan Nian
The slab cache in nfs_commit_mempool is wrong, and I think it is just a slip. I tested it on a x86-32 machine, the size of nfs_write_header is 544, and the size of nfs_commit_data is 408, so it works fine. It is also true that sizeof(struct nfs_write_header) > sizeof(struct nfs_commit_data) on other platforms in my opinoin. Just fix it. Signed-off-by: Yanchuan Nian <ycnian@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-04NFS: Remove BUG_ON() calls from the generic writeback codeTrond Myklebust
...and ensure that we set the return value for nfs_page_async_flush() to zero! (Reported-by: Dros Adamson) Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28NFS: Fix fdatasync/fsync() when confronted with a server rebootTrond Myklebust
If the server reboots before it can commit the unstable writes to disk, then nfs_commit_release_pages() will detect this when it compares the verifier returned by COMMIT to the one returned by WRITE. When this happens, the client needs to resend those writes in order to guarantee that they make it to stable storage. This patch adds a signalling mechanism to notify fsync() that it needs to retry all writes before it can exit. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28NFS: Clean up helper function nfs4_select_rw_stateid()Trond Myklebust
We want to be able to pass on the information that the page was not dirtied under a lock. Instead of adding a flag parameter, do this by passing a pointer to a 'struct nfs_lock_owner' that may be NULL. Also reuse this structure in struct nfs_lock_context to carry the fl_owner_t and pid_t. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-08-02nfs: tear down caches in nfs_init_writepagecache when allocation failsJeff Layton
...and ensure that we tear down the nfs_commit_data cache too when unloading the module. Cc: Bryan Schumaker <bjschuma@netapp.com> Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-31Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds
Merge Andrew's second set of patches: - MM - a few random fixes - a couple of RTC leftovers * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits) rtc/rtc-88pm80x: remove unneed devm_kfree rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables tmpfs: distribute interleave better across nodes mm: remove redundant initialization mm: warn if pg_data_t isn't initialized with zero mips: zero out pg_data_t when it's allocated memcg: gix memory accounting scalability in shrink_page_list mm/sparse: remove index_init_lock mm/sparse: more checks on mem_section number mm/sparse: optimize sparse_index_alloc memcg: add mem_cgroup_from_css() helper memcg: further prevent OOM with too many dirty pages memcg: prevent OOM with too many dirty pages mm: mmu_notifier: fix freed page still mapped in secondary MMU mm: memcg: only check anon swapin page charges for swap cache mm: memcg: only check swap cache pages for repeated charging mm: memcg: split swapin charge function into private and public part mm: memcg: remove needless !mm fixup to init_mm when charging mm: memcg: remove unneeded shmem charge type ...
2012-07-31nfs: prevent page allocator recursions with swap over NFS.Mel Gorman
GFP_NOFS is _more_ permissive than GFP_NOIO in that it will initiate IO, just not of any filesystem data. The problem is that previously NOFS was correct because that avoids recursion into the NFS code. With swap-over-NFS, it is no longer correct as swap IO can lead to this recursion. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Neil Brown <neilb@suse.de> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>