summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2009-05-11block: move completion related functions back to blk-core.cFUJITA Tomonori
Let's put the completion related functions back to block/blk-core.c where they have lived. We can also unexport blk_end_bidi_request() and __blk_end_bidi_request(), which nobody uses. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11block: implement and enforce request peek/start/fetchTejun Heo
Till now block layer allowed two separate modes of request execution. A request is always acquired from the request queue via elv_next_request(). After that, drivers are free to either dequeue it or process it without dequeueing. Dequeue allows elv_next_request() to return the next request so that multiple requests can be in flight. Executing requests without dequeueing has its merits mostly in allowing drivers for simpler devices which can't do sg to deal with segments only without considering request boundary. However, the benefit this brings is dubious and declining while the cost of the API ambiguity is increasing. Segment based drivers are usually for very old or limited devices and as converting to dequeueing model isn't difficult, it doesn't justify the API overhead it puts on block layer and its more modern users. Previous patches converted all block low level drivers to dequeueing model. This patch completes the API transition by... * renaming elv_next_request() to blk_peek_request() * renaming blkdev_dequeue_request() to blk_start_request() * adding blk_fetch_request() which is combination of peek and start * disallowing completion of queued (not started) requests * applying new API to all LLDs Renamings are for consistency and to break out of tree code so that it's apparent that out of tree drivers need updating. [ Impact: block request issue API cleanup, no functional change ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Mike Miller <mike.miller@hp.com> Cc: unsik Kim <donari75@gmail.com> Cc: Paul Clements <paul.clements@steeleye.com> Cc: Tim Waugh <tim@cyberelk.net> Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Cc: David S. Miller <davem@davemloft.net> Cc: Laurent Vivier <Laurent@lvivier.info> Cc: Jeff Garzik <jgarzik@pobox.com> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Grant Likely <grant.likely@secretlab.ca> Cc: Adrian McMenamin <adrian@mcmen.demon.co.uk> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Borislav Petkov <petkovbb@googlemail.com> Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com> Cc: Alex Dubov <oakad@yahoo.com> Cc: Pierre Ossman <drzeus@drzeus.cx> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Markus Lidel <Markus.Lidel@shadowconnect.com> Cc: Stefan Weinhuber <wein@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Pete Zaitcev <zaitcev@redhat.com> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11block: hide request sector and data_lenTejun Heo
Block low level drivers for some reason have been pretty good at abusing block layer API. Especially struct request's fields tend to get violated in all possible ways. Make it clear that low level drivers MUST NOT access or manipulate rq->sector and rq->data_len directly by prefixing them with double underscores. This change is also necessary to break build of out-of-tree codes which assume the previous block API where internal fields can be manipulated and rq->data_len carries residual count on completion. [ Impact: hide internal fields, block API change ] Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11block: drop request->hard_* and *nr_sectorsTejun Heo
struct request has had a few different ways to represent some properties of a request. ->hard_* represent block layer's view of the request progress (completion cursor) and the ones without the prefix are supposed to represent the issue cursor and allowed to be updated as necessary by the low level drivers. The thing is that as block layer supports partial completion, the two cursors really aren't necessary and only cause confusion. In addition, manual management of request detail from low level drivers is cumbersome and error-prone at the very least. Another interesting duplicate fields are rq->[hard_]nr_sectors and rq->{hard_cur|current}_nr_sectors against rq->data_len and rq->bio->bi_size. This is more convoluted than the hard_ case. rq->[hard_]nr_sectors are initialized for requests with bio but blk_rq_bytes() uses it only for !pc requests. rq->data_len is initialized for all request but blk_rq_bytes() uses it only for pc requests. This causes good amount of confusion throughout block layer and its drivers and determining the request length has been a bit of black magic which may or may not work depending on circumstances and what the specific LLD is actually doing. rq->{hard_cur|current}_nr_sectors represent the number of sectors in the contiguous data area at the front. This is mainly used by drivers which transfers data by walking request segment-by-segment. This value always equals rq->bio->bi_size >> 9. However, data length for pc requests may not be multiple of 512 bytes and using this field becomes a bit confusing. In general, having multiple fields to represent the same property leads only to confusion and subtle bugs. With recent block low level driver cleanups, no driver is accessing or manipulating these duplicate fields directly. Drop all the duplicates. Now rq->sector means the current sector, rq->data_len the current total length and rq->bio->bi_size the current segment length. Everything else is defined in terms of these three and available only through accessors. * blk_recalc_rq_sectors() is collapsed into blk_update_request() and now handles pc and fs requests equally other than rq->sector update. This means that now pc requests can use partial completion too (no in-kernel user yet tho). * bio_cur_sectors() is replaced with bio_cur_bytes() as block layer now uses byte count as the primary data length. * blk_rq_pos() is now guranteed to be always correct. In-block users converted. * blk_rq_bytes() is now guaranteed to be always valid as is blk_rq_sectors(). In-block users converted. * blk_rq_sectors() is now guaranteed to equal blk_rq_bytes() >> 9. More convenient one is used. * blk_rq_bytes() and blk_rq_cur_bytes() are now inlined and take const pointer to request. [ Impact: API cleanup, single way to represent one property of a request ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11block: convert to pos and nr_sectors accessorsTejun Heo
With recent cleanups, there is no place where low level driver directly manipulates request fields. This means that the 'hard' request fields always equal the !hard fields. Convert all rq->sectors, nr_sectors and current_nr_sectors references to accessors. While at it, drop superflous blk_rq_pos() < 0 test in swim.c. [ Impact: use pos and nr_sectors accessors ] Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Tested-by: Grant Likely <grant.likely@secretlab.ca> Acked-by: Grant Likely <grant.likely@secretlab.ca> Tested-by: Adrian McMenamin <adrian@mcmen.demon.co.uk> Acked-by: Adrian McMenamin <adrian@mcmen.demon.co.uk> Acked-by: Mike Miller <mike.miller@hp.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Borislav Petkov <petkovbb@googlemail.com> Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com> Cc: Eric Moore <Eric.Moore@lsi.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Pete Zaitcev <zaitcev@redhat.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Paul Clements <paul.clements@steeleye.com> Cc: Tim Waugh <tim@cyberelk.net> Cc: Jeff Garzik <jgarzik@pobox.com> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Alex Dubov <oakad@yahoo.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Dario Ballabio <ballabio_dario@emc.com> Cc: David S. Miller <davem@davemloft.net> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: unsik Kim <donari75@gmail.com> Cc: Laurent Vivier <Laurent@lvivier.info> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11block: implement blk_rq_pos/[cur_]sectors() and convert obvious onesTejun Heo
Implement accessors - blk_rq_pos(), blk_rq_sectors() and blk_rq_cur_sectors() which return rq->hard_sector, rq->hard_nr_sectors and rq->hard_cur_sectors respectively and convert direct references of the said fields to the accessors. This is in preparation of request data length handling cleanup. Geert : suggested adding const to struct request * parameter to accessors Sergei : spotted error in patch description [ Impact: cleanup ] Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Acked-by: Stephen Rothwell <sfr@canb.auug.org.au> Tested-by: Grant Likely <grant.likely@secretlab.ca> Acked-by: Grant Likely <grant.likely@secretlab.ca> Ackec-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Borislav Petkov <petkovbb@googlemail.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11block: add rq->resid_lenTejun Heo
rq->data_len served two purposes - the length of data buffer on issue and the residual count on completion. This duality creates some headaches. First of all, block layer and low level drivers can't really determine what rq->data_len contains while a request is executing. It could be the total request length or it coulde be anything else one of the lower layers is using to keep track of residual count. This complicates things because blk_rq_bytes() and thus [__]blk_end_request_all() relies on rq->data_len for PC commands. Drivers which want to report residual count should first cache the total request length, update rq->data_len and then complete the request with the cached data length. Secondly, it makes requests default to reporting full residual count, ie. reporting that no data transfer occurred. The residual count is an exception not the norm; however, the driver should clear rq->data_len to zero to signify the normal cases while leaving it alone means no data transfer occurred at all. This reverse default behavior complicates code unnecessarily and renders block PC on some drivers (ide-tape/floppy) unuseable. This patch adds rq->resid_len which is used only for residual count. While at it, remove now unnecessasry blk_rq_bytes() caching in ide_pc_intr() as rq->data_len is not changed anymore. Boaz : spotted missing conversion in osd Sergei : spotted too early conversion to blk_rq_bytes() in ide-tape [ Impact: cleanup residual count handling, report 0 resid by default ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Borislav Petkov <petkovbb@googlemail.com> Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com> Cc: Mike Miller <mike.miller@hp.com> Cc: Eric Moore <Eric.Moore@lsi.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Doug Gilbert <dgilbert@interlog.com> Cc: Mike Miller <mike.miller@hp.com> Cc: Eric Moore <Eric.Moore@lsi.com> Cc: Darrick J. Wong <djwong@us.ibm.com> Cc: Pete Zaitcev <zaitcev@redhat.com> Cc: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-07Merge branch 'linus' into tracing/coreIngo Molnar
Merge reason: tracing/core was on a .30-rc1 base and was missing out on on a handful of tracing fixes present in .30-rc5-almost. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-06blktrace: from-sector redundant in trace_block_remapAlan D. Brunelle
Remove redundant from-sector parameter: it's /always/ the bio's sector passed in. [ Impact: cleanup ] Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <49FF517C.7000503@hp.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-28block: don't init rq fields unnecessarilyTejun Heo
blk_get_request() always returns properly zeroed requests. Don't set fields to zero/NULL unnecessarily. [ Impact: cleanup ] Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-28block: catch trying to use more bits than request->cmd_flags hasNikanth Karthikesan
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-28block: include discard requests in IO accountingJens Axboe
We currently don't do merging on discard requests, but we potentially could. If we do, then we need to include discard requests in the IO accounting, or merging would end up decrementing in_flight IO counters for an IO which never incremented them. So enable accounting for discard requests. Problem found by Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-28block: make blk_do_io_stat() do the full "is this rq accountable" checksJens Axboe
We currently check for file system requests outside of blk_do_io_stat(rq), but we may as well just include it. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-28block: kill rq->dataTejun Heo
Now that all block request data transfer is done via bio, rq->data isn't used. Kill it. While at it, make the roles of rq->special and buffer clear. [ Impact: drop now unncessary field from struct request ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Boaz Harrosh <bharrosh@panasas.com>
2009-04-28block: implement and use [__]blk_end_request_all()Tejun Heo
There are many [__]blk_end_request() call sites which call it with full request length and expect full completion. Many of them ensure that the request actually completes by doing BUG_ON() the return value, which is awkward and error-prone. This patch adds [__]blk_end_request_all() which takes @rq and @error and fully completes the request. BUG_ON() is added to to ensure that this actually happens. Most conversions are simple but there are a few noteworthy ones. * cdrom/viocd: viocd_end_request() replaced with direct calls to __blk_end_request_all(). * s390/block/dasd: dasd_end_request() replaced with direct calls to __blk_end_request_all(). * s390/char/tape_block: tapeblock_end_request() replaced with direct calls to blk_end_request_all(). [ Impact: cleanup ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Mike Miller <mike.miller@hp.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Jeff Garzik <jgarzik@pobox.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Alex Dubov <oakad@yahoo.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
2009-04-28block: move rq->start_time initialization to blk_rq_init()Tejun Heo
rq->start_time was initialized in init_request_from_bio() so special requests didn't have start_time set. This has been okay as start_time has been used only for fs requests; however, there is no indication of this actually is the case or not. Set rq->start_time in blk_rq_init() and guarantee that all initialized rq's have its start_time set. This improves consistency at virtually no cost and future changes will make use of the timestamp for !bio requests. [ Impact: rq->start_time is valid for all requests ] Signed-off-by: Tejun Heo <tj@kernel.org>
2009-04-28block: clean up request completion APITejun Heo
Request completion has gone through several changes and became a bit messy over the time. Clean it up. 1. end_that_request_data() is a thin wrapper around end_that_request_data_first() which checks whether bio is NULL before doing anything and handles bidi completion. blk_update_request() is a thin wrapper around end_that_request_data() which clears nr_sectors on the last iteration but doesn't use the bidi completion. Clean it up by moving the initial bio NULL check and nr_sectors clearing on the last iteration into end_that_request_data() and renaming it to blk_update_request(), which makes blk_end_io() the only user of end_that_request_data(). Collapse end_that_request_data() into blk_end_io(). 2. There are four visible completion variants - blk_end_request(), __blk_end_request(), blk_end_bidi_request() and end_request(). blk_end_request() and blk_end_bidi_request() uses blk_end_request() as the backend but __blk_end_request() and end_request() use separate implementation in __blk_end_request() due to different locking rules. blk_end_bidi_request() is identical to blk_end_io(). Collapse blk_end_io() into blk_end_bidi_request(), separate out request update into internal helper blk_update_bidi_request() and add __blk_end_bidi_request(). Redefine [__]blk_end_request() as thin inline wrappers around [__]blk_end_bidi_request(). 3. As the whole request issue/completion usages are about to be modified and audited, it's a good chance to convert completion functions return bool which better indicates the intended meaning of return values. 4. The function name end_that_request_last() is from the days when it was a public interface and slighly confusing. Give it a proper internal name - blk_finish_request(). 5. Add description explaning that blk_end_bidi_request() can be safely used for uni requests as suggested by Boaz Harrosh. The only visible behavior change is from #1. nr_sectors counts are cleared after the final iteration no matter which function is used to complete the request. I couldn't find any place where the code assumes those nr_sectors counters contain the values for the last segment and this change is good as it makes the API much more consistent as the end result is now same whether a request is completed using [__]blk_end_request() alone or in combination with blk_update_request(). API further cleaned up per Christoph's suggestion. [ Impact: cleanup, rq->*nr_sectors always updated after req completion ] Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Boaz Harrosh <bharrosh@panasas.com> Cc: Christoph Hellwig <hch@infradead.org>
2009-04-28block: kill blk_end_request_callback()Tejun Heo
With recent IDE updates, blk_end_request_callback() doesn't have any user now. Kill it. [ Impact: removal of unused convoluted interface ] Signed-off-by: Tejun Heo <tj@kernel.org>
2009-04-28block: reorganize request fetching functionsTejun Heo
Impact: code reorganization elv_next_request() and elv_dequeue_request() are public block layer interface than actual elevator implementation. They mostly deal with how requests interact with block layer and low level drivers at the beginning of rqeuest processing whereas __elv_next_request() is the actual eleveator request fetching interface. Move the two functions to blk-core.c. This prepares for further interface cleanup. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-04-28block: reorder request completion functionsTejun Heo
Reorder request completion functions such that * All request completion functions are located together. * Functions which are used by only one caller is put right above the caller. * end_request() is put after other completion functions but before blk_update_request(). This change is for completion function cleanup which will follow. [ Impact: cleanup, code reorganization ] Signed-off-by: Tejun Heo <tj@kernel.org>
2009-04-28block: clean up misc stuff after block layer timeout conversionTejun Heo
* In blk_rq_timed_out_timer(), else { if } to else if * In blk_add_timer(), simplify if/else block [ Impact: cleanup ] Signed-off-by: Tejun Heo <tj@kernel.org>
2009-04-28block: cleanup REQ_SOFTBARRIER usagesTejun Heo
blk_insert_request() doesn't need to worry about REQ_SOFTBARRIER. Don't set it. Combined with recent ide updates, REQ_SOFTBARRIER is now only used in elevator proper and for discard requests. [ Impact: cleanup ] Signed-off-by: Tejun Heo <tj@kernel.org>
2009-04-28block: don't set REQ_NOMERGE unnecessarilyTejun Heo
RQ_NOMERGE_FLAGS already clears defines which REQ flags aren't mergeable. There is no reason to specify it superflously. It only adds to confusion. Don't set REQ_NOMERGE for barriers and requests with specific queueing directive. REQ_NOMERGE is now exclusively used by the merging code. [ Impact: cleanup ] Signed-off-by: Tejun Heo <tj@kernel.org>
2009-04-28block: kill blk_start_queueing()Tejun Heo
blk_start_queueing() is identical to __blk_run_queue() except that it doesn't check for recursion. None of the current users depends on blk_start_queueing() running request_fn directly. Replace usages of blk_start_queueing() with [__]blk_run_queue() and kill it. [ Impact: removal of mostly duplicate interface function ] Signed-off-by: Tejun Heo <tj@kernel.org>
2009-04-28block: merge blk_invoke_request_fn() into __blk_run_queue()Tejun Heo
__blk_run_queue wraps blk_invoke_request_fn() such that it additionally removes plug and bails out early if the queue is empty. Both extra operations have their own pending mechanisms and don't cause any harm correctness-wise when they are done superflously. The only user of blk_invoke_request_fn() being blk_start_queue(), there isn't much reason to keep both functions around. Merge blk_invoke_request_fn() into __blk_run_queue() and make blk_start_queue() use __blk_run_queue() instead. [ Impact: merge two subtly different internal functions ] Signed-off-by: Tejun Heo <tj@kernel.org>
2009-04-28block: enable by default support for large devices and files on 32-bit archsBartlomiej Zolnierkiewicz
Enable by default support for large devices and files (CONFIG_LBD): - With 1TB disks being a commodity hardware it is quite easy to hit 2TB limitation while building RAIDs etc. and many distros have been using CONFIG_LBD=y by default already (at least Fedora 10 and openSUSE 11.1). - This should also prevent a subtle ext4 filesystem compatibility issue: mke2fs.ext4 defaults to creating filesystems with huge_files feature enabled and such filesystems cannot be later mounted read-write on machines with CONFIG_LBD=n (it should be quite easy to hit this issue when trying to use filesystem created using distro kernel on system running the self-build kernel, think about USB disk enclosures & co.). While at it: - Clarify config option help text w.r.t. mounting ext4 filesystems (they can be mounted with CONFIG_LBD=n but in the read-only mode). Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-28block: clear req->errors on bio completion only for fs requestsTejun Heo
Impact: subtle behavior change For fs requests, rq is only carrier of bios and rq error status as a whole doesn't mean much. This is the reason why rq->errors is being cleared on each partial completion of a request as on each partial completion the error status is transferred to the respective bios. For pc requests, rq->errors is used to carry error status to the issuer and thus __end_that_request_first() doesn't clear it on such cases. The condition was fine till now as only fs and pc requests have used bio and thus the bio completion path. However, future changes will unify data accesses to bio and all non fs users care about rq error status. Clear rq->errors on bio completion only for fs requests. In general, the implicit clearing is a bit too subtle especially as the meaning of rq->errors is completely dependent on low level drivers. Unifying / cleaning up rq->errors usage and letting llds manage it would be better. TODO comment added. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jens Axboe <axboe@kernel.dk>
2009-04-24cfq-iosched: cache prio_tree root in cfqq->p_rootJens Axboe
Currently we look it up from ->ioprio, but ->ioprio can change if either the process gets its IO priority changed explicitly, or if cfq decides to temporarily boost it. So if we are unlucky, we can end up attempting to remove a node from a different rbtree root than where it was added. Fix this by using ->org_ioprio as the prio_tree index, since that will only change for explicit IO priority settings (not for a boost). Additionally cache the rbtree root inside the cfqq, then we don't have to add code to reinsert the cfqq in the prio_tree if IO priority changes. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-24cfq-iosched: fix bug with aliased request and cooperation detectionJens Axboe
cfq_prio_tree_lookup() should return the direct match, yet it always returns zero. Fix that. cfq_prio_tree_add() assumes that we don't get a direct match, while it is very possible that we do. Using O_DIRECT, you can have different cfqq with matching requests, since you don't have the page cache to serialize things for you. Fix this bug by only adding the cfqq if there isn't an existing match. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-24cfq-iosched: clear ->prio_trees[] on cfqd allocJens Axboe
Not strictly needed, but we should make it clear that we init the rbtree roots here. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-24block: fix intermittent dm timeout based oopsHannes Reinecke
Very rarely under stress testing of dm, oopses are occuring as something tampers with an old stack frame. This has been traced back to blk_abort_queue() leaving a timeout_list pointing to the stack. The reason is that sometimes blk_abort_request() won't delete the timer (if the request is marked as complete but before the timer has been removed, a small race window). Fix this by splicing back from the ususally empty list to the q->timeout_list. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-24block: simplify I/O stat accountingJerome Marchand
This simplifies I/O stat accounting switching code and separates it completely from I/O scheduler switch code. Requests are accounted according to the state of their request queue at the time of the request allocation. There is no need anymore to flush the request queue when switching I/O accounting state. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-22cfq-iosched: use the default seek distance when there aren't enough seek samplesJeff Moyer
If the cfq io context doesn't have enough samples yet to provide a mean seek distance, then use the default threshold we have for seeky IO instead of defaulting to 0. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-22cfq-iosched: make seek_mean converge more quicklyJeff Moyer
Right now, depending on the first sector to which a process issues I/O, the seek time may start out way out of whack. So make sure we start with 0 sectors in seek, instead of the offset of the first request issued. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-22block: make blk_abort_queue() ignore non-request based devicesJens Axboe
There's nothing to do for those devices, since the timeout handling is based on requests. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-22block: include empty disks in /proc/diskstatsTejun Heo
/proc/diskstats used to show stats for all disks whether they're zero-sized or not and their non-zero partitions. Commit 074a7aca7afa6f230104e8e65eba3420263714a5 accidentally changed the behavior such that it doesn't print out zero sized disks. This patch implements DISK_PITER_INCL_EMPTY_PART0 flag to partition iterator and uses it in diskstats_show() such that empty part0 is shown in /proc/diskstats. Reported and bisectd by Dianel Collins. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Daniel Collins <solemnwarning@solemnwarning.no-ip.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-22block: fix queue bounce limit settingTejun Heo
Impact: don't set GFP_DMA in q->bounce_gfp unnecessarily All DMA address limits are expressed in terms of the last addressable unit (byte or page) instead of one plus that. However, when determining bounce_gfp for 64bit machines in blk_queue_bounce_limit(), it compares the specified limit against 0x100000000UL to determine whether it's below 4G ending up falsely setting GFP_DMA in q->bounce_gfp. As DMA zone is very small on x86_64, this makes larger SG_IO transfers very eager to trigger OOM killer. Fix it. While at it, rename the parameter to @dma_mask for clarity and convert comment to proper winged style. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-22block: fix SG_IO vector request data length handlingTejun Heo
Impact: fix SG_IO behavior such that it matches the documentation SG_IO howto says that if ->dxfer_len and sum of iovec disagress, the shorter one wins. However, the current implementation returns -EINVAL for such cases. Trim iovc if it's longer than ->dxfer_len. This patch uses iov_*() helpers which take struct iovec * by casting struct sg_iovec * to it. sg_iovec is always identical to iovec and this will be further cleaned up with later patches. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-19block: clear req->errors on bio completion only for fs requestsTejun Heo
Impact: subtle behavior change For fs requests, rq is only carrier of bios and rq error status as a whole doesn't mean much. This is the reason why rq->errors is being cleared on each partial completion of a request as on each partial completion the error status is transferred to the respective bios. For pc requests, rq->errors is used to carry error status to the issuer and thus __end_that_request_first() doesn't clear it on such cases. The condition was fine till now as only fs and pc requests have used bio and thus the bio completion path. However, future changes will unify data accesses to bio and all non fs users care about rq error status. Clear rq->errors on bio completion only for fs requests. In general, the implicit clearing is a bit too subtle especially as the meaning of rq->errors is completely dependent on low level drivers. Unifying / cleaning up rq->errors usage and letting llds manage it would be better. TODO comment added. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jens Axboe <axboe@kernel.dk>
2009-04-16blktrace: add trace/ to /sys/block/sdaLi Zefan
Impact: allow ftrace-plugin blktrace to trace device-mapper devices To trace a single partition: # echo 1 > /sys/block/sda/sda1/enable To trace the whole sda instead: # echo 1 > /sys/block/sda/enable Thus we also fix an issue reported by Ted, that ftrace-plugin blktrace can't be used to trace device-mapper devices. Now: # echo 1 > /sys/block/dm-0/trace/enable echo: write error: No such device or address # mount -t ext4 /dev/dm-0 /mnt # echo 1 > /sys/block/dm-0/trace/enable # echo blk > /debug/tracing/current_tracer Reported-by: Theodore Tso <tytso@mit.edu> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Shawn Du <duyuyang@gmail.com> Cc: Jens Axboe <jens.axboe@oracle.com> LKML-Reference: <49E42665.6020506@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-16blktrace: support per-partition tracingShawn Du
Though one can specify '-d /dev/sda1' when using blktrace, it still traces the whole sda. To support per-partition tracing, when we start tracing, we initialize bt->start_lba and bt->end_lba to the start and end sector of that partition. Note some actions are per device, thus we don't filter 0-sector events. The original patch and discussion can be found here: http://marc.info/?l=linux-btrace&m=122949374214540&w=2 Signed-off-by: Shawn Du <duyuyang@gmail.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> LKML-Reference: <49E42620.4050701@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-15cfq-iosched: add close cooperator codeJens Axboe
If we have processes that are working in close proximity to each other on disk, we don't want to idle wait. Instead allow the close process to issue a request, getting better aggregate bandwidth. The anticipatory scheduler has similar checks, noop and deadline do not need it since they don't care about process <-> io mappings. The code for CFQ is a little more involved though, since we split request queues into per-process contexts. This fixes a performance problem with eg dump(8), since it uses several processes in some silly attempt to speed IO up. Even if dump(8) isn't really a valid case (it should be fixed by using CLONE_IO), there are other cases where we see close processes and where idling ends up hurting performance. Credit goes to Jeff Moyer <jmoyer@redhat.com> for writing the initial implementation. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15cfq-iosched: log responsible 'cfqq' in idle timer armJens Axboe
Makes it easier to read the traces. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15cfq-iosched: tweak kick logic a bit moreJens Axboe
We only kick the dispatch for an idling queue, if we think it's a (somewhat) fully merged request. Also allow a kick if we have other busy queues in the system, since we don't want to risk waiting for a potential merge in that case. It's better to get some work done and proceed. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15cfq-iosched: no need to save interrupts in cfq_kick_queue()Jens Axboe
It's called from the workqueue handlers from process context, so we always have irqs enabled when entered. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15block: Remove code handling bio_alloc failure with __GFP_WAITNikanth Karthikesan
Remove code handling bio_alloc failure with __GFP_WAIT. GFP_KERNEL implies __GFP_WAIT. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15block: fix SG_IO to return a proper error valueFUJITA Tomonori
blk_rq_unmap_user() returns -EFAULT if a program passes an invalid address to kernel. SG_IO path needs to pass the returned value to user space instead of ignoring it. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15cfq-iosched: don't delay queue kick for a merged requestJens Axboe
"Zhang, Yanmin" <yanmin_zhang@linux.intel.com> reports that commit b029195dda0129b427c6e579a3bb3ae752da3a93 introduced a regression of about 50% with sequential threaded read workloads. The test case is: tiotest -k0 -k1 -k3 -f 80 -t 32 which starts 32 threads each reading a 80MB file. Twiddle the kick queue logic so that we do start IO immediately, if it appears to be a fully merged request. We can't really detect that, so just check if the request is bigger than a page or not. The assumption is that since single bio issues will first queue a single request with just one page attached and then later do merges on that, if we already have more than a page worth of data in the request, then the request is most likely good to go. Verified that this doesn't cause a regression with the test case that commit b029195dda0129b427c6e579a3bb3ae752da3a93 was fixing. It does not, we still see maximum sized requests for the queue-then-merge cases. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15as-iosched: get rid of private REQ_SYNC/REQ_ASYNC definesJens Axboe
We can just use the block layer BLK_RW_SYNC/ASYNC defines now. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15cfq-iosched: get rid of private SYNC/ASYNC definesJens Axboe
We can just use the block layer BLK_RW_SYNC/ASYNC defines now. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>