summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2014-12-10blk-mq: Fix uninitialized kobject at CPU hotpluggingTakashi Iwai
When a CPU is hotplugged, the current blk-mq spews a warning like: kobject '(null)' (ffffe8ffffc8b5d8): tried to add an uninitialized object, something is seriously wrong. CPU: 1 PID: 1386 Comm: systemd-udevd Not tainted 3.18.0-rc7-2.g088d59b-default #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014 0000000000000000 0000000000000002 ffffffff81605f07 ffffe8ffffc8b5d8 ffffffff8132c7a0 ffff88023341d370 0000000000000020 ffff8800bb05bd58 ffff8800bb05bd08 000000000000a0a0 000000003f441940 0000000000000007 Call Trace: [<ffffffff81005306>] dump_trace+0x86/0x330 [<ffffffff81005644>] show_stack_log_lvl+0x94/0x170 [<ffffffff81006d21>] show_stack+0x21/0x50 [<ffffffff81605f07>] dump_stack+0x41/0x51 [<ffffffff8132c7a0>] kobject_add+0xa0/0xb0 [<ffffffff8130aee1>] blk_mq_register_hctx+0x91/0xb0 [<ffffffff8130b82e>] blk_mq_sysfs_register+0x3e/0x60 [<ffffffff81309298>] blk_mq_queue_reinit_notify+0xf8/0x190 [<ffffffff8107cfdc>] notifier_call_chain+0x4c/0x70 [<ffffffff8105fd23>] cpu_notify+0x23/0x50 [<ffffffff81060037>] _cpu_up+0x157/0x170 [<ffffffff810600d9>] cpu_up+0x89/0xb0 [<ffffffff815fa5b5>] cpu_subsys_online+0x35/0x80 [<ffffffff814323cd>] device_online+0x5d/0xa0 [<ffffffff81432485>] online_store+0x75/0x80 [<ffffffff81236a5a>] kernfs_fop_write+0xda/0x150 [<ffffffff811c5532>] vfs_write+0xb2/0x1f0 [<ffffffff811c5f42>] SyS_write+0x42/0xb0 [<ffffffff8160c4ed>] system_call_fastpath+0x16/0x1b [<00007f0132fb24e0>] 0x7f0132fb24e0 This is indeed because of an uninitialized kobject for blk_mq_ctx. The blk_mq_ctx kobjects are initialized in blk_mq_sysfs_init(), but it goes loop over hctx_for_each_ctx(), i.e. it initializes only for online CPUs. Thus, when a CPU is hotplugged, the ctx for the newly onlined CPU is registered without initialization. This patch fixes the issue by initializing the all ctx kobjects belonging to each queue. Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=908794 Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-09blk-mq: Use all available hardware queuesBart Van Assche
Suppose that a system has two CPU sockets, three cores per socket, that it does not support hyperthreading and that four hardware queues are provided by a block driver. With the current algorithm this will lead to the following assignment of CPU cores to hardware queues: HWQ 0: 0 1 HWQ 1: 2 3 HWQ 2: 4 5 HWQ 3: (none) This patch changes the queue assignment into: HWQ 0: 0 1 HWQ 1: 2 HWQ 2: 3 4 HWQ 3: 5 In other words, this patch has the following three effects: - All four hardware queues are used instead of only three. - CPU cores are spread more evenly over hardware queues. For the above example the range of the number of CPU cores associated with a single HWQ is reduced from [0..2] to [1..2]. - If the number of HWQ's is a multiple of the number of CPU sockets it is now guaranteed that all CPU cores associated with a single HWQ reside on the same CPU socket. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-09blk-mq: Micro-optimize bt_get()Bart Van Assche
Remove a superfluous finish_wait() call. Convert the two bt_wait_ptr() calls into a single call. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Robert Elliott <elliott@hp.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-09blk-mq: Fix a race between bt_clear_tag() and bt_get()Bart Van Assche
What we need is the following two guarantees: * Any thread that observes the effect of the test_and_set_bit() by __bt_get_word() also observes the preceding addition of 'current' to the appropriate wait list. This is guaranteed by the semantics of the spin_unlock() operation performed by prepare_and_wait(). Hence the conversion of test_and_set_bit_lock() into test_and_set_bit(). * The wait lists are examined by bt_clear() after the tag bit has been cleared. clear_bit_unlock() guarantees that any thread that observes that the bit has been cleared also observes the store operations preceding clear_bit_unlock(). However, clear_bit_unlock() does not prevent that the wait lists are examined before that the tag bit is cleared. Hence the addition of a memory barrier between clear_bit() and the wait list examination. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Robert Elliott <elliott@hp.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: <stable@vger.kernel.org> # v3.13+ Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-09blk-mq: Avoid that __bt_get_word() wraps multiple timesBart Van Assche
If __bt_get_word() is called with last_tag != 0, if the first find_next_zero_bit() fails, if after wrap-around the test_and_set_bit() call fails and find_next_zero_bit() succeeds, if the next test_and_set_bit() call fails and subsequently find_next_zero_bit() does not find a zero bit, then another wrap-around will occur. Avoid this by introducing an additional local variable. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Robert Elliott <elliott@hp.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: <stable@vger.kernel.org> # v3.13+ Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-09blk-mq: Fix a use-after-freeBart Van Assche
blk-mq users are allowed to free the memory request_queue.tag_set points at after blk_cleanup_queue() has finished but before blk_release_queue() has started. This can happen e.g. in the SCSI core. The SCSI core namely embeds the tag_set structure in a SCSI host structure. The SCSI host structure is freed by scsi_host_dev_release(). This function is called after blk_cleanup_queue() finished but can be called before blk_release_queue(). This means that it is not safe to access request_queue.tag_set from inside blk_release_queue(). Hence remove the blk_sync_queue() call from blk_release_queue(). This call is not necessary - outstanding requests must have finished before blk_release_queue() is called. Additionally, move the blk_mq_free_queue() call from blk_release_queue() to blk_cleanup_queue() to avoid that struct request_queue.tag_set gets accessed after it has been freed. This patch avoids that the following kernel oops can be triggered when deleting a SCSI host for which scsi-mq was enabled: Call Trace: [<ffffffff8109a7c4>] lock_acquire+0xc4/0x270 [<ffffffff814ce111>] mutex_lock_nested+0x61/0x380 [<ffffffff812575f0>] blk_mq_free_queue+0x30/0x180 [<ffffffff8124d654>] blk_release_queue+0x84/0xd0 [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0 [<ffffffff8126c140>] kobject_put+0x30/0x70 [<ffffffff81245895>] blk_put_queue+0x15/0x20 [<ffffffff8125c409>] disk_release+0x99/0xd0 [<ffffffff8133d056>] device_release+0x36/0xb0 [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0 [<ffffffff8126c140>] kobject_put+0x30/0x70 [<ffffffff8125a78a>] put_disk+0x1a/0x20 [<ffffffff811d4cb5>] __blkdev_put+0x135/0x1b0 [<ffffffff811d56a0>] blkdev_put+0x50/0x160 [<ffffffff81199eb4>] kill_block_super+0x44/0x70 [<ffffffff8119a2a4>] deactivate_locked_super+0x44/0x60 [<ffffffff8119a87e>] deactivate_super+0x4e/0x70 [<ffffffff811b9833>] cleanup_mnt+0x43/0x90 [<ffffffff811b98d2>] __cleanup_mnt+0x12/0x20 [<ffffffff8107252c>] task_work_run+0xac/0xe0 [<ffffffff81002c01>] do_notify_resume+0x61/0xa0 [<ffffffff814d2c58>] int_signal+0x12/0x17 Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Robert Elliott <elliott@hp.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: <stable@vger.kernel.org> # v3.13+ Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-08Merge tag 'scsi-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "This patch is the usual mix of driver updates (srp, ipr, scsi_debug, NCR5380, fnic, 53c974, ses, wd719x, hpsa, megaraid_sas). Of those, wd7a9x is new and 53c974 is a rewrite of the old tmscsim driver and the extensive work by Finn Thain rewrites all the NCR5380 based drivers. There's also extensive infrastructure updates: a new logging infrastructure for sense information and a rewrite of the tagged command queue API and an assortment of minor updates" * tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (183 commits) scsi: set fmt to NULL scsi_extd_sense_format() by default libsas: remove task_collector mode wd719x: remove dma_cache_sync call scsi_debug: add Report supported opcodes+tmfs; Compare and write scsi_debug: change SCSI command parser to table driven scsi_debug: add Capacity Changed Unit Attention scsi_debug: append inject error flags onto scsi_cmnd object scsi_debug: pinpoint invalid field in sense data wd719x: Add firmware documentation wd719x: Introduce Western Digital WD7193/7197/7296 PCI SCSI card driver eeprom-93cx6: Add (read-only) support for 8-bit mode esas2r: fix an oversight in setting return value esas2r: fix an error path in esas2r_ioctl_handler esas2r: fir error handling in do_fm_api scsi: add SPC-3 command definitions scsi: rename SERVICE_ACTION_IN to SERVICE_ACTION_IN_16 scsi: remove scsi_driver owner field scsi: move scsi_dispatch_cmd to scsi_lib.c scsi: stop passing a gfp_mask argument down the command setup path scsi: remove scsi_next_command ...
2014-12-08blk-mq: prevent unmapped hw queue from being scheduledMing Lei
When one hardware queue has no mapped software queues, it shouldn't have been scheduled. Otherwise WARNING or OOPS can triggered. blk_mq_hw_queue_mapped() helper is introduce for fixing the problem. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-08Merge branch 'pm-runtime'Rafael J. Wysocki
* pm-runtime: (25 commits) i2c-omap / PM: Drop CONFIG_PM_RUNTIME from i2c-omap.c dmaengine / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM drivers: sh / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM e1000e / igb / PM: Eliminate CONFIG_PM_RUNTIME MMC / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM MFD / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM misc / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM media / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM input / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM iio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM hsi / OMAP / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM i2c-hid / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM drm / exynos / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM gpio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM hwrandom / exynos / PM: Use CONFIG_PM in #ifdef block / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM USB / PM: Drop CONFIG_PM_RUNTIME from the USB core PM: Merge the SET*_RUNTIME_PM_OPS() macros PM / Kconfig: Do not select PM directly from Kconfig files PCI / PM: Drop CONFIG_PM_RUNTIME from the PCI core ...
2014-12-08blk-mq: re-check for available tags after running the hardware queueJens Axboe
If we run out of tags and have to sleep, we run the hardware queue to kick pending IO into gear. During that run, we may have completed requests, so re-check if we have free tags before going to sleep. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-08blk-mq: fix hang in bt_get()Bart Van Assche
Avoid that if there are fewer hardware queues than CPU threads that bt_get() can hang. The symptoms of the hang were as follows: * All tags allocated for a particular hardware queue. * (nr_tags) pending commands for that hardware queue. * No pending commands for the software queues associated with that hardware queue. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-08Merge remote-tracking branch 'scsi-queue/core-for-3.19' into for-linusJames Bottomley
2014-12-04block / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PMRafael J. Wysocki
After commit b2b49ccbdd54 (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is selected) PM_RUNTIME is always set if PM is set, so #ifdef blocks depending on CONFIG_PM_RUNTIME may now be changed to depend on CONFIG_PM. Replace CONFIG_PM_RUNTIME with CONFIG_PM in the block device core. Reviewed-by: Aaron Lu <aaron.lu@intel.com> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-12-02block: fix regression where bio_integrity_process uses wrong bio_vec iteratorDarrick J. Wong
bio integrity handling is broken on a system with LVM layered atop a DIF/DIX SCSI drive because device mapper clones the bio, modifies the clone, and sends the clone to the lower layers for processing. However, the clone bio has bi_vcnt == 0, which means that when the sd driver calls bio_integrity_process to attach DIX data, the for_each_segment_all() call (which uses bi_vcnt) returns immediately and random garbage is sent to the disk on a disk write. The disk of course returns an error. Therefore, teach bio_integrity_process() to use bio_for_each_segment() to iterate the bio_vecs, since the per-bio iterator tracks which bio_vecs are associated with that particular bio. The integrity handling code is effectively part of the "driver" (it's not the bio owner), so it must use the correct iterator function. v2: Fix a compiler warning about abandoned local variables. This patch supersedes "block: bio_integrity_process uses wrong bio_vec iterator". Patch applies against 3.18-rc6. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-30blk-mq: move the kdump check to blk_mq_alloc_tag_setShaohua Li
We call blk_mq_alloc_tag_set() first then blk_mq_init_queue(). The requests are allocated in the former function. So the kdump check should be moved to there to really save memory. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-24blk-mq: cleanup tag free handlingJens Axboe
We only call __blk_mq_put_tag() and __blk_mq_put_reserved_tag() from blk_mq_put_tag(), so just inline the two calls instead of having them as separate functions. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-24blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu mapJens Axboe
We currently use num_possible_cpus(), but that breaks on sparc64 where the CPU ID space is discontig. Use nr_cpu_ids as the highest CPU ID instead, so we don't end up reading from invalid memory. Cc: stable@kernel.org # 3.13+ Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-24scsi: rename SERVICE_ACTION_IN to SERVICE_ACTION_IN_16Hannes Reinecke
SPC-3 defines SERVICE ACTION IN(12) and SERVICE ACTION IN(16). So rename SERVICE_ACTION_IN to SERVICE_ACTION_IN_16 to be consistent with SPC and to allow for better distinction. Signed-off-by: Hannes Reinecke <hare@suse.de> Tested-by: Robert Elliott <elliott@hp.com> Reviewed-by: Robert Elliott <elliott@hp.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2014-11-24blk: introduce generic io stat accounting help functionGu Zheng
Many block drivers accounting io stat based on bio (e.g. NVMe...), the blk_account_io_start/end() which is based on request does not make sense to them, so here we introduce the similar help function named generic_start/end_io_acct base on raw sectors, and it can simplify some driver's open io accounting code. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-24blk-mq: handle the single queue case in blk_mq_hctx_next_cpuChristoph Hellwig
Don't duplicate the code to handle the not cpu bounce case in the caller, do it inside blk_mq_hctx_next_cpu instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-19genhd: check for int overflow in disk_expand_part_tbl()Jens Axboe
We can get here from blkdev_ioctl() -> blkpg_ioctl() -> add_partition() with a user passed in partno value. If we pass in 0x7fffffff, the new target in disk_expand_part_tbl() overflows the 'int' and we access beyond the end of ptbl->part[] and even write to it when we do the rcu_assign_pointer() to assign the new partition. Reported-by: David Ramos <daramos@stanford.edu> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-17blk-mq: add blk_mq_free_hctx_request()Jens Axboe
It's silly to use blk_mq_free_request() which in turn maps the request to the hardware queue, for places where we already know what the hardware queue is. This saves us an extra mapping of a hardware queue on request completion, if the caller knows this information already. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-17blk-mq: export blk_mq_free_request()Jens Axboe
Drivers that know they are blk-mq should just use this function instead of calling through blk_put_request(). Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-12scsi: add new scsi-command flag for tagged commandsChristoph Hellwig
Currently scsi piggy backs on the block layer to define the concept of a tagged command. But we want to be able to have block-level host-wide tags assigned even for untagged commands like the initial INQUIRY, so add a new SCSI-level flag for commands that are tagged at the scsi level, so that even commands without that set can have tags assigned to them. Note that this alredy is the case for the blk-mq code path, and this just lets the old path catch up with it. We also set this flag based upon sdev->simple_tags instead of the block queue flag, so that it is entirely independent of the block layer tagging, and thus always correct even if a driver doesn't use block level tagging yet. Also remove the old blk_rq_tagged; it was only used by SCSI drivers, and removing it forces them to look for the proper replacement. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de>
2014-11-12blk-mq: add blk_mq_unique_tag()Bart Van Assche
The queuecommand() callback functions in SCSI low-level drivers need to know which hardware context has been selected by the block layer. Since this information is not available in the request structure, and since passing the hctx pointer directly to the queuecommand callback function would require modification of all SCSI LLDs, add a function to the block layer that allows to query the hardware context index. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2014-11-11block: blk-merge: fix blk_recount_segments()Ming Lei
For cloned bio, bio->bi_vcnt can't be used at all, and we have resort to bio_segments() to figure out how many segment there are in the bio. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-11blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enablePaolo Bonzini
blk-mq is using preempt_disable/enable in order to ensure that the queue runners are placed on the right CPU. This does not work with the RT patches, because __blk_mq_run_hw_queue takes a non-raw spinlock with the preemption-disabled region. If there is contention on the lock, this violates the rules for preemption-disabled regions. While this should be easily fixable within the RT patches just by doing migrate_disable/enable, we can do better and document _why_ this particular region runs with disabled preemption. After the previous patch, it is trivial to switch it to get/put_cpu; the RT patches then can change it to get_cpu_light, which lets virtio-blk run under RT kernels. Cc: Jens Axboe <axboe@kernel.dk> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-by: Clark Williams <williams@redhat.com> Tested-by: Clark Williams <williams@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-11blk_mq: call preempt_disable/enable in blk_mq_run_hw_queue, and only if neededPaolo Bonzini
preempt_disable/enable surrounds every call to blk_mq_run_hw_queue, except the one in blk-flush.c. In fact that one is always asynchronous, and it does not need smp_processor_id(). We can do the same for all other calls, avoiding preempt_disable when async is true. This avoids peppering blk-mq.c with preemption-disabled regions. Cc: Jens Axboe <axboe@kernel.dk> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-by: Clark Williams <williams@redhat.com> Tested-by: Clark Williams <williams@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-10scsi: Fix more error handling in SCSI_IOCTL_SEND_COMMANDTony Battersby
Fix an error path in SCSI_IOCTL_SEND_COMMAND that calls blk_put_request(rq) on an invalid IS_ERR(rq) pointer. Fixes: a492f075450f ("block,scsi: fixup blk_get_request dead queue scenarios") Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04blk-mq: make mq_queue_reinit_notify() freeze queues in parallelTejun Heo
q->mq_usage_counter is a percpu_ref which is killed and drained when the queue is frozen. On a CPU hotplug event, blk_mq_queue_reinit() which involves freezing the queue is invoked on all existing queues. Because percpu_ref killing and draining involve a RCU grace period, doing the above on one queue after another may take a long time if there are many queues on the system. This patch splits out initiation of freezing and waiting for its completion, and updates blk_mq_queue_reinit_notify() so that the queues are frozen in parallel instead of one after another. Note that freezing and unfreezing are moved from blk_mq_queue_reinit() to blk_mq_queue_reinit_notify(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-31block: Fix computation of merged request priorityJan Kara
Priority of a merged request is computed by ioprio_best(). If one of the requests has undefined priority (IOPRIO_CLASS_NONE) and another request has priority from IOPRIO_CLASS_BE, the function will return the undefined priority which is wrong. Fix the function to properly return priority of a request with the defined priority. Fixes: d58cdfb89ce0c6bd5f81ae931a984ef298dbda20 CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-29blk-mq: add BLK_MQ_F_DEFER_ISSUE support flagJens Axboe
Drivers can now tell blk-mq if they take advantage of the deferred issue through 'last' or not. If they do, don't do queue-direct for sync IO. This is a preparation patch for the nvme conversion. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-29blk-mq: add a 'list' parameter to ->queue_rq()Jens Axboe
Since we have the notion of a 'last' request in a chain, we can use this to have the hardware optimize the issuing of requests. Add a list_head parameter to queue_rq that the driver can use to temporarily store hw commands for issue when 'last' is true. If we are doing a chain of requests, pass in a NULL list for the first request to force issue of that immediately, then batch the remainder for deferred issue until the last request has been sent. Instead of adding yet another argument to the hot ->queue_rq path, encapsulate the passed arguments in a blk_mq_queue_data structure. This is passed as a constant, and has been tested as faster than passing 4 (or even 3) args through ->queue_rq. Update drivers for the new ->queue_rq() prototype. There are no functional changes in this patch for drivers - if they don't use the passed in list, then they will just queue requests individually like before. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-23block: fix wrong error return in elevator_init()Sudip Mukherjee
while compiling integer err was showing as a set but unused variable. elevator_init_fn can be either cfq_init_queue or deadline_init_queue or noop_init_queue. all three of these functions are returning -ENOMEM if they fail to allocate the queue. so we should actually be returning the error code rather than returning 0 always. Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-22scsi: Fix error handling in SCSI_IOCTL_SEND_COMMANDJan Kara
When sg_scsi_ioctl() fails to prepare request to submit in blk_rq_map_kern() we jump to a label where we just end up copying (luckily zeroed-out) kernel buffer to userspace instead of reporting error. Fix the problem by jumping to the right label. CC: Jens Axboe <axboe@kernel.dk> CC: linux-scsi@vger.kernel.org CC: stable@vger.kernel.org Coverity-id: 1226871 Signed-off-by: Jan Kara <jack@suse.cz> Fixed up the, now unused, out label. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-21blk-merge: recaculate segment if it isn't less than max segmentsMing Lei
The problem is introduced by commit 764f612c6c3c231b(blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio), and merge is needed if number of current segment isn't less than max segments. Strictly speaking, bio->bi_vcnt shouldn't be used here since it may not be accurate in cases of both cloned bio or bio cloned from, but bio_segments() is a bit expensive, and bi_vcnt is still the biggest number, so the approach should work. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-21block: remove artifical max_hw_sectors capChristoph Hellwig
Set max_sectors to the value the drivers provides as hardware limit by default. Linux had proper I/O throttling for a long time and doesn't rely on a artifically small maximum I/O size anymore. By not limiting the I/O size by default we remove an annoying tuning step required for most Linux installation. Note that both the user, and if absolutely required the driver can still impose a limit for FS requests below max_hw_sectors_kb. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-18Merge branch 'for-3.18/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block layer changes from Jens Axboe: "This is the core block IO pull request for 3.18. Apart from the new and improved flush machinery for blk-mq, this is all mostly bug fixes and cleanups. - blk-mq timeout updates and fixes from Christoph. - Removal of REQ_END, also from Christoph. We pass it through the ->queue_rq() hook for blk-mq instead, freeing up one of the request bits. The space was overly tight on 32-bit, so Martin also killed REQ_KERNEL since it's no longer used. - blk integrity updates and fixes from Martin and Gu Zheng. - Update to the flush machinery for blk-mq from Ming Lei. Now we have a per hardware context flush request, which both cleans up the code should scale better for flush intensive workloads on blk-mq. - Improve the error printing, from Rob Elliott. - Backing device improvements and cleanups from Tejun. - Fixup of a misplaced rq_complete() tracepoint from Hannes. - Make blk_get_request() return error pointers, fixing up issues where we NULL deref when a device goes bad or missing. From Joe Lawrence. - Prep work for drastically reducing the memory consumption of dm devices from Junichi Nomura. This allows creating clone bio sets without preallocating a lot of memory. - Fix a blk-mq hang on certain combinations of queue depths and hardware queues from me. - Limit memory consumption for blk-mq devices for crash dump scenarios and drivers that use crazy high depths (certain SCSI shared tag setups). We now just use a single queue and limited depth for that" * 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits) block: Remove REQ_KERNEL blk-mq: allocate cpumask on the home node bio-integrity: remove the needless fail handle of bip_slab creating block: include func name in __get_request prints block: make blk_update_request print prefix match ratelimited prefix blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio block: fix alignment_offset math that assumes io_min is a power-of-2 blk-mq: Make bt_clear_tag() easier to read blk-mq: fix potential hang if rolling wakeup depth is too high block: add bioset_create_nobvec() block: use bio_clone_fast() in blk_rq_prep_clone() block: misplaced rq_complete tracepoint sd: Honor block layer integrity handling flags block: Replace strnicmp with strncasecmp block: Add T10 Protection Information functions block: Don't merge requests if integrity flags differ block: Integrity checksum flag block: Relocate bio integrity flags block: Add a disk flag to block integrity profile block: Add prefix to block integrity profile flags ...
2014-10-13blk-mq: allocate cpumask on the home nodeJens Axboe
All other allocs are done on the specific node, somehow the cpumask for hw queue runs was missed. Fix that by using zalloc_cpumask_var_node() in blk_mq_init_queue(). Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-13bio-integrity: remove the needless fail handle of bip_slab creatingGu Zheng
bip_slab is created with SLAB_PANIC, so the fail handler is unneeded. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-13block: include func name in __get_request printsRobert Elliott
In __get_request calls to printk_ratelimited, include the function name so the callbacks suppressed message matches the messages that are printed, and add "dev" before the device name so it matches other block layer messages. Signed-off-by: Robert Elliott <elliott@hp.com> Reviewed-by: Webb Scales <webbnh@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-13block: make blk_update_request print prefix match ratelimited prefixRobert Elliott
In blk_update_request, change the printk_ratelimited prefix from end_request to blk_update_request so it matches the name printed if rate limiting occurs. Old: [10234.933106] blk_update_request: 174 callbacks suppressed [10234.934940] end_request: critical target error, dev sdr, sector 16 [10234.949788] end_request: critical target error, dev sdr, sector 16 New: [16863.445173] blk_update_request: 398 callbacks suppressed [16863.447029] blk_update_request: critical target error, dev sdr, sector 1442066176 [16863.449383] blk_update_request: critical target error, dev sdr, sector 802802888 [16863.451680] blk_update_request: critical target error, dev sdr, sector 1609535456 Signed-off-by: Robert Elliott <elliott@hp.com> Reviewed-by: Webb Scales <webbnh@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-10Merge branch 'for-3.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: "A lot of activities on percpu front. Notable changes are... - percpu allocator now can take @gfp. If @gfp doesn't contain GFP_KERNEL, it tries to allocate from what's already available to the allocator and a work item tries to keep the reserve around certain level so that these atomic allocations usually succeed. This will replace the ad-hoc percpu memory pool used by blk-throttle and also be used by the planned blkcg support for writeback IOs. Please note that I noticed a bug in how @gfp is interpreted while preparing this pull request and applied the fix 6ae833c7fe0c ("percpu: fix how @gfp is interpreted by the percpu allocator") just now. - percpu_ref now uses longs for percpu and global counters instead of ints. It leads to more sparse packing of the percpu counters on 64bit machines but the overhead should be negligible and this allows using percpu_ref for refcnting pages and in-memory objects directly. - The switching between percpu and single counter modes of a percpu_ref is made independent of putting the base ref and a percpu_ref can now optionally be initialized in single or killed mode. This allows avoiding percpu shutdown latency for cases where the refcounted objects may be synchronously created and destroyed in rapid succession with only a fraction of them reaching fully operational status (SCSI probing does this when combined with blk-mq support). It's also planned to be used to implement forced single mode to detect underflow more timely for debugging. There's a separate branch percpu/for-3.18-consistent-ops which cleans up the duplicate percpu accessors. That branch causes a number of conflicts with s390 and other trees. I'll send a separate pull request w/ resolutions once other branches are merged" * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits) percpu: fix how @gfp is interpreted by the percpu allocator blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky percpu_ref: add PERCPU_REF_INIT_* flags percpu_ref: decouple switching to percpu mode and reinit percpu_ref: decouple switching to atomic mode and killing percpu_ref: add PCPU_REF_DEAD percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch percpu_ref: replace pcpu_ prefix with percpu_ percpu_ref: minor code and comment updates percpu_ref: relocate percpu_ref_reinit() Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe" Revert "percpu: free percpu allocation info for uniprocessor system" percpu-refcount: make percpu_ref based on longs instead of ints percpu-refcount: improve WARN messages percpu: fix locking regression in the failure path of pcpu_alloc() percpu-refcount: add @gfp to percpu_ref_init() proportions: add @gfp to init functions percpu_counter: add @gfp to percpu_counter_init() percpu_counter: make percpu_counters_lock irq-safe ...
2014-10-09blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bioMing Lei
It isn't correct to figure out req->bi_phys_segments from bio->bi_vcnt if the bio is cloned. Signed-off-by: Ming Lei <ming.lei@canonical.com> Tested-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-09block: fix alignment_offset math that assumes io_min is a power-of-2Mike Snitzer
The math in both blk_stack_limits() and queue_limit_alignment_offset() assume that a block device's io_min (aka minimum_io_size) is always a power-of-2. Fix the math such that it works for non-power-of-2 io_min. This issue (of alignment_offset != 0) became apparent when testing dm-thinp with a thinp blocksize that matches a RAID6 stripesize of 1280K. Commit fdfb4c8c1 ("dm thin: set minimum_io_size to pool's data block size") unlocked the potential for alignment_offset != 0 due to the dm-thin-pool's io_min possibly being a non-power-of-2. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-07Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull "trivial tree" updates from Jiri Kosina: "Usual pile from trivial tree everyone is so eagerly waiting for" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) Remove MN10300_PROC_MN2WS0038 mei: fix comments treewide: Fix typos in Kconfig kprobes: update jprobe_example.c for do_fork() change Documentation: change "&" to "and" in Documentation/applying-patches.txt Documentation: remove obsolete pcmcia-cs from Changes Documentation: update links in Changes Documentation: Docbook: Fix generated DocBook/kernel-api.xml score: Remove GENERIC_HAS_IOMAP gpio: fix 'CONFIG_GPIO_IRQCHIP' comments tty: doc: Fix grammar in serial/tty dma-debug: modify check_for_stack output treewide: fix errors in printk genirq: fix reference in devm_request_threaded_irq comment treewide: fix synchronize_rcu() in comments checkstack.pl: port to AArch64 doc: queue-sysfs: minor fixes init/do_mounts: better syntax description MIPS: fix comment spelling powerpc/simpleboot: fix comment ...
2014-10-07blk-mq: Make bt_clear_tag() easier to readBart Van Assche
Eliminate a backwards goto statement from bt_clear_tag(). Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-07blk-mq: fix potential hang if rolling wakeup depth is too highJens Axboe
We currently divide the queue depth by 4 as our batch wakeup count, but we split the wakeups over BT_WAIT_QUEUES number of wait queues. This defaults to 8. If the product of the resulting batch wake count and BT_WAIT_QUEUES is higher than the device queue depth, we can get into a situation where a task goes to sleep waiting for a request, but never gets woken up. Reported-by: Bart Van Assche <bvanassche@acm.org> Fixes: 4bb659b156996 Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-03block: add bioset_create_nobvec()Junichi Nomura
Users of bio_clone_fast() do not want bios with their own bvecs. Allocating a bvec mempool as part of the bioset intended for such users is a waste of memory. bioset_create_nobvec() creates a bioset that doesn't have the bvec mempool. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-03block: use bio_clone_fast() in blk_rq_prep_clone()Junichi Nomura
Request cloning clones bios in the request to track the completion of each bio. For that purpose, we can use bio_clone_fast() instead of bio_clone() to avoid unnecessary allocation and copy of bvecs. This patch reduces memory footprint of request-based device-mapper (about 1-4KB for each request) and is a preparation for further reduction of memory usage by removing unused bvec mempool. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>