summaryrefslogtreecommitdiff
path: root/block/genhd.c
AgeCommit message (Collapse)Author
2012-10-02Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue changes from Tejun Heo: "This is workqueue updates for v3.7-rc1. A lot of activities this round including considerable API and behavior cleanups. * delayed_work combines a timer and a work item. The handling of the timer part has always been a bit clunky leading to confusing cancelation API with weird corner-case behaviors. delayed_work is updated to use new IRQ safe timer and cancelation now works as expected. * Another deficiency of delayed_work was lack of the counterpart of mod_timer() which led to cancel+queue combinations or open-coded timer+work usages. mod_delayed_work[_on]() are added. These two delayed_work changes make delayed_work provide interface and behave like timer which is executed with process context. * A work item could be executed concurrently on multiple CPUs, which is rather unintuitive and made flush_work() behavior confusing and half-broken under certain circumstances. This problem doesn't exist for non-reentrant workqueues. While non-reentrancy check isn't free, the overhead is incurred only when a work item bounces across different CPUs and even in simulated pathological scenario the overhead isn't too high. All workqueues are made non-reentrant. This removes the distinction between flush_[delayed_]work() and flush_[delayed_]_work_sync(). The former is now as strong as the latter and the specified work item is guaranteed to have finished execution of any previous queueing on return. * In addition to the various bug fixes, Lai redid and simplified CPU hotplug handling significantly. * Joonsoo introduced system_highpri_wq and used it during CPU hotplug. There are two merge commits - one to pull in IRQ safe timer from tip/timers/core and the other to pull in CPU hotplug fixes from wq/for-3.6-fixes as Lai's hotplug restructuring depended on them." Fixed a number of trivial conflicts, but the more interesting conflicts were silent ones where the deprecated interfaces had been used by new code in the merge window, and thus didn't cause any real data conflicts. Tejun pointed out a few of them, I fixed a couple more. * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits) workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending() workqueue: use cwq_set_max_active() helper for workqueue_set_max_active() workqueue: introduce cwq_set_max_active() helper for thaw_workqueues() workqueue: remove @delayed from cwq_dec_nr_in_flight() workqueue: fix possible stall on try_to_grab_pending() of a delayed work item workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback() workqueue: use __cpuinit instead of __devinit for cpu callbacks workqueue: rename manager_mutex to assoc_mutex workqueue: WORKER_REBIND is no longer necessary for idle rebinding workqueue: WORKER_REBIND is no longer necessary for busy rebinding workqueue: reimplement idle worker rebinding workqueue: deprecate __cancel_delayed_work() workqueue: reimplement cancel_delayed_work() using try_to_grab_pending() workqueue: use mod_delayed_work() instead of __cancel + queue workqueue: use irqsafe timer for delayed_work workqueue: clean up delayed_work initializers and add missing one workqueue: make deferrable delayed_work initializer names consistent workqueue: cosmetic whitespace updates for macro definitions workqueue: deprecate system_nrt[_freezable]_wq workqueue: deprecate flush[_delayed]_work_sync() ...
2012-08-20workqueue: deprecate system_nrt[_freezable]_wqTejun Heo
system_nrt[_freezable]_wq are now spurious. Mark them deprecated and convert all users to system[_freezable]_wq. If you're cc'd and wondering what's going on: Now all workqueues are non-reentrant, so there's no reason to use system_nrt[_freezable]_wq. Please use system[_freezable]_wq instead. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-By: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: David Airlie <airlied@linux.ie> Cc: Jiri Kosina <jkosina@suse.cz> Cc: "David S. Miller" <davem@davemloft.net> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com>
2012-08-13workqueue: use mod_delayed_work() instead of cancel + queueTejun Heo
Convert delayed_work users doing cancel_delayed_work() followed by queue_delayed_work() to mod_delayed_work(). Most conversions are straight-forward. Ones worth mentioning are, * drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always use mod_delayed_work() and cancel loop in edac_mc_reset_delay_period() is dropped. * drivers/platform/x86/thinkpad_acpi.c: No need to remember whether watchdog is active or not. @fan_watchdog_active and related code dropped. * drivers/power/charger-manager.c: Seemingly a lot of delayed_work_pending() abuse going on here. [delayed_]work_pending() are unsynchronized and racy when used like this. I converted one instance in fullbatt_handler(). Please conver the rest so that it invokes workqueue APIs for the intended target state rather than trying to game work item pending state transitions. e.g. if timer should be modified - call mod_delayed_work(), canceled - call cancel_delayed_work[_sync](). * drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling() simplified. Note that round_jiffies() calls in this function are meaningless. round_jiffies() work on absolute jiffies not delta delay used by delayed_work. v2: Tomi pointed out that __cancel_delayed_work() users can't be safely converted to mod_delayed_work(). They could be calling it from irq context and if that happens while delayed_work_timer_fn() is running, it could deadlock. __cancel_delayed_work() users are dropped. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Acked-by: Anton Vorontsov <cbouatmailru@gmail.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Tomi Valkeinen <tomi.valkeinen@ti.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Doug Thompson <dougthompson@xmission.com> Cc: David Airlie <airlied@linux.ie> Cc: Roland Dreier <roland@kernel.org> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: Zhang Rui <rui.zhang@intel.com> Cc: Len Brown <len.brown@intel.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Johannes Berg <johannes@sipsolutions.net>
2012-08-03block: Don't use static to define "void *p" in show_partition_start()Jianpeng Ma
I met a odd prblem:read /proc/partitions may return zero. I wrote a file test.c: int main() { char buff[4096]; int ret; int fd; printf("pid=%d\n",getpid()); while (1) { fd = open("/proc/partitions", O_RDONLY); if (fd < 0) { printf("open error %s\n", strerror(errno)); return 0; } ret = read(fd, buff, 4096); if (ret <= 0) printf("ret=%d, %s, %ld\n", ret, strerror(errno), lseek(fd,0,SEEK_CUR)); close(fd); } exit(0); } You can reproduce by: 1:while true;do cat /proc/partitions > /dev/null ;done 2:./test I reviewed the code and found: >> static void *show_partition_start(struct seq_file *seqf, loff_t *pos) >> { >> static void *p; >> >> p = disk_seqf_start(seqf, pos); >> if (!IS_ERR_OR_NULL(p) && !*pos) >> seq_puts(seqf, "major minor #blocks name\n\n"); >> return p; >> } test cat /proc/partitions p = disk_seqf_start()(Not NULL) p = disk_seqf_start()(NULL because pos) if (!IS_ERR_OR_NULL(p) && !*pos) Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-01block: add partition resize function to blkpg ioctlVivek Goyal
Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that allows altering the size of an existing partition, even if it is currently in use. This patch converts hd_struct->nr_sects into sequence counter because One might extend a partition while IO is happening to it and update of nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This can lead to issues like reading inconsistent size of a partition. Sequence counter have been used so that readers don't have to take bdev mutex lock as we call sector_in_part() very frequently. Now all the access to hd_struct->nr_sects should happen using sequence counter read/update helper functions part_nr_sects_read/part_nr_sects_write. There is one exception though, set_capacity()/get_capacity(). I think theoritically race should exist there too but this patch does not modify set_capacity()/get_capacity() due to sheer number of call sites and I am afraid that change might break something. I have left that as a TODO item. We can handle it later if need be. This patch does not introduce any new races as such w.r.t set_capacity()/get_capacity(). v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Phillip Susi <psusi@ubuntu.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-15block: fix buffer overflow when printing partition UUIDsTejun Heo
6d1d8050b4bc8 "block, partition: add partition_meta_info to hd_struct" added part_unpack_uuid() which assumes that the passed in buffer has enough space for sprintfing "%pU" - 37 characters including '\0'. Unfortunately, b5af921ec0233 "init: add support for root devices specified by partition UUID" supplied 33 bytes buffer to the function leading to the following panic with stackprotector enabled. Kernel panic - not syncing: stack-protector: Kernel stack corrupted in: ffffffff81b14c7e [<ffffffff815e226b>] panic+0xba/0x1c6 [<ffffffff81b14c7e>] ? printk_all_partitions+0x259/0x26xb [<ffffffff810566bb>] __stack_chk_fail+0x1b/0x20 [<ffffffff81b15c7e>] printk_all_paritions+0x259/0x26xb [<ffffffff81aedfe0>] mount_block_root+0x1bc/0x27f [<ffffffff81aee0fa>] mount_root+0x57/0x5b [<ffffffff81aee23b>] prepare_namespace+0x13d/0x176 [<ffffffff8107eec0>] ? release_tgcred.isra.4+0x330/0x30 [<ffffffff81aedd60>] kernel_init+0x155/0x15a [<ffffffff81087b97>] ? schedule_tail+0x27/0xb0 [<ffffffff815f4d24>] kernel_thread_helper+0x5/0x10 [<ffffffff81aedc0b>] ? start_kernel+0x3c5/0x3c5 [<ffffffff815f4d20>] ? gs_change+0x13/0x13 Increase the buffer size, remove the dangerous part_unpack_uuid() and use snprintf() directly from printk_all_partitions(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Szymon Gruszczynski <sz.gruszczynski@googlemail.com> Cc: Will Drewry <wad@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-02Block: use a freezable workqueue for disk-event pollingAlan Stern
This patch (as1519) fixes a bug in the block layer's disk-events polling. The polling is done by a work routine queued on the system_nrt_wq workqueue. Since that workqueue isn't freezable, the polling continues even in the middle of a system sleep transition. Obviously, polling a suspended drive for media changes and such isn't a good thing to do; in the case of USB mass-storage devices it can lead to real problems requiring device resets and even re-enumeration. The patch fixes things by creating a new system-wide, non-reentrant, freezable workqueue and using it for disk-events polling. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> CC: <stable@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-02block: fix __blkdev_get and add_disk race conditionStanislaw Gruszka
The following situation might occur: __blkdev_get: add_disk: register_disk() get_gendisk() disk_block_events() disk->ev == NULL disk_add_events() __disk_unblock_events() disk->ev != NULL --ev->block Then we unblock events, when they are suppose to be blocked. This can trigger events related block/genhd.c warnings, but also can crash in sd_check_events() or other places. I'm able to reproduce crashes with the following scripts (with connected usb dongle as sdb disk). <snip> DEV=/dev/sdb ENABLE=/sys/bus/usb/devices/1-2/bConfigurationValue function stop_me() { for i in `jobs -p` ; do kill $i 2> /dev/null ; done exit } trap stop_me SIGHUP SIGINT SIGTERM for ((i = 0; i < 10; i++)) ; do while true; do fdisk -l $DEV 2>&1 > /dev/null ; done & done while true ; do echo 1 > $ENABLE sleep 1 echo 0 > $ENABLE done </snip> I use the script to verify patch fixing oops in sd_revalidate_disk http://marc.info/?l=linux-scsi&m=132935572512352&w=2 Without Jun'ichi Nomura patch titled "Fix NULL pointer dereference in sd_revalidate_disk" or this one, script easily crash kernel within a few seconds. With both patches applied I do not observe crash. Unfortunately after some time (dozen of minutes), script will hung in: [ 1563.906432] [<c08354f5>] schedule_timeout_uninterruptible+0x15/0x20 [ 1563.906437] [<c04532d5>] msleep+0x15/0x20 [ 1563.906443] [<c05d60b2>] blk_drain_queue+0x32/0xd0 [ 1563.906447] [<c05d6e00>] blk_cleanup_queue+0xd0/0x170 [ 1563.906454] [<c06d278f>] scsi_free_queue+0x3f/0x60 [ 1563.906459] [<c06d7e6e>] __scsi_remove_device+0x6e/0xb0 [ 1563.906463] [<c06d4aff>] scsi_forget_host+0x4f/0x60 [ 1563.906468] [<c06cd84a>] scsi_remove_host+0x5a/0xf0 [ 1563.906482] [<f7f030fb>] quiesce_and_remove_host+0x5b/0xa0 [usb_storage] [ 1563.906490] [<f7f03203>] usb_stor_disconnect+0x13/0x20 [usb_storage] Anyway I think this patch is some step forward. As drawback, I do not teardown on sysfs file create error, because I do not know how to nullify disk->ev (since it can be used). However add_disk error handling practically does not exist too, and things will work without this sysfs file, except events will not be exported to user space. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-01-15Merge branch 'for-3.3/core' of git://git.kernel.dk/linux-blockLinus Torvalds
* 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits) Revert "block: recursive merge requests" block: Stop using macro stubs for the bio data integrity calls blockdev: convert some macros to static inlines fs: remove unneeded plug in mpage_readpages() block: Add BLKROTATIONAL ioctl block: Introduce blk_set_stacking_limits function block: remove WARN_ON_ONCE() in exit_io_context() block: an exiting task should be allowed to create io_context block: ioc_cgroup_changed() needs to be exported block: recursive merge requests block, cfq: fix empty queue crash caused by request merge block, cfq: move icq creation and rq->elv.icq association to block core block, cfq: restructure io_cq creation path for io_context interface cleanup block, cfq: move io_cq exit/release to blk-ioc.c block, cfq: move icq cache management to block core block, cfq: move io_cq lookup to blk-ioc.c block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq block, cfq: reorganize cfq_io_context into generic and cfq specific parts block: remove elevator_queue->ops block: reorder elevator switch sequence ... Fix up conflicts in: - block/blk-cgroup.c Switch from can_attach_task to can_attach - block/cfq-iosched.c conflict with now removed cic index changes (we now use q->id instead)
2012-01-06Merge branches 'vfsmount-guts', 'umode_t' and 'partitions' into ZAl Viro
2012-01-03switch device_get_devnode() and ->devnode() to umode_t *Al Viro
both callers of device_get_devnode() are only interested in lower 16bits and nobody tries to return anything wider than 16bit anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03fs: move code out of buffer.cAl Viro
Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export kill_bdev as well, so brd doesn't have to open code it. Reduce buffer_head.h requirement accordingly. Removed a rather large comment from invalidate_bdev, as it looked a bit obsolete to bother moving. The small comment replacing it says enough. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03make register_disk() staticAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-12-14block: misc updates to blk_get_queue()Tejun Heo
* blk_get_queue() is peculiar in that it returns 0 on success and 1 on failure instead of 0 / -errno or boolean. Update it such that it returns %true on success and %false on failure. * Make sure the caller checks for the return value. * Separate out __blk_get_queue() which doesn't check whether @q is dead and put it in blk.h. This will be used later. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-10block: Revert "[SCSI] genhd: add a new attribute "alias" in gendisk"Tejun Heo
This reverts commit a72c5e5eb738033938ab30d6a634b74d1d060f10. The commit introduced alias for block devices which is intended to be used during logging although actual usage hasn't been committed yet. This approach adds very limited benefit (raw log might be easier to follow) which can be trivially implemented in userland but has a lot of problems. It is much worse than netif renames because it doesn't rename the actual device but just adds conveninence name which isn't used universally or enforced. Everything internal including device lookup and sysfs still uses the internal name and nothing prevents two devices from using conflicting alias - ie. sda can have sdb as its alias. This has been nacked by people working on device driver core, block layer and kernel-userland interface and shouldn't have been upstreamed. Revert it. http://thread.gmane.org/gmane.linux.kernel/1155104 http://thread.gmane.org/gmane.linux.scsi/68632 http://thread.gmane.org/gmane.linux.scsi/69776 Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Kay Sievers <kay.sievers@vrfy.org> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Nao Nishijima <nao.nishijima.xt@hitachi.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-04Merge branch 'for-3.2/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds
* 'for-3.2/drivers' of git://git.kernel.dk/linux-block: (30 commits) virtio-blk: use ida to allocate disk index hpsa: add small delay when using PCI Power Management to reset for kump cciss: add small delay when using PCI Power Management to reset for kump xen/blkback: Fix two races in the handling of barrier requests. xen/blkback: Check for proper operation. xen/blkback: Fix the inhibition to map pages when discarding sector ranges. xen/blkback: Report VBD_WSECT (wr_sect) properly. xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests. xen-blkfront: plug device number leak in xlblk_init() error path xen-blkfront: If no barrier or flush is supported, use invalid operation. xen-blkback: use kzalloc() in favor of kmalloc()+memset() xen-blkback: fixed indentation and comments xen-blkfront: fix a deadlock while handling discard response xen-blkfront: Handle discard requests. xen-blkback: Implement discard requests ('feature-discard') xen-blkfront: add BLKIF_OP_DISCARD and discard request struct drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd() drivers/block/loop.c: emit uevent on auto release drivers/block/cpqarray.c: use pci_dev->revision loop: always allow userspace partitions and optionally support automatic scanning ... Fic up trivial header file includsion conflict in drivers/block/loop.c
2011-11-04Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-blockLinus Torvalds
* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits) block: don't call blk_drain_queue() if elevator is not up blk-throttle: use queue_is_locked() instead of lockdep_is_held() blk-throttle: Take blkcg->lock while traversing blkcg->policy_list blk-throttle: Free up policy node associated with deleted rule block: warn if tag is greater than real_max_depth. block: make gendisk hold a reference to its queue blk-flush: move the queue kick into blk-flush: fix invalid BUG_ON in blk_insert_flush block: Remove the control of complete cpu from bio. block: fix a typo in the blk-cgroup.h file block: initialize the bounce pool if high memory may be added later block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown block: drop @tsk from attempt_plug_merge() and explain sync rules block: make get_request[_wait]() fail if queue is dead block: reorganize throtl_get_tg() and blk_throtl_bio() block: reorganize queue draining block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() block: pass around REQ_* flags instead of broken down booleans during request alloc/free block: move blk_throtl prototypes to block/blk.h block: fix genhd refcounting in blkio_policy_parse_and_set() ... Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion and making the request functions be of type "void" instead of "int" in - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c} - drivers/staging/zram/zram_drv.c
2011-10-19block: make gendisk hold a reference to its queueTejun Heo
The following command sequence triggers an oops. # mount /dev/sdb1 /mnt # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete # umount /mnt general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs RIP: 0010:[<ffffffff810d0879>] [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60 ... Call Trace: [<ffffffff810d2845>] lock_acquire+0x95/0x140 [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50 [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70 [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0 [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0 [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0 [<ffffffff811c40df>] blkdev_put+0x5f/0x190 [<ffffffff8118f18d>] kill_block_super+0x4d/0x80 [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70 [<ffffffff8119003a>] deactivate_super+0x4a/0x70 [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130 [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0 [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b This is because bdev holds on to disk but disk doesn't pin the associated queue. If a SCSI device is removed while the device is still open, the sdev puts the base reference to the queue on release. When the bdev is finally released, the associated queue is already gone along with the bdi and bdev_inode_switch_bdi() ends up dereferencing already freed bdi. Even if it were not for this bug, disk not holding onto the associated queue is very unusual and error-prone. Fix it by making add_disk() take an extra reference to its queue and put it on disk_release() and ensuring that disk and its fops owner are put in that order after all accesses to the disk and queue are complete. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-08-29[SCSI] genhd: add a new attribute "alias" in gendiskNao Nishijima
This patch allows the user to set an "alias" of the disk via sysfs interface. This patch only adds a new attribute "alias" in gendisk structure. To show the alias instead of the device name in kernel messages, we need to revise printk messages and use alias_name() in them. Example: (current) printk("disk name is %s\n", disk->disk_name); (new) printk("disk name is %s\n", alias_name(disk)); Users can use alphabets, numbers, '-' and '_' in "alias" attribute. A disk can have an "alias" which length is up to 255 bytes. This attribute is write-once. Suggested-by: James Bottomley <James.Bottomley@HansenPartnership.com> Suggested-by: Jon Masters <jcm@redhat.com> Signed-off-by: Nao Nishijima <nao.nishijima.xt@hitachi.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2011-08-23block: add GENHD_FL_NO_PART_SCANTejun Heo
There are cases where suppressing partition scan is useful - e.g. for lo devices and pseudo SATA devices which advertise to be a disk but get upset on partition scan (some port multiplier control devices show such behavior). This patch adds GENHD_FL_NO_PART_SCAN which suppresses partition scan regardless of the number of possible partitions. disk_partitionable() is renamed to disk_part_scan_enabled() as suppressing partition scan doesn't imply the device can't be partitioned using BLKPG_ADD/DEL_PARTITION calls from userland. show_partition() now directly tests disk_max_parts() to maintain backward-compatibility. -v2: Updated to make it clear that only partition scan is suppressed not partitioning itself as suggested by Kay Sievers. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-02block/genhd.c: remove useless cast in diskstats_show()Herbert Poetzl
Remove the (unsigned long long) cast in diskstats_show() and adjusts the seq_printf() format string to 'unsigned long' diskstats_show() uses part_stat_read() to get the stats, which either accesses the specified field in the struct disk_stats directly (non SMP) or sums up the per CPU values in a variable of the same type as the field, so in any case the result will have the same type and range as the specified field which for all disk_stats entries is unsigned long Also, for unsigned long ranges the output of %lu should be identical to the one of %llu, so no change in the actual proc entry contents. Signed-off-by: Herbert Poetzl <herbert@13thfloor.at> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-25Merge branch 'for-3.1/core' of git://git.kernel.dk/linux-blockLinus Torvalds
* 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits) block: strict rq_affinity backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu block: fix patch import error in max_discard_sectors check block: reorder request_queue to remove 64 bit alignment padding CFQ: add think time check for group CFQ: add think time check for service tree CFQ: move think time check variables to a separate struct fixlet: Remove fs_excl from struct task. cfq: Remove special treatment for metadata rqs. block: document blk_plug list access block: avoid building too big plug list compat_ioctl: fix make headers_check regression block: eliminate potential for infinite loop in blkdev_issue_discard compat_ioctl: fix warning caused by qemu block: flush MEDIA_CHANGE from drivers on close(2) blk-throttle: Make total_nr_queued unsigned block: Add __attribute__((format(printf...) and fix fallout fs/partitions/check.c: make local symbols static block:remove some spare spaces in genhd.c block:fix the comment error in blkdev.h ...
2011-07-20block,rcu: Convert call_rcu(disk_free_ptbl_rcu_cb) to kfree_rcu()Lai Jiangshan
The rcu callback disk_free_ptbl_rcu_cb() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(disk_free_ptbl_rcu_cb). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-01block: flush MEDIA_CHANGE from drivers on close(2)Tejun Heo
Currently, only open(2) is defined as the 'clearing' point. It has two roles - first, it's an acknowledgement from userland indicating that the event has been received and kernel can clear pending states and proceed to generate more events. Secondly, it's passed on to device drivers as a hint indicating that a synchronization point has been reached and it might want to take a deeper look at the device. The latter currently is only used by sr which uses two different mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY to discover events, where the former is lighter weight and safe to be used repeatedly but may not provide full coverage. Among other things, GET_EVENT can't detect media removal while TUR can. This patch makes close(2) - blkdev_put() - indicate clearing hint for MEDIA_CHANGE to drivers. disk_check_events() is renamed to disk_flush_events() and updated to take @mask for events to flush which is or'd to ev->clearing and will be passed to the driver on the next ->check_events() invocation. This change makes sr generate MEDIA_CHANGE when media is ejected from userland - e.g. with eject(1). Note: Given the current usage, it seems @clearing hint is needlessly complex. disk_clear_events() can simply clear all events and the hint can be boolean @flush. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-01Merge branch 'for-linus' into for-3.1/coreJens Axboe
Conflicts: block/blk-throttle.c block/cfq-iosched.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-13block:remove some spare spaces in genhd.cWanlong Gao
Remove the end-of-line spaces in genhd.c. Signed-off-by: Wanlong Gao <wanlong.gao@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-09block: make disk_block_events() properly wait for work cancellationTejun Heo
disk_block_events() should guarantee that the event work is not in flight on return and once blocked it shouldn't issue further cancellations. Because there was no synchronization between the first blocker doing cancel_delayed_work_sync() and the following blockers, the following blockers could finish before cancellation was complete, which broke both guarantees - event work could be in flight and cancellation could happen after return. This bug triggered WARN_ON_ONCE() in disk_clear_events() reported in bug#34662. https://bugzilla.kernel.org/show_bug.cgi?id=34662 Fix it by adding an outer mutex which protects both block count manipulation and work cancellation. -v2: Use outer mutex instead of bit waitqueue per Linus. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com> Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com> Reported-by: Borislav Petkov <bp@alien8.de> Reported-by: Meelis Roos <mroos@linux.ee> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-09block: remove non-syncing __disk_block_events() and fold it into ↵Tejun Heo
disk_block_events() After the previous update to disk_check_events(), nobody is using non-syncing __disk_block_events(). Remove @sync and, as this makes __disk_block_events() virtually identical to disk_block_events(), remove the underscore prefixed version. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-09block: don't use non-syncing event blocking in disk_check_events()Tejun Heo
This patch is part of fix for triggering of WARN_ON_ONCE() in disk_clear_events() reported in bug#34662. https://bugzilla.kernel.org/show_bug.cgi?id=34662 disk_clear_events() blocks events, schedules and flushes the event work. It expects the work to have started execution on schedule and finished on return from flush. WARN_ON_ONCE() triggers if the event work hasn't executed as expected. This problem happens because __disk_block_events() fails to guarantee that the event work item is not in flight on return from the function in race-free manner. The problem is two-fold and this patch addresses one of them. When __disk_block_events() is called with @sync == %false, it bumps event block count, calls cancel_delayed_work() and return. This makes it impossible to guarantee that event polling is not in flight on return from syncing __disk_block_events() - if the first blocker was non-syncing, polling could still be in progress and later syncing ones would assume that the first blocker already canceled it. Making __disk_block_events() cancel_sync regardless of block count isn't feasible either as it may race with forced event checking in disk_clear_events(). As disk_check_events() is the only user of non-syncing __disk_block_events(), updating it to directly cancel and schedule event work is the easiest way to solve the issue. Note that there's another bug in __disk_block_events() and this patch doesn't fix the issue completely. Later patch will fix the other bug. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com> Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com> Reported-by: Borislav Petkov <bp@alien8.de> Reported-by: Meelis Roos <mroos@linux.ee> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-26block: always allocate genhd->ev if check_events is implementedTejun Heo
9fd097b149 (block: unexport DISK_EVENT_MEDIA_CHANGE for legacy/fringe drivers) removed DISK_EVENT_MEDIA_CHANGE from legacy/fringe block drivers which have inadequate ->check_events(). Combined with earlier change 7c88a168da (block: don't propagate unlisted DISK_EVENTs to userland), this enables using ->check_events() for internal processing while avoiding enabling in-kernel block event polling which can lead to infinite event loop. Unfortunately, this made many drivers including floppy without any bit set in disk->events and ->async_events in which case disk_add_events() simply skipped allocation of disk->ev, which disables whole event handling. As ->check_events() is still used during open processing for revalidation, this can lead to open failure. This patch always allocates disk->ev if ->check_events is implemented. In the long term, it would make sense to simply include the event structure inline into genhd as it's now used by virtually all block devices. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ondrej Zary <linux@rainbow-software.org> Reported-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-21block: don't propagate unlisted DISK_EVENTs to userlandTejun Heo
DISK_EVENT_MEDIA_CHANGE is used for both userland visible event and internal event for revalidation of removeable devices. Some legacy drivers don't implement proper event detection and continuously generate events under certain circumstances. For example, ide-cd generates media changed continuously if there's no media in the drive, which can lead to infinite loop of events jumping back and forth between the driver and userland event handler. This patch updates disk event infrastructure such that it never propagates events not listed in disk->events to userland. Those events are processed the same for internal purposes but uevent generation is suppressed. This also ensures that userland only gets events which are advertised in the @events sysfs node lowering risk of confusion. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-31Fix common misspellingsLucas De Marchi
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-09block: Don't implicitly trigger event check on disk_unblock_events()Tejun Heo
Currently, disk_unblock_events() implicitly kick event check if the block count reaches zero. This behavior is not described in the comment and hinders with future changes. Make the unblocker explicitly check events by calling disk_check_events() as necessary. This patch doesn't cause any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kay Sievers <kay.sievers@vrfy.org>
2011-03-04Merge branch 'for-linus' of ../linux-2.6-block into block-for-2.6.39/coreTejun Heo
This merge creates two set of conflicts. One is simple context conflicts caused by removal of throtl_scheduled_delayed_work() in for-linus and removal of throtl_shutdown_timer_wq() in for-2.6.39/core. The other is caused by commit 255bb490c8 (block: blk-flush shouldn't call directly into q->request_fn() __blk_run_queue()) in for-linus crashing with FLUSH reimplementation in for-2.6.39/core. The conflict isn't trivial but the resolution is straight-forward. * __blk_run_queue() calls in flush_end_io() and flush_data_end_io() should be called with @force_kblockd set to %true. * elv_insert() in blk_kick_flush() should use %ELEVATOR_INSERT_REQUEUE. Both changes are to avoid invoking ->request_fn() directly from request completion path and closely match the changes in the commit 255bb490c8. Signed-off-by: Tejun Heo <tj@kernel.org>
2011-03-02block/genhd: Change some numerals into macrosLiu Yuan
Rename the numerals in the diskstats_show() into the macros. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Liu Yuan <tailai.ly@taobao.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-02-24Fix over-zealous flush_disk when changing device size.NeilBrown
There are two cases when we call flush_disk. In one, the device has disappeared (check_disk_change) so any data will hold becomes irrelevant. In the oter, the device has changed size (check_disk_size_change) so data we hold may be irrelevant. In both cases it makes sense to discard any 'clean' buffers, so they will be read back from the device if needed. In the former case it makes sense to discard 'dirty' buffers as there will never be anywhere safe to write the data. In the second case it *does*not* make sense to discard dirty buffers as that will lead to file system corruption when you simply enlarge the containing devices. flush_disk calls __invalidate_devices. __invalidate_device calls both invalidate_inodes and invalidate_bdev. invalidate_inodes *does* discard I_DIRTY inodes and this does lead to fs corruption. invalidate_bev *does*not* discard dirty pages, but I don't really care about that at present. So this patch adds a flag to __invalidate_device (calling it __invalidate_device2) to indicate whether dirty buffers should be killed, and this is passed to invalidate_inodes which can choose to skip dirty inodes. flusk_disk then passes true from check_disk_change and false from check_disk_size_change. dm avoids tripping over this problem by calling i_size_write directly rathher than using check_disk_size_change. md does use check_disk_size_change and so is affected. This regression was introduced by commit 608aeef17a which causes check_disk_size_change to call flush_disk, so it is suitable for any kernel since 2.6.27. Cc: stable@kernel.org Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Andrew Patterson <andrew.patterson@hp.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NeilBrown <neilb@suse.de>
2011-01-13Merge branch 'for-2.6.38/event-handling' into for-2.6.38/coreJens Axboe
2011-01-07block: add internal hd part table referencesJens Axboe
We can't use krefs since it's apparently restricted to very basic reference counting. This reverts commit e4a683c8. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-01-05block: fix accounting bug on cross partition mergesJerome Marchand
/proc/diskstats would display a strange output as follows. $ cat /proc/diskstats |grep sda 8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089 8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691 ~~~~~~~~~~ 8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390 8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92 8 4 sda4 4 0 8 0 0 0 0 0 0 0 0 8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137 Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE. The detailed root cause is as follows. Assuming that there are two partition, sda1 and sda2. 1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight is 0 and sda2's one is 1. | hd_struct->in_flight --------------------------- sda1 | 0 sda2 | 1 --------------------------- 2. A bio belongs to sda1 is issued and is merged into the request mentioned on step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed from sda2 region to sda1 region. However the two partition's hd_struct->in_flight are not changed. | hd_struct->in_flight --------------------------- sda1 | 0 sda2 | 1 --------------------------- 3. The request is finished and blk_account_io_done() is called. In this case, sda2's hd_struct->in_flight, not a sda1's one, is decremented. | hd_struct->in_flight --------------------------- sda1 | -1 sda2 | 1 --------------------------- The patch fixes the problem by caching the partition lookup inside the request structure, hence making sure that the increment and decrement will always happen on the same partition struct. This also speeds up IO with accounting enabled, since it cuts down on the number of lookups we have to do. Also add a refcount to struct hd_struct to keep the partition in memory as long as users exist. We use kref_test_and_get() to ensure we don't add a reference to a partition which is going away. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-17fs/block: type signature of major_to_index(int) to major_to_index(unsigned)Yang Zhang
The major/minor device numbers are always defined and used as `unsigned'. Signed-off-by: Yang Zhang <kthreadd@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-17block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)Yang Zhang
Signed-off-by: Yang Zhang <kthreadd@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-16implement in-kernel gendisk events handlingTejun Heo
Currently, media presence polling for removeable block devices is done from userland. There are several issues with this. * Polling is done by periodically opening the device. For SCSI devices, the command sequence generated by such action involves a few different commands including TEST_UNIT_READY. This behavior, while perfectly legal, is different from Windows which only issues single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some ATAPI devices lock up after being periodically queried such command sequences. * There is no reliable and unintrusive way for a userland program to tell whether the target device is safe for media presence polling. For example, polling for media presence during an on-going burning session can make it fail. The polling program can avoid this by opening the device with O_EXCL but then it risks making a valid exclusive user of the device fail w/ -EBUSY. * Userland polling is unnecessarily heavy and in-kernel implementation is lighter and better coordinated (workqueue, timer slack). This patch implements framework for in-kernel disk event handling, which includes media presence polling. * bdops->check_events() is added, which supercedes ->media_changed(). It should check whether there's any pending event and return if so. Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be called parallelly. * gendisk->events and ->async_events are added. These should be initialized by block driver before passing the device to add_disk(). The former contains the mask of all supported events and the latter the mask of all events which the device can report without polling. /sys/block/*/events[_async] export these to userland. * Kernel parameter block.events_dfl_poll_msecs controls the system polling interval (default is 0 which means disable) and /sys/block/*/events_poll_msecs control polling intervals for individual devices (default is -1 meaning use system setting). Note that if a device can report all supported events asynchronously and its polling interval isn't explicitly set, the device won't be polled regardless of the system polling interval. * If a device is opened exclusively with write access, event checking is automatically disabled until all write exclusive accesses are released. * There are event 'clearing' events. For example, both of currently defined events are cleared after the device has been successfully opened. This information is passed to ->check_events() callback using @clearing argument as a hint. * Event checking is always performed from system_nrt_wq and timer slack is set to 25% for polling. * Nothing changes for drivers which implement ->media_changed() but not ->check_events(). Going forward, all drivers will be converted to ->check_events() and ->media_change() will be dropped. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-16block: move register_disk() and del_gendisk() to block/genhd.cTejun Heo
There's no reason for register_disk() and del_gendisk() to be in fs/partitions/check.c. Move both to genhd.c. While at it, collapse unlink_gendisk(), which was artificially in a separate function due to genhd.c / check.c split, into del_gendisk(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-16block: kill genhd_media_change_notify()Tejun Heo
There's no user of the facility. Kill it. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-24Revert "block: fix accounting bug on cross partition merges"Jens Axboe
This reverts commit 7681bfeeccff5efa9eb29bf09249a3c400b15327. Conflicts: include/linux/genhd.h It has numerous issues with the cleanup path and non-elevator devices. Revert it for now so we can come up with a clean version without rushing things. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (31 commits) driver core: Display error codes when class suspend fails Driver core: Add section count to memory_block struct Driver core: Add mutex for adding/removing memory blocks Driver core: Move find_memory_block routine hpilo: Despecificate driver from iLO generation driver core: Convert link_mem_sections to use find_memory_block_hinted. driver core: Introduce find_memory_block_hinted which utilizes kset_find_obj_hinted. kobject: Introduce kset_find_obj_hinted. driver core: fix build for CONFIG_BLOCK not enabled driver-core: base: change to new flag variable sysfs: only access bin file vm_ops with the active lock sysfs: Fail bin file mmap if vma close is implemented. FW_LOADER: fix kconfig dependency warning on HOTPLUG uio: Statically allocate uio_class and use class .dev_attrs. uio: Support 2^MINOR_BITS minors uio: Cleanup irq handling. uio: Don't clear driver data uio: Fix lack of locking in init_uio_class SYSFS: Allow boot time switching between deprecated and modern sysfs layout driver core: remove CONFIG_SYSFS_DEPRECATED_V2 but keep it for block devices ...
2010-10-22SYSFS: Allow boot time switching between deprecated and modern sysfs layoutAndi Kleen
I have some systems which need legacy sysfs due to old tools that are making assumptions that a directory can never be a symlink to another directory, and it's a big hazzle to compile separate kernels for them. This patch turns CONFIG_SYSFS_DEPRECATED into a run time option that can be switched on/off the kernel command line. This way the same binary can be used in both cases with just a option on the command line. The old CONFIG_SYSFS_DEPRECATED_V2 option is still there to set the default. I kept the weird name to not break existing config files. Also the compat code can be still completely disabled by undefining CONFIG_SYSFS_DEPRECATED_SWITCH -- just the optimizer takes care of this now instead of lots of ifdefs. This makes the code look nicer. v2: This is an updated version on top of Kay's patch to only handle the block devices. I tested it on my old systems and that seems to work. Cc: axboe@kernel.dk Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-10-19block: fix accounting bug on cross partition mergesYasuaki Ishimatsu
/proc/diskstats would display a strange output as follows. $ cat /proc/diskstats |grep sda 8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089 8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691 ~~~~~~~~~~ 8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390 8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92 8 4 sda4 4 0 8 0 0 0 0 0 0 0 0 8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137 Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE. The detailed root cause is as follows. Assuming that there are two partition, sda1 and sda2. 1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight is 0 and sda2's one is 1. | hd_struct->in_flight --------------------------- sda1 | 0 sda2 | 1 --------------------------- 2. A bio belongs to sda1 is issued and is merged into the request mentioned on step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed from sda2 region to sda1 region. However the two partition's hd_struct->in_flight are not changed. | hd_struct->in_flight --------------------------- sda1 | 0 sda2 | 1 --------------------------- 3. The request is finished and blk_account_io_done() is called. In this case, sda2's hd_struct->in_flight, not a sda1's one, is decremented. | hd_struct->in_flight --------------------------- sda1 | -1 sda2 | 1 --------------------------- The patch fixes the problem by caching the partition lookup inside the request structure, hence making sure that the increment and decrement will always happen on the same partition struct. This also speeds up IO with accounting enabled, since it cuts down on the number of lookups we have to do. When reloading partition tables, quiesce IO to ensure that no request references to the partition struct exists. When it is safe to free the partition table, the IO for that device is restarted again. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-16block: Fix race during disk initializationSigned-off-by: Jan Kara
When a new disk is being discovered, add_disk() first ties the bdev to gendisk (via register_disk()->blkdev_get()) and only after that calls bdi_register_bdev(). Because register_disk() also creates disk's kobject, it can happen that userspace manages to open and modify the device's data (or inode) before its BDI is properly initialized leading to a warning in __mark_inode_dirty(). Fix the problem by registering BDI early enough. This patch addresses https://bugzilla.kernel.org/show_bug.cgi?id=16312 Cc: stable@kernel.org Reported-by: Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-15init: add support for root devices specified by partition UUIDWill Drewry
This is the third patch in a series which adds support for storing partition metadata, optionally, off of the hd_struct. One major use for that data is being able to resolve partition by other identities than just the index on a block device. Device enumeration varies by platform and there's a benefit to being able to use something like EFI GPT's GUIDs to determine the correct block device and partition to mount as the root. This change adds that support to root= by adding support for the following syntax: root=PARTUUID=hex-uuid Signed-off-by: Will Drewry <wad@chromium.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>