summaryrefslogtreecommitdiff
path: root/drivers/block
AgeCommit message (Collapse)Author
2013-06-19rsxx: Adding in sync_start module paramenter.Philip J Kelleher
Before, the partition table would have to be reread because our card was attached before it transistioned out of it's 'starting' state. This change will cause the driver to wait to attach the device until the adapter is ready. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19rsxx: Allow block size to be determined by configuration.Philip J Kelleher
Previously, the block size was determined by whether or not our Hardware could handle 512 byte accesses. Now, all of our Hardware can handle 512 and 4096 block sizes. This fix allows it to be user configurable. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19rsxx: Fixes soft-lockup issues during DMAs.Philip J Kelleher
The workqueue mechanism has been reworked to prevent soft lockup issues from occuring by adding in mutex sychronization. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19rsxx: Restructured DMA cancel scheme.Philip J Kelleher
Before, DMAs would never be cancelled if there was a data stall or an EEH Permenant failure which would cause an unrecoverable I/O hang. The DMA cancellation mechanism has been modified to fix these issues and allows DMAs to be cancelled during the above mentioned events. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19rsxx: Individual workqueues for interruptible events.Philip J Kelleher
Giving all interrupt based events their own workqueue to complete tasks on. This fixes a bug that would cause creg commands to timeout if too many are issued at once. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-17xen/blkback: Check for insane amounts of request on the ring (v6).Konrad Rzeszutek Wilk
Check that the ring does not have an insane amount of requests (more than there could fit on the ring). If we detect this case we will stop processing the requests and wait until the XenBus disconnects the ring. The existing check RING_REQUEST_CONS_OVERFLOW which checks for how many responses we have created in the past (rsp_prod_pvt) vs requests consumed (req_cons) and whether said difference is greater or equal to the size of the ring, does not catch this case. Wha the condition does check if there is a need to process more as we still have a backlog of responses to finish. Note that both of those values (rsp_prod_pvt and req_cons) are not exposed on the shared ring. To understand this problem a mini crash course in ring protocol response/request updates is in place. There are four entries: req_prod and rsp_prod; req_event and rsp_event to track the ring entries. We are only concerned about the first two - which set the tone of this bug. The req_prod is a value incremented by frontend for each request put on the ring. Conversely the rsp_prod is a value incremented by the backend for each response put on the ring (rsp_prod gets set by rsp_prod_pvt when pushing the responses on the ring). Both values can wrap and are modulo the size of the ring (in block case that is 32). Please see RING_GET_REQUEST and RING_GET_RESPONSE for the more details. The culprit here is that if the difference between the req_prod and req_cons is greater than the ring size we have a problem. Fortunately for us, the '__do_block_io_op' loop: rc = blk_rings->common.req_cons; rp = blk_rings->common.sring->req_prod; while (rc != rp) { .. blk_rings->common.req_cons = ++rc; /* before make_response() */ } will loop up to the point when rc == rp. The macros inside of the loop (RING_GET_REQUEST) is smart and is indexing based on the modulo of the ring size. If the frontend has provided a bogus req_prod value we will loop until the 'rc == rp' - which means we could be processing already processed requests (or responses) often. The reason the RING_REQUEST_CONS_OVERFLOW is not helping here is b/c it only tracks how many responses we have internally produced and whether we would should process more. The astute reader will notice that the macro RING_REQUEST_CONS_OVERFLOW provides two arguments - more on this later. For example, if we were to enter this function with these values: blk_rings->common.sring->req_prod = X+31415 (X is the value from the last time __do_block_io_op was called). blk_rings->common.req_cons = X blk_rings->common.rsp_prod_pvt = X The RING_REQUEST_CONS_OVERFLOW(&blk_rings->common, blk_rings->common.req_cons) is doing: req_cons - rsp_prod_pvt >= 32 Which is, X - X >= 32 or 0 >= 32 And that is false, so we continue on looping (this bug). If we re-use said macro RING_REQUEST_CONS_OVERFLOW and pass in the rp instead (sring->req_prod) of rc, the this macro can do the check: req_prod - rsp_prov_pvt >= 32 Which is, X + 31415 - X >= 32 , or 31415 >= 32 which is true, so we can error out and break out of the function. Unfortunatly the difference between rsp_prov_pvt and req_prod can be at 32 (which would error out in the macro). This condition exists when the backend is lagging behind with the responses and still has not finished responding to all of them (so make_response has not been called), and the rsp_prov_pvt + 32 == req_cons. This ends up with us not being able to use said macro. Hence introducing a new macro called RING_REQUEST_PROD_OVERFLOW which does a simple check of: req_prod - rsp_prod_pvt > RING_SIZE And with the X values from above: X + 31415 - X > 32 Returns true. Also not that if the ring is full (which is where the RING_REQUEST_CONS_OVERFLOW triggered), we would not hit the same condition: X + 32 - X > 32 Which is false. Lets use that macro. Note that in v5 of this patchset the macro was different - we used an earlier version. Cc: stable@vger.kernel.org [v1: Move the check outside the loop] [v2: Add a pr_warn as suggested by David] [v3: Use RING_REQUEST_CONS_OVERFLOW as suggested by Jan] [v4: Move wake_up after kthread_stop as suggested by Jan] [v5: Use RING_REQUEST_PROD_OVERFLOW instead] [v6: Use RING_REQUEST_PROD_OVERFLOW - Jan's version] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> gadsa
2013-06-13rbd: use the correct length for format 2 object namesJosh Durgin
Format 2 objects use 16 characters for the object name suffix to be able to express the full 64-bit range of object numbers. Format 1 images only use 12 characters for this. Using 12-character names for format 2 caused userspace and kernel rbd clients to read differently named objects, which made an image written by one client look empty to the other client. CC: stable@vger.kernel.org # 3.9+ Reported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-06-12Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer fixes from Jens Axboe: "Outside of bcache (which really isn't super big), these are all few-liners. There are a few important fixes in here: - Fix blk pm sleeping when holding the queue lock - A small collection of bcache fixes that have been done and tested since bcache was included in this merge window. - A fix for a raid5 regression introduced with the bio changes. - Two important fixes for mtip32xx, fixing an oops and potential data corruption (or hang) due to wrong bio iteration on stacked devices." * 'for-linus' of git://git.kernel.dk/linux-block: scatterlist: sg_set_buf() argument must be in linear mapping raid5: Initialize bi_vcnt pktcdvd: silence static checker warning block: remove refs to XD disks from documentation blkpm: avoid sleep when holding queue lock mtip32xx: Correctly handle bio->bi_idx != 0 conditions mtip32xx: Fix NULL pointer dereference during module unload bcache: Fix error handling in init code bcache: clarify free/available/unused space bcache: drop "select CLOSURES" bcache: Fix incompatible pointer type warning
2013-06-12Merge branch 'akpm' (updates from Andrew Morton)Linus Torvalds
Merge misc fixes from Andrew Morton: "Bunch of fixes and one little addition to math64.h" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (27 commits) include/linux/math64.h: add div64_ul() mm: memcontrol: fix lockless reclaim hierarchy iterator frontswap: fix incorrect zeroing and allocation size for frontswap_map kernel/audit_tree.c:audit_add_tree_rule(): protect `rule' from kill_rules() mm: migration: add migrate_entry_wait_huge() ocfs2: add missing lockres put in dlm_mig_lockres_handler mm/page_alloc.c: fix watermark check in __zone_watermark_ok() drivers/misc/sgi-gru/grufile.c: fix info leak in gru_get_config_info() aio: fix io_destroy() regression by using call_rcu() rtc-at91rm9200: use shadow IMR on at91sam9x5 rtc-at91rm9200: add shadow interrupt mask rtc-at91rm9200: refactor interrupt-register handling rtc-at91rm9200: add configuration support rtc-at91rm9200: add match-table compile guard fs/ocfs2/namei.c: remove unecessary ERROR when removing non-empty directory swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion drivers/rtc/rtc-twl.c: fix missing device_init_wakeup() when booted with device tree cciss: fix broken mutex usage in ioctl audit: wait_for_auditd() should use TASK_UNINTERRUPTIBLE drivers/rtc/rtc-cmos.c: fix accidentally enabling rtc channel ...
2013-06-12cciss: fix broken mutex usage in ioctlStephen M. Cameron
If a new logical drive is added and the CCISS_REGNEWD ioctl is invoked (as is normal with the Array Configuration Utility) the process will hang as below. It attempts to acquire the same mutex twice, once in do_ioctl() and once in cciss_unlocked_open(). The BKL was recursive, the mutex isn't. Linux version 3.10.0-rc2 (scameron@localhost.localdomain) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Fri May 24 14:32:12 CDT 2013 [...] acu D 0000000000000001 0 3246 3191 0x00000080 Call Trace: schedule+0x29/0x70 schedule_preempt_disabled+0xe/0x10 __mutex_lock_slowpath+0x17b/0x220 mutex_lock+0x2b/0x50 cciss_unlocked_open+0x2f/0x110 [cciss] __blkdev_get+0xd3/0x470 blkdev_get+0x5c/0x1e0 register_disk+0x182/0x1a0 add_disk+0x17c/0x310 cciss_add_disk+0x13a/0x170 [cciss] cciss_update_drive_info+0x39b/0x480 [cciss] rebuild_lun_table+0x258/0x370 [cciss] cciss_ioctl+0x34f/0x470 [cciss] do_ioctl+0x49/0x70 [cciss] __blkdev_driver_ioctl+0x28/0x30 blkdev_ioctl+0x200/0x7b0 block_ioctl+0x3c/0x40 do_vfs_ioctl+0x89/0x350 SyS_ioctl+0xa1/0xb0 system_call_fastpath+0x16/0x1b This mutex usage was added into the ioctl path when the big kernel lock was removed. As it turns out, these paths are all thread safe anyway (or can easily be made so) and we don't want ioctl() to be single threaded in any case. Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mike Miller <mike.miller@hp.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph fixes from Sage Weil: "There is a pair of fixes for double-frees in the recent bundle for 3.10, a couple of fixes for long-standing bugs (sleep while atomic and an endianness fix), and a locking fix that can be triggered when osds are going down" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: rbd: fix cleanup in rbd_add() rbd: don't destroy ceph_opts in rbd_add() ceph: ceph_pagelist_append might sleep while atomic ceph: add cpu_to_le32() calls when encoding a reconnect capability libceph: must hold mutex for reset_changed_osds()
2013-06-11Merge branch 'fixes-3.10' of git://git.infradead.org/users/willy/linux-nvmeLinus Torvalds
Pull NVMe fixes from Matthew Wilcox. * 'fixes-3.10' of git://git.infradead.org/users/willy/linux-nvme: NVMe: Add MSI support NVMe: Use dma_set_mask() correctly Return the result from user admin command IOCTL even in case of failure NVMe: Do not cancel command multiple times NVMe: fix error return code in nvme_submit_bio_queue() NVMe: check for integer overflow in nvme_map_user_pages() MAINTAINERS: update NVM EXPRESS DRIVER file list NVMe: Fix a signedness bug in nvme_trans_modesel_get_mp NVMe: Remove redundant version.h header include
2013-06-07xen/blkback: Check device permissions before allowing OP_DISCARDKonrad Rzeszutek Wilk
We need to make sure that the device is not RO or that the request is not past the number of sectors we want to issue the DISCARD operation for. This fixes CVE-2013-2140. Cc: stable@vger.kernel.org Acked-by: Jan Beulich <JBeulich@suse.com> Acked-by: Ian Campbell <Ian.Campbell@citrix.com> [v1: Made it pr_warn instead of pr_debug] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-07xen/blkback: Use physical sector size for setupStefan Bader
Currently xen-blkback passes the logical sector size over xenbus and xen-blkfront sets up the paravirt disk with that logical block size. But newer drives usually have the logical sector size set to 512 for compatibility reasons and would show the actual sector size only in physical sector size. This results in the device being partitioned and accessed in dom0 with the correct sector size, but the guest thinks 512 bytes is the correct block size. And that results in poor performance. To fix this, blkback gets modified to pass also physical-sector-size over xenbus and blkfront to use both values to set up the paravirt disk. I did not just change the passed in sector-size because I am not sure having a bigger logical sector size than the physical one is valid (and that would happen if a newer dom0 kernel hits an older domU kernel). Also this way a domU set up before should still be accessible (just some tools might detect the unaligned setup). [v2: Make xenbus write failure non-fatal] [v3: Use xenbus_scanf instead of xenbus_gather] [v4: Rebased against segment changes] Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-04xen-blkfront: Introduce a 'max' module parameter to alter the amount of ↵Konrad Rzeszutek Wilk
indirect segments. The max module parameter (by default 32) is the maximum number of segments that the frontend will negotiate with the backend for indirect descriptors. Higher value means more potential throughput but more memory usage. The backend picks the minimum of the frontend and its default backend value. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-05-31NVMe: Add MSI supportRamachandra Rao Gajula
Some devices only have support for MSI, not MSI-X. While MSI is more limited, it still provides better performance than line-based interrupts. Signed-off-by: Ramachandra Gajula <rama@fastorsystems.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-29Merge branch 'for-jens' of ↵Jens Axboe
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/linux-block into for-linus Jiri writes: please pull from git://git.kernel.org/pub/scm/linux/kernel/git/jikos/linux-block.git for-jens to receive one pktcdvd fix. It fixes a highly theoretical issue with using. pktcdvd to work with media that'd be larger than 2TB :) But it's a correct. fix and makes static checkers shut up about improperly cleaning upper. 32bits.
2013-05-29pktcdvd: silence static checker warningDan Carpenter
Static checkers complain about widening the binary "not" operations here because sectors are u64 and "(pd)->settings.size" is unsigned int. It unintentionally clears the high 32 bits of the sector. This means that the driver won't work for devices with over 2TB of space. Since this is a DVD drive, we're unlikely to reach that limit, but we may as well silence the warning. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-05-28NVMe: Use dma_set_mask() correctlyMatthew Wilcox
In some circumstances setting a 64-bit DMA mask can fail, as explained in Documentation/DMA-API-HOWTO.txt. Use the recommended code sequence to set a 32-bit DMA mask if setting a 64-bit DMA mask fails. Reported-by: Chayan Biswas <Chayan.Biswas@sandisk.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-24drivers/block/brd.c: fix brd_lookup_page() raceBrian Behlendorf
The index on the page must be set before it is inserted in the radix tree. Otherwise there is a small race which can occur during lookup where the page can be found with the incorrect index. This will trigger the BUG_ON() in brd_lookup_page(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: Chris Wedgwood <cw@f00f.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-24drivers/block/xsysace.c: fix id with missing port-numberGernot Vormayr
If the port number is missing from the device-tree the device gets named xs` instead of xsa. This fixes the check for missing ids. Tested on ml507 board. Signed-off-by: Gernot Vormayr <gvormayr@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Grant Likely <grant.likely@linaro.org> Cc: Rob Herring <rob.herring@calxeda.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-23Return the result from user admin command IOCTL even in case of failureChayan Biswas
We copy the result to user if the command is completed from the controller even if it completes with failure (non-zero) status. A return status of < 0 indicates the command was not completed by the controller. The user application may expect the error code in the result field in case of failure. Signed-off-by: Chayan Biswas <Chayan.Biswas@sandisk.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-20virtio_blk: Add missing 'static' qualifiersJonghwan Choi
Add missing 'static' qualifiers Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-05-17rbd: fix cleanup in rbd_add()Alex Elder
Bjorn Helgaas pointed out that a recent commit introduced a use-after-free condition in an error path for rbd_add(). He correctly stated: I think b536f69a3a5 "rbd: set up devices only for mapped images" introduced a use-after-free error in rbd_add(): ... If rbd_dev_device_setup() returns an error, we call rbd_dev_image_release(), which ultimately kfrees rbd_dev. Then we call rbd_dev_destroy(), which references fields in the already-freed rbd_dev struct before kfreeing it again. The simple fix is to return the error code after the call to rbd_dev_image_release(). Closer examination revealed that there's no need to clean up rbd_opts in that function, so fix that too. Update some other comments that have also become out of date. Reported-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-17rbd: don't destroy ceph_opts in rbd_add()Alex Elder
Whether rbd_client_create() successfully creates a new client or not, it takes responsibility for getting the ceph_opts structure it's passed destroyed. If successful, the structure becomes associated with the created client; if not, rbd_client_create() will destroy it. Previously, rbd_get_client() would call ceph_destroy_options() if rbd_get_client() failed, and that meant it got called twice. That led freeing various pointers more than once, which is never a good idea. This resolves: http://tracker.ceph.com/issues/4559 Cc: stable@vger.kernel.org # 3.8+ Reported-by: Dan van der Ster <dan@vanderster.com> Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-17NVMe: Do not cancel command multiple timesKeith Busch
Cancelling an already cancelled command does not do anything, so check the command context before cancelling it, continuing if had already been cancelled so we do not log the same problem every second if a device stops responding. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-17NVMe: fix error return code in nvme_submit_bio_queue()Wei Yongjun
nvme_submit_flush_data() might overwrite the initialisation of the return value with 0, so move the -ENOMEM setting close to the usage. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-17NVMe: check for integer overflow in nvme_map_user_pages()Dan Carpenter
You need to have CAP_SYS_ADMIN to trigger this overflow but it makes the static checkers complain so we should fix it. The worry is that "length" comes from copy_from_user() so we need to check that "length + offset" can't overflow. I also changed the min_t() cast to be unsigned instead of signed. Now that we cap "length" to INT_MAX it doesn't make a difference, but it's a little easier for reviewers to know that large values aren't cast to negative. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-17NVMe: Fix a signedness bug in nvme_trans_modesel_get_mpVishal Verma
nvme_trans_modesel_get_mp() was defined with a unsigned return type, but can return signed values. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-17NVMe: Remove redundant version.h header includeSachin Kamat
version.h header inclusion is not necessary as detected by checkversion.pl. Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org> Acked-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-15Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph fixes from Sage Weil: "Yes, this is a much larger pull than I would like after -rc1. There are a few things included: - a few fixes for leaks and incorrect assertions - a few patches fixing behavior when mapped images are resized - handling for cloned/layered images that are flattened out from underneath the client The last bit was non-trivial, and there is some code movement and associated cleanup mixed in. This was ready and was meant to go in last week but I missed the boat on Friday. My only excuse is that I was waiting for an all clear from the testing and there were many other shiny things to distract me. Strictly speaking, handling the flatten case isn't a regression and could wait, so if you like we can try to pull the series apart, but Alex and I would much prefer to have it all in as it is a case real users will hit with 3.10." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (33 commits) rbd: re-submit flattened write request (part 2) rbd: re-submit write request for flattened clone rbd: re-submit read request for flattened clone rbd: detect when clone image is flattened rbd: reference count parent requests rbd: define parent image request routines rbd: define rbd_dev_unparent() rbd: don't release write request until necessary rbd: get parent info on refresh rbd: ignore zero-overlap parent rbd: support reading parent page data for writes rbd: fix parent request size assumption libceph: init sent and completed when starting rbd: kill rbd_img_request_get() rbd: only set up watch for mapped images rbd: set mapping read-only flag in rbd_add() rbd: support reading parent page data rbd: fix an incorrect assertion condition rbd: define rbd_dev_v2_header_info() rbd: get rid of trivial v1 header wrappers ...
2013-05-15mtip32xx: Correctly handle bio->bi_idx != 0 conditionsSam Bradshaw
Stacking drivers may append bvecs to existing bio's, resulting in non-zero bi_idx conditions. This patch counts the loops of bio_for_each_segment() rather than inheriting the bi_idx value to pass as a segment count to the hardware submission routine. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-05-15mtip32xx: Fix NULL pointer dereference during module unloadSam Bradshaw
An open file-handle to one or more of the driver exported debugfs nodes causes raciness in recursive removal during module unload; sometimes a stale parent dentry is dereferenced when more than 1 pci device is present. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-05-13rbd: re-submit flattened write request (part 2)Alex Elder
Add code to rbd_img_obj_exists_callback() to detect when a clone's parent image has disappeared, and re-submit the original write request in that case. Kill off some redundant assertions. This completes the resolution for: http://tracker.ceph.com/issues/3763 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13rbd: re-submit write request for flattened cloneAlex Elder
Add code to rbd_img_parent_read_full_callback() to detect when a clone's parent image has disappeared, and re-submit the original write request in that case. (See the previous commit for more reasoning about why this is appropriate.) Rename some variables in rbd_img_obj_parent_read_full_callback() to match the convention used in the previous patch. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13rbd: re-submit read request for flattened cloneAlex Elder
If a clone image gets flattened while a parent read request is underway, the original rbd object request needs to be resubmitted. The reason is that by the time we get the response to the parent read request, the data read from the parent may be out of date. In other words, we could see this sequence of events: rbd client parent image/osd ---------- ---------------- original object ENOENT; issue parent read respond to parent read child image flattened original image header refresh <--- original object written independently here parent read response received Add code to rbd_img_parent_read_callback() to detect when a clone's parent image has disappeared (as evidenced by its parent overlap becoming 0), and re-submit the original read request in that case. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13rbd: detect when clone image is flattenedAlex Elder
A format 2 clone image can be the subject of a "flatten" operation, during which all of its data gets "copied up" from its parent image, leaving the image fully populated. Once this is complete, the clone's association with the parent is abolished. Since this can occur when a clone is mapped, we need to detect when it has occurred and handle it accordingly. We know an image has been flattened when we know it at one time had a parent, but we have learned (via a "get_parent" object class method call) it no longer has one. There might be in-flight requests at the point we learn an image has been flattened, so we can't simply clean up parent data structures right away. Instead, we'll drop the initial parent reference when the parent has disappeared (rather than when the image gets destroyed), which will allow the last in-flight reference to clean things up when it's complete. We leverage the fact that a zero parent overlap renders an image effectively unlayered. We set the overlap to 0 at the point we detect the clone image has flattened, which allows the unlayered behavior to take effect immediately, while keeping other parent structures in place until in-flight requests to complete. This and the next few patches resolve: http://tracker.ceph.com/issues/3763 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13rbd: reference count parent requestsAlex Elder
Keep a reference count for uses of the parent information for an rbd device. An initial reference is set in rbd_img_request_create() if the target image has a parent (with non-zero overlap). Each image request for an image with a non-zero parent overlap gets another reference when it's created, and that reference is dropped when the request is destroyed. The initial reference is dropped when the image gets torn down. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13rbd: define parent image request routinesAlex Elder
Define rbd_parent_request_create() and rbd_parent_request_destroy() to handle the creation of parent image requests submitted for layered image objects. For simplicity, let rbd_img_request_put() handle dropping the reference to any image request (parent or not), and call whichever destructor is appropriate on the last put. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13rbd: define rbd_dev_unparent()Alex Elder
Define rbd_dev_unparent() to encapsulate cleaning up parent data structures from a layered rbd image. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13rbd: don't release write request until necessaryAlex Elder
Previously when a layered write was going to involve a copyup request, the original osd request was released before submitting the parent full-object read. The osd request for the copyup would then be allocated in rbd_img_obj_parent_read_full_callback(). Shortly we will be handling the event of mapped layered images getting flattened, and when that occurs we need to resubmit the original request. We therefore don't want to release the osd request until we really konw we're going to replace it--in the callback function. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13rbd: get parent info on refreshAlex Elder
Get parent info for format 2 images on every refresh (rather than just during the initial probe). This will be needed to detect the disappearance of the parent image in the event a mapped image becomes unlayered (i.e., flattened). Avoid leaking the previous parent spec on the second and subsequent times this information is requested by dropping the previous one (if any) before updating it. (Also, extract the pool id into a local variable before assigning it into the parent spec.) Switch to using a non-zero parent overlap value rather than the existence of a parent (a non-null parent_spec pointer) to determine whether to mark a request layered. It will soon be possible for a layered image to become unlayered while a request is in flight. This means that the layered flag for an image request indicates that there was a non-zero parent overlap at the time the image request was created. The parent overlap can change thereafter, which may lead to special handling at request submission or completion time. This and the next several patches are related to: http://tracker.ceph.com/issues/3763 NOTE: If an error occurs while refreshing the parent info (i.e., requesting it after initial probe), the old parent info will persist. This is not really correct, and is a scenario that needs to be addressed. For now we'll assert that the failure mode is unlikely, but the issue has been documented in tracker issue 5040. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13rbd: ignore zero-overlap parentAlex Elder
An rbd clone image that has an overlap with its parent of 0 is effectively not a layered image at all. Detect this case and treat such an image as non-layered. Issue a warning to be sure the user knows what's going on. This resolves: http://tracker.ceph.com/issues/5028 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13rbd: support reading parent page data for writesAlex Elder
Currently, rbd_img_obj_parent_read_full() assumes the incoming object request contains bio data. But if a layered image is part of a multi-layer stack of images it will result in read requests of page data to parent images. This is handling the same kind of issue as was resolved by this commit: 5b2ab72d rbd: support reading parent page data This resolves: http://tracker.ceph.com/issues/5027 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13rbd: fix parent request size assumptionAlex Elder
The code that reads object data from the parent for a copyup on write request currently assumes that the size of that request is the size of a "full" object from the original target image. That is not necessarily the case. The parent overlap could reduce the request size below that. To fix that assumption we need to record the number of pages in the copyup_pages array, for both an image request and an object request. Rename a local variable in rbd_img_obj_parent_read_full_callback() to reflect we're recording the length of the parent read request, not the size of the target object. This resolves: http://tracker.ceph.com/issues/5038 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-09Merge git://git.infradead.org/users/willy/linux-nvmeLinus Torvalds
Pull NVMe driver update from Matthew Wilcox: "Lots of exciting new features in the NVM Express driver this time, including support for emulating SCSI commands, discard support and the ability to submit per-sector metadata with I/Os. It's still mostly bugfixes though!" * git://git.infradead.org/users/willy/linux-nvme: (27 commits) NVMe: Use user defined admin ioctl timeout NVMe: Simplify Firmware Activate code slightly NVMe: Only clear the enable bit when disabling controller NVMe: Wait for device to acknowledge shutdown NVMe: Schedule timeout for sync commands NVMe: Meta-data support in NVME_IOCTL_SUBMIT_IO NVMe: Device specific stripe size handling NVMe: Split non-mergeable bio requests NVMe: Remove dead code in nvme_dev_add NVMe: Check for NULL memory in nvme_dev_add NVMe: Fix error clean-up on nvme_alloc_queue NVMe: Free admin queue on request_irq error NVMe: Add scsi unmap to SG_IO NVMe: queue usage fixes in nvme-scsi NVMe: Set TASK_INTERRUPTIBLE before processing queues NVMe: Add a character device for each nvme device NVMe: Fix endian-related problems in user I/O submission path NVMe: Fix I/O cancellation status on big-endian machines NVMe: Fix sparse warnings in scsi emulation NVMe: Don't fail initialisation unnecessarily ...
2013-05-09NVMe: Use user defined admin ioctl timeoutKeith Busch
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-08rbd: kill rbd_img_request_get()Alex Elder
Get rid of rbd_img_request_get(), because it isn't used, and maybe won't ever be needed. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08rbd: only set up watch for mapped imagesAlex Elder
Any changes to parent images are immaterial to any mapped clone. So there is no need to have a watch event registered on header objects except for the header object of an image that is mapped. In fact, a watch request is a write operation, and we may only have read access to a parent image. We can't set up the watch request until we know the name of the header object though. So pass a flag to rbd_dev_image_probe() to indicate whether this probe is for a mapping or for a parent image. Change the second parameter to rbd_dev_header_watch_sync() be Boolean while we're at it. This resolves: http://tracker.ceph.com/issues/4941 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08rbd: set mapping read-only flag in rbd_add()Alex Elder
The rbd_dev->mapping field for a parent image is not meaningful. Since rbd_image_probe() is used both for images being mapped and their parents, it doesn't make sense to set that flag in that function. So move the setting of the mapping.read_only flag out of rbd_dev_image_probe() and into rbd_add() instead. This resolves: http://tracker.ceph.com/issues/4940 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>