summaryrefslogtreecommitdiff
path: root/drivers/nvdimm
AgeCommit message (Collapse)Author
2017-07-17libnvdimm: fix badblock range handling of ARS rangeToshi Kani
__add_badblock_range() does not account sector alignment when it sets 'num_sectors'. Therefore, an ARS error record range spanning across two sectors is set to a single sector length, which leaves the 2nd sector unprotected. Change __add_badblock_range() to set 'num_sectors' properly. Cc: <stable@vger.kernel.org> Fixes: 0caeef63e6d2 ("libnvdimm: Add a poison list and export badblocks") Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-07-11Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more block updates from Jens Axboe: "This is a followup for block changes, that didn't make the initial pull request. It's a bit of a mixed bag, this contains: - A followup pull request from Sagi for NVMe. Outside of fixups for NVMe, it also includes a series for ensuring that we properly quiesce hardware queues when browsing live tags. - Set of integrity fixes from Dmitry (mostly), fixing various issues for folks using DIF/DIX. - Fix for a bug introduced in cciss, with the req init changes. From Christoph. - Fix for a bug in BFQ, from Paolo. - Two followup fixes for lightnvm/pblk from Javier. - Depth fix from Ming for blk-mq-sched. - Also from Ming, performance fix for mtip32xx that was introduced with the dynamic initialization of commands" * 'for-linus' of git://git.kernel.dk/linux-block: (44 commits) block: call bio_uninit in bio_endio nvmet: avoid unneeded assignment of submit_bio return value nvme-pci: add module parameter for io queue depth nvme-pci: compile warnings in nvme_alloc_host_mem() nvmet_fc: Accept variable pad lengths on Create Association LS nvme_fc/nvmet_fc: revise Create Association descriptor length lightnvm: pblk: remove unnecessary checks lightnvm: pblk: control I/O flow also on tear down cciss: initialize struct scsi_req null_blk: fix error flow for shared tags during module_init block: Fix __blkdev_issue_zeroout loop nvme-rdma: unconditionally recycle the request mr nvme: split nvme_uninit_ctrl into stop and uninit virtio_blk: quiesce/unquiesce live IO when entering PM states mtip32xx: quiesce request queues to make sure no submissions are inflight nbd: quiesce request queues to make sure no submissions are inflight nvme: kick requeue list when requeueing a request instead of when starting the queues nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queues nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues ...
2017-07-07Merge tag 'libnvdimm-for-4.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "libnvdimm updates for the latest ACPI and UEFI specifications. This pull request also includes new 'struct dax_operations' enabling to undo the abuse of copy_user_nocache() for copy operations to pmem. The dax work originally missed 4.12 to address concerns raised by Al. Summary: - Introduce the _flushcache() family of memory copy helpers and use them for persistent memory write operations on x86. The _flushcache() semantic indicates that the cache is either bypassed for the copy operation (movnt) or any lines dirtied by the copy operation are written back (clwb, clflushopt, or clflush). - Extend dax_operations with ->copy_from_iter() and ->flush() operations. These operations and other infrastructure updates allow all persistent memory specific dax functionality to be pushed into libnvdimm and the pmem driver directly. It also allows dax-specific sysfs attributes to be linked to a host device, for example: /sys/block/pmem0/dax/write_cache - Add support for the new NVDIMM platform/firmware mechanisms introduced in ACPI 6.2 and UEFI 2.7. This support includes the v1.2 namespace label format, extensions to the address-range-scrub command set, new error injection commands, and a new BTT (block-translation-table) layout. These updates support inter-OS and pre-OS compatibility. - Fix a longstanding memory corruption bug in nfit_test. - Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2) capable. - Miscellaneous fixes and small updates across libnvdimm and the nfit driver. Acknowledgements that came after the branch was pushed: commit 6aa734a2f38e ("libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime") was reviewed by Toshi Kani <toshi.kani@hpe.com>" * tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (42 commits) libnvdimm, namespace: record 'lbasize' for pmem namespaces acpi/nfit: Issue Start ARS to retrieve existing records libnvdimm: New ACPI 6.2 DSM functions acpi, nfit: Show bus_dsm_mask in sysfs libnvdimm, acpi, nfit: Add bus level dsm mask for pass thru. acpi, nfit: Enable DSM pass thru for root functions. libnvdimm: passthru functions clear to send libnvdimm, btt: convert some info messages to warn/err libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime libnvdimm: fix the clear-error check in nsio_rw_bytes libnvdimm, btt: fix btt_rw_page not returning errors acpi, nfit: quiet invalid block-aperture-region warnings libnvdimm, btt: BTT updates for UEFI 2.7 format acpi, nfit: constify *_attribute_group libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region libnvdimm, pmem, dax: export a cache control attribute dax: convert to bitmask for flags dax: remove default copy_from_iter fallback libnvdimm, nfit: enable support for volatile ranges libnvdimm, pmem: fix persistence warning ...
2017-07-03Merge branch 'for-4.13/dax' into libnvdimm-for-nextDan Williams
2017-07-03libnvdimm, namespace: record 'lbasize' for pmem namespacesDan Williams
Commit f979b13c3cc5 "libnvdimm, label: honor the lba size specified in v1.2 labels") neglected to update the 'lbasize' in the label when the namespace sector_size attribute was written. We need this value in the label for inter-OS / pre-OS compatibility. Fixes: f979b13c3cc5 ("libnvdimm, label: honor the lba size specified in v1.2 labels") Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-07-03block: guard bvec iteration logicDmitry Monakhov
Currently if some one try to advance bvec beyond it's size we simply dump WARN_ONCE and continue to iterate beyond bvec array boundaries. This simply means that we endup dereferencing/corrupting random memory region. Sane reaction would be to propagate error back to calling context But bvec_iter_advance's calling context is not always good for error handling. For safity reason let truncate iterator size to zero which will break external iteration loop which prevent us from unpredictable memory range corruption. And even it caller ignores an error, it will corrupt it's own bvecs, not others. This patch does: - Return error back to caller with hope that it will react on this - Truncate iterator size Code was added long time ago here 4550dd6c, luckily no one hit it in real life :) Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> [hch: switch to true/false returns instead of errno values] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-03bio-integrity: fold bio_integrity_enabled to bio_integrity_prepDmitry Monakhov
Currently all integrity prep hooks are open-coded, and if prepare fails we ignore it's code and fail bio with EIO. Let's return real error to upper layer, so later caller may react accordingly. In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled, so it is reasonable to fold it in to one function. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> [hch: merged with the latest block tree, return bool from bio_integrity_prep] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-01libnvdimm: passthru functions clear to sendJerry Hoemann
Have dsm functions called via the pass thru mechanism also be checked against clear to send. Signed-off-by: Jerry Hoemann <jerry.hoemann@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-07-01libnvdimm, btt: convert some info messages to warn/errVishal Verma
Some critical messages such as IO errors, metadata failures were printed with dev_info. Make them louder by upgrading them to dev_warn or dev_error. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-30libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetimeDan Williams
We need to hold a reference on the 'dirent' until we are sure there are no more notifications that will be sent. As noted in the new comments we take advantage of the fact that the references are taken and dropped under device_lock() and that nd_device_notify() holds device_lock() over new badblocks notifications. The notifications that happen when badblocks are cleared only occur while the device is active. Also take the opportunity to fix up the error messages to report the user visible effect of a sysfs_get_dirent() failure. Fixes: 975750a98c26 ("libnvdimm, pmem: Add sysfs notifications to badblocks") Cc: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-30libnvdimm: fix the clear-error check in nsio_rw_bytesVishal Verma
A leftover from the 'bandaid' fix that disabled BTT error clearing in rw_bytes resulted in an incorrect check. After we converted these checks over to use the NVDIMM_IO_ATOMIC flag, the ndns->claim check was both redundant, and incorrect. Remove it. Fixes: 3ae3d67ba705 ("libnvdimm: add an atomic vs process context flag to rw_bytes") Cc: <stable@vger.kernel.org> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-29libnvdimm, btt: fix btt_rw_page not returning errorsVishal Verma
btt_rw_page was not propagating errors frm btt_do_bvec, resulting in any IO errors via the rw_page path going unnoticed. the pmem driver recently fixed this in e10624f pmem: fail io-requests to known bad blocks but same problem in BTT went neglected. Fixes: 5212e11fde4d ("nd_btt: atomic sector updates") Cc: <stable@vger.kernel.org> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-29acpi, nfit: quiet invalid block-aperture-region warningsDan Williams
This state is already visible by userspace since the BLK region will not be enabled, and it is otherwise benign as it usually indicates that the DIMM is not configured. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-29libnvdimm, btt: BTT updates for UEFI 2.7 formatVishal Verma
The UEFI 2.7 specification defines an updated BTT metadata format, bumping the revision to 2.0. Add support for the new format, while retaining compatibility for the old 1.1 format. Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Linda Knippers <linda.knippers@hpe.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-29libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile regionDan Williams
The pmem driver attaches to both persistent and volatile memory ranges advertised by the ACPI NFIT. When the region is volatile it is redundant to spend cycles flushing caches at fsync(). Check if the hosting region is volatile and do not set dax_write_cache() if it is. Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-29libnvdimm, pmem, dax: export a cache control attributeDan Williams
The dax_flush() operation can be turned into a nop on platforms where firmware arranges for cpu caches to be flushed on a power-fail event. The ACPI 6.2 specification defines a mechanism for the platform to indicate this capability so the kernel can select the proper default. However, for other platforms, the administrator must toggle this setting manually. Given this flush setting is a dax-specific mechanism we advertise it through a 'dax' attribute group hanging off a host device. For example, a 'pmem0' block-device gets a 'dax' sysfs-subdirectory with a 'write_cache' attribute to control response to dax cache flush requests. This is similar to the 'queue/write_cache' attribute that appears under block devices. Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-27libnvdimm, nfit: enable support for volatile rangesDan Williams
Allow volatile nfit ranges to participate in all the same infrastructure provided for persistent memory regions. A resulting resulting namespace device will still be called "pmem", but the parent region type will be "nd_volatile". This is in preparation for disabling the dax ->flush() operation in the pmem driver when it is hosted on a volatile range. Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-27libnvdimm, pmem: fix persistence warningDan Williams
The pmem driver assumes if platform firmware describes the memory devices associated with a persistent memory range and CONFIG_ARCH_HAS_PMEM_API=y that it has all the mechanism necessary to flush data to a power-fail safe zone. We warn if the firmware does not describe memory devices, but we also need to warn if the architecture does not claim pmem support. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-27x86, libnvdimm, pmem: remove global pmem apiDan Williams
Now that all callers of the pmem api have been converted to dax helpers that call back to the pmem driver, we can remove include/linux/pmem.h and asm/pmem.h. Cc: <x86@kernel.org> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Oliver O'Halloran <oohall@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-27x86, libnvdimm, pmem: move arch_invalidate_pmem() to libnvdimmDan Williams
Kill this globally defined wrapper and move to libnvdimm so that we can ultimately remove include/linux/pmem.h and asm/pmem.h. Cc: <x86@kernel.org> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-27block: don't bother with bounce limits for make_request driversChristoph Hellwig
We only call blk_queue_bounce for request-based drivers, so stop messing with it for make_request based drivers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-15x86, dax, libnvdimm: remove wb_cache_pmem() indirectionDan Williams
With all handling of the CONFIG_ARCH_HAS_PMEM_API case being moved to libnvdimm and the pmem driver directly we do not need to provide global wrappers and fallbacks in the CONFIG_ARCH_HAS_PMEM_API=n case. The pmem driver will simply not link to arch_wb_cache_pmem() in that case. Same as before, pmem flushing is only defined for x86_64, via clean_cache_range(), but it is straightforward to add other archs in the future. arch_wb_cache_pmem() is an exported function since the pmem module needs to find it, but it is privately declared in drivers/nvdimm/pmem.h because there are no consumers outside of the pmem driver. Cc: <x86@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Oliver O'Halloran <oohall@gmail.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15dax, pmem: introduce an optional 'flush' dax_operationDan Williams
Filesystem-DAX flushes caches whenever it writes to the address returned through dax_direct_access() and when writing back dirty radix entries. That flushing is only required in the pmem case, so add a dax operation to allow pmem to take this extra action, but skip it for other dax capable devices that do not provide a flush routine. An example for this differentiation might be a volatile ram disk where there is no expectation of persistence. In fact the pmem driver itself might front such an address range specified by the NFIT. So, this "no flush" property might be something passed down by the bus / libnvdimm. Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15libnvdimm, pmem: Add sysfs notifications to badblocksToshi Kani
Sysfs "badblocks" information may be updated during run-time that: - MCE, SCI, and sysfs "scrub" may add new bad blocks - Writes and ioctl() may clear bad blocks Add support to send sysfs notifications to sysfs "badblocks" file under region and pmem directories when their badblocks information is re-evaluated (but is not necessarily changed) during run-time. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Linda Knippers <linda.knippers@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15libnvdimm, label: switch to using v1.2 labels by defaultDan Williams
The rules for which version of the label specification are in effect at any given point in time are as follows: 1/ If a DIMM has an existing / valid index block then the version specified is used regardless if it is a previous version. 2/ By default when the kernel is initializing new index blocks the latest specification version (v1.2 at time of writing) is used. 3/ An environment that wants to force create v1.1 label-sets must arrange for userspace to disable all active regions / namespaces / dimms and write a valid set of v1.1 index blocks to the dimms. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15libnvdimm, label: add address abstraction identifiersDan Williams
Starting with v1.2 labels, 'address abstractions' can be hinted via an address abstraction id that implies an info-block format. The standard address abstraction in the specification is the v2 format of the Block-Translation-Table (BTT). Support for that is saved for a later patch, for now we add support for the Linux supported address abstractions BTT (v1), PFN, and DAX. The new 'holder_class' attribute for namespace devices is added for tooling to specify the 'abstraction_guid' to store in the namespace label. For v1.1 labels this field is undefined and any setting of 'holder_class' away from the default 'none' value will only have effect until the driver is unloaded. Setting 'holder_class' requires that whatever device tries to claim the namespace must be of the specified class. Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15libnvdimm, label: add v1.2 label checksum supportDan Williams
The v1.2 namespace label specification adds a fletcher checksum to each label instance. Add generation and validation support for the new field. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15libnvdimm, label: update 'nlabel' and 'position' handling for local namespacesDan Williams
The v1.2 namespace label specification requires 'nlabel' and 'position' to be valid for the first ("lowest dpa") label in the set. It also requires all non-first labels to set those fields to 0xff. Linux does not much care if these values are correct, because we can just trust the count of labels with the matching uuid like the v1.1 case. However, we set them correctly in case other environments care. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15libnvdimm, label: populate 'isetcookie' for blk-aperture namespacesDan Williams
Starting with the v1.2 definition of namespace labels, the isetcookie field is populated and validated for blk-aperture namespaces. This adds some safety against inadvertent copying of namespace labels from one DIMM-device to another. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15libnvdimm, label: populate the type_guid property for v1.2 namespacesDan Williams
The type_guid refers to the "Address Range Type GUID" for the region backing a namespace as defined the ACPI NFIT (NVDIMM Firmware Interface Table). This 'type' identifier specifies an access mechanism for the given namespace. This capability replaces the confusing usage of the 'NSLABEL_FLAG_LOCAL' flag to indicate a block-aperture-mode namespace. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15libnvdimm, label: honor the lba size specified in v1.2 labelsDan Williams
Previously we only honored the lba size for blk-aperture mode namespaces. For pmem namespaces the lba size was just assumed to be 512. With the new v1.2 label definition and compatibility with other operating environments, the ->lbasize property is now respected for pmem namespaces. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15libnvdimm, label: add v1.2 interleave-set-cookie algorithmDan Williams
The interleave-set-cookie algorithm is extended to incorporate all the same components that are used to generate an nvdimm unique-id. For backwards compatibility we still maintain the old v1.1 definition. Reported-by: Nicholas Moulin <nicholas.w.moulin@intel.com> Reported-by: Kaushik Kanetkar <kaushik.a.kanetkar@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15libnvdimm, label: add v1.2 nvdimm label definitionsDan Williams
In support of improved interoperability between operating systems and pre-boot environments the Intel proposed NVDIMM Namespace Specification [1], has been adopted and modified to the the UEFI 2.7 NVDIMM Label Protocol [2]. Update the definitions of the namespace label data structures so that the new format can be supported alongside the existing label format. The new specification changes the default label size to 256 bytes, so everywhere that relied on sizeof(struct nd_namespace_label) must now use the sizeof_namespace_label() helper. There should be no functional differences from these changes as the default is still the v1.1 128-byte format. Future patches will move the default to the v1.2 definition. [1]: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf [2]: http://www.uefi.org/sites/default/files/resources/UEFI_Spec_2_7.pdf Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-13Merge branch 'uuid-types' of bombadil.infradead.org:public_git/uuid into ↵Christoph Hellwig
nvme-base
2017-06-09x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass ↵Dan Williams
operations The pmem driver has a need to transfer data with a persistent memory destination and be able to rely on the fact that the destination writes are not cached. It is sufficient for the writes to be flushed to a cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect userspace to call fsync() to ensure data-writes have reached a power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or REQ_FLUSH to the pmem driver which will turn around and fence previous writes with an "sfence". Implement a __copy_from_user_inatomic_flushcache, memcpy_page_flushcache, and memcpy_flushcache, that guarantee that the destination buffer is not dirty in the cpu cache on completion. The new copy_from_iter_flushcache and sub-routines will be used to replace the "pmem api" (include/linux/pmem.h + arch/x86/include/asm/pmem.h). The availability of copy_from_iter_flushcache() and memcpy_flushcache() are gated by the CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE config symbol, and fallback to copy_from_iter_nocache() and plain memcpy() otherwise. This is meant to satisfy the concern from Linus that if a driver wants to do something beyond the normal nocache semantics it should be something private to that driver [1], and Al's concern that anything uaccess related belongs with the rest of the uaccess code [2]. The first consumer of this interface is a new 'copy_from_iter' dax operation so that pmem can inject cache maintenance operations without imposing this overhead on other dax-capable drivers. [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html Cc: <x86@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-09block: switch bios to blk_status_tChristoph Hellwig
Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep arround at least for now and thus propagate to a proper blk_status_t value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-05uuid: hoist uuid_is_null() helper from libnvdimmChristoph Hellwig
Hoist the libnvdimm helper as an inline helper to linux/uuid.h using an auxiliary const variable uuid_null in lib/uuid.c. [hch: also add the guid variant. Both do the same but I'd like to keep casts to a minimum] The common helper uses the new abstract type uuid_t * instead of u8 *. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Amir Goldstein <amir73il@gmail.com> [hch: added guid_is_null] Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2017-05-12Merge branch 'libnvdimm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm fixes from Dan Williams: "Incremental fixes and a small feature addition on top of the main libnvdimm 4.12 pull request: - Geert noticed that tinyconfig was bloated by BLOCK selecting DAX. The size regression is fixed by moving all dax helpers into the dax-core and only specifying "select DAX" for FS_DAX and dax-capable drivers. He also asked for clarification of the NR_DEV_DAX config option which, on closer look, does not need to be a config option at all. Mike also throws in a DEV_DAX_PMEM fixup for good measure. - Ben's attention to detail on -stable patch submissions caught a case where the recent fixes to arch_copy_from_iter_pmem() missed a condition where we strand dirty data in the cache. This is tagged for -stable and will also be included in the rework of the pmem api to a proposed {memcpy,copy_user}_flushcache() interface for 4.13. - Vishal adds a feature that missed the initial pull due to pending review feedback. It allows the kernel to clear media errors when initializing a BTT (atomic sector update driver) instance on a pmem namespace. - Ross noticed that the dax_device + dax_operations conversion broke __dax_zero_page_range(). The nvdimm unit tests fail to check this path, but xfstests immediately trips over it. No excuse for missing this before submitting the 4.12 pull request. These all pass the nvdimm unit tests and an xfstests spot check. The set has received a build success notification from the kbuild robot" * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: filesystem-dax: fix broken __dax_zero_page_range() conversion libnvdimm, btt: ensure that initializing metadata clears poison libnvdimm: add an atomic vs process context flag to rw_bytes x86, pmem: Fix cache flushing for iovec write < 8 bytes device-dax: kill NR_DEV_DAX block, dax: move "select DAX" from BLOCK to FS_DAX device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX
2017-05-10libnvdimm, btt: ensure that initializing metadata clears poisonVishal Verma
If we had badblocks/poison in the metadata area of a BTT, recreating the BTT would not clear the poison in all cases, notably the flog area. This is because rw_bytes will only clear errors if the request being sent down is 512B aligned and sized. Make sure that when writing the map and info blocks, the rw_bytes being sent are of the correct size/alignment. For the flog, instead of doing the smaller log_entry writes only, first do a 'wipe' of the entire area by writing zeroes in large enough chunks so that errors get cleared. Cc: Andy Rudoff <andy.rudoff@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-10libnvdimm: add an atomic vs process context flag to rw_bytesVishal Verma
nsio_rw_bytes can clear media errors, but this cannot be done while we are in an atomic context due to locking within ACPI. From the BTT, ->rw_bytes may be called either from atomic or process context depending on whether the calls happen during initialization or during IO. During init, we want to ensure error clearing happens, and the flag marking process context allows nsio_rw_bytes to do that. When called during IO, we're in atomic context, and error clearing can be skipped. Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-08treewide: use kv[mz]alloc* rather than opencoded variantsMichal Hocko
There are many code paths opencoding kvmalloc. Let's use the helper instead. The main difference to kvmalloc is that those users are usually not considering all the aspects of the memory allocator. E.g. allocation requests <= 32kB (with 4kB pages) are basically never failing and invoke OOM killer to satisfy the allocation. This sounds too disruptive for something that has a reasonable fallback - the vmalloc. On the other hand those requests might fallback to vmalloc even when the memory allocator would succeed after several more reclaim/compaction attempts previously. There is no guarantee something like that happens though. This patch converts many of those places to kv[mz]alloc* helpers because they are more conservative. Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390 Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim Acked-by: David Sterba <dsterba@suse.com> # btrfs Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4 Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5 Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Santosh Raspatur <santosh@chelsio.com> Cc: Hariprasad S <hariprasad@chelsio.com> Cc: Yishai Hadas <yishaih@mellanox.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: "Yan, Zheng" <zyan@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-05Merge tag 'libnvdimm-for-4.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "The bulk of this has been in multiple -next releases. There were a few late breaking fixes and small features that got added in the last couple days, but the whole set has received a build success notification from the kbuild robot. Change summary: - Region media error reporting: A libnvdimm region device is the parent to one or more namespaces. To date, media errors have been reported via the "badblocks" attribute attached to pmem block devices for namespaces in "raw" or "memory" mode. Given that namespaces can be in "device-dax" or "btt-sector" mode this new interface reports media errors generically, i.e. independent of namespace modes or state. This subsequently allows userspace tooling to craft "ACPI 6.1 Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error" requests and submit them via the ioctl path for NVDIMM root bus devices. - Introduce 'struct dax_device' and 'struct dax_operations': Prompted by a request from Linus and feedback from Christoph this allows for dax capable drivers to publish their own custom dax operations. This fixes the broken assumption that all dax operations are related to a persistent memory device, and makes it easier for other architectures and platforms to add customized persistent memory support. - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is available for storage appliance applications to manually trigger memory controllers to drain write-pending buffers that would otherwise be flushed automatically by the platform ADR (asynchronous-DRAM-refresh) mechanism at a power loss event. Support for "locked" DIMMs is included to prevent namespaces from surfacing when the namespace label data area is locked. Finally, fixes for various reported deadlocks and crashes, also tagged for -stable. - ACPI / nfit driver updates: General updates of the nfit driver to add DSM command overrides, ACPI 6.1 health state flags support, DSM payload debug available by default, and various fixes. Acknowledgements that came after the branch was pushed: - commmit 565851c972b5 "device-dax: fix sysfs attribute deadlock": Tested-by: Yi Zhang <yizhan@redhat.com> - commit 23f498448362 "libnvdimm: rework region badblocks clearing" Tested-by: Toshi Kani <toshi.kani@hpe.com>" * tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits) libnvdimm, pfn: fix 'npfns' vs section alignment libnvdimm: handle locked label storage areas libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED brd: fix uninitialized use of brd->dax_dev block, dax: use correct format string in bdev_dax_supported device-dax: fix sysfs attribute deadlock libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking" libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering libnvdimm: rework region badblocks clearing acpi, nfit: kill ACPI_NFIT_DEBUG libnvdimm: fix clear length of nvdimm_forget_poison() libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify libnvdimm, region: sysfs trigger for nvdimm_flush() libnvdimm: fix phys_addr for nvdimm_clear_poison x86, dax, pmem: remove indirection around memcpy_from_pmem() block: remove block_device_operations ->direct_access() block, dax: convert bdev_dax_supported() to dax_direct_access() filesystem-dax: convert to dax_direct_access() Revert "block: use DAX for partition table reads" ext2, ext4, xfs: retrieve dax_device for iomap operations ...
2017-05-04Merge branch 'for-4.12/dax' into libnvdimm-for-nextDan Williams
2017-05-04libnvdimm, pfn: fix 'npfns' vs section alignmentDan Williams
Fix failures to create namespaces due to the vmem_altmap not advertising enough free space to store the memmap. WARNING: CPU: 15 PID: 8022 at arch/x86/mm/init_64.c:656 arch_add_memory+0xde/0xf0 [..] Call Trace: dump_stack+0x63/0x83 __warn+0xcb/0xf0 warn_slowpath_null+0x1d/0x20 arch_add_memory+0xde/0xf0 devm_memremap_pages+0x244/0x440 pmem_attach_disk+0x37e/0x490 [nd_pmem] nd_pmem_probe+0x7e/0xa0 [nd_pmem] nvdimm_bus_probe+0x71/0x120 [libnvdimm] driver_probe_device+0x2bb/0x460 bind_store+0x114/0x160 drv_attr_store+0x25/0x30 In commit 658922e57b84 "libnvdimm, pfn: fix memmap reservation sizing" we arranged for the capacity to be allocated, but failed to also update the 'npfns' parameter. This leads to cases where there is enough capacity reserved to hold all the allocated sections, but vmemmap_populate_hugepages() still encounters -ENOMEM from altmap_alloc_block_buf(). This fix is a stop-gap until we can teach the core memory hotplug implementation to permit sub-section hotplug. Cc: <stable@vger.kernel.org> Fixes: 658922e57b84 ("libnvdimm, pfn: fix memmap reservation sizing") Reported-by: Anisha Allada <anisha.allada@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-04libnvdimm: handle locked label storage areasDan Williams
Per the latest version of the "NVDIMM DSM Interface Example" [1], the label data retrieval routine can report a "locked" status. In this case all regions associated with that DIMM are disabled until the label area is unlocked. Provide generic libnvdimm enabling for NVDIMMs with label data area locking capabilities. [1]: http://pmem.io/documents/ Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-04libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKEDDan Williams
This is a preparation patch for handling locked nvdimm label regions, a new concept as introduced by the latest DSM document on pmem.io [1]. A future patch will leverage nvdimm_set_locked() at DIMM probe time to flag regions that can not be enabled. There should be no functional difference resulting from this change. [1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example-V1.3.pdf Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-01Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Ingo Molnar: "The main x86 MM changes in this cycle were: - continued native kernel PCID support preparation patches to the TLB flushing code (Andy Lutomirski) - various fixes related to 32-bit compat syscall returning address over 4Gb in applications, launched from 64-bit binaries - motivated by C/R frameworks such as Virtuozzo. (Dmitry Safonov) - continued Intel 5-level paging enablement: in particular the conversion of x86 GUP to the generic GUP code. (Kirill A. Shutemov) - x86/mpx ABI corner case fixes/enhancements (Joerg Roedel) - ... plus misc updates, fixes and cleanups" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits) mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash x86/mm: Fix flush_tlb_page() on Xen x86/mm: Make flush_tlb_mm_range() more predictable x86/mm: Remove flush_tlb() and flush_tlb_current_task() x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly() x86/mm/64: Fix crash in remove_pagetable() Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation" x86/boot/e820: Remove a redundant self assignment x86/mm: Fix dump pagetables for 4 levels of page tables x86/mpx, selftests: Only check bounds-vs-shadow when we keep shadow x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space Revert "x86/mm/numa: Remove numa_nodemask_from_meminfo()" x86/espfix: Add support for 5-level paging x86/kasan: Extend KASAN to support 5-level paging x86/mm: Add basic defines/helpers for CONFIG_X86_5LEVEL=y x86/paravirt: Add 5-level support to the paravirt code x86/mm: Define virtual memory map for 5-level paging x86/asm: Remove __VIRTUAL_MASK_SHIFT==47 assert x86/boot: Detect 5-level paging support x86/mm/numa: Remove numa_nodemask_from_meminfo() ...
2017-05-01libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"Dan Williams
This continues the 4.11 status quo of disabling of error clearing from the BTT I/O path. Toshi found that even though we have eliminated all the libnvdimm sources of sleeping-while-atomic triggers, we still have sleeping operations that will occur in the path to send the ACPI DSM to the DIMM to clear the error: BUG: sleeping function called from invalid context at mm/slab.h:432 in_atomic(): 1, irqs_disabled(): 0, pid: 13353, name: dd Call Trace: dump_stack+0x86/0xc3 ___might_sleep+0x17d/0x250 __might_sleep+0x4a/0x80 __kmalloc+0x1c0/0x2e0 acpi_os_allocate_zeroed+0x2d/0x2f acpi_evaluate_object+0x59/0x3b1 acpi_evaluate_dsm+0xbd/0x10c acpi_nfit_ctl+0x1ef/0x7c0 [nfit] ? nsio_rw_bytes+0x152/0x280 nvdimm_clear_poison+0x77/0x140 nsio_rw_bytes+0x18f/0x280 btt_write_pg+0x1d4/0x3d0 [nd_btt] btt_make_request+0x119/0x2d0 [nd_btt] A solution for tracking and handling media errors natively in the BTT is needed. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Reported-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-01libnvdimm: fix nvdimm_bus_lock() vs device_lock() orderingDan Williams
A debug patch to turn the standard device_lock() into something that lockdep can analyze yielded the following: ====================================================== [ INFO: possible circular locking dependency detected ] 4.11.0-rc4+ #106 Tainted: G O ------------------------------------------------------- lt-libndctl/1898 is trying to acquire lock: (&dev->nvdimm_mutex/3){+.+.+.}, at: [<ffffffffc023c948>] nd_attach_ndns+0x178/0x1b0 [libnvdimm] but task is already holding lock: (&nvdimm_bus->reconfig_mutex){+.+.+.}, at: [<ffffffffc022e0b1>] nvdimm_bus_lock+0x21/0x30 [libnvdimm] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&nvdimm_bus->reconfig_mutex){+.+.+.}: lock_acquire+0xf6/0x1f0 __mutex_lock+0x88/0x980 mutex_lock_nested+0x1b/0x20 nvdimm_bus_lock+0x21/0x30 [libnvdimm] nvdimm_namespace_capacity+0x1b/0x40 [libnvdimm] nvdimm_namespace_common_probe+0x230/0x510 [libnvdimm] nd_pmem_probe+0x14/0x180 [nd_pmem] nvdimm_bus_probe+0xa9/0x260 [libnvdimm] -> #0 (&dev->nvdimm_mutex/3){+.+.+.}: __lock_acquire+0x1107/0x1280 lock_acquire+0xf6/0x1f0 __mutex_lock+0x88/0x980 mutex_lock_nested+0x1b/0x20 nd_attach_ndns+0x178/0x1b0 [libnvdimm] nd_namespace_store+0x308/0x3c0 [libnvdimm] namespace_store+0x87/0x220 [libnvdimm] In this case '&dev->nvdimm_mutex/3' mirrors '&dev->mutex'. Fix this by replacing the use of device_lock() with nvdimm_bus_lock() to protect nd_{attach,detach}_ndns() operations. Cc: <stable@vger.kernel.org> Fixes: 8c2f7e8658df ("libnvdimm: infrastructure for btt devices") Reported-by: Yi Zhang <yizhan@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-01mm, zone_device: Replace {get, put}_zone_device_page() with a single ↵Dan Williams
reference to fix pmem crash The x86 conversion to the generic GUP code included a small change which causes crashes and data corruption in the pmem code - not good. The root cause is that the /dev/pmem driver code implicitly relies on the x86 get_user_pages() implementation doing a get_page() on the page refcount, because get_page() does a get_zone_device_page() which properly refcounts pmem's separate page struct arrays that are not present in the regular page struct structures. (The pmem driver does this because it can cover huge memory areas.) But the x86 conversion to the generic GUP code changed the get_page() to page_cache_get_speculative() which is faster but doesn't do the get_zone_device_page() call the pmem code relies on. One way to solve the regression would be to change the generic GUP code to use get_page(), but that would slow things down a bit and punish other generic-GUP using architectures for an x86-ism they did not care about. (Arguably the pmem driver was probably not working reliably for them: but nvdimm is an Intel feature, so non-x86 exposure is probably still limited.) So restructure the pmem code's interface with the MM instead: get rid of the get/put_zone_device_page() distinction, integrate put_zone_device_page() into __put_page() and and restructure the pmem completion-wait and teardown machinery: Kirill points out that the calls to {get,put}_dev_pagemap() can be removed from the mm fast path if we take a single get_dev_pagemap() reference to signify that the page is alive and use the final put of the page to drop that reference. This does require some care to make sure that any waits for the percpu_ref to drop to zero occur *after* devm_memremap_page_release(), since it now maintains its own elevated reference. This speeds up things while also making the pmem refcounting more robust going forward. Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>