summaryrefslogtreecommitdiff
path: root/drivers/nvdimm
AgeCommit message (Collapse)Author
2020-05-27nvdimm: use bio_{start,end}_io_acctChristoph Hellwig
Switch dm to use the nicer bio accounting helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-08Merge tag 'libnvdimm-for-5.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm and dax updates from Dan Williams: "There were multiple touches outside of drivers/nvdimm/ this round to add cross arch compatibility to the devm_memremap_pages() interface, enhance numa information for persistent memory ranges, and add a zero_page_range() dax operation. This cycle I switched from the patchwork api to Konstantin's b4 script for collecting tags (from x86, PowerPC, filesystem, and device-mapper folks), and everything looks to have gone ok there. This has all appeared in -next with no reported issues. Summary: - Add support for region alignment configuration and enforcement to fix compatibility across architectures and PowerPC page size configurations. - Introduce 'zero_page_range' as a dax operation. This facilitates filesystem-dax operation without a block-device. - Introduce phys_to_target_node() to facilitate drivers that want to know resulting numa node if a given reserved address range was onlined. - Advertise a persistence-domain for of_pmem and papr_scm. The persistence domain indicates where cpu-store cycles need to reach in the platform-memory subsystem before the platform will consider them power-fail protected. - Promote numa_map_to_online_node() to a cross-kernel generic facility. - Save x86 numa information to allow for node-id lookups for reserved memory ranges, deploy that capability for the e820-pmem driver. - Pick up some miscellaneous minor fixes, that missed v5.6-final, including a some smatch reports in the ioctl path and some unit test compilation fixups. - Fixup some flexible-array declarations" * tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (29 commits) dax: Move mandatory ->zero_page_range() check in alloc_dax() dax,iomap: Add helper dax_iomap_zero() to zero a range dax: Use new dax zero page method for zeroing a page dm,dax: Add dax zero_page_range operation s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver dax, pmem: Add a dax operation zero_page_range pmem: Add functions for reading/writing page to/from pmem libnvdimm: Update persistence domain value for of_pmem and papr_scm device tools/test/nvdimm: Fix out of tree build libnvdimm/region: Fix build error libnvdimm/region: Replace zero-length array with flexible-array member libnvdimm/label: Replace zero-length array with flexible-array member ACPI: NFIT: Replace zero-length array with flexible-array member libnvdimm/region: Introduce an 'align' attribute libnvdimm/region: Introduce NDD_LABELING libnvdimm/namespace: Enforce memremap_compat_align() libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid libnvdimm: Out of bounds read in __nd_ioctl() acpi/nfit: improve bounds checking for 'func' mm/memremap_pages: Introduce memremap_compat_align() ...
2020-04-02Merge branch 'for-5.7/libnvdimm' into libnvdimm-for-nextDan Williams
- Introduce 'zero_page_range' as a dax operation. This facilitates filesystem-dax operation without a block-device. - Advertise a persistence-domain for of_pmem and papr_scm. The persistence domain indicates where cpu-store cycles need to reach in the platform-memory subsystem before the platform will consider them power-fail protected. - Fixup some flexible-array declarations.
2020-04-02Merge branch 'for-5.7/numa' into libnvdimm-for-nextDan Williams
- Promote numa_map_to_online_node() to a cross-kernel generic facility. - Save x86 numa information to allow for node-id lookups for reserved memory ranges, deploy that capability for the e820-pmem driver. - Introduce phys_to_target_node() to facilitate drivers that want to know resulting numa node if a given reserved address range was onlined.
2020-04-02Merge branch 'for-5.6/libnvdimm-fixes' into libnvdimm-for-nextDan Williams
Pick up some miscellaneous minor fixes, that missed v5.6-final, including a some smatch reports in the ioctl path and some unit test compilation fixups.
2020-04-02dax: Move mandatory ->zero_page_range() check in alloc_dax()Vivek Goyal
zero_page_range() dax operation is mandatory for dax devices. Right now that check happens in dax_zero_page_range() function. Dan thinks that's too late and its better to do the check earlier in alloc_dax(). I also modified alloc_dax() to return pointer with error code in it in case of failure. Right now it returns NULL and caller assumes failure happened due to -ENOMEM. But with this ->zero_page_range() check, I need to return -EINVAL instead. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Link: https://lore.kernel.org/r/20200401161125.GB9398@redhat.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-04-02dax, pmem: Add a dax operation zero_page_rangeVivek Goyal
Add a dax operation zero_page_range, to zero a page. This will also clear any known poison in the page being zeroed. As of now, zeroing of one page is allowed in a single call. There are no callers which are trying to zero more than a page in a single call. Once we grow the callers which zero more than a page in single call, we can add that support. Primary reason for not doing that yet is that this will add little complexity in dm implementation where a range might be spanning multiple underlying targets and one will have to split the range into multiple sub ranges and call zero_page_range() on individual targets. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: https://lore.kernel.org/r/20200228163456.1587-3-vgoyal@redhat.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-04-02pmem: Add functions for reading/writing page to/from pmemVivek Goyal
This splits pmem_do_bvec() into pmem_do_read() and pmem_do_write(). pmem_do_write() will be used by pmem zero_page_range() as well. Hence sharing the same code. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: https://lore.kernel.org/r/20200228163456.1587-2-vgoyal@redhat.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-03-31libnvdimm: Update persistence domain value for of_pmem and papr_scm deviceAneesh Kumar K.V
Currently, kernel shows the below values "persistence_domain":"cpu_cache" "persistence_domain":"memory_controller" "persistence_domain":"unknown" "cpu_cache" indicates no extra instructions is needed to ensure the persistence of data in the pmem media on power failure. "memory_controller" indicates cpu cache flush instructions are required to flush the data. Platform provides mechanisms to automatically flush outstanding write data from memory controler to pmem on system power loss. Based on the above use memory_controller for non volatile regions on ppc64. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/20200324034821.60869-1-aneesh.kumar@linux.ibm.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-03-31libnvdimm/region: Fix build errorYueHaibing
On CONFIG_PPC32=y build fails: drivers/nvdimm/region_devs.c:1034:14: note: in expansion of macro ‘do_div’ remainder = do_div(per_mapping, mappings); ^~~~~~ In file included from ./arch/powerpc/include/generated/asm/div64.h:1:0, from ./include/linux/kernel.h:18, from ./include/asm-generic/bug.h:19, from ./arch/powerpc/include/asm/bug.h:109, from ./include/linux/bug.h:5, from ./include/linux/scatterlist.h:7, from drivers/nvdimm/region_devs.c:5: ./include/asm-generic/div64.h:243:22: error: passing argument 1 of ‘__div64_32’ from incompatible pointer type [-Werror=incompatible-pointer-types] __rem = __div64_32(&(n), __base); \ Use div_u64 instead of do_div to fix this. Fixes: 2522afb86a8c ("libnvdimm/region: Introduce an 'align' attribute") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Link: https://lore.kernel.org/r/20200331115024.31628-1-yuehaibing@huawei.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-03-30libnvdimm/region: Replace zero-length array with flexible-array memberGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Link: https://lore.kernel.org/r/20200319230937.GA16648@embeddedor.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-03-30libnvdimm/label: Replace zero-length array with flexible-array memberGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Link: https://lore.kernel.org/r/20200319230737.GA16452@embeddedor.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-03-27block: simplify queue allocationChristoph Hellwig
Current make_request based drivers use either blk_alloc_queue_node or blk_alloc_queue to allocate a queue, and then set up the make_request_fn function pointer and a few parameters using the blk_queue_make_request helper. Simplify this by passing the make_request pointer to blk_alloc_queue, and while at it merge the _node variant into the main helper by always passing a node_id, and remove the superfluous gfp_mask parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-17libnvdimm/region: Introduce an 'align' attributeDan Williams
The align attribute applies an alignment constraint for namespace creation in a region. Whereas the 'align' attribute of a namespace applied alignment padding via an info block, the 'align' attribute applies alignment constraints to the free space allocation. The default for 'align' is the maximum known memremap_compat_align() across all archs (16MiB from PowerPC at time of writing) multiplied by the number of interleave ways if there is blk-aliasing. The minimum is PAGE_SIZE and allows for the creation of cross-arch incompatible namespaces, just as previous kernels allowed, but the expectation is cross-arch and mode-independent compatibility by default. The regression risk with this change is limited to cases that were dependent on the ability to create unaligned namespaces, *and* for some reason are unable to opt-out of aligned namespaces by writing to 'regionX/align'. If such a scenario arises the default can be flipped from opt-out to opt-in of compat-aligned namespace creation, but that is a last resort. The kernel will otherwise continue to support existing defined misaligned namespaces. Unfortunately this change needs to touch several parts of the implementation at once: - region/available_size: expand busy extents to current align - region/max_available_extent: expand busy extents to current align - namespace/size: trim free space to current align ...to keep the free space accounting conforming to the dynamic align setting. Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reported-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Link: https://lore.kernel.org/r/158041478371.3889308.14542630147672668068.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-03-17libnvdimm/region: Introduce NDD_LABELINGDan Williams
The NDD_ALIASING flag is used to indicate where pmem capacity might alias with blk capacity and require labeling. It is also used to indicate whether the DIMM supports labeling. Separate this latter capability into its own flag so that the NDD_ALIASING flag is scoped to true aliased configurations. To my knowledge aliased configurations only exist in the ACPI spec, there are no known platforms that ship this support in production. This clarity allows namespace-capacity alignment constraints around interleave-ways to be relaxed. Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/158041477856.3889308.4212605617834097674.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-03-17libnvdimm/namespace: Enforce memremap_compat_align()Dan Williams
The pmem driver on PowerPC crashes with the following signature when instantiating misaligned namespaces that map their capacity via memremap_pages(). BUG: Unable to handle kernel data access at 0xc001000406000000 Faulting instruction address: 0xc000000000090790 NIP [c000000000090790] arch_add_memory+0xc0/0x130 LR [c000000000090744] arch_add_memory+0x74/0x130 Call Trace: arch_add_memory+0x74/0x130 (unreliable) memremap_pages+0x74c/0xa30 devm_memremap_pages+0x3c/0xa0 pmem_attach_disk+0x188/0x770 nvdimm_bus_probe+0xd8/0x470 With the assumption that only memremap_pages() has alignment constraints, enforce memremap_compat_align() for pmem_should_map_pages(), nd_pfn, and nd_dax cases. This includes preventing the creation of namespaces where the base address is misaligned and cases there infoblock padding parameters are invalid. Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Jeff Moyer <jmoyer@redhat.com> Fixes: a3619190d62e ("libnvdimm/pfn: stop padding pmem namespaces to section alignment") Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-03-17libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock validDan Williams
The EOPNOTSUPP return code from the pmem driver indicates that the namespace has a configuration that may be valid, but the current kernel does not support it. Expand this to all of the nd_pfn_validate() error conditions after the infoblock has been verified as self consistent. This prevents exposing the namespace to I/O when the infoblock needs to be corrected, or the system needs to be put into a different configuration (like changing the page size on PowerPC). Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-02-28libnvdimm: Out of bounds read in __nd_ioctl()Dan Carpenter
The "cmd" comes from the user and it can be up to 255. It it's more than the number of bits in long, it results out of bounds read when we check test_bit(cmd, &cmd_mask). The highest valid value for "cmd" is ND_CMD_CALL (10) so I added a compare against that. Fixes: 62232e45f4a2 ("libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20200225162055.amtosfy7m35aivxg@kili.mountain Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-02-20mm/memremap_pages: Introduce memremap_compat_align()Dan Williams
The "sub-section memory hotplug" facility allows memremap_pages() users like libnvdimm to compensate for hardware platforms like x86 that have a section size larger than their hardware memory mapping granularity. The compensation that sub-section support affords is being tolerant of physical memory resources shifting by units smaller (64MiB on x86) than the memory-hotplug section size (128 MiB). Where the platform physical-memory mapping granularity is limited by the number and capability of address-decode-registers in the memory controller. While the sub-section support allows memremap_pages() to operate on sub-section (2MiB) granularity, the Power architecture may still require 16MiB alignment on "!radix_enabled()" platforms. In order for libnvdimm to be able to detect and manage this per-arch limitation, introduce memremap_compat_align() as a common minimum alignment across all driver-facing memory-mapping interfaces, and let Power override it to 16MiB in the "!radix_enabled()" case. The assumption / requirement for 16MiB to be a viable memremap_compat_align() value is that Power does not have platforms where its equivalent of address-decode-registers never hardware remaps a persistent memory resource on smaller than 16MiB boundaries. Note that I tried my best to not add a new Kconfig symbol, but header include entanglements defeated the #ifndef memremap_compat_align design pattern and the need to export it defeats the __weak design pattern for arch overrides. Based on an initial patch by Aneesh. Link: http://lore.kernel.org/r/CAPcyv4gBGNP95APYaBcsocEa50tQj9b5h__83vgngjq3ouGX_Q@mail.gmail.com Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reported-by: Jeff Moyer <jmoyer@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-02-18libnvdimm/e820: Retrieve and populate correct 'target_node' infoDan Williams
Use the new phys_to_target_node() and numa_map_to_online_node() helpers to retrieve the correct id for the 'numa_node' ("local" / online initiator node) and 'target_node' (offline target memory node) sysfs attributes. Below is an example from a 4 NUMA node system where all the memory on node2 is pmem / reserved. It should be noted that with the arrival of the ACPI HMAT table and EFI Specific Purpose Memory the kernel will start to see more platforms with reserved / performance differentiated memory in its own NUMA node. Hence all the stakeholders on the Cc for what is ostensibly a libnvdimm local patch. === Before === /* Notice no online memory on node2 at start */ # numactl --hardware available: 3 nodes (0-1,3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 node 0 size: 3958 MB node 0 free: 3708 MB node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 node 1 size: 4027 MB node 1 free: 3871 MB node 3 cpus: node 3 size: 3994 MB node 3 free: 3971 MB node distances: node 0 1 3 0: 10 21 21 1: 21 10 21 3: 21 21 10 /* * Put the pmem namespace into devdax mode so it can be assigned to the * kmem driver */ # ndctl create-namespace -e namespace0.0 -m devdax -f { "dev":"namespace0.0", "mode":"devdax", "map":"dev", "size":"3.94 GiB (4.23 GB)", "uuid":"1650af9b-9ba3-4704-acd6-10178399d9a3", [..] } /* Online Persistent Memory as System RAM */ # daxctl reconfigure-device --mode=system-ram dax0.0 libdaxctl: memblock_in_dev: dax0.0: memory0: Unable to determine phys_index: Success libdaxctl: memblock_in_dev: dax0.0: memory0: Unable to determine phys_index: Success libdaxctl: memblock_in_dev: dax0.0: memory0: Unable to determine phys_index: Success libdaxctl: memblock_in_dev: dax0.0: memory0: Unable to determine phys_index: Success [ { "chardev":"dax0.0", "size":4225761280, "target_node":0, "mode":"system-ram" } ] reconfigured 1 device /* Note that the memory is onlined by default to the wrong node, node0 */ # numactl --hardware available: 3 nodes (0-1,3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 node 0 size: 7926 MB node 0 free: 7655 MB node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 node 1 size: 4027 MB node 1 free: 3871 MB node 3 cpus: node 3 size: 3994 MB node 3 free: 3971 MB node distances: node 0 1 3 0: 10 21 21 1: 21 10 21 3: 21 21 10 === After === /* Notice that the "phys_index" error messages are gone */ # daxctl reconfigure-device --mode=system-ram dax0.0 [ { "chardev":"dax0.0", "size":4225761280, "target_node":2, "mode":"system-ram" } ] reconfigured 1 device /* Notice that node2 is now correctly populated */ # numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 node 0 size: 3958 MB node 0 free: 3793 MB node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 node 1 size: 4027 MB node 1 free: 3851 MB node 2 cpus: node 2 size: 3968 MB node 2 free: 3968 MB node 3 cpus: node 3 size: 3994 MB node 3 free: 3908 MB node distances: node 0 1 2 3 0: 10 21 21 21 1: 21 10 21 21 2: 21 21 10 21 3: 21 21 21 10 Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/158188327614.894464.13122730362187722603.stgit@dwillia2-desk3.amr.corp.intel.com
2020-01-31mm: Cleanup __put_devmap_managed_page() vs ->page_free()Dan Williams
After the removal of the device-public infrastructure there are only 2 ->page_free() call backs in the kernel. One of those is a device-private callback in the nouveau driver, the other is a generic wakeup needed in the DAX case. In the hopes that all ->page_free() callbacks can be migrated to common core kernel functionality, move the device-private specific actions in __put_devmap_managed_page() under the is_device_private_page() conditional, including the ->page_free() callback. For the other page types just open-code the generic wakeup. Yes, the wakeup is only needed in the MEMORY_DEVICE_FSDAX case, but it does no harm in the MEMORY_DEVICE_DEVDAX and MEMORY_DEVICE_PCI_P2PDMA case. Link: http://lkml.kernel.org/r/20200107224558.2362728-4-jhubbard@nvidia.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01Merge tag 'libnvdimm-for-5.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "The highlight this cycle is continuing integration fixes for PowerPC and some resulting optimizations. Summary: - Updates to better support vmalloc space restrictions on PowerPC platforms. - Cleanups to move common sysfs attributes to core 'struct device_type' objects. - Export the 'target_node' attribute (the effective numa node if pmem is marked online) for regions and namespaces. - Miscellaneous fixups and optimizations" * tag 'libnvdimm-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits) MAINTAINERS: Remove Keith from NVDIMM maintainers libnvdimm: Export the target_node attribute for regions and namespaces dax: Add numa_node to the default device-dax attributes libnvdimm: Simplify root read-only definition for the 'resource' attribute dax: Simplify root read-only definition for the 'resource' attribute dax: Create a dax device_type libnvdimm: Move nvdimm_bus_attribute_group to device_type libnvdimm: Move nvdimm_attribute_group to device_type libnvdimm: Move nd_mapping_attribute_group to device_type libnvdimm: Move nd_region_attribute_group to device_type libnvdimm: Move nd_numa_attribute_group to device_type libnvdimm: Move nd_device_attribute_group to device_type libnvdimm: Move region attribute group definition libnvdimm: Move attribute groups to device type libnvdimm: Remove prototypes for nonexistent functions libnvdimm/btt: fix variable 'rc' set but not used libnvdimm/pmem: Delete include of nd-core.h libnvdimm/namespace: Differentiate between probe mapping and runtime mapping libnvdimm/pfn_dev: Don't clear device memmap area during generic namespace probe libnvdimm: Trivial comment fix ...
2019-12-01Merge tag 'compat-ioctl-5.5' of ↵Linus Torvalds
git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann: "As part of the cleanup of some remaining y2038 issues, I came to fs/compat_ioctl.c, which still has a couple of commands that need support for time64_t. In completely unrelated work, I spent time on cleaning up parts of this file in the past, moving things out into drivers instead. After Al Viro reviewed an earlier version of this series and did a lot more of that cleanup, I decided to try to completely eliminate the rest of it and move it all into drivers. This series incorporates some of Al's work and many patches of my own, but in the end stops short of actually removing the last part, which is the scsi ioctl handlers. I have patches for those as well, but they need more testing or possibly a rewrite" * tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits) scsi: sd: enable compat ioctls for sed-opal pktcdvd: add compat_ioctl handler compat_ioctl: move SG_GET_REQUEST_TABLE handling compat_ioctl: ppp: move simple commands into ppp_generic.c compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic compat_ioctl: unify copy-in of ppp filters tty: handle compat PPP ioctls compat_ioctl: move SIOCOUTQ out of compat_ioctl.c compat_ioctl: handle SIOCOUTQNSD af_unix: add compat_ioctl support compat_ioctl: reimplement SG_IO handling compat_ioctl: move WDIOC handling into wdt drivers fs: compat_ioctl: move FITRIM emulation into file systems gfs2: add compat_ioctl support compat_ioctl: remove unused convert_in_user macro compat_ioctl: remove last RAID handling code compat_ioctl: remove /dev/raw ioctl translation compat_ioctl: remove PCI ioctl translation compat_ioctl: remove joystick ioctl translation ...
2019-11-19libnvdimm: Export the target_node attribute for regions and namespacesDan Williams
Aneesh points out that some platforms may have "local" attached persistent memory and "remote" persistent memory that map to the same "online" node, or persistent memory devices with different performance properties. In this case 'numa_node' is identical for the two instances, but 'target_node' is differentiated so platform firmware can communicate distinct performance properties per range. Expose 'target_node' by default to allow for disambiguation of devices that share the same numa_map_to_online_node() result. Reported-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157401274500.43284.2369509941678577768.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-11-19libnvdimm: Simplify root read-only definition for the 'resource' attributeDan Williams
Rather than update the permission in ->is_visible() set the permission directly at declaration time. Cc: Ira Weiny <ira.weiny@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157309905534.1582359.13927459228885931097.stgit@dwillia2-desk3.amr.corp.intel.com
2019-11-19libnvdimm: Move nvdimm_bus_attribute_group to device_typeDan Williams
A 'struct device_type' instance can carry default attributes for the device. Use this facility to remove the export of nvdimm_bus_attribute_group and put the responsibility on the core rather than leaf implementations to define this attribute. Cc: Ira Weiny <ira.weiny@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157309903815.1582359.6418211876315050283.stgit@dwillia2-desk3.amr.corp.intel.com
2019-11-19libnvdimm: Move nvdimm_attribute_group to device_typeDan Williams
A 'struct device_type' instance can carry default attributes for the device. Use this facility to remove the export of nvdimm_attribute_group and put the responsibility on the core rather than leaf implementations to define this attribute. Cc: Ira Weiny <ira.weiny@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157309903201.1582359.10966209746585062329.stgit@dwillia2-desk3.amr.corp.intel.com
2019-11-19libnvdimm: Move nd_mapping_attribute_group to device_typeDan Williams
A 'struct device_type' instance can carry default attributes for the device. Use this facility to remove the export of nd_mapping_attribute_group and put the responsibility on the core rather than leaf implementations to define this attribute. Cc: Ira Weiny <ira.weiny@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157309902686.1582359.6749533709859492704.stgit@dwillia2-desk3.amr.corp.intel.com
2019-11-19libnvdimm: Move nd_region_attribute_group to device_typeDan Williams
A 'struct device_type' instance can carry default attributes for the device. Use this facility to remove the export of nd_region_attribute_group and put the responsibility on the core rather than leaf implementations to define this attribute. Cc: Ira Weiny <ira.weiny@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157309902169.1582359.16828508538444551337.stgit@dwillia2-desk3.amr.corp.intel.com
2019-11-19libnvdimm: Move nd_numa_attribute_group to device_typeDan Williams
A 'struct device_type' instance can carry default attributes for the device. Use this facility to remove the export of nd_numa_attribute_group and put the responsibility on the core rather than leaf implementations to define this attribute. Cc: Ira Weiny <ira.weiny@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157401269537.43284.14411189404186877352.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-11-17libnvdimm: Move nd_device_attribute_group to device_typeDan Williams
A 'struct device_type' instance can carry default attributes for the device. Use this facility to remove the export of nd_device_attribute_group and put the responsibility on the core rather than leaf implementations to define this attribute. For regions this creates a new nd_region_attribute_groups[] added to the per-region device-type instances. Cc: Ira Weiny <ira.weiny@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157309901138.1582359.12909354140826530394.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-11-17libnvdimm: Move region attribute group definitionDan Williams
In preparation for moving region attributes from device attribute groups to the region device-type, reorder the declaration so that it can be referenced by the device-type definition without forward declarations. No functional changes are intended to result from this change. Cc: Ira Weiny <ira.weiny@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157309900624.1582359.6929998072035982264.stgit@dwillia2-desk3.amr.corp.intel.com
2019-11-17libnvdimm: Move attribute groups to device typeDan Williams
Statically initialize the attribute groups for each libnvdimm device_type. This is a preparation step for removing unnecessary exports of attributes that can be included in the device_type by default. Also take the opportunity to mark 'struct device_type' instances const. Cc: Ira Weiny <ira.weiny@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157309900111.1582359.2445687530383470348.stgit@dwillia2-desk3.amr.corp.intel.com
2019-11-17libnvdimm: Remove prototypes for nonexistent functionsAlastair D'Silva
These functions don't exist, so remove the prototypes for them. Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Link: https://lore.kernel.org/r/20191025044721.16617-3-alastair@au1.ibm.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-11-17libnvdimm/btt: fix variable 'rc' set but not usedQian Cai
drivers/nvdimm/btt.c: In function 'btt_read_pg': drivers/nvdimm/btt.c:1264:8: warning: variable 'rc' set but not used [-Wunused-but-set-variable] int rc; ^~ Add a ratelimited message in case a storm of errors is encountered. Fixes: d9b83c756953 ("libnvdimm, btt: rework error clearing") Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/1572530719-32161-1-git-send-email-cai@lca.pw Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-11-17libnvdimm/pmem: Delete include of nd-core.hDan Williams
The entire point of nd-core.h is to hide functionality that no leaf driver should touch. In fact, the commit that added it had no need to include it. Fixes: 06e8ccdab15f ("acpi: nfit: Add support for detect platform...") Cc: Ira Weiny <ira.weiny@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-11-14libnvdimm/namespace: Differentiate between probe mapping and runtime mappingAneesh Kumar K.V
The nvdimm core currently maps the full namespace to an ioremap range while probing the namespace mode. This can result in probe failures on architectures that have limited ioremap space. For example, with a large btt namespace that consumes most of I/O remap range, depending on the sequence of namespace initialization, the user can find a pfn namespace initialization failure due to unavailable I/O remap space which nvdimm core uses for temporary mapping. nvdimm core can avoid this failure by only mapping the reserved info block area to check for pfn superblock type and map the full namespace resource only before using the namespace. Given that personalities like BTT can be layered on top of any namespace type create a generic form of devm_nsio_enable (devm_namespace_enable) and use it inside the per-personality attach routines. Now devm_namespace_enable() is always paired with disable unless the mapping is going to be used for long term runtime access. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/20191017073308.32645-1-aneesh.kumar@linux.ibm.com [djbw: reworks to move devm_namespace_{en,dis}able into *attach helpers] Reported-by: kbuild test robot <lkp@intel.com> Link: https://lore.kernel.org/r/20191031105741.102793-2-aneesh.kumar@linux.ibm.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-11-14libnvdimm/pfn_dev: Don't clear device memmap area during generic namespace probeAneesh Kumar K.V
nvdimm core use nd_pfn_validate when looking for devdax or fsdax namespace. In this case device resources are allocated against nd_namespace_io dev. In-order to allow remap of range in nd_pfn_clear_memmap_error(), move the device memmap area clearing while initializing pfn namespace. With this device resource are allocated against nd_pfn and we can use nd_pfn->dev for remapping. This also avoids calling nd_pfn_clear_mmap_errors twice. Once while probing the namespace and second while initializing a pfn namespace. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/20191101032728.113001-1-aneesh.kumar@linux.ibm.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-11-14libnvdimm/namsepace: Don't set claim_class on errorIra Weiny
Don't leave claim_class set to an invalid value if an error occurs in btt_claim_class(). While we are here change the return type of __holder_class_store() to be clear about the values it is returning. This was found via code inspection. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20190925211348.14082-1-ira.weiny@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-11-07lib: Uplevel the pmem "region" ida to a global allocatorDan Williams
In preparation for handling platform differentiated memory types beyond persistent memory, uplevel the "region" identifier to a global number space. This enables a device-dax instance to be registered to any memory type with guaranteed unique names. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-10-23compat_ioctl: move more drivers to compat_ptr_ioctlArnd Bergmann
The .ioctl and .compat_ioctl file operations have the same prototype so they can both point to the same function, which works great almost all the time when all the commands are compatible. One exception is the s390 architecture, where a compat pointer is only 31 bit wide, and converting it into a 64-bit pointer requires calling compat_ptr(). Most drivers here will never run in s390, but since we now have a generic helper for it, it's easy enough to use it consistently. I double-checked all these drivers to ensure that all ioctl arguments are used as pointers or are ignored, but are not interpreted as integer values. Acked-by: Jason Gunthorpe <jgg@mellanox.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: David Sterba <dsterba@suse.com> Acked-by: Darren Hart (VMware) <dvhart@infradead.org> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-09-29Merge tag 'libnvdimm-fixes-5.4-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm More libnvdimm updates from Dan Williams: - Complete the reworks to interoperate with powerpc dynamic huge page sizes - Fix a crash due to missed accounting for the powerpc 'struct page'-memmap mapping granularity - Fix badblock initialization for volatile (DRAM emulated) pmem ranges - Stop triggering request_key() notifications to userspace when NVDIMM-security is disabled / not present - Miscellaneous small fixups * tag 'libnvdimm-fixes-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: libnvdimm/region: Enable MAP_SYNC for volatile regions libnvdimm: prevent nvdimm from requesting key when security is disabled libnvdimm/region: Initialize bad block for volatile namespaces libnvdimm/nfit_test: Fix acpi_handle redefinition libnvdimm/altmap: Track namespace boundaries in altmap libnvdimm: Fix endian conversion issues  libnvdimm/dax: Pick the right alignment default when creating dax devices powerpc/book3s64: Export has_transparent_hugepage() related functions.
2019-09-24libnvdimm/region: Enable MAP_SYNC for volatile regionsAneesh Kumar K.V
Some environments want to use a host tmpfs/ramdisk to back guest pmem. While the data is not persisted relative to the host it *is* persisted relative to guest crashes / reboots. The guest is free to use dax and MAP_SYNC to keep filesystem metadata consistent with dax accesses without requiring guest fsync(). The guest can also observe that the region is volatile and skip cache flushing as global visibility is enough to "persist" data relative to the host staying alive over guest reset events. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Pankaj Gupta <pagupta@redhat.com> Link: https://lore.kernel.org/r/20190924114327.14700-1-aneesh.kumar@linux.ibm.com [djbw: reword the changelog] Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-09-24libnvdimm: prevent nvdimm from requesting key when security is disabledDave Jiang
Current implementation attempts to request keys from the keyring even when security is not enabled. Change behavior so when security is disabled it will skip key request. Error messages seen when no keys are installed and libnvdimm is loaded: request-key[4598]: Cannot find command to construct key 661489677 request-key[4606]: Cannot find command to construct key 34713726 Cc: stable@vger.kernel.org Fixes: 4c6926a23b76 ("acpi/nfit, libnvdimm: Add unlock of nvdimm support for Intel DIMMs") Signed-off-by: Dave Jiang <dave.jiang@intel.com> Link: https://lore.kernel.org/r/156934642272.30222.5230162488753445916.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-09-24libnvdimm/region: Initialize bad block for volatile namespacesAneesh Kumar K.V
We do check for a bad block during namespace init and that use region bad block list. We need to initialize the bad block for volatile regions for this to work. We also observe a lockdep warning as below because the lock is not initialized correctly since we skip bad block init for volatile regions. INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.3.0-rc1-15699-g3dee241c937e #149 Call Trace: [c0000000f95cb250] [c00000000147dd84] dump_stack+0xe8/0x164 (unreliable) [c0000000f95cb2a0] [c00000000022ccd8] register_lock_class+0x308/0xa60 [c0000000f95cb3a0] [c000000000229cc0] __lock_acquire+0x170/0x1ff0 [c0000000f95cb4c0] [c00000000022c740] lock_acquire+0x220/0x270 [c0000000f95cb580] [c000000000a93230] badblocks_check+0xc0/0x290 [c0000000f95cb5f0] [c000000000d97540] nd_pfn_validate+0x5c0/0x7f0 [c0000000f95cb6d0] [c000000000d98300] nd_dax_probe+0xd0/0x1f0 [c0000000f95cb760] [c000000000d9b66c] nd_pmem_probe+0x10c/0x160 [c0000000f95cb790] [c000000000d7f5ec] nvdimm_bus_probe+0x10c/0x240 [c0000000f95cb820] [c000000000d0f844] really_probe+0x254/0x4e0 [c0000000f95cb8b0] [c000000000d0fdfc] driver_probe_device+0x16c/0x1e0 [c0000000f95cb930] [c000000000d10238] device_driver_attach+0x68/0xa0 [c0000000f95cb970] [c000000000d1040c] __driver_attach+0x19c/0x1c0 [c0000000f95cb9f0] [c000000000d0c4c4] bus_for_each_dev+0x94/0x130 [c0000000f95cba50] [c000000000d0f014] driver_attach+0x34/0x50 [c0000000f95cba70] [c000000000d0e208] bus_add_driver+0x178/0x2f0 [c0000000f95cbb00] [c000000000d117c8] driver_register+0x108/0x170 [c0000000f95cbb70] [c000000000d7edb0] __nd_driver_register+0xe0/0x100 [c0000000f95cbbd0] [c000000001a6baa4] nd_pmem_driver_init+0x34/0x48 [c0000000f95cbbf0] [c0000000000106f4] do_one_initcall+0x1d4/0x4b0 [c0000000f95cbcd0] [c0000000019f499c] kernel_init_freeable+0x544/0x65c [c0000000f95cbdb0] [c000000000010d6c] kernel_init+0x2c/0x180 [c0000000f95cbe20] [c00000000000b954] ret_from_kernel_thread+0x5c/0x68 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/20190919083355.26340-1-aneesh.kumar@linux.ibm.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-09-24libnvdimm/altmap: Track namespace boundaries in altmapAneesh Kumar K.V
With PFN_MODE_PMEM namespace, the memmap area is allocated from the device area. Some architectures map the memmap area with large page size. On architectures like ppc64, 16MB page for memap mapping can map 262144 pfns. This maps a namespace size of 16G. When populating memmap region with 16MB page from the device area, make sure the allocated space is not used to map resources outside this namespace. Such usage of device area will prevent a namespace destroy. Add resource end pnf in altmap and use that to check if the memmap area allocation can map pfn outside the namespace. On ppc64 in such case we fallback to allocation from memory. This fix kernel crash reported below: [ 132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 devm_memremap_pages_release+0x2d8/0x2e0 [ 133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000 [ 133.464760] Faulting instruction address: 0xc00000000007580c [ 133.464766] Oops: Kernel access of bad area, sig: 11 [#1] [ 133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries ..... [ 133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0 [ 133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0 [ 133.464910] Call Trace: [ 133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0 (unreliable) [ 133.464921] [c000007cbfd0f8d0] [c000000000370a44] section_deactivate+0x1a4/0x240 [ 133.464928] [c000007cbfd0f980] [c000000000386270] __remove_pages+0x3a0/0x590 [ 133.464935] [c000007cbfd0fa50] [c000000000074158] arch_remove_memory+0x88/0x160 [ 133.464942] [c000007cbfd0fae0] [c0000000003be8c0] devm_memremap_pages_release+0x150/0x2e0 [ 133.464949] [c000007cbfd0fb70] [c000000000738ea0] devm_action_release+0x30/0x50 [ 133.464955] [c000007cbfd0fb90] [c00000000073a5a4] release_nodes+0x344/0x400 [ 133.464961] [c000007cbfd0fc40] [c00000000073378c] device_release_driver_internal+0x15c/0x250 [ 133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110 [ 133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70 [ 133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0 [ 133.464987] [c000007cbfd0fd10] [c0000000004a1dfc] kernfs_fop_write+0x17c/0x250 [ 133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70 [ 133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250 djbw: Aneesh notes that this crash can likely be triggered in any kernel that supports 'papr_scm', so flagging that commit for -stable consideration. Fixes: b5beae5e224f ("powerpc/pseries: Add driver for PAPR SCM regions") Cc: <stable@vger.kernel.org> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Pankaj Gupta <pagupta@redhat.com> Tested-by: Santosh Sivaraj <santosh@fossix.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Link: https://lore.kernel.org/r/20190910062826.10041-1-aneesh.kumar@linux.ibm.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-09-24libnvdimm: Fix endian conversion issues Aneesh Kumar K.V
nd_label->dpa issue was observed when trying to enable the namespace created with little-endian kernel on a big-endian kernel. That made me run `sparse` on the rest of the code and other changes are the result of that. Fixes: d9b83c756953 ("libnvdimm, btt: rework error clearing") Fixes: 9dedc73a4658 ("libnvdimm/btt: Fix LBA masking during 'free list' population") Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/20190809074726.27815-1-aneesh.kumar@linux.ibm.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-09-24libnvdimm/dax: Pick the right alignment default when creating dax devicesAneesh Kumar K.V
Allow arch to provide the supported alignments and use hugepage alignment only if we support hugepage. Right now we depend on compile time configs whereas this patch switch this to runtime discovery. Architectures like ppc64 can have THP enabled in code, but then can have hugepage size disabled by the hypervisor. This allows us to create dax devices with PAGE_SIZE alignment in this case. Existing dax namespace with alignment larger than PAGE_SIZE will fail to initialize in this specific case. We still allow fsdax namespace initialization. With respect to identifying whether to enable hugepage fault for a dax device, if THP is enabled during compile, we default to taking hugepage fault and in dax fault handler if we find the fault size > alignment we retry with PAGE_SIZE fault size. This also addresses the below failure scenario on ppc64 ndctl create-namespace --mode=devdax | grep align "align":16777216, "align":16777216 cat /sys/devices/ndbus0/region0/dax0.0/supported_alignments 65536 16777216 daxio.static-debug -z -o /dev/dax0.0 Bus error (core dumped) $ dmesg | tail lpar: Failed hash pte insert with error -4 hash-mmu: mm: Hashing failure ! EA=0x7fff17000000 access=0x8000000000000006 current=daxio hash-mmu: trap=0x300 vsid=0x22cb7a3 ssize=1 base psize=2 psize 10 pte=0xc000000501002b86 daxio[3860]: bus error (7) at 7fff17000000 nip 7fff973c007c lr 7fff973bff34 code 2 in libpmem.so.1.0.0[7fff973b0000+20000] daxio[3860]: code: 792945e4 7d494b78 e95f0098 7d494b78 f93f00a0 4800012c e93f0088 f93f0120 daxio[3860]: code: e93f00a0 f93f0128 e93f0120 e95f0128 <f9490000> e93f0088 39290008 f93f0110 The failure was due to guest kernel using wrong page size. The namespaces created with 16M alignment will appear as below on a config with 16M page size disabled. $ ndctl list -Ni [ { "dev":"namespace0.1", "mode":"fsdax", "map":"dev", "size":5351931904, "uuid":"fc6e9667-461a-4718-82b4-69b24570bddb", "align":16777216, "blockdev":"pmem0.1", "supported_alignments":[ 65536 ] }, { "dev":"namespace0.0", "mode":"fsdax", <==== devdax 16M alignment marked disabled. "map":"mem", "size":5368709120, "uuid":"a4bdf81a-f2ee-4bc6-91db-7b87eddd0484", "state":"disabled" } ] Cc: linux-mm@kvack.org Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/20190905154603.10349-8-aneesh.kumar@linux.ibm.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-09-21Merge tag 'libnvdimm-for-5.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "Some reworks to better support nvdimms on powerpc and an nvdimm security interface update: - Rework the nvdimm core to accommodate architectures with different page sizes and ones that can change supported huge page sizes at boot time rather than a compile time constant. - Introduce a distinct 'frozen' attribute for the nvdimm security state since it is independent of the locked state. - Miscellaneous fixups" * tag 'libnvdimm-for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: libnvdimm: Use PAGE_SIZE instead of SZ_4K for align check libnvdimm/label: Remove the dpa align check libnvdimm/pfn_dev: Add page size and struct page size to pfn superblock libnvdimm/pfn_dev: Add a build check to make sure we notice when struct page size change libnvdimm/pmem: Advance namespace seed for specific probe errors libnvdimm/region: Rewrite _probe_success() to _advance_seeds() libnvdimm/security: Consolidate 'security' operations libnvdimm/security: Tighten scope of nvdimm->busy vs security operations libnvdimm/security: Introduce a 'frozen' attribute libnvdimm, region: Use struct_size() in kzalloc() tools/testing/nvdimm: Fix fallthrough warning libnvdimm/of_pmem: Provide a unique name for bus provider
2019-09-21Merge tag 'for-linus-hmm' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull hmm updates from Jason Gunthorpe: "This is more cleanup and consolidation of the hmm APIs and the very strongly related mmu_notifier interfaces. Many places across the tree using these interfaces are touched in the process. Beyond that a cleanup to the page walker API and a few memremap related changes round out the series: - General improvement of hmm_range_fault() and related APIs, more documentation, bug fixes from testing, API simplification & consolidation, and unused API removal - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE, and make them internal kconfig selects - Hoist a lot of code related to mmu notifier attachment out of drivers by using a refcount get/put attachment idiom and remove the convoluted mmu_notifier_unregister_no_release() and related APIs. - General API improvement for the migrate_vma API and revision of its only user in nouveau - Annotate mmu_notifiers with lockdep and sleeping region debugging Two series unrelated to HMM or mmu_notifiers came along due to dependencies: - Allow pagemap's memremap_pages family of APIs to work without providing a struct device - Make walk_page_range() and related use a constant structure for function pointers" * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits) libnvdimm: Enable unit test infrastructure compile checks mm, notifier: Catch sleeping/blocking for !blockable kernel.h: Add non_block_start/end() drm/radeon: guard against calling an unpaired radeon_mn_unregister() csky: add missing brackets in a macro for tlb.h pagewalk: use lockdep_assert_held for locking validation pagewalk: separate function pointers from iterator data mm: split out a new pagewalk.h header from mm.h mm/mmu_notifiers: annotate with might_sleep() mm/mmu_notifiers: prime lockdep mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports mm/hmm: hmm_range_fault() infinite loop mm/hmm: hmm_range_fault() NULL pointer bug mm/hmm: fix hmm_range_fault()'s handling of swapped out pages mm/mmu_notifiers: remove unregister_no_release RDMA/odp: remove ib_ucontext from ib_umem RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm' RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr RDMA/mlx5: Use ib_umem_start instead of umem.address ...