summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-06-04virtio-mem: Offline and remove completely unplugged memory blocksDavid Hildenbrand
Let's offline+remove memory blocks once all subblocks are unplugged. We can use the new Linux MM interface for that. As no memory is in use anymore, this shouldn't take a long time and shouldn't fail. There might be corner cases where the offlining could still fail (especially, if another notifier NACKs the offlining request). Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Tested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-10-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-04mm/memory_hotplug: Introduce offline_and_remove_memory()David Hildenbrand
virtio-mem wants to offline and remove a memory block once it unplugged all subblocks (e.g., using alloc_contig_range()). Let's provide an interface to do that from a driver. virtio-mem already supports to offline partially unplugged memory blocks. Offlining a fully unplugged memory block will not require to migrate any pages. All unplugged subblocks are PageOffline() and have a reference count of 0 - so offlining code will simply skip them. All we need is an interface to offline and remove the memory from kernel module context, where we don't have access to the memory block devices (esp. find_memory_block() and device_offline()) and the device hotplug lock. To keep things simple, allow to only work on a single memory block. Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Qian Cai <cai@lca.pw> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-9-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-04virtio-mem: Allow to offline partially unplugged memory blocksDavid Hildenbrand
Dropping the reference count of PageOffline() pages during MEM_GOING_ONLINE allows offlining code to skip them. However, we also have to clear PG_reserved, because PG_reserved pages get detected as unmovable right away. Take care of restoring the reference count when offlining is canceled. Clarify why we don't have to perform any action when unloading the driver. Also, let's add a warning if anybody is still holding a reference to unplugged pages when offlining. Tested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-8-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-04mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINEDavid Hildenbrand
virtio-mem wants to allow to offline memory blocks of which some parts were unplugged (allocated via alloc_contig_range()), especially, to later offline and remove completely unplugged memory blocks. The important part is that PageOffline() has to remain set until the section is offline, so these pages will never get accessed (e.g., when dumping). The pages should not be handed back to the buddy (which would require clearing PageOffline() and result in issues if offlining fails and the pages are suddenly in the buddy). Let's allow to do that by allowing to isolate any PageOffline() page when offlining. This way, we can reach the memory hotplug notifier MEM_GOING_OFFLINE, where the driver can signal that he is fine with offlining this page by dropping its reference count. PageOffline() pages with a reference count of 0 can then be skipped when offlining the pages (like if they were free, however they are not in the buddy). Anybody who uses PageOffline() pages and does not agree to offline them (e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not decrement the reference count and make offlining fail when trying to migrate such an unmovable page. So there should be no observable change. Same applies to balloon compaction users (movable PageOffline() pages), the pages will simply be migrated. Note 1: If offlining fails, a driver has to increment the reference count again in MEM_CANCEL_OFFLINE. Note 2: A driver that makes use of this has to be aware that re-onlining the memory block has to be handled by hooking into onlining code (online_page_callback_t), resetting the page PageOffline() and not giving them to the buddy. Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Juergen Gross <jgross@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Qian Cai <cai@lca.pw> Cc: Pingfan Liu <kernelfans@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-7-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-04virtio-mem: Paravirtualized memory hotunplug part 2David Hildenbrand
We also want to unplug online memory (contained in online memory blocks and, therefore, managed by the buddy), and eventually replug it later. When requested to unplug memory, we use alloc_contig_range() to allocate subblocks in online memory blocks (so we are the owner) and send them to our hypervisor. When requested to plug memory, we can replug such memory using free_contig_range() after asking our hypervisor. We also want to mark all allocated pages PG_offline, so nobody will touch them. To differentiate pages that were never onlined when onlining the memory block from pages allocated via alloc_contig_range(), we use PageDirty(). Based on this flag, virtio_mem_fake_online() can either online the pages for the first time or use free_contig_range(). It is worth noting that there are no guarantees on how much memory can actually get unplugged again. All device memory might completely be fragmented with unmovable data, such that no subblock can get unplugged. We are not touching the ZONE_MOVABLE. If memory is onlined to the ZONE_MOVABLE, it can only get unplugged after that memory was offlined manually by user space. In normal operation, virtio-mem memory is suggested to be onlined to ZONE_NORMAL. In the future, we will try to make unplug more likely to succeed. Add a module parameter to control if online memory shall be touched. As we want to access alloc_contig_range()/free_contig_range() from kernel module context, export the symbols. Note: Whenever virtio-mem uses alloc_contig_range(), all affected pages are on the same node, in the same zone, and contain no holes. Acked-by: Michal Hocko <mhocko@suse.com> # to export contig range allocator API Tested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-6-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-04virtio-mem: Paravirtualized memory hotunplug part 1David Hildenbrand
Unplugging subblocks of memory blocks that are offline is easy. All we have to do is watch out for concurrent onlining activity. Tested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-5-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-04virtio-mem: Allow to specify an ACPI PXM as nidDavid Hildenbrand
We want to allow to specify (similar as for a DIMM), to which node a virtio-mem device (and, therefore, its memory) belongs. Add a new virtio-mem feature flag and export pxm_to_node, so it can be used in kernel module context. Acked-by: Michal Hocko <mhocko@suse.com> # for the export Acked-by: "Rafael J. Wysocki" <rafael@kernel.org> # for the export Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Tested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Len Brown <lenb@kernel.org> Cc: linux-acpi@vger.kernel.org Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-4-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-04MAINTAINERS: Add myself as virtio-mem maintainerDavid Hildenbrand
Let's make sure patches/bug reports find the right person. Cc: "Michael S. Tsirkin" <mst@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-3-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-04virtio-mem: Paravirtualized memory hotplugDavid Hildenbrand
Each virtio-mem device owns exactly one memory region. It is responsible for adding/removing memory from that memory region on request. When the device driver starts up, the requested amount of memory is queried and then plugged to Linux. On request, further memory can be plugged or unplugged. This patch only implements the plugging part. On x86-64, memory can currently be plugged in 4MB ("subblock") granularity. When required, a new memory block will be added (e.g., usually 128MB on x86-64) in order to plug more subblocks. Only x86-64 was tested for now. The online_page callback is used to keep unplugged subblocks offline when onlining memory - similar to the Hyper-V balloon driver. Unplugged pages are marked PG_offline, to tell dump tools (e.g., makedumpfile) to skip them. User space is usually responsible for onlining the added memory. The memory hotplug notifier is used to synchronize virtio-mem activity against memory onlining/offlining. Each virtio-mem device can belong to a NUMA node, which allows us to easily add/remove small chunks of memory to/from a specific NUMA node by using multiple virtio-mem devices. Something that works even when the guest has no idea about the NUMA topology. One way to view virtio-mem is as a "resizable DIMM" or a DIMM with many "sub-DIMMS". This patch directly introduces the basic infrastructure to implement memory unplug. Especially the memory block states and subblock bitmaps will be heavily used there. Notes: - In case memory is to be onlined by user space, we limit the amount of offline memory blocks, to not run out of memory. This is esp. an issue if memory is added faster than it is getting onlined. - Suspend/Hibernate is not supported due to the way virtio-mem devices behave. Limited support might be possible in the future. - Reloading the device driver is not supported. Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Tested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <lenb@kernel.org> Cc: linux-acpi@vger.kernel.org Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-2-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-04vdpasim: Fix some coccinelle warningsSamuel Zou
Fix below warnings reported by coccicheck: drivers/vdpa/vdpa_sim/vdpa_sim.c:104:1-10: WARNING: Assignment of 0/1 to bool variable drivers/vdpa/vdpa_sim/vdpa_sim.c:164:7-11: WARNING: Unsigned expression compared with zero: read <= 0 drivers/vdpa/vdpa_sim/vdpa_sim.c:169:7-12: WARNING: Unsigned expression compared with zero: write <= 0 1. The 'ready' variable in vdpasim_virtqueue struct is bool type. It is better to initialize vq->ready to false 2. Modify 'read' and 'write' variables type from size_t to ssize_t. And preserve the reverse christmas tree ordering of local variables. Fixes: 2c53d0f64c06 ("vdpasim: vDPA device simulator") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Samuel Zou <zou_wei@huawei.com> Link: https://lore.kernel.org/r/1588990802-28451-1-git-send-email-zou_wei@huawei.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-04ifcvf: move IRQ request/free to status change handlersZhu Lingshan
This commit move IRQ request and free operations from probe() to VIRTIO status change handler to comply with VIRTIO spec. VIRTIO spec 1.1, section 2.1.2 Device Requirements: Device Status Field The device MUST NOT consume buffers or send any used buffer notifications to the driver before DRIVER_OK. Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com> Link: https://lore.kernel.org/r/1589270444-3669-1-git-send-email-lingshan.zhu@intel.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2020-06-04vhost: (cosmetic) remove a superfluous variable initialisationGuennadi Liakhovetski
Even the compiler is able to figure out that in this case the initialisation is superfluous. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com> Link: https://lore.kernel.org/r/20200527180541.5570-3-guennadi.liakhovetski@linux.intel.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-04crypto: virtio: Fix dest length calculation in __virtio_crypto_skcipher_do_req()Longpeng(Mike)
The src/dst length is not aligned with AES_BLOCK_SIZE(which is 16) in some testcases in tcrypto.ko. For example, the src/dst length of one of cts(cbc(aes))'s testcase is 17, the crypto_virtio driver will set @src_data_len=16 but @dst_data_len=17 in this case and get a wrong at then end. SRC: pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp (17 bytes) EXP: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc pp (17 bytes) DST: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00 (pollute the last bytes) (pp: plaintext cc:ciphertext) Fix this issue by limit the length of dest buffer. Fixes: dbaf0624ffa5 ("crypto: add virtio-crypto driver") Cc: Gonglei <arei.gonglei@huawei.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: virtualization@lists.linux-foundation.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Link: https://lore.kernel.org/r/20200602070501.2023-4-longpeng2@huawei.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-04crypto: virtio: Fix use-after-free in virtio_crypto_skcipher_finalize_req()Longpeng(Mike)
The system'll crash when the users insmod crypto/tcrypto.ko with mode=155 ( testing "authenc(hmac(sha1),cbc(aes))" ). It's caused by reuse the memory of request structure. In crypto_authenc_init_tfm(), the reqsize is set to: [PART 1] sizeof(authenc_request_ctx) + [PART 2] ictx->reqoff + [PART 3] MAX(ahash part, skcipher part) and the 'PART 3' is used by both ahash and skcipher in turn. When the virtio_crypto driver finish skcipher req, it'll call ->complete callback(in crypto_finalize_skcipher_request) and then free its resources whose pointers are recorded in 'skcipher parts'. However, the ->complete is 'crypto_authenc_encrypt_done' in this case, it will use the 'ahash part' of the request and change its content, so virtio_crypto driver will get the wrong pointer after ->complete finish and mistakenly free some other's memory. So the system will crash when these memory will be used again. The resources which need to be cleaned up are not used any more. But the pointers of these resources may be changed in the function "crypto_finalize_skcipher_request". Thus release specific resources before calling this function. Fixes: dbaf0624ffa5 ("crypto: add virtio-crypto driver") Reported-by: LABBE Corentin <clabbe@baylibre.com> Cc: Gonglei <arei.gonglei@huawei.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: virtualization@lists.linux-foundation.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200123101000.GB24255@Red Acked-by: Gonglei <arei.gonglei@huawei.com> Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Link: https://lore.kernel.org/r/20200602070501.2023-3-longpeng2@huawei.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-04crypto: virtio: Fix src/dst scatterlist calculation in ↵Longpeng(Mike)
__virtio_crypto_skcipher_do_req() The system will crash when the users insmod crypto/tcrypt.ko with mode=38 ( testing "cts(cbc(aes))" ). Usually the next entry of one sg will be @sg@ + 1, but if this sg element is part of a chained scatterlist, it could jump to the start of a new scatterlist array. Fix it by sg_next() on calculation of src/dst scatterlist. Fixes: dbaf0624ffa5 ("crypto: add virtio-crypto driver") Reported-by: LABBE Corentin <clabbe@baylibre.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: virtualization@lists.linux-foundation.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200123101000.GB24255@Red Signed-off-by: Gonglei <arei.gonglei@huawei.com> Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Link: https://lore.kernel.org/r/20200602070501.2023-2-longpeng2@huawei.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-04virtio-balloon: Disable free page reporting if page poison reporting is not ↵Alexander Duyck
enabled We should disable free page reporting if page poisoning is enabled but we cannot report it via the balloon interface. This way we can avoid the possibility of corrupting guest memory. Normally the page poisoning feature should always be present when free page reporting is enabled on the hypervisor, however this allows us to correctly handle a case of the virtio-balloon device being possibly misconfigured. Fixes: 5d757c8d518d ("virtio-balloon: add support for providing free page reports to host") Cc: stable@vger.kernel.org Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Link: https://lore.kernel.org/r/20200508173732.17877.85060.stgit@localhost.localdomain Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-04vhost_vdpa: disable doorbell mapping for !MMUMichael S. Tsirkin
There could be ways to support doorbell mapping with !MMU, but things like pgprot_noncached are not universally supported. Fixable, but just disable this for now. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-04vhost_vdpa: support doorbell mapping via mmapJason Wang
Currently the doorbell is relayed via eventfd which may have significant overhead because of the cost of vmexits or syscall. This patch introduces mmap() based doorbell mapping which can eliminate the overhead caused by vmexit or syscall. To ease the userspace modeling of the doorbell layout (usually virtio-pci), this patch starts from a doorbell per page model. Vhost-vdpa only support the hardware doorbell that sit at the boundary of a page and does not share the page with other registers. Doorbell of each virtqueue must be mapped separately, pgoff is the index of the virtqueue. This allows userspace to map a subset of the doorbell which may be useful for the implementation of software assisted virtqueue (control vq) in the future. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20200529080303.15449-5-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-04vdpa: introduce get_vq_notification methodJason Wang
This patch introduces a new method in the vdpa_config_ops which reports the physical address and the size of the doorbell for a specific virtqueue. This will be used by the future patches that maps doorbell to userspace. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20200529080303.15449-4-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-04vhost: use mmgrab() instead of mmget() for non worker deviceJason Wang
For the device that doesn't use vhost worker and use_mm(), mmget() is too heavy weight and it may brings troubles for implementing mmap() support for vDPA device. This is because, an reference to the address space was held via mm_get() in vhost_dev_set_owner() and an reference to the file was held in mmap(). This means when process exits, the mm can not be released thus we can not release the file. This patch tries to use mmgrab() instead of mmget(), which allows the address space to be destroy in process exit without releasing the mm structure itself. This is sufficient for vDPA device which pin user pages and does not depend on the address space to work. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20200529080303.15449-3-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-04vhost: allow device that does not depend on vhost workerJason Wang
vDPA device currently relays the eventfd via vhost worker. This is inefficient due the latency of wakeup and scheduling, so this patch tries to introduce a use_worker attribute for the vhost device. When use_worker is not set with vhost_dev_init(), vhost won't try to allocate a worker thread and the vhost_poll will be processed directly in the wakeup function. This help for vDPA since it reduces the latency caused by vhost worker. In my testing, it saves 0.2 ms in pings between VMs on a mutual host. Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20200529080303.15449-2-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-02MAINTAINERS: Add myself as virtio-balloon co-maintainerDavid Hildenbrand
As suggested by Michael, let's add me as co-maintainer of virtio-balloon. While at it, also add "include/linux/balloon_compaction.h" to the file list. Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200310115411.12760-1-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-02vhost: revert "vhost: disable for OABI"Michael S. Tsirkin
This reverts commit d085eb8ce727 ("vhost: disable for OABI") With commit "virtio: force spec specified alignment on types" in place, we force proper alignment for all structures, so there's no longer a reason to blacklist OABI. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-02virtio: force spec specified alignment on typesMichael S. Tsirkin
The ring element addresses are passed between components with different alignments assumptions. Thus, if guest/userspace selects a pointer and host then gets and dereferences it, we might need to decrease the compiler-selected alignment to prevent compiler on the host from assuming pointer is aligned. This actually triggers on ARM with -mabi=apcs-gnu - which is a deprecated configuration, but it seems safer to handle this generally. Note that userspace that allocates the memory is actually OK and does not need to be fixed, but userspace that gets it from guest or another process does need to be fixed. The later doesn't generally talk to the kernel so while it might be buggy it's not talking to the kernel in the buggy way - it's just using the header in the buggy way - so fixing header and asking userspace to recompile is the best we can do. I verified that the produced kernel binary on x86 is exactly identical before and after the change. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2020-06-02virtio-mmio: Delete an error message in vm_find_vqs()Markus Elfring
The function “platform_get_irq” can log an error already. Thus omit a redundant message for the exception handling in the calling function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Link: https://lore.kernel.org/r/9e27bc4a-cfa1-7818-dc25-8ad308816b30@web.de Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-02virtio: add VIRTIO_RING_NO_LEGACYMatej Genci
Add macro to disable legacy vring functions. Signed-off-by: Matej Genci <matej.genci@nutanix.com> Link: https://lore.kernel.org/r/20190911124942.243713-1-matej.genci@nutanix.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-05-31Linux 5.7Linus Torvalds
2020-05-31checkpatch/coding-style: deprecate 80-column warningJoe Perches
Yes, staying withing 80 columns is certainly still _preferred_. But it's not the hard limit that the checkpatch warnings imply, and other concerns can most certainly dominate. Increase the default limit to 100 characters. Not because 100 characters is some hard limit either, but that's certainly a "what are you doing" kind of value and less likely to be about the occasional slightly longer lines. Miscellanea: - to avoid unnecessary whitespace changes in files, checkpatch will no longer emit a warning about line length when scanning files unless --strict is also used - Add a bit to coding-style about alignment to open parenthesis Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-31Merge tag 'x86-urgent-2020-05-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A pile of x86 fixes: - Prevent a memory leak in ioperm which was caused by the stupid assumption that the exit cleanup is always called for current, which is not the case when fork fails after taking a reference on the ioperm bitmap. - Fix an arithmething overflow in the DMA code on 32bit systems - Fill gaps in the xstate copy with defaults instead of leaving them uninitialized - Revert: "Make __X32_SYSCALL_BIT be unsigned long" as it turned out that existing user space fails to build" * tag 'x86-urgent-2020-05-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ioperm: Prevent a memory leak when fork fails x86/dma: Fix max PFN arithmetic overflow on 32 bit systems copy_xstate_to_kernel(): don't leave parts of destination uninitialized x86/syscalls: Revert "x86/syscalls: Make __X32_SYSCALL_BIT be unsigned long"
2020-05-31Merge tag 'sched-urgent-2020-05-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Thomas Gleixner: "A single scheduler fix preventing a crash in NUMA balancing. The current->mm check is not reliable as the mm might be temporary due to use_mm() in a kthread. Check for PF_KTHREAD explictly" * tag 'sched-urgent-2020-05-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Don't NUMA balance for kthreads
2020-05-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: "Another week, another set of bug fixes: 1) Fix pskb_pull length in __xfrm_transport_prep(), from Xin Long. 2) Fix double xfrm_state put in esp{4,6}_gro_receive(), also from Xin Long. 3) Re-arm discovery timer properly in mac80211 mesh code, from Linus Lüssing. 4) Prevent buffer overflows in nf_conntrack_pptp debug code, from Pablo Neira Ayuso. 5) Fix race in ktls code between tls_sw_recvmsg() and tls_decrypt_done(), from Vinay Kumar Yadav. 6) Fix crashes on TCP fallback in MPTCP code, from Paolo Abeni. 7) More validation is necessary of untrusted GSO packets coming from virtualization devices, from Willem de Bruijn. 8) Fix endianness of bnxt_en firmware message length accesses, from Edwin Peer. 9) Fix infinite loop in sch_fq_pie, from Davide Caratti. 10) Fix lockdep splat in DSA by setting lockless TX in netdev features for slave ports, from Vladimir Oltean. 11) Fix suspend/resume crashes in mlx5, from Mark Bloch. 12) Fix use after free in bpf fmod_ret, from Alexei Starovoitov. 13) ARP retransmit timer guard uses wrong offset, from Hongbin Liu. 14) Fix leak in inetdev_init(), from Yang Yingliang. 15) Don't try to use inet hash and unhash in l2tp code, results in crashes. From Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (77 commits) l2tp: add sk_family checks to l2tp_validate_socket l2tp: do not use inet_hash()/inet_unhash() net: qrtr: Allocate workqueue before kernel_bind mptcp: remove msk from the token container at destruction time. mptcp: fix race between MP_JOIN and close mptcp: fix unblocking connect() net/sched: act_ct: add nat mangle action only for NAT-conntrack devinet: fix memleak in inetdev_init() virtio_vsock: Fix race condition in virtio_transport_recv_pkt drivers/net/ibmvnic: Update VNIC protocol version reporting NFC: st21nfca: add missed kfree_skb() in an error path neigh: fix ARP retransmit timer guard bpf, selftests: Add a verifier test for assigning 32bit reg states to 64bit ones bpf, selftests: Verifier bounds tests need to be updated bpf: Fix a verifier issue when assigning 32bit reg states to 64bit ones bpf: Fix use-after-free in fmod_ret check net/mlx5e: replace EINVAL in mlx5e_flower_parse_meta() net/mlx5e: Fix MLX5_TC_CT dependencies net/mlx5e: Properly set default values when disabling adaptive moderation net/mlx5e: Fix arch depending casting issue in FEC ...
2020-05-30l2tp: add sk_family checks to l2tp_validate_socketEric Dumazet
syzbot was able to trigger a crash after using an ISDN socket and fool l2tp. Fix this by making sure the UDP socket is of the proper family. BUG: KASAN: slab-out-of-bounds in setup_udp_tunnel_sock+0x465/0x540 net/ipv4/udp_tunnel.c:78 Write of size 1 at addr ffff88808ed0c590 by task syz-executor.5/3018 CPU: 0 PID: 3018 Comm: syz-executor.5 Not tainted 5.7.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xd3/0x413 mm/kasan/report.c:382 __kasan_report.cold+0x20/0x38 mm/kasan/report.c:511 kasan_report+0x33/0x50 mm/kasan/common.c:625 setup_udp_tunnel_sock+0x465/0x540 net/ipv4/udp_tunnel.c:78 l2tp_tunnel_register+0xb15/0xdd0 net/l2tp/l2tp_core.c:1523 l2tp_nl_cmd_tunnel_create+0x4b2/0xa60 net/l2tp/l2tp_netlink.c:249 genl_family_rcv_msg_doit net/netlink/genetlink.c:673 [inline] genl_family_rcv_msg net/netlink/genetlink.c:718 [inline] genl_rcv_msg+0x627/0xdf0 net/netlink/genetlink.c:735 netlink_rcv_skb+0x15a/0x410 net/netlink/af_netlink.c:2469 genl_rcv+0x24/0x40 net/netlink/genetlink.c:746 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline] netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x6e6/0x810 net/socket.c:2352 ___sys_sendmsg+0x100/0x170 net/socket.c:2406 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2439 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x45ca29 Code: 0d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007effe76edc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00000000004fe1c0 RCX: 000000000045ca29 RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005 RBP: 000000000078bf00 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff R13: 000000000000094e R14: 00000000004d5d00 R15: 00007effe76ee6d4 Allocated by task 3018: save_stack+0x1b/0x40 mm/kasan/common.c:49 set_track mm/kasan/common.c:57 [inline] __kasan_kmalloc mm/kasan/common.c:495 [inline] __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:468 __do_kmalloc mm/slab.c:3656 [inline] __kmalloc+0x161/0x7a0 mm/slab.c:3665 kmalloc include/linux/slab.h:560 [inline] sk_prot_alloc+0x223/0x2f0 net/core/sock.c:1612 sk_alloc+0x36/0x1100 net/core/sock.c:1666 data_sock_create drivers/isdn/mISDN/socket.c:600 [inline] mISDN_sock_create+0x272/0x400 drivers/isdn/mISDN/socket.c:796 __sock_create+0x3cb/0x730 net/socket.c:1428 sock_create net/socket.c:1479 [inline] __sys_socket+0xef/0x200 net/socket.c:1521 __do_sys_socket net/socket.c:1530 [inline] __se_sys_socket net/socket.c:1528 [inline] __x64_sys_socket+0x6f/0xb0 net/socket.c:1528 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295 entry_SYSCALL_64_after_hwframe+0x49/0xb3 Freed by task 2484: save_stack+0x1b/0x40 mm/kasan/common.c:49 set_track mm/kasan/common.c:57 [inline] kasan_set_free_info mm/kasan/common.c:317 [inline] __kasan_slab_free+0xf7/0x140 mm/kasan/common.c:456 __cache_free mm/slab.c:3426 [inline] kfree+0x109/0x2b0 mm/slab.c:3757 kvfree+0x42/0x50 mm/util.c:603 __free_fdtable+0x2d/0x70 fs/file.c:31 put_files_struct fs/file.c:420 [inline] put_files_struct+0x248/0x2e0 fs/file.c:413 exit_files+0x7e/0xa0 fs/file.c:445 do_exit+0xb04/0x2dd0 kernel/exit.c:791 do_group_exit+0x125/0x340 kernel/exit.c:894 get_signal+0x47b/0x24e0 kernel/signal.c:2739 do_signal+0x81/0x2240 arch/x86/kernel/signal.c:784 exit_to_usermode_loop+0x26c/0x360 arch/x86/entry/common.c:161 prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline] syscall_return_slowpath arch/x86/entry/common.c:279 [inline] do_syscall_64+0x6b1/0x7d0 arch/x86/entry/common.c:305 entry_SYSCALL_64_after_hwframe+0x49/0xb3 The buggy address belongs to the object at ffff88808ed0c000 which belongs to the cache kmalloc-2k of size 2048 The buggy address is located 1424 bytes inside of 2048-byte region [ffff88808ed0c000, ffff88808ed0c800) The buggy address belongs to the page: page:ffffea00023b4300 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0xfffe0000000200(slab) raw: 00fffe0000000200 ffffea0002838208 ffffea00015ba288 ffff8880aa000e00 raw: 0000000000000000 ffff88808ed0c000 0000000100000001 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88808ed0c480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88808ed0c500: 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88808ed0c580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ^ ffff88808ed0c600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88808ed0c680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc Fixes: 6b9f34239b00 ("l2tp: fix races in tunnel creation") Fixes: fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: James Chapman <jchapman@katalix.com> Cc: Guillaume Nault <gnault@redhat.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30l2tp: do not use inet_hash()/inet_unhash()Eric Dumazet
syzbot recently found a way to crash the kernel [1] Issue here is that inet_hash() & inet_unhash() are currently only meant to be used by TCP & DCCP, since only these protocols provide the needed hashinfo pointer. L2TP uses a single list (instead of a hash table) This old bug became an issue after commit 610236587600 ("bpf: Add new cgroup attach type to enable sock modifications") since after this commit, sk_common_release() can be called while the L2TP socket is still considered 'hashed'. general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 0 PID: 7063 Comm: syz-executor654 Not tainted 5.7.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:inet_unhash+0x11f/0x770 net/ipv4/inet_hashtables.c:600 Code: 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e dd 04 00 00 48 8d 7d 08 44 8b 73 08 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 55 05 00 00 48 8d 7d 14 4c 8b 6d 08 48 b8 00 00 RSP: 0018:ffffc90001777d30 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: ffff88809a6df940 RCX: ffffffff8697c242 RDX: 0000000000000001 RSI: ffffffff8697c251 RDI: 0000000000000008 RBP: 0000000000000000 R08: ffff88809f3ae1c0 R09: fffffbfff1514cc1 R10: ffffffff8a8a6607 R11: fffffbfff1514cc0 R12: ffff88809a6df9b0 R13: 0000000000000007 R14: 0000000000000000 R15: ffffffff873a4d00 FS: 0000000001d2b880(0000) GS:ffff8880ae600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000006cd090 CR3: 000000009403a000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: sk_common_release+0xba/0x370 net/core/sock.c:3210 inet_create net/ipv4/af_inet.c:390 [inline] inet_create+0x966/0xe00 net/ipv4/af_inet.c:248 __sock_create+0x3cb/0x730 net/socket.c:1428 sock_create net/socket.c:1479 [inline] __sys_socket+0xef/0x200 net/socket.c:1521 __do_sys_socket net/socket.c:1530 [inline] __se_sys_socket net/socket.c:1528 [inline] __x64_sys_socket+0x6f/0xb0 net/socket.c:1528 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x441e29 Code: e8 fc b3 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffdce184148 EFLAGS: 00000246 ORIG_RAX: 0000000000000029 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000441e29 RDX: 0000000000000073 RSI: 0000000000000002 RDI: 0000000000000002 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000402c30 R14: 0000000000000000 R15: 0000000000000000 Modules linked in: ---[ end trace 23b6578228ce553e ]--- RIP: 0010:inet_unhash+0x11f/0x770 net/ipv4/inet_hashtables.c:600 Code: 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e dd 04 00 00 48 8d 7d 08 44 8b 73 08 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 55 05 00 00 48 8d 7d 14 4c 8b 6d 08 48 b8 00 00 RSP: 0018:ffffc90001777d30 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: ffff88809a6df940 RCX: ffffffff8697c242 RDX: 0000000000000001 RSI: ffffffff8697c251 RDI: 0000000000000008 RBP: 0000000000000000 R08: ffff88809f3ae1c0 R09: fffffbfff1514cc1 R10: ffffffff8a8a6607 R11: fffffbfff1514cc0 R12: ffff88809a6df9b0 R13: 0000000000000007 R14: 0000000000000000 R15: ffffffff873a4d00 FS: 0000000001d2b880(0000) GS:ffff8880ae600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000006cd090 CR3: 000000009403a000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fixes: 0d76751fad77 ("l2tp: Add L2TPv3 IP encapsulation (no UDP) support") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: James Chapman <jchapman@katalix.com> Cc: Andrii Nakryiko <andriin@fb.com> Reported-by: syzbot+3610d489778b57cc8031@syzkaller.appspotmail.com
2020-05-30net: qrtr: Allocate workqueue before kernel_bindChris Lew
A null pointer dereference in qrtr_ns_data_ready() is seen if a client opens a qrtr socket before qrtr_ns_init() can bind to the control port. When the control port is bound, the ENETRESET error will be broadcasted and clients will close their sockets. This results in DEL_CLIENT packets being sent to the ns and qrtr_ns_data_ready() being called without the workqueue being allocated. Allocate the workqueue before setting sk_data_ready and binding to the control port. This ensures that the work and workqueue structs are allocated and initialized before qrtr_ns_data_ready can be called. Fixes: 0c2204a4ad71 ("net: qrtr: Migrate nameservice to kernel from userspace") Signed-off-by: Chris Lew <clew@codeaurora.org> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30Merge branch 'mptcp-a-bunch-of-fixes'David S. Miller
Paolo Abeni says: ==================== mptcp: a bunch of fixes This patch series pulls together a few bugfixes for MPTCP bug observed while doing stress-test with apache bench - forced to use MPTCP and multiple subflows. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30mptcp: remove msk from the token container at destruction time.Paolo Abeni
Currently we remote the msk from the token container only via mptcp_close(). The MPTCP master socket can be destroyed also via other paths (e.g. if not yet accepted, when shutting down the listener socket). When we hit the latter scenario, dangling msk references are left into the token container, leading to memory corruption and/or UaF. This change addresses the issue by moving the token removal into the msk destructor. Fixes: 79c0949e9a09 ("mptcp: Add key generation and token tree") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30mptcp: fix race between MP_JOIN and closePaolo Abeni
If a MP_JOIN subflow completes the 3whs while another CPU is closing the master msk, we can hit the following race: CPU1 CPU2 close() mptcp_close subflow_syn_recv_sock mptcp_token_get_sock mptcp_finish_join inet_sk_state_load mptcp_token_destroy inet_sk_state_store(TCP_CLOSE) __mptcp_flush_join_list() mptcp_sock_graft list_add_tail sk_common_release sock_orphan() <socket free> The MP_JOIN socket will be leaked. Additionally we can hit UaF for the msk 'struct socket' referenced via the 'conn' field. This change try to address the issue introducing some synchronization between the MP_JOIN 3whs and mptcp_close via the join_list spinlock. If we detect the msk is closing the MP_JOIN socket is closed, too. Fixes: f296234c98a8 ("mptcp: Add handling of incoming MP_JOIN requests") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30mptcp: fix unblocking connect()Paolo Abeni
Currently unblocking connect() on MPTCP sockets fails frequently. If mptcp_stream_connect() is invoked to complete a previously attempted unblocking connection, it will still try to create the first subflow via __mptcp_socket_create(). If the 3whs is completed and the 'can_ack' flag is already set, the latter will fail with -EINVAL. This change addresses the issue checking for pending connect and delegating the completion to the first subflow. Additionally do msk addresses and sk_state changes only when needed. Fixes: 2303f994b3e1 ("mptcp: Associate MPTCP context with TCP socket") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30net/sched: act_ct: add nat mangle action only for NAT-conntrackwenxu
Currently add nat mangle action with comparing invert and orig tuple. It is better to check IPS_NAT_MASK flags first to avoid non necessary memcmp for non-NAT conntrack. Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30devinet: fix memleak in inetdev_init()Yang Yingliang
When devinet_sysctl_register() failed, the memory allocated in neigh_parms_alloc() should be freed. Fixes: 20e61da7ffcf ("ipv4: fail early when creating netdev named all or default") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30virtio_vsock: Fix race condition in virtio_transport_recv_pktJia He
When client on the host tries to connect(SOCK_STREAM, O_NONBLOCK) to the server on the guest, there will be a panic on a ThunderX2 (armv8a server): [ 463.718844] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 463.718848] Mem abort info: [ 463.718849] ESR = 0x96000044 [ 463.718852] EC = 0x25: DABT (current EL), IL = 32 bits [ 463.718853] SET = 0, FnV = 0 [ 463.718854] EA = 0, S1PTW = 0 [ 463.718855] Data abort info: [ 463.718856] ISV = 0, ISS = 0x00000044 [ 463.718857] CM = 0, WnR = 1 [ 463.718859] user pgtable: 4k pages, 48-bit VAs, pgdp=0000008f6f6e9000 [ 463.718861] [0000000000000000] pgd=0000000000000000 [ 463.718866] Internal error: Oops: 96000044 [#1] SMP [...] [ 463.718977] CPU: 213 PID: 5040 Comm: vhost-5032 Tainted: G O 5.7.0-rc7+ #139 [ 463.718980] Hardware name: GIGABYTE R281-T91-00/MT91-FS1-00, BIOS F06 09/25/2018 [ 463.718982] pstate: 60400009 (nZCv daif +PAN -UAO) [ 463.718995] pc : virtio_transport_recv_pkt+0x4c8/0xd40 [vmw_vsock_virtio_transport_common] [ 463.718999] lr : virtio_transport_recv_pkt+0x1fc/0xd40 [vmw_vsock_virtio_transport_common] [ 463.719000] sp : ffff80002dbe3c40 [...] [ 463.719025] Call trace: [ 463.719030] virtio_transport_recv_pkt+0x4c8/0xd40 [vmw_vsock_virtio_transport_common] [ 463.719034] vhost_vsock_handle_tx_kick+0x360/0x408 [vhost_vsock] [ 463.719041] vhost_worker+0x100/0x1a0 [vhost] [ 463.719048] kthread+0x128/0x130 [ 463.719052] ret_from_fork+0x10/0x18 The race condition is as follows: Task1 Task2 ===== ===== __sock_release virtio_transport_recv_pkt __vsock_release vsock_find_bound_socket (found sk) lock_sock_nested vsock_remove_sock sock_orphan sk_set_socket(sk, NULL) sk->sk_shutdown = SHUTDOWN_MASK ... release_sock lock_sock virtio_transport_recv_connecting sk->sk_socket->state (panic!) The root cause is that vsock_find_bound_socket can't hold the lock_sock, so there is a small race window between vsock_find_bound_socket() and lock_sock(). If __vsock_release() is running in another task, sk->sk_socket will be set to NULL inadvertently. This fixes it by checking sk->sk_shutdown(suggested by Stefano) after lock_sock since sk->sk_shutdown is set to SHUTDOWN_MASK under the protection of lock_sock_nested. Signed-off-by: Jia He <justin.he@arm.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30Merge tag 'powerpc-5.7-6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - a fix for the recent change to how we restore non-volatile GPRs, which broke our emulation of reading from the DSCR (Data Stream Control Register). - a fix for the recent rewrite of interrupt/syscall exit in C, we need to exclude KCOV from that code, otherwise it can lead to unrecoverable faults. Thanks to Daniel Axtens. * tag 'powerpc-5.7-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/64s: Disable sanitisers for C syscall/interrupt entry/exit code powerpc/64s: Fix restore of NV GPRs after facility unavailable exception
2020-05-30Merge tag 'gpio-v5.7-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio Pull GPIO fixes from Linus Walleij: "Here are some (very) late fixes for GPIO, none of them very serious except the one tagged for stable for enabling IRQ on open drain lines: - Fix probing of mvebu chips without PWM - Fix error path on ida_get_simple() on the exar driver - Notify userspace properly about line status changes when flags are changed on lines. - Fix a sleeping while holding spinlock in the mellanox driver. - Fix return value of the PXA and Kona probe calls. - Fix IRQ locking of open drain lines, it is fine to have IRQs on open drain lines flagged for output" * tag 'gpio-v5.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: gpio: fix locking open drain IRQ lines gpio: bcm-kona: Fix return value of bcm_kona_gpio_probe() gpio: pxa: Fix return value of pxa_gpio_probe() gpio: mlxbf2: Fix sleeping while holding spinlock gpiolib: notify user-space about line status changes after flags are set gpio: exar: Fix bad handling for ida_simple_get error path gpio: mvebu: Fix probing for chips without PWM
2020-05-29drivers/net/ibmvnic: Update VNIC protocol version reportingThomas Falcon
VNIC protocol version is reported in big-endian format, but it is not byteswapped before logging. Fix that, and remove version comparison as only one protocol version exists at this time. Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29NFC: st21nfca: add missed kfree_skb() in an error pathChuhong Yuan
st21nfca_tm_send_atr_res() misses to call kfree_skb() in an error path. Add the missed function call to fix it. Fixes: 1892bf844ea0 ("NFC: st21nfca: Adding P2P support to st21nfca in Initiator & Target mode") Signed-off-by: Chuhong Yuan <hslester96@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29neigh: fix ARP retransmit timer guardHangbin Liu
In commit 19e16d220f0a ("neigh: support smaller retrans_time settting") we add more accurate control for ARP and NS. But for ARP I forgot to update the latest guard in neigh_timer_handler(), then the next retransmit would be reset to jiffies + HZ/2 if we set the retrans_time less than 500ms. Fix it by setting the time_before() check to HZ/100. IPv6 does not have this issue. Reported-by: Jianwen Ji <jiji@redhat.com> Fixes: 19e16d220f0a ("neigh: support smaller retrans_time settting") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29Merge tag 'mlx5-fixes-2020-05-28' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5 fixes 2020-05-28 This series introduces some fixes to mlx5 driver. v1->v2: - Fix bad sha1, Jakub. - Added one more patch by Pablo. net/mlx5e: replace EINVAL in mlx5e_flower_parse_meta() Nothing major, the only patch worth mentioning is the suspend/resume crash fix by adding the missing pci device handlers, the fix is very straight forward and as Dexuan already expressed, the patch is important for Azure users to avoid crash on VM hibernation, patch is marked for -stable v4.6 below. Conflict note: ('net/mlx5e: Fix MLX5_TC_CT dependencies') has a trivial one line conflict with current net-next, which can be resolved by simply using the line from net-next. Please pull and let me know if there is any problem. For -stable v4.6 ('net/mlx5: Fix crash upon suspend/resume') For -stable v5.6 ('net/mlx5e: replace EINVAL in mlx5e_flower_parse_meta()') ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29Merge tag 'armsoc-fixes-v5.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull ARM SoC fixes from Arnd Bergmann: "This time there is one fix for the error path in the mediatek cmdq driver (used by their video driver) and a couple of devicetree fixes, mostly for 32-bit ARM, and fairly harmless: - On OMAP2 there were a few regressions in the ethernet drivers, one of them leading to an external abort trap - One Raspberry Pi version had a misconfigured LED - Interrupts on Broadcom NSP were slightly misconfigured - One i.MX6q board had issues with graphics mode setting - On mmp3 there are some minor fixes that were submitted for v5.8 with a cc:stable tag, so I ended up picking them up here as well - The Mediatek Video Codec needs to run at a higher frequency than configured originally" * tag 'armsoc-fixes-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: ARM: dts: mmp3: Drop usb-nop-xceiv from HSIC phy ARM: dts: mmp3-dell-ariel: Fix the SPI devices ARM: dts: mmp3: Use the MMP3 compatible string for /clocks ARM: dts: bcm: HR2: Fix PPI interrupt types ARM: dts: bcm2835-rpi-zero-w: Fix led polarity ARM: dts/imx6q-bx50v3: Set display interface clock parents soc: mediatek: cmdq: return send msg error code arm64: dts: mt8173: fix vcodec-enc clock ARM: dts: Fix wrong mdio clock for dm814x ARM: dts: am437x: fix networking on boards with ksz9031 phy ARM: dts: am57xx: fix networking on boards with ksz9031 phy
2020-05-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf 2020-05-29 The following pull-request contains BPF updates for your *net* tree. We've added 6 non-merge commits during the last 7 day(s) which contain a total of 4 files changed, 55 insertions(+), 34 deletions(-). The main changes are: 1) minor verifier fix for fmod_ret progs, from Alexei. 2) af_xdp overflow check, from Bjorn. 3) minor verifier fix for 32bit assignment, from John. 4) powerpc has non-overlapping addr space, from Petr. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29Merge tag 'ceph-for-5.7-rc8' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "Cache tiering and cap handling fixups, both marked for stable" * tag 'ceph-for-5.7-rc8' of git://github.com/ceph/ceph-client: ceph: flush release queue when handling caps for unknown inode libceph: ignore pool overlay and cache logic on redirects