summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-09-28iommu/amd: Remove domain->updatedJoerg Roedel
This struct member was used to track whether a domain change requires updates to the device-table and IOMMU cache flushes. The problem is, that access to this field is racy since locking in the common mapping code-paths has been eliminated. Move the updated field to the stack to get rid of all potential races and remove the field from the struct. Fixes: 92d420ec028d ("iommu/amd: Relax locking in dma_ops path") Reviewed-by: Filippo Sironi <sironi@amazon.de> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2019-09-27Merge branch 'next-integrity' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity Pull integrity updates from Mimi Zohar: "The major feature in this time is IMA support for measuring and appraising appended file signatures. In addition are a couple of bug fixes and code cleanup to use struct_size(). In addition to the PE/COFF and IMA xattr signatures, the kexec kernel image may be signed with an appended signature, using the same scripts/sign-file tool that is used to sign kernel modules. Similarly, the initramfs may contain an appended signature. This contained a lot of refactoring of the existing appended signature verification code, so that IMA could retain the existing framework of calculating the file hash once, storing it in the IMA measurement list and extending the TPM, verifying the file's integrity based on a file hash or signature (eg. xattrs), and adding an audit record containing the file hash, all based on policy. (The IMA support for appended signatures patch set was posted and reviewed 11 times.) The support for appended signature paves the way for adding other signature verification methods, such as fs-verity, based on a single system-wide policy. The file hash used for verifying the signature and the signature, itself, can be included in the IMA measurement list" * 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity: ima: ima_api: Use struct_size() in kzalloc() ima: use struct_size() in kzalloc() sefltest/ima: support appended signatures (modsig) ima: Fix use after free in ima_read_modsig() MODSIGN: make new include file self contained ima: fix freeing ongoing ahash_request ima: always return negative code for error ima: Store the measurement again when appraising a modsig ima: Define ima-modsig template ima: Collect modsig ima: Implement support for module-style appended signatures ima: Factor xattr_verify() out of ima_appraise_measurement() ima: Add modsig appraise_type option for module-style appended signatures integrity: Select CONFIG_KEYS instead of depending on it PKCS#7: Introduce pkcs7_get_digest() PKCS#7: Refactor verify_pkcs7_signature() MODSIGN: Export module signature definitions ima: initialize the "template" field with the default template
2019-09-27Merge tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: "Highlights: - Add a new knfsd file cache, so that we don't have to open and close on each (NFSv2/v3) READ or WRITE. This can speed up read and write in some cases. It also replaces our readahead cache. - Prevent silent data loss on write errors, by treating write errors like server reboots for the purposes of write caching, thus forcing clients to resend their writes. - Tweak the code that allocates sessions to be more forgiving, so that NFSv4.1 mounts are less likely to hang when a server already has a lot of clients. - Eliminate an arbitrary limit on NFSv4 ACL sizes; they should now be limited only by the backend filesystem and the maximum RPC size. - Allow the server to enforce use of the correct kerberos credentials when a client reclaims state after a reboot. And some miscellaneous smaller bugfixes and cleanup" * tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux: (34 commits) sunrpc: clean up indentation issue nfsd: fix nfs read eof detection nfsd: Make nfsd_reset_boot_verifier_locked static nfsd: degraded slot-count more gracefully as allocation nears exhaustion. nfsd: handle drc over-allocation gracefully. nfsd: add support for upcall version 2 nfsd: add a "GetVersion" upcall for nfsdcld nfsd: Reset the boot verifier on all write I/O errors nfsd: Don't garbage collect files that might contain write errors nfsd: Support the server resetting the boot verifier nfsd: nfsd_file cache entries should be per net namespace nfsd: eliminate an unnecessary acl size limit Deprecate nfsd fault injection nfsd: remove duplicated include from filecache.c nfsd: Fix the documentation for svcxdr_tmpalloc() nfsd: Fix up some unused variable warnings nfsd: close cached files prior to a REMOVE or RENAME that would replace target nfsd: rip out the raparms cache nfsd: have nfsd_test_lock use the nfsd_file cache nfsd: hook up nfs4_preprocess_stateid_op to the nfsd_file cache ...
2019-09-27Merge tag 'virtio-fs-5.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse virtio-fs support from Miklos Szeredi: "Virtio-fs allows exporting directory trees on the host and mounting them in guest(s). This isn't actually a new filesystem, but a glue layer between the fuse filesystem and a virtio based back-end. It's similar in functionality to the existing virtio-9p solution, but significantly faster in benchmarks and has better POSIX compliance. Further permformance improvements can be achieved by sharing the page cache between host and guest, allowing for faster I/O and reduced memory use. Kata Containers have been including the out-of-tree virtio-fs (with the shared page cache patches as well) since version 1.7 as an experimental feature. They have been active in development and plan to switch from virtio-9p to virtio-fs as their default solution. There has been interest from other sources as well. The userspace infrastructure is slated to be merged into qemu once the kernel part hits mainline. This was developed by Vivek Goyal, Dave Gilbert and Stefan Hajnoczi" * tag 'virtio-fs-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: virtio-fs: add virtiofs filesystem virtio-fs: add Documentation/filesystems/virtiofs.rst fuse: reserve values for mapping protocol
2019-09-27Merge tag '9p-for-5.4' of git://github.com/martinetd/linuxLinus Torvalds
Pull 9p updates from Dominique Martinet: "Some of the usual small fixes and cleanup. Small fixes all around: - avoid overlayfs copy-up for PRIVATE mmaps - KUMSAN uninitialized warning for transport error - one syzbot memory leak fix in 9p cache - internal API cleanup for v9fs_fill_super" * tag '9p-for-5.4' of git://github.com/martinetd/linux: 9p/vfs_super.c: Remove unused parameter data in v9fs_fill_super 9p/cache.c: Fix memory leak in v9fs_cache_session_get_cookie 9p: Transport error uninitialized 9p: avoid attaching writeback_fid on mmap with type PRIVATE
2019-09-27Merge tag 'riscv/for-v5.4-rc1-b' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull more RISC-V updates from Paul Walmsley: "Some additional RISC-V updates. This includes one significant fix: - Prevent interrupts from being unconditionally re-enabled during exception handling if they were disabled in the context in which the exception occurred Also a few other fixes: - Fix a build error when sparse memory support is manually enabled - Prevent CPUs beyond CONFIG_NR_CPUS from being enabled in early boot And a few minor improvements: - DT improvements: in the FU540 SoC DT files, improve U-Boot compatibility by adding an "ethernet0" alias, drop an unnecessary property from the DT files, and add support for the PWM device - KVM preparation: add a KVM-related macro for future RISC-V KVM support, and export some symbols required to build KVM support as modules - defconfig additions: build more drivers by default for QEMU configurations" * tag 'riscv/for-v5.4-rc1-b' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: Avoid interrupts being erroneously enabled in handle_exception() riscv: dts: sifive: Drop "clock-frequency" property of cpu nodes riscv: dts: sifive: Add ethernet0 to the aliases node RISC-V: Export kernel symbols for kvm KVM: RISC-V: Add KVM_REG_RISCV for ONE_REG interface arch/riscv: disable excess harts before picking main boot hart RISC-V: Enable VIRTIO drivers in RV64 and RV32 defconfig RISC-V: Fix building error when CONFIG_SPARSEMEM_MANUAL=y riscv: dts: Add DT support for SiFive FU540 PWM driver
2019-09-27Merge tag 'nios2-v5.4-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2 Pull nios2 fix from Ley Foon Tan: "Make sure the command line buffer is NUL-terminated" * tag 'nios2-v5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2: nios2: force the string buffer NULL-terminated
2019-09-27Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull more KVM updates from Paolo Bonzini: "x86 KVM changes: - The usual accuracy improvements for nested virtualization - The usual round of code cleanups from Sean - Added back optimizations that were prematurely removed in 5.2 (the bare minimum needed to fix the regression was in 5.3-rc8, here comes the rest) - Support for UMWAIT/UMONITOR/TPAUSE - Direct L2->L0 TLB flushing when L0 is Hyper-V and L1 is KVM - Tell Windows guests if SMT is disabled on the host - More accurate detection of vmexit cost - Revert a pvqspinlock pessimization" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (56 commits) KVM: nVMX: cleanup and fix host 64-bit mode checks KVM: vmx: fix build warnings in hv_enable_direct_tlbflush() on i386 KVM: x86: Don't check kvm_rebooting in __kvm_handle_fault_on_reboot() KVM: x86: Drop ____kvm_handle_fault_on_reboot() KVM: VMX: Add error handling to VMREAD helper KVM: VMX: Optimize VMX instruction error and fault handling KVM: x86: Check kvm_rebooting in kvm_spurious_fault() KVM: selftests: fix ucall on x86 Revert "locking/pvqspinlock: Don't wait if vCPU is preempted" kvm: nvmx: limit atomic switch MSRs kvm: svm: Intercept RDPRU kvm: x86: Add "significant index" flag to a few CPUID leaves KVM: x86/mmu: Skip invalid pages during zapping iff root_count is zero KVM: x86/mmu: Explicitly track only a single invalid mmu generation KVM: x86/mmu: Revert "KVM: x86/mmu: Remove is_obsolete() call" KVM: x86/mmu: Revert "Revert "KVM: MMU: reclaim the zapped-obsolete page first"" KVM: x86/mmu: Revert "Revert "KVM: MMU: collapse TLB flushes when zap all pages"" KVM: x86/mmu: Revert "Revert "KVM: MMU: zap pages in batch"" KVM: x86/mmu: Revert "Revert "KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages"" KVM: x86/mmu: Revert "Revert "KVM: MMU: show mmu_valid_gen in shadow page related tracepoints"" ...
2019-09-27Merge tag 'pwm/for-5.4-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm Pull pwm updates from Thierry Reding: "Besides one new driver being added for the PWM controller found in various Spreadtrum SoCs, this series of changes brings a slew of, mostly minor, fixes and cleanups for existing drivers, as well as some enhancements to the core code. Lastly, Uwe is added to the PWM subsystem entry of the MAINTAINERS file, making official his role as a reviewer" * tag 'pwm/for-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm: (34 commits) MAINTAINERS: Add myself as reviewer for the PWM subsystem MAINTAINERS: Add patchwork link for PWM entry MAINTAINERS: Add a selection of PWM related keywords to the PWM entry pwm: mediatek: Add MT7629 compatible string dt-bindings: pwm: Update bindings for MT7629 SoC pwm: mediatek: Update license and switch to SPDX tag pwm: mediatek: Use pwm_mediatek as common prefix pwm: mediatek: Allocate the clks array dynamically pwm: mediatek: Remove the has_clks field pwm: mediatek: Drop the check for of_device_get_match_data() pwm: atmel: Consolidate driver data initialization pwm: atmel: Remove unneeded check for match data pwm: atmel: Remove platform_device_id and use only dt bindings pwm: stm32-lp: Add check in case requested period cannot be achieved pwm: Ensure pwm_apply_state() doesn't modify the state argument pwm: fsl-ftm: Don't update the state for the caller of pwm_apply_state() pwm: sun4i: Don't update the state for the caller of pwm_apply_state() pwm: rockchip: Don't update the state for the caller of pwm_apply_state() pwm: Let pwm_get_state() return the last implemented state pwm: Introduce local struct pwm_chip in pwm_apply_state() ...
2019-09-27Merge tag 'for-5.4/io_uring-2019-09-27' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more io_uring updates from Jens Axboe: "Just two things in here: - Improvement to the io_uring CQ ring wakeup for batched IO (me) - Fix wrong comparison in poll handling (yangerkun) I realize the first one is a little late in the game, but it felt pointless to hold it off until the next release. Went through various testing and reviews with Pavel and peterz" * tag 'for-5.4/io_uring-2019-09-27' of git://git.kernel.dk/linux-block: io_uring: make CQ ring wakeups be more efficient io_uring: compare cached_cq_tail with cq.head in_io_uring_poll
2019-09-27net: tap: clean up an indentation issueColin Ian King
There is a statement that is indented too deeply, remove the extraneous tab. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27Merge tag 'for-linus-2019-09-27' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "A few fixes/changes to round off this merge window. This contains: - Small series making some functional tweaks to blk-iocost (Tejun) - Elevator switch locking fix (Ming) - Kill redundant call in blk-wbt (Yufen) - Fix flush timeout handling (Yufen)" * tag 'for-linus-2019-09-27' of git://git.kernel.dk/linux-block: block: fix null pointer dereference in blk_mq_rq_timed_out() rq-qos: get rid of redundant wbt_update_limits() iocost: bump up default latency targets for hard disks iocost: improve nr_lagging handling iocost: better trace vrate changes block: don't release queue's sysfs lock during switching elevator blk-mq: move lockdep_assert_held() into elevator_exit
2019-09-27nfp: abm: fix memory leak in nfp_abm_u32_knode_replaceNavid Emamdoost
In nfp_abm_u32_knode_replace if the allocation for match fails it should go to the error handling instead of returning. Updated other gotos to have correct errno returned, too. Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27mmc: host: sdhci-pci: Add Genesys Logic GL975x supportBen Chuang
Add support for the GL9750 and GL9755 chipsets. Enable v4 mode and wait 5ms after set 1.8V signal enable for GL9750/ GL9755. Fix the value of SDHCI_MAX_CURRENT register and use the vendor tuning flow for GL9750. Co-developed-by: Michael K Johnson <johnsonm@danlj.org> Signed-off-by: Michael K Johnson <johnsonm@danlj.org> Signed-off-by: Ben Chuang <ben.chuang@genesyslogic.com.tw> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2019-09-27tcp: better handle TCP_USER_TIMEOUT in SYN_SENT stateEric Dumazet
Yuchung Cheng and Marek Majkowski independently reported a weird behavior of TCP_USER_TIMEOUT option when used at connect() time. When the TCP_USER_TIMEOUT is reached, tcp_write_timeout() believes the flow should live, and the following condition in tcp_clamp_rto_to_user_timeout() programs one jiffie timers : remaining = icsk->icsk_user_timeout - elapsed; if (remaining <= 0) return 1; /* user timeout has passed; fire ASAP */ This silly situation ends when the max syn rtx count is reached. This patch makes sure we honor both TCP_SYNCNT and TCP_USER_TIMEOUT, avoiding these spurious SYN packets. Fixes: b701a99e431d ("tcp: Add tcp_clamp_rto_to_user_timeout() helper to improve accuracy") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Yuchung Cheng <ycheng@google.com> Reported-by: Marek Majkowski <marek@cloudflare.com> Cc: Jon Maxwell <jmaxwell37@gmail.com> Link: https://marc.info/?l=linux-netdev&m=156940118307949&w=2 Acked-by: Jon Maxwell <jmaxwell37@gmail.com> Tested-by: Marek Majkowski <marek@cloudflare.com> Signed-off-by: Marek Majkowski <marek@cloudflare.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27sk_buff: drop all skb extensions on free and skb scrubbingFlorian Westphal
Now that we have a 3rd extension, add a new helper that drops the extension space and use it when we need to scrub an sk_buff. At this time, scrubbing clears secpath and bridge netfilter data, but retains the tc skb extension, after this patch all three get cleared. NAPI reuse/free assumes we can only have a secpath attached to skb, but it seems better to clear all extensions there as well. v2: add unlikely hint (Eric Dumazet) Fixes: 95a7233c452a ("net: openvswitch: Set OvS recirc_id from tc chain index") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27tcp_bbr: fix quantization code to not raise cwnd if not probing bandwidthKevin(Yudong) Yang
There was a bug in the previous logic that attempted to ensure gain cycling gets inflight above BDP even for small BDPs. This code correctly raised and lowered target inflight values during the gain cycle. And this code correctly ensured that cwnd was raised when probing bandwidth. However, it did not correspondingly ensure that cwnd was *not* raised in this way when *not* probing for bandwidth. The result was that small-BDP flows that were always cwnd-bound could go for many cycles with a fixed cwnd, and not probe or yield bandwidth at all. This meant that multiple small-BDP flows could fail to converge in their bandwidth allocations. Fixes: 3c346b233c68 ("tcp_bbr: fix bw probing to raise in-flight data for very small BDPs") Signed-off-by: Kevin(Yudong) Yang <yyd@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27Merge branch 'for-5.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux Pull thermal management updates from Zhang Rui: - Add Amit Kucheria as thermal subsystem Reviewer (Amit Kucheria) - Fix a use after free bug when unregistering thermal zone devices (Ido Schimmel) - Fix thermal core framework to use put_device() when device_register() fails (Yue Hu) - Enable intel_pch_thermal and MMIO RAPL support for Intel Icelake platform (Srinivas Pandruvada) - Add clock operations in qorip thermal driver, for some platforms with clock control like i.MX8MQ (Anson Huang) - A couple of trivial fixes and cleanups for thermal core and different soc thermal drivers (Amit Kucheria, Christophe JAILLET, Chuhong Yuan, Fuqian Huang, Kelsey Skunberg, Nathan Huckleberry, Rishi Gupta, Srinivas Kandagatla) * 'for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: MAINTAINERS: Add Amit Kucheria as reviewer for thermal thermal: Add some error messages thermal: Fix use-after-free when unregistering thermal zone device thermal/drivers/core: Use put_device() if device_register() fails thermal_hwmon: Sanitize thermal_zone type thermal: intel: Use dev_get_drvdata thermal: intel: int3403: replace printk(KERN_WARN...) with pr_warn(...) thermal: intel: int340x_thermal: Remove unnecessary acpi_has_method() uses thermal: int340x: processor_thermal: Add Ice Lake support drivers: thermal: qcom: tsens: Fix memory leak from qfprom read thermal: tegra: Fix a typo thermal: rcar_gen3_thermal: Replace devm_add_action() followed by failure action with devm_add_action_or_reset() thermal: armada: Fix -Wshift-negative-value dt-bindings: thermal: qoriq: Add optional clocks property thermal: qoriq: Use __maybe_unused instead of #if CONFIG_PM_SLEEP thermal: qoriq: Use devm_platform_ioremap_resource() instead of of_iomap() thermal: qoriq: Fix error path of calling qoriq_tmu_register_tmu_zone fail thermal: qoriq: Add clock operations drivers: thermal: processor_thermal_device: Export sysfs interface for TCC offset
2019-09-27Merge branch 'mlxsw-Various-fixes'David S. Miller
Ido Schimmel says: ==================== mlxsw: Various fixes This patchset includes two small fixes for the mlxsw driver and one patch which clarifies recently introduced devlink-trap documentation. Patch #1 clears the port's VLAN filters during port initialization. This ensures that the drop reason reported to the user is consistent. The problem is explained in detail in the commit message. Patch #2 clarifies the description of one of the traps exposed via devlink-trap. Patch #3 from Danielle forbids the installation of a tc filter with multiple mirror actions since this is not supported by the device. The failure is communicated to the user via extack. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27mlxsw: spectrum_flower: Fail in case user specifies multiple mirror actionsDanielle Ratson
The ASIC can only mirror a packet to one port, but when user is trying to set more than one mirror action, it doesn't fail. Add a check if more than one mirror action was specified per rule and if so, fail for not being supported. Fixes: d0d13c1858a11 ("mlxsw: spectrum_acl: Add support for mirror action") Signed-off-by: Danielle Ratson <danieller@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27Documentation: Clarify trap's descriptionIdo Schimmel
Alex noted that the below description might not be obvious to all users. Clarify it by adding an example. Fixes: f3047ca01f12 ("Documentation: Add devlink-trap documentation") Reported-by: Alex Kushnarov <alexanderk@mellanox.com> Reviewed-by: Alex Kushnarov <alexanderk@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27mlxsw: spectrum: Clear VLAN filters during port initializationIdo Schimmel
When a port is created, its VLAN filters are not cleared by the firmware. This causes tagged packets to be later dropped by the ingress STP filters, which default to DISCARD state. The above did not matter much until commit b5ce611fd96e ("mlxsw: spectrum: Add devlink-trap support") where we exposed the drop reason to users. Without this patch, the drop reason users will see is not consistent. If a port is enslaved to a VLAN-aware bridge and a packet with an invalid VLAN tries to ingress the bridge, it will be dropped due to ingress STP filter. If the VLAN is later enabled and then disabled, the packet will be dropped by the ingress VLAN filter despite the above being a seemingly NOP operation. Fix this by clearing all the VLAN filters during port initialization. Adjust the test accordingly. Fixes: b5ce611fd96e ("mlxsw: spectrum: Add devlink-trap support") Reported-by: Alex Kushnarov <alexanderk@mellanox.com> Tested-by: Alex Kushnarov <alexanderk@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27net: ena: clean up indentation issueColin Ian King
There memset is indented incorrectly, remove the extraneous tabs. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27NFC: st95hf: clean up indentation issueColin Ian King
The return statement is indented incorrectly, add in a missing tab and remove an extraneous space after the return Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27mmc: tegra: Implement ->set_dma_mask()Nicolin Chen
The SDHCI controller on Tegra186 supports 40-bit addressing, which is usually enough to address all of system memory. However, if the SDHCI controller is behind an IOMMU, the address space can go beyond. This happens on Tegra186 and later where the ARM SMMU has an input address space of 48 bits. If the DMA API is backed by this ARM SMMU, the top- down IOVA allocator will cause IOV addresses to be returned that the SDHCI controller cannot access. Unfortunately, prior to the introduction of the ->set_dma_mask() host operation, the SDHCI core would set either a 64-bit DMA mask if the controller claimed to support 64-bit addressing, or a 32-bit DMA mask otherwise. Since the full 64 bits cannot be addressed on Tegra, this had to be worked around in commit 68481a7e1c84 ("mmc: tegra: Mark 64 bit dma broken on Tegra186") by setting the SDHCI_QUIRK2_BROKEN_64_BIT_DMA quirk, which effectively restricts the DMA mask to 32 bits. One disadvantage of this is that dma_map_*() APIs will now try to use the swiotlb to bounce DMA to addresses beyond of the controller's DMA mask. This in turn caused degraded performance and can lead to situations where the swiotlb buffer is exhausted, which in turn leads to DMA transfers to fail. With the recent introduction of the ->set_dma_mask() host operation, this can now be properly fixed. For each generation of Tegra, the exact supported DMA mask can be configured. This kills two birds with one stone: it avoids the use of bounce buffers because system memory never exceeds the addressable memory range of the SDHCI controllers on these devices, and at the same time when an IOMMU is involved, it prevents IOV addresses from being allocated beyond the addressible range of the controllers. Since the DMA mask is now properly handled, the 64-bit DMA quirk can be removed. Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com> [treding@nvidia.com: provide more background in commit message] Tested-by: Nicolin Chen <nicoleotsuka@gmail.com> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Thierry Reding <treding@nvidia.com> Cc: stable@vger.kernel.org # v4.15 + Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2019-09-27mmc: sdhci: Let drivers define their DMA maskAdrian Hunter
Add host operation ->set_dma_mask() so that drivers can define their own DMA masks. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Tested-by: Nicolin Chen <nicoleotsuka@gmail.com> Signed-off-by: Thierry Reding <treding@nvidia.com> Cc: stable@vger.kernel.org # v4.15 + Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2019-09-27mmc: sdhci-of-esdhc: set DMA snooping based on DMA coherenceRussell King
We must not unconditionally set the DMA snoop bit; if the DMA API is assuming that the device is not DMA coherent, and the device snoops the CPU caches, the device can see stale cache lines brought in by speculative prefetch. This leads to the device seeing stale data, potentially resulting in corrupted data transfers. Commonly, this results in a descriptor fetch error such as: mmc0: ADMA error mmc0: sdhci: ============ SDHCI REGISTER DUMP =========== mmc0: sdhci: Sys addr: 0x00000000 | Version: 0x00002202 mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 mmc0: sdhci: Argument: 0x00000000 | Trn mode: 0x00000013 mmc0: sdhci: Present: 0x01f50008 | Host ctl: 0x00000038 mmc0: sdhci: Power: 0x00000003 | Blk gap: 0x00000000 mmc0: sdhci: Wake-up: 0x00000000 | Clock: 0x000040d8 mmc0: sdhci: Timeout: 0x00000003 | Int stat: 0x00000001 mmc0: sdhci: Int enab: 0x037f108f | Sig enab: 0x037f108b mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00002202 mmc0: sdhci: Caps: 0x35fa0000 | Caps_1: 0x0000af00 mmc0: sdhci: Cmd: 0x0000333a | Max curr: 0x00000000 mmc0: sdhci: Resp[0]: 0x00000920 | Resp[1]: 0x001d8a33 mmc0: sdhci: Resp[2]: 0x325b5900 | Resp[3]: 0x3f400e00 mmc0: sdhci: Host ctl2: 0x00000000 mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236d43820c mmc0: sdhci: ============================================ mmc0: error -5 whilst initialising SD card but can lead to other errors, and potentially direct the SDHCI controller to read/write data to other memory locations (e.g. if a valid descriptor is visible to the device in a stale cache line.) Fix this by ensuring that the DMA snoop bit corresponds with the behaviour of the DMA API. Since the driver currently only supports DT, use of_dma_is_coherent(). Note that device_get_dma_attr() can not be used as that risks re-introducing this bug if/when the driver is converted to ACPI. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2019-09-27mmc: sdhci: improve ADMA error reportingRussell King
ADMA errors are potentially data corrupting events; although we print the register state, we do not usefully print the ADMA descriptors. Worse than that, we print them by referencing their virtual address which is meaningless when the register state gives us the DMA address of the failing descriptor. Print the ADMA descriptors giving their DMA addresses rather than their virtual addresses, and print them using SDHCI_DUMP() rather than DBG(). We also do not show the correct value of the interrupt status register; the register dump shows the current value, after we have cleared the pending interrupts we are going to service. What is more useful is to print the interrupts that _were_ pending at the time the ADMA error was encountered. Fix that too. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2019-09-27net: phy: micrel: add Asym Pause workaround for KSZ9021Hans Andersson
The Micrel KSZ9031 PHY may fail to establish a link when the Asymmetric Pause capability is set. This issue is described in a Silicon Errata (DS80000691D or DS80000692D), which advises to always disable the capability. Micrel KSZ9021 has no errata, but has the same issue with Asymmetric Pause. This patch apply the same workaround as the one for KSZ9031. Fixes: 3aed3e2a143c ("net: phy: micrel: add Asym Pause workaround") Signed-off-by: Hans Andersson <hans.andersson@cellavision.se> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27net: socionext: ave: Avoid using netdev_err() before calling register_netdev()Kunihiko Hayashi
Until calling register_netdev(), ndev->dev_name isn't specified, and netdev_err() displays "(unnamed net_device)". ave 65000000.ethernet (unnamed net_device) (uninitialized): invalid phy-mode setting ave: probe of 65000000.ethernet failed with error -22 This replaces netdev_err() with dev_err() before calling register_netdev(). Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27ptp: correctly disable flags on old ioctlsJacob Keller
Commit 415606588c61 ("PTP: introduce new versions of IOCTLs", 2019-09-13) introduced new versions of the PTP ioctls which actually validate that the flags are acceptable values. As part of this, it cleared the flags value using a bitwise and+negation, in an attempt to prevent the old ioctl from accidentally enabling new features. This is incorrect for a couple of reasons. First, it results in accidentally preventing previously working flags on the request ioctl. By clearing the "valid" flags, we now no longer allow setting the enable, rising edge, or falling edge flags. Second, if we add new additional flags in the future, they must not be set by the old ioctl. (Since the flag wasn't checked before, we could potentially break userspace programs which sent garbage flag data. The correct way to resolve this is to check for and clear all but the originally valid flags. Create defines indicating which flags are correctly checked and interpreted by the original ioctls. Use these to clear any bits which will not be correctly interpreted by the original ioctls. In the future, new flags must be added to the VALID_FLAGS macros, but *not* to the V1_VALID_FLAGS macros. In this way, new features may be exposed over the v2 ioctls, but without breaking previous userspace which happened to not clear the flags value properly. The old ioctl will continue to behave the same way, while the new ioctl gains the benefit of using the flags fields. Cc: Richard Cochran <richardcochran@gmail.com> Cc: Felipe Balbi <felipe.balbi@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Christopher Hall <christopher.s.hall@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27lib: dimlib: fix help text typosRandy Dunlap
Fix help text typos for DIMLIB. Fixes: 4f75da3666c0 ("linux/dim: Move implementation to .c files") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Uwe Kleine-König <uwe@kleine-koenig.org> Cc: Tal Gilboa <talgi@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Uwe Kleine-König <uwe@kleine-koenig.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27net: dsa: microchip: Always set regmap stride to 1Marek Vasut
The regmap stride is set to 1 for regmap describing 8bit registers already. However, for 16/32/64bit registers, the stride is 2/4/8 respectively. This is not correct, as the switch protocol supports unaligned register reads and writes and the KSZ87xx even uses such unaligned register accesses to read e.g. MIB counter. This patch fixes MIB counter access on KSZ87xx. Signed-off-by: Marek Vasut <marex@denx.de> Cc: Andrew Lunn <andrew@lunn.ch> Cc: David S. Miller <davem@davemloft.net> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: George McCollister <george.mccollister@gmail.com> Cc: Tristram Ha <Tristram.Ha@microchip.com> Cc: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Cc: Woojung Huh <woojung.huh@microchip.com> Fixes: 46558d601cb6 ("net: dsa: microchip: Initial SPI regmap support") Fixes: 255b59ad0db2 ("net: dsa: microchip: Factor out regmap config generation into common header") Reviewed-by: George McCollister <george.mccollister@gmail.com> Tested-by: George McCollister <george.mccollister@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27Merge tag 'linux-watchdog-5.4-rc1' of ↵Linus Torvalds
git://www.linux-watchdog.org/linux-watchdog Pull watchdog updates from Wim Van Sebroeck: - addition of AST2600, i.MX7ULP and F81803 watchdog support - removal of the w90x900 and ks8695 drivers - ziirave_wdt improvements - small fixes and improvements * tag 'linux-watchdog-5.4-rc1' of git://www.linux-watchdog.org/linux-watchdog: (51 commits) watchdog: f71808e_wdt: Add F81803 support watchdog: qcom: remove unnecessary variable from private storage watchdog: qcom: support pre-timeout when the bark irq is available watchdog: imx_sc: this patch just fixes whitespaces watchdog: apseed: Add access_cs0 option for alt-boot watchdog: aspeed: add support for dual boot watchdog: orion_wdt: use timer1 as a pretimeout watchdog: Add i.MX7ULP watchdog support dt-bindings: watchdog: Add i.MX7ULP bindings dt-bindings: watchdog: sun4i: Add the watchdog clock dt-bindings: watchdog: sun4i: Add the watchdog interrupts dt-bindings: watchdog: Convert Allwinner watchdog to a schema dt-bindings: watchdog: Add YAML schemas for the generic watchdog bindings watchdog: aspeed: Add support for AST2600 dt-bindings: watchdog: Add ast2600 compatible watchdog: ziirave_wdt: Update checked I2C functionality mask watchdog: ziirave_wdt: Drop ziirave_firm_write_block_data() watchdog: ziirave_wdt: Fix DOWNLOAD_START payload watchdog: ziirave_wdt: Drop status polling code watchdog: ziirave_wdt: Fix RESET_PROCESSOR payload ...
2019-09-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Add NFT_CHAIN_POLICY_UNSET to replace hardcoded -1 to specify that the chain policy is unset. The chain policy field is actually defined as an 8-bit unsigned integer. 2) Remove always true condition reported by smatch in chain policy check. 3) Fix element lookup on dynamic sets, from Florian Westphal. 4) Use __u8 in ebtables uapi header, from Masahiro Yamada. 5) Bogus EBUSY when removing flowtable after chain flush, from Laura Garcia Liebana. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27Merge tag 'drm-next-2019-09-27' of git://anongit.freedesktop.org/drm/drmLinus Torvalds
Pull drm fixes from Dave Airlie: "Fixes built up over the past 1.5 weeks or so, it's two weeks of amdgpu, some core cleanups and some panfrost fixes. I also finally figured out why my desktop was slow to do a bunch of stuff (someone gave it an IPv6 address which can't reach anything!). core: - Some cleanups and fixes in the self-refresh helpers - Some cleanups and fixes in the atomic helpers amdgpu: - Fix a 64 bit divide - Prevent a memory leak in a failure case in dc - Load proper gfx firmware on navi14 variants - Add more navi12 and navi14 PCI ids - Misc fixes for renoir - Fix bandwidth issues with multiple displays on vega20 - Support for Dali - Fix a possible oops with KFD on hawaii - Fix for backlight level after resume on some APUs - Other misc fixes panfrost: - Multiple panfrost fixes for regulator support and page fault handling" * tag 'drm-next-2019-09-27' of git://anongit.freedesktop.org/drm/drm: (34 commits) drm/amd/display: prevent memory leak drm/amdgpu/gfx10: add support for wks firmware loading drm/amdgpu/display: include slab.h in dcn21_resource.c drm/amdgpu/display: fix 64 bit divide drm/panfrost: Prevent race when handling page fault drm/panfrost: Remove NULL checks for regulator drm/panfrost: Fix regulator_get_optional() misuse drm: Measure Self Refresh Entry/Exit times to avoid thrashing drm: Fix kerneldoc and remove unused struct member in self_refresh helper drm/atomic: Rename crtc_state->pageflip_flags to async_flip drm/atomic: Reject FLIP_ASYNC unconditionally drm/atomic: Take the atomic toys away from X drm/amdgpu: flag navi12 and 14 as experimental for 5.4 drm/kms: Duct-tape for mode object lifetime checks drm/amdgpu: add navi12 pci id drm/amdgpu: add navi14 PCI ID for work station SKU drm/amdkfd: Swap trap temporary registers in gfx10 trap handler drm/amd/powerplay: implement sysfs for getting dpm clock drm/amd/display: Restore backlight brightness after system resume drm/amd/display: Implement voltage limitation for dali ...
2019-09-27nfp: flower: fix memory leak in nfp_flower_spawn_vnic_reprsNavid Emamdoost
In nfp_flower_spawn_vnic_reprs in the loop if initialization or the allocations fail memory is leaked. Appropriate releases are added. Fixes: b94524529741 ("nfp: flower: add per repr private data for LAG offload") Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27nfp: flower: prevent memory leak in nfp_flower_spawn_phy_reprsNavid Emamdoost
In nfp_flower_spawn_phy_reprs, in the for loop over eth_tbl if any of intermediate allocations or initializations fail memory is leaked. requiered releases are added. Fixes: b94524529741 ("nfp: flower: add per repr private data for LAG offload") Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27net/sched: Set default of CONFIG_NET_TC_SKB_EXT to NPaul Blakey
This a new feature, it is preferred that it defaults to N. We will probe the feature support from userspace before actually using it. Fixes: 95a7233c452a ('net: openvswitch: Set OvS recirc_id from tc chain index') Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27vrf: Do not attempt to create IPv6 mcast rule if IPv6 is disabledDavid Ahern
A user reported that vrf create fails when IPv6 is disabled at boot using 'ipv6.disable=1': https://bugzilla.kernel.org/show_bug.cgi?id=204903 The failure is adding fib rules at create time. Add RTNL_FAMILY_IP6MR to the check in vrf_fib_rule if ipv6_mod_enabled is disabled. Fixes: e4a38c0c4b27 ("ipv6: add vrf table handling code for ipv6 mcast") Signed-off-by: David Ahern <dsahern@gmail.com> Cc: Patrick Ruddy <pruddy@vyatta.att-mail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27Merge tag 'ntb-5.4' of git://github.com/jonmason/ntbLinus Torvalds
Pull NTB updates from Jon Mason: "A few bugfixes and support for new AMD NTB hardware" * tag 'ntb-5.4' of git://github.com/jonmason/ntb: NTB: fix IDT Kconfig typos/spellos ntb_hw_amd: Add memory window support for new AMD hardware ntb_hw_amd: Add a new NTB PCI device ID NTB: ntb_transport: remove redundant assignment to rc ntb_hw_switchtec: make ntb_mw_set_trans() work when addr == 0 ntb: point to right memory window index
2019-09-27keys: Add Jarkko Sakkinen as co-maintainerJarkko Sakkinen
To address a major procedural concern on Linus's part the keyrings needs a co-maintainer. Suggested-by: David Howells <dhowells@redhat.com> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2019-09-27 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix libbpf's BTF dumper to not skip anonymous enum definitions, from Andrii. 2) Fix BTF verifier issues when handling the BTF of vmlinux, from Alexei. 3) Fix nested calls into bpf_event_output() from TCP sockops BPF programs, from Allan. 4) Fix NULL pointer dereference in AF_XDP's xsk map creation when allocation fails, from Jonathan. 5) Remove unneeded 64 byte alignment requirement of the AF_XDP UMEM headroom, from Bjorn. 6) Remove unused XDP_OPTIONS getsockopt() call which results in an error on older kernels, from Toke. 7) Fix a client/server race in tcp_rtt BPF kselftest case, from Stanislav. 8) Fix indentation issue in BTF's btf_enum_check_kflag_member(), from Colin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27btrfs: qgroup: Fix reserved data space leak if we have multiple reserve callsQu Wenruo
[BUG] The following script can cause btrfs qgroup data space leak: mkfs.btrfs -f $dev mount $dev -o nospace_cache $mnt btrfs subv create $mnt/subv btrfs quota en $mnt btrfs quota rescan -w $mnt btrfs qgroup limit 128m $mnt/subv for (( i = 0; i < 3; i++)); do # Create 3 64M holes for latter fallocate to fail truncate -s 192m $mnt/subv/file xfs_io -c "pwrite 64m 4k" $mnt/subv/file > /dev/null xfs_io -c "pwrite 128m 4k" $mnt/subv/file > /dev/null sync # it's supposed to fail, and each failure will leak at least 64M # data space xfs_io -f -c "falloc 0 192m" $mnt/subv/file &> /dev/null rm $mnt/subv/file sync done # Shouldn't fail after we removed the file xfs_io -f -c "falloc 0 64m" $mnt/subv/file [CAUSE] Btrfs qgroup data reserve code allow multiple reservations to happen on a single extent_changeset: E.g: btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_1M); btrfs_qgroup_reserve_data(inode, &data_reserved, SZ_1M, SZ_2M); btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_4M); Btrfs qgroup code has its internal tracking to make sure we don't double-reserve in above example. The only pattern utilizing this feature is in the main while loop of btrfs_fallocate() function. However btrfs_qgroup_reserve_data()'s error handling has a bug in that on error it clears all ranges in the io_tree with EXTENT_QGROUP_RESERVED flag but doesn't free previously reserved bytes. This bug has a two fold effect: - Clearing EXTENT_QGROUP_RESERVED ranges This is the correct behavior, but it prevents btrfs_qgroup_check_reserved_leak() to catch the leakage as the detector is purely EXTENT_QGROUP_RESERVED flag based. - Leak the previously reserved data bytes. The bug manifests when N calls to btrfs_qgroup_reserve_data are made and the last one fails, leaking space reserved in the previous ones. [FIX] Also free previously reserved data bytes when btrfs_qgroup_reserve_data fails. Fixes: 524725537023 ("btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-27btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data spaceQu Wenruo
[BUG] Under the following case with qgroup enabled, if some error happened after we have reserved delalloc space, then in error handling path, we could cause qgroup data space leakage: From btrfs_truncate_block() in inode.c: ret = btrfs_delalloc_reserve_space(inode, &data_reserved, block_start, blocksize); if (ret) goto out; again: page = find_or_create_page(mapping, index, mask); if (!page) { btrfs_delalloc_release_space(inode, data_reserved, block_start, blocksize, true); btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, true); ret = -ENOMEM; goto out; } [CAUSE] In the above case, btrfs_delalloc_reserve_space() will call btrfs_qgroup_reserve_data() and mark the io_tree range with EXTENT_QGROUP_RESERVED flag. In the error handling path, we have the following call stack: btrfs_delalloc_release_space() |- btrfs_free_reserved_data_space() |- btrsf_qgroup_free_data() |- __btrfs_qgroup_release_data(reserved=@reserved, free=1) |- qgroup_free_reserved_data(reserved=@reserved) |- clear_record_extent_bits(); |- freed += changeset.bytes_changed; However due to a completion bug, qgroup_free_reserved_data() will clear EXTENT_QGROUP_RESERVED flag in BTRFS_I(inode)->io_failure_tree, other than the correct BTRFS_I(inode)->io_tree. Since io_failure_tree is never marked with that flag, btrfs_qgroup_free_data() will not free any data reserved space at all, causing a leakage. This type of error handling can only be triggered by errors outside of qgroup code. So EDQUOT error from qgroup can't trigger it. [FIX] Fix the wrong target io_tree. Reported-by: Josef Bacik <josef@toxicpanda.com> Fixes: bc42bda22345 ("btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-27block: fix null pointer dereference in blk_mq_rq_timed_out()Yufen Yu
We got a null pointer deference BUG_ON in blk_mq_rq_timed_out() as following: [ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040 [ 108.827059] PGD 0 P4D 0 [ 108.827313] Oops: 0000 [#1] SMP PTI [ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431 [ 108.829503] Workqueue: kblockd blk_mq_timeout_work [ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330 [ 108.838191] Call Trace: [ 108.838406] bt_iter+0x74/0x80 [ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450 [ 108.839074] ? __switch_to_asm+0x34/0x70 [ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40 [ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40 [ 108.840273] ? syscall_return_via_sysret+0xf/0x7f [ 108.840732] blk_mq_timeout_work+0x74/0x200 [ 108.841151] process_one_work+0x297/0x680 [ 108.841550] worker_thread+0x29c/0x6f0 [ 108.841926] ? rescuer_thread+0x580/0x580 [ 108.842344] kthread+0x16a/0x1a0 [ 108.842666] ? kthread_flush_work+0x170/0x170 [ 108.843100] ret_from_fork+0x35/0x40 The bug is caused by the race between timeout handle and completion for flush request. When timeout handle function blk_mq_rq_timed_out() try to read 'req->q->mq_ops', the 'req' have completed and reinitiated by next flush request, which would call blk_rq_init() to clear 'req' as 0. After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"), normal requests lifetime are protected by refcount. Until 'rq->ref' drop to zero, the request can really be free. Thus, these requests cannot been reused before timeout handle finish. However, flush request has defined .end_io and rq->end_io() is still called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq' can be reused by the next flush request handle, resulting in null pointer deference BUG ON. We fix this problem by covering flush request with 'rq->ref'. If the refcount is not zero, flush_end_io() return and wait the last holder recall it. To record the request status, we add a new entry 'rq_status', which will be used in flush_end_io(). Cc: Christoph Hellwig <hch@infradead.org> Cc: Keith Busch <keith.busch@intel.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: stable@vger.kernel.org # v4.18+ Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> ------- v2: - move rq_status from struct request to struct blk_flush_queue v3: - remove unnecessary '{}' pair. v4: - let spinlock to protect 'fq->rq_status' v5: - move rq_status after flush_running_idx member of struct blk_flush_queue Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-27Merge branch 'qdisc-destroy'David S. Miller
Vlad Buslov says: ==================== Fix Qdisc destroy issues caused by adding fine-grained locking to filter API TC filter API unlocking introduced several new fine-grained locks. The change caused sleeping-while-atomic BUGs in several Qdiscs that call cls APIs which need to obtain new mutex while holding sch tree spinlock. This series fixes affected Qdiscs by ensuring that cls API that became sleeping is only called outside of sch tree lock critical section. ==================== Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27net: sched: sch_sfb: don't call qdisc_put() while holding tree lockVlad Buslov
Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for sfb: tc qdisc add dev ens1f0 handle 1: root sfb tc qdisc add dev ens1f0 parent 1:10 handle 50: sfq perturb 10 tc qdisc change dev ens1f0 root handle 1: sfb Resulting dmesg: [ 7265.938717] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 7265.940152] in_atomic(): 1, irqs_disabled(): 0, pid: 28579, name: tc [ 7265.941455] INFO: lockdep is turned off. [ 7265.942744] CPU: 11 PID: 28579 Comm: tc Tainted: G W 5.3.0-rc8+ #721 [ 7265.944065] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 7265.945396] Call Trace: [ 7265.946709] dump_stack+0x85/0xc0 [ 7265.947994] ___might_sleep.cold+0xac/0xbc [ 7265.949282] __mutex_lock+0x5b/0x960 [ 7265.950543] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.951803] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.953022] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 7265.954248] tcf_block_put_ext.part.0+0x21/0x50 [ 7265.955478] tcf_block_put+0x50/0x70 [ 7265.956694] sfq_destroy+0x15/0x50 [sch_sfq] [ 7265.957898] qdisc_destroy+0x5f/0x160 [ 7265.959099] sfb_change+0x175/0x330 [sch_sfb] [ 7265.960304] tc_modify_qdisc+0x324/0x840 [ 7265.961503] rtnetlink_rcv_msg+0x170/0x4b0 [ 7265.962692] ? netlink_deliver_tap+0x95/0x400 [ 7265.963876] ? rtnl_dellink+0x2d0/0x2d0 [ 7265.965064] netlink_rcv_skb+0x49/0x110 [ 7265.966251] netlink_unicast+0x171/0x200 [ 7265.967427] netlink_sendmsg+0x224/0x3f0 [ 7265.968595] sock_sendmsg+0x5e/0x60 [ 7265.969753] ___sys_sendmsg+0x2ae/0x330 [ 7265.970916] ? ___sys_recvmsg+0x159/0x1f0 [ 7265.972074] ? do_wp_page+0x9c/0x790 [ 7265.973233] ? __handle_mm_fault+0xcd3/0x19e0 [ 7265.974407] __sys_sendmsg+0x59/0xa0 [ 7265.975591] do_syscall_64+0x5c/0xb0 [ 7265.976753] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 7265.977938] RIP: 0033:0x7f229069f7b8 [ 7265.979117] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 7265.981681] RSP: 002b:00007ffd7ed2d158 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 7265.983001] RAX: ffffffffffffffda RBX: 000000005d813ca1 RCX: 00007f229069f7b8 [ 7265.984336] RDX: 0000000000000000 RSI: 00007ffd7ed2d1c0 RDI: 0000000000000003 [ 7265.985682] RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000165c9a0 [ 7265.987021] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 7265.988309] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 In sfb_change() function use qdisc_purge_queue() instead of qdisc_tree_flush_backlog() to properly reset old child Qdisc and save pointer to it into local temporary variable. Put reference to Qdisc after sch tree lock is released in order not to call potentially sleeping cls API in atomic section. This is safe to do because Qdisc has already been reset by qdisc_purge_queue() inside sch tree lock critical section. Reported-by: syzbot+ac54455281db908c581e@syzkaller.appspotmail.com Fixes: c266f64dbfa2 ("net: sched: protect block state with mutex") Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27net: sched: multiq: don't call qdisc_put() while holding tree lockVlad Buslov
Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for multiq: tc qdisc add dev ens1f0 root handle 1: multiq tc qdisc add dev ens1f0 parent 1:10 handle 50: sfq perturb 10 ethtool -L ens1f0 combined 2 tc qdisc change dev ens1f0 root handle 1: multiq Resulting dmesg: [ 5539.419344] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 5539.420945] in_atomic(): 1, irqs_disabled(): 0, pid: 27658, name: tc [ 5539.422435] INFO: lockdep is turned off. [ 5539.423904] CPU: 21 PID: 27658 Comm: tc Tainted: G W 5.3.0-rc8+ #721 [ 5539.425400] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 5539.426911] Call Trace: [ 5539.428380] dump_stack+0x85/0xc0 [ 5539.429823] ___might_sleep.cold+0xac/0xbc [ 5539.431262] __mutex_lock+0x5b/0x960 [ 5539.432682] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 5539.434103] ? __nla_validate_parse+0x51/0x840 [ 5539.435493] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 5539.436903] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 5539.438327] tcf_block_put_ext.part.0+0x21/0x50 [ 5539.439752] tcf_block_put+0x50/0x70 [ 5539.441165] sfq_destroy+0x15/0x50 [sch_sfq] [ 5539.442570] qdisc_destroy+0x5f/0x160 [ 5539.444000] multiq_tune+0x14a/0x420 [sch_multiq] [ 5539.445421] tc_modify_qdisc+0x324/0x840 [ 5539.446841] rtnetlink_rcv_msg+0x170/0x4b0 [ 5539.448269] ? netlink_deliver_tap+0x95/0x400 [ 5539.449691] ? rtnl_dellink+0x2d0/0x2d0 [ 5539.451116] netlink_rcv_skb+0x49/0x110 [ 5539.452522] netlink_unicast+0x171/0x200 [ 5539.453914] netlink_sendmsg+0x224/0x3f0 [ 5539.455304] sock_sendmsg+0x5e/0x60 [ 5539.456686] ___sys_sendmsg+0x2ae/0x330 [ 5539.458071] ? ___sys_recvmsg+0x159/0x1f0 [ 5539.459461] ? do_wp_page+0x9c/0x790 [ 5539.460846] ? __handle_mm_fault+0xcd3/0x19e0 [ 5539.462263] __sys_sendmsg+0x59/0xa0 [ 5539.463661] do_syscall_64+0x5c/0xb0 [ 5539.465044] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 5539.466454] RIP: 0033:0x7f1fe08177b8 [ 5539.467863] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 5539.470906] RSP: 002b:00007ffe812de5d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 5539.472483] RAX: ffffffffffffffda RBX: 000000005d8135e3 RCX: 00007f1fe08177b8 [ 5539.474069] RDX: 0000000000000000 RSI: 00007ffe812de640 RDI: 0000000000000003 [ 5539.475655] RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000182e9b0 [ 5539.477203] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 5539.478699] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 Rearrange locking in multiq_tune() in following ways: - In loop that removes Qdiscs from disabled queues, call qdisc_purge_queue() instead of qdisc_tree_flush_backlog() on Qdisc that is being destroyed. Save the Qdisc in temporary allocated array and call qdisc_put() on each element of the array after sch tree lock is released. This is safe to do because Qdiscs have already been reset by qdisc_purge_queue() inside sch tree lock critical section. - Do the same change for second loop that initializes Qdiscs for newly enabled queues in multiq_tune() function. Since sch tree lock is obtained and released on each iteration of this loop, just call qdisc_put() directly outside of critical section. Don't verify that old Qdisc is not noop_qdisc before releasing reference to it because such check is already performed by qdisc_put*() functions. Fixes: c266f64dbfa2 ("net: sched: protect block state with mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27net: sched: sch_htb: don't call qdisc_put() while holding tree lockVlad Buslov
Recent changes that removed rtnl dependency from rules update path of tc also made tcf_block_put() function sleeping. This function is called from ops->destroy() of several Qdisc implementations, which in turn is called by qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock, which results sleeping-while-atomic BUG. Steps to reproduce for htb: tc qdisc add dev ens1f0 root handle 1: htb default 12 tc class add dev ens1f0 parent 1: classid 1:1 htb rate 100kbps ceil 100kbps tc qdisc add dev ens1f0 parent 1:1 handle 40: sfq perturb 10 tc class add dev ens1f0 parent 1:1 classid 1:2 htb rate 100kbps ceil 100kbps Resulting dmesg: [ 4791.148551] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909 [ 4791.151354] in_atomic(): 1, irqs_disabled(): 0, pid: 27273, name: tc [ 4791.152805] INFO: lockdep is turned off. [ 4791.153605] CPU: 19 PID: 27273 Comm: tc Tainted: G W 5.3.0-rc8+ #721 [ 4791.154336] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 4791.155075] Call Trace: [ 4791.155803] dump_stack+0x85/0xc0 [ 4791.156529] ___might_sleep.cold+0xac/0xbc [ 4791.157251] __mutex_lock+0x5b/0x960 [ 4791.157966] ? console_unlock+0x363/0x5d0 [ 4791.158676] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 4791.159395] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 4791.160103] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0 [ 4791.160815] tcf_block_put_ext.part.0+0x21/0x50 [ 4791.161530] tcf_block_put+0x50/0x70 [ 4791.162233] sfq_destroy+0x15/0x50 [sch_sfq] [ 4791.162936] qdisc_destroy+0x5f/0x160 [ 4791.163642] htb_change_class.cold+0x5df/0x69d [sch_htb] [ 4791.164505] tc_ctl_tclass+0x19d/0x480 [ 4791.165360] rtnetlink_rcv_msg+0x170/0x4b0 [ 4791.166191] ? netlink_deliver_tap+0x95/0x400 [ 4791.166907] ? rtnl_dellink+0x2d0/0x2d0 [ 4791.167625] netlink_rcv_skb+0x49/0x110 [ 4791.168345] netlink_unicast+0x171/0x200 [ 4791.169058] netlink_sendmsg+0x224/0x3f0 [ 4791.169771] sock_sendmsg+0x5e/0x60 [ 4791.170475] ___sys_sendmsg+0x2ae/0x330 [ 4791.171183] ? ___sys_recvmsg+0x159/0x1f0 [ 4791.171894] ? do_wp_page+0x9c/0x790 [ 4791.172595] ? __handle_mm_fault+0xcd3/0x19e0 [ 4791.173309] __sys_sendmsg+0x59/0xa0 [ 4791.174024] do_syscall_64+0x5c/0xb0 [ 4791.174725] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 4791.175435] RIP: 0033:0x7f0aa41497b8 [ 4791.176129] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 4791.177532] RSP: 002b:00007fff4e37d588 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 4791.178243] RAX: ffffffffffffffda RBX: 000000005d8132f7 RCX: 00007f0aa41497b8 [ 4791.178947] RDX: 0000000000000000 RSI: 00007fff4e37d5f0 RDI: 0000000000000003 [ 4791.179662] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000020149a0 [ 4791.180382] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 4791.181100] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000 In htb_change_class() function save parent->leaf.q to local temporary variable and put reference to it after sch tree lock is released in order not to call potentially sleeping cls API in atomic section. This is safe to do because Qdisc has already been reset by qdisc_purge_queue() inside sch tree lock critical section. Fixes: c266f64dbfa2 ("net: sched: protect block state with mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>