summaryrefslogtreecommitdiff
path: root/drivers
AgeCommit message (Collapse)Author
2018-02-08Merge tag 'iommu-updates-v4.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull IOMMU updates from Joerg Roedel: "This time there are not a lot of changes coming from the IOMMU side. That is partly because I returned from my parental leave late in the development process and probably partly because everyone was busy with Spectre and Meltdown mitigation work and didn't find the time for IOMMU work. So here are the few changes that queued up for this merge window: - 5-level page-table support for the Intel IOMMU. - error reporting improvements for the AMD IOMMU driver - additional DT bindings for ipmmu-vmsa (Renesas) - small fixes and cleanups" * tag 'iommu-updates-v4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu: Clean up of_iommu_init_fn iommu/ipmmu-vmsa: Remove redundant of_iommu_init_fn hook iommu/msm: Claim bus ops on probe iommu/vt-d: Enable 5-level paging mode in the PASID entry iommu/vt-d: Add a check for 5-level paging support iommu/vt-d: Add a check for 1GB page support iommu/vt-d: Enable upto 57 bits of domain address width iommu/vt-d: Use domain instead of cache fetching iommu/exynos: Don't unconditionally steal bus ops iommu/omap: Fix debugfs_create_*() usage iommu/vt-d: clean up pr_irq if request_threaded_irq fails iommu: Check the result of iommu_group_get() for NULL iommu/ipmmu-vmsa: Add r8a779(70|95) DT bindings iommu/ipmmu-vmsa: Add r8a7796 DT binding iommu/amd: Set the device table entry PPR bit for IOMMU V2 devices iommu/amd - Record more information about unknown events
2018-02-08Merge branch 'pcmcia' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brodo/pcmcia Pull pcmcia updates from Dominik Brodowski: "The linux-pcmcia mailing list was shut down, so offer an alternative path for patches in MAINTAINERS. Also, throw in two odd fixes for the pcmcia subsystem" * 'pcmcia' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/pcmcia: pcmcia: soc_common: Handle return value of clk_prepare_enable pcmcia: use proper printk format for resource pcmcia: remove mailing list, update MAINTAINERS
2018-02-08Merge tag 'drm-for-v4.16-part2-fixes' of ↵Linus Torvalds
git://people.freedesktop.org/~airlied/linux Pull more drm updates from Dave Airlie: "Ben missed sending his nouveau tree, but he really didn't have much stuff in it: - GP108 acceleration support is enabled by "secure boot" support - some clockgating work on Kepler, and bunch of fixes - the bulk of the diff is regenerated firmware files, the change to them really isn't that large. Otherwise this contains regular Intel and AMDGPU fixes" * tag 'drm-for-v4.16-part2-fixes' of git://people.freedesktop.org/~airlied/linux: (59 commits) drm/i915/bios: add DP max link rate to VBT child device struct drm/i915/cnp: Properly handle VBT ddc pin out of bounds. drm/i915/cnp: Ignore VBT request for know invalid DDC pin. drm/i915/cmdparser: Do not check past the cmd length. drm/i915/cmdparser: Check reg_table_count before derefencing. drm/i915/bxt, glk: Increase PCODE timeouts during CDCLK freq changing drm/i915/gvt: Use KVM r/w to access guest opregion drm/i915/gvt: Fix aperture read/write emulation when enable x-no-mmap=on drm/i915/gvt: only reset execlist state of one engine during VM engine reset drm/i915/gvt: refine intel_vgpu_submission_ops as per engine ops drm/amdgpu: re-enable CGCG on CZ and disable on ST drm/nouveau/clk: fix gcc-7 -Wint-in-bool-context warning drm/nouveau/mmu: Fix trailing semicolon drm/nouveau: Introduce NvPmEnableGating option drm/nouveau: Add support for SLCG for Kepler2 drm/nouveau: Add support for BLCG on Kepler2 drm/nouveau: Add support for BLCG on Kepler1 drm/nouveau: Add support for basic clockgating on Kepler1 drm/nouveau/kms/nv50: fix handling of gamma since atomic conversion drm/nouveau/kms/nv50: use INTERPOLATE_257_UNITY_RANGE LUT on newer chipsets ...
2018-02-08Merge tag 'ceph-for-4.16-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph updates from Ilya Dryomov: "Things have been very quiet on the rbd side, as work continues on the big ticket items slated for the next merge window. On the CephFS side we have a large number of cap handling improvements, a fix for our long-standing abuse of ->journal_info in ceph_readpages() and yet another dentry pointer management patch" * tag 'ceph-for-4.16-rc1' of git://github.com/ceph/ceph-client: ceph: improving efficiency of syncfs libceph: check kstrndup() return value ceph: try to allocate enough memory for reserved caps ceph: fix race of queuing delayed caps ceph: delete unreachable code in ceph_check_caps() ceph: limit rate of cap import/export error messages ceph: fix incorrect snaprealm when adding caps ceph: fix un-balanced fsc->writeback_count update ceph: track read contexts in ceph_file_info ceph: avoid dereferencing invalid pointer during cached readdir ceph: use atomic_t for ceph_inode_info::i_shared_gen ceph: cleanup traceless reply handling for rename ceph: voluntarily drop Fx cap for readdir request ceph: properly drop caps for setattr request ceph: voluntarily drop Lx cap for link/rename requests ceph: voluntarily drop Ax cap for requests that create new inode rbd: whitelist RBD_FEATURE_OPERATIONS feature bit rbd: don't NULL out ->obj_request in rbd_img_obj_parent_read_full() rbd: use kmem_cache_zalloc() in rbd_img_request_create() rbd: obj_request->completion is unused
2018-02-08tuntap: add missing xdp flushJason Wang
When using devmap to redirect packets between interfaces, xdp_do_flush() is usually a must to flush any batched packets. Unfortunately this is missed in current tuntap implementation. Unlike most hardware driver which did XDP inside NAPI loop and call xdp_do_flush() at then end of each round of poll. TAP did it in the context of process e.g tun_get_user(). So fix this by count the pending redirected packets and flush when it exceeds NAPI_POLL_WEIGHT or MSG_MORE was cleared by sendmsg() caller. With this fix, xdp_redirect_map works again between two TAPs. Fixes: 761876c857cb ("tap: XDP support") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-08Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull more arm64 updates from Catalin Marinas: "As I mentioned in the last pull request, there's a second batch of security updates for arm64 with mitigations for Spectre/v1 and an improved one for Spectre/v2 (via a newly defined firmware interface API). Spectre v1 mitigation: - back-end version of array_index_mask_nospec() - masking of the syscall number to restrict speculation through the syscall table - masking of __user pointers prior to deference in uaccess routines Spectre v2 mitigation update: - using the new firmware SMC calling convention specification update - removing the current PSCI GET_VERSION firmware call mitigation as vendors are deploying new SMCCC-capable firmware - additional branch predictor hardening for synchronous exceptions and interrupts while in user mode Meltdown v3 mitigation update: - Cavium Thunder X is unaffected but a hardware erratum gets in the way. The kernel now starts with the page tables mapped as global and switches to non-global if kpti needs to be enabled. Other: - Theoretical trylock bug fixed" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (38 commits) arm64: Kill PSCI_GET_VERSION as a variant-2 workaround arm64: Add ARM_SMCCC_ARCH_WORKAROUND_1 BP hardening support arm/arm64: smccc: Implement SMCCC v1.1 inline primitive arm/arm64: smccc: Make function identifiers an unsigned quantity firmware/psci: Expose SMCCC version through psci_ops firmware/psci: Expose PSCI conduit arm64: KVM: Add SMCCC_ARCH_WORKAROUND_1 fast handling arm64: KVM: Report SMCCC_ARCH_WORKAROUND_1 BP hardening support arm/arm64: KVM: Turn kvm_psci_version into a static inline arm/arm64: KVM: Advertise SMCCC v1.1 arm/arm64: KVM: Implement PSCI 1.0 support arm/arm64: KVM: Add smccc accessors to PSCI code arm/arm64: KVM: Add PSCI_VERSION helper arm/arm64: KVM: Consolidate the PSCI include files arm64: KVM: Increment PC after handling an SMC trap arm: KVM: Fix SMCCC handling of unimplemented SMC/HVC calls arm64: KVM: Fix SMCCC handling of unimplemented SMC/HVC calls arm64: entry: Apply BP hardening for suspicious interrupts from EL0 arm64: entry: Apply BP hardening for high-priority synchronous exceptions arm64: futex: Mask __user pointers prior to dereference ...
2018-02-08Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio/vhost updates from Michael Tsirkin: "virtio, vhost: fixes, cleanups, features This includes the disk/cache memory stats for for the virtio balloon, as well as multiple fixes and cleanups" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vhost: don't hold onto file pointer for VHOST_SET_LOG_FD vhost: don't hold onto file pointer for VHOST_SET_VRING_ERR vhost: don't hold onto file pointer for VHOST_SET_VRING_CALL ringtest: ring.c malloc & memset to calloc virtio_vop: don't kfree device on register failure virtio_pci: don't kfree device on register failure virtio: split device_register into device_initialize and device_add vhost: remove unused lock check flag in vhost_dev_cleanup() vhost: Remove the unused variable. virtio_blk: print capacity at probe time virtio: make VIRTIO a menuconfig to ease disabling it all virtio/ringtest: virtio_ring: fix up need_event math virtio/ringtest: fix up need_event math virtio: virtio_mmio: make of_device_ids const. firmware: Use PTR_ERR_OR_ZERO() virtio-mmio: Use PTR_ERR_OR_ZERO() vhost/scsi: Improve a size determination in four functions virtio_balloon: include disk/file caches memory statistics
2018-02-08Merge ath-current from ↵Kalle Valo
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git ath.git fixes for 4.16. Major changes: ath10k * correct firmware RAM dump length for QCA6174/QCA9377 * add new QCA988X device id * fix a kernel panic during pci probe * revert a recent commit which broke ath10k firmware metadata parsing ath9k * fix a noise floor regression introduced during the merge window * add new device id
2018-02-08nfp: populate MODULE_VERSIONJakub Kicinski
DKMS and similar out-of-tree module replacement services use module version to make sure the out-of-tree software is not older than the module shipped with the kernel. We use the kernel version in ethtool -i output, put it into MODULE_VERSION as well. Reported-by: Jan Gutter <jan.gutter@netronome.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-08nfp: limit the number of TSO segmentsJakub Kicinski
Most FWs limit the number of TSO segments a frame can produce to 64. This is for fairness and efficiency (of FW datapath) reasons. If a frame with larger number of segments is submitted the FW will drop it. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-08nfp: forbid disabling hw-tc-offload on representors while offload activeJakub Kicinski
All netdevs which can accept TC offloads must implement .ndo_set_features(). nfp_reprs currently do not do that, which means hw-tc-offload can be turned on and off even when offloads are active. Whether the offloads are active is really a question to nfp_ports, so remove the per-app tc_busy callback indirection thing, and simply count the number of offloaded items in nfp_port structure. Fixes: 8a2768732a4d ("nfp: provide infrastructure for offloading flower based TC filters") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Tested-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-08nfp: don't advertise hw-tc-offload on non-port netdevsJakub Kicinski
nfp_port is a structure which represents an ASIC port, both PCIe vNIC (on a PF or a VF) or the external MAC port. vNIC netdev (struct nfp_net) and pure representor netdev (struct nfp_repr) both have a pointer to this structure. nfp_reprs always have a port associated. nfp_nets, however, only represent a device port in legacy mode, where they are considered the MAC port. In switchdev mode they are just the CPU's side of the PCIe link. By definition TC offloads only apply to device ports. Don't set the flag on vNICs without a port (i.e. in switchdev mode). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Tested-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-08nfp: bpf: require ETH tableJakub Kicinski
Upcoming changes will require all netdevs supporting TC offloads to have a full struct nfp_port. Require those for BPF offload. The operation without management FW reporting information about Ethernet ports is something we only support for very old and very basic NIC firmwares anyway. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Tested-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-08Revert "ath10k: add sanity check to ie_len before parsing fw/board ie"Ryan Hsu
This reverts commit 9ed4f91628737c820af6a1815b65bc06bd31518f. The commit introduced a regression that over read the ie with the padding. - the expected IE information ath10k_pci 0000:03:00.0: found firmware features ie (1 B) ath10k_pci 0000:03:00.0: Enabling feature bit: 6 ath10k_pci 0000:03:00.0: Enabling feature bit: 7 ath10k_pci 0000:03:00.0: features ath10k_pci 0000:03:00.0: 00000000: c0 00 00 00 00 00 00 00 - the wrong IE with padding is read (0x77) ath10k_pci 0000:03:00.0: found firmware features ie (4 B) ath10k_pci 0000:03:00.0: Enabling feature bit: 6 ath10k_pci 0000:03:00.0: Enabling feature bit: 7 ath10k_pci 0000:03:00.0: Enabling feature bit: 8 ath10k_pci 0000:03:00.0: Enabling feature bit: 9 ath10k_pci 0000:03:00.0: Enabling feature bit: 10 ath10k_pci 0000:03:00.0: Enabling feature bit: 12 ath10k_pci 0000:03:00.0: Enabling feature bit: 13 ath10k_pci 0000:03:00.0: Enabling feature bit: 14 ath10k_pci 0000:03:00.0: Enabling feature bit: 16 ath10k_pci 0000:03:00.0: Enabling feature bit: 17 ath10k_pci 0000:03:00.0: Enabling feature bit: 18 ath10k_pci 0000:03:00.0: features ath10k_pci 0000:03:00.0: 00000000: c0 77 07 00 00 00 00 00 Tested-by: Mike Lothian <mike@fireburn.co.uk> Signed-off-by: Ryan Hsu <ryanhsu@codeaurora.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2018-02-08nfp: bpf: fix immed relocation for larger offsetsJakub Kicinski
Immed relocation is missing a shift which means for larger offsets the lower and higher part of the address would be ORed together. Fixes: ce4ebfd859c3 ("nfp: bpf: add helpers for updating immediate instructions") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-02-08Merge branches 'acpi-video', 'acpi-battery' and 'acpi-cppc'Rafael J. Wysocki
* acpi-video: ACPI / video: Use true for boolean value * acpi-battery: ACPI / battery: Add quirk for Asus UX360UA and UX410UAK * acpi-cppc: ACPI / CPPC: Use 64-bit arithmetic instead of 32-bit
2018-02-08Merge branches 'acpi-tables', 'acpi-bus' and 'acpi-processor'Rafael J. Wysocki
* acpi-tables: ACPI: SPCR: Make SPCR available to x86 ACPI / tables: Add IORT to injectable table list * acpi-bus: ACPI / bus: Parse tables as term_list for Dell XPS 9570 and Precision M5530 ACPI / scan: Use acpi_bus_get_status() to initialize ACPI_TYPE_DEVICE devs ACPI / bus: Do not call _STA on battery devices with unmet dependencies PCI: acpiphp_ibm: prepare for acpi_get_object_info() no longer returning status ACPI: export acpi_bus_get_status_handle() * acpi-processor: ACPI / processor: Set default C1 idle state description ACPI: processor_perflib: Do not send _PPC change notification if not ready
2018-02-08Merge branch 'acpica'Rafael J. Wysocki
* acpica: ACPICA: Update version to 20180105 ACPICA: All acpica: Update copyrights to 2018 ACPICA: Add a missing pair of parentheses ACPICA: Prefer ACPI_TO_POINTER() over ACPI_ADD_PTR() ACPICA: Avoid NULL pointer arithmetic ACPICA: Linux: add support for X32 ABI compilation
2018-02-08drm/stm: ltdc: remove non-alpha color formats on layer 2 for older hwPhilippe CORNU
Hw older versions support non-alpha color formats derived from native alpha color formats only on the primary layer. For instance, RG16 native format without alpha works fine on 2nd layer but XR24 (derived color format from AR24) does not work on 2nd layer. Signed-off-by: Philippe Cornu <philippe.cornu@st.com> Reviewed-by: Yannick Fertré <yannick.fertre@st.com> Signed-off-by: Benjamin Gaignard <benjamin.gaignard@linaro.org> Link: https://patchwork.freedesktop.org/patch/msgid/20180201104243.20726-3-philippe.cornu@st.com
2018-02-08drm/stm: ltdc: add non-alpha color formatsPhilippe CORNU
ltdc supports natively some color formats with alpha (like ARGB8888, ARGB1555, ARGB4444...). Related non-alpha formats are supported too (ARGB8888->XRGB8888, ARGB4444->XRGB4444...) by adjusting ltdc blending factors. Note: Wayland/Weston requests by default the non-alpha XRGB8888 color format. Signed-off-by: Philippe Cornu <philippe.cornu@st.com> Reviewed-by: Yannick Fertré <yannick.fertre@st.com> Signed-off-by: Benjamin Gaignard <benjamin.gaignard@linaro.org> Link: https://patchwork.freedesktop.org/patch/msgid/20180201104243.20726-2-philippe.cornu@st.com
2018-02-08Merge branches 'pm-cpufreq', 'pm-cpuidle' and 'pm-domains'Rafael J. Wysocki
* pm-cpufreq: arm: imx: Add MODULE_ALIAS for cpufreq cpufreq: Add and use cpufreq_for_each_{valid_,}entry_idx() cpufreq: intel_pstate: Enable HWP during system resume on CPU0 cpufreq: scpi: fix error return code in scpi_cpufreq_init() cpufreq: scpi: fix static checker warning cdev isn't an ERR_PTR cpufreq: remove at32ap-cpufreq cpufreq: AMD: Ignore the check for ProcFeedback in ST/CZ cpufreq: Skip cpufreq resume if it's not suspended * pm-cpuidle: x86: PM: Make APM idle driver initialize polling state * pm-domains: PM / domains: Fix up domain-idle-states OF parsing
2018-02-08arm: imx: Add MODULE_ALIAS for cpufreqNicolas Chauvet
Without this, the imx6q-cpufreq driver isn't loaded automatically when built as a module Tested on wandboard quad with a fedora 27 kernel rpm Signed-off-by: Nicolas Chauvet <kwizart@gmail.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-02-08cpufreq: Add and use cpufreq_for_each_{valid_,}entry_idx()Dominik Brodowski
Pointer subtraction is slow and tedious. Therefore, replace all instances where cpufreq_for_each_{valid_,}entry loops contained such substractions with an iteration macro providing an index to the frequency_table entry. Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Link: http://lkml.kernel.org/r/20180120020237.GM13338@ZenIV.linux.org.uk Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-02-08cpufreq: intel_pstate: Enable HWP during system resume on CPU0Chen Yu
When maxcpus=1 is in the kernel command line, the BP is responsible for re-enabling the HWP - because currently only the APs invoke intel_pstate_hwp_enable() during their online process - which might put the system into unstable state after resume. Fix this by enabling the HWP explicitly on BP during resume. Reported-by: Doug Smythies <dsmythies@telus.net> Suggested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Yu Chen <yu.c.chen@intel.com> [ rjw: Subject/changelog, minor modifications ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-02-08cpufreq: scpi: fix error return code in scpi_cpufreq_init()Wei Yongjun
Fix to return a negative error code from the clk_get() error handling case instead of 0, as done elsewhere in this function. Fixes: 343a8d17fa8d (cpufreq: scpi: remove arm_big_little dependency) Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-02-08ACPI: sbshc: remove raw pointer from printk() messageGreg Kroah-Hartman
There's no need to be printing a raw kernel pointer to the kernel log at every boot. So just remove it, and change the whole message to use the correct dev_info() call at the same time. Reported-by: Wang Qize <wang_qize@venustech.com.cn> Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-02-08drm/bridge/synopsys: dsi: Add 1.31 version supportPhilippe Cornu
Add support for the Synopsys DesignWare MIPI DSI version 1.31 Two registers need to be updated/added for supporting 1.31: * PHY_TMR_CFG 0x9c (updated) 1.30 [31:24] phy_hs2lp_time [23:16] phy_lp2hs_time [14: 0] max_rd_time 1.31 [25:16] phy_hs2lp_time [ 9: 0] phy_lp2hs_time * PHY_TMR_RD_CFG 0xf4 (new) 1.31 [14: 0] max_rd_time Signed-off-by: Philippe Cornu <philippe.cornu@st.com> Reviewed-by: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180206084251.303-1-philippe.cornu@st.com
2018-02-08drm/bridge/synopsys: dsi: Add read featurePhilippe Cornu
This patch adds the DCS/GENERIC DSI read feature. Signed-off-by: Philippe Cornu <philippe.cornu@st.com> Reviewed-by: Andrzej Hajda <a.hajda@samsung.com> Reviewed-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180204213104.17834-1-philippe.cornu@st.com
2018-02-07net: ethernet: ti: cpsw: fix net watchdog timeoutGrygorii Strashko
It was discovered that simple program which indefinitely sends 200b UDP packets and runs on TI AM574x SoC (SMP) under RT Kernel triggers network watchdog timeout in TI CPSW driver (<6 hours run). The network watchdog timeout is triggered due to race between cpsw_ndo_start_xmit() and cpsw_tx_handler() [NAPI] cpsw_ndo_start_xmit() if (unlikely(!cpdma_check_free_tx_desc(txch))) { txq = netdev_get_tx_queue(ndev, q_idx); netif_tx_stop_queue(txq); ^^ as per [1] barier has to be used after set_bit() otherwise new value might not be visible to other cpus } cpsw_tx_handler() if (unlikely(netif_tx_queue_stopped(txq))) netif_tx_wake_queue(txq); and when it happens ndev TX queue became disabled forever while driver's HW TX queue is empty. Fix this, by adding smp_mb__after_atomic() after netif_tx_stop_queue() calls and double check for free TX descriptors after stopping ndev TX queue - if there are free TX descriptors wake up ndev TX queue. [1] https://www.kernel.org/doc/html/latest/core-api/atomic_ops.html Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-07ibmvnic: Ensure that buffers are NULL after freeThomas Falcon
This change will guard against a double free in the case that the buffers were previously freed at some other time, such as during a device reset. It resolves a kernel oops that occurred when changing the VNIC device's MTU. Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-07ibmvnic: Fix rx queue cleanup for non-fatal resetsJohn Allen
At some point, a check was added to exit the polling routine during resets. This makes sense for most reset conditions, but for a non-fatal error, we expect the polling routine to continue running to properly clean up the rx queues. This patch checks if we are performing a non-fatal reset and if we are, continues normal polling operation. Signed-off-by: John Allen <jallen@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-07i40e: Fix the number of queues available to be mapped for useAmritha Nambiar
Fix the number of queues per enabled TC and report available queues to the kernel without having to limit them to the max RSS limit so they are available to be mapped for XPS. This allows a queue per processing thread available for handling traffic for the given traffic class. Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-07sun: Add SPDX license tags to Sun network driversShannon Nelson
Add the appropriate SPDX license tags to the Sun network drivers as outlined in Documentation/process/license-rules.rst. Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-07cxgb4: Fix error handling path in 'init_one()'Christophe JAILLET
Commit baf5086840ab1 ("cxgb4: restructure VF mgmt code") has reordered some code but an error handling label has not been updated accordingly. So fix it and free 'adapter' if 't4_wait_dev_ready()' fails. Fixes: baf5086840ab1 ("cxgb4: restructure VF mgmt code") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-07Merge branch 'for-linus' into testJens Axboe
* for-linus: block, bfq: add requeue-request hook bcache: fix for data collapse after re-attaching an attached device bcache: return attach error when no cache set exist bcache: set writeback_rate_update_seconds in range [1, 60] seconds bcache: fix for allocator and register thread race bcache: set error_limit correctly bcache: properly set task state in bch_writeback_thread() bcache: fix high CPU occupancy during journal bcache: add journal statistic block: Add should_fail_bio() for bpf error injection blk-wbt: account flush requests correctly
2018-02-08Merge tag 'drm-intel-next-fixes-2018-02-07' of ↵Dave Airlie
git://anongit.freedesktop.org/drm/drm-intel into drm-next Fix for pcode timeouts on BXT and GLK, cmdparser fixes and fixes for new vbt version on CFL and CNL. GVT contains vGPU reset enhancement, which refines vGPU reset flow and the support of virtual aperture read/write when x-no-mmap=on is set in KVM, which is required by a test case from Redhat and also another fix for virtual OpRegion. * tag 'drm-intel-next-fixes-2018-02-07' of git://anongit.freedesktop.org/drm/drm-intel: drm/i915/bios: add DP max link rate to VBT child device struct drm/i915/cnp: Properly handle VBT ddc pin out of bounds. drm/i915/cnp: Ignore VBT request for know invalid DDC pin. drm/i915/cmdparser: Do not check past the cmd length. drm/i915/cmdparser: Check reg_table_count before derefencing. drm/i915/bxt, glk: Increase PCODE timeouts during CDCLK freq changing drm/i915/gvt: Use KVM r/w to access guest opregion drm/i915/gvt: Fix aperture read/write emulation when enable x-no-mmap=on drm/i915/gvt: only reset execlist state of one engine during VM engine reset drm/i915/gvt: refine intel_vgpu_submission_ops as per engine ops
2018-02-07Merge tag 'regulator-fix-v4.16-suspend' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator fix from Mark Brown: "Fix suspend to idle. Testing on mainline after the initial regulator pull request went in identified a regression for suspend to idle due to it calling the suspend operations with states that it wasn't realized could happen, this patch fixes the problem" * tag 'regulator-fix-v4.16-suspend' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: regulator: Fix suspend to idle
2018-02-07Merge tag 'fbdev-v4.16' of git://github.com/bzolnier/linuxLinus Torvalds
Pull fbdev updates from Bartlomiej Zolnierkiewicz: "There is nothing really major here: - fix display-timings lookup in the Device Tree in atmel_lcdfb driver (Johan Hovold) - fix video mode and line_length to be set correctly in vfb driver (Pieter "PoroCYon" Sluys) - fix returning nonsensical values to the user-space on GIO_FONTX ioctl when using dummy console (Nicolas Pitre) - add missing license tag to mmpfb driver (Arnd Bergmann) - convert radeonfb and pxa3xx_gcu drivers to use ktime_get[_ts64]() instead of the deprecated do_gettimeofday() (Arnd Bergmann) - switch udlfb driver from using the pr_*() logging functions to the dev_*() ones + related cleanups (Ladislav Michl) - use __raw I/O accessors also on arm64 (Ji Zhang) - fix Kconfig help text for intelfb driver (Randy Dunlap) - do not duplicate features data in omapfb driver (Ladislav Michl) - misc cleanups (Colin Ian King, Markus Elfring, Rasmus Villemoes, Vasyl Gomonovych, Himanshu Jha, Michael Trimarchi)" * tag 'fbdev-v4.16' of git://github.com/bzolnier/linux: (25 commits) video: udlfb: Switch from the pr_*() to the dev_*() logging functions video: udlfb: Constify read only data video: fbdev/mmp: add MODULE_LICENSE console/dummy: leave .con_font_get set to NULL fbdev: mxsfb: use framebuffer_alloc in the correct way video: udlfb: Do not name private data 'dev' video: udlfb: Remove noisy warnings video: udlfb: Remove redundant gdev variable video: udlfb: Remove unnecessary local variable fbdev: auo_k190x: Use zeroing memory allocator instead of allocator/memset vfb: fix video mode and line_length being set when loaded fbdev: arm64 use __raw I/O memory api omapfb: dss: Do not duplicate features data video: fbdev: omap2: Use PTR_ERR_OR_ZERO() fbdev: au1200fb: delete duplicate header contents fbdev: pxa3xx: use ktime_get_ts64 for time stamps fbdev: radeon: use ktime_get() for HZ calibration video: smscufx: Improve a size determination in two functions video: udlfb: Delete an unnecessary return statement in two functions video: udlfb: Improve a size determination in dlfb_alloc_urb_list() ...
2018-02-07Merge tag 'platform-drivers-x86-v4.16-2' of ↵Linus Torvalds
git://git.infradead.org/linux-platform-drivers-x86 Pull more x86 platform-drivers updates from Andy Shevchenko: "The DEFINE_SHOW_ATTRIBUTE() macro was defined privately in three locations and is useful for new and old users to avoid a lot of code duplication. Move the macro to seq_file.h. Along with above, clean up three drivers to use that macro. This, due to dependencies, was sent separately since affected changes weren't upstream originally yet. The rationale of doing this now is to allow use of new macro in v4.17 cycle in a conflictless manner" * tag 'platform-drivers-x86-v4.16-2' of git://git.infradead.org/linux-platform-drivers-x86: platform/x86: samsung-laptop: Re-use DEFINE_SHOW_ATTRIBUTE() macro platform/x86: ideapad-laptop: Re-use DEFINE_SHOW_ATTRIBUTE() macro platform/x86: dell-laptop: Re-use DEFINE_SHOW_ATTRIBUTE() macro seq_file: Introduce DEFINE_SHOW_ATTRIBUTE() helper macro
2018-02-07drm/i915/bios: add DP max link rate to VBT child device structJani Nikula
Update VBT defs to reflect revision 216. While at it, default the expected child device struct size to sizeof the size rather than a hardcoded value. v2: Fix bit order (David) Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180118153310.32437-1-jani.nikula@intel.com (cherry picked from commit c4fb60b9aba9f939d3f8575df23fd8d5958ec6ed) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2018-02-08Merge branch 'drm-next-4.16' of git://people.freedesktop.org/~agd5f/linux ↵Dave Airlie
into drm-next A few more misc fixes for 4.16. * 'drm-next-4.16' of git://people.freedesktop.org/~agd5f/linux: drm/amdgpu: re-enable CGCG on CZ and disable on ST drm/amdgpu: disable coarse grain clockgating for ST drm/radeon: adjust tested variable drm/amdgpu: remove WARN_ON when VM isn't found v2 drm/amdgpu: fix locking in vega10_ih_prescreen_iv drm/amdgpu: fix another potential cause of VM faults drm/amdgpu: use queue 0 for kiq ring drm/ttm: Fix 'buf' pointer update in ttm_bo_vm_access_kmap() (v2) drm/ttm: fix missing parameter change for ttm_bo_cleanup_refs
2018-02-07Merge tag 'linux-watchdog-4.16-rc1' of ↵Linus Torvalds
git://www.linux-watchdog.org/linux-watchdog Pull watchdog updates from Wim Van Sebroeck: - new watchdog device drivers for Realtek RTD1295 and Spreadtrum SC9860 platform - add support for the following devices: jz4780 SoC, AST25xx series SoC and r8a77970 SoC - convert to watchdog framework: i6300esb_wdt, xen_wdt and sp5100_tco - several fixes for watchdog core - remove at32ap700x and obsolete documentation - gpio: Convert to use GPIO descriptors - rename gemini into FTWDT010 as this IP block is generc from Faraday Technology - various clean-ups and small bugfixes - add Guenter Roeck as co-maintainer - change maintainers e-mail address * tag 'linux-watchdog-4.16-rc1' of git://www.linux-watchdog.org/linux-watchdog: (74 commits) documentation: watchdog: remove documentation of w83697hf_wdt/w83697ug_wdt documentation: watchdog: remove documentation for ixp2000 documentation: watchdog: remove documentation of at32ap700x_wdt watchdog: remove at32ap700x_wdt watchdog: sp5100_tco: Add support for recent FCH versions watchdog: sp5100-tco: Abort if watchdog is disabled by hardware watchdog: sp5100_tco: Use bit operations watchdog: sp5100_tco: Convert to use watchdog subsystem watchdog: sp5100_tco: Clean up function and variable names watchdog: sp5100_tco: Use dev_ print functions where possible watchdog: sp5100_tco: Match PCI device early watchdog: sp5100_tco: Clean up sp5100_tco_setupdevice watchdog: sp5100_tco: Use standard error codes watchdog: sp5100_tco: Use request_muxed_region where possible watchdog: sp5100_tco: Fix watchdog disable bit watchdog: sp5100_tco: Always use SP5100_IO_PM_{INDEX_REG,DATA_REG} watchdog: core: make sure the watchdog_worker is not deferred watchdog: mt7621: switch to using managed devm_watchdog_register_device() watchdog: mt7621: set WDOG_HW_RUNNING bit when appropriate watchdog: imx2_wdt: restore previous timeout after suspend+resume ...
2018-02-07bcache: fix for data collapse after re-attaching an attached deviceTang Junhui
back-end device sdm has already attached a cache_set with ID f67ebe1f-f8bc-4d73-bfe5-9dc88607f119, then try to attach with another cache set, and it returns with an error: [root]# cd /sys/block/sdm/bcache [root]# echo 5ccd0a63-148e-48b8-afa2-aca9cbd6279f > attach -bash: echo: write error: Invalid argument After that, execute a command to modify the label of bcache device: [root]# echo data_disk1 > label Then we reboot the system, when the system power on, the back-end device can not attach to cache_set, a messages show in the log: Feb 5 12:05:52 ceph152 kernel: [922385.508498] bcache: bch_cached_dev_attach() couldn't find uuid for sdm in set In sysfs_attach(), dc->sb.set_uuid was assigned to the value which input through sysfs, no matter whether it is success or not in bch_cached_dev_attach(). For example, If the back-end device has already attached to an cache set, bch_cached_dev_attach() would fail, but dc->sb.set_uuid was changed. Then modify the label of bcache device, it will call bch_write_bdev_super(), which would write the dc->sb.set_uuid to the super block, so we record a wrong cache set ID in the super block, after the system reboot, the cache set couldn't find the uuid of the back-end device, so the bcache device couldn't exist and use any more. In this patch, we don't assigned cache set ID to dc->sb.set_uuid in sysfs_attach() directly, but input it into bch_cached_dev_attach(), and assigned dc->sb.set_uuid to the cache set ID after the back-end device attached to the cache set successful. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-07bcache: return attach error when no cache set existTang Junhui
I attach a back-end device to a cache set, and the cache set is not registered yet, this back-end device did not attach successfully, and no error returned: [root]# echo 87859280-fec6-4bcc-20df7ca8f86b > /sys/block/sde/bcache/attach [root]# In sysfs_attach(), the return value "v" is initialized to "size" in the beginning, and if no cache set exist in bch_cache_sets, the "v" value would not change any more, and return to sysfs, sysfs regard it as success since the "size" is a positive number. This patch fixes this issue by assigning "v" with "-ENOENT" in the initialization. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-07bcache: set writeback_rate_update_seconds in range [1, 60] secondsColy Li
dc->writeback_rate_update_seconds can be set via sysfs and its value can be set to [1, ULONG_MAX]. It does not make sense to set such a large value, 60 seconds is long enough value considering the default 5 seconds works well for long time. Because dc->writeback_rate_update is a special delayed work, it re-arms itself inside the delayed work routine update_writeback_rate(). When stopping it by cancel_delayed_work_sync(), there should be a timeout to wait and make sure the re-armed delayed work is stopped too. A small max value of dc->writeback_rate_update_seconds is also helpful to decide a reasonable small timeout. This patch limits sysfs interface to set dc->writeback_rate_update_seconds in range of [1, 60] seconds, and replaces the hand-coded number by macros. Changelog: v2: fix a rebase typo in v4, which is pointed out by Michael Lyle. v1: initial version. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-07bcache: fix for allocator and register thread raceTang Junhui
After long time running of random small IO writing, I reboot the machine, and after the machine power on, I found bcache got stuck, the stack is: [root@ceph153 ~]# cat /proc/2510/task/*/stack [<ffffffffa06b2455>] closure_sync+0x25/0x90 [bcache] [<ffffffffa06b6be8>] bch_journal+0x118/0x2b0 [bcache] [<ffffffffa06b6dc7>] bch_journal_meta+0x47/0x70 [bcache] [<ffffffffa06be8f7>] bch_prio_write+0x237/0x340 [bcache] [<ffffffffa06a8018>] bch_allocator_thread+0x3c8/0x3d0 [bcache] [<ffffffff810a631f>] kthread+0xcf/0xe0 [<ffffffff8164c318>] ret_from_fork+0x58/0x90 [<ffffffffffffffff>] 0xffffffffffffffff [root@ceph153 ~]# cat /proc/2038/task/*/stack [<ffffffffa06b1abd>] __bch_btree_map_nodes+0x12d/0x150 [bcache] [<ffffffffa06b1bd1>] bch_btree_insert+0xf1/0x170 [bcache] [<ffffffffa06b637f>] bch_journal_replay+0x13f/0x230 [bcache] [<ffffffffa06c75fe>] run_cache_set+0x79a/0x7c2 [bcache] [<ffffffffa06c0cf8>] register_bcache+0xd48/0x1310 [bcache] [<ffffffff812f702f>] kobj_attr_store+0xf/0x20 [<ffffffff8125b216>] sysfs_write_file+0xc6/0x140 [<ffffffff811dfbfd>] vfs_write+0xbd/0x1e0 [<ffffffff811e069f>] SyS_write+0x7f/0xe0 [<ffffffff8164c3c9>] system_call_fastpath+0x16/0x1 The stack shows the register thread and allocator thread were getting stuck when registering cache device. I reboot the machine several times, the issue always exsit in this machine. I debug the code, and found the call trace as bellow: register_bcache() ==>run_cache_set() ==>bch_journal_replay() ==>bch_btree_insert() ==>__bch_btree_map_nodes() ==>btree_insert_fn() ==>btree_split() //node need split ==>btree_check_reserve() In btree_check_reserve(), It will check if there is enough buckets of RESERVE_BTREE type, since allocator thread did not work yet, so no buckets of RESERVE_BTREE type allocated, so the register thread waits on c->btree_cache_wait, and goes to sleep. Then the allocator thread initialized, the call trace is bellow: bch_allocator_thread() ==>bch_prio_write() ==>bch_journal_meta() ==>bch_journal() ==>journal_wait_for_write() In journal_wait_for_write(), It will check if journal is full by journal_full(), but the long time random small IO writing causes the exhaustion of journal buckets(journal.blocks_free=0), In order to release the journal buckets, the allocator calls btree_flush_write() to flush keys to btree nodes, and waits on c->journal.wait until btree nodes writing over or there has already some journal buckets space, then the allocator thread goes to sleep. but in btree_flush_write(), since bch_journal_replay() is not finished, so no btree nodes have journal (condition "if (btree_current_write(b)->journal)" never satisfied), so we got no btree node to flush, no journal bucket released, and allocator sleep all the times. Through the above analysis, we can see that: 1) Register thread wait for allocator thread to allocate buckets of RESERVE_BTREE type; 2) Alloctor thread wait for register thread to replay journal, so it can flush btree nodes and get journal bucket. then they are all got stuck by waiting for each other. Hua Rui provided a patch for me, by allocating some buckets of RESERVE_BTREE type in advance, so the register thread can get bucket when btree node splitting and no need to waiting for the allocator thread. I tested it, it has effect, and register thread run a step forward, but finally are still got stuck, the reason is only 8 bucket of RESERVE_BTREE type were allocated, and in bch_journal_replay(), after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left, then btree_check_reserve() is not satisfied anymore, so it goes to sleep again, and in the same time, alloctor thread did not flush enough btree nodes to release a journal bucket, so they all got stuck again. So we need to allocate more buckets of RESERVE_BTREE type in advance, but how much is enough? By experience and test, I think it should be as much as journal buckets. Then I modify the code as this patch, and test in the machine, and it works. This patch modified base on Hua Rui’s patch, and allocate more buckets of RESERVE_BTREE type in advance to avoid register thread and allocate thread going to wait for each other. [patch v2] ca->sb.njournal_buckets would be 0 in the first time after cache creation, and no journal exists, so just 8 btree buckets is OK. Signed-off-by: Hua Rui <huarui.dev@gmail.com> Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-07bcache: set error_limit correctlyColy Li
Struct cache uses io_errors for two purposes, - Error decay: when cache set error_decay is set, io_errors is used to generate a small piece of delay when I/O error happens. - I/O errors counter: in order to generate big enough value for error decay, I/O errors counter value is stored by left shifting 20 bits (a.k.a IO_ERROR_SHIFT). In function bch_count_io_errors(), if I/O errors counter reaches cache set error limit, bch_cache_set_error() will be called to retire the whold cache set. But current code is problematic when checking the error limit, see the following code piece from bch_count_io_errors(), 90 if (error) { 91 char buf[BDEVNAME_SIZE]; 92 unsigned errors = atomic_add_return(1 << IO_ERROR_SHIFT, 93 &ca->io_errors); 94 errors >>= IO_ERROR_SHIFT; 95 96 if (errors < ca->set->error_limit) 97 pr_err("%s: IO error on %s, recovering", 98 bdevname(ca->bdev, buf), m); 99 else 100 bch_cache_set_error(ca->set, 101 "%s: too many IO errors %s", 102 bdevname(ca->bdev, buf), m); 103 } At line 94, errors is right shifting IO_ERROR_SHIFT bits, now it is real errors counter to compare at line 96. But ca->set->error_limit is initia- lized with an amplified value in bch_cache_set_alloc(), 1545 c->error_limit = 8 << IO_ERROR_SHIFT; It means by default, in bch_count_io_errors(), before 8<<20 errors happened bch_cache_set_error() won't be called to retire the problematic cache device. If the average request size is 64KB, it means bcache won't handle failed device until 512GB data is requested. This is too large to be an I/O threashold. So I believe the correct error limit should be much less. This patch sets default cache set error limit to 8, then in bch_count_io_errors() when errors counter reaches 8 (if it is default value), function bch_cache_set_error() will be called to retire the whole cache set. This patch also removes bits shifting when store or show io_error_limit value via sysfs interface. Nowadays most of SSDs handle internal flash failure automatically by LBA address re-indirect mapping. If an I/O error can be observed by upper layer code, it will be a notable error because that SSD can not re-indirect map the problematic LBA address to an available flash block. This situation indicates the whole SSD will be failed very soon. Therefore setting 8 as the default io error limit value makes sense, it is enough for most of cache devices. Changelog: v2: add reviewed-by from Hannes. v1: initial version for review. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Junhui Tang <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-07bcache: properly set task state in bch_writeback_thread()Coly Li
Kernel thread routine bch_writeback_thread() has the following code block, 447 down_write(&dc->writeback_lock); 448~450 if (check conditions) { 451 up_write(&dc->writeback_lock); 452 set_current_state(TASK_INTERRUPTIBLE); 453 454 if (kthread_should_stop()) 455 return 0; 456 457 schedule(); 458 continue; 459 } If condition check is true, its task state is set to TASK_INTERRUPTIBLE and call schedule() to wait for others to wake up it. There are 2 issues in current code, 1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if another process changes the condition and call wake_up_process(dc-> writeback_thread), then at line 452 task state is set back to TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be waken up. 2, At line 454 if kthread_should_stop() is true, writeback kernel thread will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and call do_exit(). It is not good to enter do_exit() with task state TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a warning message is reported by __might_sleep(): "WARNING: do not call blocking ops when !TASK_RUNNING; state=1 set at [xxxx]". For the first issue, task state should be set before condition checks. Ineed because dc->writeback_lock is required when modifying all the conditions, calling set_current_state() inside code block where dc-> writeback_lock is hold is safe. But this is quite implicit, so I still move set_current_state() before all the condition checks. For the second issue, frankley speaking it does not hurt when kernel thread exits with TASK_INTERRUPTIBLE state, but this warning message scares users, makes them feel there might be something risky with bcache and hurt their data. Setting task state to TASK_RUNNING before returning fixes this problem. In alloc.c:allocator_wait(), there is also a similar issue, and is also fixed in this patch. Changelog: v3: merge two similar fixes into one patch v2: fix the race issue in v1 patch. v1: initial buggy fix. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Michael Lyle <mlyle@lyle.org> Cc: Junhui Tang <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-07bcache: fix high CPU occupancy during journalTang Junhui
After long time small writing I/O running, we found the occupancy of CPU is very high and I/O performance has been reduced by about half: [root@ceph151 internal]# top top - 15:51:05 up 1 day,2:43, 4 users, load average: 16.89, 15.15, 16.53 Tasks: 2063 total, 4 running, 2059 sleeping, 0 stopped, 0 zombie %Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa, 0.0 hi, 0.5 si, 0.0 st KiB Mem : 65450044 total, 24586420 free, 38909008 used, 1954616 buff/cache KiB Swap: 65667068 total, 65667068 free, 0 used. 25136812 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2023 root 20 0 0 0 0 S 55.1 0.0 0:04.42 kworker/11:191 14126 root 20 0 0 0 0 S 42.9 0.0 0:08.72 kworker/10:3 9292 root 20 0 0 0 0 S 30.4 0.0 1:10.99 kworker/6:1 8553 ceph 20 0 4242492 1.805g 18804 S 30.0 2.9 410:07.04 ceph-osd 12287 root 20 0 0 0 0 S 26.7 0.0 0:28.13 kworker/7:85 31019 root 20 0 0 0 0 S 26.1 0.0 1:30.79 kworker/22:1 1787 root 20 0 0 0 0 R 25.7 0.0 5:18.45 kworker/8:7 32169 root 20 0 0 0 0 S 14.5 0.0 1:01.92 kworker/23:1 21476 root 20 0 0 0 0 S 13.9 0.0 0:05.09 kworker/1:54 2204 root 20 0 0 0 0 S 12.5 0.0 1:25.17 kworker/9:10 16994 root 20 0 0 0 0 S 12.2 0.0 0:06.27 kworker/5:106 15714 root 20 0 0 0 0 R 10.9 0.0 0:01.85 kworker/19:2 9661 ceph 20 0 4246876 1.731g 18800 S 10.6 2.8 403:00.80 ceph-osd 11460 ceph 20 0 4164692 2.206g 18876 S 10.6 3.5 360:27.19 ceph-osd 9960 root 20 0 0 0 0 S 10.2 0.0 0:02.75 kworker/2:139 11699 ceph 20 0 4169244 1.920g 18920 S 10.2 3.1 355:23.67 ceph-osd 6843 ceph 20 0 4197632 1.810g 18900 S 9.6 2.9 380:08.30 ceph-osd The kernel work consumed a lot of CPU, and I found they are running journal work, The journal is reclaiming source and flush btree node with surprising frequency. Through further analysis, we found that in btree_flush_write(), we try to get a btree node with the smallest fifo idex to flush by traverse all the btree nodein c->bucket_hash, after we getting it, since no locker protects it, this btree node may have been written to cache device by other works, and if this occurred, we retry to traverse in c->bucket_hash and get another btree node. When the problem occurrd, the retry times is very high, and we consume a lot of CPU in looking for a appropriate btree node. In this patch, we try to record 128 btree nodes with the smallest fifo idex in heap, and pop one by one when we need to flush btree node. It greatly reduces the time for the loop to find the appropriate BTREE node, and also reduce the occupancy of CPU. [note by mpl: this triggers a checkpatch error because of adjacent, pre-existing style violations] Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-07bcache: add journal statisticTang Junhui
Sometimes, Journal takes up a lot of CPU, we need statistics to know what's the journal is doing. So this patch provide some journal statistics: 1) reclaim: how many times the journal try to reclaim resource, usually the journal bucket or/and the pin are exhausted. 2) flush_write: how many times the journal try to flush btree node to cache device, usually the journal bucket are exhausted. 3) retry_flush_write: how many times the journal retry to flush the next btree node, usually the previous tree node have been flushed by other thread. we show these statistic by sysfs interface. Through these statistics We can totally see the status of journal module when the CPU is too high. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>