Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU updates from Joerg Roedel:
"This time there are not a lot of changes coming from the IOMMU side.
That is partly because I returned from my parental leave late in the
development process and probably partly because everyone was busy with
Spectre and Meltdown mitigation work and didn't find the time for
IOMMU work. So here are the few changes that queued up for this merge
window:
- 5-level page-table support for the Intel IOMMU.
- error reporting improvements for the AMD IOMMU driver
- additional DT bindings for ipmmu-vmsa (Renesas)
- small fixes and cleanups"
* tag 'iommu-updates-v4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu: Clean up of_iommu_init_fn
iommu/ipmmu-vmsa: Remove redundant of_iommu_init_fn hook
iommu/msm: Claim bus ops on probe
iommu/vt-d: Enable 5-level paging mode in the PASID entry
iommu/vt-d: Add a check for 5-level paging support
iommu/vt-d: Add a check for 1GB page support
iommu/vt-d: Enable upto 57 bits of domain address width
iommu/vt-d: Use domain instead of cache fetching
iommu/exynos: Don't unconditionally steal bus ops
iommu/omap: Fix debugfs_create_*() usage
iommu/vt-d: clean up pr_irq if request_threaded_irq fails
iommu: Check the result of iommu_group_get() for NULL
iommu/ipmmu-vmsa: Add r8a779(70|95) DT bindings
iommu/ipmmu-vmsa: Add r8a7796 DT binding
iommu/amd: Set the device table entry PPR bit for IOMMU V2 devices
iommu/amd - Record more information about unknown events
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brodo/pcmcia
Pull pcmcia updates from Dominik Brodowski:
"The linux-pcmcia mailing list was shut down, so offer an alternative
path for patches in MAINTAINERS.
Also, throw in two odd fixes for the pcmcia subsystem"
* 'pcmcia' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/pcmcia:
pcmcia: soc_common: Handle return value of clk_prepare_enable
pcmcia: use proper printk format for resource
pcmcia: remove mailing list, update MAINTAINERS
|
|
git://people.freedesktop.org/~airlied/linux
Pull more drm updates from Dave Airlie:
"Ben missed sending his nouveau tree, but he really didn't have much
stuff in it:
- GP108 acceleration support is enabled by "secure boot" support
- some clockgating work on Kepler, and bunch of fixes
- the bulk of the diff is regenerated firmware files, the change to
them really isn't that large.
Otherwise this contains regular Intel and AMDGPU fixes"
* tag 'drm-for-v4.16-part2-fixes' of git://people.freedesktop.org/~airlied/linux: (59 commits)
drm/i915/bios: add DP max link rate to VBT child device struct
drm/i915/cnp: Properly handle VBT ddc pin out of bounds.
drm/i915/cnp: Ignore VBT request for know invalid DDC pin.
drm/i915/cmdparser: Do not check past the cmd length.
drm/i915/cmdparser: Check reg_table_count before derefencing.
drm/i915/bxt, glk: Increase PCODE timeouts during CDCLK freq changing
drm/i915/gvt: Use KVM r/w to access guest opregion
drm/i915/gvt: Fix aperture read/write emulation when enable x-no-mmap=on
drm/i915/gvt: only reset execlist state of one engine during VM engine reset
drm/i915/gvt: refine intel_vgpu_submission_ops as per engine ops
drm/amdgpu: re-enable CGCG on CZ and disable on ST
drm/nouveau/clk: fix gcc-7 -Wint-in-bool-context warning
drm/nouveau/mmu: Fix trailing semicolon
drm/nouveau: Introduce NvPmEnableGating option
drm/nouveau: Add support for SLCG for Kepler2
drm/nouveau: Add support for BLCG on Kepler2
drm/nouveau: Add support for BLCG on Kepler1
drm/nouveau: Add support for basic clockgating on Kepler1
drm/nouveau/kms/nv50: fix handling of gamma since atomic conversion
drm/nouveau/kms/nv50: use INTERPOLATE_257_UNITY_RANGE LUT on newer chipsets
...
|
|
Pull ceph updates from Ilya Dryomov:
"Things have been very quiet on the rbd side, as work continues on the
big ticket items slated for the next merge window.
On the CephFS side we have a large number of cap handling
improvements, a fix for our long-standing abuse of ->journal_info in
ceph_readpages() and yet another dentry pointer management patch"
* tag 'ceph-for-4.16-rc1' of git://github.com/ceph/ceph-client:
ceph: improving efficiency of syncfs
libceph: check kstrndup() return value
ceph: try to allocate enough memory for reserved caps
ceph: fix race of queuing delayed caps
ceph: delete unreachable code in ceph_check_caps()
ceph: limit rate of cap import/export error messages
ceph: fix incorrect snaprealm when adding caps
ceph: fix un-balanced fsc->writeback_count update
ceph: track read contexts in ceph_file_info
ceph: avoid dereferencing invalid pointer during cached readdir
ceph: use atomic_t for ceph_inode_info::i_shared_gen
ceph: cleanup traceless reply handling for rename
ceph: voluntarily drop Fx cap for readdir request
ceph: properly drop caps for setattr request
ceph: voluntarily drop Lx cap for link/rename requests
ceph: voluntarily drop Ax cap for requests that create new inode
rbd: whitelist RBD_FEATURE_OPERATIONS feature bit
rbd: don't NULL out ->obj_request in rbd_img_obj_parent_read_full()
rbd: use kmem_cache_zalloc() in rbd_img_request_create()
rbd: obj_request->completion is unused
|
|
When using devmap to redirect packets between interfaces,
xdp_do_flush() is usually a must to flush any batched
packets. Unfortunately this is missed in current tuntap
implementation.
Unlike most hardware driver which did XDP inside NAPI loop and call
xdp_do_flush() at then end of each round of poll. TAP did it in the
context of process e.g tun_get_user(). So fix this by count the
pending redirected packets and flush when it exceeds NAPI_POLL_WEIGHT
or MSG_MORE was cleared by sendmsg() caller.
With this fix, xdp_redirect_map works again between two TAPs.
Fixes: 761876c857cb ("tap: XDP support")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull more arm64 updates from Catalin Marinas:
"As I mentioned in the last pull request, there's a second batch of
security updates for arm64 with mitigations for Spectre/v1 and an
improved one for Spectre/v2 (via a newly defined firmware interface
API).
Spectre v1 mitigation:
- back-end version of array_index_mask_nospec()
- masking of the syscall number to restrict speculation through the
syscall table
- masking of __user pointers prior to deference in uaccess routines
Spectre v2 mitigation update:
- using the new firmware SMC calling convention specification update
- removing the current PSCI GET_VERSION firmware call mitigation as
vendors are deploying new SMCCC-capable firmware
- additional branch predictor hardening for synchronous exceptions
and interrupts while in user mode
Meltdown v3 mitigation update:
- Cavium Thunder X is unaffected but a hardware erratum gets in the
way. The kernel now starts with the page tables mapped as global
and switches to non-global if kpti needs to be enabled.
Other:
- Theoretical trylock bug fixed"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (38 commits)
arm64: Kill PSCI_GET_VERSION as a variant-2 workaround
arm64: Add ARM_SMCCC_ARCH_WORKAROUND_1 BP hardening support
arm/arm64: smccc: Implement SMCCC v1.1 inline primitive
arm/arm64: smccc: Make function identifiers an unsigned quantity
firmware/psci: Expose SMCCC version through psci_ops
firmware/psci: Expose PSCI conduit
arm64: KVM: Add SMCCC_ARCH_WORKAROUND_1 fast handling
arm64: KVM: Report SMCCC_ARCH_WORKAROUND_1 BP hardening support
arm/arm64: KVM: Turn kvm_psci_version into a static inline
arm/arm64: KVM: Advertise SMCCC v1.1
arm/arm64: KVM: Implement PSCI 1.0 support
arm/arm64: KVM: Add smccc accessors to PSCI code
arm/arm64: KVM: Add PSCI_VERSION helper
arm/arm64: KVM: Consolidate the PSCI include files
arm64: KVM: Increment PC after handling an SMC trap
arm: KVM: Fix SMCCC handling of unimplemented SMC/HVC calls
arm64: KVM: Fix SMCCC handling of unimplemented SMC/HVC calls
arm64: entry: Apply BP hardening for suspicious interrupts from EL0
arm64: entry: Apply BP hardening for high-priority synchronous exceptions
arm64: futex: Mask __user pointers prior to dereference
...
|
|
Pull virtio/vhost updates from Michael Tsirkin:
"virtio, vhost: fixes, cleanups, features
This includes the disk/cache memory stats for for the virtio balloon,
as well as multiple fixes and cleanups"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost: don't hold onto file pointer for VHOST_SET_LOG_FD
vhost: don't hold onto file pointer for VHOST_SET_VRING_ERR
vhost: don't hold onto file pointer for VHOST_SET_VRING_CALL
ringtest: ring.c malloc & memset to calloc
virtio_vop: don't kfree device on register failure
virtio_pci: don't kfree device on register failure
virtio: split device_register into device_initialize and device_add
vhost: remove unused lock check flag in vhost_dev_cleanup()
vhost: Remove the unused variable.
virtio_blk: print capacity at probe time
virtio: make VIRTIO a menuconfig to ease disabling it all
virtio/ringtest: virtio_ring: fix up need_event math
virtio/ringtest: fix up need_event math
virtio: virtio_mmio: make of_device_ids const.
firmware: Use PTR_ERR_OR_ZERO()
virtio-mmio: Use PTR_ERR_OR_ZERO()
vhost/scsi: Improve a size determination in four functions
virtio_balloon: include disk/file caches memory statistics
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git
ath.git fixes for 4.16. Major changes:
ath10k
* correct firmware RAM dump length for QCA6174/QCA9377
* add new QCA988X device id
* fix a kernel panic during pci probe
* revert a recent commit which broke ath10k firmware metadata parsing
ath9k
* fix a noise floor regression introduced during the merge window
* add new device id
|
|
DKMS and similar out-of-tree module replacement services use
module version to make sure the out-of-tree software is not
older than the module shipped with the kernel. We use the
kernel version in ethtool -i output, put it into MODULE_VERSION
as well.
Reported-by: Jan Gutter <jan.gutter@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Most FWs limit the number of TSO segments a frame can produce
to 64. This is for fairness and efficiency (of FW datapath)
reasons. If a frame with larger number of segments is submitted
the FW will drop it.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
All netdevs which can accept TC offloads must implement
.ndo_set_features(). nfp_reprs currently do not do that, which
means hw-tc-offload can be turned on and off even when offloads
are active.
Whether the offloads are active is really a question to nfp_ports,
so remove the per-app tc_busy callback indirection thing, and
simply count the number of offloaded items in nfp_port structure.
Fixes: 8a2768732a4d ("nfp: provide infrastructure for offloading flower based TC filters")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Tested-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
nfp_port is a structure which represents an ASIC port, both
PCIe vNIC (on a PF or a VF) or the external MAC port. vNIC
netdev (struct nfp_net) and pure representor netdev (struct
nfp_repr) both have a pointer to this structure. nfp_reprs
always have a port associated. nfp_nets, however, only represent
a device port in legacy mode, where they are considered the
MAC port. In switchdev mode they are just the CPU's side of
the PCIe link.
By definition TC offloads only apply to device ports. Don't
set the flag on vNICs without a port (i.e. in switchdev mode).
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Tested-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Upcoming changes will require all netdevs supporting TC offloads
to have a full struct nfp_port. Require those for BPF offload.
The operation without management FW reporting information about
Ethernet ports is something we only support for very old and very
basic NIC firmwares anyway.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Tested-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This reverts commit 9ed4f91628737c820af6a1815b65bc06bd31518f.
The commit introduced a regression that over read the ie with
the padding.
- the expected IE information
ath10k_pci 0000:03:00.0: found firmware features ie (1 B)
ath10k_pci 0000:03:00.0: Enabling feature bit: 6
ath10k_pci 0000:03:00.0: Enabling feature bit: 7
ath10k_pci 0000:03:00.0: features
ath10k_pci 0000:03:00.0: 00000000: c0 00 00 00 00 00 00 00
- the wrong IE with padding is read (0x77)
ath10k_pci 0000:03:00.0: found firmware features ie (4 B)
ath10k_pci 0000:03:00.0: Enabling feature bit: 6
ath10k_pci 0000:03:00.0: Enabling feature bit: 7
ath10k_pci 0000:03:00.0: Enabling feature bit: 8
ath10k_pci 0000:03:00.0: Enabling feature bit: 9
ath10k_pci 0000:03:00.0: Enabling feature bit: 10
ath10k_pci 0000:03:00.0: Enabling feature bit: 12
ath10k_pci 0000:03:00.0: Enabling feature bit: 13
ath10k_pci 0000:03:00.0: Enabling feature bit: 14
ath10k_pci 0000:03:00.0: Enabling feature bit: 16
ath10k_pci 0000:03:00.0: Enabling feature bit: 17
ath10k_pci 0000:03:00.0: Enabling feature bit: 18
ath10k_pci 0000:03:00.0: features
ath10k_pci 0000:03:00.0: 00000000: c0 77 07 00 00 00 00 00
Tested-by: Mike Lothian <mike@fireburn.co.uk>
Signed-off-by: Ryan Hsu <ryanhsu@codeaurora.org>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
|
|
Immed relocation is missing a shift which means for larger
offsets the lower and higher part of the address would be
ORed together.
Fixes: ce4ebfd859c3 ("nfp: bpf: add helpers for updating immediate instructions")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
* acpi-video:
ACPI / video: Use true for boolean value
* acpi-battery:
ACPI / battery: Add quirk for Asus UX360UA and UX410UAK
* acpi-cppc:
ACPI / CPPC: Use 64-bit arithmetic instead of 32-bit
|
|
* acpi-tables:
ACPI: SPCR: Make SPCR available to x86
ACPI / tables: Add IORT to injectable table list
* acpi-bus:
ACPI / bus: Parse tables as term_list for Dell XPS 9570 and Precision M5530
ACPI / scan: Use acpi_bus_get_status() to initialize ACPI_TYPE_DEVICE devs
ACPI / bus: Do not call _STA on battery devices with unmet dependencies
PCI: acpiphp_ibm: prepare for acpi_get_object_info() no longer returning status
ACPI: export acpi_bus_get_status_handle()
* acpi-processor:
ACPI / processor: Set default C1 idle state description
ACPI: processor_perflib: Do not send _PPC change notification if not ready
|
|
* acpica:
ACPICA: Update version to 20180105
ACPICA: All acpica: Update copyrights to 2018
ACPICA: Add a missing pair of parentheses
ACPICA: Prefer ACPI_TO_POINTER() over ACPI_ADD_PTR()
ACPICA: Avoid NULL pointer arithmetic
ACPICA: Linux: add support for X32 ABI compilation
|
|
Hw older versions support non-alpha color formats derived
from native alpha color formats only on the primary layer.
For instance, RG16 native format without alpha works fine
on 2nd layer but XR24 (derived color format from AR24)
does not work on 2nd layer.
Signed-off-by: Philippe Cornu <philippe.cornu@st.com>
Reviewed-by: Yannick Fertré <yannick.fertre@st.com>
Signed-off-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20180201104243.20726-3-philippe.cornu@st.com
|
|
ltdc supports natively some color formats with alpha (like
ARGB8888, ARGB1555, ARGB4444...). Related non-alpha formats are
supported too (ARGB8888->XRGB8888, ARGB4444->XRGB4444...) by
adjusting ltdc blending factors.
Note: Wayland/Weston requests by default the non-alpha XRGB8888
color format.
Signed-off-by: Philippe Cornu <philippe.cornu@st.com>
Reviewed-by: Yannick Fertré <yannick.fertre@st.com>
Signed-off-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20180201104243.20726-2-philippe.cornu@st.com
|
|
* pm-cpufreq:
arm: imx: Add MODULE_ALIAS for cpufreq
cpufreq: Add and use cpufreq_for_each_{valid_,}entry_idx()
cpufreq: intel_pstate: Enable HWP during system resume on CPU0
cpufreq: scpi: fix error return code in scpi_cpufreq_init()
cpufreq: scpi: fix static checker warning cdev isn't an ERR_PTR
cpufreq: remove at32ap-cpufreq
cpufreq: AMD: Ignore the check for ProcFeedback in ST/CZ
cpufreq: Skip cpufreq resume if it's not suspended
* pm-cpuidle:
x86: PM: Make APM idle driver initialize polling state
* pm-domains:
PM / domains: Fix up domain-idle-states OF parsing
|
|
Without this, the imx6q-cpufreq driver isn't loaded
automatically when built as a module
Tested on wandboard quad with a fedora 27 kernel rpm
Signed-off-by: Nicolas Chauvet <kwizart@gmail.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Pointer subtraction is slow and tedious. Therefore, replace all instances
where cpufreq_for_each_{valid_,}entry loops contained such substractions
with an iteration macro providing an index to the frequency_table entry.
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Link: http://lkml.kernel.org/r/20180120020237.GM13338@ZenIV.linux.org.uk
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
When maxcpus=1 is in the kernel command line, the BP is responsible
for re-enabling the HWP - because currently only the APs invoke
intel_pstate_hwp_enable() during their online process - which might
put the system into unstable state after resume.
Fix this by enabling the HWP explicitly on BP during resume.
Reported-by: Doug Smythies <dsmythies@telus.net>
Suggested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Yu Chen <yu.c.chen@intel.com>
[ rjw: Subject/changelog, minor modifications ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Fix to return a negative error code from the clk_get() error handling
case instead of 0, as done elsewhere in this function.
Fixes: 343a8d17fa8d (cpufreq: scpi: remove arm_big_little dependency)
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
There's no need to be printing a raw kernel pointer to the kernel log at
every boot. So just remove it, and change the whole message to use the
correct dev_info() call at the same time.
Reported-by: Wang Qize <wang_qize@venustech.com.cn>
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Add support for the Synopsys DesignWare MIPI DSI version 1.31
Two registers need to be updated/added for supporting 1.31:
* PHY_TMR_CFG 0x9c (updated)
1.30 [31:24] phy_hs2lp_time
[23:16] phy_lp2hs_time
[14: 0] max_rd_time
1.31 [25:16] phy_hs2lp_time
[ 9: 0] phy_lp2hs_time
* PHY_TMR_RD_CFG 0xf4 (new)
1.31 [14: 0] max_rd_time
Signed-off-by: Philippe Cornu <philippe.cornu@st.com>
Reviewed-by: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180206084251.303-1-philippe.cornu@st.com
|
|
This patch adds the DCS/GENERIC DSI read feature.
Signed-off-by: Philippe Cornu <philippe.cornu@st.com>
Reviewed-by: Andrzej Hajda <a.hajda@samsung.com>
Reviewed-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180204213104.17834-1-philippe.cornu@st.com
|
|
It was discovered that simple program which indefinitely sends 200b UDP
packets and runs on TI AM574x SoC (SMP) under RT Kernel triggers network
watchdog timeout in TI CPSW driver (<6 hours run). The network watchdog
timeout is triggered due to race between cpsw_ndo_start_xmit() and
cpsw_tx_handler() [NAPI]
cpsw_ndo_start_xmit()
if (unlikely(!cpdma_check_free_tx_desc(txch))) {
txq = netdev_get_tx_queue(ndev, q_idx);
netif_tx_stop_queue(txq);
^^ as per [1] barier has to be used after set_bit() otherwise new value
might not be visible to other cpus
}
cpsw_tx_handler()
if (unlikely(netif_tx_queue_stopped(txq)))
netif_tx_wake_queue(txq);
and when it happens ndev TX queue became disabled forever while driver's HW
TX queue is empty.
Fix this, by adding smp_mb__after_atomic() after netif_tx_stop_queue()
calls and double check for free TX descriptors after stopping ndev TX queue
- if there are free TX descriptors wake up ndev TX queue.
[1] https://www.kernel.org/doc/html/latest/core-api/atomic_ops.html
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This change will guard against a double free in the case that the
buffers were previously freed at some other time, such as during
a device reset. It resolves a kernel oops that occurred when changing
the VNIC device's MTU.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
At some point, a check was added to exit the polling routine during resets.
This makes sense for most reset conditions, but for a non-fatal error, we
expect the polling routine to continue running to properly clean up the rx
queues. This patch checks if we are performing a non-fatal reset and if we
are, continues normal polling operation.
Signed-off-by: John Allen <jallen@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix the number of queues per enabled TC and report available queues
to the kernel without having to limit them to the max RSS limit so
they are available to be mapped for XPS. This allows a queue per
processing thread available for handling traffic for the given
traffic class.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add the appropriate SPDX license tags to the Sun network drivers
as outlined in Documentation/process/license-rules.rst.
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit baf5086840ab1 ("cxgb4: restructure VF mgmt code") has reordered
some code but an error handling label has not been updated accordingly.
So fix it and free 'adapter' if 't4_wait_dev_ready()' fails.
Fixes: baf5086840ab1 ("cxgb4: restructure VF mgmt code")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
* for-linus:
block, bfq: add requeue-request hook
bcache: fix for data collapse after re-attaching an attached device
bcache: return attach error when no cache set exist
bcache: set writeback_rate_update_seconds in range [1, 60] seconds
bcache: fix for allocator and register thread race
bcache: set error_limit correctly
bcache: properly set task state in bch_writeback_thread()
bcache: fix high CPU occupancy during journal
bcache: add journal statistic
block: Add should_fail_bio() for bpf error injection
blk-wbt: account flush requests correctly
|
|
git://anongit.freedesktop.org/drm/drm-intel into drm-next
Fix for pcode timeouts on BXT and GLK, cmdparser fixes and fixes
for new vbt version on CFL and CNL.
GVT contains vGPU reset enhancement, which refines vGPU reset flow
and the support of virtual aperture read/write when x-no-mmap=on
is set in KVM, which is required by a test case from Redhat and
also another fix for virtual OpRegion.
* tag 'drm-intel-next-fixes-2018-02-07' of git://anongit.freedesktop.org/drm/drm-intel:
drm/i915/bios: add DP max link rate to VBT child device struct
drm/i915/cnp: Properly handle VBT ddc pin out of bounds.
drm/i915/cnp: Ignore VBT request for know invalid DDC pin.
drm/i915/cmdparser: Do not check past the cmd length.
drm/i915/cmdparser: Check reg_table_count before derefencing.
drm/i915/bxt, glk: Increase PCODE timeouts during CDCLK freq changing
drm/i915/gvt: Use KVM r/w to access guest opregion
drm/i915/gvt: Fix aperture read/write emulation when enable x-no-mmap=on
drm/i915/gvt: only reset execlist state of one engine during VM engine reset
drm/i915/gvt: refine intel_vgpu_submission_ops as per engine ops
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fix from Mark Brown:
"Fix suspend to idle.
Testing on mainline after the initial regulator pull request went in
identified a regression for suspend to idle due to it calling the
suspend operations with states that it wasn't realized could happen,
this patch fixes the problem"
* tag 'regulator-fix-v4.16-suspend' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: Fix suspend to idle
|
|
Pull fbdev updates from Bartlomiej Zolnierkiewicz:
"There is nothing really major here:
- fix display-timings lookup in the Device Tree in atmel_lcdfb driver
(Johan Hovold)
- fix video mode and line_length to be set correctly in vfb driver
(Pieter "PoroCYon" Sluys)
- fix returning nonsensical values to the user-space on GIO_FONTX
ioctl when using dummy console (Nicolas Pitre)
- add missing license tag to mmpfb driver (Arnd Bergmann)
- convert radeonfb and pxa3xx_gcu drivers to use ktime_get[_ts64]()
instead of the deprecated do_gettimeofday() (Arnd Bergmann)
- switch udlfb driver from using the pr_*() logging functions to the
dev_*() ones + related cleanups (Ladislav Michl)
- use __raw I/O accessors also on arm64 (Ji Zhang)
- fix Kconfig help text for intelfb driver (Randy Dunlap)
- do not duplicate features data in omapfb driver (Ladislav Michl)
- misc cleanups (Colin Ian King, Markus Elfring, Rasmus Villemoes,
Vasyl Gomonovych, Himanshu Jha, Michael Trimarchi)"
* tag 'fbdev-v4.16' of git://github.com/bzolnier/linux: (25 commits)
video: udlfb: Switch from the pr_*() to the dev_*() logging functions
video: udlfb: Constify read only data
video: fbdev/mmp: add MODULE_LICENSE
console/dummy: leave .con_font_get set to NULL
fbdev: mxsfb: use framebuffer_alloc in the correct way
video: udlfb: Do not name private data 'dev'
video: udlfb: Remove noisy warnings
video: udlfb: Remove redundant gdev variable
video: udlfb: Remove unnecessary local variable
fbdev: auo_k190x: Use zeroing memory allocator instead of allocator/memset
vfb: fix video mode and line_length being set when loaded
fbdev: arm64 use __raw I/O memory api
omapfb: dss: Do not duplicate features data
video: fbdev: omap2: Use PTR_ERR_OR_ZERO()
fbdev: au1200fb: delete duplicate header contents
fbdev: pxa3xx: use ktime_get_ts64 for time stamps
fbdev: radeon: use ktime_get() for HZ calibration
video: smscufx: Improve a size determination in two functions
video: udlfb: Delete an unnecessary return statement in two functions
video: udlfb: Improve a size determination in dlfb_alloc_urb_list()
...
|
|
git://git.infradead.org/linux-platform-drivers-x86
Pull more x86 platform-drivers updates from Andy Shevchenko:
"The DEFINE_SHOW_ATTRIBUTE() macro was defined privately in three
locations and is useful for new and old users to avoid a lot of code
duplication.
Move the macro to seq_file.h.
Along with above, clean up three drivers to use that macro.
This, due to dependencies, was sent separately since affected changes
weren't upstream originally yet. The rationale of doing this now is to
allow use of new macro in v4.17 cycle in a conflictless manner"
* tag 'platform-drivers-x86-v4.16-2' of git://git.infradead.org/linux-platform-drivers-x86:
platform/x86: samsung-laptop: Re-use DEFINE_SHOW_ATTRIBUTE() macro
platform/x86: ideapad-laptop: Re-use DEFINE_SHOW_ATTRIBUTE() macro
platform/x86: dell-laptop: Re-use DEFINE_SHOW_ATTRIBUTE() macro
seq_file: Introduce DEFINE_SHOW_ATTRIBUTE() helper macro
|
|
Update VBT defs to reflect revision 216. While at it, default the
expected child device struct size to sizeof the size rather than a
hardcoded value.
v2: Fix bit order (David)
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180118153310.32437-1-jani.nikula@intel.com
(cherry picked from commit c4fb60b9aba9f939d3f8575df23fd8d5958ec6ed)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
into drm-next
A few more misc fixes for 4.16.
* 'drm-next-4.16' of git://people.freedesktop.org/~agd5f/linux:
drm/amdgpu: re-enable CGCG on CZ and disable on ST
drm/amdgpu: disable coarse grain clockgating for ST
drm/radeon: adjust tested variable
drm/amdgpu: remove WARN_ON when VM isn't found v2
drm/amdgpu: fix locking in vega10_ih_prescreen_iv
drm/amdgpu: fix another potential cause of VM faults
drm/amdgpu: use queue 0 for kiq ring
drm/ttm: Fix 'buf' pointer update in ttm_bo_vm_access_kmap() (v2)
drm/ttm: fix missing parameter change for ttm_bo_cleanup_refs
|
|
git://www.linux-watchdog.org/linux-watchdog
Pull watchdog updates from Wim Van Sebroeck:
- new watchdog device drivers for Realtek RTD1295 and Spreadtrum SC9860
platform
- add support for the following devices: jz4780 SoC, AST25xx series SoC
and r8a77970 SoC
- convert to watchdog framework: i6300esb_wdt, xen_wdt and sp5100_tco
- several fixes for watchdog core
- remove at32ap700x and obsolete documentation
- gpio: Convert to use GPIO descriptors
- rename gemini into FTWDT010 as this IP block is generc from Faraday
Technology
- various clean-ups and small bugfixes
- add Guenter Roeck as co-maintainer
- change maintainers e-mail address
* tag 'linux-watchdog-4.16-rc1' of git://www.linux-watchdog.org/linux-watchdog: (74 commits)
documentation: watchdog: remove documentation of w83697hf_wdt/w83697ug_wdt
documentation: watchdog: remove documentation for ixp2000
documentation: watchdog: remove documentation of at32ap700x_wdt
watchdog: remove at32ap700x_wdt
watchdog: sp5100_tco: Add support for recent FCH versions
watchdog: sp5100-tco: Abort if watchdog is disabled by hardware
watchdog: sp5100_tco: Use bit operations
watchdog: sp5100_tco: Convert to use watchdog subsystem
watchdog: sp5100_tco: Clean up function and variable names
watchdog: sp5100_tco: Use dev_ print functions where possible
watchdog: sp5100_tco: Match PCI device early
watchdog: sp5100_tco: Clean up sp5100_tco_setupdevice
watchdog: sp5100_tco: Use standard error codes
watchdog: sp5100_tco: Use request_muxed_region where possible
watchdog: sp5100_tco: Fix watchdog disable bit
watchdog: sp5100_tco: Always use SP5100_IO_PM_{INDEX_REG,DATA_REG}
watchdog: core: make sure the watchdog_worker is not deferred
watchdog: mt7621: switch to using managed devm_watchdog_register_device()
watchdog: mt7621: set WDOG_HW_RUNNING bit when appropriate
watchdog: imx2_wdt: restore previous timeout after suspend+resume
...
|
|
back-end device sdm has already attached a cache_set with ID
f67ebe1f-f8bc-4d73-bfe5-9dc88607f119, then try to attach with
another cache set, and it returns with an error:
[root]# cd /sys/block/sdm/bcache
[root]# echo 5ccd0a63-148e-48b8-afa2-aca9cbd6279f > attach
-bash: echo: write error: Invalid argument
After that, execute a command to modify the label of bcache
device:
[root]# echo data_disk1 > label
Then we reboot the system, when the system power on, the back-end
device can not attach to cache_set, a messages show in the log:
Feb 5 12:05:52 ceph152 kernel: [922385.508498] bcache:
bch_cached_dev_attach() couldn't find uuid for sdm in set
In sysfs_attach(), dc->sb.set_uuid was assigned to the value
which input through sysfs, no matter whether it is success
or not in bch_cached_dev_attach(). For example, If the back-end
device has already attached to an cache set, bch_cached_dev_attach()
would fail, but dc->sb.set_uuid was changed. Then modify the
label of bcache device, it will call bch_write_bdev_super(),
which would write the dc->sb.set_uuid to the super block, so we
record a wrong cache set ID in the super block, after the system
reboot, the cache set couldn't find the uuid of the back-end
device, so the bcache device couldn't exist and use any more.
In this patch, we don't assigned cache set ID to dc->sb.set_uuid
in sysfs_attach() directly, but input it into bch_cached_dev_attach(),
and assigned dc->sb.set_uuid to the cache set ID after the back-end
device attached to the cache set successful.
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
I attach a back-end device to a cache set, and the cache set is not
registered yet, this back-end device did not attach successfully, and no
error returned:
[root]# echo 87859280-fec6-4bcc-20df7ca8f86b > /sys/block/sde/bcache/attach
[root]#
In sysfs_attach(), the return value "v" is initialized to "size" in
the beginning, and if no cache set exist in bch_cache_sets, the "v" value
would not change any more, and return to sysfs, sysfs regard it as success
since the "size" is a positive number.
This patch fixes this issue by assigning "v" with "-ENOENT" in the
initialization.
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
dc->writeback_rate_update_seconds can be set via sysfs and its value can
be set to [1, ULONG_MAX]. It does not make sense to set such a large
value, 60 seconds is long enough value considering the default 5 seconds
works well for long time.
Because dc->writeback_rate_update is a special delayed work, it re-arms
itself inside the delayed work routine update_writeback_rate(). When
stopping it by cancel_delayed_work_sync(), there should be a timeout to
wait and make sure the re-armed delayed work is stopped too. A small max
value of dc->writeback_rate_update_seconds is also helpful to decide a
reasonable small timeout.
This patch limits sysfs interface to set dc->writeback_rate_update_seconds
in range of [1, 60] seconds, and replaces the hand-coded number by macros.
Changelog:
v2: fix a rebase typo in v4, which is pointed out by Michael Lyle.
v1: initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
After long time running of random small IO writing,
I reboot the machine, and after the machine power on,
I found bcache got stuck, the stack is:
[root@ceph153 ~]# cat /proc/2510/task/*/stack
[<ffffffffa06b2455>] closure_sync+0x25/0x90 [bcache]
[<ffffffffa06b6be8>] bch_journal+0x118/0x2b0 [bcache]
[<ffffffffa06b6dc7>] bch_journal_meta+0x47/0x70 [bcache]
[<ffffffffa06be8f7>] bch_prio_write+0x237/0x340 [bcache]
[<ffffffffa06a8018>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
[<ffffffff810a631f>] kthread+0xcf/0xe0
[<ffffffff8164c318>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff
[root@ceph153 ~]# cat /proc/2038/task/*/stack
[<ffffffffa06b1abd>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
[<ffffffffa06b1bd1>] bch_btree_insert+0xf1/0x170 [bcache]
[<ffffffffa06b637f>] bch_journal_replay+0x13f/0x230 [bcache]
[<ffffffffa06c75fe>] run_cache_set+0x79a/0x7c2 [bcache]
[<ffffffffa06c0cf8>] register_bcache+0xd48/0x1310 [bcache]
[<ffffffff812f702f>] kobj_attr_store+0xf/0x20
[<ffffffff8125b216>] sysfs_write_file+0xc6/0x140
[<ffffffff811dfbfd>] vfs_write+0xbd/0x1e0
[<ffffffff811e069f>] SyS_write+0x7f/0xe0
[<ffffffff8164c3c9>] system_call_fastpath+0x16/0x1
The stack shows the register thread and allocator thread
were getting stuck when registering cache device.
I reboot the machine several times, the issue always
exsit in this machine.
I debug the code, and found the call trace as bellow:
register_bcache()
==>run_cache_set()
==>bch_journal_replay()
==>bch_btree_insert()
==>__bch_btree_map_nodes()
==>btree_insert_fn()
==>btree_split() //node need split
==>btree_check_reserve()
In btree_check_reserve(), It will check if there is enough buckets
of RESERVE_BTREE type, since allocator thread did not work yet, so
no buckets of RESERVE_BTREE type allocated, so the register thread
waits on c->btree_cache_wait, and goes to sleep.
Then the allocator thread initialized, the call trace is bellow:
bch_allocator_thread()
==>bch_prio_write()
==>bch_journal_meta()
==>bch_journal()
==>journal_wait_for_write()
In journal_wait_for_write(), It will check if journal is full by
journal_full(), but the long time random small IO writing
causes the exhaustion of journal buckets(journal.blocks_free=0),
In order to release the journal buckets,
the allocator calls btree_flush_write() to flush keys to
btree nodes, and waits on c->journal.wait until btree nodes writing
over or there has already some journal buckets space, then the
allocator thread goes to sleep. but in btree_flush_write(), since
bch_journal_replay() is not finished, so no btree nodes have journal
(condition "if (btree_current_write(b)->journal)" never satisfied),
so we got no btree node to flush, no journal bucket released,
and allocator sleep all the times.
Through the above analysis, we can see that:
1) Register thread wait for allocator thread to allocate buckets of
RESERVE_BTREE type;
2) Alloctor thread wait for register thread to replay journal, so it
can flush btree nodes and get journal bucket.
then they are all got stuck by waiting for each other.
Hua Rui provided a patch for me, by allocating some buckets of
RESERVE_BTREE type in advance, so the register thread can get bucket
when btree node splitting and no need to waiting for the allocator
thread. I tested it, it has effect, and register thread run a step
forward, but finally are still got stuck, the reason is only 8 bucket
of RESERVE_BTREE type were allocated, and in bch_journal_replay(),
after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left,
then btree_check_reserve() is not satisfied anymore, so it goes to sleep
again, and in the same time, alloctor thread did not flush enough btree
nodes to release a journal bucket, so they all got stuck again.
So we need to allocate more buckets of RESERVE_BTREE type in advance,
but how much is enough? By experience and test, I think it should be
as much as journal buckets. Then I modify the code as this patch,
and test in the machine, and it works.
This patch modified base on Hua Rui’s patch, and allocate more buckets
of RESERVE_BTREE type in advance to avoid register thread and allocate
thread going to wait for each other.
[patch v2] ca->sb.njournal_buckets would be 0 in the first time after
cache creation, and no journal exists, so just 8 btree buckets is OK.
Signed-off-by: Hua Rui <huarui.dev@gmail.com>
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Struct cache uses io_errors for two purposes,
- Error decay: when cache set error_decay is set, io_errors is used to
generate a small piece of delay when I/O error happens.
- I/O errors counter: in order to generate big enough value for error
decay, I/O errors counter value is stored by left shifting 20 bits (a.k.a
IO_ERROR_SHIFT).
In function bch_count_io_errors(), if I/O errors counter reaches cache set
error limit, bch_cache_set_error() will be called to retire the whold cache
set. But current code is problematic when checking the error limit, see the
following code piece from bch_count_io_errors(),
90 if (error) {
91 char buf[BDEVNAME_SIZE];
92 unsigned errors = atomic_add_return(1 << IO_ERROR_SHIFT,
93 &ca->io_errors);
94 errors >>= IO_ERROR_SHIFT;
95
96 if (errors < ca->set->error_limit)
97 pr_err("%s: IO error on %s, recovering",
98 bdevname(ca->bdev, buf), m);
99 else
100 bch_cache_set_error(ca->set,
101 "%s: too many IO errors %s",
102 bdevname(ca->bdev, buf), m);
103 }
At line 94, errors is right shifting IO_ERROR_SHIFT bits, now it is real
errors counter to compare at line 96. But ca->set->error_limit is initia-
lized with an amplified value in bch_cache_set_alloc(),
1545 c->error_limit = 8 << IO_ERROR_SHIFT;
It means by default, in bch_count_io_errors(), before 8<<20 errors happened
bch_cache_set_error() won't be called to retire the problematic cache
device. If the average request size is 64KB, it means bcache won't handle
failed device until 512GB data is requested. This is too large to be an I/O
threashold. So I believe the correct error limit should be much less.
This patch sets default cache set error limit to 8, then in
bch_count_io_errors() when errors counter reaches 8 (if it is default
value), function bch_cache_set_error() will be called to retire the whole
cache set. This patch also removes bits shifting when store or show
io_error_limit value via sysfs interface.
Nowadays most of SSDs handle internal flash failure automatically by LBA
address re-indirect mapping. If an I/O error can be observed by upper layer
code, it will be a notable error because that SSD can not re-indirect
map the problematic LBA address to an available flash block. This situation
indicates the whole SSD will be failed very soon. Therefore setting 8 as
the default io error limit value makes sense, it is enough for most of
cache devices.
Changelog:
v2: add reviewed-by from Hannes.
v1: initial version for review.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Kernel thread routine bch_writeback_thread() has the following code block,
447 down_write(&dc->writeback_lock);
448~450 if (check conditions) {
451 up_write(&dc->writeback_lock);
452 set_current_state(TASK_INTERRUPTIBLE);
453
454 if (kthread_should_stop())
455 return 0;
456
457 schedule();
458 continue;
459 }
If condition check is true, its task state is set to TASK_INTERRUPTIBLE
and call schedule() to wait for others to wake up it.
There are 2 issues in current code,
1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if
another process changes the condition and call wake_up_process(dc->
writeback_thread), then at line 452 task state is set back to
TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be
waken up.
2, At line 454 if kthread_should_stop() is true, writeback kernel thread
will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and
call do_exit(). It is not good to enter do_exit() with task state
TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a
warning message is reported by __might_sleep(): "WARNING: do not call
blocking ops when !TASK_RUNNING; state=1 set at [xxxx]".
For the first issue, task state should be set before condition checks.
Ineed because dc->writeback_lock is required when modifying all the
conditions, calling set_current_state() inside code block where dc->
writeback_lock is hold is safe. But this is quite implicit, so I still move
set_current_state() before all the condition checks.
For the second issue, frankley speaking it does not hurt when kernel thread
exits with TASK_INTERRUPTIBLE state, but this warning message scares users,
makes them feel there might be something risky with bcache and hurt their
data. Setting task state to TASK_RUNNING before returning fixes this
problem.
In alloc.c:allocator_wait(), there is also a similar issue, and is also
fixed in this patch.
Changelog:
v3: merge two similar fixes into one patch
v2: fix the race issue in v1 patch.
v1: initial buggy fix.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
After long time small writing I/O running, we found the occupancy of CPU
is very high and I/O performance has been reduced by about half:
[root@ceph151 internal]# top
top - 15:51:05 up 1 day,2:43, 4 users, load average: 16.89, 15.15, 16.53
Tasks: 2063 total, 4 running, 2059 sleeping, 0 stopped, 0 zombie
%Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa, 0.0 hi, 0.5 si, 0.0 st
KiB Mem : 65450044 total, 24586420 free, 38909008 used, 1954616 buff/cache
KiB Swap: 65667068 total, 65667068 free, 0 used. 25136812 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2023 root 20 0 0 0 0 S 55.1 0.0 0:04.42 kworker/11:191
14126 root 20 0 0 0 0 S 42.9 0.0 0:08.72 kworker/10:3
9292 root 20 0 0 0 0 S 30.4 0.0 1:10.99 kworker/6:1
8553 ceph 20 0 4242492 1.805g 18804 S 30.0 2.9 410:07.04 ceph-osd
12287 root 20 0 0 0 0 S 26.7 0.0 0:28.13 kworker/7:85
31019 root 20 0 0 0 0 S 26.1 0.0 1:30.79 kworker/22:1
1787 root 20 0 0 0 0 R 25.7 0.0 5:18.45 kworker/8:7
32169 root 20 0 0 0 0 S 14.5 0.0 1:01.92 kworker/23:1
21476 root 20 0 0 0 0 S 13.9 0.0 0:05.09 kworker/1:54
2204 root 20 0 0 0 0 S 12.5 0.0 1:25.17 kworker/9:10
16994 root 20 0 0 0 0 S 12.2 0.0 0:06.27 kworker/5:106
15714 root 20 0 0 0 0 R 10.9 0.0 0:01.85 kworker/19:2
9661 ceph 20 0 4246876 1.731g 18800 S 10.6 2.8 403:00.80 ceph-osd
11460 ceph 20 0 4164692 2.206g 18876 S 10.6 3.5 360:27.19 ceph-osd
9960 root 20 0 0 0 0 S 10.2 0.0 0:02.75 kworker/2:139
11699 ceph 20 0 4169244 1.920g 18920 S 10.2 3.1 355:23.67 ceph-osd
6843 ceph 20 0 4197632 1.810g 18900 S 9.6 2.9 380:08.30 ceph-osd
The kernel work consumed a lot of CPU, and I found they are running journal
work, The journal is reclaiming source and flush btree node with surprising
frequency.
Through further analysis, we found that in btree_flush_write(), we try to
get a btree node with the smallest fifo idex to flush by traverse all the
btree nodein c->bucket_hash, after we getting it, since no locker protects
it, this btree node may have been written to cache device by other works,
and if this occurred, we retry to traverse in c->bucket_hash and get
another btree node. When the problem occurrd, the retry times is very high,
and we consume a lot of CPU in looking for a appropriate btree node.
In this patch, we try to record 128 btree nodes with the smallest fifo idex
in heap, and pop one by one when we need to flush btree node. It greatly
reduces the time for the loop to find the appropriate BTREE node, and also
reduce the occupancy of CPU.
[note by mpl: this triggers a checkpatch error because of adjacent,
pre-existing style violations]
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Sometimes, Journal takes up a lot of CPU, we need statistics
to know what's the journal is doing. So this patch provide
some journal statistics:
1) reclaim: how many times the journal try to reclaim resource,
usually the journal bucket or/and the pin are exhausted.
2) flush_write: how many times the journal try to flush btree node
to cache device, usually the journal bucket are exhausted.
3) retry_flush_write: how many times the journal retry to flush
the next btree node, usually the previous tree node have been
flushed by other thread.
we show these statistic by sysfs interface. Through these statistics
We can totally see the status of journal module when the CPU is too
high.
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|