Age | Commit message (Collapse) | Author |
|
This functions adds necessary APIs needed in JBD2 layer for fast
commits.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-5-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This patch adds fast commit area trackers in the journal_t
structure. These are initialized via the jbd2_fc_init() routine that
this patch adds. This patch also adds ext4/fast_commit.c and
ext4/fast_commit.h files for fast commit code that will be added in
subsequent patches in this series.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-4-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We are running out of mount option bits. Add handling for using
s_mount_opt2. Add ext4 and jbd2 fast commit feature flag and also add
ability to turn off the fast commit feature in Ext4.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-3-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
"ip addr show" command execute error when we have a physical
network card with a large number of VFs
The return value of if_nlmsg_size() in rtnl_calcit() will exceed
range of u16 data type when any network cards has a larger number of
VFs. rtnl_vfinfo_size() will significant increase needed dump size when
the value of num_vfs is larger.
Eventually we get a wrong value of min_ifinfo_dump_size because of overflow
which decides the memory size needed by netlink dump and netlink_dump()
will return -EMSGSIZE because of not enough memory was allocated.
So fix it by promoting min_dump_alloc data type to u32 to
avoid whole netlink message size overflow and it's also align
with the data type of struct netlink_callback{}.min_dump_alloc
which is assigned by return value of rtnl_calcit()
Signed-off-by: Di Zhu <zhudi21@huawei.com>
Link: https://lore.kernel.org/r/20201021020053.1401-1-zhudi21@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Based on the discussion in [0], update the bpf_redirect_neigh() helper to
accept an optional parameter specifying the nexthop information. This makes
it possible to combine bpf_fib_lookup() and bpf_redirect_neigh() without
incurring a duplicate FIB lookup - since the FIB lookup helper will return
the nexthop information even if no neighbour is present, this can simply
be passed on to bpf_redirect_neigh() if bpf_fib_lookup() returns
BPF_FIB_LKUP_RET_NO_NEIGH. Thus fix & extend it before helper API is frozen.
[0] https://lore.kernel.org/bpf/393e17fc-d187-3a8d-2f0d-a627c7c63fca@iogearbox.net/
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/bpf/160322915615.32199.1187570224032024535.stgit@toke.dk
|
|
Cache the address space ID just like the slot ID. It will be used in
order to fill in the dirty ring entries.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201014182700.2888246-7-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
KVM unconditionally provides PV features to the guest, regardless of the
configured CPUID. An unwitting guest that doesn't check
KVM_CPUID_FEATURES before use could access paravirt features that
userspace did not intend to provide. Fix this by checking the guest's
CPUID before performing any paravirtual operations.
Introduce a capability, KVM_CAP_ENFORCE_PV_FEATURE_CPUID, to gate the
aforementioned enforcement. Migrating a VM from a host w/o this patch to
a host with this patch could silently change the ABI exposed to the
guest, warranting that we default to the old behavior and opt-in for
the new one.
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Change-Id: I202a0926f65035b872bfe8ad15307c026de59a98
Message-Id: <20200818152429.1923996-4-oupton@google.com>
Reviewed-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"A new driver this cycle is making the bulk of the changes and the
rx8010 driver has been rework to use the modern APIs.
Summary:
Subsystem:
- new generic DT properties: aux-voltage-chargeable,
trickle-voltage-millivolt
New driver:
- Microcrystal RV-3032
Drivers:
- ds1307: use aux-voltage-chargeable
- r9701, rx8010: modernization of the driver
- rv3028: fix clock output, trickle resistor values, RAM
configuration registers"
* tag 'rtc-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (50 commits)
rtc: r9701: set range
rtc: r9701: convert to devm_rtc_allocate_device
rtc: r9701: stop setting RWKCNT
rtc: r9701: remove useless memset
rtc: r9701: stop setting a default time
rtc: r9701: remove leftover comment
rtc: rv3032: Add a driver for Microcrystal RV-3032
dt-bindings: rtc: rv3032: add RV-3032 bindings
dt-bindings: rtc: add trickle-voltage-millivolt
rtc: rv3028: ensure ram configuration registers are saved
rtc: rv3028: factorize EERD bit handling
rtc: rv3028: fix trickle resistor values
rtc: rv3028: fix clock output support
rtc: mt6397: Remove unused member dev
rtc: rv8803: simplify the return expression of rv8803_nvram_write
rtc: meson: simplify the return expression of meson_vrtc_probe
rtc: rx8010: rename rx8010_init_client() to rx8010_init()
rtc: ds1307: enable rx8130's backup battery, make it chargeable optionally
rtc: ds1307: consider aux-voltage-chargeable
rtc: ds1307: store previous charge default per chip
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c updates from Wolfram Sang:
- if a host can be a client, too, the I2C core can now use it to
emulate SMBus HostNotify support (STM32 and R-Car added this so far)
- also for client mode, a testunit has been added. It can create rare
situations on the bus, so host controllers can be tested
- a binding has been added to mark the bus as "single-master". This
allows for better timeout detections
- new driver for Mellanox Bluefield
- massive refactoring of the Tegra driver
- EEPROMs recognized by the at24 driver can now have custom names
- rest is driver updates
* 'i2c/for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (80 commits)
Documentation: i2c: add testunit docs to index
i2c: tegra: Improve driver module description
i2c: tegra: Clean up whitespaces, newlines and indentation
i2c: tegra: Clean up and improve comments
i2c: tegra: Clean up printk messages
i2c: tegra: Clean up variable names
i2c: tegra: Improve formatting of variables
i2c: tegra: Check errors for both positive and negative values
i2c: tegra: Factor out hardware initialization into separate function
i2c: tegra: Factor out register polling into separate function
i2c: tegra: Factor out packet header setup from tegra_i2c_xfer_msg()
i2c: tegra: Factor out error recovery from tegra_i2c_xfer_msg()
i2c: tegra: Rename wait/poll functions
i2c: tegra: Remove "dma" variable from tegra_i2c_xfer_msg()
i2c: tegra: Remove redundant check in tegra_i2c_issue_bus_clear()
i2c: tegra: Remove likely/unlikely from the code
i2c: tegra: Remove outdated barrier()
i2c: tegra: Clean up variable types
i2c: tegra: Reorder location of functions in the code
i2c: tegra: Clean up probe function
...
|
|
Pull ceph updates from Ilya Dryomov:
- a patch that removes crush_workspace_mutex (myself). CRUSH
computations are no longer serialized and can run in parallel.
- a couple new filesystem client metrics for "ceph fs top" command
(Xiubo Li)
- a fix for a very old messenger bug that affected the filesystem,
marked for stable (myself)
- assorted fixups and cleanups throughout the codebase from Jeff and
others.
* tag 'ceph-for-5.10-rc1' of git://github.com/ceph/ceph-client: (27 commits)
libceph: clear con->out_msg on Policy::stateful_server faults
libceph: format ceph_entity_addr nonces as unsigned
libceph: fix ENTITY_NAME format suggestion
libceph: move a dout in queue_con_delay()
ceph: comment cleanups and clarifications
ceph: break up send_cap_msg
ceph: drop separate mdsc argument from __send_cap
ceph: promote to unsigned long long before shifting
ceph: don't SetPageError on readpage errors
ceph: mark ceph_fmt_xattr() as printf-like for better type checking
ceph: fold ceph_update_writeable_page into ceph_write_begin
ceph: fold ceph_sync_writepages into writepage_nounlock
ceph: fold ceph_sync_readpages into ceph_readpage
ceph: don't call ceph_update_writeable_page from page_mkwrite
ceph: break out writeback of incompatible snap context to separate function
ceph: add a note explaining session reject error string
libceph: switch to the new "osd blocklist add" command
libceph, rbd, ceph: "blacklist" -> "blocklist"
ceph: have ceph_writepages_start call pagevec_lookup_range_tag
ceph: use kill_anon_super helper
...
|
|
- Fix designware-ep Header Type check (Hou Zhiqiang)
- Use DBI accessors instead of own config accessors (Rob Herring)
- Allow overriding bridge pci_ops (Rob Herring)
- Allow root and child buses to have different pci_ops (Rob Herring)
- Add default dwc pci_ops.map_bus (Rob Herring)
- Use pci_ops for root config space accessors in al, exynos, histb,
keystone, kirin, meson, tegra (Rob Herring)
- Remove dwc own/other config accessor ops (Rob Herring)
- Use generic config accessors in dwc (Rob Herring)
- Also call .add_bus() callback for root bus (Rob Herring)
- Convert keystone .scan_bus() callback to use pci_ops.add_bus (Rob
Herring)
- Convert dwc to use pci_host_probe() (Rob Herring)
- Remove dwc root_bus pointer (Rob Herring)
- Remove storing of PCI resources in dwc-specific structs (Rob Herring)
- Simplify config space handling (Rob Herring)
- Drop keystone duplicated DT num-viewport handling (Rob Herring)
- Check CONFIG_PCI_MSI in dw_pcie_msi_init() instead of duplicating it in
all the drivers (Rob Herring)
- Remove imx6 duplicate PCIE_LINK_WIDTH_SPEED_CONTROL definition (Rob
Herring)
- Add dwc num_lanes for use when it's lacking from DT (Rob Herring)
- Ensure "Fast Link Mode" simulation environment setting is cleared (Rob
Herring)
- Drop meson duplicate number of lanes setup (Rob Herring)
- Drop meson unnecessary RC config space init (Rob Herring)
- Rework meson config and dwc port logic register accesses (Rob Herring)
- Use common PCI register definitions in imx6 and qcom (Rob Herring)
- Search for DesignWare PCIe Capability instead of hard-coding its location
(Rob Herring)
- Use common DesignWare register definitions in tegra (Rob Herring)
- Drop keystone unused DBI2 code (Rob Herring)
- Make dwc ATU accessors private (Rob Herring)
- Centralize link gen setting in dwc (Rob Herring)
- Set PORT_LINK_DLL_LINK_EN in common dwc setup code (Rob Herring)
- Drop intel-gw unnecessary DT 'device_type' checking (Rob Herring)
- Move intel-gw PCI_CAP_ID_EXP discovery to the single place it's used (Rob
Herring)
- Drop intel-gw unused max_width (Rob Herring)
- Move N_FTS (fast training sequence) setup to common dwc setup (Rob
Herring)
- Convert spear13xx, tegra194 to use DBI accessors (Rob Herring)
- Add multiple PFs support for DWC (Xiaowei Bao)
- Add MSI-X doorbell mode for endpoint mode (Xiaowei Bao)
- Update MSI/MSI-X capability management for endpoints (Xiaowei Bao)
- Add layerscape ls1088a and ls2088a compatible strings (Xiaowei Bao)
- Update layerscape MSI/MSI-X management (Xiaowei Bao)
- Use doorbell to support MSI-X on layerscape (Xiaowei Bao)
- Add layerscape endpoint mode support for ls1088a and ls2088a (Xiaowei
Bao)
- Add layerscape ls1088a node to DT (Xiaowei Bao)
- Add Freescale/Layerscape ls1088a to endpoint test (Xiaowei Bao)
- Add endpoint test driver data for Layerscape PCIe controllers (Hou
Zhiqiang)
- Fix 'cast truncates bits from constant value' warning (Gustavo Pimentel)
- Add uniphier iATU register description (Kunihiko Hayashi)
- Add common iATU register support (Kunihiko Hayashi)
- Remove keystone iATU register mapping in favor of generic dwc support
(Kunihiko Hayashi)
- Skip PCIE_MSI_INTR0* programming if MSI is disabled (Jisheng Zhang)
- Fix MSI page leakage in suspend/resume (Jisheng Zhang)
- Check whether link is up before attempting config access (best-effort fix
even though it's racy) (Hou Zhiqiang)
* remotes/lorenzo/pci/dwc:
PCI: dwc: Add link up check in dw_child_pcie_ops.map_bus()
PCI: dwc: Fix MSI page leakage in suspend/resume
PCI: dwc: Skip PCIE_MSI_INTR0* programming if MSI is disabled
PCI: keystone: Remove iATU register mapping
PCI: dwc: Add common iATU register support
dt-bindings: PCI: uniphier-ep: Add iATU register description
dt-bindings: PCI: uniphier: Add iATU register description
PCI: dwc: Fix 'cast truncates bits from constant value'
misc: pci_endpoint_test: Add driver data for Layerscape PCIe controllers
misc: pci_endpoint_test: Add LS1088a in pci_device_id table
PCI: layerscape: Add EP mode support for ls1088a and ls2088a
PCI: layerscape: Modify the MSIX to the doorbell mode
PCI: layerscape: Modify the way of getting capability with different PEX
PCI: layerscape: Fix some format issue of the code
dt-bindings: pci: layerscape-pci: Add compatible strings for ls1088a and ls2088a
PCI: designware-ep: Modify MSI and MSIX CAP way of finding
PCI: designware-ep: Move the function of getting MSI capability forward
PCI: designware-ep: Add the doorbell mode of MSI-X in EP mode
PCI: designware-ep: Add multiple PFs support for DWC
PCI: dwc: Use DBI accessors
PCI: dwc: Move N_FTS setup to common setup
PCI: dwc/intel-gw: Drop unused max_width
PCI: dwc/intel-gw: Move getting PCI_CAP_ID_EXP offset to intel_pcie_link_setup()
PCI: dwc/intel-gw: Drop unnecessary checking of DT 'device_type' property
PCI: dwc: Set PORT_LINK_DLL_LINK_EN in common setup code
PCI: dwc: Centralize link gen setting
PCI: dwc: Make ATU accessors private
PCI: dwc: Remove read_dbi2 code
PCI: dwc/tegra: Use common Designware port logic register definitions
PCI: dwc: Remove hardcoded PCI_CAP_ID_EXP offset
PCI: dwc/qcom: Use common PCI register definitions
PCI: dwc/imx6: Use common PCI register definitions
PCI: dwc/meson: Rework PCI config and DW port logic register accesses
PCI: dwc/meson: Drop unnecessary RC config space initialization
PCI: dwc/meson: Drop the duplicate number of lanes setup
PCI: dwc: Ensure FAST_LINK_MODE is cleared
PCI: dwc: Add a 'num_lanes' field to struct dw_pcie
PCI: dwc/imx6: Remove duplicate define PCIE_LINK_WIDTH_SPEED_CONTROL
PCI: dwc: Check CONFIG_PCI_MSI inside dw_pcie_msi_init()
PCI: dwc/keystone: Drop duplicated 'num-viewport'
PCI: dwc: Simplify config space handling
PCI: dwc: Remove storing of PCI resources
PCI: dwc: Remove root_bus pointer
PCI: dwc: Convert to use pci_host_probe()
PCI: dwc: keystone: Convert .scan_bus() callback to use add_bus
PCI: Also call .add_bus() callback for root bus
PCI: dwc: Use generic config accessors
PCI: dwc: Remove dwc specific config accessor ops
PCI: dwc: histb: Use pci_ops for root config space accessors
PCI: dwc: exynos: Use pci_ops for root config space accessors
PCI: dwc: kirin: Use pci_ops for root config space accessors
PCI: dwc: meson: Use pci_ops for root config space accessors
PCI: dwc: tegra: Use pci_ops for root config space accessors
PCI: dwc: keystone: Use pci_ops for config space accessors
PCI: dwc: al: Use pci_ops for child config space accessors
PCI: dwc: Add a default pci_ops.map_bus for root port
PCI: dwc: Allow overriding bridge pci_ops
PCI: dwc: Use DBI accessors instead of own config accessors
PCI: Allow root and child buses to have different pci_ops
PCI: designware-ep: Fix the Header Type check
|
|
- Remove useless __KERNEL__ preprocessor guard in sparc io_32.h (Lorenzo
Pieralisi)
- Move ioremap/iounmap declaration so it's visible in asm-generic/io.h
(Lorenzo Pieralisi)
- Fix memory leak in generic !CONFIG_GENERIC_IOMAP pci_iounmap()
implementation (Lorenzo Pieralisi)
* remotes/lorenzo/pci/pci-iomap:
asm-generic/io.h: Fix !CONFIG_GENERIC_IOMAP pci_iounmap() implementation
sparc32: Move ioremap/iounmap declaration before asm-generic/io.h include
sparc32: Remove useless io_32.h __KERNEL__ preprocessor guard
|
|
- Add ACPI APEI notifier chain for unknown (vendor) CPER records (Shiju
Jose)
- Add handling of HiSilicon HIP PCIe controller errors (Yicong Yang)
* remotes/lorenzo/pci/apei:
PCI: hip: Add handling of HiSilicon HIP PCIe controller errors
ACPI / APEI: Add a notifier chain for unknown (vendor) CPER records
|
|
- Remove unnecessary #includes (Gustavo Pimentel)
- Fix intel_mid_pci.c build error when !CONFIG_ACPI (Randy Dunlap)
- Use scnprintf(), not snprintf(), in sysfs "show" functions (Krzysztof
Wilczyński)
- Simplify pci-pf-stub by using module_pci_driver() (Liu Shixin)
- Print IRQ used by Link Bandwidth Notification (Dongdong Liu)
- Update sysfs mmap-related #ifdef comments (Clint Sbisa)
- Simplify pci_dev_reset_slot_function() (Lukas Wunner)
- Use "NULL" instead of "0" to fix sparse warnings (Gustavo Pimentel)
- Simplify bool comparisons (Krzysztof Wilczyński)
- Drop double zeroing for P2PDMA sg_init_table() (Julia Lawall)
* pci/misc:
PCI: v3-semi: Remove unneeded break
PCI/P2PDMA: Drop double zeroing for sg_init_table()
PCI: Simplify bool comparisons
PCI: endpoint: Use "NULL" instead of "0" as a NULL pointer
PCI: Simplify pci_dev_reset_slot_function()
PCI: Update mmap-related #ifdef comments
PCI/LINK: Print IRQ number used by port
PCI/IOV: Simplify pci-pf-stub with module_pci_driver()
PCI: Use scnprintf(), not snprintf(), in sysfs "show" functions
x86/PCI: Fix intel_mid_pci.c build error when ACPI is not enabled
PCI: Remove unnecessary header includes
|
|
- Remove unused pcibios_pm_ops (Vaibhav Gupta)
- Rename pci_dev.d3_delay to d3hot_delay (Krzysztof Wilczyński)
- Apply D2 transition delay as microseconds, not milliseconds (Bjorn
Helgaas)
* pci/pm:
PCI/PM: Revert "PCI/PM: Apply D2 delay as milliseconds, not microseconds"
PCI/PM: Remove unused PCI_PM_BUS_WAIT
PCI/PM: Rename pci_dev.d3_delay to d3hot_delay
PCI/PM: Remove unused pcibios_pm_ops
|
|
- Tone down message about missing optional MCFG (Jeremy Linton)
- Add schedule point in pci_read_config() (Jiang Biao)
- Add Ampere Altra SOC MCFG quirk (Tuan Phan)
- Add Kconfig options for MPS/MRRS strategy (Jim Quinlan)
* pci/enumeration:
PCI: Add Kconfig options for MPS/MRRS strategy
PCI/ACPI: Add Ampere Altra SOC MCFG quirk
PCI: Add schedule point in pci_read_config()
PCI/ACPI: Tone down missing MCFG message
|
|
An architecture may restrict host access to guest memory,
e.g. IBM s390 Secure Execution or AMD SEV.
Provide a new Kconfig entry the architecture can select,
CONFIG_ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS, when it provides
the arch_has_restricted_virtio_memory_access callback to advertise
to VIRTIO common code when the architecture restricts memory access
from the host.
The common code can then fail the probe for any device where
VIRTIO_F_ACCESS_PLATFORM is required, but not set.
Signed-off-by: Pierre Morel <pmorel@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Link: https://lore.kernel.org/r/1599728030-17085-2-git-send-email-pmorel@linux.ibm.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
NFS_FS=y as dependency of CONFIG_NFSD_V4_2_INTER_SSC still have
build errors and some configs with NFSD=m to get NFS4ERR_STALE
error when doing inter server copy.
Added ops table in nfs_common for knfsd to access NFS client modules.
Fixes: 3ac3711adb88 ("NFSD: Fix NFS server build errors")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Pull XArray updates from Matthew Wilcox:
- Fix the test suite after introduction of the local_lock
- Fix a bug in the IDA spotted by Coverity
- Change the API that allows the workingset code to delete a node
- Fix xas_reload() when dealing with entries that occupy multiple
indices
- Add a few more tests to the test suite
- Fix an unsigned int being shifted into an unsigned long
* tag 'xarray-5.9' of git://git.infradead.org/users/willy/xarray:
XArray: Fix xas_create_range for ranges above 4 billion
radix-tree: fix the comment of radix_tree_next_slot()
XArray: Fix xas_reload for multi-index entries
XArray: Add private interface for workingset node deletion
XArray: Fix xas_for_each_conflict documentation
XArray: Test marked multiorder iterations
XArray: Test two more things about xa_cmpxchg
ida: Free allocated bitmap in error path
radix tree test suite: Fix compilation
|
|
Pull NFS client updates from Anna Schumaker:
"Stable Fixes:
- Wait for stateid updates after CLOSE/OPEN_DOWNGRADE # v5.4+
- Fix nfs_path in case of a rename retry
- Support EXCHID4_FLAG_SUPP_FENCE_OPS v4.2 EXCHANGE_ID flag
New features and improvements:
- Replace dprintk() calls with tracepoints
- Make cache consistency bitmap dynamic
- Added support for the NFS v4.2 READ_PLUS operation
- Improvements to net namespace uniquifier
Other bugfixes and cleanups:
- Remove redundant clnt pointer
- Don't update timeout values on connection resets
- Remove redundant tracepoints
- Various cleanups to comments
- Fix oops when trying to use copy_file_range with v4.0 source server
- Improvements to flexfiles mirrors
- Add missing 'local_lock=posix' mount option"
* tag 'nfs-for-5.10-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (55 commits)
NFSv4.2: support EXCHGID4_FLAG_SUPP_FENCE_OPS 4.2 EXCHANGE_ID flag
NFSv4: Fix up RCU annotations for struct nfs_netns_client
NFS: Only reference user namespace from nfs4idmap struct instead of cred
nfs: add missing "posix" local_lock constant table definition
NFSv4: Use the net namespace uniquifier if it is set
NFSv4: Clean up initialisation of uniquified client id strings
NFS: Decode a full READ_PLUS reply
SUNRPC: Add an xdr_align_data() function
NFS: Add READ_PLUS hole segment decoding
SUNRPC: Add the ability to expand holes in data pages
SUNRPC: Split out _shift_data_right_tail()
SUNRPC: Split out xdr_realign_pages() from xdr_align_pages()
NFS: Add READ_PLUS data segment support
NFS: Use xdr_page_pos() in NFSv4 decode_getacl()
SUNRPC: Implement a xdr_page_pos() function
SUNRPC: Split out a function for setting current page
NFS: fix nfs_path in case of a rename retry
fs: nfs: return per memcg count for xattr shrinkers
NFSv4: Wait for stateid updates after CLOSE/OPEN_DOWNGRADE
nfs: remove incorrect fallthrough label
...
|
|
Pull io_uring updates from Jens Axboe:
"A mix of fixes and a few stragglers. In detail:
- Revert the bogus __read_mostly that we discussed for the initial
pull request.
- Fix a merge window regression with fixed file registration error
path handling.
- Fix io-wq numa node affinities.
- Series abstracting out an io_identity struct, making it both easier
to see what the personality items are, and also easier to to adopt
more. Use this to cover audit logging.
- Fix for read-ahead disabled block condition in async buffered
reads, and using single page read-ahead to unify what
generic_file_buffer_read() path is used.
- Series for REQ_F_COMP_LOCKED fix and removal of it (Pavel)
- Poll fix (Pavel)"
* tag 'io_uring-5.10-2020-10-20' of git://git.kernel.dk/linux-block: (21 commits)
io_uring: use blk_queue_nowait() to check if NOWAIT supported
mm: use limited read-ahead to satisfy read
mm: mark async iocb read as NOWAIT once some data has been copied
io_uring: fix double poll mask init
io-wq: inherit audit loginuid and sessionid
io_uring: use percpu counters to track inflight requests
io_uring: assign new io_identity for task if members have changed
io_uring: store io_identity in io_uring_task
io_uring: COW io_identity on mismatch
io_uring: move io identity items into separate struct
io_uring: rely solely on work flags to determine personality.
io_uring: pass required context in as flags
io-wq: assign NUMA node locality if appropriate
io_uring: fix error path cleanup in io_sqe_files_register()
Revert "io_uring: mark io_uring_fops/io_op_defs as __read_mostly"
io_uring: fix REQ_F_COMP_LOCKED by killing it
io_uring: dig out COMP_LOCK from deep call chain
io_uring: don't put a poll req under spinlock
io_uring: don't unnecessarily clear F_LINK_TIMEOUT
io_uring: don't set COMP_LOCKED if won't put
...
|
|
'clk-qcom', 'clk-prima2' and 'clk-bcm' into clk-next
- Support qcom SM8150/SM8250 video and display clks
- Change how qcom's display port clks work
* clk-ingenic:
clk: ingenic: Respect CLK_SET_RATE_PARENT in .round_rate
clk: ingenic: Don't tag custom clocks with CLK_SET_RATE_PARENT
clk: ingenic: Don't use CLK_SET_RATE_GATE for PLL
clk: ingenic: Use readl_poll_timeout instead of custom loop
clk: ingenic: Use to_clk_info() macro for all clocks
* clk-at91:
clk: at91: sam9x60: support only two programmable clocks
clk: at91: clk-sam9x60-pll: remove unused variable
clk: at91: clk-main: update key before writing AT91_CKGR_MOR
clk: at91: remove the checking of parent_name
* clk-kconfig:
clk: Restrict CLK_HSDK to ARC_SOC_HSDK
* clk-imx:
clk: imx8mq: Fix usdhc parents order
clk: imx: imx21: Remove clock driver
clk: imx: gate2: Fix a few typos
clk: imx: Fix and update kerneldoc
clk: imx: fix i.MX7D peripheral clk mux flags
clk: imx: fix composite peripheral flags
clk: imx: Correct the memrepair clock on imx8mp
clk: imx: Correct the root clk of media ldb on imx8mp
clk: imx: vf610: Add CRC clock
clk: imx: Explicitly include bits.h
clk: imx8qxp: Support building i.MX8QXP clock driver as module
clk: imx8m: Support module build
clk: imx: Add clock configuration for ARMv7 platforms
clk: imx: Support building i.MX common clock driver as module
clk: composite: Export clk_hw_register_composite()
clk: imx6sl: Use BIT(x) to avoid shifting signed 32-bit value by 31 bits
* clk-qcom:
clk: qcom: gdsc: Keep RETAIN_FF bit set if gdsc is already on
clk: qcom: Add display clock controller driver for SM8150 and SM8250
dt-bindings: clock: add QCOM SM8150 and SM8250 display clock bindings
clk: qcom: add video clock controller driver for SM8250
clk: qcom: add video clock controller driver for SM8150
dt-bindings: clock: add SM8250 QCOM video clock bindings
dt-bindings: clock: add SM8150 QCOM video clock bindings
dt-bindings: clock: combine qcom,sdm845-videocc and qcom,sc7180-videocc
clk: qcom: gcc-msm8994: Add missing clocks, resets and GDSCs
clk/qcom: fix spelling typo
clk: qcom: gcc-sdm660: Fix wrong parent_map
clk: qcom: dispcc: Update DP clk ops for phy design
clk: qcom: gcc-msm8939: remove defined but not used variables
clk: qcom: ipq8074: make pcie0_rchng_clk_src static
* clk-prima2:
clk: clk-prima2: fix return value check in prima2_clk_init()
* clk-bcm:
clk: bcm2835: add missing release if devm_clk_hw_register fails
clk: bcm: rpi: Add register to control pixel bvb clk
|
|
'clk-mediatek' into clk-next
- Small non-critical fixes for TI clk driver
- Support Mediatek MT8167 clks
* clk-simplify:
clk: mediatek: fix platform_no_drv_owner.cocci warnings
clk: mediatek: mt7629: simplify the return expression of mtk_infrasys_init
clk: mediatek: mt6797: simplify the return expression of mtk_infrasys_init
* clk-ti:
clk: ti: dra7: add missing clkctrl register for SHA2 instance
clk: ti: clockdomain: fix static checker warning
clk: ti: autoidle: add checks against NULL pointer reference
clk: keystone: sci-clk: add 10% slack to set_rate
clk: keystone: sci-clk: cache results of last query rate operation
clk: keystone: sci-clk: fix parsing assigned-clock data during probe
* clk-tegra:
clk: tegra: Drop !provider check in tegra210_clk_emc_set_rate()
* clk-rockchip:
clk: rockchip: Initialize hw to error to avoid undefined behavior
clk: rockchip: rk3399: Support module build
clk: rockchip: fix the clk config to support module build
clk: rockchip: Export some clock common APIs for module drivers
clk: rockchip: Export rockchip_register_softrst()
clk: rockchip: Export rockchip_clk_register_ddrclk()
clk: rockchip: Use clk_hw_register_composite instead of clk_register_composite calls
clk: rockchip: rk3308: drop unused mux_timer_src_p
* clk-mediatek:
clk: mediatek: Add MT8167 clock support
dt-bindings: clock: mediatek: add bindings for MT8167 clocks
clk: mediatek: add UART0 clock support
|
|
'clk-doc' and 'clk-unused' into clk-next
- Remove various unused variables in clk drivers
* clk-renesas:
clk: renesas: rcar-gen3: Update description for RZ/G2
clk: renesas: cpg-mssr: Add support for R-Car V3U
clk: renesas: cpg-mssr: Add register pointers into struct cpg_mssr_priv
clk: renesas: cpg-mssr: Use enum clk_reg_layout instead of a boolean flag
dt-bindings: clock: renesas,cpg-mssr: Document r8a779a0
dt-bindings: clock: Add r8a779a0 CPG Core Clock Definitions
dt-bindings: power: Add r8a779a0 SYSC power domain definitions
clk: renesas: rcar-gen2: Rename vsp1-(sy|rt) clocks to vsp(s|r)
clk: renesas: r8a7742: Add clk entry for VSPR
* clk-amlogic:
clk: meson: make shipped controller configurable
clk: meson: g12a: mark fclk_div2 as critical
clk: meson: axg-audio: fix g12a tdmout sclk inverter
clk: meson: axg-audio: separate axg and g12a regmap tables
clk: meson: add sclk-ws driver
* clk-allwinner:
clk: sunxi-ng: sun8i: r40: Use sigma delta modulation for audio PLL
clk: sunxi-ng: add support for the Allwinner A100 CCU
dt-bindings: clk: sunxi-ccu: add compatible string for A100 CCU and R-CCU
* clk-samsung:
clk: s2mps11: initialize driver via module_platform_driver
clk: samsung: Use cached clk_hws instead of __clk_lookup() calls
clk: samsung: exynos5420/5250: Add IDs to the CPU parent clk definitions
clk: samsung: Add clk ID definitions for the CPU parent clocks
clk: samsung: exynos5420: Avoid __clk_lookup() calls when enabling clocks
clk: samsung: exynos5420: Add definition of clock ID for mout_sw_aclk_g3d
clk: samsung: Keep top BPLL mux on Exynos542x enabled
* clk-doc:
clk: davinci: add missing kerneldoc
clk: fixed: add missing kerneldoc
* clk-unused:
clk: socfpga: agilex: Remove unused variable 'cntr_mux'
clk: si5341: drop unused 'err' variable
clk: mmp: pxa1928: drop unused 'clk' variable
clk: at91: drop unused at91sam9g45_pcr_layout
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply
Pull power supply and reset updates from Sebastian Reichel:
"Power-supply core:
- add wireless type
- properly document current direction
Battery/charger driver changes:
- new fuel-gauge/charger driver for RN5T618/RN5T619
- new charger driver for BQ25980, BQ25975 and BQ25960
- bq27xxx-battery: add support for TI bq34z100
- gpio-charger: convert to GPIO descriptors
- gpio-charger: add optional support for charge current limiting
- max17040: add support for max17041, max17043, max17044
- max17040: add support for max17048, max17049, max17058, max17059
- smb347-charger: add DT support
- smb247-charger: add SMB345 and SMB358 support
- simple-battery: add temperature properties
- lots of minor fixes, cleanups and DT binding YAML conversions
Reset drivers:
- ocelot: Add support for Sparx5"
* tag 'for-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply: (81 commits)
power: reset: POWER_RESET_OCELOT_RESET should depend on Ocelot or Sparx5
power: supply: bq25980: Fix uninitialized wd_reg_val and overrun
power: supply: ltc2941: Fix ptr to enum cast
power: supply: test-power: revise parameter printing to use sprintf
power: supply: charger-manager: fix incorrect check on charging_duration_ms
power: supply: max17040: Fix ptr to enum cast
power: supply: bq25980: Fix uninitialized wd_reg_val
power: supply: bq25980: remove redundant zero check on ret
power: reset: ocelot: Add support for Sparx5
dt-bindings: reset: ocelot: Add Sparx5 support
power: supply: sbs-battery: keep error code when get_property() fails
power: supply: bq25980: Add support for the BQ259xx family
dt-binding: bq25980: Add the bq25980 flash charger
power: supply: fix spelling mistake "unprecise" -> "imprecise"
power: supply: test_power: add missing newlines when printing parameters by sysfs
power: supply: pm2301: drop duplicated i2c_device_id
power: supply: charger-manager: drop unused charger assignment
power: supply: rt9455: skip 'struct acpi_device_id' when !CONFIG_ACPI
power: supply: goldfish: skip 'struct acpi_device_id' when !CONFIG_ACPI
power: supply: bq25890: skip 'struct acpi_device_id' when !CONFIG_ACPI
...
|
|
This patch is to fix typo in the comment of helper pm_runtime_set_active().
Signed-off-by: Bean Huo <beanhuo@micron.com>
[ rjw: Subject edit ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull more xen updates from Juergen Gross:
- A single patch to fix the Xen security issue XSA-331 (malicious
guests can DoS dom0 by triggering NULL-pointer dereferences or access
to stale data).
- A larger series to fix the Xen security issue XSA-332 (malicious
guests can DoS dom0 by sending events at high frequency leading to
dom0's vcpus being busy in IRQ handling for elongated times).
* tag 'for-linus-5.10b-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/events: block rogue events for some time
xen/events: defer eoi in case of excessive number of events
xen/events: use a common cpu hotplug hook for event channels
xen/events: switch user event channels to lateeoi model
xen/pciback: use lateeoi irq binding
xen/pvcallsback: use lateeoi irq binding
xen/scsiback: use lateeoi irq binding
xen/netback: use lateeoi irq binding
xen/blkback: use lateeoi irq binding
xen/events: add a new "late EOI" evtchn framework
xen/events: fix race in evtchn_fifo_unmask()
xen/events: add a proper barrier to 2-level uevent unmasking
xen/events: avoid removing an event channel while handling it
|
|
After removing blk-sq code, there is no user of nr_batch_requests
and last_waited in kernel.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for Linux 5.10
- New page table code for both hypervisor and guest stage-2
- Introduction of a new EL2-private host context
- Allow EL2 to have its own private per-CPU variables
- Support of PMU event filtering
- Complete rework of the Spectre mitigation
|
|
nft_flow_rule_create
This patch fixes the issue due to:
BUG: KASAN: slab-out-of-bounds in nft_flow_rule_create+0x622/0x6a2
net/netfilter/nf_tables_offload.c:40
Read of size 8 at addr ffff888103910b58 by task syz-executor227/16244
The error happens when expr->ops is accessed early on before performing the boundary check and after nft_expr_next() moves the expr to go out-of-bounds.
This patch checks the boundary condition before expr->ops that fixes the slab-out-of-bounds Read issue.
Add nft_expr_more() and use it to fix this problem.
Signed-off-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Due to a mismerge a bunch of prototypes that should have moved to
dma-map-ops.h are still in dma-mapping.h, fix that up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
In order to avoid tight event channel related IRQ loops add a new
framework of "late EOI" handling: the IRQ the event channel is bound
to will be masked until the event has been handled and the related
driver is capable to handle another event. The driver is responsible
for unmasking the event channel via the new function xen_irq_lateeoi().
This is similar to binding an event channel to a threaded IRQ, but
without having to structure the driver accordingly.
In order to support a future special handling in case a rogue guest
is sending lots of unsolicited events, add a flag to xen_irq_lateeoi()
which can be set by the caller to indicate the event was a spurious
one.
This is part of XSA-332.
Cc: stable@vger.kernel.org
Reported-by: Julien Grall <julien@xen.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Wei Liu <wl@xen.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Support directly accessing host page cache from virtiofs. This can
improve I/O performance for various workloads, as well as reducing
the memory requirement by eliminating double caching. Thanks to Vivek
Goyal for doing most of the work on this.
- Allow automatic submounting inside virtiofs. This allows unique
st_dev/ st_ino values to be assigned inside the guest to files
residing on different filesystems on the host. Thanks to Max Reitz
for the patches.
- Fix an old use after free bug found by Pradeep P V K.
* tag 'fuse-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits)
virtiofs: calculate number of scatter-gather elements accurately
fuse: connection remove fix
fuse: implement crossmounts
fuse: Allow fuse_fill_super_common() for submounts
fuse: split fuse_mount off of fuse_conn
fuse: drop fuse_conn parameter where possible
fuse: store fuse_conn in fuse_req
fuse: add submount support to <uapi/linux/fuse.h>
fuse: fix page dereference after free
virtiofs: add logic to free up a memory range
virtiofs: maintain a list of busy elements
virtiofs: serialize truncate/punch_hole and dax fault path
virtiofs: define dax address space operations
virtiofs: add DAX mmap support
virtiofs: implement dax read/write operations
virtiofs: introduce setupmapping/removemapping commands
virtiofs: implement FUSE_INIT map_alignment field
virtiofs: keep a list of free dax memory ranges
virtiofs: add a mount option to enable dax
virtiofs: set up virtio_fs dax_device
...
|
|
perf_event.h has macros that define the field offsets in the
data_src bitmask in perf records. The SNOOPX and REMOTE offsets
were both 37. These are distinct fields, and the bitfield layout
in perf_mem_data_src confirms that SNOOPX should be at offset 38.
Fixes: 52839e653b5629bd ("perf tools: Add support for printing new mem_info encodings")
Signed-off-by: Al Grant <al.grant@foss.arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/4ac9f5cc-4388-b34a-9999-418a4099415d@foss.arm.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull more Kunit updates from Shuah Khan:
- add Kunit to kernel_init() and remove KUnit from init calls entirely.
This addresses the concern that Kunit would not work correctly during
late init phase.
- add a linker section where KUnit can put references to its test
suites.
This is the first step in transitioning to dispatching all KUnit
tests from a centralized executor rather than having each as its own
separate late_initcall.
- add a centralized executor to dispatch tests rather than relying on
late_initcall to schedule each test suite separately. Centralized
execution is for built-in tests only; modules will execute tests when
loaded.
- convert bitfield test to use KUnit framework
- Documentation updates for naming guidelines and how
kunit_test_suite() works.
- add test plan to KUnit TAP format
* tag 'linux-kselftest-kunit-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
lib: kunit: Fix compilation test when using TEST_BIT_FIELD_COMPILE
lib: kunit: add bitfield test conversion to KUnit
Documentation: kunit: add a brief blurb about kunit_test_suite
kunit: test: add test plan to KUnit TAP format
init: main: add KUnit to kernel init
kunit: test: create a single centralized executor for all tests
vmlinux.lds.h: add linker section for KUnit test suites
Documentation: kunit: Add naming guidelines
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU changes from Ingo Molnar:
- Debugging for smp_call_function()
- RT raw/non-raw lock ordering fixes
- Strict grace periods for KASAN
- New smp_call_function() torture test
- Torture-test updates
- Documentation updates
- Miscellaneous fixes
[ This doesn't actually pull the tag - I've dropped the last merge from
the RCU branch due to questions about the series. - Linus ]
* tag 'core-rcu-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits)
smp: Make symbol 'csd_bug_count' static
kernel/smp: Provide CSD lock timeout diagnostics
smp: Add source and destination CPUs to __call_single_data
rcu: Shrink each possible cpu krcp
rcu/segcblist: Prevent useless GP start if no CBs to accelerate
torture: Add gdb support
rcutorture: Allow pointer leaks to test diagnostic code
rcutorture: Hoist OOM registry up one level
refperf: Avoid null pointer dereference when buf fails to allocate
rcutorture: Properly synchronize with OOM notifier
rcutorture: Properly set rcu_fwds for OOM handling
torture: Add kvm.sh --help and update help message
rcutorture: Add CONFIG_PROVE_RCU_LIST to TREE05
torture: Update initrd documentation
rcutorture: Replace HTTP links with HTTPS ones
locktorture: Make function torture_percpu_rwsem_init() static
torture: document --allcpus argument added to the kvm.sh script
rcutorture: Output number of elapsed grace periods
rcutorture: Remove KCSAN stubs
rcu: Remove unused "cpu" parameter from rcu_report_qs_rdp()
...
|
|
All users are gone now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Link: https://lkml.kernel.org/r/20201002122204.1534411-12-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add a proper helper to remap PFNs into kernel virtual space so that
drivers don't have to abuse alloc_vm_area and open coded PTE manipulation
for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Link: https://lkml.kernel.org/r/20201002122204.1534411-4-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add a flag so that vmap takes ownership of the passed in page array. When
vfree is called on such an allocation it will put one reference on each
page, and free the page array itself.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Link: https://lkml.kernel.org/r/20201002122204.1534411-3-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There is usecase that System Management Software(SMS) want to give a
memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
case of Android, it is the ActivityManagerService.
The information required to make the reclaim decision is not known to the
app. Instead, it is known to the centralized userspace
daemon(ActivityManagerService), and that daemon must be able to initiate
reclaim on its own without any app involvement.
To solve the issue, this patch introduces a new syscall
process_madvise(2). It uses pidfd of an external process to give the
hint. It also supports vector address range because Android app has
thousands of vmas due to zygote so it's totally waste of CPU and power if
we should call the syscall one by one for each vma.(With testing 2000-vma
syscall vs 1-vector syscall, it showed 15% performance improvement. I
think it would be bigger in real practice because the testing ran very
cache friendly environment).
Another potential use case for the vector range is to amortize the cost
ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
benefit users like TCP receive zerocopy and malloc implementations. In
future, we could find more usecases for other advises so let's make it
happens as API since we introduce a new syscall at this moment. With
that, existing madvise(2) user could replace it with process_madvise(2)
with their own pid if they want to have batch address ranges support
feature.
ince it could affect other process's address range, only privileged
process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
UID) gives it the right to ptrace the process could use it successfully.
The flag argument is reserved for future use if we need to extend the API.
I think supporting all hints madvise has/will supported/support to
process_madvise is rather risky. Because we are not sure all hints make
sense from external process and implementation for the hint may rely on
the caller being in the current context so it could be error-prone. Thus,
I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
If someone want to add other hints, we could hear the usecase and review
it for each hint. It's safer for maintenance rather than introducing a
buggy syscall but hard to fix it later.
So finally, the API is as follows,
ssize_t process_madvise(int pidfd, const struct iovec *iovec,
unsigned long vlen, int advice, unsigned int flags);
DESCRIPTION
The process_madvise() system call is used to give advice or directions
to the kernel about the address ranges from external process as well as
local process. It provides the advice to address ranges of process
described by iovec and vlen. The goal of such advice is to improve
system or application performance.
The pidfd selects the process referred to by the PID file descriptor
specified in pidfd. (See pidofd_open(2) for further information)
The pointer iovec points to an array of iovec structures, defined in
<sys/uio.h> as:
struct iovec {
void *iov_base; /* starting address */
size_t iov_len; /* number of bytes to be advised */
};
The iovec describes address ranges beginning at address(iov_base)
and with size length of bytes(iov_len).
The vlen represents the number of elements in iovec.
The advice is indicated in the advice argument, which is one of the
following at this moment if the target process specified by pidfd is
external.
MADV_COLD
MADV_PAGEOUT
Permission to provide a hint to external process is governed by a
ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
The process_madvise supports every advice madvise(2) has if target
process is in same thread group with calling process so user could
use process_madvise(2) to extend existing madvise(2) to support
vector address ranges.
RETURN VALUE
On success, process_madvise() returns the number of bytes advised.
This return value may be less than the total number of requested
bytes, if an error occurred. The caller should check return value
to determine whether a partial advice occurred.
FAQ:
Q.1 - Why does any external entity have better knowledge?
Quote from Sandeep
"For Android, every application (including the special SystemServer)
are forked from Zygote. The reason of course is to share as many
libraries and classes between the two as possible to benefit from the
preloading during boot.
After applications start, (almost) all of the APIs end up calling into
this SystemServer process over IPC (binder) and back to the
application.
In a fully running system, the SystemServer monitors every single
process periodically to calculate their PSS / RSS and also decides
which process is "important" to the user for interactivity.
So, because of how these processes start _and_ the fact that the
SystemServer is looping to monitor each process, it does tend to *know*
which address range of the application is not used / useful.
Besides, we can never rely on applications to clean things up
themselves. We've had the "hey app1, the system is low on memory,
please trim your memory usage down" notifications for a long time[1].
They rely on applications honoring the broadcasts and very few do.
So, if we want to avoid the inevitable killing of the application and
restarting it, some way to be able to tell the OS about unimportant
memory in these applications will be useful.
- ssp
Q.2 - How to guarantee the race(i.e., object validation) between when
giving a hint from an external process and get the hint from the target
process?
process_madvise operates on the target process's address space as it
exists at the instant that process_madvise is called. If the space
target process can run between the time the process_madvise process
inspects the target process address space and the time that
process_madvise is actually called, process_madvise may operate on
memory regions that the calling process does not expect. It's the
responsibility of the process calling process_madvise to close this
race condition. For example, the calling process can suspend the
target process with ptrace, SIGSTOP, or the freezer cgroup so that it
doesn't have an opportunity to change its own address space before
process_madvise is called. Another option is to operate on memory
regions that the caller knows a priori will be unchanged in the target
process. Yet another option is to accept the race for certain
process_madvise calls after reasoning that mistargeting will do no
harm. The suggested API itself does not provide synchronization. It
also apply other APIs like move_pages, process_vm_write.
The race isn't really a problem though. Why is it so wrong to require
that callers do their own synchronization in some manner? Nobody
objects to write(2) merely because it's possible for two processes to
open the same file and clobber each other's writes --- instead, we tell
people to use flock or something. Think about mmap. It never
guarantees newly allocated address space is still valid when the user
tries to access it because other threads could unmap the memory right
before. That's where we need synchronization by using other API or
design from userside. It shouldn't be part of API itself. If someone
needs more fine-grained synchronization rather than process level,
there were two ideas suggested - cookie[2] and anon-fd[3]. Both are
applicable via using last reserved argument of the API but I don't
think it's necessary right now since we have already ways to prevent
the race so don't want to add additional complexity with more
fine-grained optimization model.
To make the API extend, it reserved an unsigned long as last argument
so we could support it in future if someone really needs it.
Q.3 - Why doesn't ptrace work?
Injecting an madvise in the target process using ptrace would not work
for us because such injected madvise would have to be executed by the
target process, which means that process would have to be runnable and
that creates the risk of the abovementioned race and hinting a wrong
VMA. Furthermore, we want to act the hint in caller's context, not the
callee's, because the callee is usually limited in cpuset/cgroups or
even freezed state so they can't act by themselves quick enough, which
causes more thrashing/kill. It doesn't work if the target process are
ptraced(e.g., strace, debugger, minidump) because a process can have at
most one ptracer.
[1] https://developer.android.com/topic/performance/memory"
[2] process_getinfo for getting the cookie which is updated whenever
vma of process address layout are changed - Daniel Colascione -
https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
[3] anonymous fd which is used for the object(i.e., address range)
validation - Michal Hocko -
https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
[minchan@kernel.org: fix process_madvise build break for arm64]
Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
[minchan@kernel.org: fix build error for mips of process_madvise]
Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
[akpm@linux-foundation.org: fix patch ordering issue]
[akpm@linux-foundation.org: fix arm64 whoops]
[minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
[akpm@linux-foundation.org: fix i386 build]
[sfr@canb.auug.org.au: fix syscall numbering]
Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
[sfr@canb.auug.org.au: madvise.c needs compat.h]
Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
[minchan@kernel.org: fix mips build]
Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
[yuehaibing@huawei.com: remove duplicate header which is included twice]
Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
[minchan@kernel.org: do not use helper functions for process_madvise]
Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
[akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
[sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Dias <joaodias@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: <linux-man@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
process_madvise syscall needs pidfd_get_pid function to translate pidfd to
pid so this patch move the function to kernel/pid.c.
Suggested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jann Horn <jannh@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Dias <joaodias@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: <linux-man@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200302193630.68771-5-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-3-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-3-minchan@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "introduce memory hinting API for external process", v9.
Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With
that, application could give hints to kernel what memory range are
preferred to be reclaimed. However, in some platform(e.g., Android), the
information required to make the hinting decision is not known to the app.
Instead, it is known to a centralized userspace daemon(e.g.,
ActivityManagerService), and that daemon must be able to initiate reclaim
on its own without any app involvement.
To solve the concern, this patch introduces new syscall -
process_madvise(2). Bascially, it's same with madvise(2) syscall but it
has some differences.
1. It needs pidfd of target process to provide the hint
2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
moment. Other hints in madvise will be opened when there are explicit
requests from community to prevent unexpected bugs we couldn't support.
3. Only privileged processes can do something for other process's
address space.
For more detail of the new API, please see "mm: introduce external memory
hinting API" description in this patchset.
This patch (of 3):
In upcoming patches, do_madvise will be called from external process
context so we shouldn't asssume "current" is always hinted process's
task_struct.
Furthermore, we must not access mm_struct via task->mm, but obtain it via
access_mm() once (in the following patch) and only use that pointer [1],
so pass it to do_madvise() as well. Note the vma->vm_mm pointers are
safe, so we can use them further down the call stack.
And let's pass current->mm as arguments of do_madvise so it shouldn't
change existing behavior but prepare next patch to make review easy.
[vbabka@suse.cz: changelog tweak]
[minchan@kernel.org: use current->mm for io_uring]
Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
[akpm@linux-foundation.org: fix it for upstream changes]
[akpm@linux-foundation.org: whoops]
[rdunlap@infradead.org: add missing includes]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jann Horn <jannh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: John Dias <joaodias@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: <linux-man@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If a memcg to charge can be determined (using remote charging API), there
are no reasons to exclude allocations made from an interrupt context from
the accounting.
Such allocations will pass even if the resulting memcg size will exceed
the hard limit, but it will affect the application of the memory pressure
and an inability to put the workload under the limit will eventually
trigger the OOM.
To use active_memcg() helper, memcg_kmem_bypass() is moved back to
memcontrol.c.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Link: http://lkml.kernel.org/r/20200827225843.1270629-5-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Remote memcg charging API uses current->active_memcg to store the
currently active memory cgroup, which overwrites the memory cgroup of the
current process. It works well for normal contexts, but doesn't work for
interrupt contexts: indeed, if an interrupt occurs during the execution of
a section with an active memcg set, all allocations inside the interrupt
will be charged to the active memcg set (given that we'll enable
accounting for allocations from an interrupt context). But because the
interrupt might have no relation to the active memcg set outside, it's
obviously wrong from the accounting prospective.
To resolve this problem, let's add a global percpu int_active_memcg
variable, which will be used to store an active memory cgroup which will
be used from interrupt contexts. set_active_memcg() will transparently
use current->active_memcg or int_active_memcg depending on the context.
To make the read part simple and transparent for the caller, let's
introduce two new functions:
- struct mem_cgroup *active_memcg(void),
- struct mem_cgroup *get_active_memcg(void).
They are returning the active memcg if it's set, hiding all implementation
details: where to get it depending on the current context.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Link: http://lkml.kernel.org/r/20200827225843.1270629-4-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently the remote memcg charging API consists of two functions:
memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the
memcg value, which overwrites the memcg of the current task.
memalloc_use_memcg(target_memcg);
<...>
memalloc_unuse_memcg();
It works perfectly for allocations performed from a normal context,
however an attempt to call it from an interrupt context or just nest two
remote charging blocks will lead to an incorrect accounting. On exit from
the inner block the active memcg will be cleared instead of being
restored.
memalloc_use_memcg(target_memcg);
memalloc_use_memcg(target_memcg_2);
<...>
memalloc_unuse_memcg();
Error: allocation here are charged to the memcg of the current
process instead of target_memcg.
memalloc_unuse_memcg();
This patch extends the remote charging API by switching to a single
function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg),
which sets the new value and returns the old one. So a remote charging
block will look like:
old_memcg = set_active_memcg(target_memcg);
<...>
set_active_memcg(old_memcg);
This patch is heavily based on the patch by Johannes Weiner, which can be
found here: https://lkml.org/lkml/2020/5/28/806 .
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dan Schatzberg <dschatzberg@fb.com>
Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
j_submit|finish_inode_data_buffers()
Introduce journal callbacks to allow different behaviors
for an inode in journal_submit|finish_inode_data_buffers().
The existing users of the current behavior (ext4, ocfs2)
are adapted to use the previously exported functions
that implement the current behavior.
Users are callers of jbd2_journal_inode_ranged_write|wait(),
which adds the inode to the transaction's inode list with
the JI_WRITE|WAIT_DATA flags. Only ext4 and ocfs2 in-tree.
Both CONFIG_EXT4_FS and CONFIG_OCSFS2_FS select CONFIG_JBD2,
which builds fs/jbd2/commit.c and journal.c that define and
export the functions, so we can call directly in ext4/ocfs2.
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201006004841.600488-3-mfo@canonical.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Export functions that implement the current behavior done
for an inode in journal_submit|finish_inode_data_buffers().
No functional change.
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201006004841.600488-2-mfo@canonical.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
the struct name was modified long ago, but the comment still
use struct handle_s.
Signed-off-by: Hui Su <sh_def@163.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200922171231.GA53120@rlk
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
A previous commit changed the notification mode from true/false to an
int, allowing notify-no, notify-yes, or signal-notify. This was
backwards compatible in the sense that any existing true/false user
would translate to either 0 (on notification sent) or 1, the latter
which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2.
Clean this up properly, and define a proper enum for the notification
mode. Now we have:
- TWA_NONE. This is 0, same as before the original change, meaning no
notification requested.
- TWA_RESUME. This is 1, same as before the original change, meaning
that we use TIF_NOTIFY_RESUME.
- TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the
notification.
Clean up all the callers, switching their 0/1/false/true to using the
appropriate TWA_* mode for notifications.
Fixes: e91b48162332 ("task_work: teach task_work_add() to do signal_wake_up()")
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
All the callers currently do this, clean it up and move the clearing
into tracehook_notify_resume() instead.
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|