summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2020-10-22Merge tag 'nfsd-5.10' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: "The one new feature this time, from Anna Schumaker, is READ_PLUS, which has the same arguments as READ but allows the server to return an array of data and hole extents. Otherwise it's a lot of cleanup and bugfixes" * tag 'nfsd-5.10' of git://linux-nfs.org/~bfields/linux: (43 commits) NFSv4.2: Fix NFS4ERR_STALE error when doing inter server copy SUNRPC: fix copying of multiple pages in gss_read_proxy_verf() sunrpc: raise kernel RPC channel buffer size svcrdma: fix bounce buffers for unaligned offsets and multiple pages nfsd: remove unneeded break net/sunrpc: Fix return value for sysctl sunrpc.transports NFSD: Encode a full READ_PLUS reply NFSD: Return both a hole and a data segment NFSD: Add READ_PLUS hole segment encoding NFSD: Add READ_PLUS data support NFSD: Hoist status code encoding into XDR encoder functions NFSD: Map nfserr_wrongsec outside of nfsd_dispatch NFSD: Remove the RETURN_STATUS() macro NFSD: Call NFSv2 encoders on error returns NFSD: Fix .pc_release method for NFSv2 NFSD: Remove vestigial typedefs NFSD: Refactor nfsd_dispatch() error paths NFSD: Clean up nfsd_dispatch() variables NFSD: Clean up stale comments in nfsd_dispatch() NFSD: Clean up switch statement in nfsd_dispatch() ...
2020-10-21ext4: fast commit recovery pathHarshad Shirwadkar
This patch adds fast commit recovery path support for Ext4 file system. We add several helper functions that are similar in spirit to e2fsprogs journal recovery path handlers. Example of such functions include - a simple block allocator, idempotent block bitmap update function etc. Using these routines and the fast commit log in the fast commit area, the recovery path (ext4_fc_replay()) performs fast commit log recovery. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-8-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21jbd2: fast commit recovery pathHarshad Shirwadkar
This patch adds fast commit recovery support in JBD2. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-7-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21ext4: main fast-commit commit pathHarshad Shirwadkar
This patch adds main fast commit commit path handlers. The overall patch can be divided into two inter-related parts: (A) Metadata updates tracking This part consists of helper functions to track changes that need to be committed during a commit operation. These updates are maintained by Ext4 in different in-memory queues. Following are the APIs and their short description that are implemented in this patch: - ext4_fc_track_link/unlink/creat() - Track unlink. link and creat operations - ext4_fc_track_range() - Track changed logical block offsets inodes - ext4_fc_track_inode() - Track inodes - ext4_fc_mark_ineligible() - Mark file system fast commit ineligible() - ext4_fc_start_update() / ext4_fc_stop_update() / ext4_fc_start_ineligible() / ext4_fc_stop_ineligible() These functions are useful for co-ordinating inode updates with commits. (B) Main commit Path This part consists of functions to convert updates tracked in in-memory data structures into on-disk commits. Function ext4_fc_commit() is the main entry point to commit path. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-6-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21jbd2: add fast commit machineryHarshad Shirwadkar
This functions adds necessary APIs needed in JBD2 layer for fast commits. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-5-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21ext4 / jbd2: add fast commit initializationHarshad Shirwadkar
This patch adds fast commit area trackers in the journal_t structure. These are initialized via the jbd2_fc_init() routine that this patch adds. This patch also adds ext4/fast_commit.c and ext4/fast_commit.h files for fast commit code that will be added in subsequent patches in this series. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-4-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21ext4: add fast_commit feature and handling for extended mount optionsHarshad Shirwadkar
We are running out of mount option bits. Add handling for using s_mount_opt2. Add ext4 and jbd2 fast commit feature flag and also add ability to turn off the fast commit feature in Ext4. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-3-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21rtnetlink: fix data overflow in rtnl_calcit()Di Zhu
"ip addr show" command execute error when we have a physical network card with a large number of VFs The return value of if_nlmsg_size() in rtnl_calcit() will exceed range of u16 data type when any network cards has a larger number of VFs. rtnl_vfinfo_size() will significant increase needed dump size when the value of num_vfs is larger. Eventually we get a wrong value of min_ifinfo_dump_size because of overflow which decides the memory size needed by netlink dump and netlink_dump() will return -EMSGSIZE because of not enough memory was allocated. So fix it by promoting min_dump_alloc data type to u32 to avoid whole netlink message size overflow and it's also align with the data type of struct netlink_callback{}.min_dump_alloc which is assigned by return value of rtnl_calcit() Signed-off-by: Di Zhu <zhudi21@huawei.com> Link: https://lore.kernel.org/r/20201021020053.1401-1-zhudi21@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-22bpf: Fix bpf_redirect_neigh helper api to support supplying nexthopToke Høiland-Jørgensen
Based on the discussion in [0], update the bpf_redirect_neigh() helper to accept an optional parameter specifying the nexthop information. This makes it possible to combine bpf_fib_lookup() and bpf_redirect_neigh() without incurring a duplicate FIB lookup - since the FIB lookup helper will return the nexthop information even if no neighbour is present, this can simply be passed on to bpf_redirect_neigh() if bpf_fib_lookup() returns BPF_FIB_LKUP_RET_NO_NEIGH. Thus fix & extend it before helper API is frozen. [0] https://lore.kernel.org/bpf/393e17fc-d187-3a8d-2f0d-a627c7c63fca@iogearbox.net/ Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/bpf/160322915615.32199.1187570224032024535.stgit@toke.dk
2020-10-21KVM: Cache as_id in kvm_memory_slotPeter Xu
Cache the address space ID just like the slot ID. It will be used in order to fill in the dirty ring entries. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20201014182700.2888246-7-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-21kvm: x86: only provide PV features if enabled in guest's CPUIDOliver Upton
KVM unconditionally provides PV features to the guest, regardless of the configured CPUID. An unwitting guest that doesn't check KVM_CPUID_FEATURES before use could access paravirt features that userspace did not intend to provide. Fix this by checking the guest's CPUID before performing any paravirtual operations. Introduce a capability, KVM_CAP_ENFORCE_PV_FEATURE_CPUID, to gate the aforementioned enforcement. Migrating a VM from a host w/o this patch to a host with this patch could silently change the ABI exposed to the guest, warranting that we default to the old behavior and opt-in for the new one. Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Signed-off-by: Oliver Upton <oupton@google.com> Change-Id: I202a0926f65035b872bfe8ad15307c026de59a98 Message-Id: <20200818152429.1923996-4-oupton@google.com> Reviewed-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-21Merge tag 'rtc-5.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux Pull RTC updates from Alexandre Belloni: "A new driver this cycle is making the bulk of the changes and the rx8010 driver has been rework to use the modern APIs. Summary: Subsystem: - new generic DT properties: aux-voltage-chargeable, trickle-voltage-millivolt New driver: - Microcrystal RV-3032 Drivers: - ds1307: use aux-voltage-chargeable - r9701, rx8010: modernization of the driver - rv3028: fix clock output, trickle resistor values, RAM configuration registers" * tag 'rtc-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (50 commits) rtc: r9701: set range rtc: r9701: convert to devm_rtc_allocate_device rtc: r9701: stop setting RWKCNT rtc: r9701: remove useless memset rtc: r9701: stop setting a default time rtc: r9701: remove leftover comment rtc: rv3032: Add a driver for Microcrystal RV-3032 dt-bindings: rtc: rv3032: add RV-3032 bindings dt-bindings: rtc: add trickle-voltage-millivolt rtc: rv3028: ensure ram configuration registers are saved rtc: rv3028: factorize EERD bit handling rtc: rv3028: fix trickle resistor values rtc: rv3028: fix clock output support rtc: mt6397: Remove unused member dev rtc: rv8803: simplify the return expression of rv8803_nvram_write rtc: meson: simplify the return expression of meson_vrtc_probe rtc: rx8010: rename rx8010_init_client() to rx8010_init() rtc: ds1307: enable rx8130's backup battery, make it chargeable optionally rtc: ds1307: consider aux-voltage-chargeable rtc: ds1307: store previous charge default per chip ...
2020-10-21Merge branch 'i2c/for-5.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c updates from Wolfram Sang: - if a host can be a client, too, the I2C core can now use it to emulate SMBus HostNotify support (STM32 and R-Car added this so far) - also for client mode, a testunit has been added. It can create rare situations on the bus, so host controllers can be tested - a binding has been added to mark the bus as "single-master". This allows for better timeout detections - new driver for Mellanox Bluefield - massive refactoring of the Tegra driver - EEPROMs recognized by the at24 driver can now have custom names - rest is driver updates * 'i2c/for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (80 commits) Documentation: i2c: add testunit docs to index i2c: tegra: Improve driver module description i2c: tegra: Clean up whitespaces, newlines and indentation i2c: tegra: Clean up and improve comments i2c: tegra: Clean up printk messages i2c: tegra: Clean up variable names i2c: tegra: Improve formatting of variables i2c: tegra: Check errors for both positive and negative values i2c: tegra: Factor out hardware initialization into separate function i2c: tegra: Factor out register polling into separate function i2c: tegra: Factor out packet header setup from tegra_i2c_xfer_msg() i2c: tegra: Factor out error recovery from tegra_i2c_xfer_msg() i2c: tegra: Rename wait/poll functions i2c: tegra: Remove "dma" variable from tegra_i2c_xfer_msg() i2c: tegra: Remove redundant check in tegra_i2c_issue_bus_clear() i2c: tegra: Remove likely/unlikely from the code i2c: tegra: Remove outdated barrier() i2c: tegra: Clean up variable types i2c: tegra: Reorder location of functions in the code i2c: tegra: Clean up probe function ...
2020-10-21Merge tag 'ceph-for-5.10-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph updates from Ilya Dryomov: - a patch that removes crush_workspace_mutex (myself). CRUSH computations are no longer serialized and can run in parallel. - a couple new filesystem client metrics for "ceph fs top" command (Xiubo Li) - a fix for a very old messenger bug that affected the filesystem, marked for stable (myself) - assorted fixups and cleanups throughout the codebase from Jeff and others. * tag 'ceph-for-5.10-rc1' of git://github.com/ceph/ceph-client: (27 commits) libceph: clear con->out_msg on Policy::stateful_server faults libceph: format ceph_entity_addr nonces as unsigned libceph: fix ENTITY_NAME format suggestion libceph: move a dout in queue_con_delay() ceph: comment cleanups and clarifications ceph: break up send_cap_msg ceph: drop separate mdsc argument from __send_cap ceph: promote to unsigned long long before shifting ceph: don't SetPageError on readpage errors ceph: mark ceph_fmt_xattr() as printf-like for better type checking ceph: fold ceph_update_writeable_page into ceph_write_begin ceph: fold ceph_sync_writepages into writepage_nounlock ceph: fold ceph_sync_readpages into ceph_readpage ceph: don't call ceph_update_writeable_page from page_mkwrite ceph: break out writeback of incompatible snap context to separate function ceph: add a note explaining session reject error string libceph: switch to the new "osd blocklist add" command libceph, rbd, ceph: "blacklist" -> "blocklist" ceph: have ceph_writepages_start call pagevec_lookup_range_tag ceph: use kill_anon_super helper ...
2020-10-21Merge branch 'remotes/lorenzo/pci/dwc'Bjorn Helgaas
- Fix designware-ep Header Type check (Hou Zhiqiang) - Use DBI accessors instead of own config accessors (Rob Herring) - Allow overriding bridge pci_ops (Rob Herring) - Allow root and child buses to have different pci_ops (Rob Herring) - Add default dwc pci_ops.map_bus (Rob Herring) - Use pci_ops for root config space accessors in al, exynos, histb, keystone, kirin, meson, tegra (Rob Herring) - Remove dwc own/other config accessor ops (Rob Herring) - Use generic config accessors in dwc (Rob Herring) - Also call .add_bus() callback for root bus (Rob Herring) - Convert keystone .scan_bus() callback to use pci_ops.add_bus (Rob Herring) - Convert dwc to use pci_host_probe() (Rob Herring) - Remove dwc root_bus pointer (Rob Herring) - Remove storing of PCI resources in dwc-specific structs (Rob Herring) - Simplify config space handling (Rob Herring) - Drop keystone duplicated DT num-viewport handling (Rob Herring) - Check CONFIG_PCI_MSI in dw_pcie_msi_init() instead of duplicating it in all the drivers (Rob Herring) - Remove imx6 duplicate PCIE_LINK_WIDTH_SPEED_CONTROL definition (Rob Herring) - Add dwc num_lanes for use when it's lacking from DT (Rob Herring) - Ensure "Fast Link Mode" simulation environment setting is cleared (Rob Herring) - Drop meson duplicate number of lanes setup (Rob Herring) - Drop meson unnecessary RC config space init (Rob Herring) - Rework meson config and dwc port logic register accesses (Rob Herring) - Use common PCI register definitions in imx6 and qcom (Rob Herring) - Search for DesignWare PCIe Capability instead of hard-coding its location (Rob Herring) - Use common DesignWare register definitions in tegra (Rob Herring) - Drop keystone unused DBI2 code (Rob Herring) - Make dwc ATU accessors private (Rob Herring) - Centralize link gen setting in dwc (Rob Herring) - Set PORT_LINK_DLL_LINK_EN in common dwc setup code (Rob Herring) - Drop intel-gw unnecessary DT 'device_type' checking (Rob Herring) - Move intel-gw PCI_CAP_ID_EXP discovery to the single place it's used (Rob Herring) - Drop intel-gw unused max_width (Rob Herring) - Move N_FTS (fast training sequence) setup to common dwc setup (Rob Herring) - Convert spear13xx, tegra194 to use DBI accessors (Rob Herring) - Add multiple PFs support for DWC (Xiaowei Bao) - Add MSI-X doorbell mode for endpoint mode (Xiaowei Bao) - Update MSI/MSI-X capability management for endpoints (Xiaowei Bao) - Add layerscape ls1088a and ls2088a compatible strings (Xiaowei Bao) - Update layerscape MSI/MSI-X management (Xiaowei Bao) - Use doorbell to support MSI-X on layerscape (Xiaowei Bao) - Add layerscape endpoint mode support for ls1088a and ls2088a (Xiaowei Bao) - Add layerscape ls1088a node to DT (Xiaowei Bao) - Add Freescale/Layerscape ls1088a to endpoint test (Xiaowei Bao) - Add endpoint test driver data for Layerscape PCIe controllers (Hou Zhiqiang) - Fix 'cast truncates bits from constant value' warning (Gustavo Pimentel) - Add uniphier iATU register description (Kunihiko Hayashi) - Add common iATU register support (Kunihiko Hayashi) - Remove keystone iATU register mapping in favor of generic dwc support (Kunihiko Hayashi) - Skip PCIE_MSI_INTR0* programming if MSI is disabled (Jisheng Zhang) - Fix MSI page leakage in suspend/resume (Jisheng Zhang) - Check whether link is up before attempting config access (best-effort fix even though it's racy) (Hou Zhiqiang) * remotes/lorenzo/pci/dwc: PCI: dwc: Add link up check in dw_child_pcie_ops.map_bus() PCI: dwc: Fix MSI page leakage in suspend/resume PCI: dwc: Skip PCIE_MSI_INTR0* programming if MSI is disabled PCI: keystone: Remove iATU register mapping PCI: dwc: Add common iATU register support dt-bindings: PCI: uniphier-ep: Add iATU register description dt-bindings: PCI: uniphier: Add iATU register description PCI: dwc: Fix 'cast truncates bits from constant value' misc: pci_endpoint_test: Add driver data for Layerscape PCIe controllers misc: pci_endpoint_test: Add LS1088a in pci_device_id table PCI: layerscape: Add EP mode support for ls1088a and ls2088a PCI: layerscape: Modify the MSIX to the doorbell mode PCI: layerscape: Modify the way of getting capability with different PEX PCI: layerscape: Fix some format issue of the code dt-bindings: pci: layerscape-pci: Add compatible strings for ls1088a and ls2088a PCI: designware-ep: Modify MSI and MSIX CAP way of finding PCI: designware-ep: Move the function of getting MSI capability forward PCI: designware-ep: Add the doorbell mode of MSI-X in EP mode PCI: designware-ep: Add multiple PFs support for DWC PCI: dwc: Use DBI accessors PCI: dwc: Move N_FTS setup to common setup PCI: dwc/intel-gw: Drop unused max_width PCI: dwc/intel-gw: Move getting PCI_CAP_ID_EXP offset to intel_pcie_link_setup() PCI: dwc/intel-gw: Drop unnecessary checking of DT 'device_type' property PCI: dwc: Set PORT_LINK_DLL_LINK_EN in common setup code PCI: dwc: Centralize link gen setting PCI: dwc: Make ATU accessors private PCI: dwc: Remove read_dbi2 code PCI: dwc/tegra: Use common Designware port logic register definitions PCI: dwc: Remove hardcoded PCI_CAP_ID_EXP offset PCI: dwc/qcom: Use common PCI register definitions PCI: dwc/imx6: Use common PCI register definitions PCI: dwc/meson: Rework PCI config and DW port logic register accesses PCI: dwc/meson: Drop unnecessary RC config space initialization PCI: dwc/meson: Drop the duplicate number of lanes setup PCI: dwc: Ensure FAST_LINK_MODE is cleared PCI: dwc: Add a 'num_lanes' field to struct dw_pcie PCI: dwc/imx6: Remove duplicate define PCIE_LINK_WIDTH_SPEED_CONTROL PCI: dwc: Check CONFIG_PCI_MSI inside dw_pcie_msi_init() PCI: dwc/keystone: Drop duplicated 'num-viewport' PCI: dwc: Simplify config space handling PCI: dwc: Remove storing of PCI resources PCI: dwc: Remove root_bus pointer PCI: dwc: Convert to use pci_host_probe() PCI: dwc: keystone: Convert .scan_bus() callback to use add_bus PCI: Also call .add_bus() callback for root bus PCI: dwc: Use generic config accessors PCI: dwc: Remove dwc specific config accessor ops PCI: dwc: histb: Use pci_ops for root config space accessors PCI: dwc: exynos: Use pci_ops for root config space accessors PCI: dwc: kirin: Use pci_ops for root config space accessors PCI: dwc: meson: Use pci_ops for root config space accessors PCI: dwc: tegra: Use pci_ops for root config space accessors PCI: dwc: keystone: Use pci_ops for config space accessors PCI: dwc: al: Use pci_ops for child config space accessors PCI: dwc: Add a default pci_ops.map_bus for root port PCI: dwc: Allow overriding bridge pci_ops PCI: dwc: Use DBI accessors instead of own config accessors PCI: Allow root and child buses to have different pci_ops PCI: designware-ep: Fix the Header Type check
2020-10-21Merge branch 'remotes/lorenzo/pci/pci-iomap'Bjorn Helgaas
- Remove useless __KERNEL__ preprocessor guard in sparc io_32.h (Lorenzo Pieralisi) - Move ioremap/iounmap declaration so it's visible in asm-generic/io.h (Lorenzo Pieralisi) - Fix memory leak in generic !CONFIG_GENERIC_IOMAP pci_iounmap() implementation (Lorenzo Pieralisi) * remotes/lorenzo/pci/pci-iomap: asm-generic/io.h: Fix !CONFIG_GENERIC_IOMAP pci_iounmap() implementation sparc32: Move ioremap/iounmap declaration before asm-generic/io.h include sparc32: Remove useless io_32.h __KERNEL__ preprocessor guard
2020-10-21Merge branch 'remotes/lorenzo/pci/apei'Bjorn Helgaas
- Add ACPI APEI notifier chain for unknown (vendor) CPER records (Shiju Jose) - Add handling of HiSilicon HIP PCIe controller errors (Yicong Yang) * remotes/lorenzo/pci/apei: PCI: hip: Add handling of HiSilicon HIP PCIe controller errors ACPI / APEI: Add a notifier chain for unknown (vendor) CPER records
2020-10-21Merge branch 'pci/misc'Bjorn Helgaas
- Remove unnecessary #includes (Gustavo Pimentel) - Fix intel_mid_pci.c build error when !CONFIG_ACPI (Randy Dunlap) - Use scnprintf(), not snprintf(), in sysfs "show" functions (Krzysztof Wilczyński) - Simplify pci-pf-stub by using module_pci_driver() (Liu Shixin) - Print IRQ used by Link Bandwidth Notification (Dongdong Liu) - Update sysfs mmap-related #ifdef comments (Clint Sbisa) - Simplify pci_dev_reset_slot_function() (Lukas Wunner) - Use "NULL" instead of "0" to fix sparse warnings (Gustavo Pimentel) - Simplify bool comparisons (Krzysztof Wilczyński) - Drop double zeroing for P2PDMA sg_init_table() (Julia Lawall) * pci/misc: PCI: v3-semi: Remove unneeded break PCI/P2PDMA: Drop double zeroing for sg_init_table() PCI: Simplify bool comparisons PCI: endpoint: Use "NULL" instead of "0" as a NULL pointer PCI: Simplify pci_dev_reset_slot_function() PCI: Update mmap-related #ifdef comments PCI/LINK: Print IRQ number used by port PCI/IOV: Simplify pci-pf-stub with module_pci_driver() PCI: Use scnprintf(), not snprintf(), in sysfs "show" functions x86/PCI: Fix intel_mid_pci.c build error when ACPI is not enabled PCI: Remove unnecessary header includes
2020-10-21Merge branch 'pci/pm'Bjorn Helgaas
- Remove unused pcibios_pm_ops (Vaibhav Gupta) - Rename pci_dev.d3_delay to d3hot_delay (Krzysztof Wilczyński) - Apply D2 transition delay as microseconds, not milliseconds (Bjorn Helgaas) * pci/pm: PCI/PM: Revert "PCI/PM: Apply D2 delay as milliseconds, not microseconds" PCI/PM: Remove unused PCI_PM_BUS_WAIT PCI/PM: Rename pci_dev.d3_delay to d3hot_delay PCI/PM: Remove unused pcibios_pm_ops
2020-10-21Merge branch 'pci/enumeration'Bjorn Helgaas
- Tone down message about missing optional MCFG (Jeremy Linton) - Add schedule point in pci_read_config() (Jiang Biao) - Add Ampere Altra SOC MCFG quirk (Tuan Phan) - Add Kconfig options for MPS/MRRS strategy (Jim Quinlan) * pci/enumeration: PCI: Add Kconfig options for MPS/MRRS strategy PCI/ACPI: Add Ampere Altra SOC MCFG quirk PCI: Add schedule point in pci_read_config() PCI/ACPI: Tone down missing MCFG message
2020-10-21virtio: let arch advertise guest's memory access restrictionsPierre Morel
An architecture may restrict host access to guest memory, e.g. IBM s390 Secure Execution or AMD SEV. Provide a new Kconfig entry the architecture can select, CONFIG_ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS, when it provides the arch_has_restricted_virtio_memory_access callback to advertise to VIRTIO common code when the architecture restricts memory access from the host. The common code can then fail the probe for any device where VIRTIO_F_ACCESS_PLATFORM is required, but not set. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Link: https://lore.kernel.org/r/1599728030-17085-2-git-send-email-pmorel@linux.ibm.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
2020-10-21NFSv4.2: Fix NFS4ERR_STALE error when doing inter server copyDai Ngo
NFS_FS=y as dependency of CONFIG_NFSD_V4_2_INTER_SSC still have build errors and some configs with NFSD=m to get NFS4ERR_STALE error when doing inter server copy. Added ops table in nfs_common for knfsd to access NFS client modules. Fixes: 3ac3711adb88 ("NFSD: Fix NFS server build errors") Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-20Merge tag 'xarray-5.9' of git://git.infradead.org/users/willy/xarrayLinus Torvalds
Pull XArray updates from Matthew Wilcox: - Fix the test suite after introduction of the local_lock - Fix a bug in the IDA spotted by Coverity - Change the API that allows the workingset code to delete a node - Fix xas_reload() when dealing with entries that occupy multiple indices - Add a few more tests to the test suite - Fix an unsigned int being shifted into an unsigned long * tag 'xarray-5.9' of git://git.infradead.org/users/willy/xarray: XArray: Fix xas_create_range for ranges above 4 billion radix-tree: fix the comment of radix_tree_next_slot() XArray: Fix xas_reload for multi-index entries XArray: Add private interface for workingset node deletion XArray: Fix xas_for_each_conflict documentation XArray: Test marked multiorder iterations XArray: Test two more things about xa_cmpxchg ida: Free allocated bitmap in error path radix tree test suite: Fix compilation
2020-10-20Merge tag 'nfs-for-5.10-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client updates from Anna Schumaker: "Stable Fixes: - Wait for stateid updates after CLOSE/OPEN_DOWNGRADE # v5.4+ - Fix nfs_path in case of a rename retry - Support EXCHID4_FLAG_SUPP_FENCE_OPS v4.2 EXCHANGE_ID flag New features and improvements: - Replace dprintk() calls with tracepoints - Make cache consistency bitmap dynamic - Added support for the NFS v4.2 READ_PLUS operation - Improvements to net namespace uniquifier Other bugfixes and cleanups: - Remove redundant clnt pointer - Don't update timeout values on connection resets - Remove redundant tracepoints - Various cleanups to comments - Fix oops when trying to use copy_file_range with v4.0 source server - Improvements to flexfiles mirrors - Add missing 'local_lock=posix' mount option" * tag 'nfs-for-5.10-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (55 commits) NFSv4.2: support EXCHGID4_FLAG_SUPP_FENCE_OPS 4.2 EXCHANGE_ID flag NFSv4: Fix up RCU annotations for struct nfs_netns_client NFS: Only reference user namespace from nfs4idmap struct instead of cred nfs: add missing "posix" local_lock constant table definition NFSv4: Use the net namespace uniquifier if it is set NFSv4: Clean up initialisation of uniquified client id strings NFS: Decode a full READ_PLUS reply SUNRPC: Add an xdr_align_data() function NFS: Add READ_PLUS hole segment decoding SUNRPC: Add the ability to expand holes in data pages SUNRPC: Split out _shift_data_right_tail() SUNRPC: Split out xdr_realign_pages() from xdr_align_pages() NFS: Add READ_PLUS data segment support NFS: Use xdr_page_pos() in NFSv4 decode_getacl() SUNRPC: Implement a xdr_page_pos() function SUNRPC: Split out a function for setting current page NFS: fix nfs_path in case of a rename retry fs: nfs: return per memcg count for xattr shrinkers NFSv4: Wait for stateid updates after CLOSE/OPEN_DOWNGRADE nfs: remove incorrect fallthrough label ...
2020-10-20Merge tag 'io_uring-5.10-2020-10-20' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring updates from Jens Axboe: "A mix of fixes and a few stragglers. In detail: - Revert the bogus __read_mostly that we discussed for the initial pull request. - Fix a merge window regression with fixed file registration error path handling. - Fix io-wq numa node affinities. - Series abstracting out an io_identity struct, making it both easier to see what the personality items are, and also easier to to adopt more. Use this to cover audit logging. - Fix for read-ahead disabled block condition in async buffered reads, and using single page read-ahead to unify what generic_file_buffer_read() path is used. - Series for REQ_F_COMP_LOCKED fix and removal of it (Pavel) - Poll fix (Pavel)" * tag 'io_uring-5.10-2020-10-20' of git://git.kernel.dk/linux-block: (21 commits) io_uring: use blk_queue_nowait() to check if NOWAIT supported mm: use limited read-ahead to satisfy read mm: mark async iocb read as NOWAIT once some data has been copied io_uring: fix double poll mask init io-wq: inherit audit loginuid and sessionid io_uring: use percpu counters to track inflight requests io_uring: assign new io_identity for task if members have changed io_uring: store io_identity in io_uring_task io_uring: COW io_identity on mismatch io_uring: move io identity items into separate struct io_uring: rely solely on work flags to determine personality. io_uring: pass required context in as flags io-wq: assign NUMA node locality if appropriate io_uring: fix error path cleanup in io_sqe_files_register() Revert "io_uring: mark io_uring_fops/io_op_defs as __read_mostly" io_uring: fix REQ_F_COMP_LOCKED by killing it io_uring: dig out COMP_LOCK from deep call chain io_uring: don't put a poll req under spinlock io_uring: don't unnecessarily clear F_LINK_TIMEOUT io_uring: don't set COMP_LOCKED if won't put ...
2020-10-20Merge branches 'clk-ingenic', 'clk-at91', 'clk-kconfig', 'clk-imx', ↵Stephen Boyd
'clk-qcom', 'clk-prima2' and 'clk-bcm' into clk-next - Support qcom SM8150/SM8250 video and display clks - Change how qcom's display port clks work * clk-ingenic: clk: ingenic: Respect CLK_SET_RATE_PARENT in .round_rate clk: ingenic: Don't tag custom clocks with CLK_SET_RATE_PARENT clk: ingenic: Don't use CLK_SET_RATE_GATE for PLL clk: ingenic: Use readl_poll_timeout instead of custom loop clk: ingenic: Use to_clk_info() macro for all clocks * clk-at91: clk: at91: sam9x60: support only two programmable clocks clk: at91: clk-sam9x60-pll: remove unused variable clk: at91: clk-main: update key before writing AT91_CKGR_MOR clk: at91: remove the checking of parent_name * clk-kconfig: clk: Restrict CLK_HSDK to ARC_SOC_HSDK * clk-imx: clk: imx8mq: Fix usdhc parents order clk: imx: imx21: Remove clock driver clk: imx: gate2: Fix a few typos clk: imx: Fix and update kerneldoc clk: imx: fix i.MX7D peripheral clk mux flags clk: imx: fix composite peripheral flags clk: imx: Correct the memrepair clock on imx8mp clk: imx: Correct the root clk of media ldb on imx8mp clk: imx: vf610: Add CRC clock clk: imx: Explicitly include bits.h clk: imx8qxp: Support building i.MX8QXP clock driver as module clk: imx8m: Support module build clk: imx: Add clock configuration for ARMv7 platforms clk: imx: Support building i.MX common clock driver as module clk: composite: Export clk_hw_register_composite() clk: imx6sl: Use BIT(x) to avoid shifting signed 32-bit value by 31 bits * clk-qcom: clk: qcom: gdsc: Keep RETAIN_FF bit set if gdsc is already on clk: qcom: Add display clock controller driver for SM8150 and SM8250 dt-bindings: clock: add QCOM SM8150 and SM8250 display clock bindings clk: qcom: add video clock controller driver for SM8250 clk: qcom: add video clock controller driver for SM8150 dt-bindings: clock: add SM8250 QCOM video clock bindings dt-bindings: clock: add SM8150 QCOM video clock bindings dt-bindings: clock: combine qcom,sdm845-videocc and qcom,sc7180-videocc clk: qcom: gcc-msm8994: Add missing clocks, resets and GDSCs clk/qcom: fix spelling typo clk: qcom: gcc-sdm660: Fix wrong parent_map clk: qcom: dispcc: Update DP clk ops for phy design clk: qcom: gcc-msm8939: remove defined but not used variables clk: qcom: ipq8074: make pcie0_rchng_clk_src static * clk-prima2: clk: clk-prima2: fix return value check in prima2_clk_init() * clk-bcm: clk: bcm2835: add missing release if devm_clk_hw_register fails clk: bcm: rpi: Add register to control pixel bvb clk
2020-10-20Merge branches 'clk-simplify', 'clk-ti', 'clk-tegra', 'clk-rockchip' and ↵Stephen Boyd
'clk-mediatek' into clk-next - Small non-critical fixes for TI clk driver - Support Mediatek MT8167 clks * clk-simplify: clk: mediatek: fix platform_no_drv_owner.cocci warnings clk: mediatek: mt7629: simplify the return expression of mtk_infrasys_init clk: mediatek: mt6797: simplify the return expression of mtk_infrasys_init * clk-ti: clk: ti: dra7: add missing clkctrl register for SHA2 instance clk: ti: clockdomain: fix static checker warning clk: ti: autoidle: add checks against NULL pointer reference clk: keystone: sci-clk: add 10% slack to set_rate clk: keystone: sci-clk: cache results of last query rate operation clk: keystone: sci-clk: fix parsing assigned-clock data during probe * clk-tegra: clk: tegra: Drop !provider check in tegra210_clk_emc_set_rate() * clk-rockchip: clk: rockchip: Initialize hw to error to avoid undefined behavior clk: rockchip: rk3399: Support module build clk: rockchip: fix the clk config to support module build clk: rockchip: Export some clock common APIs for module drivers clk: rockchip: Export rockchip_register_softrst() clk: rockchip: Export rockchip_clk_register_ddrclk() clk: rockchip: Use clk_hw_register_composite instead of clk_register_composite calls clk: rockchip: rk3308: drop unused mux_timer_src_p * clk-mediatek: clk: mediatek: Add MT8167 clock support dt-bindings: clock: mediatek: add bindings for MT8167 clocks clk: mediatek: add UART0 clock support
2020-10-20Merge branches 'clk-renesas', 'clk-amlogic', 'clk-allwinner', 'clk-samsung', ↵Stephen Boyd
'clk-doc' and 'clk-unused' into clk-next - Remove various unused variables in clk drivers * clk-renesas: clk: renesas: rcar-gen3: Update description for RZ/G2 clk: renesas: cpg-mssr: Add support for R-Car V3U clk: renesas: cpg-mssr: Add register pointers into struct cpg_mssr_priv clk: renesas: cpg-mssr: Use enum clk_reg_layout instead of a boolean flag dt-bindings: clock: renesas,cpg-mssr: Document r8a779a0 dt-bindings: clock: Add r8a779a0 CPG Core Clock Definitions dt-bindings: power: Add r8a779a0 SYSC power domain definitions clk: renesas: rcar-gen2: Rename vsp1-(sy|rt) clocks to vsp(s|r) clk: renesas: r8a7742: Add clk entry for VSPR * clk-amlogic: clk: meson: make shipped controller configurable clk: meson: g12a: mark fclk_div2 as critical clk: meson: axg-audio: fix g12a tdmout sclk inverter clk: meson: axg-audio: separate axg and g12a regmap tables clk: meson: add sclk-ws driver * clk-allwinner: clk: sunxi-ng: sun8i: r40: Use sigma delta modulation for audio PLL clk: sunxi-ng: add support for the Allwinner A100 CCU dt-bindings: clk: sunxi-ccu: add compatible string for A100 CCU and R-CCU * clk-samsung: clk: s2mps11: initialize driver via module_platform_driver clk: samsung: Use cached clk_hws instead of __clk_lookup() calls clk: samsung: exynos5420/5250: Add IDs to the CPU parent clk definitions clk: samsung: Add clk ID definitions for the CPU parent clocks clk: samsung: exynos5420: Avoid __clk_lookup() calls when enabling clocks clk: samsung: exynos5420: Add definition of clock ID for mout_sw_aclk_g3d clk: samsung: Keep top BPLL mux on Exynos542x enabled * clk-doc: clk: davinci: add missing kerneldoc clk: fixed: add missing kerneldoc * clk-unused: clk: socfpga: agilex: Remove unused variable 'cntr_mux' clk: si5341: drop unused 'err' variable clk: mmp: pxa1928: drop unused 'clk' variable clk: at91: drop unused at91sam9g45_pcr_layout
2020-10-20Merge tag 'for-v5.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply Pull power supply and reset updates from Sebastian Reichel: "Power-supply core: - add wireless type - properly document current direction Battery/charger driver changes: - new fuel-gauge/charger driver for RN5T618/RN5T619 - new charger driver for BQ25980, BQ25975 and BQ25960 - bq27xxx-battery: add support for TI bq34z100 - gpio-charger: convert to GPIO descriptors - gpio-charger: add optional support for charge current limiting - max17040: add support for max17041, max17043, max17044 - max17040: add support for max17048, max17049, max17058, max17059 - smb347-charger: add DT support - smb247-charger: add SMB345 and SMB358 support - simple-battery: add temperature properties - lots of minor fixes, cleanups and DT binding YAML conversions Reset drivers: - ocelot: Add support for Sparx5" * tag 'for-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply: (81 commits) power: reset: POWER_RESET_OCELOT_RESET should depend on Ocelot or Sparx5 power: supply: bq25980: Fix uninitialized wd_reg_val and overrun power: supply: ltc2941: Fix ptr to enum cast power: supply: test-power: revise parameter printing to use sprintf power: supply: charger-manager: fix incorrect check on charging_duration_ms power: supply: max17040: Fix ptr to enum cast power: supply: bq25980: Fix uninitialized wd_reg_val power: supply: bq25980: remove redundant zero check on ret power: reset: ocelot: Add support for Sparx5 dt-bindings: reset: ocelot: Add Sparx5 support power: supply: sbs-battery: keep error code when get_property() fails power: supply: bq25980: Add support for the BQ259xx family dt-binding: bq25980: Add the bq25980 flash charger power: supply: fix spelling mistake "unprecise" -> "imprecise" power: supply: test_power: add missing newlines when printing parameters by sysfs power: supply: pm2301: drop duplicated i2c_device_id power: supply: charger-manager: drop unused charger assignment power: supply: rt9455: skip 'struct acpi_device_id' when !CONFIG_ACPI power: supply: goldfish: skip 'struct acpi_device_id' when !CONFIG_ACPI power: supply: bq25890: skip 'struct acpi_device_id' when !CONFIG_ACPI ...
2020-10-20PM: runtime: Fix typo in pm_runtime_set_active() helper commentBean Huo
This patch is to fix typo in the comment of helper pm_runtime_set_active(). Signed-off-by: Bean Huo <beanhuo@micron.com> [ rjw: Subject edit ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-10-20Merge tag 'for-linus-5.10b-rc1b-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull more xen updates from Juergen Gross: - A single patch to fix the Xen security issue XSA-331 (malicious guests can DoS dom0 by triggering NULL-pointer dereferences or access to stale data). - A larger series to fix the Xen security issue XSA-332 (malicious guests can DoS dom0 by sending events at high frequency leading to dom0's vcpus being busy in IRQ handling for elongated times). * tag 'for-linus-5.10b-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/events: block rogue events for some time xen/events: defer eoi in case of excessive number of events xen/events: use a common cpu hotplug hook for event channels xen/events: switch user event channels to lateeoi model xen/pciback: use lateeoi irq binding xen/pvcallsback: use lateeoi irq binding xen/scsiback: use lateeoi irq binding xen/netback: use lateeoi irq binding xen/blkback: use lateeoi irq binding xen/events: add a new "late EOI" evtchn framework xen/events: fix race in evtchn_fifo_unmask() xen/events: add a proper barrier to 2-level uevent unmasking xen/events: avoid removing an event channel while handling it
2020-10-20block: remove unused members for io_contextYufen Yu
After removing blk-sq code, there is no user of nr_batch_requests and last_waited in kernel. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-20Merge tag 'kvmarm-5.10' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for Linux 5.10 - New page table code for both hypervisor and guest stage-2 - Introduction of a new EL2-private host context - Allow EL2 to have its own private per-CPU variables - Support of PMU event filtering - Complete rework of the Spectre mitigation
2020-10-20netfilter: nftables_offload: KASAN slab-out-of-bounds Read in ↵Saeed Mirzamohammadi
nft_flow_rule_create This patch fixes the issue due to: BUG: KASAN: slab-out-of-bounds in nft_flow_rule_create+0x622/0x6a2 net/netfilter/nf_tables_offload.c:40 Read of size 8 at addr ffff888103910b58 by task syz-executor227/16244 The error happens when expr->ops is accessed early on before performing the boundary check and after nft_expr_next() moves the expr to go out-of-bounds. This patch checks the boundary condition before expr->ops that fixes the slab-out-of-bounds Read issue. Add nft_expr_more() and use it to fix this problem. Signed-off-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-20dma-mapping: move more functions to dma-map-ops.hChristoph Hellwig
Due to a mismerge a bunch of prototypes that should have moved to dma-map-ops.h are still in dma-mapping.h, fix that up. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-20xen/events: add a new "late EOI" evtchn frameworkJuergen Gross
In order to avoid tight event channel related IRQ loops add a new framework of "late EOI" handling: the IRQ the event channel is bound to will be masked until the event has been handled and the related driver is capable to handle another event. The driver is responsible for unmasking the event channel via the new function xen_irq_lateeoi(). This is similar to binding an event channel to a threaded IRQ, but without having to structure the driver accordingly. In order to support a future special handling in case a rogue guest is sending lots of unsolicited events, add a flag to xen_irq_lateeoi() which can be set by the caller to indicate the event was a spurious one. This is part of XSA-332. Cc: stable@vger.kernel.org Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Wei Liu <wl@xen.org>
2020-10-19Merge tag 'fuse-update-5.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Support directly accessing host page cache from virtiofs. This can improve I/O performance for various workloads, as well as reducing the memory requirement by eliminating double caching. Thanks to Vivek Goyal for doing most of the work on this. - Allow automatic submounting inside virtiofs. This allows unique st_dev/ st_ino values to be assigned inside the guest to files residing on different filesystems on the host. Thanks to Max Reitz for the patches. - Fix an old use after free bug found by Pradeep P V K. * tag 'fuse-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits) virtiofs: calculate number of scatter-gather elements accurately fuse: connection remove fix fuse: implement crossmounts fuse: Allow fuse_fill_super_common() for submounts fuse: split fuse_mount off of fuse_conn fuse: drop fuse_conn parameter where possible fuse: store fuse_conn in fuse_req fuse: add submount support to <uapi/linux/fuse.h> fuse: fix page dereference after free virtiofs: add logic to free up a memory range virtiofs: maintain a list of busy elements virtiofs: serialize truncate/punch_hole and dax fault path virtiofs: define dax address space operations virtiofs: add DAX mmap support virtiofs: implement dax read/write operations virtiofs: introduce setupmapping/removemapping commands virtiofs: implement FUSE_INIT map_alignment field virtiofs: keep a list of free dax memory ranges virtiofs: add a mount option to enable dax virtiofs: set up virtio_fs dax_device ...
2020-10-19perf: correct SNOOPX field offsetAl Grant
perf_event.h has macros that define the field offsets in the data_src bitmask in perf records. The SNOOPX and REMOTE offsets were both 37. These are distinct fields, and the bitfield layout in perf_mem_data_src confirms that SNOOPX should be at offset 38. Fixes: 52839e653b5629bd ("perf tools: Add support for printing new mem_info encodings") Signed-off-by: Al Grant <al.grant@foss.arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/4ac9f5cc-4388-b34a-9999-418a4099415d@foss.arm.com
2020-10-18Merge tag 'linux-kselftest-kunit-5.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull more Kunit updates from Shuah Khan: - add Kunit to kernel_init() and remove KUnit from init calls entirely. This addresses the concern that Kunit would not work correctly during late init phase. - add a linker section where KUnit can put references to its test suites. This is the first step in transitioning to dispatching all KUnit tests from a centralized executor rather than having each as its own separate late_initcall. - add a centralized executor to dispatch tests rather than relying on late_initcall to schedule each test suite separately. Centralized execution is for built-in tests only; modules will execute tests when loaded. - convert bitfield test to use KUnit framework - Documentation updates for naming guidelines and how kunit_test_suite() works. - add test plan to KUnit TAP format * tag 'linux-kselftest-kunit-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: lib: kunit: Fix compilation test when using TEST_BIT_FIELD_COMPILE lib: kunit: add bitfield test conversion to KUnit Documentation: kunit: add a brief blurb about kunit_test_suite kunit: test: add test plan to KUnit TAP format init: main: add KUnit to kernel init kunit: test: create a single centralized executor for all tests vmlinux.lds.h: add linker section for KUnit test suites Documentation: kunit: Add naming guidelines
2020-10-18Merge tag 'core-rcu-2020-10-12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU changes from Ingo Molnar: - Debugging for smp_call_function() - RT raw/non-raw lock ordering fixes - Strict grace periods for KASAN - New smp_call_function() torture test - Torture-test updates - Documentation updates - Miscellaneous fixes [ This doesn't actually pull the tag - I've dropped the last merge from the RCU branch due to questions about the series. - Linus ] * tag 'core-rcu-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits) smp: Make symbol 'csd_bug_count' static kernel/smp: Provide CSD lock timeout diagnostics smp: Add source and destination CPUs to __call_single_data rcu: Shrink each possible cpu krcp rcu/segcblist: Prevent useless GP start if no CBs to accelerate torture: Add gdb support rcutorture: Allow pointer leaks to test diagnostic code rcutorture: Hoist OOM registry up one level refperf: Avoid null pointer dereference when buf fails to allocate rcutorture: Properly synchronize with OOM notifier rcutorture: Properly set rcu_fwds for OOM handling torture: Add kvm.sh --help and update help message rcutorture: Add CONFIG_PROVE_RCU_LIST to TREE05 torture: Update initrd documentation rcutorture: Replace HTTP links with HTTPS ones locktorture: Make function torture_percpu_rwsem_init() static torture: document --allcpus argument added to the kvm.sh script rcutorture: Output number of elapsed grace periods rcutorture: Remove KCSAN stubs rcu: Remove unused "cpu" parameter from rcu_report_qs_rdp() ...
2020-10-18mm: remove alloc_vm_areaChristoph Hellwig
All users are gone now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002122204.1534411-12-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm: add a vmap_pfn functionChristoph Hellwig
Add a proper helper to remap PFNs into kernel virtual space so that drivers don't have to abuse alloc_vm_area and open coded PTE manipulation for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002122204.1534411-4-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm: add a VM_MAP_PUT_PAGES flag for vmapChristoph Hellwig
Add a flag so that vmap takes ownership of the passed in page array. When vfree is called on such an allocation it will put one reference on each page, and free the page array itself. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002122204.1534411-3-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm/madvise: introduce process_madvise() syscall: an external memory hinting APIMinchan Kim
There is usecase that System Management Software(SMS) want to give a memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the case of Android, it is the ActivityManagerService. The information required to make the reclaim decision is not known to the app. Instead, it is known to the centralized userspace daemon(ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the issue, this patch introduces a new syscall process_madvise(2). It uses pidfd of an external process to give the hint. It also supports vector address range because Android app has thousands of vmas due to zygote so it's totally waste of CPU and power if we should call the syscall one by one for each vma.(With testing 2000-vma syscall vs 1-vector syscall, it showed 15% performance improvement. I think it would be bigger in real practice because the testing ran very cache friendly environment). Another potential use case for the vector range is to amortize the cost ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could benefit users like TCP receive zerocopy and malloc implementations. In future, we could find more usecases for other advises so let's make it happens as API since we introduce a new syscall at this moment. With that, existing madvise(2) user could replace it with process_madvise(2) with their own pid if they want to have batch address ranges support feature. ince it could affect other process's address range, only privileged process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same UID) gives it the right to ptrace the process could use it successfully. The flag argument is reserved for future use if we need to extend the API. I think supporting all hints madvise has/will supported/support to process_madvise is rather risky. Because we are not sure all hints make sense from external process and implementation for the hint may rely on the caller being in the current context so it could be error-prone. Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. If someone want to add other hints, we could hear the usecase and review it for each hint. It's safer for maintenance rather than introducing a buggy syscall but hard to fix it later. So finally, the API is as follows, ssize_t process_madvise(int pidfd, const struct iovec *iovec, unsigned long vlen, int advice, unsigned int flags); DESCRIPTION The process_madvise() system call is used to give advice or directions to the kernel about the address ranges from external process as well as local process. It provides the advice to address ranges of process described by iovec and vlen. The goal of such advice is to improve system or application performance. The pidfd selects the process referred to by the PID file descriptor specified in pidfd. (See pidofd_open(2) for further information) The pointer iovec points to an array of iovec structures, defined in <sys/uio.h> as: struct iovec { void *iov_base; /* starting address */ size_t iov_len; /* number of bytes to be advised */ }; The iovec describes address ranges beginning at address(iov_base) and with size length of bytes(iov_len). The vlen represents the number of elements in iovec. The advice is indicated in the advice argument, which is one of the following at this moment if the target process specified by pidfd is external. MADV_COLD MADV_PAGEOUT Permission to provide a hint to external process is governed by a ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2). The process_madvise supports every advice madvise(2) has if target process is in same thread group with calling process so user could use process_madvise(2) to extend existing madvise(2) to support vector address ranges. RETURN VALUE On success, process_madvise() returns the number of bytes advised. This return value may be less than the total number of requested bytes, if an error occurred. The caller should check return value to determine whether a partial advice occurred. FAQ: Q.1 - Why does any external entity have better knowledge? Quote from Sandeep "For Android, every application (including the special SystemServer) are forked from Zygote. The reason of course is to share as many libraries and classes between the two as possible to benefit from the preloading during boot. After applications start, (almost) all of the APIs end up calling into this SystemServer process over IPC (binder) and back to the application. In a fully running system, the SystemServer monitors every single process periodically to calculate their PSS / RSS and also decides which process is "important" to the user for interactivity. So, because of how these processes start _and_ the fact that the SystemServer is looping to monitor each process, it does tend to *know* which address range of the application is not used / useful. Besides, we can never rely on applications to clean things up themselves. We've had the "hey app1, the system is low on memory, please trim your memory usage down" notifications for a long time[1]. They rely on applications honoring the broadcasts and very few do. So, if we want to avoid the inevitable killing of the application and restarting it, some way to be able to tell the OS about unimportant memory in these applications will be useful. - ssp Q.2 - How to guarantee the race(i.e., object validation) between when giving a hint from an external process and get the hint from the target process? process_madvise operates on the target process's address space as it exists at the instant that process_madvise is called. If the space target process can run between the time the process_madvise process inspects the target process address space and the time that process_madvise is actually called, process_madvise may operate on memory regions that the calling process does not expect. It's the responsibility of the process calling process_madvise to close this race condition. For example, the calling process can suspend the target process with ptrace, SIGSTOP, or the freezer cgroup so that it doesn't have an opportunity to change its own address space before process_madvise is called. Another option is to operate on memory regions that the caller knows a priori will be unchanged in the target process. Yet another option is to accept the race for certain process_madvise calls after reasoning that mistargeting will do no harm. The suggested API itself does not provide synchronization. It also apply other APIs like move_pages, process_vm_write. The race isn't really a problem though. Why is it so wrong to require that callers do their own synchronization in some manner? Nobody objects to write(2) merely because it's possible for two processes to open the same file and clobber each other's writes --- instead, we tell people to use flock or something. Think about mmap. It never guarantees newly allocated address space is still valid when the user tries to access it because other threads could unmap the memory right before. That's where we need synchronization by using other API or design from userside. It shouldn't be part of API itself. If someone needs more fine-grained synchronization rather than process level, there were two ideas suggested - cookie[2] and anon-fd[3]. Both are applicable via using last reserved argument of the API but I don't think it's necessary right now since we have already ways to prevent the race so don't want to add additional complexity with more fine-grained optimization model. To make the API extend, it reserved an unsigned long as last argument so we could support it in future if someone really needs it. Q.3 - Why doesn't ptrace work? Injecting an madvise in the target process using ptrace would not work for us because such injected madvise would have to be executed by the target process, which means that process would have to be runnable and that creates the risk of the abovementioned race and hinting a wrong VMA. Furthermore, we want to act the hint in caller's context, not the callee's, because the callee is usually limited in cpuset/cgroups or even freezed state so they can't act by themselves quick enough, which causes more thrashing/kill. It doesn't work if the target process are ptraced(e.g., strace, debugger, minidump) because a process can have at most one ptracer. [1] https://developer.android.com/topic/performance/memory" [2] process_getinfo for getting the cookie which is updated whenever vma of process address layout are changed - Daniel Colascione - https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224 [3] anonymous fd which is used for the object(i.e., address range) validation - Michal Hocko - https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/ [minchan@kernel.org: fix process_madvise build break for arm64] Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com [minchan@kernel.org: fix build error for mips of process_madvise] Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com [akpm@linux-foundation.org: fix patch ordering issue] [akpm@linux-foundation.org: fix arm64 whoops] [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian] [akpm@linux-foundation.org: fix i386 build] [sfr@canb.auug.org.au: fix syscall numbering] Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au [sfr@canb.auug.org.au: madvise.c needs compat.h] Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au [minchan@kernel.org: fix mips build] Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com [yuehaibing@huawei.com: remove duplicate header which is included twice] Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com [minchan@kernel.org: do not use helper functions for process_madvise] Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com [akpm@linux-foundation.org: pidfd_get_pid() gained an argument] [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"] Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <christian@brauner.io> Cc: Daniel Colascione <dancol@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Dias <joaodias@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Sandeep Patil <sspatil@google.com> Cc: SeongJae Park <sj38.park@gmail.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: <linux-man@vger.kernel.org> Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18pid: move pidfd_get_pid() to pid.cMinchan Kim
process_madvise syscall needs pidfd_get_pid function to translate pidfd to pid so this patch move the function to kernel/pid.c. Suggested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jann Horn <jannh@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Daniel Colascione <dancol@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Dias <joaodias@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Sandeep Patil <sspatil@google.com> Cc: SeongJae Park <sj38.park@gmail.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: <linux-man@vger.kernel.org> Link: http://lkml.kernel.org/r/20200302193630.68771-5-minchan@kernel.org Link: http://lkml.kernel.org/r/20200622192900.22757-3-minchan@kernel.org Link: https://lkml.kernel.org/r/20200901000633.1920247-3-minchan@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm/madvise: pass mm to do_madviseMinchan Kim
Patch series "introduce memory hinting API for external process", v9. Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With that, application could give hints to kernel what memory range are preferred to be reclaimed. However, in some platform(e.g., Android), the information required to make the hinting decision is not known to the app. Instead, it is known to a centralized userspace daemon(e.g., ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the concern, this patch introduces new syscall - process_madvise(2). Bascially, it's same with madvise(2) syscall but it has some differences. 1. It needs pidfd of target process to provide the hint 2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this moment. Other hints in madvise will be opened when there are explicit requests from community to prevent unexpected bugs we couldn't support. 3. Only privileged processes can do something for other process's address space. For more detail of the new API, please see "mm: introduce external memory hinting API" description in this patchset. This patch (of 3): In upcoming patches, do_madvise will be called from external process context so we shouldn't asssume "current" is always hinted process's task_struct. Furthermore, we must not access mm_struct via task->mm, but obtain it via access_mm() once (in the following patch) and only use that pointer [1], so pass it to do_madvise() as well. Note the vma->vm_mm pointers are safe, so we can use them further down the call stack. And let's pass current->mm as arguments of do_madvise so it shouldn't change existing behavior but prepare next patch to make review easy. [vbabka@suse.cz: changelog tweak] [minchan@kernel.org: use current->mm for io_uring] Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org [akpm@linux-foundation.org: fix it for upstream changes] [akpm@linux-foundation.org: whoops] [rdunlap@infradead.org: add missing includes] Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jann Horn <jannh@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Daniel Colascione <dancol@google.com> Cc: Sandeep Patil <sspatil@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: John Dias <joaodias@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: SeongJae Park <sj38.park@gmail.com> Cc: Christian Brauner <christian@brauner.io> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: <linux-man@vger.kernel.org> Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm: kmem: enable kernel memcg accounting from interrupt contextsRoman Gushchin
If a memcg to charge can be determined (using remote charging API), there are no reasons to exclude allocations made from an interrupt context from the accounting. Such allocations will pass even if the resulting memcg size will exceed the hard limit, but it will affect the application of the memory pressure and an inability to put the workload under the limit will eventually trigger the OOM. To use active_memcg() helper, memcg_kmem_bypass() is moved back to memcontrol.c. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200827225843.1270629-5-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm: kmem: prepare remote memcg charging infra for interrupt contextsRoman Gushchin
Remote memcg charging API uses current->active_memcg to store the currently active memory cgroup, which overwrites the memory cgroup of the current process. It works well for normal contexts, but doesn't work for interrupt contexts: indeed, if an interrupt occurs during the execution of a section with an active memcg set, all allocations inside the interrupt will be charged to the active memcg set (given that we'll enable accounting for allocations from an interrupt context). But because the interrupt might have no relation to the active memcg set outside, it's obviously wrong from the accounting prospective. To resolve this problem, let's add a global percpu int_active_memcg variable, which will be used to store an active memory cgroup which will be used from interrupt contexts. set_active_memcg() will transparently use current->active_memcg or int_active_memcg depending on the context. To make the read part simple and transparent for the caller, let's introduce two new functions: - struct mem_cgroup *active_memcg(void), - struct mem_cgroup *get_active_memcg(void). They are returning the active memcg if it's set, hiding all implementation details: where to get it depending on the current context. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200827225843.1270629-4-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm, memcg: rework remote charging API to support nestingRoman Gushchin
Currently the remote memcg charging API consists of two functions: memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the memcg value, which overwrites the memcg of the current task. memalloc_use_memcg(target_memcg); <...> memalloc_unuse_memcg(); It works perfectly for allocations performed from a normal context, however an attempt to call it from an interrupt context or just nest two remote charging blocks will lead to an incorrect accounting. On exit from the inner block the active memcg will be cleared instead of being restored. memalloc_use_memcg(target_memcg); memalloc_use_memcg(target_memcg_2); <...> memalloc_unuse_memcg(); Error: allocation here are charged to the memcg of the current process instead of target_memcg. memalloc_unuse_memcg(); This patch extends the remote charging API by switching to a single function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg), which sets the new value and returns the old one. So a remote charging block will look like: old_memcg = set_active_memcg(target_memcg); <...> set_active_memcg(old_memcg); This patch is heavily based on the patch by Johannes Weiner, which can be found here: https://lkml.org/lkml/2020/5/28/806 . Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Schatzberg <dschatzberg@fb.com> Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18jbd2, ext4, ocfs2: introduce/use journal callbacks ↵Mauricio Faria de Oliveira
j_submit|finish_inode_data_buffers() Introduce journal callbacks to allow different behaviors for an inode in journal_submit|finish_inode_data_buffers(). The existing users of the current behavior (ext4, ocfs2) are adapted to use the previously exported functions that implement the current behavior. Users are callers of jbd2_journal_inode_ranged_write|wait(), which adds the inode to the transaction's inode list with the JI_WRITE|WAIT_DATA flags. Only ext4 and ocfs2 in-tree. Both CONFIG_EXT4_FS and CONFIG_OCSFS2_FS select CONFIG_JBD2, which builds fs/jbd2/commit.c and journal.c that define and export the functions, so we can call directly in ext4/ocfs2. Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201006004841.600488-3-mfo@canonical.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>