summaryrefslogtreecommitdiff
path: root/mm/percpu.c
AgeCommit message (Collapse)Author
2017-06-20percpu: expose statistics about percpu memory via debugfsDennis Zhou
There is limited visibility into the use of percpu memory leaving us unable to reason about correctness of parameters and overall use of percpu memory. These counters and statistics aim to help understand basic statistics about percpu memory such as number of allocations over the lifetime, allocation sizes, and fragmentation. New Config: PERCPU_STATS Signed-off-by: Dennis Zhou <dennisz@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-06-20percpu: migrate percpu data structures to internal headerDennis Zhou
Migrates pcpu_chunk definition and a few percpu static variables to an internal header file from mm/percpu.c. These will be used with debugfs to expose statistics about percpu memory improving visibility regarding allocations and fragmentation. Signed-off-by: Dennis Zhou <dennisz@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-06-20percpu: add missing lockdep_assert_held to func pcpu_free_areaDennis Zhou
Add a missing lockdep_assert_held for pcpu_lock to improve consistency and safety throughout mm/percpu.c. Signed-off-by: Dennis Zhou <dennisz@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-05-10mark most percpu globals as __ro_after_initDaniel Micay
Moving pcpu_base_addr to this section comes from PaX where it's part of KERNEXEC. This extends it to the rest of the globals only written by the init code. Signed-off-by: Daniel Micay <danielmicay@gmail.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-04-04Merge branch 'sched/core' into locking/coreThomas Gleixner
Required for the rtmutex/sched_deadline patches which depend on both branches
2017-03-26lockdep: Fix per-cpu static objectsPeter Zijlstra
Since commit 383776fa7527 ("locking/lockdep: Handle statically initialized PER_CPU locks properly") we try to collapse per-cpu locks into a single class by giving them all the same key. For this key we choose the canonical address of the per-cpu object, which would be the offset into the per-cpu area. This has two problems: - there is a case where we run !0 lock->key through static_obj() and expect this to pass; it doesn't for canonical pointers. - 0 is a valid canonical address. Cure both issues by redefining the canonical address as the address of the per-cpu variable on the boot CPU. Since I didn't want to rely on CPU0 being the boot-cpu, or even existing at all, track the boot CPU in a variable. Fixes: 383776fa7527 ("locking/lockdep: Handle statically initialized PER_CPU locks properly") Reported-by: kernel test robot <fengguang.wu@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Borislav Petkov <bp@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-mm@kvack.org Cc: wfg@linux.intel.com Cc: kernel test robot <fengguang.wu@intel.com> Cc: LKP <lkp@01.org> Link: http://lkml.kernel.org/r/20170320114108.kbvcsuepem45j5cr@hirez.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-16locking/lockdep: Handle statically initialized PER_CPU locks properlyThomas Gleixner
If a PER_CPU struct which contains a spin_lock is statically initialized via: DEFINE_PER_CPU(struct foo, bla) = { .lock = __SPIN_LOCK_UNLOCKED(bla.lock) }; then lockdep assigns a seperate key to each lock because the logic for assigning a key to statically initialized locks is to use the address as the key. With per CPU locks the address is obvioulsy different on each CPU. That's wrong, because all locks should have the same key. To solve this the following modifications are required: 1) Extend the is_kernel/module_percpu_addr() functions to hand back the canonical address of the per CPU address, i.e. the per CPU address minus the per CPU offset. 2) Check the lock address with these functions and if the per CPU check matches use the returned canonical address as the lock key, so all per CPU locks have the same key. 3) Move the static_obj(key) check into look_up_lock_class() so this check can be avoided for statically initialized per CPU locks. That's required because the canonical address fails the static_obj(key) check for obvious reasons. Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [ Merged Dan's fixups for !MODULES and !SMP into this patch. ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Murphy <dmurphy@ti.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20170227143736.pectaimkjkan5kow@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-06percpu: acquire pcpu_lock when updating pcpu_nr_empty_pop_pagesTahsin Erdogan
Update to pcpu_nr_empty_pop_pages in pcpu_alloc() is currently done without holding pcpu_lock. This can lead to bad updates to the variable. Add missing lock calls. Fixes: b539b87fed37 ("percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populated") Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v3.18+
2017-02-27scripts/spelling.txt: add "followings" pattern and fix typo instancesMasahiro Yamada
Fix typos and add the following to the scripts/spelling.txt: followings||following While we are here, add a missing colon in the boilerplate in DT binding documents. The "you SoC" in allwinner,sunxi-pinctrl.txt was fixed as well. I reworded "as the followings:" to "as follows:" for drivers/usb/gadget/udc/renesas_usb3.c. Link: http://lkml.kernel.org/r/1481573103-11329-32-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13Merge branch 'for-4.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu update from Tejun Heo: "This includes just one patch to reject non-power-of-2 alignments and trigger warning. Interestingly, this actually caught a bug in XEN ARM64" * 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: ensure the requested alignment is power of two
2016-12-12mm/percpu.c: fix panic triggered by BUG_ON() falselyzijun_hu
As shown by pcpu_build_alloc_info(), the number of units within a percpu group is deduced by rounding up the number of CPUs within the group to @upa boundary/ Therefore, the number of CPUs isn't equal to the units's if it isn't aligned to @upa normally. However, pcpu_page_first_chunk() uses BUG_ON() to assert that one number is equal to the other roughly, so a panic is maybe triggered by the BUG_ON() incorrectly. In order to fix this issue, the number of CPUs is rounded up then compared with units's and the BUG_ON() is replaced with a warning and return of an error code as well, to keep system alive as much as possible. Link: http://lkml.kernel.org/r/57FCF07C.2020103@zoho.com Signed-off-by: zijun_hu <zijun_hu@htc.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-19percpu: ensure the requested alignment is power of twozijun_hu
The percpu allocator expectedly assumes that the requested alignment is power of two but hasn't been veryfing the input. If the specified alignment isn't power of two, the allocator can malfunction. Add the sanity check. The following is detailed analysis of the effects of alignments which aren't power of two. The alignment must be a even at least since the LSB of a chunk->map element is used as free/in-use flag of a area; besides, the alignment must be a power of 2 too since ALIGN() doesn't work well for other alignment always but is adopted by pcpu_fit_in_area(). IOW, the current allocator only works well for a power of 2 aligned area allocation. See below opposite example for why an odd alignment doesn't work. Let's assume area [16, 36) is free but its previous one is in-use, we want to allocate a @size == 8 and @align == 7 area. The larger area [16, 36) is split to three areas [16, 21), [21, 29), [29, 36) eventually. However, due to the usage for a chunk->map element, the actual offset of the aim area [21, 29) is 21 but is recorded in relevant element as 20; moreover, the residual tail free area [29, 36) is mistook as in-use and is lost silently Unlike macro roundup(), ALIGN(x, a) doesn't work if @a isn't a power of 2 for example, roundup(10, 6) == 12 but ALIGN(10, 6) == 10, and the latter result isn't desired obviously. tj: Code style and patch description updates. Signed-off-by: zijun_hu <zijun_hu@htc.com> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-05mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()zijun_hu
in order to ensure the percpu group areas within a chunk aren't distributed too sparsely, pcpu_embed_first_chunk() goes to error handling path when a chunk spans over 3/4 VMALLOC area, however, during the error handling, it forget to free the memory allocated for all percpu groups by going to label @out_free other than @out_free_areas. it will cause memory leakage issue if the rare scene really happens, in order to fix the issue, we check chunk spanned area immediately after completing memory allocation for all percpu groups, we go to label @out_free_areas to free the memory then return if the checking is failed. in order to verify the approach, we dump all memory allocated then enforce the jump then dump all memory freed, the result is okay after checking whether we free all memory we allocate in this function. BTW, The approach is chosen after thinking over the below scenes - we don't go to label @out_free directly to fix this issue since we maybe free several allocated memory blocks twice - the aim of jumping after pcpu_setup_first_chunk() is bypassing free usable memory other than handling error, moreover, the function does not return error code in any case, it either panics due to BUG_ON() or return 0. Signed-off-by: zijun_hu <zijun_hu@htc.com> Tested-by: zijun_hu <zijun_hu@htc.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-05mm/percpu.c: correct max_distance calculation for pcpu_embed_first_chunk()zijun_hu
pcpu_embed_first_chunk() calculates the range a percpu chunk spans into @max_distance and uses it to ensure that a chunk is not too big compared to the total vmalloc area. However, during calculation, it used incorrect top address by adding a unit size to the highest group's base address. This can make the calculated max_distance slightly smaller than the actual distance although given the scale of values involved the error is very unlikely to have an actual impact. Fix this issue by adding the group's size instead of a unit size. BTW, The type of variable max_distance is changed from size_t to unsigned long too based on below consideration: - type unsigned long usually have same width with IP core registers and can be applied at here very well - make @max_distance type consistent with the operand calculated against it such as @ai->groups[i].base_offset and macro VMALLOC_TOTAL - type unsigned long is more universal then size_t, size_t is type defined to unsigned int or unsigned long among various ARCHs usually Signed-off-by: zijun_hu <zijun_hu@htc.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-25percpu: fix synchronization between synchronous map extension and chunk ↵Tejun Heo
destruction For non-atomic allocations, pcpu_alloc() can try to extend the area map synchronously after dropping pcpu_lock; however, the extension wasn't synchronized against chunk destruction and the chunk might get freed while extension is in progress. This patch fixes the bug by putting most of non-atomic allocations under pcpu_alloc_mutex to synchronize against pcpu_balance_work which is responsible for async chunk management including destruction. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Reported-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: stable@vger.kernel.org # v3.18+ Fixes: 1a4d76076cda ("percpu: implement asynchronous chunk population")
2016-05-25percpu: fix synchronization between chunk->map_extend_work and chunk destructionTejun Heo
Atomic allocations can trigger async map extensions which is serviced by chunk->map_extend_work. pcpu_balance_work which is responsible for destroying idle chunks wasn't synchronizing properly against chunk->map_extend_work and may end up freeing the chunk while the work item is still in flight. This patch fixes the bug by rolling async map extension operations into pcpu_balance_work. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Reported-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: stable@vger.kernel.org # v3.18+ Fixes: 9c824b6a172c ("percpu: make sure chunk->map array has available space")
2016-03-17mm: percpu: use pr_fmt to prefix outputJoe Perches
Use the normal mechanism to make the logging output consistently "percpu:" instead of a mix of "PERCPU:" and "percpu:" Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17mm: convert printk(KERN_<LEVEL> to pr_<level>Joe Perches
Most of the mm subsystem uses pr_<level> so make it consistent. Miscellanea: - Realign arguments - Add missing newline to format - kmemleak-test.c has a "kmemleak: " prefix added to the "Kmemleak testing" logging message via pr_fmt Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17mm: coalesce split stringsJoe Perches
Kernel style prefers a single string over split strings when the string is 'user-visible'. Miscellanea: - Add a missing newline - Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17mm: convert pr_warning to pr_warnJoe Perches
There are a mixture of pr_warning and pr_warn uses in mm. Use pr_warn consistently. Miscellanea: - Coalesce formats - Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-22tree wide: use kvfree() than conditional kfree()/vfree()Tetsuo Handa
There are many locations that do if (memory_was_allocated_by_vmalloc) vfree(ptr); else kfree(ptr); but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory using is_vmalloc_addr(). Unless callers have special reasons, we can replace this branch with kvfree(). Please check and reply if you found problems. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Jan Kara <jack@suse.com> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net> Acked-by: David Rientjes <rientjes@google.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Boris Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05mm/percpu: use offset_in_page macroAlexander Kuleshov
linux/mm.h provides offset_in_page() macro. Let's use already predefined macro instead of (addr & ~PAGE_MASK). Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-21percpu: clean up of schunk->map[] assignment in pcpu_setup_first_chunkBaoquan He
The original assignment is a little redundent. Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2015-06-24mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()Larry Finger
Beginning at commit d52d3997f843 ("ipv6: Create percpu rt6_info"), the following INFO splat is logged: =============================== [ INFO: suspicious RCU usage. ] 4.1.0-rc7-next-20150612 #1 Not tainted ------------------------------- kernel/sched/core.c:7318 Illegal context switch in RCU-bh read-side critical section! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 3 locks held by systemd/1: #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff815f0c8f>] rtnetlink_rcv+0x1f/0x40 #1: (rcu_read_lock_bh){......}, at: [<ffffffff816a34e2>] ipv6_add_addr+0x62/0x540 #2: (addrconf_hash_lock){+...+.}, at: [<ffffffff816a3604>] ipv6_add_addr+0x184/0x540 stack backtrace: CPU: 0 PID: 1 Comm: systemd Not tainted 4.1.0-rc7-next-20150612 #1 Hardware name: TOSHIBA TECRA A50-A/TECRA A50-A, BIOS Version 4.20 04/17/2014 Call Trace: dump_stack+0x4c/0x6e lockdep_rcu_suspicious+0xe7/0x120 ___might_sleep+0x1d5/0x1f0 __might_sleep+0x4d/0x90 kmem_cache_alloc+0x47/0x250 create_object+0x39/0x2e0 kmemleak_alloc_percpu+0x61/0xe0 pcpu_alloc+0x370/0x630 Additional backtrace lines are truncated. In addition, the above splat is followed by several "BUG: sleeping function called from invalid context at mm/slub.c:1268" outputs. As suggested by Martin KaFai Lau, these are the clue to the fix. Routine kmemleak_alloc_percpu() always uses GFP_KERNEL for its allocations, whereas it should follow the gfp from its callers. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@vger.kernel.org> [3.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-24percpu: Fix trivial typos in commentsYannick Guerrini
Change 'tranlated' to 'translated' Change 'mutliples' to 'multiples' Signed-off-by: Yannick Guerrini <yguerrini@tomshardware.fr> Signed-off-by: Tejun Heo <tj@kernel.org>
2015-02-13percpu: use %*pb[l] to print bitmaps including cpumasks and nodemasksTejun Heo
printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-29percpu: off by one in BUG_ON()Dan Carpenter
The unit_map[] array has "nr_cpu_ids" number of elements. It's allocated a few lines earlier in the function. So this test should be >= instead of >. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-10-08percpu: fix how @gfp is interpreted by the percpu allocatorTejun Heo
When @gfp is specified, the percpu allocator is interested in whether it contains all of GFP_KERNEL or not. If it does, the normal allocation path is taken; otherwise, the atomic allocation path. Unfortunately, pcpu_alloc() was incorrectly testing for whether @gfp contains any part of GFP_KERNEL. Fix it by testing "(gfp & GFP_KERNEL) != GFP_KERNEL" instead of "!(gfp & GFP_KERNEL)" to decide whether the allocation should be atomic or not. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-21Revert "percpu: free percpu allocation info for uniprocessor system"Guenter Roeck
This reverts commit 3189eddbcafc ("percpu: free percpu allocation info for uniprocessor system"). The commit causes a hang with a crisv32 image. This may be an architecture problem, but at least for now the revert is necessary to be able to boot a crisv32 image. Cc: Tejun Heo <tj@kernel.org> Cc: Honggang Li <enjoymindful@gmail.com> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 3189eddbcafc ("percpu: free percpu allocation info for uniprocessor system") Cc: stable@vger.kernel.org # Please don't apply 3189eddbcafc
2014-09-09percpu: fix locking regression in the failure path of pcpu_alloc()Tejun Heo
While updating locking, b38d08f3181c ("percpu: restructure locking") broke pcpu_create_chunk() creation path in pcpu_alloc(). It returns without releasing pcpu_alloc_mutex. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Julia Lawall <julia.lawall@lip6.fr>
2014-09-02percpu: implement asynchronous chunk populationTejun Heo
The percpu allocator now supports atomic allocations by only allocating from already populated areas but the mechanism to ensure that there's adequate amount of populated areas was missing. This patch expands pcpu_balance_work so that in addition to freeing excess free chunks it also populates chunks to maintain an adequate level of populated areas. pcpu_alloc() schedules pcpu_balance_work if the amount of free populated areas is too low or after an atomic allocation failure. * PERPCU_DYNAMIC_RESERVE is increased by two pages to account for PCPU_EMPTY_POP_PAGES_LOW. * pcpu_async_enabled is added to gate both async jobs - chunk->map_extend_work and pcpu_balance_work - so that we don't end up scheduling them while the needed subsystems aren't up yet. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: rename pcpu_reclaim_work to pcpu_balance_workTejun Heo
pcpu_reclaim_work will also be used to populate chunks asynchronously. Rename it to pcpu_balance_work in preparation. pcpu_reclaim() is renamed to pcpu_balance_workfn() and some of its local variables are renamed too. This is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populatedTejun Heo
pcpu_nr_empty_pop_pages counts the number of empty populated pages across all chunks and chunk->nr_populated counts the number of populated pages in a chunk. Both will be used to implement pre/async population for atomic allocations. pcpu_chunk_[de]populated() are added to update chunk->populated, chunk->nr_populated and pcpu_nr_empty_pop_pages together. All successful chunk [de]populations should be followed by the corresponding pcpu_chunk_[de]populated() calls. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: make sure chunk->map array has available spaceTejun Heo
An allocation attempt may require extending chunk->map array which requires GFP_KERNEL context which isn't available for atomic allocations. This patch ensures that chunk->map array usually keeps some amount of available space by directly allocating buffer space during GFP_KERNEL allocations and scheduling async extension during atomic ones. This should make atomic allocation failures from map space exhaustion rare. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: implement [__]alloc_percpu_gfp()Tejun Heo
Now that pcpu_alloc_area() can allocate only from populated areas, it's easy to add atomic allocation support to [__]alloc_percpu(). Update pcpu_alloc() so that it accepts @gfp and skips all the blocking operations and allocates only from the populated areas if @gfp doesn't contain GFP_KERNEL. New interface functions [__]alloc_percpu_gfp() are added. While this means that atomic allocations are possible, this isn't complete yet as there's no mechanism to ensure that certain amount of populated areas is kept available and atomic allocations may keep failing under certain conditions. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: indent the population block in pcpu_alloc()Tejun Heo
The next patch will conditionalize the population block in pcpu_alloc() which will end up making a rather large indentation change obfuscating the actual logic change. This patch puts the block under "if (true)" so that the next patch can avoid indentation changes. The defintions of the local variables which are used only in the block are moved into the block. This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: make pcpu_alloc_area() capable of allocating only from populated areasTejun Heo
Update pcpu_alloc_area() so that it can skip unpopulated areas if the new parameter @pop_only is true. This is implemented by a new function, pcpu_fit_in_area(), which determines the amount of head padding considering the alignment and populated state. @pop_only is currently always false but this will be used to implement atomic allocation. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: restructure lockingTejun Heo
At first, the percpu allocator required a sleepable context for both alloc and free paths and used pcpu_alloc_mutex to protect everything. Later, pcpu_lock was introduced to protect the index data structure so that the free path can be invoked from atomic contexts. The conversion only updated what's necessary and left most of the allocation path under pcpu_alloc_mutex. The percpu allocator is planned to add support for atomic allocation and this patch restructures locking so that the coverage of pcpu_alloc_mutex is further reduced. * pcpu_alloc() now grab pcpu_alloc_mutex only while creating a new chunk and populating the allocated area. Everything else is now protected soley by pcpu_lock. After this change, multiple instances of pcpu_extend_area_map() may race but the function already implements sufficient synchronization using pcpu_lock. This also allows multiple allocators to arrive at new chunk creation. To avoid creating multiple empty chunks back-to-back, a new chunk is created iff there is no other empty chunk after grabbing pcpu_alloc_mutex. * pcpu_lock is now held while modifying chunk->populated bitmap. After this, all data structures are protected by pcpu_lock. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: move region iterations out of pcpu_[de]populate_chunk()Tejun Heo
Previously, pcpu_[de]populate_chunk() were called with the range which may contain multiple target regions in it and pcpu_[de]populate_chunk() iterated over the regions. This has the benefit of batching up cache flushes for all the regions; however, we're planning to add more bookkeeping logic around [de]population to support atomic allocations and this delegation of iterations gets in the way. This patch moves the region iterations out of pcpu_[de]populate_chunk() into its callers - pcpu_alloc() and pcpu_reclaim() - so that we can later add logic to track more states around them. This change may make cache and tlb flushes more frequent but multi-region [de]populations are rare anyway and if this actually becomes a problem, it's not difficult to factor out cache flushes as separate callbacks which are directly invoked from percpu.c. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: move common parts out of pcpu_[de]populate_chunk()Tejun Heo
percpu-vm and percpu-km implement separate versions of pcpu_[de]populate_chunk() and some part which is or should be common are currently in the specific implementations. Make the following changes. * Allocate area clearing is moved from the pcpu_populate_chunk() implementations to pcpu_alloc(). This makes percpu-km's version noop. * Quick exit tests in pcpu_[de]populate_chunk() of percpu-vm are moved to their respective callers so that they are applied to percpu-km too. This doesn't make any meaningful difference as both functions are noop for percpu-km; however, this is more consistent and will help implementing atomic allocation support. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-16percpu: free percpu allocation info for uniprocessor systemHonggang Li
Currently, only SMP system free the percpu allocation info. Uniprocessor system should free it too. For example, one x86 UML virtual machine with 256MB memory, UML kernel wastes one page memory. Signed-off-by: Honggang Li <enjoymindful@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
2014-06-19percpu: Use ALIGN macro instead of hand coding alignment calculationChristoph Lameter
Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-04-14percpu: make pcpu_alloc_chunk() use pcpu_mem_free() instead of kfree()Jianyu Zhan
pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) + BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long) It hardly could be ever bigger than PAGE_SIZE even for large-scale machine, but for consistency with its couterpart pcpu_mem_zalloc(), use pcpu_mem_free() instead. Commit b4916cb17c26 ("percpu: make pcpu_free_chunk() use pcpu_mem_free() instead of kfree()") addressed this problem, but missed this one. tj: commit message updated Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 099a19d91ca4 ("percpu: allow limited allocation before slab is online) Cc: stable@vger.kernel.org
2014-03-29percpu: renew the max_contig if we merge the head and previous blockJianyu Zhan
During pcpu_alloc_area(), we might merge the current head with the previous block. Since we have calculated the max_contig using the size of previous block before we skip it, and now we update the size of previous block, so we should renew the max_contig. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-17percpu: allocation size should be evenViro
723ad1d90b56 ("percpu: store offsets instead of lengths in ->map[]") updated percpu area allocator to use the lowest bit, instead of sign, to signify whether the area is occupied and forced min align to 2; unfortunately, it forgot to force the allocation size to be even causing malfunctions for the very rare odd-sized allocations. Always force the allocations to be even sized. tj: Wrote patch description. Original-patch-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-07percpu: speed alloc_pcpu_area() upAl Viro
If we know that first N areas are all in use, we can obviously skip them when searching for a free one. And that kind of hint is very easy to maintain. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-07percpu: store offsets instead of lengths in ->map[]Al Viro
Current code keeps +-length for each area in chunk->map[]. It has several unpleasant consequences: * even if we know that first 50 areas are all in use, allocation still needs to go through all those areas just to sum their sizes, just to get the offset of free one. * freeing needs to find the array entry refering to the area in question; again, the need to sum the sizes until we reach the offset we are interested in. Note that offsets are monotonous, so simple binary search would do here. New data representation: array of <offset,in-use flag> pairs. Each pair is represented by one int - we use offset|1 for <offset, in use> and offset for <offset, free> (we make sure that all offsets are even). In the end we put a sentry entry - <total size, in use>. The first entry is <0, flag>; it would be possible to store together the flag for Nth area and offset for N+1st, but that leads to much hairier code. In other words, where the old variant would have 4, -8, -4, 4, -12, 100 (4 bytes free, 8 in use, 4 in use, 4 free, 12 in use, 100 free) we store <0,0>, <4,1>, <12,1>, <16,0>, <20,1>, <32,0>, <132,1> i.e. 0, 5, 13, 16, 21, 32, 133 This commit switches to new data representation and takes care of a couple of low-hanging fruits in free_pcpu_area() - one is the switch to binary search, another is not doing two memmove() when one would do. Speeding the alloc side up (by keeping track of how many areas in the beginning are known to be all in use) also becomes possible - that'll be done in the next commit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-07perpcu: fold pcpu_split_block() into the only callerAl Viro
... and simplify the results a bit. Makes the next step easier to deal with - we will be changing the data representation for chunk->map[] and it's easier to do if the code in question is not split between pcpu_alloc_area() and pcpu_split_block(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-01-21Merge branch 'akpm' (incoming from Andrew)Linus Torvalds
Merge first patch-bomb from Andrew Morton: - a couple of misc things - inotify/fsnotify work from Jan - ocfs2 updates (partial) - about half of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits) mm/migrate: remove unused function, fail_migrate_page() mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages mm/migrate: correct failure handling if !hugepage_migration_support() mm/migrate: add comment about permanent failure path mm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure mm: compaction: reset scanner positions immediately when they meet mm: compaction: do not mark unmovable pageblocks as skipped in async compaction mm: compaction: detect when scanners meet in isolate_freepages mm: compaction: reset cached scanner pfn's before reading them mm: compaction: encapsulate defer reset logic mm: compaction: trace compaction begin and end memcg, oom: lock mem_cgroup_print_oom_info sched: add tracepoints related to NUMA task migration mm: numa: do not automatically migrate KSM pages mm: numa: trace tasks that fail migration due to rate limiting mm: numa: limit scope of lock for NUMA migrate rate limiting mm: numa: make NUMA-migrate related functions static lib/show_mem.c: show num_poisoned_pages when oom mm/hwpoison: add '#' to hwpoison_inject mm/memblock: use WARN_ONCE when MAX_NUMNODES passed as input parameter ...
2014-01-21mm/percpu.c: use memblock apis for early memory allocationsSantosh Shilimkar
Switch to memblock interfaces for early memory allocator instead of bootmem allocator. No functional change in beahvior than what it is in current code from bootmem users points of view. Archs already converted to NO_BOOTMEM now directly use memblock interfaces instead of bootmem wrappers build on top of memblock. And the archs which still uses bootmem, these new apis just fallback to exiting bootmem APIs. Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Paul Walmsley <paul@pwsan.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Russell King <linux@arm.linux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: Tony Lindgren <tony@atomide.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>