summaryrefslogtreecommitdiff
path: root/block/blk-throttle.c
AgeCommit message (Collapse)Author
2015-02-20blk-throttle: check stats_cpu before reading it from sysfsThadeu Lima de Souza Cascardo
When reading blkio.throttle.io_serviced in a recently created blkio cgroup, it's possible to race against the creation of a throttle policy, which delays the allocation of stats_cpu. Like other functions in the throttle code, just checking for a NULL stats_cpu prevents the following oops caused by that race. [ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020 [ 1117.285252] Faulting instruction address: 0xc0000000003efa2c [ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1] [ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV [ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4 [ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5 [ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000 [ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980 [ 1137.734202] REGS: c000000f1d213500 TRAP: 0300 Not tainted (3.19.0) [ 1137.734230] MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI> CR: 42008884 XER: 20000000 [ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0 GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000 GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000 GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808 GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000 GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80 GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850 [ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180 [ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180 [ 1137.734943] Call Trace: [ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable) [ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0 [ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70 [ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150 [ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50 [ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510 [ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200 [ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80 [ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0 [ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100 [ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98 [ 1137.735383] Instruction dump: [ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24 [ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 <7cead02a> e9090008 e9490010 e9290018 And here is one code that allows to easily reproduce this, although this has first been found by running docker. void run(pid_t pid) { int n; int status; int fd; char *buffer; buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE); n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid); fd = open(CGPATH "/test/tasks", O_WRONLY); write(fd, buffer, n); close(fd); if (fork() > 0) { fd = open("/dev/sda", O_RDONLY | O_DIRECT); read(fd, buffer, 512); close(fd); wait(&status); } else { fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY); n = read(fd, buffer, BUFFER_SIZE); close(fd); } free(buffer); exit(0); } void test(void) { int status; mkdir(CGPATH "/test", 0666); if (fork() > 0) wait(&status); else run(getpid()); rmdir(CGPATH "/test"); } int main(int argc, char **argv) { int i; for (i = 0; i < NR_TESTS; i++) test(); return 0; } Reported-by: Ricardo Marin Matinata <rmm@br.ibm.com> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-09cgroup: remove sane_behavior support on non-default hierarchiesTejun Heo
sane_behavior has been used as a development vehicle for the default unified hierarchy. Now that the default hierarchy is in place, the flag became redundant and confusing as its usage is allowed on all hierarchies. There are gonna be either the default hierarchy or legacy ones. Let's make that clear by removing sane_behavior support on non-default hierarchies. This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of cgroup_on_dfl() with sane_behavior specific part dropped. On the default and legacy hierarchies w/o sane_behavior, this shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz>
2014-06-09Merge branch 'for-3.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot of activities on cgroup side. Heavy restructuring including locking simplification took place to improve the code base and enable implementation of the unified hierarchy, which currently exists behind a __DEVEL__ mount option. The core support is mostly complete but individual controllers need further work. To explain the design and rationales of the the unified hierarchy Documentation/cgroups/unified-hierarchy.txt is added. Another notable change is css (cgroup_subsys_state - what each controller uses to identify and interact with a cgroup) iteration update. This is part of continuing updates on css object lifetime and visibility. cgroup started with reference count draining on removal way back and is now reaching a point where csses behave and are iterated like normal refcnted objects albeit with some complexities to allow distinguishing the state where they're being deleted. The css iteration update isn't taken advantage of yet but is planned to be used to simplify memcg significantly" * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits) cgroup: disallow disabled controllers on the default hierarchy cgroup: don't destroy the default root cgroup: disallow debug controller on the default hierarchy cgroup: clean up MAINTAINERS entries cgroup: implement css_tryget() device_cgroup: use css_has_online_children() instead of has_children() cgroup: convert cgroup_has_live_children() into css_has_online_children() cgroup: use CSS_ONLINE instead of CGRP_DEAD cgroup: iterate cgroup_subsys_states directly cgroup: introduce CSS_RELEASED and reduce css iteration fallback window cgroup: move cgroup->serial_nr into cgroup_subsys_state cgroup: link all cgroup_subsys_states in their sibling lists cgroup: move cgroup->sibling and ->children into cgroup_subsys_state cgroup: remove cgroup->parent device_cgroup: remove direct access to cgroup->children memcg: update memcg_has_children() to use css_next_child() memcg: remove tasks/children test from mem_cgroup_force_empty() cgroup: remove css_parent() cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css cgroup: use cgroup->self.refcnt for cgroup refcnting ...
2014-05-13cgroup: replace cftype->write_string() with cftype->write()Tejun Heo
Convert all cftype->write_string() users to the new cftype->write() which maps directly to kernfs write operation and has full access to kernfs and cgroup contexts. The conversions are mostly mechanical. * @css and @cft are accessed using of_css() and of_cft() accessors respectively instead of being specified as arguments. * Should return @nbytes on success instead of 0. * @buf is not trimmed automatically. Trim if necessary. Note that blkcg and netprio don't need this as the parsers already handle whitespaces. cftype->write_string() has no user left after the conversions and removed. While at it, remove unnecessary local variable @p in cgroup_subtree_control_write() and stale comment about CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c. This patch doesn't introduce any visible behavior changes. v2: netprio was missing from conversion. Converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: "David S. Miller" <davem@davemloft.net>
2014-05-02block/blk-throttle.c: fix return of 0/1 with return type boolFabian Frederick
Fix 4 coccinelle warnings. Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-21block/blk-throttle.c: add static to blk_throtl_dispatch_work_fnFabian Frederick
blk_throtl_dispatch_work_fn is only used in blk-throttle.c Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-03-19cgroup: drop const from @buffer of cftype->write_string()Tejun Heo
cftype->write_string() just passes on the writeable buffer from kernfs and there's no reason to add const restriction on the buffer. The only thing const achieves is unnecessarily complicating parsing of the buffer. Drop const from @buffer. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2014-02-11cgroup: update the meaning of cftype->max_write_lenTejun Heo
cftype->max_write_len is used to extend the maximum size of writes. It's interpreted in such a way that the actual maximum size is one less than the specified value. The default size is defined by CGROUP_LOCAL_BUFFER_SIZE. Its interpretation is quite confusing - its value is decremented by 1 and then compared for equality with max size, which means that the actual default size is CGROUP_LOCAL_BUFFER_SIZE - 2, which is 62 chars. There's no point in having a limit that low. Update its definition so that it means the actual string length sans termination and anything below PAGE_SIZE-1 is treated as PAGE_SIZE-1. .max_write_len for "release_agent" is updated to PATH_MAX-1 and cgroup_release_agent_write() is updated so that the redundant strlen() check is removed and it uses strlcpy() instead of strcpy(). .max_write_len initializations in blk-throttle.c and cfq-iosched.c are no longer necessary and removed. The one in cpuset is kept unchanged as it's an approximated value to begin with. This will also make transition to kernfs smoother. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-01-30Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block IO changes from Jens Axboe: "The major piece in here is the immutable bio_ve series from Kent, the rest is fairly minor. It was supposed to go in last round, but various issues pushed it to this release instead. The pull request contains: - Various smaller blk-mq fixes from different folks. Nothing major here, just minor fixes and cleanups. - Fix for a memory leak in the error path in the block ioctl code from Christian Engelmayer. - Header export fix from CaiZhiyong. - Finally the immutable biovec changes from Kent Overstreet. This enables some nice future work on making arbitrarily sized bios possible, and splitting more efficient. Related fixes to immutable bio_vecs: - dm-cache immutable fixup from Mike Snitzer. - btrfs immutable fixup from Muthu Kumar. - bio-integrity fix from Nic Bellinger, which is also going to stable" * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits) xtensa: fixup simdisk driver to work with immutable bio_vecs block/blk-mq-cpu.c: use hotcpu_notifier() blk-mq: for_each_* macro correctness block: Fix memory leak in rw_copy_check_uvector() handling bio-integrity: Fix bio_integrity_verify segment start bug block: remove unrelated header files and export symbol blk-mq: uses page->list incorrectly blk-mq: use __smp_call_function_single directly btrfs: fix missing increment of bi_remaining Revert "block: Warn and free bio if bi_end_io is not set" block: Warn and free bio if bi_end_io is not set blk-mq: fix initializing request's start time block: blk-mq: don't export blk_mq_free_queue() block: blk-mq: make blk_sync_queue support mq block: blk-mq: support draining mq queue dm cache: increment bi_remaining when bi_end_io is restored block: fixup for generic bio chaining block: Really silence spurious compiler warnings block: Silence spurious compiler warnings block: Kill bio_pair_split() ...
2013-12-05cgroup: replace cftype->read_seq_string() with cftype->seq_show()Tejun Heo
In preparation of conversion to kernfs, cgroup file handling is updated so that it can be easily mapped to kernfs. This patch replaces cftype->read_seq_string() with cftype->seq_show() which is not limited to single_open() operation and will map directcly to kernfs seq_file interface. The conversions are mechanical. As ->seq_show() doesn't have @css and @cft, the functions which make use of them are converted to use seq_css() and seq_cft() respectively. In several occassions, e.f. if it has seq_string in its name, the function name is updated to fit the new method better. This patch does not introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Neil Horman <nhorman@tuxdriver.com>
2013-11-23block: Abstract out bvec iteratorKent Overstreet
Immutable biovecs are going to require an explicit iterator. To implement immutable bvecs, a later patch is going to add a bi_bvec_done member to this struct; for now, this patch effectively just renames things. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Benny Halevy <bhalevy@tonian.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchand@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Peng Tao <tao.peng@emc.com> Cc: Andy Adamson <andros@netapp.com> Cc: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de>6
2013-11-13block: Use u64_stats_init() to initialize seqcountsPeter Zijlstra
Now that seqcounts are lockdep enabled objects, we need to explicitly initialize runtime allocated seqcounts so that lockdep can track them. Without this patch, Fengguang was seeing: [ 4.127282] INFO: trying to register non-static key. [ 4.128027] the code is fine but needs lockdep annotation. [ 4.128027] turning off the locking correctness validator. [ 4.128027] CPU: 0 PID: 96 Comm: kworker/u4:1 Not tainted 3.12.0-next-20131108-10601-gbad570d #2 [ 4.128027] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ ... ] [ 4.128027] Call Trace: [ 4.128027] [<7908e744>] ? console_unlock+0x353/0x380 [ 4.128027] [<79dc7cf2>] dump_stack+0x48/0x60 [ 4.128027] [<7908953e>] __lock_acquire.isra.26+0x7e3/0xceb [ 4.128027] [<7908a1c5>] lock_acquire+0x71/0x9a [ 4.128027] [<794079aa>] ? blk_throtl_bio+0x1c3/0x485 [ 4.128027] [<7940658b>] throtl_update_dispatch_stats+0x7c/0x153 [ 4.128027] [<794079aa>] ? blk_throtl_bio+0x1c3/0x485 [ 4.128027] [<794079aa>] blk_throtl_bio+0x1c3/0x485 ... Use u64_stats_init() for all affected data structures, which initializes the seqcount. Reported-and-Tested-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Peter Zijlstra <peterz@infradead.org> [ Folded in another fix from the mailing list as well as a fix to that fix. Tweaked commit message. ] Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1384314134-6895-1-git-send-email-john.stultz@linaro.org [ So I actually think that the two SOBs from PeterZ are the right depiction of the patch route. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-08cgroup: make css_for_each_descendant() and friends include the origin css in ↵Tejun Heo
the iteration Previously, all css descendant iterators didn't include the origin (root of subtree) css in the iteration. The reasons were maintaining consistency with css_for_each_child() and that at the time of introduction more use cases needed skipping the origin anyway; however, given that css_is_descendant() considers self to be a descendant, omitting the origin css has become more confusing and looking at the accumulated use cases rather clearly indicates that including origin would result in simpler code overall. While this is a change which can easily lead to subtle bugs, cgroup API including the iterators has recently gone through major restructuring and no out-of-tree changes will be applicable without adjustments making this a relatively acceptable opportunity for this type of change. The conversions are mostly straight-forward. If the iteration block had explicit origin handling before or after, it's moved inside the iteration. If not, if (pos == origin) continue; is added. Some conversions add extra reference get/put around origin handling by consolidating origin handling and the rest. While the extra ref operations aren't strictly necessary, this shouldn't cause any noticeable difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com>
2013-08-08cgroup: make hierarchy iterators deal with cgroup_subsys_state instead of cgroupTejun Heo
cgroup is currently in the process of transitioning to using css (cgroup_subsys_state) as the primary handle instead of cgroup in subsystem API. For hierarchy iterators, this is beneficial because * In most cases, css is the only thing subsystems care about anyway. * On the planned unified hierarchy, iterations for different subsystems will need to skip over different subtrees of the hierarchy depending on which subsystems are enabled on each cgroup. Passing around css makes it unnecessary to explicitly specify the subsystem in question as css is intersection between cgroup and subsystem * For the planned unified hierarchy, css's would need to be created and destroyed dynamically independent from cgroup hierarchy. Having cgroup core manage css iteration makes enforcing deref rules a lot easier. Most subsystem conversions are straight-forward. Noteworthy changes are * blkio: cgroup_to_blkcg() is no longer used. Removed. * freezer: cgroup_freezer() is no longer used. Removed. * devices: cgroup_to_devcgroup() is no longer used. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk>
2013-08-08cgroup: pass around cgroup_subsys_state instead of cgroup in file methodsTejun Heo
cgroup is currently in the process of transitioning to using struct cgroup_subsys_state * as the primary handle instead of struct cgroup. Please see the previous commit which converts the subsystem methods for rationale. This patch converts all cftype file operations to take @css instead of @cgroup. cftypes for the cgroup core files don't have their subsytem pointer set. These will automatically use the dummy_css added by the previous patch and can be converted the same way. Most subsystem conversions are straight forwards but there are some interesting ones. * freezer: update_if_frozen() is also converted to take @css instead of @cgroup for consistency. This will make the code look simpler too once iterators are converted to use css. * memory/vmpressure: mem_cgroup_from_css() needs to be exported to vmpressure while mem_cgroup_from_cont() can be made static. Updated accordingly. * cpu: cgroup_tg() doesn't have any user left. Removed. * cpuacct: cgroup_ca() doesn't have any user left. Removed. * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left. Removed. * net_cls: cgrp_cls_state() doesn't have any user left. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Steven Rostedt <rostedt@goodmis.org>
2013-05-14blk-throttle: implement proper hierarchy supportTejun Heo
With the recent updates, blk-throttle is finally ready for proper hierarchy support. Dispatching now honors service_queue->parent_sq and propagates correctly. The only thing missing is setting ->parent_sq correctly so that throtl_grp hierarchy matches the cgroup hierarchy. This patch updates throtl_pd_init() such that service_queues form the same hierarchy as the cgroup hierarchy if sane_behavior is enabled. As this concludes proper hierarchy support for blkcg, the shameful .broken_hierarchy tag is removed from blkio_subsys. v2: Updated blkio-controller.txt as suggested by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Li Zefan <lizefan@huawei.com>
2013-05-14blk-throttle: implement throtl_grp->has_rules[]Tejun Heo
blk_throtl_bio() has a quick exit path for throtl_grps without limits configured. It looks at the bps and iops limits and if both are not configured, the bio is issued immediately. While this is correct in the current flat hierarchy as each throtl_grp behaves completely independently, it would become wrong in proper hierarchy mode. A group without any limits could still be limited by one of its ancestors and bio's queued for such group should not bypass blk-throtl. As having a quick bypass mechanism is beneficial, this patch reimplements the mechanism such that it's correct even with proper hierarchy. throtl_grp->has_rules[] is added. These booleans are updated for the whole subtree whenever a config is updated so that has_rules[] of the whole subtree stays synchronized. They're also updated when a new throtl_grp comes online so that it can't escape the limits of its ancestors. As no throtl_grp has another throtl_grp as parent now, this patch doesn't yet make any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: Account for child group's start time in parent while bio climbs upVivek Goyal
With the planned proper hierarchy support, a bio will climb up the tree before actually being dispatched. This makes sure bio is also subjected to parent's throttling limits, if any. It might happen that parent is idle and when bio is transferred to parent, a new slice starts fresh. But that is incorrect as parents wait time should have started when bio was queued in child group and causes IOs to be throttled more than configured as they climb the hierarchy. Given the fact that we have not written hierarchical algorithm in a way where child's and parents time slices are synchronized, we transfer the child's start time to parent if parent was idling. If parent was busy doing dispatch of other bios all this while, this is not an issue. Child's slice start time is passed to parent. Parent looks at its last expired slice start time. If child's start time is after parents old start time, that means parent had been idle and after parent went idle, child had an IO queued. So use child's start time as parent start time. If parent's start time is after child's start time, that means, when IO got queued in child group, parent was not idle. But later it dispatched some IO, its slice got trimmed and then it went idle. After a while child's request got shifted in parent group. In this case use parent's old start time as new start time as that's the duration of slice we did not use. This logic is far from perfect as if there are multiple childs then first child transferring the bio decides the start time while a bio might have queued up even earlier in other child, which is yet to be transferred up to parent. In that case we will lose time and bandwidth in parent. This patch is just an approximation to make situation somewhat better. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-05-14blk-throttle: add throtl_qnode for dispatch fairnessTejun Heo
With flat hierarchy, there's only single level of dispatching happening and fairness beyond that point is the responsibility of the rest of the block layer and driver, which usually works out okay; however, with the planned hierarchy support, service_queue->bio_lists[] can be filled up by bios from a single source. While the limits would still be honored, it'd be very easy to starve IOs from siblings or children. To avoid such starvation, this patch implements throtl_qnode and converts service_queue->bio_lists[] to lists of per-source qnodes which in turn contains the bio's. For example, when a bio is dispatched from a child group, the bio doesn't get queued on ->bio_lists[] directly but it first gets queued on the group's qnode which in turn gets queued on service_queue->queued[]. When dispatching for the upper level, the ->queued[] list is consumed in round-robing order so that the dispatch windows is consumed fairly by all IO sources. There are two ways a bio can come to a throtl_grp - directly queued to the group or dispatched from a child. For the former throtl_grp->qnode_on_self[rw] is used. For the latter, the child's ->qnode_on_parent[rw]. Note that this means that the child which is contributing a bio to its parent should stay pinned until all its bios are dispatched to its grand-parent. This patch moves blkg refcnting from bio add/remove spots to qnode activation/deactivation so that the blkg containing an active qnode is always pinned. As child pins the parent, this is sufficient for keeping the relevant sub-tree pinned while bios are in flight. The starvation issue was spotted by Vivek Goyal. v2: The original patch used the same throtl_grp->qnode_on_self/parent for reads and writes causing RWs to be queued incorrectly if there already are outstanding IOs in the other direction. They should be throtl_grp->qnode_on_self/parent[2] so that READs and WRITEs can use different qnodes. Spotted by Vivek Goyal. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: make throtl_pending_timer_fn() ready for hierarchyTejun Heo
throtl_pending_timer_fn() currently assumes that the parent_sq is the top level one and the bio's dispatched are ready to be issued; however, this assumption will be wrong with proper hierarchy support. This patch makes the following changes to make throtl_pending_timer_fn() ready for hiearchy. * If the parent_sq isn't the top-level one, update the parent throtl_grp's dispatch time and schedule the next dispatch as necessary. If the parent's dispatch time is now, repeat the function for the parent throtl_grp. * If the parent_sq is the top-level one, kick issue work_item as before. * The debug message printed by throtl_log() now prints out the service_queue's nr_queued[] instead of the total nr_queued as the latter becomes uninteresting and misleading with hierarchical dispatch. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: make tg_dispatch_one_bio() ready for hierarchyTejun Heo
tg_dispatch_one_bio() currently assumes that the parent_sq is the top level one and the bio being dispatched is ready to be issued; however, this assumption will be wrong with proper hierarchy support. This patch makes the following changes to make tg_dispatch_on_bio() ready for hiearchy. * throtl_data->nr_queued[] is incremented in blk_throtl_bio() instead of throtl_add_bio_tg() so that throtl_add_bio_tg() can be used to transfer a bio from a child tg to its parent. * tg_dispatch_one_bio() is updated to distinguish whether its parent is another throtl_grp or the throtl_data. If former, the bio is transferred to the parent throtl_grp using throtl_add_bio_tg(). If latter, the bio is ready to be issued and put on the top-level service_queue's bio_lists[] and throtl_data->nr_queued is decremented. As all throtl_grps currently have the top level service_queue as their ->parent_sq, this patch in itself doesn't make any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: make blk_throtl_bio() ready for hierarchyTejun Heo
Currently, blk_throtl_bio() issues the passed in bio directly if it's within limits of its associated tg (throtl_grp). This behavior becomes incorrect with hierarchy support as the bio should be accounted to and throttled by the ancestor throtl_grps too. This patch makes the direct issue path of blk_throtl_bio() to loop until it reaches the top-level service_queue or gets throttled. If the former, the bio can be issued directly; otherwise, it gets queued at the first layer it was above limits. As tg->parent_sq is always the top-level service queue currently, this patch in itself doesn't make any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: make blk_throtl_drain() ready for hierarchyTejun Heo
The current blk_throtl_drain() assumes that all active throtl_grps are queued on throtl_data->service_queue, which won't be true once hierarchy support is implemented. This patch makes blk_throtl_drain() perform post-order walk of the blkg hierarchy draining each associated throtl_grp, which guarantees that all bios will eventually be pushed to the top-level service_queue in throtl_data. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: dispatch from throtl_pending_timer_fn()Tejun Heo
Currently, blk_throtl_dispatch_work_fn() is responsible for both dispatching bio's from throtl_grp's according to their limits and then issuing the dispatched bios. This patch moves the dispatch part to throtl_pending_timer_fn() so that the work item is kicked iff there are bio's to issue. This is to avoid work item execution at each step when hierarchy support is enabled. bio's will be dispatched towards the top-level service_queue from the timers at each layer and the work item will only be used to issue the bio's which reached the top-level service_queue. While fetching bio's to issue from bio_lists[], blk_throtl_dispatch_work_fn() fetches all READs before WRITEs. While the original code also dispatched READs first, if multiple throtl_grps are dispatched on the same run, WRITEs from throtl_grp which is dispatched first would precede READs from throtl_grps which are dispatched later. While this is a behavior change, given that the previous code already prioritized READs and block layer generally prioritizes and segregates READs from WRITEs, this isn't likely to make any noticeable differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: implement dispatch loopingTejun Heo
throtl_select_dispatch() only dispatches throtl_quantum bios on each invocation. blk_throtl_dispatch_work_fn() in turn depends on throtl_schedule_next_dispatch() scheduling the next dispatch window immediately so that undue delays aren't incurred. This effectively chains multiple dispatch work item executions back-to-back when there are more than throtl_quantum bios to dispatch on a given tick. There is no reason to finish the current work item just to repeat it immediately. This patch makes throtl_schedule_next_dispatch() return %false without doing anything if the current dispatch window is still open and updates blk_throtl_dispatch_work_fn() repeat dispatching after cpu_relax() on %false return. This change will help implementing hierarchy support as dispatching will be done from pending_timer and immediate reschedule of timer function isn't supported and doesn't make much sense. While this patch changes how dispatch behaves when there are more than throtl_quantum bios to dispatch on a single tick, the behavior change is immaterial. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: separate out throtl_service_queue->pending_timer from ↵Tejun Heo
throtl_data->dispatch_work Currently, throtl_data->dispatch_work is a delayed_work item which handles both delayed dispatch and issuing bios. The two tasks will be separated to support proper hierarchy. To prepare for that, this patch separates out the timer into throtl_service_queue->pending_timer from throtl_data->dispatch_work and make the latter a work_struct. * As the timer is now per-service_queue, it's initialized and del_sync'd as its corresponding service_queue is created and destroyed. The timer, when triggered, simply schedules throtl_data->dispathc_work for execution. * throtl_schedule_delayed_work() is renamed to throtl_schedule_pending_timer() and takes @sq and @expires now. * Simiarly, throtl_schedule_next_dispatch() now takes @sq, which should be the parent_sq of the service_queue which just got a new bio or updated. As the parent_sq is always the top-level service_queue now, this doesn't change anything at this point. This patch doesn't introduce any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats ↵Tejun Heo
update with it With proper hierarchy support, a bio can be dispatched multiple times until it reaches the top-level service_queue and we don't want to update dispatch stats at each step. They are local stats and will be kept local. If recursive stats are necessary, they should be implemented separately and definitely not by updating counters recursively on each dispatch. This patch moves REQ_THROTTLED setting to throtl_charge_bio() and gate stats update with it so that dispatch stats are updated only on the first time the bio is charged to a throtl_grp, which will always be the throtl_grp the bio was originally queued to. This means that REQ_THROTTLED would be set even for bios which don't get throttled. As we don't want bios to leave blk-throtl with the flag set, move REQ_THROTLLED clearing to the end of blk_throtl_bio() and clear if the bio is being issued directly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log()Tejun Heo
Now that both throtl_data and throtl_grp embed throtl_service_queue, we can unify throtl_log() and throtl_log_tg(). * sq_to_tg() is added. This returns the throtl_grp a service_queue is embedded in. If the service_queue is the top-level one embedded in throtl_data, NULL is returned. * sq_to_td() is added. A service_queue is always associated with a throtl_data. This function finds the associated td and returns it. * throtl_log() is updated to take throtl_service_queue instead of throtl_data. If the service_queue is one embedded in throtl_grp, it prints the same header as throtl_log_tg() did. If it's one embedded in throtl_data, it behaves the same as before. This renders throtl_log_tg() unnecessary. Removed. This change is necessary for hierarchy support as we're gonna be using the same code paths to dispatch bios to intermediate service_queues embedded in throtl_grps and the top-level service_queue embedded in throtl_data. This patch doesn't make any behavior changes. v2: throtl_log() didn't print a space after blkg path. Updated so that it prints a space after throtl_grp path. Spotted by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: add throtl_service_queue->parent_sqTejun Heo
To prepare for hierarchy support, this patch adds throtl_service_queue->service_sq which points to the arent service_queue. Currently, for all service_queues embedded in throtl_grps, it points to throtl_data->service_queue. As throtl_data->service_queue doesn't have a parent its parent_sq is set to NULL. There are a number of functions which take both throtl_grp *tg and throtl_service_queue *parent_sq. With this patch, the parent service_queue can be determined from @tg and the @parent_sq arguments are removed. This patch doesn't make any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: generalize update_disptime optimization in blk_throtl_bio()Tejun Heo
When blk_throtl_bio() wants to queue a bio to a tg (throtl_grp), it avoids invoking tg_update_disptime() and throtl_schedule_next_dispatch() if the tg already has bios queued in that direction. As a new bio is appeneded after the existing ones, it can't change the tg's next dispatch time or the parent's dispatch schedule. This optimization is currently open coded in blk_throtl_bio(). Whether the target biolist was occupied was recorded in a local variable and later used to skip disptime update. This patch moves generalizes it so that throtl_add_bio_tg() sets a new flag THROTL_TG_WAS_EMPTY if the biolist was empty before the new bio was added. tg_update_disptime() clears the flag automatically. blk_throtl_bio() is updated to simply test the flag before updating disptime. This patch doesn't make any functional differences now but will enable using the same optimization for recursive dispatch. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: dispatch to throtl_data->service_queue.bio_lists[]Tejun Heo
throtl_service_queues will eventually form a tree which is anchored at throtl_data->service_queue and queue bios will climb the tree to the top service_queue to be executed. This patch makes the dispatch paths in blk_throtl_dispatch_work_fn() and blk_throtl_drain() to dispatch bios to throtl_data->service_queue.bio_lists[] instead of the on-stack bio_lists. This will keep the final dispatch to the top level service_queue share the same mechanism as dispatches through the rest of the hierarchy. As bio's should be issued in a sleepable context, blk_throtl_dispatch_work_fn() transfers all dispatched bio's from the service_queue bio_lists[] into an onstack one before dropping queue_lock and issuing the bio's. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: move bio_lists[] and friends to throtl_service_queueTejun Heo
throtl_service_queues will eventually form a tree which is anchored at throtl_data->service_queue and queue bios will climb the tree to the top service_queue to be executed. This patch moves bio_lists[] and nr_queued[] from throtl_grp to its service_queue to prepare for that. As currently only the throtl_data->service_queue is in use, this patch just ends up moving throtl_grp->bio_lists[] and ->nr_queued[] to throtl_grp->service_queue.bio_lists[] and ->nr_queued[] without making any functional differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: add throtl_grp->service_queueTejun Heo
Currently, there's single service_queue per queue - throtl_data->service_queue. All active throtl_grp's are queued on the queue and dispatched according to their limits. To support hierarchy, this will be expanded such that active throtl_grp's form a tree anchored at throtl_data->service_queue and chained through each intermediate throtl_grp's service_queue. This patch adds throtl_grp->service_queue to prepare for hierarchy support. The initialization function - throtl_service_queue_init() - is added and replaces the macro initializer. The newly added tg->service_queue isn't used yet. Following patches will do. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: reorganize throtl_service_queue passed around as argumentTejun Heo
throtl_service_queue will be the building block of hierarchy support and will form a tree. This patch updates its usages as arguments to reduce confusion. * When a service queue is used as the parent role - the host of the rbtree - use @parent_sq instead of @sq. * For functions taking both @tg and @parent_sq, reorder them so that the order is (@tg, @parent_sq) not the other way around. This makes the code follow the usual convention of specifying the primary target of the operation as the first argument. This patch doesn't make any functional differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: pass around throtl_service_queue instead of throtl_dataTejun Heo
throtl_service_queue will be used as the basic block to implement hierarchy support. Pass around throtl_service_queue *sq instead of throtl_data *td in the following functions which will be used across multiple levels of hierarchy. * [__]throtl_enqueue/dequeue_tg() * throtl_add_bio_tg() * tg_update_disptime() * throtl_select_dispatch() Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: add backlink pointer from throtl_grp to throtl_dataTejun Heo
Add throtl_grp->td so that the td (throtl_data) a given tg (throtl_grp) belongs to can be determined, and remove @td argument from functions which take both @td and @tg as the former now can be determined from the latter. This generally simplifies the code and removes a number of cases where @td is passed as an argument without being actually used. This will also help hierarchy support implementation. While at it, in multi-line conditions, move the logical operators leading broken lines to the end of the previous line. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: simplify throtl_grp flag handlingTejun Heo
blk-throttle is still using function-defining macros to define flag handling functions, which went out style at least a decade ago. Just define the flag as bitmask and use direct bit operations. This patch doesn't make any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: rename throtl_rb_root to throtl_service_queueTejun Heo
throtl_rb_root will be expanded to cover more roles for hierarchy support. Rename it to throtl_service_queue and make its fields more descriptive. * rb -> pending_tree * left -> first_pending * count -> nr_pending * min_disptime -> first_pending_disptime This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: remove pointless throtl_nr_queued() optimizationsTejun Heo
throtl_nr_queued() is used in several places to avoid performing certain operations when the throtl_data is empty. This usually is useless as those paths usually aren't traveled if there's no bio queued. * throtl_schedule_delayed_work() skips scheduling dispatch work item if @td doesn't have any bios queued; however, the only case it can be called when @td is empty is from tg_set_conf() which isn't something we should be optimizing for. * throtl_schedule_next_dispatch() takes a quick exit if @td is empty; however, right after that it triggers BUG if the service tree is empty. The two conditions are equivalent and it can just test @st->count for the quick exit. * blk_throtl_dispatch_work_fn() skips dispatch if @td is empty. This work function isn't usually invoked when @td is empty. The only possibility is from tg_set_conf() and when it happens the normal dispatching path can handle empty @td fine. No need to add special skip path. This patch removes the above three unnecessary optimizations, which leave throtl_log() call in blk_throtl_dispatch_work_fn() the only user of throtl_nr_queued(). Remove throtl_nr_queued() and open code it in throtl_log(). I don't think we need td->nr_queued[] at all. Maybe we can remove it later. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: relocate throtl_schedule_delayed_work()Tejun Heo
Move throtl_schedule_delayed_work() above its first user so that the forward declaration can be removed. This patch is pure relocaiton. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: collapse throtl_dispatch() into the work functionTejun Heo
blk-throttle is about to go through major restructuring to support hierarchy. Do cosmetic updates in preparation. * s/throtl_data->throtl_work/throtl_data->dispatch_work/ * s/blk_throtl_work()/blk_throtl_dispatch_work_fn()/ * Collapse throtl_dispatch() into blk_throtl_dispatch_work_fn() This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: remove deferred config application mechanismTejun Heo
When bps or iops configuration changes, blk-throttle records the new configuration and sets a flag indicating that the config has changed. The flag is checked in the bio dispatch path and applied. This deferred config application was necessary due to limitations in blkcg framework, which haven't existed for quite a while now. This patch removes the deferred config application mechanism and applies new configurations directly from tg_set_conf(), which is simpler. v2: Dropped unnecessary throtl_schedule_delayed_work() call from tg_set_conf() as suggested by Vivek Goyal. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-05-14blk-throttle: remove spurious throtl_enqueue_tg() call from ↵Tejun Heo
throtl_select_dispatch() throtl_select_dispatch() calls throtl_enqueue_tg() right after tg_update_disptime(), which always calls the function anyway. The call is, while harmless, unnecessary. Remove it. This patch doesn't introduce any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2012-12-06block: Rename queue dead flagBart Van Assche
QUEUE_FLAG_DEAD is used to indicate that queuing new requests must stop. After this flag has been set queue draining starts. However, during the queue draining phase it is still safe to invoke the queue's request_fn, so QUEUE_FLAG_DYING is a better name for this flag. This patch has been generated by running the following command over the kernel source tree: git grep -lEw 'blk_queue_dead|QUEUE_FLAG_DEAD' | xargs sed -i.tmp -e 's/blk_queue_dead/blk_queue_dying/g' \ -e 's/QUEUE_FLAG_DEAD/QUEUE_FLAG_DYING/g'; \ sed -i.tmp -e "s/QUEUE_FLAG_DYING$(printf \\t)*5/QUEUE_FLAG_DYING$(printf \\t)5/g" \ include/linux/blkdev.h; \ sed -i.tmp -e 's/ DEAD/ DYING/g' -e 's/dead queue/a dying queue/' \ -e 's/Dead queue/A dying queue/' block/blk-core.c Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: James Bottomley <JBottomley@Parallels.com> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Jens Axboe <axboe@kernel.dk> Cc: Chanho Min <chanho.min@lge.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-21workqueue: use mod_delayed_work() instead of __cancel + queueTejun Heo
Now that mod_delayed_work() is safe to call from IRQ handlers, __cancel_delayed_work() followed by queue_delayed_work() can be replaced with mod_delayed_work(). Most conversions are straight-forward except for the following. * net/core/link_watch.c: linkwatch_schedule_work() was doing a quite elaborate dancing around its delayed_work. Collapse it such that linkwatch_work is queued for immediate execution if LW_URGENT and existing timer is kept otherwise. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
2012-08-20workqueue: deprecate system_nrt[_freezable]_wqTejun Heo
system_nrt[_freezable]_wq are now spurious. Mark them deprecated and convert all users to system[_freezable]_wq. If you're cc'd and wondering what's going on: Now all workqueues are non-reentrant, so there's no reason to use system_nrt[_freezable]_wq. Please use system[_freezable]_wq instead. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-By: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: David Airlie <airlied@linux.ie> Cc: Jiri Kosina <jkosina@suse.cz> Cc: "David S. Miller" <davem@davemloft.net> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com>
2012-06-25block: allocate io_context upfrontTejun Heo
Block layer very lazy allocation of ioc. It waits until the moment ioc is absolutely necessary; unfortunately, that time could be inside queue lock and __get_request() performs unlock - try alloc - retry dancing. Just allocate it up-front on entry to block layer. We're not saving the rain forest by deferring it to the last possible moment and complicating things unnecessarily. This patch is to prepare for further updates to request allocation path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-23blkcg: tg_stats_alloc_lock is an irq lockTejun Heo
tg_stats_alloc_lock nests inside queue lock and should always be held with irq disabled. throtl_pd_{init|exit}() were using non-irqsafe spinlock ops which triggered inverse lock ordering via irq warning via RCU freeing of blkg invoking throtl_pd_exit() w/o disabling IRQ. Update both functions to use irq safe operations. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Sasha Levin <sasha.levin@oracle.com> LKML-Reference: <1335339396.16988.80.camel@lappy> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-01Merge tag 'v3.4-rc5' into for-3.5/coreJens Axboe
The core branch is behind driver commits that we want to build on for 3.5, hence I'm pulling in a later -rc. Linux 3.4-rc5 Conflicts: Documentation/feature-removal-schedule.txt Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-20blkcg: collapse blkcg_policy_ops into blkcg_policyTejun Heo
There's no reason to keep blkcg_policy_ops separate. Collapse it into blkcg_policy. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>