summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2015-04-22modpost: expand pattern matching to support substring matchesPaul Gortmaker
Currently the match() function supports a leading * to match any prefix and a trailing * to match any suffix. However there currently is not a combination of both that can be used to target matches of whole families of functions that share a common substring. Here we expand the *foo and foo* match to also support *foo* with the goal of targeting compiler generated symbol names that contain strings like ".constprop." and ".isra." Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-04-22modpost: do not try to match the SHT_NUL section.Quentin Casasnovas
Trying to match the SHT_NUL section isn't useful and causes build failures on parisc and mn10300 since the addition of section strict white-listing and __ex_table sanitizing. Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Fixes: 050e57fd5936 ("modpost: add strict white-listing when referencing....") Fixes: 52dc0595d540 ("modpost: handle relocations mismatch in __ex_table.") Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-04-22modpost: fix extable entry size calculation.Quentin Casasnovas
As Guenter pointed out, we were never really calculating the extable entry size because the pointer arithmetic was simply wrong. We want to check we're handling the second relocation in __ex_table to infer an entry size, but we were using (void*) pointers instead of Elf_Rel[a]* ones. This fixes the problem by moving that check in the caller (since we can deal with different types of relocations) and add is_second_extable_reloc() to make the whole thing more readable. Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Reported-by: Guenter Roeck <linux@roeck-us.net> CC: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-04-22modpost: fix inverted logic in is_extable_fault_address().Quentin Casasnovas
As Guenter pointed out, we want to assert that extable_entry_size has been discovered and not the other way around. Moreover, this sanity check is only valid when we're not dealing with the first relocation in __ex_table, since we have not discovered the extable entry size at that point. This was leading to a divide-by-zero on some architectures and make the build fail. Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Reported-by: Guenter Roeck <linux@roeck-us.net> CC: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-04-22modpost: handle -ffunction-sectionsRusty Russell
52dc0595d540 introduced OTHER_TEXT_SECTIONS for identifying what sections could validly have __ex_table entries. Unfortunately, it wasn't tested with -ffunction-sections, which some architectures use. Reported-by: kbuild test robot <fengguang.wu@intel.com> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-04-22modpost: Whitelist .text.fixup and .exception.textThierry Reding
32-bit and 64-bit ARM use these sections to store executable code, so they must be whitelisted in modpost's table of valid text sections. Signed-off-by: Thierry Reding <treding@nvidia.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-04-22dmaengine: dw: don't prompt for DW_DMAC_COREVinod Koul
DW_DMAC_CORE is slected by PCI or Platform driver, so this symbol shouldn't be user selectable, so remove the prompt Signed-off-by: Vinod Koul <vinod.koul@intel.com>
2015-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds
Pull sparc fixes from David Miller: 1) ldc_alloc_exp_dring() can be called from softints, so use GFP_ATOMIC. From Sowmini Varadhan. 2) Some minor warning/build fixups for the new iommu-common code on certain archs and with certain debug options enabled. Also from Sowmini Varadhan. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc: Use GFP_ATOMIC in ldc_alloc_exp_dring() as it can be called in softirq context sparc64: Use M7 PMC write on all chips T4 and onward. iommu-common: rename iommu_pool_hash to iommu_hash_common iommu-common: fix x86_64 compiler warnings
2015-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "Just a few fixes trickling in at this point. 1) If we see an attached socket on an skb in the ipv4 forwarding path, bail. This can happen due to races with FIB rule addition, and deletion, and we should just drop such frames. From Sebastian Pöhn. 2) pppoe receive should only accept packets destined for this hosts's MAC address. From Joakim Tjernlund. 3) Handle checksum unwrapping properly in ppp receive properly when it's encapsulated in UDP in some way, fix from Tom Herbert. 4) Fix some bugs in mv88e6xxx DSA driver resulting from the conversion from register offset constants to mnenomic macros. From Vivien Didelot. 5) Fix handling of HCA max message size in mlx4 adapters, from Eran Ben ELisha" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: net/mlx4_core: Fix reading HCA max message size in mlx4_QUERY_DEV_CAP tcp: add memory barriers to write space paths altera tse: Error-Bit on tx-avalon-stream always set. net: dsa: mv88e6xxx: use PORT_DEFAULT_VLAN net: dsa: mv88e6xxx: fix setup of port control 1 ppp: call skb_checksum_complete_unset in ppp_receive_frame net: add skb_checksum_complete_unset pppoe: Lacks DST MAC address check ip_forward: Drop frames with attached skb->sk
2015-04-21Merge tag 'omap-for-v4.1/prcm-dts-mfd-syscon-fix' of ↵Olof Johansson
git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into next/late Merge "urgent omap boot fix for v4.1 if MFD_SYSCON is not set" from Tony Lindgren: Urgent pull request for v4.1 to booting for custom kernel .config files that do not have MFD_SYSCON set. Omaps now have a dependency to MFD_SYSCON for system control module generic register area and some clocks with the changes done in omap-for-v4.1/prcm-dts branch. This can be pulled on top of omap-for-v4.1/prcm-dts, or into fixes for v4.1. We already do have a slight MFD_SYSCON dependency for REGULATOR_PBIAS for dual voltage MMC cards on the first MMC bus for many devices, so from that point of view this can also be merged separately from omap-for-v4.1/prcm-dts. * tag 'omap-for-v4.1/prcm-dts-mfd-syscon-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap: ARM: OMAP2+: Fix booting with configs that don't have MFD_SYSCON Signed-off-by: Olof Johansson <olof@lixom.net>
2015-04-22ACPI / EC: fix NULL pointer dereference in acpi_ec_remove_query_handler()Chris Bainbridge
Use list_for_each_entry_safe for iterating because handler may be freed in the loop. BUG: unable to handle kernel NULL pointer dereference at 000000000000002c IP: [<ffffffff814d69c8>] acpi_ec_put_query_handler+0x7/0x1a Call Trace: acpi_ec_remove_query_handler+0x87/0x97 acpi_smbus_hc_remove+0x2a/0x44 [sbshc] acpi_device_remove+0x7b/0x9a __device_release_driver+0x7e/0x110 driver_detach+0xb0/0xc0 bus_remove_driver+0x54/0xe0 driver_unregister+0x2b/0x60 acpi_bus_unregister_driver+0x10/0x12 acpi_smb_hc_driver_exit+0x10/0x12 [sbshc] SyS_delete_module+0x1b8/0x210 system_call_fastpath+0x12/0x6a Signed-off-by: Chris Bainbridge <chris.bainbridge@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-04-21Merge branch 'parisc-4.1-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull parisc fixes from Helge Deller: "The patch by Guenter Roeck fixes the build on parisc which got broken because of commit f24ffde43237 ("parisc: expose number of page table levels on Kconfig level") and the patch from Matthew Wilcox converts our code to use the generic scatterlist.h header file" * 'parisc-4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Replace PT_NLEVELS with CONFIG_PGTABLE_LEVELS parisc: Eliminate sg_virt_addr() and private scatterlist.h
2015-04-22md/raid5: don't do chunk aligned read on degraded array.Eric Mei
When array is degraded, read data landed on failed drives will result in reading rest of data in a stripe. So a single sequential read would result in same data being read twice. This patch is to avoid chunk aligned read for degraded array. The downside is to involve stripe cache which means associated CPU overhead and extra memory copy. Test Results: Following test are done on a enterprise storage node with Seagate 6T SAS drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2, chunk size 128 KiB. I use FIO, using direct-io with various bs size, enough queue depth, tested sequential and 100% random read against 3 array config: 1) optimal, as baseline; 2) degraded; 3) degraded with this patch. Kernel version is 4.0-rc3. Each individual test I only did once so there might be some variations, but we just focus on big trend. Sequential Read: bs=(KiB) optimal(MiB/s) degraded(MiB/s) degraded-with-patch (MiB/s) 1024 1608 656 995 512 1624 710 956 256 1635 728 980 128 1636 771 983 64 1612 1119 1000 32 1580 1420 1004 16 1368 688 986 8 768 647 953 4 411 413 850 Random Read: bs=(KiB) optimal(IOPS) degraded(IOPS) degraded-with-patch (IOPS) 1024 163 160 156 512 274 273 272 256 426 428 424 128 576 592 591 64 726 724 726 32 849 848 837 16 900 970 971 8 927 940 929 4 948 940 955 Some notes: * In sequential + optimal, as bs size getting smaller, the FIO thread become CPU bound. * In sequential + degraded, there's big increase when bs is 64K and 32K, I don't have explanation. * In sequential + degraded-with-patch, the MD thread mostly become CPU bound. If you want to we can discuss specific data point in those data. But in general it seems with this patch, we have more predictable and in most cases significant better sequential read performance when array is degraded, and almost no noticeable impact on random read. Performance is a complicated thing, the patch works well for this particular configuration, but may not be universal. For example I imagine testing on all SSD array may have very different result. But I personally think in most cases IO bandwidth is more scarce resource than CPU. Signed-off-by: Eric Mei <eric.mei@seagate.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid5: allow the stripe_cache to grow and shrink.NeilBrown
The default setting of 256 stripe_heads is probably much too small for many configurations. So it is best to make it auto-configure. Shrinking the cache under memory pressure is easy. The only interesting part here is that we put a fairly high cost ('seeks') on shrinking the cache as the cost is greater than just having to read more data, it reduces parallelism. Growing the cache on demand needs to be done carefully. If we allow fast growth, that can upset memory balance as lots of dirty memory can quickly turn into lots of memory queued in the stripe_cache. It is important for the raid5 block device to appear congested to allow write-throttling to work. So we only add stripes slowly. We set a flag when an allocation fails because all stripes are in use, allocate at a convenient time when that flag is set, and don't allow it to be set again until at least one stripe_head has been released for re-use. This means that a spurt of requests will only cause one stripe_head to be allocated, but a steady stream of requests will slowly increase the cache size - until memory pressure puts it back again. It could take hours to reach a steady state. The value written to, and displayed in, stripe_cache_size is used as a minimum. The cache can grow above this and shrink back down to it. The actual size is not directly visible, though it can be deduced to some extent by watching stripe_cache_active. Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid5: change ->inactive_blocked to a bit-flag.NeilBrown
This allows us to easily add more (atomic) flags. Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid5: move max_nr_stripes management into grow_one_stripe and ↵NeilBrown
drop_one_stripe Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe() succeeds, do it inside the functions. Also choose the correct hash to handle next inside the functions. This removes duplication and will help with future new uses of {grow,drop}_one_stripe. This also fixes a minor bug where the "md/raid:%md: allocate XXkB" message always said "0kB". Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid5: pass gfp_t arg to grow_one_stripe()NeilBrown
This is needed for future improvement to stripe cache management. Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid5: introduce configuration option rmw_levelMarkus Stockhausen
Depending on the available coding we allow optimized rmw logic for write operations. To support easier testing this patch allows manual control of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level. The configuration can handle three levels of control. rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q calculation has no implementation path yet to factor in/out chunks of a syndrome. Enforcing this level can be benefical for slow CPUs with hardware syndrome support and fast SSDs. rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will save IOs. This equals the "old" unpatched behaviour and will be the default. rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are equal. We might have higher CPU consumption because of calculating the parity twice but it can be benefical otherwise. E.g. RAID4 with fast dedicated parity disk/SSD. The option is implemented just to be forward-looking and will ONLY work with this patch! Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid5: activate raid6 rmw featureMarkus Stockhausen
Glue it altogehter. The raid6 rmw path should work the same as the already existing raid5 logic. So emulate the prexor handling/flags and split functions as needed. 1) Enable xor_syndrome() in the async layer. 2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome at the start of a rmw run as we did it before for the single parity. 3) Take care of rmw run in ops_run_reconstruct6(). Again process only the changed pages to get syndrome back into sync. 4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw run. The lower layers will calculate start & end pages from that and call the xor_syndrome() correspondingly. 5) Adapt the several places where we ignored Q handling up to now. Performance numbers for a single E5630 system with a mix of 10 7200k desktop/server disks. 300 seconds random write with 8 threads onto a 3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4) bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0 skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0 4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s 8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s 16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s 32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s 64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s 128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s 256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s 512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid6 algorithms: xor_syndrome() for SSE2Markus Stockhausen
The second and (last) optimized XOR syndrome calculation. This version supports right and left side optimization. All CPUs with architecture older than Haswell will benefit from it. It should be noted that SSE2 movntdq kills performance for memory areas that are read and written simultaneously in chunks smaller than cache line size. So use movdqa instead for P/Q writes in sse21 and sse22 XOR functions. Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid6 algorithms: xor_syndrome() for generic intMarkus Stockhausen
Start the algorithms with the very basic one. It is left and right optimized. That means we can avoid all calculations for unneeded pages above the right stop offset. For pages below the left start offset we still need the syndrome multiplication but without reading data pages. Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid6 algorithms: improve test programMarkus Stockhausen
It is always helpful to have a test tool in place if we implement new data critical algorithms. So add some test routines to the raid6 checker that can prove if the new xor_syndrome() works as expected. Run through all permutations of start/stop pages per algorithm and simulate a xor_syndrome() assisted rmw run. After each rmw check if the recovery algorithm still confirms that the stripe is fine. Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid6 algorithms: delta syndrome functionsMarkus Stockhausen
v3: s-o-b comment, explanation of performance and descision for the start/stop implementation Implementing rmw functionality for RAID6 requires optimized syndrome calculation. Up to now we can only generate a complete syndrome. The target P/Q pages are always overwritten. With this patch we provide a framework for inplace P/Q modification. In the first place simply fill those functions with NULL values. xor_syndrome() has two additional parameters: start & stop. These will indicate the first and last page that are changing during a rmw run. That makes it possible to avoid several unneccessary loops and speed up calculation. The caller needs to implement the following logic to make the functions work. 1) xor_syndrome(disks, start, stop, ...): "Remove" all data of source blocks inside P/Q between (and including) start and end. 2) modify any block with start <= block <= stop 3) xor_syndrome(disks, start, stop, ...): "Reinsert" all data of source blocks into P/Q between (and including) start and end. Pages between start and stop that won't be changed should be filled with a pointer to the kernel zero page. The reasons for not taking NULL pages are: 1) Algorithms cross the whole source data line by line. Thus avoid additional branches. 2) Having a NULL page avoids calculating the XOR P parity but still need calulation steps for the Q parity. Depending on the algorithm unrolling that might be only a difference of 2 instructions per loop. The benchmark numbers of the gen_syndrome() functions are displayed in the kernel log. Do the same for the xor_syndrome() functions. This will help to analyze performance problems and give an rough estimate how well the algorithm works. The choice of the fastest algorithm will still depend on the gen_syndrome() performance. With the start/stop page implementation the speed can vary a lot in real life. E.g. a change of page 0 & page 15 on a stripe will be harder to compute than the case where page 0 & page 1 are XOR candidates. To be not to enthusiatic about the expected speeds we will run a worse case test that simulates a change on the upper half of the stripe. So we do: 1) calculation of P/Q for the upper pages 2) continuation of Q for the lower (empty) pages Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22raid5: handle expansion/resync case with stripe batchingshli@kernel.org
expansion/resync can grab a stripe when the stripe is in batch list. Since all stripes in batch list must be in the same state, we can't allow some stripes run into expansion/resync. So we delay expansion/resync for stripe in batch list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22raid5: handle io error of batch listshli@kernel.org
If io error happens in any stripe of a batch list, the batch list will be split, then normal process will run for the stripes in the list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22RAID5: batch adjacent full stripe writeshli@kernel.org
stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k unit. Idealy we should use big size for adjacent full stripe writes. Bigger stripe cache size means less stripes runing in the state machine so can reduce cpu overhead. And also bigger size can cause bigger IO size dispatched to under layer disks. With below patch, we will automatically batch adjacent full stripe write together. Such stripes will be added to the batch list. Only the first stripe of the list will be put to handle_list and so run handle_stripe(). Some steps of handle_stripe() are extended to cover all stripes of the list, including ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes running in handle_stripe() and we send IO of whole stripe list together to increase IO size. Stripes added to a batch list have some limitations. A batch list can only include full stripe write and can't cross chunk boundary to make sure stripes have the same parity disks. Stripes in a batch list must be in the same state (no written, toread and so on). If a stripe is in a batch list, all new read/write to add_stripe_bio will be blocked to overlap conflict till the batch list is handled. The limitations will make sure stripes in a batch list be in exactly the same state in the life circly. I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6 PCIe SSD. This patch improves around 30% performance and IO size to under layer disk is exactly 32k. I also run a 4k randwrite test in the same array to make sure the performance isn't changed with the patch. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22raid5: track overwrite disk countshli@kernel.org
Track overwrite disk count, so we can know if a stripe is a full stripe write. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22raid5: add a new flag to track if a stripe can be batchedshli@kernel.org
A freshly new stripe with write request can be batched. Any time the stripe is handled or new read is queued, the flag will be cleared. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22raid5: use flex_array for scribble datashli@kernel.org
Use flex_array for scribble data. Next patch will batch several stripes together, so scribble data should be able to cover several stripes, so this patch also allocates scribble data for stripes across a chunk. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md raid0: access mddev->queue (request queue member) conditionally because ↵Heinz Mauelshagen
it is not set when accessed from dm-raid The patch makes 3 references to mddev->queue in the raid0 personality conditional in order to allow for it to be accessed from dm-raid. Mandatory, because md instances underneath dm-raid don't manage a request queue of their own which'd lead to oopses without the patch. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Tested-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md: allow resync to go faster when there is competing IO.NeilBrown
When md notices non-sync IO happening while it is trying to resync (or reshape or recover) it slows down to the set minimum. The default minimum might have made sense many years ago but the drives have become faster. Changing the default to match the times isn't really a long term solution. This patch changes the code so that instead of waiting until the speed has dropped to the target, it just waits until pending requests have completed. This means that the delay inserted is a function of the speed of the devices. Testing shows that: - for some loads, the resync speed is unchanged. For those loads increasing the minimum doesn't change the speed either. So this is a good result. To increase resync speed under such loads we would probably need to increase the resync window size. - for other loads, resync speed does increase to a reasonable fraction (e.g. 20%) of maximum possible, and throughput of the load only drops a little bit (e.g. 10%) - for other loads, throughput of the non-sync load drops quite a bit more. These seem to be latency-sensitive loads. So it isn't a perfect solution, but it is mostly an improvement. Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md: remove 'go_faster' option from ->sync_request()NeilBrown
This option is not well justified and testing suggests that it hardly ever makes any difference. The comment suggests there might be a need to wait for non-resync activity indicated by ->nr_waiting, however raise_barrier() already waits for all of that. So just remove it to simplify reasoning about speed limiting. This allows us to remove a 'FIXME' comment from raid5.c as that never used the flag. Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md: don't require sync_min to be a multiple of chunk_size.NeilBrown
There is really no need for sync_min to be a multiple of chunk_size, and values read from here often aren't. That means you cannot read a value and expect to be able to write it back later. So remove the chunk_size check, and round down to a multiple of 4K, to be sure everything works with 4K-sector devices. Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22Merge branch 'cluster' into for-nextNeilBrown
2015-04-22md-cluster: re-add capabilitiesGoldwyn Rodrigues
When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state, the clustered md: 1. Sends RE_ADD message with the desc_nr. Nodes receiving the message clear the Faulty bit in their respective rdev->flags. 2. The node initiating re-add, gathers the bitmaps of all nodes and copies them into the local bitmap. It does not clear the bitmap from which it is copying. 3. Initiating node schedules a md recovery to sync the devices. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md: re-add a failed diskGoldwyn Rodrigues
This adds the capability of re-adding a failed disk by writing "re-add" to /sys/block/mdXX/md/dev-YYY/state. This facilitates adding disks which have encountered a temporary error such as a network disconnection/hiccup in an iSCSI device, or a SAN cable disconnection which has been restored. In such a situation, you do not need to remove and re-add the device. Writing re-add to the failed device's state would add it again to the array and perform the recovery of only the blocks which were written after the device failed. This works for generic md, and is not related to clustering. However, this patch is to ease re-add operations listed above in clustering environments. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md-cluster: remove capabilitiesGoldwyn Rodrigues
This adds "remove" capabilities for the clustered environment. When a user initiates removal of a device from the array, a REMOVE message with disk number in the array is sent to all the nodes which kick the respective device in their own array. This facilitates the removal of failed devices. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md: Export and rename find_rdev_nr_rcuGoldwyn Rodrigues
This is required by the clustering module (patches to follow) to find the device to remove or re-add. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md: Export and rename kick_rdev_from_arrayGoldwyn Rodrigues
This export is required for clustering module in order to co-ordinate remove/readd a rdev from all nodes. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md-cluster: correct the num for comparisonGuoqing Jiang
Since the node num of md-cluster is from zero, and cinfo->slot_number represents the slot num of dlm, no need to check for equality. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-21net/mlx4_core: Fix reading HCA max message size in mlx4_QUERY_DEV_CAPEran Ben Elisha
Currently we parse max_msg_sz from the wrong offset in QUERY_DEV_CAP, fix to use the right offset. Fixes: 0b131561a7d6 ('net/mlx4_en: Add Flow control statistics [..]') Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-21ia64/PCI: Treat all host bridge Address Space Descriptors (even consumers) ↵Bjorn Helgaas
as windows Prior to c770cb4cb505 ("PCI: Mark invalid BARs as unassigned"), if we tried to claim a PCI BAR but could not find an upstream bridge window that matched it, we complained but still allowed the device to be enabled. c770cb4cb505 broke devices that previously worked (mptsas and igb in the case Tony reported, but it could be any devices) because it marks those BARs as IORESOURCE_UNSET, which makes pci_enable_device() complain and return failure: igb 0000:81:00.0: can't enable device: BAR 0 [mem size 0x00020000] not assigned igb: probe of 0000:81:00.0 failed with error -22 The underlying cause is an ACPI Address Space Descriptor for a PCI host bridge window that is marked as "consumer". This is a firmware defect: resources that are produced on the downstream side of a bridge should be marked "producer". But rejecting these BARs that we previously allowed is a functionality regression, and firmware has not used the producer/consumer bit consistently, so we can't rely on it anyway. Stop checking the producer/consumer bit, and assume all bridge Address Space Descriptors are for bridge windows. Note that this change does not affect I/O Port or Fixed Location I/O Port Descriptors, which are commonly used for the [io 0x0cf8-0x0cff] config access range. That range is a "consumer" range and should not be treated as a window. Fixes: c770cb4cb505 ("PCI: Mark invalid BARs as unassigned") Link: https://bugzilla.kernel.org/show_bug.cgi?id=96961 Reported-and-tested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-04-21sparc: Use GFP_ATOMIC in ldc_alloc_exp_dring() as it can be called in ↵Sowmini Varadhan
softirq context Since it is possible for vnet_event_napi to end up doing vnet_control_pkt_engine -> ... -> vnet_send_attr -> vnet_port_alloc_tx_ring -> ldc_alloc_exp_dring -> kzalloc() (i.e., in softirq context), kzalloc() should be called with GFP_ATOMIC from ldc_alloc_exp_dring. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-21uapi: Remove kernel internal declarationAndreas Gruenbacher
The enum nfs4_acl_whotype is only used in nfs4d's internal nfs4 acl representation. No longer expose it to user space. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-04-21nfsd: fix nsfd startup race triggering BUG_ONGiuseppe Cantavenera
nfsd triggered a BUG_ON in net_generic(...) when rpc_pipefs_event(...) in fs/nfsd/nfs4recover.c was called before assigning ntfsd_net_id. The following was observed on a MIPS 32-core processor: kernel: Call Trace: kernel: [<ffffffffc00bc5e4>] rpc_pipefs_event+0x7c/0x158 [nfsd] kernel: [<ffffffff8017a2a0>] notifier_call_chain+0x70/0xb8 kernel: [<ffffffff8017a4e4>] __blocking_notifier_call_chain+0x4c/0x70 kernel: [<ffffffff8053aff8>] rpc_fill_super+0xf8/0x1a0 kernel: [<ffffffff8022204c>] mount_ns+0xb4/0xf0 kernel: [<ffffffff80222b48>] mount_fs+0x50/0x1f8 kernel: [<ffffffff8023dc00>] vfs_kern_mount+0x58/0xf0 kernel: [<ffffffff802404ac>] do_mount+0x27c/0xa28 kernel: [<ffffffff80240cf0>] SyS_mount+0x98/0xe8 kernel: [<ffffffff80135d24>] handle_sys64+0x44/0x68 kernel: kernel: Code: 0040f809 00000000 2e020001 <00020336> 3c12c00d 3c02801a de100000 6442eb98 0040f809 kernel: ---[ end trace 7471374335809536 ]--- Fixed this behaviour by calling register_pernet_subsys(&nfsd_net_ops) before registering rpc_pipefs_event(...) with the notifier chain. Signed-off-by: Giuseppe Cantavenera <giuseppe.cantavenera.ext@nokia.com> Signed-off-by: Lorenzo Restelli <lorenzo.restelli.ext@nokia.com> Reviewed-by: Kinlong Mee <kinglongmee@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-04-21nfsd: eliminate NFSD_DEBUGMark Salter
Commit f895b252d4edf ("sunrpc: eliminate RPC_DEBUG") introduced use of IS_ENABLED() in a uapi header which leads to a build failure for userspace apps trying to use <linux/nfsd/debug.h>: linux/nfsd/debug.h:18:15: error: missing binary operator before token "(" #if IS_ENABLED(CONFIG_SUNRPC_DEBUG) ^ Since this was only used to define NFSD_DEBUG if CONFIG_SUNRPC_DEBUG is enabled, replace instances of NFSD_DEBUG with CONFIG_SUNRPC_DEBUG. Cc: stable@vger.kernel.org Fixes: f895b252d4edf "sunrpc: eliminate RPC_DEBUG" Signed-off-by: Mark Salter <msalter@redhat.com> Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-04-21nfsd4: fix READ permission checkingJ. Bruce Fields
In the case we already have a struct file (derived from a stateid), we still need to do permission-checking; otherwise an unauthorized user could gain access to a file by sniffing or guessing somebody else's stateid. Cc: stable@vger.kernel.org Fixes: dc97618ddda9 "nfsd4: separate splice and readv cases" Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-04-21nfsd4: disallow SEEK with special stateidsJ. Bruce Fields
If the client uses a special stateid then we'll pass a NULL file to vfs_llseek. Fixes: 24bab491220f " NFSD: Implement SEEK" Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: stable@vger.kernel.org Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-04-21sparc64: Use M7 PMC write on all chips T4 and onward.David S. Miller
They both work equally well, and the M7 implementation is simpler and cheaper (less register writes). With help from David Ahern. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-21MAINTAINERS: Remove Mohit Kumar (email bounces)Bjorn Helgaas
Email to Mohit Kumar <mohit.kumar@st.com> has been bouncing, so remove the address from MAINTAINERS and add an entry in CREDITS. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>