Age | Commit message (Collapse) | Author |
|
ip_route_me_harder can't create the route cache when the outdev is the same
with the indev for the skbs whichout a valid protocol set.
__mkroute_input functions has this check:
1998 if (skb->protocol != htons(ETH_P_IP)) {
1999 /* Not IP (i.e. ARP). Do not create route, if it is
2000 * invalid for proxy arp. DNAT routes are always valid.
2001 *
2002 * Proxy arp feature have been extended to allow, ARP
2003 * replies back to the same interface, to support
2004 * Private VLAN switch technologies. See arp.c.
2005 */
2006 if (out_dev == in_dev &&
2007 IN_DEV_PROXY_ARP_PVLAN(in_dev) == 0) {
2008 err = -EINVAL;
2009 goto cleanup;
2010 }
2011 }
This patch gives the new skb a valid protocol to bypass this check. In order
to make ipt_REJECT work with bridges, you also need to enable ip_forward.
This patch also fixes a regression. When we used skb_copy_expand(), we
didn't have this issue stated above, as the protocol was properly set.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch improves the situation in which the expectation table is
full for conntrack NAT helpers. Basically, we give up if we don't
find a place in the table instead of looping over nf_ct_expect_related()
with a different port (we should only do this if it returns -EBUSY, for
-EMFILE or -ESHUTDOWN I think that it's better to skip this).
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
|
|
Special care should be taken when slow path is hit in ip_fragment() :
When walking through frags, we transfert truesize ownership from skb to
frags. Then if we hit a slow_path condition, we must undo this or risk
uncharging frags->truesize twice, and in the end, having negative socket
sk_wmem_alloc counter, or even freeing socket sooner than expected.
Many thanks to Nick Bowler, who provided a very clean bug report and
test program.
Thanks to Jarek for reviewing my first patch and providing a V2
While Nick bisection pointed to commit 2b85a34e911 (net: No more
expensive sock_hold()/sock_put() on each tx), underlying bug is older
(2.6.12-rc5)
A side effect is to extend work done in commit b2722b1c3a893e
(ip_fragment: also adjust skb->truesize for packets not owned by a
socket) to ipv6 as well.
Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
Tested-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If a RST comes in immediately after checking sk->sk_err, tcp_poll will
return POLLIN but not POLLOUT. Fix this by checking sk->sk_err at the end
of tcp_poll. Additionally, ensure the correct order of operations on SMP
machines with memory barriers.
Signed-off-by: Tom Marshall <tdm.code@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The family parameter xfrm_state_find is used to find a state matching a
certain policy. This value is set to the template's family
(encap_family) right before xfrm_state_find is called.
The family parameter is however also used to construct a temporary state
in xfrm_state_find itself which is wrong for inter-family scenarios
because it produces a selector for the wrong family. Since this selector
is included in the xfrm_user_acquire structure, user space programs
misinterpret IPv6 addresses as IPv4 and vice versa.
This patch splits up the original init_tempsel function into a part that
initializes the selector respectively the props and id of the temporary
state, to allow for differing ip address families whithin the state.
Signed-off-by: Thomas Egerer <thomas.egerer@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Under load, netif_rx() can drop incoming packets but administrators dont
have a chance to spot which device needs some tuning (RPS activation for
example)
This patch adds rx_dropped accounting in vlans and tunnels.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
ipv6 can be a module, we should test CONFIG_IPV6 and CONFIG_IPV6_MODULE
to enable ipv6 bits in ip_gre.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Related dicussion here : http://lkml.org/lkml/2010/9/3/16
Introduce a function br_parse_ip_options that will audit the
skb and possibly refill IP options before a packet enters the
IP stack. If no options are present, the function will zero out
the skb cb area so that it is not misinterpreted as options by some
unsuspecting IP layer routine. If packet consistency fails, drop it.
Signed-off-by: Bandan Das <bandan.das@stratus.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When alloc_null_binding(), no IP_NAT_RNAGE_MAP_IPS in flags means no IP address
translation is needed. It isn't necessary to specify the address explicitly.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
|
|
other choice
Eliminate nf_nat_used_tuple() to save some CPU cycles when there is no
other choice.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
|
|
dev->ip_ptr is protected by rtnl and rcu.
Yet some places dont use appropriate primitives and/or locking rules.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As RTNL is held while doing tunnels inserts and deletes, we can remove
ipgre_lock spinlock. My initial RCU conversion was conservative and
converted the rwlock to spinlock, with no RTNL requirement.
Use appropriate rcu annotations and modern lockdep checks as well.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As RTNL is held while doing tunnels inserts and deletes, we can remove
ipip_lock spinlock. My initial RCU conversion was conservative and
converted the rwlock to spinlock, with no RTNL requirement.
Use appropriate rcu annotations and modern lockdep checks as well.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a static function nf_nat_csum() to replace the duplicate code in
nf_nat_mangle_udp_packet() and __nf_nat_mangle_tcp_packet().
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
|
|
While integrating your man-pages patch for IP_NODEFRAG, I noticed
that this option is settable by setsockopt(), but not gettable by
getsockopt(). I suppose this is not intended. The (untested,
trivial) patch below adds getsockopt() support.
Signed-off-by: Michael kerrisk <mtk.manpages@gmail.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After all these years, it turns out that the
/proc/sys/net/ipv4/conf/*/force_igmp_version
parameter isn't fully implemented.
*Symptom*:
When set force_igmp_version to a value of 2, the kernel should only perform
multicast IGMPv2 operations (IETF rfc2236). An host-initiated Join message
will be sent as a IGMPv2 Join message. But if a IGMPv3 query message is
received, the host responds with a IGMPv3 join message. Per rfc3376 and
rfc2236, a IGMPv2 host should treat a IGMPv3 query as a IGMPv2 query and
respond with an IGMPv2 Join message.
*Consequences*:
This is an issue when a IGMPv3 capable switch is the querier and will only
issue IGMPv3 queries (which double as IGMPv2 querys) and there's an
intermediate switch that is only IGMPv2 capable. The intermediate switch
processes the initial v2 Join, but fails to recognize the IGMPv3 Join responses
to the Query, resulting in a dropped connection when the intermediate v2-only
switch times it out.
*Identifying issue in the kernel source*:
The issue is in this section of code (in net/ipv4/igmp.c), which is called when
an IGMP query is received (from mainline 2.6.36-rc3 gitweb):
...
A IGMPv3 query has a length >= 12 and no sources. This routine will exit after
line 880, setting the general query timer (random timeout between 0 and query
response time). This calls igmp_gq_timer_expire():
...
.. which only sends a v3 response. So if a v3 query is received, the kernel
always sends a v3 response.
IGMP queries happen once every 60 sec (per vlan), so the traffic is low. A
IGMPv3 query *is* a strict superset of a IGMPv2 query, so this patch properly
short circuit's the v3 behaviour.
One issue is that this does not address force_igmp_version=1. Then again, I've
never seen any IGMPv1 multicast equipment in the wild. However there is a lot
of v2-only equipment. If it's necessary to support the IGMPv1 case as well:
837 if (len == 8 || IGMP_V2_SEEN(in_dev) || IGMP_V1_SEEN(in_dev)) {
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use rcu_dereference_rtnl() helper
Change hard coded constants in fib_flag_trans()
7 -> RTN_UNREACHABLE
8 -> RTN_PROHIBIT
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
net/mac80211/main.c
|
|
xfrm4_tunnel_register() & xfrm6_tunnel_register() should
use rcu_assign_pointer() to make sure previous writes
(to handler->next) are committed to memory before chain
insertion.
deregister functions dont need a particular barrier.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
commit 30fff923 introduced in linux-2.6.33 (udp: bind() optimisation)
added a secondary hash on UDP, hashed on (local addr, local port).
Problem is that following sequence :
fd = socket(...)
connect(fd, &remote, ...)
not only selects remote end point (address and port), but also sets
local address, while UDP stack stored in secondary hash table the socket
while its local address was INADDR_ANY (or ipv6 equivalent)
Sequence is :
- autobind() : choose a random local port, insert socket in hash tables
[while local address is INADDR_ANY]
- connect() : set remote address and port, change local address to IP
given by a route lookup.
When an incoming UDP frame comes, if more than 10 sockets are found in
primary hash table, we switch to secondary table, and fail to find
socket because its local address changed.
One solution to this problem is to rehash datagram socket if needed.
We add a new rehash(struct socket *) method in "struct proto", and
implement this method for UDP v4 & v6, using a common helper.
This rehashing only takes care of secondary hash table, since primary
hash (based on local port only) is not changed.
Reported-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use cmpxchg() to get rid of spinlocks in inet_add_protocol() and
friends.
inet_protos[] & inet6_protos[] are moved to read_mostly section
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Blackhole routes are used when xfrm_lookup() returns -EREMOTE (error
triggered by IKE for example), hence this kind of route is always
temporary and so we should check if a better route exists for next
packets.
Bug has been introduced by commit d11a4dc18bf41719c9f0d7ed494d295dd2973b92.
Signed-off-by: Jianzhao Wang <jianzhao.wang@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Hi,
Here is one more of these warnings and a patch below:
Sep 5 23:52:33 del kernel: [46044.244833] ===================================================
Sep 5 23:52:33 del kernel: [46044.269681] [ INFO: suspicious rcu_dereference_check() usage. ]
Sep 5 23:52:33 del kernel: [46044.277000] ---------------------------------------------------
Sep 5 23:52:33 del kernel: [46044.285185] net/ipv4/fib_trie.c:1756 invoked rcu_dereference_check() without protection!
Sep 5 23:52:33 del kernel: [46044.293627]
Sep 5 23:52:33 del kernel: [46044.293632] other info that might help us debug this:
Sep 5 23:52:33 del kernel: [46044.293634]
Sep 5 23:52:33 del kernel: [46044.325333]
Sep 5 23:52:33 del kernel: [46044.325335] rcu_scheduler_active = 1, debug_locks = 0
Sep 5 23:52:33 del kernel: [46044.348013] 1 lock held by pppd/1717:
Sep 5 23:52:33 del kernel: [46044.357548] #0: (rtnl_mutex){+.+.+.}, at: [<c125dc1f>] rtnl_lock+0xf/0x20
Sep 5 23:52:33 del kernel: [46044.367647]
Sep 5 23:52:33 del kernel: [46044.367652] stack backtrace:
Sep 5 23:52:33 del kernel: [46044.387429] Pid: 1717, comm: pppd Not tainted 2.6.35.4.4a #3
Sep 5 23:52:33 del kernel: [46044.398764] Call Trace:
Sep 5 23:52:33 del kernel: [46044.409596] [<c12f9aba>] ? printk+0x18/0x1e
Sep 5 23:52:33 del kernel: [46044.420761] [<c1053969>] lockdep_rcu_dereference+0xa9/0xb0
Sep 5 23:52:33 del kernel: [46044.432229] [<c12b7235>] trie_firstleaf+0x65/0x70
Sep 5 23:52:33 del kernel: [46044.443941] [<c12b74d4>] fib_table_flush+0x14/0x170
Sep 5 23:52:33 del kernel: [46044.455823] [<c1033e92>] ? local_bh_enable_ip+0x62/0xd0
Sep 5 23:52:33 del kernel: [46044.467995] [<c12fc39f>] ? _raw_spin_unlock_bh+0x2f/0x40
Sep 5 23:52:33 del kernel: [46044.480404] [<c12b24d0>] ? fib_sync_down_dev+0x120/0x180
Sep 5 23:52:33 del kernel: [46044.493025] [<c12b069d>] fib_flush+0x2d/0x60
Sep 5 23:52:33 del kernel: [46044.505796] [<c12b06f5>] fib_disable_ip+0x25/0x50
Sep 5 23:52:33 del kernel: [46044.518772] [<c12b10d3>] fib_netdev_event+0x73/0xd0
Sep 5 23:52:33 del kernel: [46044.531918] [<c1048dfd>] notifier_call_chain+0x2d/0x70
Sep 5 23:52:33 del kernel: [46044.545358] [<c1048f0a>] raw_notifier_call_chain+0x1a/0x20
Sep 5 23:52:33 del kernel: [46044.559092] [<c124f687>] call_netdevice_notifiers+0x27/0x60
Sep 5 23:52:33 del kernel: [46044.573037] [<c124faec>] __dev_notify_flags+0x5c/0x80
Sep 5 23:52:33 del kernel: [46044.586489] [<c124fb47>] dev_change_flags+0x37/0x60
Sep 5 23:52:33 del kernel: [46044.599394] [<c12a8a8d>] devinet_ioctl+0x54d/0x630
Sep 5 23:52:33 del kernel: [46044.612277] [<c12aabb7>] inet_ioctl+0x97/0xc0
Sep 5 23:52:34 del kernel: [46044.625208] [<c123f6af>] sock_ioctl+0x6f/0x270
Sep 5 23:52:34 del kernel: [46044.638046] [<c109d2b0>] ? handle_mm_fault+0x420/0x6c0
Sep 5 23:52:34 del kernel: [46044.650968] [<c123f640>] ? sock_ioctl+0x0/0x270
Sep 5 23:52:34 del kernel: [46044.663865] [<c10c3188>] vfs_ioctl+0x28/0xa0
Sep 5 23:52:34 del kernel: [46044.676556] [<c10c38fa>] do_vfs_ioctl+0x6a/0x5c0
Sep 5 23:52:34 del kernel: [46044.688989] [<c1048676>] ? up_read+0x16/0x30
Sep 5 23:52:34 del kernel: [46044.701411] [<c1021376>] ? do_page_fault+0x1d6/0x3a0
Sep 5 23:52:34 del kernel: [46044.714223] [<c10b6588>] ? fget_light+0xf8/0x2f0
Sep 5 23:52:34 del kernel: [46044.726601] [<c1241f98>] ? sys_socketcall+0x208/0x2c0
Sep 5 23:52:34 del kernel: [46044.739140] [<c10c3eb3>] sys_ioctl+0x63/0x70
Sep 5 23:52:34 del kernel: [46044.751967] [<c12fca3d>] syscall_call+0x7/0xb
Sep 5 23:52:34 del kernel: [46044.764734] [<c12f0000>] ? cookie_v6_check+0x3d0/0x630
-------------->
This patch fixes the warning:
===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
net/ipv4/fib_trie.c:1756 invoked rcu_dereference_check() without protection!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
1 lock held by pppd/1717:
#0: (rtnl_mutex){+.+.+.}, at: [<c125dc1f>] rtnl_lock+0xf/0x20
stack backtrace:
Pid: 1717, comm: pppd Not tainted 2.6.35.4a #3
Call Trace:
[<c12f9aba>] ? printk+0x18/0x1e
[<c1053969>] lockdep_rcu_dereference+0xa9/0xb0
[<c12b7235>] trie_firstleaf+0x65/0x70
[<c12b74d4>] fib_table_flush+0x14/0x170
...
Allow trie_firstleaf() to be called either under rcu_read_lock()
protection or with RTNL held. The same annotation is added to
node_parent_rcu() to prevent a similar warning a bit later.
Followup of commits 634a4b20 and 4eaa0e3c.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Actually iterate over the next-hops to make sure we have
a device match. Otherwise RP filtering is always elided
when the route matched has multiple next-hops.
Reported-by: Igor M Podlesny <for.poige@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Clean the code up according to Documentation/CodingStyle.
Don't initialize the variable dont_send in arp_process().
Remove the temporary varialbe flags in arp_state_to_flags().
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Thanks to Ilpo Jarvinen, this updates also the initial window
setting for tcp_output with regard to RFC 5681.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
tunnel4_handlers, tunnel64_handlers, tunnel6_handlers and
tunnel46_handlers are protected by RCU, but we dont use appropriate rcu
primitives to scan them. rcu_lock() is already held by caller.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
tcp4_gro_receive() and tcp4_gro_complete() dont need to be exported.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
tunnel4_handlers chain being scanned for each incoming packet,
make sure it doesnt share an often dirtied cache line.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch consolidates initial-window code common to TCP and CCID-2:
* TCP uses RFC 3390 in a packet-oriented manner (tcp_input.c) and
* CCID-2 uses RFC 3390 in packet-oriented manner (RFC 4341).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The string clone is only used as a temporary copy of the argument val
within the while loop, and so it should be freed before leaving the
function. The call to strsep, however, modifies clone, so a pointer to the
front of the string is kept in saved_clone, to make it possible to free it.
The sematic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@r exists@
local idexpression x;
expression E;
identifier l;
statement S;
@@
*x= \(kasprintf\|kstrdup\)(...);
...
if (x == NULL) S
... when != kfree(x)
when != E = x
if (...) {
<... when != kfree(x)
* goto l;
...>
* return ...;
}
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This issue come from ruby language community. Below test program
hang up when only run on Linux.
% uname -mrsv
Linux 2.6.26-2-486 #1 Sat Dec 26 08:37:39 UTC 2009 i686
% ruby -rsocket -ve '
BasicSocket.do_not_reverse_lookup = true
serv = TCPServer.open("127.0.0.1", 0)
s1 = TCPSocket.open("127.0.0.1", serv.addr[1])
s2 = serv.accept
s2.close
s1.write("a") rescue p $!
s1.write("a") rescue p $!
Thread.new {
s1.write("a")
}.join'
ruby 1.9.3dev (2010-07-06 trunk 28554) [i686-linux]
#<Errno::EPIPE: Broken pipe>
[Hang Here]
FreeBSD, Solaris, Mac doesn't. because Ruby's write() method call
select() internally. and tcp_poll has a bug.
SUS defined 'ready for writing' of select() as following.
| A descriptor shall be considered ready for writing when a call to an output
| function with O_NONBLOCK clear would not block, whether or not the function
| would transfer data successfully.
That said, EPIPE situation is clearly one of 'ready for writing'.
We don't have read-side issue because tcp_poll() already has read side
shutdown care.
| if (sk->sk_shutdown & RCV_SHUTDOWN)
| mask |= POLLIN | POLLRDNORM | POLLRDHUP;
So, Let's insert same logic in write side.
- reference url
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/31065
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/31068
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As discovered by Anton Blanchard, current code to autotune
tcp_death_row.sysctl_max_tw_buckets, sysctl_tcp_max_orphans and
sysctl_max_syn_backlog makes little sense.
The bigger a page is, the less tcp_max_orphans is : 4096 on a 512GB
machine in Anton's case.
(tcp_hashinfo.bhash_size * sizeof(struct inet_bind_hashbucket))
is much bigger if spinlock debugging is on. Its wrong to select bigger
limits in this case (where kernel structures are also bigger)
bhash_size max is 65536, and we get this value even for small machines.
A better ground is to use size of ehash table, this also makes code
shorter and more obvious.
Based on a patch from Anton, and another from David.
Reported-and-tested-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As reported by Anton Blanchard when we use
percpu_counter_read_positive() to make our orphan socket limit checks,
the check can be off by up to num_cpus_online() * batch (which is 32
by default) which on a 128 cpu machine can be as large as the default
orphan limit itself.
Fix this by doing the full expensive sum check if the optimized check
triggers.
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
|
|
Compiler is not smart enough to avoid a conditional branch.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
commit f3c5c1bfd430858d3a05436f82c51e53104feb6b
(netfilter: xtables: make ip_tables reentrant) forgot to
also compute the jumpstack size in the compat handlers.
Result is that "iptables -I INPUT -j userchain" turns into -j DROP.
Reported by Sebastian Roesner on #netfilter, closes
http://bugzilla.netfilter.org/show_bug.cgi?id=669.
Note: arptables change is compile-tested only.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
SKBs can be "fragmented" in two ways, via a page array (called
skb_shinfo(skb)->frags[]) and via a list of SKBs (called
skb_shinfo(skb)->frag_list).
Since skb_has_frags() tests the latter, it's name is confusing
since it sounds more like it's testing the former.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Via setsockopt it is possible to reduce the socket RX buffer
(SO_RCVBUF). TCP method to select the initial window and window scaling
option in tcp_select_initial_window() currently misbehaves and do not
consider a reduced RX socket buffer via setsockopt.
Even though the server's RX buffer is reduced via setsockopt() to 256
byte (Initial Window 384 byte => 256 * 2 - (256 * 2 / 4)) the window
scale option is still 7:
192.168.1.38.40676 > 78.47.222.210.5001: Flags [S], seq 2577214362, win 5840, options [mss 1460,sackOK,TS val 338417 ecr 0,nop,wscale 0], length 0
78.47.222.210.5001 > 192.168.1.38.40676: Flags [S.], seq 1570631029, ack 2577214363, win 384, options [mss 1452,sackOK,TS val 2435248895 ecr 338417,nop,wscale 7], length 0
192.168.1.38.40676 > 78.47.222.210.5001: Flags [.], ack 1, win 5840, options [nop,nop,TS val 338421 ecr 2435248895], length 0
Within tcp_select_initial_window() the original space argument - a
representation of the rx buffer size - is expanded during
tcp_select_initial_window(). Only sysctl_tcp_rmem[2], sysctl_rmem_max
and window_clamp are considered to calculate the initial window.
This patch adjust the window_clamp argument if the user explicitly
reduce the receive buffer.
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
|
|
PPP: introduce "pptp" module which implements point-to-point tunneling protocol using pppox framework
NET: introduce the "gre" module for demultiplexing GRE packets on version criteria
(required to pptp and ip_gre may coexists)
NET: ip_gre: update to use the "gre" module
This patch introduces then pptp support to the linux kernel which
dramatically speeds up pptp vpn connections and decreases cpu usage in
comparison of existing user-space implementation
(poptop/pptpclient). There is accel-pptp project
(https://sourceforge.net/projects/accel-pptp/) to utilize this module,
it contains plugin for pppd to use pptp in client-mode and modified
pptpd (poptop) to build high-performance pptp NAS.
There was many changes from initial submitted patch, most important are:
1. using rcu instead of read-write locks
2. using static bitmap instead of dynamically allocated
3. using vmalloc for memory allocation instead of BITS_PER_LONG + __get_free_pages
4. fixed many coding style issues
Thanks to Eric Dumazet.
Signed-off-by: Dmitry Kozlov <xeb@mail.ru>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Now cmpxchg() is available on all arches, we can use it in
build_ehash_secret() and rt_bind_peer() instead of using spinlocks.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
|
|
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch removes the abstraction introduced by the union skb_shared_tx in
the shared skb data.
The access of the different union elements at several places led to some
confusion about accessing the shared tx_flags e.g. in skb_orphan_try().
http://marc.info/?l=linux-netdev&m=128084897415886&w=2
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After commit 24b36f019 (netfilter: {ip,ip6,arp}_tables: dont block
bottom half more than necessary), lockdep can raise a warning
because we attempt to lock a spinlock with BH enabled, while
the same lock is usually locked by another cpu in a softirq context.
Disable again BH to avoid these lockdep warnings.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Diagnosed-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
tcp_parse_md5sig_option doesn't check md5sig option (TCPOPT_MD5SIG)
length, but tcp_v[46]_inbound_md5_hash assume that it's at least 16
bytes long.
Signed-off-by: Dmitry Popov <dp@highloadlab.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
drivers/net/e1000e/hw.h
net/bridge/br_device.c
net/bridge/br_input.c
|
|
6c79bf0f2440fd250c8fce8d9b82fcf03d4e8350 subtracts PPPOE_SES_HLEN from mtu at
the front of ip_fragment(). So the later subtraction should be removed. The
MTU of 802.1q is also 1500, so MTU should not be changed.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Bart De Schuymer <bdschuym@pandora.bo>
----
net/ipv4/ip_output.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
Signed-off-by: Bart De Schuymer <bdschuym@pandora.bo>
Signed-off-by: David S. Miller <davem@davemloft.net>
|