Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
|
|
SO_REUSEPORT per EUID.
If there is no TCP_LISTEN socket on a ephemeral port, we can bind multiple
sockets having SO_REUSEADDR to the same port. Then if all sockets bound to
the port have also SO_REUSEPORT enabled and have the same EUID, all of them
can be listened. This is not safe.
Let's say, an application has root privilege and binds sockets to an
ephemeral port with both of SO_REUSEADDR and SO_REUSEPORT. When none of
sockets is not listened yet, a malicious user can use sudo, exhaust
ephemeral ports, and bind sockets to the same ephemeral port, so he or she
can call listen and steal the port.
To prevent this issue, we must not bind more than one sockets that have the
same EUID and both of SO_REUSEADDR and SO_REUSEPORT.
On the other hand, if the sockets have different EUIDs, the issue above does
not occur. After sockets with different EUIDs are bound to the same port and
one of them is listened, no more socket can be listened. This is because the
condition below is evaluated true and listen() for the second socket fails.
} else if (!reuseport_ok ||
!reuseport || !sk2->sk_reuseport ||
rcu_access_pointer(sk->sk_reuseport_cb) ||
(sk2->sk_state != TCP_TIME_WAIT &&
!uid_eq(uid, sock_i_uid(sk2)))) {
if (inet_rcv_saddr_equal(sk, sk2, true))
break;
}
Therefore, on the same port, we cannot do listen() for multiple sockets with
different EUIDs and any other listen syscalls fail, so the problem does not
happen. In this case, we can still call connect() for other sockets that
cannot be listened, so we have to succeed to call bind() in order to fully
utilize 4-tuples.
Summarizing the above, we should be able to bind only one socket having
SO_REUSEADDR and SO_REUSEPORT per EUID.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
exhausted.
Commit aacd9289af8b82f5fb01bcdd53d0e3406d1333c7 ("tcp: bind() use stronger
condition for bind_conflict") introduced a restriction to forbid to bind
SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
assign ports dispersedly so that we can connect to the same remote host.
The change results in accelerating port depletion so that we fail to bind
sockets to the same local port even if we want to connect to the different
remote hosts.
You can reproduce this issue by following instructions below.
1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
2. set SO_REUSEADDR to two sockets.
3. bind two sockets to (localhost, 0) and the latter fails.
Therefore, when ephemeral ports are exhausted, bind(0) should fallback to
the legacy behaviour to enable the SO_REUSEADDR option and make it possible
to connect to different remote (addr, port) tuples.
This patch allows us to bind SO_REUSEADDR enabled sockets to the same
(addr, port) only when net.ipv4.ip_autobind_reuse is set 1 and all
ephemeral ports are exhausted. This also allows connect() and listen() to
share ports in the following way and may break some applications. So the
ip_autobind_reuse is 0 by default and disables the feature.
1. setsockopt(sk1, SO_REUSEADDR)
2. setsockopt(sk2, SO_REUSEADDR)
3. bind(sk1, saddr, 0)
4. bind(sk2, saddr, 0)
5. connect(sk1, daddr)
6. listen(sk2)
If it is set 1, we can fully utilize the 4-tuples, but we should use
IP_BIND_ADDRESS_NO_PORT for bind()+connect() as possible.
The notable thing is that if all sockets bound to the same port have
both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
ephemeral port and also do listen().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When we get an ephemeral port, the relax is false, so the SO_REUSEADDR
conditions may be evaluated twice. We do not need to check the conditions
again.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There was a bug that was causing packets to be sent to the driver
without first calling dequeue() on the "child" qdisc. And the KASAN
report below shows that sending a packet without calling dequeue()
leads to bad results.
The problem is that when checking the last qdisc "child" we do not set
the returned skb to NULL, which can cause it to be sent to the driver,
and so after the skb is sent, it may be freed, and in some situations a
reference to it may still be in the child qdisc, because it was never
dequeued.
The crash log looks like this:
[ 19.937538] ==================================================================
[ 19.938300] BUG: KASAN: use-after-free in taprio_dequeue_soft+0x620/0x780
[ 19.938968] Read of size 4 at addr ffff8881128628cc by task swapper/1/0
[ 19.939612]
[ 19.939772] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc3+ #97
[ 19.940397] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qe4
[ 19.941523] Call Trace:
[ 19.941774] <IRQ>
[ 19.941985] dump_stack+0x97/0xe0
[ 19.942323] print_address_description.constprop.0+0x3b/0x60
[ 19.942884] ? taprio_dequeue_soft+0x620/0x780
[ 19.943325] ? taprio_dequeue_soft+0x620/0x780
[ 19.943767] __kasan_report.cold+0x1a/0x32
[ 19.944173] ? taprio_dequeue_soft+0x620/0x780
[ 19.944612] kasan_report+0xe/0x20
[ 19.944954] taprio_dequeue_soft+0x620/0x780
[ 19.945380] __qdisc_run+0x164/0x18d0
[ 19.945749] net_tx_action+0x2c4/0x730
[ 19.946124] __do_softirq+0x268/0x7bc
[ 19.946491] irq_exit+0x17d/0x1b0
[ 19.946824] smp_apic_timer_interrupt+0xeb/0x380
[ 19.947280] apic_timer_interrupt+0xf/0x20
[ 19.947687] </IRQ>
[ 19.947912] RIP: 0010:default_idle+0x2d/0x2d0
[ 19.948345] Code: 00 00 41 56 41 55 65 44 8b 2d 3f 8d 7c 7c 41 54 55 53 0f 1f 44 00 00 e8 b1 b2 c5 fd e9 07 00 3
[ 19.950166] RSP: 0018:ffff88811a3efda0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
[ 19.950909] RAX: 0000000080000000 RBX: ffff88811a3a9600 RCX: ffffffff8385327e
[ 19.951608] RDX: 1ffff110234752c0 RSI: 0000000000000000 RDI: ffffffff8385262f
[ 19.952309] RBP: ffffed10234752c0 R08: 0000000000000001 R09: ffffed10234752c1
[ 19.953009] R10: ffffed10234752c0 R11: ffff88811a3a9607 R12: 0000000000000001
[ 19.953709] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[ 19.954408] ? default_idle_call+0x2e/0x70
[ 19.954816] ? default_idle+0x1f/0x2d0
[ 19.955192] default_idle_call+0x5e/0x70
[ 19.955584] do_idle+0x3d4/0x500
[ 19.955909] ? arch_cpu_idle_exit+0x40/0x40
[ 19.956325] ? _raw_spin_unlock_irqrestore+0x23/0x30
[ 19.956829] ? trace_hardirqs_on+0x30/0x160
[ 19.957242] cpu_startup_entry+0x19/0x20
[ 19.957633] start_secondary+0x2a6/0x380
[ 19.958026] ? set_cpu_sibling_map+0x18b0/0x18b0
[ 19.958486] secondary_startup_64+0xa4/0xb0
[ 19.958921]
[ 19.959078] Allocated by task 33:
[ 19.959412] save_stack+0x1b/0x80
[ 19.959747] __kasan_kmalloc.constprop.0+0xc2/0xd0
[ 19.960222] kmem_cache_alloc+0xe4/0x230
[ 19.960617] __alloc_skb+0x91/0x510
[ 19.960967] ndisc_alloc_skb+0x133/0x330
[ 19.961358] ndisc_send_ns+0x134/0x810
[ 19.961735] addrconf_dad_work+0xad5/0xf80
[ 19.962144] process_one_work+0x78e/0x13a0
[ 19.962551] worker_thread+0x8f/0xfa0
[ 19.962919] kthread+0x2ba/0x3b0
[ 19.963242] ret_from_fork+0x3a/0x50
[ 19.963596]
[ 19.963753] Freed by task 33:
[ 19.964055] save_stack+0x1b/0x80
[ 19.964386] __kasan_slab_free+0x12f/0x180
[ 19.964830] kmem_cache_free+0x80/0x290
[ 19.965231] ip6_mc_input+0x38a/0x4d0
[ 19.965617] ipv6_rcv+0x1a4/0x1d0
[ 19.965948] __netif_receive_skb_one_core+0xf2/0x180
[ 19.966437] netif_receive_skb+0x8c/0x3c0
[ 19.966846] br_handle_frame_finish+0x779/0x1310
[ 19.967302] br_handle_frame+0x42a/0x830
[ 19.967694] __netif_receive_skb_core+0xf0e/0x2a90
[ 19.968167] __netif_receive_skb_one_core+0x96/0x180
[ 19.968658] process_backlog+0x198/0x650
[ 19.969047] net_rx_action+0x2fa/0xaa0
[ 19.969420] __do_softirq+0x268/0x7bc
[ 19.969785]
[ 19.969940] The buggy address belongs to the object at ffff888112862840
[ 19.969940] which belongs to the cache skbuff_head_cache of size 224
[ 19.971202] The buggy address is located 140 bytes inside of
[ 19.971202] 224-byte region [ffff888112862840, ffff888112862920)
[ 19.972344] The buggy address belongs to the page:
[ 19.972820] page:ffffea00044a1800 refcount:1 mapcount:0 mapping:ffff88811a2bd1c0 index:0xffff8881128625c0 compo0
[ 19.973930] flags: 0x8000000000010200(slab|head)
[ 19.974388] raw: 8000000000010200 ffff88811a2ed650 ffff88811a2ed650 ffff88811a2bd1c0
[ 19.975151] raw: ffff8881128625c0 0000000000190013 00000001ffffffff 0000000000000000
[ 19.975915] page dumped because: kasan: bad access detected
[ 19.976461] page_owner tracks the page as allocated
[ 19.976946] page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NO)
[ 19.978332] prep_new_page+0x24b/0x330
[ 19.978707] get_page_from_freelist+0x2057/0x2c90
[ 19.979170] __alloc_pages_nodemask+0x218/0x590
[ 19.979619] new_slab+0x9d/0x300
[ 19.979948] ___slab_alloc.constprop.0+0x2f9/0x6f0
[ 19.980421] __slab_alloc.constprop.0+0x30/0x60
[ 19.980870] kmem_cache_alloc+0x201/0x230
[ 19.981269] __alloc_skb+0x91/0x510
[ 19.981620] alloc_skb_with_frags+0x78/0x4a0
[ 19.982043] sock_alloc_send_pskb+0x5eb/0x750
[ 19.982476] unix_stream_sendmsg+0x399/0x7f0
[ 19.982904] sock_sendmsg+0xe2/0x110
[ 19.983262] ____sys_sendmsg+0x4de/0x6d0
[ 19.983660] ___sys_sendmsg+0xe4/0x160
[ 19.984032] __sys_sendmsg+0xab/0x130
[ 19.984396] do_syscall_64+0xe7/0xae0
[ 19.984761] page last free stack trace:
[ 19.985142] __free_pages_ok+0x432/0xbc0
[ 19.985533] qlist_free_all+0x56/0xc0
[ 19.985907] quarantine_reduce+0x149/0x170
[ 19.986315] __kasan_kmalloc.constprop.0+0x9e/0xd0
[ 19.986791] kmem_cache_alloc+0xe4/0x230
[ 19.987182] prepare_creds+0x24/0x440
[ 19.987548] do_faccessat+0x80/0x590
[ 19.987906] do_syscall_64+0xe7/0xae0
[ 19.988276] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 19.988775]
[ 19.988930] Memory state around the buggy address:
[ 19.989402] ffff888112862780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 19.990111] ffff888112862800: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
[ 19.990822] >ffff888112862880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 19.991529] ^
[ 19.992081] ffff888112862900: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
[ 19.992796] ffff888112862980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
Reported-by: Michael Schmidt <michael.schmidt@eti.uni-siegen.de>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Acked-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This reverts commit 4cda75275f9f89f9485b0ca4d6950c95258a9bce
from net-next.
Brown bag time.
Michal noticed that this change doesn't work at all when
netif_set_real_num_tx_queues() gets called prior to an initial
dev_activate(), as for instance igb does.
Doing so dies with:
[ 40.579142] BUG: kernel NULL pointer dereference, address: 0000000000000400
[ 40.586922] #PF: supervisor read access in kernel mode
[ 40.592668] #PF: error_code(0x0000) - not-present page
[ 40.598405] PGD 0 P4D 0
[ 40.601234] Oops: 0000 [#1] PREEMPT SMP PTI
[ 40.605909] CPU: 18 PID: 1681 Comm: wickedd Tainted: G E 5.6.0-rc3-ethnl.50-default #1
[ 40.616205] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
[ 40.627377] RIP: 0010:qdisc_hash_add.part.22+0x2e/0x90
[ 40.633115] Code: 00 55 53 89 f5 48 89 fb e8 2f 9b fb ff 85 c0 74 44 48 8b 43 40 48 8b 08 69 43 38 47 86 c8 61 c1 e8 1c 48 83 e8 80 48 8d 14 c1 <48> 8b 04 c1 48 8d 4b 28 48 89 53 30 48 89 43 28 48 85 c0 48 89 0a
[ 40.654080] RSP: 0018:ffffb879864934d8 EFLAGS: 00010203
[ 40.659914] RAX: 0000000000000080 RBX: ffffffffb8328d80 RCX: 0000000000000000
[ 40.667882] RDX: 0000000000000400 RSI: 0000000000000000 RDI: ffffffffb831faa0
[ 40.675849] RBP: 0000000000000000 R08: ffffa0752c8b9088 R09: ffffa0752c8b9208
[ 40.683816] R10: 0000000000000006 R11: 0000000000000000 R12: ffffa0752d734000
[ 40.691783] R13: 0000000000000008 R14: 0000000000000000 R15: ffffa07113c18000
[ 40.699750] FS: 00007f94548e5880(0000) GS:ffffa0752e980000(0000) knlGS:0000000000000000
[ 40.708782] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 40.715189] CR2: 0000000000000400 CR3: 000000082b6ae006 CR4: 00000000001606e0
[ 40.723156] Call Trace:
[ 40.725888] dev_qdisc_set_real_num_tx_queues+0x61/0x90
[ 40.731725] netif_set_real_num_tx_queues+0x94/0x1d0
[ 40.737286] __igb_open+0x19a/0x5d0 [igb]
[ 40.741767] __dev_open+0xbb/0x150
[ 40.745567] __dev_change_flags+0x157/0x1a0
[ 40.750240] dev_change_flags+0x23/0x60
[...]
Fixes: 4cda75275f9f ("net: sched: make newly activated qdiscs visible")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Eric Dumazet <edumazet@google.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: Cong Wang <xiyou.wangcong@gmail.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
the following packetdrill script
socket(..., SOCK_STREAM, IPPROTO_MPTCP) = 3
fcntl(3, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
> S 0:0(0) <mss 1460,sackOK,TS val 100 ecr 0,nop,wscale 8,mpcapable v1 flags[flag_h] nokey>
< S. 0:0(0) ack 1 win 65535 <mss 1460,sackOK,TS val 700 ecr 100,nop,wscale 8,mpcapable v1 flags[flag_h] key[skey=2]>
> . 1:1(0) ack 1 win 256 <nop, nop, TS val 100 ecr 700,mpcapable v1 flags[flag_h] key[ckey,skey]>
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
fcntl(3, F_SETFL, O_RDWR) = 0
write(3, ..., 1000) = 1000
doesn't transmit 1KB data packet after a successful three-way-handshake,
using mp_capable with data as required by protocol v1, and write() hangs
forever:
PID: 973 TASK: ffff97dd399cae80 CPU: 1 COMMAND: "packetdrill"
#0 [ffffa9b94062fb78] __schedule at ffffffff9c90a000
#1 [ffffa9b94062fc08] schedule at ffffffff9c90a4a0
#2 [ffffa9b94062fc18] schedule_timeout at ffffffff9c90e00d
#3 [ffffa9b94062fc90] wait_woken at ffffffff9c120184
#4 [ffffa9b94062fcb0] sk_stream_wait_connect at ffffffff9c75b064
#5 [ffffa9b94062fd20] mptcp_sendmsg at ffffffff9c8e801c
#6 [ffffa9b94062fdc0] sock_sendmsg at ffffffff9c747324
#7 [ffffa9b94062fdd8] sock_write_iter at ffffffff9c7473c7
#8 [ffffa9b94062fe48] new_sync_write at ffffffff9c302976
#9 [ffffa9b94062fed0] vfs_write at ffffffff9c305685
#10 [ffffa9b94062ff00] ksys_write at ffffffff9c305985
#11 [ffffa9b94062ff38] do_syscall_64 at ffffffff9c004475
#12 [ffffa9b94062ff50] entry_SYSCALL_64_after_hwframe at ffffffff9ca0008c
RIP: 00007f959407eaf7 RSP: 00007ffe9e95a910 RFLAGS: 00000293
RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f959407eaf7
RDX: 00000000000003e8 RSI: 0000000001785fe0 RDI: 0000000000000008
RBP: 0000000001785fe0 R8: 0000000000000000 R9: 0000000000000003
R10: 0000000000000007 R11: 0000000000000293 R12: 00000000000003e8
R13: 00007ffe9e95ae30 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
Fix it ensuring that socket state is TCP_ESTABLISHED on reception of the
third ack.
Fixes: 1954b86016cf ("mptcp: Check connection state before attempting send")
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Locking newsk while still holding the listener lock triggered
a lockdep splat [1]
We can simply move the memcg code after we release the listener lock,
as this can also help if multiple threads are sharing a common listener.
Also fix a typo while reading socket sk_rmem_alloc.
[1]
WARNING: possible recursive locking detected
5.6.0-rc3-syzkaller #0 Not tainted
--------------------------------------------
syz-executor598/9524 is trying to acquire lock:
ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492
but task is already holding lock:
ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(sk_lock-AF_INET6);
lock(sk_lock-AF_INET6);
*** DEADLOCK ***
May be due to missing lock nesting notation
1 lock held by syz-executor598/9524:
#0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
#0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445
stack backtrace:
CPU: 0 PID: 9524 Comm: syz-executor598 Not tainted 5.6.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x188/0x20d lib/dump_stack.c:118
print_deadlock_bug kernel/locking/lockdep.c:2370 [inline]
check_deadlock kernel/locking/lockdep.c:2411 [inline]
validate_chain kernel/locking/lockdep.c:2954 [inline]
__lock_acquire.cold+0x114/0x288 kernel/locking/lockdep.c:3954
lock_acquire+0x197/0x420 kernel/locking/lockdep.c:4484
lock_sock_nested+0xc5/0x110 net/core/sock.c:2947
lock_sock include/net/sock.h:1541 [inline]
inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492
inet_accept+0xe9/0x7c0 net/ipv4/af_inet.c:734
__sys_accept4_file+0x3ac/0x5b0 net/socket.c:1758
__sys_accept4+0x53/0x90 net/socket.c:1809
__do_sys_accept4 net/socket.c:1821 [inline]
__se_sys_accept4 net/socket.c:1818 [inline]
__x64_sys_accept4+0x93/0xf0 net/socket.c:1818
do_syscall_64+0xf6/0x790 arch/x86/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4445c9
Code: e8 0c 0d 03 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffc35b37608 EFLAGS: 00000246 ORIG_RAX: 0000000000000120
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004445c9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000306777 R09: 0000000000306777
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00000000004053d0 R14: 0000000000000000 R15: 0000000000000000
Fixes: d752a4986532 ("net: memcg: late association of sock to memcg")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The Internet Assigned Numbers Authority (IANA) has recently assigned
a protocol number value of 143 for Ethernet [1].
Before this assignment, encapsulation mechanisms such as Segment Routing
used the IPv6-NoNxt protocol number (59) to indicate that the encapsulated
payload is an Ethernet frame.
In this patch, we add the definition of the Ethernet protocol number to the
kernel headers and update the SRv6 L2 tunnels to use it.
[1] https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml
Signed-off-by: Paolo Lungaroni <paolo.lungaroni@cnit.it>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Acked-by: Ahmed Abdelsalam <ahmed.abdelsalam@gssi.it>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
By default, DSA drivers should configure CPU and DSA ports to their
maximum speed. In many configurations this is sufficient to make the
link work.
In some cases it is necessary to configure the link to run slower,
e.g. because of limitations of the SoC it is connected to. Or back to
back PHYs are used and the PHY needs to be driven in order to
establish link. In this case, phylink is used.
Only instantiate phylink if it is required. If there is no PHY, or no
fixed link properties, phylink can upset a link which works in the
default configuration.
Fixes: 0e27921816ad ("net: dsa: Use PHYLINK for the CPU/DSA ports")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Sparse reports a warning at netlink_seq_start()
warning: context imbalance in netlink_seq_start() - wrong count at exit
The root cause is the missing annotation at netlink_seq_start()
Add the missing __acquires(RCU) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Sparse reports warning at tcp_child_process()
warning: context imbalance in tcp_child_process() - unexpected unlock
The root cause is the missing annotation at tcp_child_process()
Add the missing __releases(&((child)->sk_lock.slock)) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Sparse reports warnings at raw_seq_start() and raw_seq_stop()
warning: context imbalance in raw_seq_start() - wrong count at exit
warning: context imbalance in raw_seq_stop() - unexpected unlock
The root cause is the missing annotations at raw_seq_start()
and raw_seq_stop()
Add the missing __acquires(&h->lock) annotation
Add the missing __releases(&h->lock) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In their .attach callback, mq[prio] only add the qdiscs of the currently
active TX queues to the device's qdisc hash list.
If a user later increases the number of active TX queues, their qdiscs
are not visible via eg. 'tc qdisc show'.
Add a hook to netif_set_real_num_tx_queues() that walks all active
TX queues and adds those which are missing to the hash list.
CC: Eric Dumazet <edumazet@google.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: Cong Wang <xiyou.wangcong@gmail.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In one error case, tpacket_rcv drops packets after incrementing the
ring producer index.
If this happens, it does not update tp_status to TP_STATUS_USER and
thus the reader is stalled for an iteration of the ring, causing out
of order arrival.
The only such error path is when virtio_net_hdr_from_skb fails due
to encountering an unknown GSO type.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
caifdevs->list is traversed using list_for_each_entry_rcu()
outside an RCU read-side critical section but under the
protection of rtnl_mutex. Hence, add the corresponding lockdep
expression to silence the following false-positive warning:
[ 10.868467] =============================
[ 10.869082] WARNING: suspicious RCU usage
[ 10.869817] 5.6.0-rc1-00177-g06ec0a154aae4 #1 Not tainted
[ 10.870804] -----------------------------
[ 10.871557] net/caif/caif_dev.c:115 RCU-list traversed in non-reader section!!
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When trying to transmit to an unknown destination, the mesh code would
unconditionally transmit a HWMP PREQ even if HWMP is not the current
path selection algorithm.
Signed-off-by: Nicolas Cavallari <nicolas.cavallari@green-communications.fr>
Link: https://lore.kernel.org/r/20200305140409.12204-1-cavallar@lri.fr
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Add missing attribute validation for NL80211_ATTR_OPER_CLASS
to the netlink policy.
Fixes: 1057d35ede5d ("cfg80211: introduce TDLS channel switch commands")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20200303051058.4089398-4-kuba@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Add missing attribute validation for beacon report scanning
to the netlink policy.
Fixes: 1d76250bd34a ("nl80211: support beacon report scanning")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20200303051058.4089398-3-kuba@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Add missing attribute validation for critical protocol fields
to the netlink policy.
Fixes: 5de17984898c ("cfg80211: introduce critical protocol indication from user-space")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20200303051058.4089398-2-kuba@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
When pktgen is used to measure the performance of dev_queue_xmit()
packet handling in the core, it is preferable to not hand down
packets to a low-level Ethernet driver as it would distort the
measurements.
Allow using pktgen on the loopback device, thus constraining
measurements to core code.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
During IB device removal, cancel the event worker before the device
structure is freed.
Fixes: a4cf0443c414 ("smc: introduce SMC as an IB-client")
Reported-by: syzbot+b297c6825752e7a07272@syzkaller.appspotmail.com
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Rafał found an issue that for non-Ethernet interface, if we down and up
frequently, the memory will be consumed slowly.
The reason is we add allnodes/allrouters addressed in multicast list in
ipv6_add_dev(). When link down, we call ipv6_mc_down(), store all multicast
addresses via mld_add_delrec(). But when link up, we don't call ipv6_mc_up()
for non-Ethernet interface to remove the addresses. This makes idev->mc_tomb
getting bigger and bigger. The call stack looks like:
addrconf_notify(NETDEV_REGISTER)
ipv6_add_dev
ipv6_dev_mc_inc(ff01::1)
ipv6_dev_mc_inc(ff02::1)
ipv6_dev_mc_inc(ff02::2)
addrconf_notify(NETDEV_UP)
addrconf_dev_config
/* Alas, we support only Ethernet autoconfiguration. */
return;
addrconf_notify(NETDEV_DOWN)
addrconf_ifdown
ipv6_mc_down
igmp6_group_dropped(ff02::2)
mld_add_delrec(ff02::2)
igmp6_group_dropped(ff02::1)
igmp6_group_dropped(ff01::1)
After investigating, I can't found a rule to disable multicast on
non-Ethernet interface. In RFC2460, the link could be Ethernet, PPP, ATM,
tunnels, etc. In IPv4, it doesn't check the dev type when calls ip_mc_up()
in inetdev_event(). Even for IPv6, we don't check the dev type and call
ipv6_add_dev(), ipv6_dev_mc_inc() after register device.
So I think it's OK to fix this memory consumer by calling ipv6_mc_up() for
non-Ethernet interface.
v2: Also check IFF_MULTICAST flag to make sure the interface supports
multicast
Reported-by: Rafał Miłecki <zajec5@gmail.com>
Tested-by: Rafał Miłecki <zajec5@gmail.com>
Fixes: 74235a25c673 ("[IPV6] addrconf: Fix IPv6 on tuntap tunnels")
Fixes: 1666d49e1d41 ("mld: do not remove mld souce list info when set link down")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If a TCP socket is allocated in IRQ context or cloned from unassociated
(i.e. not associated to a memcg) in IRQ context then it will remain
unassociated for its whole life. Almost half of the TCPs created on the
system are created in IRQ context, so, memory used by such sockets will
not be accounted by the memcg.
This issue is more widespread in cgroup v1 where network memory
accounting is opt-in but it can happen in cgroup v2 if the source socket
for the cloning was created in root memcg.
To fix the issue, just do the association of the sockets at the accept()
time in the process context and then force charge the memory buffer
already used and reserved by the socket.
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The mptcp rcvbuf size is adjusted according to the subflow rcvbuf size.
This should not be done if userspace did set a fixed value.
Fixes: 600911ff5f72bae ("mptcp: add rmem queue accounting")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
- Avoid RCU list-traversal in spinlock, by Sven Eckelmann
- Replace zero-length array with flexible-array member,
by Gustavo A. R. Silva
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Simon Wunderlich says:
====================
Here is a batman-adv bugfix:
- Don't schedule OGM for disabled interface, by Sven Eckelmann
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In our production environment we have faced with problem that updating
classid in cgroup with heavy tasks cause long freeze of the file tables
in this tasks. By heavy tasks we understand tasks with many threads and
opened sockets (e.g. balancers). This freeze leads to an increase number
of client timeouts.
This patch implements following logic to fix this issue:
аfter iterating 1000 file descriptors file table lock will be released
thus providing a time gap for socket creation/deletion.
Now update is non atomic and socket may be skipped using calls:
dup2(oldfd, newfd);
close(oldfd);
But this case is not typical. Moreover before this patch skip is possible
too by hiding socket fd in unix socket buffer.
New sockets will be allocated with updated classid because cgroup state
is updated before start of the file descriptors iteration.
So in common cases this patch has no side effects.
Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 105e808c1da2 ("pie: remove pie_vars->accu_prob_overflows")
changes the scale of probability values in PIE from (2^64 - 1) to
(2^56 - 1). This affects the precision of tc_pie_xstats->prob in
user space.
This patch ensures user space is unaffected.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add TCP_NLA_BYTES_NOTSENT to SCM_TIMESTAMPING_OPT_STATS that reports
bytes in the write queue but not sent. This is the same metric as
what is exported with tcp_info.tcpi_notsent_bytes.
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Allow adding hashed UDP sockets to sockmaps.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-9-lmb@cloudflare.com
|
|
Add basic psock hooks for UDP sockets. This allows adding and
removing sockets, as well as automatic removal on unhash and close.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-8-lmb@cloudflare.com
|
|
We can take advantage of the fact that both callers of
sock_map_init_proto are holding a RCU read lock, and
have verified that psock is valid.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-7-lmb@cloudflare.com
|
|
The init, close and unhash handlers from TCP sockmap are generic,
and can be reused by UDP sockmap. Move the helpers into the sockmap code
base and expose them. This requires tcp_bpf_get_proto and tcp_bpf_clone to
be conditional on BPF_STREAM_PARSER.
The moved functions are unmodified, except that sk_psock_unlink is
renamed to sock_map_unlink to better match its behaviour.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-6-lmb@cloudflare.com
|
|
We need to ensure that sk->sk_prot uses certain callbacks, so that
code that directly calls e.g. tcp_sendmsg in certain corner cases
works. To avoid spurious asserts, we must to do this only if
sk_psock_update_proto has not yet been called. The same invariants
apply for tcp_bpf_check_v6_needs_rebuild, so move the call as well.
Doing so allows us to merge tcp_bpf_init and tcp_bpf_reinit.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-4-lmb@cloudflare.com
|
|
Only update psock->saved_* if psock->sk_proto has not been initialized
yet. This allows us to get rid of tcp_bpf_reinit_sk_prot.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-3-lmb@cloudflare.com
|
|
The sock map code checks that a socket does not have an active upper
layer protocol before inserting it into the map. This requires casting
via inet_csk, which isn't valid for UDP sockets.
Guard checks for ULP by checking inet_sk(sk)->is_icsk first.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-2-lmb@cloudflare.com
|
|
In commit 1ec17dbd90f8 ("inet_diag: fix reporting cgroup classid and
fallback to priority") croup classid reporting was fixed. But this works
only for TCP sockets because for other socket types icsk parameter can
be NULL and classid code path is skipped. This change moves classid
handling to inet_diag_msg_attrs_fill() function.
Also inet_diag_msg_attrs_size() helper was added and addends in
nlmsg_new() were reordered to save order from inet_sk_diag_fill().
Fixes: 1ec17dbd90f8 ("inet_diag: fix reporting cgroup classid and fallback to priority")
Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Convert zones_lock spinlock to zones_mutex mutex,
and struct (tcf_ct_flow_table)->ref to a refcount,
so that control path can use regular GFP_KERNEL allocations
from standard process context. This is more robust
in case of memory pressure.
The refcount is needed because tcf_ct_flow_table_put() can
be called from RCU callback, thus in BH context.
The issue was spotted by syzbot, as rhashtable_init()
was called with a spinlock held, which is bad since GFP_KERNEL
allocations can sleep.
Note to developers : Please make sure your patches are tested
with CONFIG_DEBUG_ATOMIC_SLEEP=y
BUG: sleeping function called from invalid context at mm/slab.h:565
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 9582, name: syz-executor610
2 locks held by syz-executor610/9582:
#0: ffffffff8a34eb80 (rtnl_mutex){+.+.}, at: rtnl_lock net/core/rtnetlink.c:72 [inline]
#0: ffffffff8a34eb80 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x3f9/0xad0 net/core/rtnetlink.c:5437
#1: ffffffff8a3961b8 (zones_lock){+...}, at: spin_lock_bh include/linux/spinlock.h:343 [inline]
#1: ffffffff8a3961b8 (zones_lock){+...}, at: tcf_ct_flow_table_get+0xa3/0x1700 net/sched/act_ct.c:67
Preemption disabled at:
[<0000000000000000>] 0x0
CPU: 0 PID: 9582 Comm: syz-executor610 Not tainted 5.6.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x188/0x20d lib/dump_stack.c:118
___might_sleep.cold+0x1f4/0x23d kernel/sched/core.c:6798
slab_pre_alloc_hook mm/slab.h:565 [inline]
slab_alloc_node mm/slab.c:3227 [inline]
kmem_cache_alloc_node_trace+0x272/0x790 mm/slab.c:3593
__do_kmalloc_node mm/slab.c:3615 [inline]
__kmalloc_node+0x38/0x60 mm/slab.c:3623
kmalloc_node include/linux/slab.h:578 [inline]
kvmalloc_node+0x61/0xf0 mm/util.c:574
kvmalloc include/linux/mm.h:645 [inline]
kvzalloc include/linux/mm.h:653 [inline]
bucket_table_alloc+0x8b/0x480 lib/rhashtable.c:175
rhashtable_init+0x3d2/0x750 lib/rhashtable.c:1054
nf_flow_table_init+0x16d/0x310 net/netfilter/nf_flow_table_core.c:498
tcf_ct_flow_table_get+0xe33/0x1700 net/sched/act_ct.c:82
tcf_ct_init+0xba4/0x18a6 net/sched/act_ct.c:1050
tcf_action_init_1+0x697/0xa20 net/sched/act_api.c:945
tcf_action_init+0x1e9/0x2f0 net/sched/act_api.c:1001
tcf_action_add+0xdb/0x370 net/sched/act_api.c:1411
tc_ctl_action+0x366/0x456 net/sched/act_api.c:1466
rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5440
netlink_rcv_skb+0x15a/0x410 net/netlink/af_netlink.c:2478
netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329
netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:672
____sys_sendmsg+0x6b9/0x7d0 net/socket.c:2343
___sys_sendmsg+0x100/0x170 net/socket.c:2397
__sys_sendmsg+0xec/0x1b0 net/socket.c:2430
do_syscall_64+0xf6/0x790 arch/x86/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4403d9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffd719af218 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004403d9
RDX: 0000000000000000 RSI: 0000000020000300 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000000000005 R09: 00000000004002c8
R10: 0000000000000008 R11: 00000000000
Fixes: c34b961a2492 ("net/sched: act_ct: Create nf flow table per zone")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paul Blakey <paulb@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
syzbot found an interesting case of the kernel reading
an uninit-value [1]
Problem is in the handling of ETH_P_WCCP in gre_parse_header()
We look at the byte following GRE options to eventually decide
if the options are four bytes longer.
Use skb_header_pointer() to not pull bytes if we found
that no more bytes were needed.
All callers of gre_parse_header() are properly using pskb_may_pull()
anyway before proceeding to next header.
[1]
BUG: KMSAN: uninit-value in pskb_may_pull include/linux/skbuff.h:2303 [inline]
BUG: KMSAN: uninit-value in __iptunnel_pull_header+0x30c/0xbd0 net/ipv4/ip_tunnel_core.c:94
CPU: 1 PID: 11784 Comm: syz-executor940 Not tainted 5.6.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1c9/0x220 lib/dump_stack.c:118
kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:118
__msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
pskb_may_pull include/linux/skbuff.h:2303 [inline]
__iptunnel_pull_header+0x30c/0xbd0 net/ipv4/ip_tunnel_core.c:94
iptunnel_pull_header include/net/ip_tunnels.h:411 [inline]
gre_rcv+0x15e/0x19c0 net/ipv6/ip6_gre.c:606
ip6_protocol_deliver_rcu+0x181b/0x22c0 net/ipv6/ip6_input.c:432
ip6_input_finish net/ipv6/ip6_input.c:473 [inline]
NF_HOOK include/linux/netfilter.h:307 [inline]
ip6_input net/ipv6/ip6_input.c:482 [inline]
ip6_mc_input+0xdf2/0x1460 net/ipv6/ip6_input.c:576
dst_input include/net/dst.h:442 [inline]
ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
NF_HOOK include/linux/netfilter.h:307 [inline]
ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:306
__netif_receive_skb_one_core net/core/dev.c:5198 [inline]
__netif_receive_skb net/core/dev.c:5312 [inline]
netif_receive_skb_internal net/core/dev.c:5402 [inline]
netif_receive_skb+0x66b/0xf20 net/core/dev.c:5461
tun_rx_batched include/linux/skbuff.h:4321 [inline]
tun_get_user+0x6aef/0x6f60 drivers/net/tun.c:1997
tun_chr_write_iter+0x1f2/0x360 drivers/net/tun.c:2026
call_write_iter include/linux/fs.h:1901 [inline]
new_sync_write fs/read_write.c:483 [inline]
__vfs_write+0xa5a/0xca0 fs/read_write.c:496
vfs_write+0x44a/0x8f0 fs/read_write.c:558
ksys_write+0x267/0x450 fs/read_write.c:611
__do_sys_write fs/read_write.c:623 [inline]
__se_sys_write fs/read_write.c:620 [inline]
__ia32_sys_write+0xdb/0x120 fs/read_write.c:620
do_syscall_32_irqs_on arch/x86/entry/common.c:339 [inline]
do_fast_syscall_32+0x3c7/0x6e0 arch/x86/entry/common.c:410
entry_SYSENTER_compat+0x68/0x77 arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7f62d99
Code: 90 e8 0b 00 00 00 f3 90 0f ae e8 eb f9 8d 74 26 00 89 3c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 002b:00000000fffedb2c EFLAGS: 00000217 ORIG_RAX: 0000000000000004
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000020002580
RDX: 0000000000000fca RSI: 0000000000000036 RDI: 0000000000000004
RBP: 0000000000008914 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Uninit was created at:
kmsan_save_stack_with_flags mm/kmsan/kmsan.c:144 [inline]
kmsan_internal_poison_shadow+0x66/0xd0 mm/kmsan/kmsan.c:127
kmsan_slab_alloc+0x8a/0xe0 mm/kmsan/kmsan_hooks.c:82
slab_alloc_node mm/slub.c:2793 [inline]
__kmalloc_node_track_caller+0xb40/0x1200 mm/slub.c:4401
__kmalloc_reserve net/core/skbuff.c:142 [inline]
__alloc_skb+0x2fd/0xac0 net/core/skbuff.c:210
alloc_skb include/linux/skbuff.h:1051 [inline]
alloc_skb_with_frags+0x18c/0xa70 net/core/skbuff.c:5766
sock_alloc_send_pskb+0xada/0xc60 net/core/sock.c:2242
tun_alloc_skb drivers/net/tun.c:1529 [inline]
tun_get_user+0x10ae/0x6f60 drivers/net/tun.c:1843
tun_chr_write_iter+0x1f2/0x360 drivers/net/tun.c:2026
call_write_iter include/linux/fs.h:1901 [inline]
new_sync_write fs/read_write.c:483 [inline]
__vfs_write+0xa5a/0xca0 fs/read_write.c:496
vfs_write+0x44a/0x8f0 fs/read_write.c:558
ksys_write+0x267/0x450 fs/read_write.c:611
__do_sys_write fs/read_write.c:623 [inline]
__se_sys_write fs/read_write.c:620 [inline]
__ia32_sys_write+0xdb/0x120 fs/read_write.c:620
do_syscall_32_irqs_on arch/x86/entry/common.c:339 [inline]
do_fast_syscall_32+0x3c7/0x6e0 arch/x86/entry/common.c:410
entry_SYSENTER_compat+0x68/0x77 arch/x86/entry/entry_64_compat.S:139
Fixes: 95f5c64c3c13 ("gre: Move utility functions to common headers")
Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, user who is adding an action expects HW to report stats,
however it does not have exact expectations about the stats types.
That is aligned with TCA_ACT_HW_STATS_TYPE_ANY.
Allow user to specify the type of HW stats for an action and require it.
Pass the information down to flow_offload layer.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Introduce flow_action_basic_hw_stats_types_check() helper and use it
in drivers. That sanitizes the drivers which do not have support
for action HW stats types.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Patches to bump position index from sysctl seq_next,
from Vasilin Averin.
2) Release flowtable hook from error path, from Florian Westphal.
3) Patches to add missing netlink attribute validation,
from Jakub Kicinski.
4) Missing NFTA_CHAIN_FLAGS in nf_tables_fill_chain_info().
5) Infinite loop in module autoload if extension is not available,
from Florian Westphal.
6) Missing module ownership in inet/nat chain type definition.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Set owner to THIS_MODULE, otherwise the nft_chain_nat module might be
removed while there are still inet/nat chains in place.
[ 117.942096] BUG: unable to handle page fault for address: ffffffffa0d5e040
[ 117.942101] #PF: supervisor read access in kernel mode
[ 117.942103] #PF: error_code(0x0000) - not-present page
[ 117.942106] PGD 200c067 P4D 200c067 PUD 200d063 PMD 3dc909067 PTE 0
[ 117.942113] Oops: 0000 [#1] PREEMPT SMP PTI
[ 117.942118] CPU: 3 PID: 27 Comm: kworker/3:0 Not tainted 5.6.0-rc3+ #348
[ 117.942133] Workqueue: events nf_tables_trans_destroy_work [nf_tables]
[ 117.942145] RIP: 0010:nf_tables_chain_destroy.isra.0+0x94/0x15a [nf_tables]
[ 117.942149] Code: f6 45 54 01 0f 84 d1 00 00 00 80 3b 05 74 44 48 8b 75 e8 48 c7 c7 72 be de a0 e8 56 e6 2d e0 48 8b 45 e8 48 c7 c7 7f be de a0 <48> 8b 30 e8 43 e6 2d e0 48 8b 45 e8 48 8b 40 10 48 85 c0 74 5b 8b
[ 117.942152] RSP: 0018:ffffc9000015be10 EFLAGS: 00010292
[ 117.942155] RAX: ffffffffa0d5e040 RBX: ffff88840be87fc2 RCX: 0000000000000007
[ 117.942158] RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffffffffa0debe7f
[ 117.942160] RBP: ffff888403b54b50 R08: 0000000000001482 R09: 0000000000000004
[ 117.942162] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8883eda7e540
[ 117.942164] R13: dead000000000122 R14: dead000000000100 R15: ffff888403b3db80
[ 117.942167] FS: 0000000000000000(0000) GS:ffff88840e4c0000(0000) knlGS:0000000000000000
[ 117.942169] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 117.942172] CR2: ffffffffa0d5e040 CR3: 00000003e4c52002 CR4: 00000000001606e0
[ 117.942174] Call Trace:
[ 117.942188] nf_tables_trans_destroy_work.cold+0xd/0x12 [nf_tables]
[ 117.942196] process_one_work+0x1d6/0x3b0
[ 117.942200] worker_thread+0x45/0x3c0
[ 117.942203] ? process_one_work+0x3b0/0x3b0
[ 117.942210] kthread+0x112/0x130
[ 117.942214] ? kthread_create_worker_on_cpu+0x40/0x40
[ 117.942221] ret_from_fork+0x35/0x40
nf_tables_chain_destroy() crashes on module_put() because the module is
gone.
Fixes: d164385ec572 ("netfilter: nat: add inet family nat support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Currently passive MPTCP socket can skip including the DACK
option - if the peer sends data before accept() completes.
The above happens because the msk 'can_ack' flag is set
only after the accept() call.
Such missing DACK option may cause - as per RFC spec -
unwanted fallback to TCP.
This change addresses the issue using the key material
available in the current subflow, if any, to create a suitable
dack option when msk ack seq is not yet available.
v1 -> v2:
- adavance the generated ack after the initial MPC packet
Fixes: d22f4988ffec ("mptcp: process MP_CAPABLE data option")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This is similar to commit 674d9de02aa7 ("NFC: Fix possible memory
corruption when handling SHDLC I-Frame commands") and commit d7ee81ad09f0
("NFC: nci: Add some bounds checking in nci_hci_cmd_received()") which
added range checks on "pipe".
The "pipe" variable comes skb->data[0] in nfc_hci_msg_rx_work().
It's in the 0-255 range. We're using it as the array index into the
hdev->pipes[] array which has NFC_HCI_MAX_PIPES (128) members.
Fixes: 118278f20aa8 ("NFC: hci: Add pipes table to reference them with a tuple {gate, host}")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Invoke ndo_setup_tc() as appropriate to signal init / replacement,
destroying and dumping of pFIFO / bFIFO Qdisc.
A lot of the FIFO logic is used for pFIFO_head_drop as well, but that's a
semantically very different Qdisc that isn't really in the same boat as
pFIFO / bFIFO. Split some of the functions to keep the Qdisc intact.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Linux supports 22 different interrupt coalescing parameters.
No driver implements them all. Some drivers just ignore the
ones they don't support, while others have to carry a long
list of checks to reject unsupported settings.
To simplify the drivers add the ability to specify inside
ethtool_ops which parameters are supported and let the core
reject attempts to set any other one.
This commit makes the mechanism an opt-in, only drivers which
set ethtool_opts->coalesce_types to a non-zero value will have
the checks enforced.
The same mask is used for global and per queue settings.
v3: - move the (temporary) check if driver defines types
earlier (Michal)
- rename used_types -> nonzero_params, and
coalesce_types -> supported_coalesce_params (Alex)
- use EOPNOTSUPP instead of EINVAL (Andrew, Michal)
Leaving the long series of ifs for now, it seems nice to
be able to grep for the field and flag names. This will
probably have to be revisited once netlink support lands.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In the commit e0a4b99773d3 ("hsr: use upper/lower device infrastructure"),
dev_get() was removed but dev_put() in the error path wasn't removed.
So, if creating hsr interface command is failed, the reference counter leak
of lower interface would occur.
Test commands:
ip link add dummy0 type dummy
ip link add ipvlan0 link dummy0 type ipvlan mode l2
ip link add ipvlan1 link dummy0 type ipvlan mode l2
ip link add hsr0 type hsr slave1 ipvlan0 slave2 ipvlan1
ip link del ipvlan0
Result:
[ 633.271992][ T1280] unregister_netdevice: waiting for ipvlan0 to become free. Usage count = -1
Fixes: e0a4b99773d3 ("hsr: use upper/lower device infrastructure")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
nft will loop forever if the kernel doesn't support an expression:
1. nft_expr_type_get() appends the family specific name to the module list.
2. -EAGAIN is returned to nfnetlink, nfnetlink calls abort path.
3. abort path sets ->done to true and calls request_module for the
expression.
4. nfnetlink replays the batch, we end up in nft_expr_type_get() again.
5. nft_expr_type_get attempts to append family-specific name. This
one already exists on the list, so we continue
6. nft_expr_type_get adds the generic expression name to the module
list. -EAGAIN is returned, nfnetlink calls abort path.
7. abort path encounters the family-specific expression which
has 'done' set, so it gets removed.
8. abort path requests the generic expression name, sets done to true.
9. batch is replayed.
If the expression could not be loaded, then we will end up back at 1),
because the family-specific name got removed and the cycle starts again.
Note that userspace can SIGKILL the nft process to stop the cycle, but
the desired behaviour is to return an error after the generic expr name
fails to load the expression.
Fixes: eb014de4fd418 ("netfilter: nf_tables: autoload modules from the abort path")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|