The TCP SACK panic

The TCP SACK panic [LWN.net]https://lwn.net/Articles/791409/

Selective acknowledgment (SACK) is a technique used by TCP to help alleviate congestion that can arise due to the retransmission of dropped packets. It allows the endpoints to describe which pieces of the data they have received, so that only the missing pieces need to be retransmitted. However, a bug was recently found in the Linux implementation of SACK that allows remote attackers to panic the system by sending crafted SACK information.

Data sent via TCP is broken up into multiple segments based on the maximum segment size (MSS) specified by the other endpoint—or some other network hardware in the path it traversed. Those segments are transmitted to that endpoint, which acknowledges that it has received them. Originally, those acknowledgments (ACKs) could only indicate that it had received segments up to the first gap; so if one early segment was lost (e.g. dropped due to congestion), the endpoint could only ACK those up to the lost one. The originating endpoint would have to retransmit many segments that had actually been received in order to ensure the data gets there; the status of the later segments is unknown, so they have to be resent.

In simplified form, sender A might send segments 20-50, with segments 23 and 37 getting dropped along the way. Receiver B can only ACK segments 20-22, so A must send 23-50 again. As might be guessed, if the link is congested such that segments are being dropped, sending a bunch of potentially redundant traffic is not going to help things.

Selective acknowledgment was created as a mechanism to eliminate the redundant traffic. It came about in 1996 from RFC 2018. The idea is that receiver B can ACK 20-22, 24-36, and 38-50 so that A need only resend 23 and 37. It seems like common sense at some level; if someone read off a string of 30 words and you missed the third, you wouldn't ask them to repeat the list starting at the third word.

In order to keep track of all of that, the network subsystem has some bookkeeping to do. It is in this record keeping that the bug was found.

The struct sk_buff (typically called an SKB) is a kernel data structure that is used to hold network data of various sorts, including for transmit queues, receive queues, SACK queues, and more. For reference, networking maintainer David Miller has a nice overview (if somewhat dated) of how SKBs are used in the kernel. Part of the bookkeeping for TCP is to keep track of the 32KB (64KB on PowerPC) fragments that the TCP data stream has been broken up into; it is in the interaction between fragments and SACK where the kernel went astray.

The struct tcp_skb_cb is a control buffer that tracks various things about a TCP packet, including the number of segments/fragments it has been broken up into. It does so for the general segmentation offload (GSO) feature, which moves the segmentation of the packets as low as it can in the network stack, including possibly offloading it to the network hardware. The number of segments is stored in the tcp_gso_segs field, which is a two-byte unsigned integer. That works fine as long as the number of segments doesn't go beyond 64K.

But that is just what can happen when SACK has been agreed upon by the endpoints, which is done when the connection is established. An attacker can use a small MSS value (perhaps the minimum of 48 bytes, which only leaves eight bytes for actual user data) and cause an overflow of tcp_gso_segs by carefully choosing which segments to acknowledge. Multiple SKBs will be coalesced by the kernel in order to more efficiently process blocks of unacknowledged segments, but doing so could overflow tcp_gso_segs.

That overflow would cause a BUG_ON() in tcp_shifted_skb() to be hit, leading to a kernel panic. This was the most serious of four SACK-related bugs found by Jonathan Looney at Netflix. Two other Linux bugs were reported, both leading to a SACK slowdown or excessive resource use, which could also lead to a denial of service. There is also a SACK slowness problem that Looney identified in FreeBSD 12 when using the RACK TCP stack. Netflix contributed RACK to FreeBSD just over a year ago.

The SACK panic has been designated as CVE-2019-11477; it is clearly the most severe of the Linux problems. CVE-2019-11478 is another denial of service; by crafting a sequence of SACKs, an attacker can cause fragmentation of the TCP transmission queue, leading to higher resource use. CVE-2019-11479 points out that the MSS for Linux is set to 48, which means that a much larger amount of CPU, memory, and bandwidth could be consumed in sending relatively small amounts of user data. The fix for that is to give the administrator a sysctl knob to set the minimum value for MSS that the kernel will accept; it is left at 48 by default for compatibility, but it can now be easily changed.

The problems have been addressed; the Netflix report has links to the individual patches. Those patches were released as part of the 5.1.11, 4.19.52, 4.14.127, 4.9.182, and 4.4.182 stable updates that were made on June 17, the same day as the embargo was lifted. Distribution kernels have largely been updated at this point, so those who can upgrade should probably do so.

There are various mitigations for the problems for those unable to update on the spur of the moment. Restricting the MSS to a reasonable value using iptables or via other means will thwart these attacks, but those mitigations also require disabling MTU probing by setting the net.ipv4.tcp_mtu_probing sysctl to 0 for CVE-2019-11477 and CVE-2019-11478. Either of those CVEs can instead be thwarted by turning off SACK by setting /proc/sys/net/ipv4/tcp_sack to 0. To avoid CVE-2019-11479, administrators simply need to filter out MSS values that are too low using one of the methods listed by Netflix.

The Red Hat vulnerability report has lots of useful details, as does the Netflix report mentioned above. A remote-control kernel crash is obviously a fairly nasty surprise with potentially wide-ranging impact. It is only the endpoints of a connection that are affected, however, which limits the damage somewhat. At least the servers and desktops can be updated, which may not be true of all the gear our traffic visits on the way to its destination.

Index entries for this article
Kernel	Networking
Kernel	Security/Vulnerabilities
Security	Linux kernel/Networking
Security	Networking/Vulnerabilities

猜你喜欢