linux mechanism

  • Operation Function socket read / write and recv / send usage is substantially the same, one more than the former flag parameter. See socket I / O functions . If it is blocked socket, when the read operation is performed, if there is no data socket receive buffer zone will block waiting for data; when the write operation is performed, if the socket send buffer does not have enough space to store the data to be written, it will block waiting for the buffer zone freed. Read / write data to the buffer will return immediately after the success, but the write socket buffer zone does not mean that the data is successfully sent to the peer, such as receiving a TCP RST TCP packets can cause chain scission and empty the cache data. If the socket is non-blocking, when the read operation is performed, if there is no data socket receive buffer, an error is returned directly EWOULDBLOCK; while performing a write operation, if the socket there is sufficient space in the transmit buffer is insufficient to copy all or the data to be transmitted space, then the front of the N copies can accommodate data, and return the number of bytes actually copy, socket EWOULDBLOCK transmission error is returned when no buffer space. Fcntl function can use the file descriptor provided O_NONBLOCK numerals nonblocking socket.
  • Blocking / non-blocking connection is a socket of a relatively. Use non-blocking to prevent reading and writing thread is blocked, usually for the server. golang the read / write is blocked, but the bottom layer is non-blocking, you can use multiple coroutine nonblocking.

Reference :

On the TCP / IP network programming socket behavior

  •  Linux process scheduling
  • IO multiplexing primarily server by select (), poll (), epoll (), etc., can monitor multiple descriptors, once a descriptor is ready (usually ready for reading or writing is ready, this is the file descriptor before the write operation), the program can be notified accordingly read and write operations. But they are essentially synchronous IO.

reference:

The difference between I / O multiplexer select, poll, epoll use

  • Zero copy mainly to reduce the number of user space to kernel space copy. Zero-copy technology commonly used mmap, sendfile, FileChannel, DMA and the like. Users can not modify the file when using sendfile, but you can modify the file using mmap. Starting Linux 2.4 version, the underlying operating system provides a scatter / gather DMA mode to read this data in the buffer directly from the kernel space to the protocol engine, the kernel data space without having to re-copy buffer to kernel space socket buffer associated, in which case only the peripheral write operation will be blocked at full buffer.

reference:

On the zero-copy mechanisms under Linux

  •  TCP
    • TCP's TIME_WAIT has two functions:
      • Prevent the residual data before a TCP connection (Serial No. just the right circumstances) into the subsequent TCP connection
      • To prevent the TCP wave emitted during the last ACK packet is discarded, the case needs to be retransmitted ACK packet
  • Sharing Fase : byte alignment principle
  • Linux Network Queue : IP stack packets will be submitted directly to the QDisc queue, QDisc can use certain strategies to control traffic
    • BQL automatically adjusted by the driver queue length of the data to prevent data driver queue is caused by excessive Bufferbloat generated delay. TSO, GSO, UFO and GRO may cause an increase in drive data in the queue, so in order to improve the throughput delay, these features can be turned off.
    • QDisc (Queuing Disciplines) positioned between the IP stack and driver queue, to achieve a traffic classification, prioritization and rate control and the like. You can use tc command configuration. QDisc There are three key concepts: QDiscs, classes and filter
      • QDisc for traffic queues. Linux to achieve a lot QDisc to meet each QDisc corresponding message queues and behavior. This interface allows the queue manager QDisc may be implemented in the NIC driver and the IP stack without modification premise. By default, each card is assigned a pfifo_fast type of QDisc.
      • The second is closely related to the QDisc the class. Independent QDisc likely to achieve their different class to handle the traffic.
      • Filters for QDisc or according to differentiate traffic class.

  • TCP RTT and rto
  • TCP congestion avoidance algorithm , the current mainstream Linux default congestion avoidance algorithm Cubic , you can use ss -i command. tcp sliding window, the congestion window, sliding window size of the reception terminal can receive data is equal window_size * (2 ^ tcp_window_scaling), the size of the buffer may receive data via ACK packet to notify the peer; ss -i congestion window through the cwnd field is obtained, the sender sends the data can not exceed the size of cwnd and the receiver's advertised window.
Sliding window essentially description data buffer size of the recipient of the TCP data sender to send up to calculate how long their data based on this data. If the sender receive window size is the recipient of TCP packets 0, then the sender will stop sending data and wait until the recipient does not send window size for the arrival of datagrams 0. 
About sliding window protocol, there are three terms, namely: 

Window closed: When approaching from the left window to the right, this phenomenon occurs when the data is sent and acknowledged.  
Window open: when the right side of the window moves to the right along the time, this phenomenon occurs at the end of the process after receiving the data.  
Window shrink: When moving to the left along the right side of the window, a phenomenon rarely occurs.  

The following algorithm look tahoe, reno FIG congestion algorithms (fast retransmit and fast recovery) and cubic algorithms, three phase algorithm is the same as in the slow start, based on a timeout or a duplicate acknowledgment to confirm whether to switch to the congestion avoidance phase. Different points in the congestion avoidance phase. Algorithm is described tahos (Comments from TCP-IP - Volume 1):

1 ) For a given connection, initializing cwnd to one segment, ssthresh to 65535 bytes.
2 ) Output TCP output routine can not exceed the size of the cwnd and the receiver's advertised window. Congestion avoidance is flow control imposed by the sender, while the advertised window is flow control of the receiving party. The former is an estimate of network congestion sender feel, while the latter is related to the available buffer size on the connection with the recipient.
3 ) when congestion occurs ( time-out or the receipt of a duplicate acknowledgment ), ssthresh is set to half the current window size (the minimum CWnd and the receiver's advertised window size, but a minimum of two segments). In addition, if the congestion is caused by a timeout, then cwnd is set to one segment (this is the slow start) .
4 ) When a new data confirm each other, we increase cwnd, but the increase depends on whether we approach the ongoing slow start or congestion avoidance. If cwnd is less than or equal to ssthresh, the ongoing slow start, otherwise the ongoing congestion avoidance. Slow start continues until half-time when we returned to the location of the only stop when congestion occurs (since we recorded half to give us trouble in step 2 of the window size), and then into the implementation of congestion avoidance. Slow start algorithm initial set cwnd to one segment, then each receive a confirmation is incremented. This will increase exponentially window: transmitting a segment, then two, then four ⋯⋯. Congestion avoidance algorithm requires each received cwnd will increase 1 / cwnd when a confirmation. The increase in the index compared to the slow start, which is a plus growth (additive increase). We hope that in a round-trip time cwnd increase up to a segment (no matter how many ACK received in the RTT in), but the slow start cwnd will increase according to the number of round-trip time received confirmation.

The figure comes from CSDN , tahoe algorithm can be seen when congestion will cwnd is set to 1, greatly reducing the transmission efficiency

 reno use fast retransmission and fast recovery algorithm improves tahoe:

Fast retransmit 
when fast retransmit algorithm first requires each recipient receives a disorder of the segment immediately after the issue of duplicate acknowledgments (in order to enable the sender to know as soon as possible segment does not reach the other side) and not wait until they send data It was carried piggyback. 
After the receiver receives the M1 and M2 are respectively sent acknowledgment. Now assume that the receiver does not receive the M3 but then received the M4. Obviously, the recipient can not confirm M4, M4 is out of order because the segment received. According to reliable transmission principle, the recipient can do nothing, you can also send a confirmation of M2 at the appropriate time. However, in accordance with the fast retransmission algorithm, the receiving party shall promptly transmit duplicate acknowledgment of the M2, this can let the sender know early segment M3 does not reach the recipient. The sender then sends the M5 and M6. After the receiver receives both packets, and also issued a duplicate confirmation of the M2 again. Thus, the sender received an acknowledgment of four M2 receiving side, after which all three duplicate acknowledgments. Fast retransmit algorithm also provides that, so long as the sender receive three consecutive duplicate acknowledgments should immediately retransmit segment M3 has not yet received the other side, without having to continue to wait for the retransmission timer M3 set to expire. Since the early sender retransmit the unacknowledged segments, thus using the entire network can be increased by about 20 after a certain fast retransmission % . Fast recovery 
and fast retransmit and fast recovery algorithm also with the use of the process has the following two points: 
When the sender receives three consecutive duplicate acknowledgments, it performs "multiplication down" algorithm, the slow start threshold ssthresh halved . This is to prevent network congestion. Please note: The slow start algorithm is not executed next. 
Since the sender now that the network is congestion may not occur, so the slow except at the beginning is now slow start algorithm is not executed (i.e. the congestion window cwnd is now not set to 1), but the cwnd value is set to the slow start threshold ssthresh Save value after a half, and then begin congestion avoidance algorithms ( "additive increase"), so that the congestion window is linearly increased slowly.


Reno can be seen in the event of congestion avoidance algorithm will not cwnd to 1, thereby improving the transmission efficiency, fast retransmit and fast recovery mechanisms conducive to faster detect congestion.

 reno congestion avoidance algorithm is still linear phase increasing, while the cubic from BIC algorithm, congestion avoidance phase dichotomy congestion window to find the best linear increase compared to the rate has increased. The figure comes from the paper

Below two FIG , first to avoid congestion in reno algorithm, the second algorithm to avoid congestion in the cubic, cubic congestion window can be seen faster Approximation

  • At the same host TCP communication, may use jumbo packets, the MSS is much larger than the average 1460 bytes. In order to lead to the creation and transmission of a large number of packets that meet the requirements of MTU way this case, Linux to achieve the TSO, USO and GSO, see below description
In order to avoid the overhead associated with a large number of packets on the transmit path, the Linux kernel implements several optimizations: TCP segmentation offload (TSO), 
UDP fragmentation offload (UFO) and generic segmentation offload (GSO). All of these optimizations allow the IP stack to create packets which are larger than the MTU of the outgoing NIC. 
For IPv4, packets as large as the IPv4 maximum of 65,536 bytes can be created and queued to the driver queue. 
In the case of TSO and UFO, the NIC hardware takes responsibility for breaking the single large packet into packets small enough to be transmitted on the physical interface. For NICs without hardware support, 
GSO performs the same operation in software immediately before queueing to the driver queue.
  • Linux namespace:
    • mount Namespace: mount namespaces by isolation / proc / [pid] / mounts , / proc / [pid] / mountinfo, and / proc / [pid] / mountstats document, only that the container is mounted to view the mount command information
    • Pid namespace: by mounting a / proc command makes use of the ps -ef command in the container can only see the process of internal container
    • cgroup namespace: cgroup container where the process can be viewed through the container / proc / $ pid / cgroup, and then until the corresponding directory cgroup
    • network namespace: if two processes belonging to different namespaces, their / proc / $ pid / net the same information; information if two processes belong to the same namespace, their / proc / $ pid / net of the same. In addition to / proc / $ pid / net directory, network namespace also isolate the / sys / class / net, / proc / sys / net, socket and so on.

Guess you like

Origin www.cnblogs.com/charlieroro/p/11436331.html