Efficiently build vivo enterprise-level network traffic analysis system

Author: Vivo Internet Server Team - Ming Yujia

With the rapid development of the network scale, whether the network status is good or not is directly related to the daily income of the enterprise, and every second of failure will lead to a large number of user losses and economic losses. Therefore, how to quickly discover network problems and locate abnormal traffic has become a priority problem in large enterprises, and many network traffic analysis technologies have emerged at the same time.

I. Overview

With the rapid development of the network scale, whether the network status is good or not is directly related to the daily income of the enterprise, and every second of failure will lead to a large number of user losses and economic losses. Every enterprise is constantly improving its own network monitoring methods, but in the process of building a monitoring system, it inevitably faces the following difficulties :

  1. Huge network traffic data: Due to the high scale and complexity of network traffic, it is difficult to effectively monitor and analyze large amounts of data.

  2. The construction cost of traffic data collection and analysis is high: in order to obtain accurate traffic data, efficient data collection technology, large-capacity storage devices, and a large amount of development resources are required, which makes the cost of monitoring skyrocket.

  3. Single monitoring method and lack of scalability: Traditional monitoring methods can only monitor a few fixed data points, and it is difficult to customize and expand for different network environments.

  4. Difficulty in quickly locating and solving problems: Due to the large amount of network traffic data and frequent changes, it often takes a lot of time and effort to find out the root cause of the problem.

Therefore, how to quickly discover network problems and locate abnormal traffic with the lowest possible monitoring cost has become a priority problem in large enterprises, and many network traffic analysis technologies have emerged at the same time.

sFlow technology is such an efficient and flexible solution. It can extract some information in data packets through traffic sampling technology, so as to realize continuous monitoring of a large amount of network traffic data. At the same time, sFlow technology also has flexible configuration and scalability, can be customized according to actual needs, and supports a variety of network devices and protocols. These advantages make sFlow technology widely used in modern network monitoring and management.

2. Common network traffic collection technology

Mainstream network traffic collection is mainly divided into two types: full traffic collection and sampling traffic collection.

2.1 Full flow collection

Full flow collection includes methods such as port mirroring and optical splitting equipment. In a network with huge traffic, the use of port mirroring will not only increase the delay of the whole link, but also increase the pressure on network devices when the throughput is huge. Although optical splitting equipment can reduce link delay, there is also a high purchase price threshold. In addition, due to the large scale of IDCs in large enterprises, the amount of full-flow data will also increase sharply. If you want to do full-flow data analysis through self-research, you need not only certain storage and computing resources, but also certain The software development cycle is not conducive to the rapid construction of the project.

2.2 Sampling flow collection

In the absence of a traffic analysis system, the advantages of using sampling analysis are reflected. Compared with full traffic, its deployment cost is low, and the cost of data analysis is small. It is very suitable for rapid positioning of abnormal traffic and analysis of trend proportions in the network. . The following mainly compares the advantages and disadvantages of sFlow and Netflow sampling methods.

sFlow has a wider range of traffic monitoring. In the IDC internal environment that meets the hardware requirements, using sFlow for sampling traffic monitoring can effectively reduce the load on network devices and provide real-time traffic monitoring methods to deal with sudden network abnormalities.

3. System design based on sFlow

3.1 Basic design

If the hardware conditions are met, the basic system design based on sFlow is very simple, and the data closed loop of the entire process can be realized by using sFlow agent + sFlow collector + sFlow analyzer.

sFlow agent : By enabling the sFlow capability on the relevant network equipment, setting parameters such as the sampling ratio and formulating the corresponding address of the collection end, the port can collect the sending and receiving traffic. What is more important on the agent side is how to determine the range of network devices to be collected. Compared with the purposeless full deployment of network devices, it is more meaningful to deploy core network devices at the border, because all external traffic must eventually pass through the border network devices. In the case of better monitoring of external traffic anomalies, it can also reduce the burden of data storage.

sFlow collector : Collect and analyze the sFlow datagrams collected and transmitted by the agent side.

sFlow analyzer : Visual analysis and display of formatted data for network administrators to conduct effective observation and analysis.

picture

3.2 Open source + self-developed: advanced architecture

After determining the basic architecture, how to select components and expand customized functions, the open source solution elastiflow provides us with a good example. The author has extended based on open source to meet more customized functions.

sFlow agent : Use the form of reporting unified VIP for port traffic sampling (the official sampling ratio needs to be 2^n), and can use the LB capability of VIP for load balancing, so that sFlow packets can be evenly sent to the fixed port of the collection end. Setting different sampling ratios for different network lines can ensure higher accuracy of important lines while reducing data storage.

picture

sFlow collector : Using the ELK suite for data collection and visual analysis is one of the more mature technical solutions. Therefore, on the collection side, we use logstash to collect and analyze native data packets. The author of elastiflow uses the original udp-sFlow packet analysis component in logstash for data analysis, but the author found in the actual test that although this solution can obtain a better structured data format, the performance of data analysis is very poor. Poor, in the case of a large amount of data, it will cause a large amount of data packet loss, resulting in a decrease in data accuracy. However, sFlowtool has excellent performance because the bottom layer is written based on C language. A single physical machine (32c64g) can reach 10w+tps. Although the data structure after parsing sFlow packets is weaker, it can be used in The follow-up analysis module cleans and structures the data. An example of data analyzed by sFlowtool is shown below. The data via logstash is sent to the kafka message queue.

[root@server src]# ./sFlowtool -l
FLOW,10.0.0.254,0,0,00902773db08,001083265e00,0x0800,0,0,10.0.0.1,10.0.0.254,17,0x00,64,35690,161,0x00,143,125,80
FLOW后的字段释义如下
agent_address
inputPort
outputPort
src_MAC
dst_MAC
ethernet_type
in_vlan
out_vlan
src_IP
dst_IP
IP_protocol
ip_tos
ip_ttl
udp_src_port OR tcp_src_port OR icmp_type
udp_dst_port OR tcp_dst_port OR icmp_code
tcp_flags
packet_size
IP_size
sampling_rate

sFlow analyzer : By consuming data from kafka in real time, the data is cleaned and structured, and with the help of three-party meta data, software defines the analyzed data for subsequent storage and analysis.

database+display : use Elasticsearch+Kibana for storage and visualization, and also use mertic beat to monitor the collection performance of logstash. Kibana, as a Bi-type data visualization solution, provides most of the free-to-use charts and Dashboards, which can be used for visual analysis.

3.3 Analysis software definition

With the original data, we have been able to perform basic session traffic analysis based on some ip quintuples, etc. But the value that traffic data can reflect is far more than that. Using other platforms such as cmdb in the enterprise can provide greater value for our traffic data.

Network device dimension : Through the switch address and inbound and outbound ports in the data, the inbound and outbound direction of the traffic can be judged according to the switch port index collected and configured. It is also possible to assign other attributes such as channel, line, and device name based on the network device ip.

IP dimension : The ip quintuple provides a higher possibility of exploring data. We can judge his project, department and other attribution information according to the attribution ip, and also reversely associate domain names. This can quickly locate the business party when analyzing and judging abnormal traffic, which greatly improves the efficiency of operation and maintenance.

3.4 Self-developed compression storage and visualization

Since the data compression effect of Elasticsearch itself is not ideal, it makes us huge and bloated when storing data for a long time. Correspondingly, Druid, an olap-type database, solves this problem very well. After the data is sampled, it undergoes strict structural processing on the analysis side, which can achieve good data compression in Druid. In addition, Druid's built-in data pre-aggregation capability can also better help us reduce the precision of historical data and reduce storage pressure. After switching the storage engine, it means that it is no longer possible to use Kibana for general display. Using the self-developed web service framework can also respond to flexible demand scenarios and achieve more customized analysis.

3.5 Lightweight stream processing model based on Celery design

Although the traffic data has been sampled and refined, the overall data volume is still huge. Efficient and fast stream processing, reducing the overall system delay to within 30s, can help network managers find problems faster. In addition to using traditional stream processing tools, we can also use Celery to build a lightweight, efficient and easy-to-expand distribution stream processing cluster.

picture

Celery is a simple, flexible and reliable distributed system that handles a large number of messages, focusing on asynchronous task queues for real-time processing, and also supports task scheduling. Based on the real-time asynchronous processing characteristics of celery, we design the consumption link of celerybeat → watcher → producer → consumer to perform stream processing.

Celery beat : As a trigger for a scheduled task, a new task is dispatched to the watcher queue every 1s.

watcher worker : After getting the task in the queue, forward it to the producer, and perform congestion control on the producer queue according to the set maximum value of the queue.

Producer worker: After getting the task in the queue, it will obtain the collected traffic data from Kafka, send it to the consumer queue in batches according to the batch size, and perform congestion control on the consumer queue according to the set maximum value of the queue.

consumer worker : After getting the task in the queue, according to the business information in the local cache/shared cache, perform data cleaning, business labeling and other operations on the collected data, and write it into another kakfa or directly into the database.

Each role and node can communicate through Celery broker to realize distributed cluster deployment. For consumer unit operation, eventlet can be used to start in a coroutine mode to ensure high concurrent consumption of the cluster.

4. Application scenarios

4.1 Traffic Analysis in Computer Room Dimensions

Through ip matching based on the network cmdb, the flow data is summarized in the computer room dimension, and the overall external inbound and outbound traffic analysis of the computer room can be obtained. When the IDC interacts with the outside, the trend change of the overall traffic is a direct criterion for judging the degree of bandwidth occupancy.

picture

4.2 Association of network line information

Through the logical information mapping of network devices based on ip+ifindex, the core channel lines can be aggregated and displayed. For some abnormal problems such as public network lines and dedicated line bandwidths being full, the fault can be directly and accurately located by observing the line analysis The first point in time that occurs.

picture

4.3 IP session information mining

Although sflow only intercepts the header information of the message and does not include the data packet, the ip quintuple itself also provides great network traffic analysis value.

Using session information, we can accurately and effectively locate the ip attribution of abnormal traffic. Through ip + service port, we can even locate the specific service and process that generated abnormal traffic, so as to make the next step decision. In addition, ip can also be linked with the CMDB in the enterprise to locate the resource group of the resource to which ip belongs, so as to obtain the analysis of the proportion of traffic generated by different departments/administrative groups, which is also conducive to the first time when abnormal traffic occurs Perceive the relevant business in a short time, and carry out notification management and control.

4.4 IP attribution analysis

In addition to combining internal information, through the attribution information provided by the operator, we can check the source of ip access, conduct relevant attribution analysis and Dashboard production.

picture

V. Summary

To achieve comprehensive and real-time monitoring and analysis of the network, advanced and effective network monitoring protocols and technologies must be relied upon to meet the growing demands of business. Although sFlow-based traffic analysis has great advantages in lightweight construction, it can also respond quickly based on traffic trends and distribution ratios in the face of abnormal traffic. However, the sampling of sFlow itself does not include the information of the data packets in the message. It cannot provide accurate positioning and solutions for some network security attacks and defenses such as SQL injection and data security. Therefore, full traffic analysis should also be an indispensable part of the traffic analysis system in the future. The combination of the two can provide more comprehensive and refined traffic monitoring and escort the network security of the data center.

6. Future Outlook

Although sFlow technology has been widely used in the field of network performance monitoring and management, under the impact of larger-scale network traffic scenarios in the future, more capabilities are required:

1. Support more protocols and applications : The idea of ​​sFlow monitoring is not only applicable to network traffic, but also can monitor application traffic, virtualization environment, cloud platform, etc. In the future, sFlow technology should support more protocols and applications to better adapt to the new network environment.

2. Adaptive traffic collection technology : The traffic collection technology of sFlow technology is a fixed period, but as the network traffic changes, the fixed period collection may not accurately reflect the real-time status of the network. In the future, sFlow monitoring technology should support adaptive traffic collection technology, which can automatically adjust the collection cycle according to actual network traffic changes.

3. Convenient management function : The current configuration of sFlow relies more on the network administrators to configure on the switch, and cannot realize functions such as one-click distribution, automatic discovery, and quick adjustment of sampling ratio. In the future, a convenient distribution is needed command, the sFlow management platform for hot loading configuration changes.

Guess you like

Origin blog.csdn.net/vivo_tech/article/details/132097894