A new generation of network monitoring technology - Telemetry

1. Background of Telemetry

Traditional network device monitoring methods include SNMP, CLI, Syslog, NetStream, and sFlow, among which SNMP is the mainstream data monitoring method. With the expansion of the scale of network systems, the increase in the number of network devices, the complexity of network structures, and the corresponding monitoring requirements, these traditional monitoring methods now reflect many shortcomings, such as:

  • NetStream and sFlow mainly monitor traffic, but lack related monitoring of other data planes such as CPU information, memory, network congestion information, network event log information, etc.

  • Both Syslog and CLI lack unified structured data, which is not conducive to maintenance and expansion.

  • Although SNMP has a wide monitoring range, it has shortcomings in monitoring frequency and method. The main monitoring method of SNMP is the traditional Pull Mode. This method is a question-and-answer mode in which the monitoring end proactively initiates requests to nodes in a polling manner and waits for responses. This method has poor timeliness and is difficult to monitor emergencies. If the sampling period is too long, the real-time performance will be reduced; if the sampling period is too short, the pressure on the equipment being tested will increase.

  • Although SNMP has developed an SNMP Trap push method, this method can only push event and alarm information, and monitoring data such as interface traffic cannot be collected and sent.

Therefore, the monitoring technology of large data networks should have the characteristics of real-time, high performance, high scalability, etc., including higher accuracy of monitoring data for timely detection and rapid adjustment of micro-burst traffic. At the same time, the monitoring process should have a thorough understanding of the functions and functions of the equipment itself. The performance impact is small to improve equipment and network utilization, and better realize operation and maintenance visualization, such as monitoring delay, forwarding path, cache and packet loss, etc. If the SNMP protocol is used to request network status information initiated by an external application, the status of the network cannot be reflected in real time.

Disadvantages of traditional monitoring methods:

picture

SNMP has poor timeliness:

picture

Telemetry is a technology born out of this need. The industry also considers SNMP to be a traditional Telemetry technology and calls current Telemetry Streaming Telemetry. This article uses Telemetry to refer to Streaming Telemetry. Telemetry is a new generation of network monitoring technology that collects data remotely and at high speed from devices. The device periodically and actively sends device information to the collector through "Push Mode", providing more real-time, faster and more accurate network monitoring functions. . Compared with SNMP, Telemetry enables network devices to actively push status information and is more timely.

2. Characteristics of Telemetry

Telemetry collects a wide range of data types, which can fully reflect network conditions.

It organizes data according to the unified YANG data model, can use Google's GPB (Google Protocol Buffers), XML, JSON and other methods for encoding and decoding, and transmits data through protocols such as gRPC (Google Procedure Call Protocol), making the data more secure. Access is more efficient and intelligent docking is more convenient. The data types that Telemetry can monitor are:

  • Network interface data: including network interface traffic, error rate, packet loss rate, etc.

  • Network device status: including CPU utilization, memory utilization, temperature, fan speed, etc.

  • Network traffic statistics: including the source IP address, destination IP address, port number and other information of the traffic.

  • QoS (Quality of Service) indicators: including delay, jitter, packet loss rate, etc.

  • Link status: including link bandwidth utilization, bandwidth utilization change trend, etc.

  • BGP (Border Gateway Protocol) information: including BGP routing table, AS (autonomous system) path, etc.

  • Network security information: including DDoS attacks, port scans, abnormal traffic, etc.

  • Network equipment performance indicators: including various hardware indicators, resource utilization, etc.

The device uses push method to periodically and actively send monitoring data to the collector (the accuracy can reach sub-second level, and problems can be quickly located).

The traditional SNMP detection method mainly relies on the routing engine of the network device to process information. Telemetry can rely on manufacturer support to embed code at the ASIC level of the hardware board and export real-time data directly from the board. The data exported by the board is sent at line speed, allowing the upper-layer routing engine to focus on processing protocols and routing calculations. Real-time data can provide full support for machine learning and purpose analysis, and is very helpful for applications such as automation, traffic optimization, and micro-burst.

Telemetry implements one subscription and the device replies N times. It can monitor the device all the time and avoid repeated queries.

Traditional SNMP queries are based on one question and one answer. Assuming 1,000 interactions within 1 minute, it means that SNMP has parsed 1,000 query request messages. The monitoring system must retain session information for each query request in order to match the returned query results. ;At the same time, the device being queried needs to interrupt other tasks to execute the query command. This pull-based query is transmitted in both directions, which is not only expensive but also low in real-time performance. In large networks, devices such as routers and switches are often under great pressure and cannot support multiple query requests in a short period of time. Telemetry's push mode only requires one subscription, and subsequent devices continue to push data to the monitoring system. There is no need to maintain session relationships and one-time transmission is achieved. It is very suitable for collecting high-speed monitoring data such as interface information.

Comparison between Telemetry and SNMP methods:

picture

Supports variable frequency sampling and suppression functions.

Generally, the user's analyzer needs to set a smaller sampling period to obtain more accurate data for analysis. However, a smaller sampling period generates a large amount of redundant data, which not only requires a large amount of storage space, but also makes it inconvenient for users to analyze the data. Data is managed. If variable frequency sampling is configured, Telemetry will dynamically adjust the collection period based on preset conditions (such as CPU utilization). When the monitoring indicators are normal, the sampling interval will be reduced. When the monitoring indicators reach the threshold, the sampling period will be automatically adjusted according to the configuration. Report collected data more frequently, thereby reducing the amount of data in the analyzer.

Taking the Huawei NE40E-M router as an example, if the current CPU utilization of the main control board is above 90%, Telemetry will suspend other sampling tasks in addition to the CPU and memory sampling tasks. At this time, Telemetry stops sending collected data and enters the suppression state. After the occupancy rate drops to the threshold, the suppression is lifted. After uploading is resumed, the uploading cycle of some data may be lengthened.

3. Working principle of Telemetry

How Telemetry works:

picture

A complete telemetry system can be divided into five parts:

Subscribe to collect data

The methods of subscribing to data are divided into static subscription and dynamic subscription

Static subscription means that the device serves as the client and the collector serves as the server. The device actively initiates a connection to the collector to collect and send data. Mostly used for long-term inspection.

Dynamic subscription means that the device acts as a server, the collector acts as a client and initiates a connection to the device, and the device collects and sends data. Mostly used for short-term monitoring.

Push collected data

Telemetry reports data encapsulated in an encoded format to the collector for reception and storage through data push; Telemetry data push has two methods: gRPC-based and UDP-based.

Read data

Both the detected device and the collector are encoded/decoded through GPB combined with the .proto file.

Take gRPC subscription push as an example:
The device captures the available data information (data source) through the Yang model
and then passes the data through GPB Combining .proto files for encoding (data generation)
The collector performs data subscription (data subscription) through gRPC
The device pushes the encoded data to the subscribed collection through gRPC (data push)
The collector then decodes the .proto file through GPB (this file must be consistent with the GPB .proto file)

analyze data

The analyzer analyzes the collected data and sends the analysis results to the controller, so that the controller can configure and manage the network and optimize the network in a timely manner.

Adjust network parameters

The controller delivers the network configuration that needs to be adjusted to the device. After the configuration is delivered and takes effect, the new collected data will be reported to the collector. At this time, the analyzer can analyze whether the optimized network effect is as expected until the tuning is completed. Finally, the entire business process forms a closed loop.

4. Application scenarios of Telemetry

1. Take the static subscription of Huawei NE40E router as an example, collect CPU and memory information, and finally push it to prometheus and display it with Grafana:

Network structure:

picture

2. Use the system-view command on the router to enter the view, and then use display telemetry sensor-path to view the sampling paths supported by the device and determine the sensor-path of the corresponding indicator:

Different sensor-paths represent different indicators:

picture

3. Use the telemetry command to enter the telemetry view and configure static subscription as follows:

Configure static subscription:

picture

4. On the device side, configure the telemetry server. First, you need the proto file officially provided by Huawei as a decoding tool. The link is attached at the end of this article. Use the command line or the open source run_codegen.py script to generate python code using proto files.

The required corresponding proto files and run_codegen scripts:

picture

huawei-devm.proto(device):

picture

Dynamically load to parse data:

picture

picture

5. Server code

from concurrent import futures
import time
import grpc
from proto_file import huawei_grpc_dialout_pb2_grpc
from proto_file import huawei_telemetry_pb2
import prometheus_client
from prometheus_client import Gauge
from prometheus_client.core import CollectorRegistry
import requests
import importlib

SERVER_ADDRESS=" "
PUSHGATEWAY_ADDRESS=" "
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
registry = CollectorRegistry(auto_describe=False)
gaugeMap={}
def serve():
    # 创建一个grpc server对象

    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    # 注册huawei的telemetry数据监听服务

    huawei_grpc_dialout_pb2_grpc.add_gRPCDataserviceServicer_to_server(
        TelemetryCpuInfo(), server)
    server.add_insecure_port(SERVER_ADDRESS)
    # 启动grpc server

    server.start()
    # 死循环监听

    try:
        while True:
            print("running------")
            time.sleep(_ONE_DAY_IN_SECONDS)
    except KeyboardInterrupt:
        server.stop(0)

def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        pass
    try:
        import unicodedata
        unicodedata.numeric(s)
        return True
    except (TypeError,ValueError):
        pass
    return False

def toPushgateway(labelValue,parseData,count):
    labels = ["product_name","subscription_id_str","sensor_path","node_id_str"]
    jobName = "pushgateway"
    url=PUSHGATEWAY_ADDRESS
    param=""


    for i in str(parseData).split("\n"):
        if '{' in i :
            string = "".join(i.replace('{','_').split())
            param+=string
            continue
        if ':' in i:
            save=param
            key = param+i.split(':')[0].strip()
            value = i.split(':')[1].strip()
            if is_number(value):
                key+=str(count)
                print("keys:"+key)
                if key in gaugeMap.keys():
                    g=gaugeMap.get(key)
                else:
                    g = Gauge(key,"",labels,registry=registry)
                    gaugeMap[key]=g
                g.labels(product_name=labelValue[0],
                    subscription_id_str=labelValue[1],
                    sensor_path=labelValue[2],
                    node_id_str=labelValue[3]).set(float(value))
                requests.post("%s/job/%s" %(url,jobName),
                          data=prometheus_client.generate_latest(registry))
            param=save
        continue


# 创建类继承huawei_grpc_dialout_pb2_grpc中Servicer方法

class TelemetryCpuInfo(huawei_grpc_dialout_pb2_grpc.gRPCDataserviceServicer):
    def __init__(self):
        return

    def dataPublish(self, request_iterator, context):

        for i in request_iterator:
            print('############ start ############\n')
            telemetry_data = huawei_telemetry_pb2.Telemetry.FromString(i.data)
            print(telemetry_data)
            labels = [telemetry_data.product_name,
                      telemetry_data.subscription_id_str,
                      telemetry_data.sensor_path,
                      telemetry_data.node_id_str]
            count = 0
            for row_data in telemetry_data.data_gpb.row:
                print('-----------------')
                print('The proto path is :' + telemetry_data.proto_path)
                print('-----------------')
                module_name = telemetry_data.proto_path.split('.')[0]
                root_class = telemetry_data.proto_path.split('.')[1]
                decode_module = importlib.import_module('proto_file.'+module_name + '_pb2')
                # 定义解码方法:getattr获取动态加载的模块中的属性值,调用此属性的解码方法FromString

                decode_func = getattr(decode_module, root_class).FromString
                parsedata = decode_func(row_data.content)
                print('----------- content is -----------\n')
                print(parsedata)
                print(type(parsedata))
                toPushgateway(labelValue=labels,parseData=parsedata,count=count)
                count+=1
                print('----------- done -----------------')


if __name__ == '__main__':
    serve()

6. The running results can realize the analysis and output of telemetry data.

Data collected -1:

picture

Data collected--2:

picture

Data collected-3:

picture

7. On the router side, you can check the subscription status through the command display telemetry subscription + subscription name.​ 

Static subscription related information:

picture

Grafana display interface:

picture

5. Summary

As a new generation of monitoring technology, Telemetry technology can realize end-to-end network traffic visualization with its high scalability and high real-time performance, break the "network black box", and provide overall solutions and necessary technical support for refined network operation and maintenance. , in line with the monitoring requirements of large-scale data networks. However, Telemetry technology still has some limitations, such as:

  • It is not suitable for use on small and medium-sized networks. The processing of huge data flows requires more resources.

  • Currently, SNMP is still supported by many types of network devices: such as printers, routers and servers, etc., and the range is very wide; however, many old devices and programs in the network do not support Telemetry technology. Due to differences in statistical principles, interface Telemetry statistics may be inconsistent with statistics queried through commands, MIB, and PM.

  • At present, there is no consistent indicator path and protocol stack among multiple manufacturers. For example, at the encoding level, there are XML, JSON, and GPB; at the communication level, there are gRPC, RestConf, and Netconf.

However, in the face of large-scale, high-performance network monitoring requirements, users need a new network monitoring method. Telemetry technology can meet user requirements and support the intelligent operation and maintenance system to manage more devices. Monitoring data has higher precision and more real-time. The monitoring process has little impact on the functions and performance of the device itself. It provides rapid location of network problems and optimization and adjustment of network quality. It provides the most important big data foundation, converts network quality analysis into big data analysis, and effectively supports the needs of intelligent operation and maintenance. It is foreseeable that in the future, a variety of new network monitoring systems with Telemetry technology as the core will appear. Their fine granularity and high precision will provide new methods for monitoring in big data network environments. ideas.

Guess you like

Origin blog.csdn.net/m0_59795797/article/details/134586060