The data transfer station of the advertising business system - "log center - real-time service monitoring"

The data transfer station of the advertising business system - "log center - real-time service monitoring"

log center

The log center is a transfer station for data in the advertising link. Real-time monitoring of full-link service robustness, and support for settlement, exposure, interaction and other monitoring and reporting. Plays a pivotal role in the back link.

The log center includes multiple functional modules, which can be divided into three types according to their functional characteristics: real-time service monitoring, monitoring [exposure/interaction/Win] reporting, and transfer settlement .

Real-time service monitoring - pre-link log analysis

Currently, the ADX link contains multiple microservices/modules. In order to solve the problem of the data caliber of each service, as well as the overall robustness of the system, the analysis of business data growth points, and the hidden dangers of various pain points in the details, etc., the former link convergence and unified data indicators will be formed to form real-time monitoring of metrics based on trace logs.

Of course, behind this module, there are additional factors such as cost/resource compression.

Log Convergence Means - "Surgical Opening"

According to the ADX architecture module diagram in the "Advertising Business System Details" of the three top complex businesses of advertising, recommendation, and search, the link includes five main services such as front-end, traffic engine, bidding, portrait, and delivery engine ... module.
​Welcome to pay attention to the official account at the end of the article

So how to converge the log data in these modules and form a unified log trace?

Students who are familiar with the construction of the monitoring system may think that it is not a problem. The classic EFK\Prometheus\Graphite and so on have many mature wheels. Yes, students who are not familiar with it can refer to the monitoring series " Comparison of Monitoring Components " in the cloud native community for a simple understanding.

Since Tao Li came first, here is the plan directly.

insert image description here
In the above data flow diagram, the five modules/microservices are independently deployed based on the Docker mirroring method [For Docker related information, please refer to the Docker engineering environment construction and introduction] , the log data will be transparently transmitted in the form of resp, and at the same time, the log data will be transparently transmitted in the form of pvId /uuid for coupling.

Coupling forms the trace log of pv granularity. At this time, we open a small opening in the front part of the data flow, the only way to let the data flow out.

Note: The form of resp is not optimal. Although the cost is extremely low, it is easy to cause bandwidth and IO pressure. [ADX
system can be ignored, it is related to its specific deployment method: in order to compress the service performance to the extreme, each service will be deployed on the same machine (for details, please pay attention to the follow-up article); other forms such as agent\SDK\Filebeat can be used as
alternatives ]

Just like performing a clinical operation, the trace data of the whole link is obtained from the opening of the throat. Since the scale of ADX data is positively correlated with business growth, it means that we need to take into account the special situation that traffic doubles .
Therefore, relying on middleware has the miraculous effect of " shaving peaks and filling valleys ", pouring data into Kafka and transferring it to downstream analysis services and Hive storage.

  • The asynchronous link to which Hive storage belongs, through the intervention of data mining methods such as Flink\Spark, performs OLAP analysis and further assists business decision-making.
  • The analysis service belongs to the synchronous link. With the excellent data collection, aggregation, visualization and other multi-dimensional capabilities of components such as Graphite\Prometheus\Zabbix\Open-Falcon, real-time monitoring covering both business and service is built to jointly help the business advance.
Metrics-based log analysis - Prometheus & Graphite

Metrics is one of the three elements of the observable direction of the service, and the others are Log\Trace. [For details on the observable direction, see Cloud Native Hot Topics|What is Observability]

First, let’s talk about technology selection. Why did you choose Prometheus & Graphite among so many components? It mainly involves the following process:

  • market research
    • Complete research methods and solutions are required, which can be open source products or competing products, or even the design of leading companies in the industry
  • combined with its own conditions
    • Fully look inside and understand your own strengths and weaknesses. Combine the research results to find the one that suits you best
  • Secondary development/customization
    • At the same time of landing, consider the pain points of the current plan in combination with the situation, and give supplementary development or directional development

  • ​Welcome to pay attention to the official account at the end of the article

In the original solution, there were only Prometheus components, but there were two pain points. [For details about Prometheus components, see Prometheus? The ancient Greek god of Titans? Alien? No, a new generation of enterprise-level monitoring components—Prometheus]

  • Prometheus indicator data accuracy is not 100%
    • The solution here is to use Graphite + Prometheus dual monitoring links to provide data support. Of course, it involves data redundancy. Here, the core index is in the form of dual mining, and the regular index is unique to Prometheus.
  • Prometheus restart/outage metrics will be initially calculated from 0
    • Here, the hot standby method is used to avoid it.
How the monitoring service monitors itself & is more robust than regular services

As a monitoring service, the core responsibility is to monitor other services.

Under this premise, there is an implicit requirement for monitoring services, that is, you must be more robust than regular services. It cannot be said that the monitoring service will stop before the object service is suspended.

Therefore, in order to ensure strong high availability, the monitoring service has a set of highly scalable and high-performance architecture design, a flexible and independent expansion and contraction mechanism, plus two sets of downgrading schemes and a set of self-service monitoring.

Highly scalable and high-performance architecture design

In the service order instance, multi-coroutine concurrency is used for orchestration. The data is injected into the memory chan, and multiple coroutines are dynamically called up concurrently to perform business aggregation and output data indicators. It satisfies the characteristics of dynamic expansion while making the best use of the machine.

Reliable two sets of downgrade schemes

In order to ensure that the service can continuously output business data during peak traffic, two downgrading schemes are designed.

  • traffic sampling
    • Under the highest carrying capacity, 80%, 40%, 20%, and 10% gradient ratio sampling is performed at a time. If the amount of data is overloaded, enter the second plan.
  • Keep big but not small
    • Filtering the traffic with a funnel model only guarantees the normal output of some core data, and all other data tasks will be abandoned.
Monitor the service itself

While the service is being monitored, we have designed our own service data Check logic to ensure that the service data is clean and correct.

The dual-link mode was mentioned above, and by fitting the two different link data of Graphtie and Prometheus, a judgment can be made on the data situation of the service itself.

Flexible and autonomous expansion and contraction mechanism

Services are deployed in a distributed manner, and the linkage mode of three indicators, service traffic entry threshold, service exit failure rate, and service exit P90 threshold, provides data basis for the timing of expansion and contraction.

  • Redundancy: Between the standby node and the expansion node, the machine scale redundancy is around 1.05.

​Welcome to pay attention to the official account at the end of the article

Forming monitoring effect

After data indicators are aggregated, they are embedded into Grafana components in data source mode to provide real-time, diverse, and multi-dimensional visualization effects. [Grafana details can be found in five minutes to build a real-time monitoring system based on Prometheus + Grafana]

insert image description here
insert image description here
Supported business dimensions: exposure volume/proportion, material filling volume/proportion, amount/proportion of candidate types of delivery engines...; service dimensions: QPS/failure rate/SLA/HttpCode distribution...

Exposure data transfer settlement

Exposure data is an additional concern of ADX, and its data flow is an important bridge for communication and settlement, involving revenue ...


See follow-up article!

Recommended reading:
Advertisement, recommendation, search three top complex business "Advertising Business System Details"
Advertising Business System Inheritance of the Past and the Future - "Message Center"
Advertising Business System Data Transfer Station - "Log Center - Real-time Service Monitoring"
The data bridge of the advertising business system - the core channel of the "log center-exposure data transfer and settlement"
advertising business system - the auxiliary decision-making of the "log center-s2s monitoring and reporting"
advertising business system - the "AB experimental platform"
advertising business system Framework Precipitation——Smart Fuse of “Data Consumption Service Framework”
Advertising Business System—Agile Delivery of “Smart Flow Control”
Advertising Business System——Business Connection of “Deployment Based on Docker Containers”
Advertising Business System—“PDB - Advertisement delivery [quantity and price]"


Get it done with three lines of code - Reversing the linked list...
Kafka's high-throughput, high-performance core technology and best application scenarios...
How HTTPS ensures data transmission security - TLS protocol...
Build a real-time monitoring system based on Prometheus + Grafana in five minutes...

Guess you like

Origin blog.csdn.net/qq_34417408/article/details/128631539