Root cause analysis based on causal inference

foreword

One sub-field of research related to the operation and maintenance of complex information systems is "root cause location". A problem definition in this subfield is: finding the "root cause" ("root cause" refers to a specific monitoring variable) among a large number of monitoring variables (such as CPU utilization, response time) of a complex system. Under this problem definition, the methods adopted in recent papers are mainly depth-first search and random walk. But it is not clear why these two methods should be effective, and even the problem definition is problematic: "root cause location" means "locating the root cause"!

Our work gives a formal definition of this problem in the language of causal inference, and then gives a simple method for this problem definition that clearly states why it should be effective. This work has been published in KDD'22 [1] . Due to the understanding, time pressure, and space limit when submitting the manuscript, some content was not explained clearly in the paper. This article will expand on the fidelity assumption, the motivation for proposing a structured composition, etc., which are not clearly explained in the paper, and also serve as a Chinese introduction for domestic readers.

problem definition

Judea Pearl divides the existing causal inference tasks into three types, forming a causal ladder [2] . On the causal ladder, the first level focuses on the joint probability distribution among variables. The second level is on top of the first level, focusing on how the probability distribution changes when we do something. The "doing something" here is called intervention in causal inference. Counterfactual reasoning, on the third rung of the causal ladder, is characterized by imposed interventions that contradict the facts.

Judea Pearl's Ladder of Karma

We relate interventions in causal inference to failures in root cause analysis, mapping the non-fault data and the fault data as sampling the two distributions before and after the intervention. Based on such conceptual mapping, we map root cause analysis to a new causal inference task— intervention identification , that is, identify the intervened The variable m. The do operator here means to assign a set of variables to a given value.

Formalizing root cause analysis as a new causal inference task—intervention identification

analyze

fidelity assumption

For a given set of variables V, for all possible interventions do(M=m), M⊂V gives a set of equivalence classes for all distributions defined on V: [m1]=[m2]⟺P( V∣do(m1))≡P(V∣do(m2)). In other words, the do operator defines a mapping from all possible interventions to all distributions defined on V.

  • When the mapping is bijective, what the intervention recognition needs to solve is the inverse mapping of the do operator;
  • Intervention identification does not always have a solution when this mapping is not bijective.

The figure below shows two worlds with different interventions but the same post-intervention distribution. When such situations arise, intervention identification cannot draw definitive conclusions.

For this example,

In the world on the left where the response time is intervened, the distribution of CPU utilization is the same as before the intervention.

And the world on the right where the CPU utilization is intervened retains the conditional probability distribution of the response time before the intervention.

Combining these two pieces of information we can conclude that the two post-intervention worlds look exactly the same as they did before the intervention. However, such scenarios are not practical for root cause analysis. To this end, we introduce the fidelity assumption to rule out such impractical cases, assuming that any intervention will cause changes that can be observed.

Two worlds with different interventions but the same post-intervention distribution

causal hierarchy theorem

Under the aforementioned fidelity assumption, it can be proved that intervention identification is at the second level of the causal ladder [1] . The causal hierarchy theorem (Causal Hierarchy Theorem) [2] states that the measure of causal ladder collapse is zero. If we want to answer questions at a certain rung on the causal ladder, we need knowledge at that rung, or higher. From this, the following two inferences can be drawn.

  1. Solving the intervention identification problem requires knowledge of the second rung of the causal ladder. The causal Bayesian network [3] is a bridge connecting the first two steps of the causal ladder [2] , which also explains why many existing root cause analysis works take the construction of causal graphs between indicators or services as one of the steps .
  2. Solving the intervention identification problem does not require knowledge of counterfactual reasoning.

Intervention Identification Criterion

After further analysis under the above-mentioned faithfulness assumption, we get the decision theorem [1] that the intervention identification criterion is the root cause : to detect whether the conditional probability distribution conditional on the parent node in the cause-effect diagram at the time of the fault has changed, that is,

Vi∈M⟺P(Vi∣pa(Vi),do(m))≠P(Vi∣pa(Vi))

causal diagram

The causal hierarchy theorem requires us to introduce the knowledge of the second level of the causal ladder to solve the intervention identification problem. To do this, we first need to construct a causal graph.

causal discovery

Existing root cause analysis work does not care whether the mined graph is correct after applying the PC algorithm and other causal discovery algorithms. Prior to this work, we evaluated the performance of some algorithms on an open-source dataset [4] , and the results are shown in the figure below [5] . The algorithms tried in the experiments are not ideal when dealing with real problems.

Evaluate the ability of some correlation analysis, causal discovery algorithms to mine causal graphs on an open source dataset

randomized controlled trial?

Randomized controlled trials are the gold standard for determining cause and effect. The author once tried to achieve the effect of controlling variables by adjusting the intensity of fault injection, but encountered problems when summarizing the principles of fault injection [ 6] : For the following example, in lines A and B of lines 4, 7, and 10 respectively When the three counters, C and C, inject faults, do we need to ensure that the sum of counter_B and counter_C is always equal to counter_A? If equivalence is no longer guaranteed only when B or C is intervened, A is already assumed to be the cause of B and C.

01 function foo() {
02     defer timer.ObserveDuration()
03     // A
04     counter_A += 1
05     if (condition) {
06         // B
07         counter_B += 1
08     } else {
09         // C
10         counter_C += 1
11     }
12 }

In the attempt of randomized controlled trials, the following two insights were summarized:

  1. Fault injection itself relies on assumptions;
  2. The causality that a randomized controlled trial "finds" is the embodiment of a hypothesis.

That being the case, it is better to build a causal graph between monitoring variables directly based on assumptions, and there is a structured composition described in the paper.

structured composition

  1. We first classify monitoring variables into four meta-variables, called Traffic, Saturation, Latency, and Errors. Four meta-variables connect directed edges as causal hypotheses. For example, since user input is the beginning of a request, the corresponding payload is used as the reason for other meta-variables.
  2. Secondly, the map composed of four meta-variables is expanded according to the system architecture. For example, a database can be considered a resource of a service. The part describing the database in the resource utilization of the service can be expanded into four meta-variables of the database, and inherit the relationship between the resource utilization of the service and other meta-variables of the service.
  3. Finally, we fill in the monitoring variables into the corresponding meta-variables to complete the construction of the causal graph.

accomplish

Applying the aforementioned intervention identification criteria will encounter the problem of insufficient data, which is reflected in two aspects:

  1. The requirement of timely stop loss limits the time span of fault data that can be used, and our understanding of faults is limited;
  2. The failure-free data is also insufficient, manifested by the lack of intersection between the two data distributions before and after the failure.

Insufficient data encountered when applying intervention identification criteria

Regression-Based Hypothesis Testing Methods

To reduce the need for failure data, we reformulate the intervention identification criterion in the form of a hypothesis test. The null hypothesis H0 is that a variable Vi has not been intervened, and it is tested whether a value Vi(t) of the variable after the fault occurs still obeys the previous distribution P(Vi(t)∣pa(t)(Vi)), where Vi (t) The multiple of the distance from the mean compared to the variance is taken as the anomaly score. On the other hand, the regression technique is used to try to extend the limited non-fault data to the data range of the fault moment.

Subsequent adjustment

When we know the specific form of the data generation process [7] , the regression model can have better performance. The reality is that we do not know the specific form of the data generation process, and the problem of insufficient data limits the use of data-driven methods. To this end, we introduce subsequent adjustments in this work.

For example, when system response time remains high, adding machines and reducing resource utilization may be an effective approach. Subsequent tuning expresses a preference for the latter by adding the anomaly score of the response time to the resource utilization (the parent node of the response time in the causal graph).

Subsequent adjustments helped regression-based hypothesis testing methods achieve better performance on the real datasets we employ. However, the effect of subsequent adjustments needs to be evaluated on more data sets, or replaced by other methods to deal with insufficient data.

other questions

Why do you say that Sage assumes that the system itself has no faults in the method design?

For a given data distribution P′(V), counterfactual reasoning first infers the value z′ of the hidden variable based on existing observations, and then applies intervention to calculate the counterfactual distribution P′(V∣z′,do( x)) . Sage [8] uses counterfactual reasoning to diagnose system performance problems, and cares about which interventions are applied, the overall delay Y of the system will return to normal.

Our work maps faults as interventions in causal inference. When the system itself is not intervened, there is P′≡P, which can learn from the data before the failure; and when the system itself is intervened, P′(V)≡P(V∣do(M)) , the latter needs to be learned from fault data. Sage trains the model based on the data before the failure, thus implicitly assuming that the system itself has not been interfered with, that is, it is assumed that the system itself does not have the failure defined in our work.

If faults are regarded as something other than intervention and at a lower level on the causal ladder, what root cause analysis is analyzing needs to be further explored.

reference

  1. ^abcCausal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition. Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition | Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
  2. ^abcOn Pearl's Hierarchy and the Foundations of Causal Inference. https://causalai.net/r60.pdf
  3. ^ Reference system: Guanhe causal analysis system https://yinguo.grandhoo.com/home

Guess you like

Origin blog.csdn.net/DuJinn/article/details/126744163