Paper Interpretation: PRIPEL: Release of Privacy Protection Event Log Containing Context Information

solved problem

The existing technology is limited to the control flow of the process, and ignores context information , such as attribute values ​​and duration. Therefore, this excludes contextual factors (Timestamp, attribute value) Any form of process analysis. To bridge this gap, we introduced PRIPEL , an event log publishing framework for privacy awareness. Compared with existing work, PRIPEL has a completely different perspective to ensure privacy at the level of individual cases (rather than complete logs). In this way, contextual information and long-tail process behavior can be preserved, so that a series of rich process analysis techniques can be applied.

introduction

Insert picture description here

Such a rich event log can not only discover the control flow model of the process, but also provide a starting point for multi-dimensional analysis that incorporates the impact of context on process execution . An example is based onTime information(For example, arriving at night),Patient characteristics(E.g. age and gender) andActivity result(For example, the dispensed medication) to predict the patient’s remaining waiting time [23]. Containing such contextual information can be a refined case category.

Event logs, especially those that contain contextual information, may contain events related to sensitive data pointing to individuals involved in the execution of the process [26].

Excluding contextual factors will prevent any fine-grained analysis, which will contain detailed information about different types of cases. but,Contextual information aggregation based on anonymity (see [12]) is not suitable to overcome this limitation. Such aggregation leads to the loss of long-tail processing behavior, that is, traces of rare rare cases, so it is particularly important for any analysis(For example due to abnormal runtime characteristics).

The proposed method

Our idea is to ensure that event logs have differential privacy based on individual cases rather than the entire log. To this end, the PRIPEL framework makes full use of the parallel composition principle of differential privacy. Based on different private choices of the activity sequence, the contextual information from the original log is integrated through the sequence enrichment step. Subsequently, the integrated context information is anonymized according to the principle of partial differential privacy:

Advantage

Ensuring privacy at the level of each case is a fundamentally different perspective, which allows us to overcome the above-mentioned limitations of existing work. PRIPEL is the first method:

  • It can not only ensure the differential privacy of the control flow, but also ensure the differential privacy of the context information in the event log .
  • At the same time , most of the behavior of the long tail process is retained , because the differential privacy can ensure that the personal data belonging to a specific individual is no longer identified

concept

Ensure local differential privacy

An anonymous function is defined that inserts noise into the data to hide information about individuals, while retaining as many characteristics about the overall population as possible. Several such mechanisms have been developed to anonymize various data types, including mechanisms to ensure differential privacy for numeric, categorical, and Boolean data.

  • Numerical data-Laplace mechanism : Laplace mechanism [5] is numerical valueAdditive noise mechanism​​. It extracts noise from the Laplace distribution, which is calibrated according to privacy parameters and the sensitivity of the data distribution. The latter is defined as the greatest difference that each person can make.
  • Boolean data-random response . To ensure the differential privacy of Boolean data, users can use randomized responses [37]. The algorithm is based on the following idea: toss a coin to decide whether to show the true value of the individual, or whether to choose a random value. Here, randomization depends on the strength of the differential privacy guarantee. In this article, we will use the so-called binary mechanism [16].
  • Classification data-index mechanism . To process categorical data, an exponential mechanism can be used [27]. It can define the utility difference between the different potential values ​​of the category value domain. The probability of one value being exchanged for another value depends on the introduced probability loss.
  • Parallel composition of differential privacy . In view of the mechanism that can provide differential privacy for various data types, the key attribute of (local) differential privacy is that it is constituent. Intuitively, this means that when the results of multiple -differential-private mechanisms performed on disjoint data sets are merged, the merged result also provides -differential privacy [28].

PRIPEL framework

Insert picture description here
The framework takes the event log as input and converts it to anonymous, including contextual information and guaranteed-differential privacy.

  • First, the tracking variable query Q is applied, and the query returns multiple sets of activity sequences to ensure different privacy from the perspective of control flow.
  • Secondly, the framework enriches the activity sequence obtained by Q by enriching the context information (ie timestamp and attribute value) from the original logL, thus constructing a new trace. This can achieve sequence enrichment, which will result in a matched event log Lm.
  • Finally, PRIPEL anonymizes the timestamp and attribute value separately by using the maximum value of the parallel composition of differential privacy. The generated event logL' guarantees differential privacy while retaining the original logL information to a large extent.

Trace Variants Query

The first step of the framework focuses on the anonymization of event logs from the perspective of control flow. In particular, the framework applies a trajectory variable query that returns multiple sets of activity sequences, which are captured in a differential private mannerTrajectory variable and its frequency. Considering that even publishing the activity sequence from the event log, that is, deleting all attribute values ​​and timestamps, is enough to associate the individual's identity with the infrequent activity sequence, so this step is essential [12,25]. For example, unconventional treatment pathways may be sufficient to resolve the identity of a particular patient. In PRIPEL, we adopt the latest privacy protection track variable query implementation [25]. It uses the Laplace mechanism (see section 2.3) to add noise to the results of the trajectory variable query. As shown in the exemplary query results in Table 2, this mechanism can change the frequency of trajectory variants, completely delete variants and introduce new variants. Note that the size of the trace variant query is usually different from the number of traces in the original log. The tracevariant query configuration used has two parameters, n and k, which will affect the prefix tree used by the mechanism to generate the query.
Insert picture description here
n sets the maximum depth of the prefix tree, which determines the maximum length of the active sequence returned by the query. According to the number of potential activity sequences explored, parameters are used for binding to limit the state space of the mechanism. A higher k means that only the more common prefixes are considered, which reduces the running time but may negatively affect the utility of the resulting log
Insert picture description here

case study

sepsis: 1050 tracks are distributed among 846 tracking variants.

Tested different parameters,

The utility of event logs

operation hours

Three levels are considered:

  • Event level (attribute):
  • Track level (duration):
  • Log level (process workload):

Data attribute value: At the event level, we compare the value distribution of the data attribute in the anonymous log with the original distribution. The sepsis log mainly has the attribute of Boolean value. The quality of their value distribution is easy to quantify, that is, by comparing the score of the true value in the anonymous logL' with the score in L. To illustrate the impact of differential privacy parameters on the quality of attribute values, we evaluated the value distribution of the Boolean attribute InfectionSuspected. As shown in Table 4, the true value of this attribute is true for 81% of the cases in the original log. The
Insert picture description here
anonymous distribution reasonably retains the highest ε value, which is the least strict privacy guarantee. There, the distribution has a true value of 75%. However, for stronger privacy guarantees, the accuracy of the distribution decreases, and for ε = 0.1, it almost reaches complete randomness. This means that the quality of attribute values ​​can be preserved for certain privacy levels, but it may be affected for more stringent settings. Please note that given that these results are obtained by anonymizing individual values, the reduced quality for stronger privacy guarantees is inherently tied to the concept of differential privacy and therefore has nothing to do with the details of the PRIPEL framework.

Case duration. Next, we investigate the accuracy of the duration of the case in the anonymous log. Unlike the quality of individual event attributes discussed previously, the quality of case duration is affected by all three steps of the framework. Therefore, when interpreting the results described in Table 4, it is important to consider that the largest trace in the anonymous record is bound to 30 events (due to the choice of parameter n), while the original record contains a maximum of 370 traces event. However, due to the increased noise, we can still observe longer durations in anonymous logs. In addition, in all cases, the average case duration was much higher than the median case duration. This means that for longer duration, the log contains multiple outliers. All anonymous logs reveal this insight. We concluded that PRIPEL retains insights about the trajectory level, such as the duration of the case.

Process workload: Finally, at the log level, we consider the total workload of the process based on the number of cases that are active at any given time. Given that the anonymous event log can have a much higher number of traces than the original log, as shown in Figure 3, consider the progress of the relative number of active cases over time. The red dot represents the original event log, and the blue triangle represents the anonymous event log with a value of = 1.0. The graph clearly shows that the overall trend is continuous over time. However, the workload shown by the anonymous log is always higher than the original log. In addition, the time variation of anonymous logs is not so severe. This shows that the necessary noise insertion can eliminate some changes. Nevertheless, the results show that PRIPEL retains the utility of such log-level process analysis.
Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_42253964/article/details/108594854