Quantifying the Re-identification Risk of Event Logs for Process Mining

Summary

When publishing the event log, you mustConsider the risk of this re-identification attack. In this article, we showed how to pass the event logIndividual uniqueness measures to quantify re-identification risks. We also reported on a large-scale study that explored the uniqueness of individuals in a series of public event logs. Our results indicate that potentially up to all situations in the event log may be re-identified, which highlights the importance of privacy protection technology in process mining.

introduction

Due to the well-known privacy threats, the willingness to publish event logs is low. However, publicly available event logs are necessary for evaluating process mining models [2-4], soNeed to discuss how to publish event logs safely. In this context, we believe that it is important to understand the risks of data re-identification in event logs and process mining. With this insight, we can balance how much information the event log can share and how much information should be kept anonymously to protect privacy . Although many examples have confirmed the general risk of re-identifying data [5-7], the re-identification risk of event logs has not attracted enough attention.

The purpose of this article is to raise awareness of the risk of re-identification of event logs, therebyProvide measures to quantify this risk. To this end, we provide a method of representing the uniqueness of data, which is derived from the model commonly used in process mining technology. Each event recorded in the event log is composed of specific data types, such as the activity name of the corresponding process step, the time stamp of its execution, and the event attributes that capture the activity context and parameters. In addition, the sequence of events (also called trajectories) related to the same case of the process has data attributes, so-called case attributes, which contain general information about the case.

In order to extract sensitive information, the attacker uses background knowledge to correlate the attributes of the target with the case/event attributes in the event log, For example, by correlating publicly available sources.The higher the uniqueness of the event log, the higher the chance of the opponent identifying the target. Therefore, our method explores the number of cases uniquely identifiable by the case attribute set or event attribute set. We use this information to derive the uniqueness metric of the event log as a basis for estimating the likelihood of re-identifying the case. In order to prove the importance of the uniqueness of event logs, we conducted a large-scale study on 12 publicly available event logs from the 4TU.Centre for Research Data warehouse. 1 We classified the records and evaluated the uniqueness of the individuals involved in the case. Our results on these logs indicate that based on previous knowledge, it is possible for the adversary to re-identify all cases.We show that the attacker only needs a few attributes of the trajectory to successfully launch such an attack. The contributions of this article can be summarized as follows:

  • We propose a method to quantify the privacy risks associated with event logs. In this way, we support identifying information that should be prohibited when publishing event logs, thereby promoting the responsible use of logs and paving the way for novel use cases based on event log analysis.
  • By reporting the results of large-scale evaluation studies, we emphasize the need to develop highly practical event log privacy protection technology for process analysis. Our concepts of personal uniqueness may prompt this effort because they clarify the inherent privacy risks.

The structure of this article is as follows. Section 2 illustrates the privacy threats in process mining. Section 3 introduces methods to quantify re-identification risks. We analyze event logs that are publicly available.

Privacy threats of process mining

Insert picture description here
Process mining uses event logs to discover and analyze business processes. The event log captures the execution of activities as events. A finite sequence of such events forms a trajectory, representing a single process instance (also called a case). For example, the treatment of patients in the emergency room includes many events, such as blood sampling and analysis, which together follow a specific structure determined by the process. Therefore, an event related to a single patient constitutes a case. In addition, the case attributes provide general information about the case, such as the patient’s birthplace. Each event is composed of various data types, such as the name of each activity, the timestamp of execution, and the event attributes. Event attributes are event-specific and may change over time, such as temperature or the department performing the treatment. The main difference between case attributes and event attributes is that the case attributes will not change the value of the case during the observation period. We show an example of a comprehensive event log in Table 1, which captures the flow of the emergency room.

Considering the structure of the event log, several privacy threats were identified.Linking cases to individuals can reveal sensitive informationFor example, in the emergency room process, certain events can indicate that the patient is in a certain condition. Generally, case attributes can contain various sensitive data to reveal racial or ethnic origin, political opinions, religious or philosophical beliefs, and financial or health information. same,The event log can display information about the productivity of hospital staff[8] or work schedule. Such employee surveillance is a serious privacy threat. Obviously, it is important to include privacy considerations in process mining projects. weSuppose the opponent’s goal is to identify individuals by linking external information in the event log

Depending on the type of background information, there may be different opponent models .We assume targeted re-identification, that is, the opponent has information about a specific individual, which includes a subset of attribute values. Based on this, the adversary aims to reveal sensitive information, such as diagnostic information. Here, we assume that the opponent knows that there is a person in the event log . In this article, we consider uniqueness metrics to quantify the risk of re-identification of sensitive information, thereby providing a basis for managing privacy considerations.

Re-identify event log

In order to apply the uniqueness measurement to a case, we aggregate all the event data that occurred to the corresponding case. This assumption makes it easy to handle multiple events belonging to the same case. Since the case attribute does not change over time, it only needs to be considered once, and the event attribute may be different for each event, so the time change needs to be considered. Table 2 provides corresponding examples.
Insert picture description here

Each row in the table belongs to a case. The case attributes "gender" and "age" are listed in separate columns.The "Activity", "Timestamp" and "Arrival Channel" columns contain an ordered list of each attribute. For example, case ID 11 has only two events, so there are only two events. The second event "Antibiotics" on March 4, 2019 did not have an "access channel" (ie "none").The uniqueness of the event log is used as the basis for estimating the possibility of re-identifying the case. We have studied many so-called projections, which can be regarded as data minimization techniques to effectively reduce the potential risk of re-identification in the event log. Projection refers to a subset of the attributes in the event log. They can be easily used to assess risks in different situations . Table 3 summarizes the projection of the event log and its potential use in process mining. Projection A contains a sequence of all executed activities and their time stamps, while projection F only contains case attributes . Has shown thatEven the sparse projection of event logs will bring privacy risks[4]. Therefore, in our assessment, we will consider the re-identification risk of various projections.
Insert picture description here

Uniqueness based on case attributes

In addition to the unique identifier (UID), the so-called quasi-identifier is also information that can be linked to an individual. A combination of quasi-identifiers may be sufficient to create a UID. In the event log, case attributes can be seen as quasi-identifiers. For example, in the event log of BPI Challenge 2018 [9], the area of ​​all packages and the ID of the local department can be regarded as case attributes. Measuring uniqueness based on case attributes is a common method to quantify the risk of re-identification ==[10]==. Case uniqueness and individual uniqueness will greatly increase the risk of re-identification.A single value of the case attribute will not lead to recognition. However, the combination with other attributes may cause special situations. In particular, linking attributes to other sources of information may result in successful re-identification

We define uniqueness as part of the uniqueness case in the event log. Let fk be the frequency of the k-th combination of case attribute values ​​in the sample. If fk = 1, the situation is unique, that is, there are no other situations with the same case attribute value. Therefore, the uniqueness of the case attribute is defined as:
Insert picture description here
if the k-th combination is unique, the indicator function I (fk = 1) is 1, and N is the total number of cases in the event log. Referring to the data in Table 2, the attribute value "gender: female" will lead to two possible candidate cases (id: 11 and id: 11), that is, fk = 2, which means that the combination is not unique. Taking "age" into account as an additional quasi-identifier, all three listed cases can be made unique, namely Ucase =1. Since samples of event logs are often released, we distinguish between sample uniqueness and overall uniqueness.The number of unique cases in a sample is called sample uniqueness. For overall uniqueness, we refer to the number of unique cases in the complete event log (ie overall). Based on the public event log, we can measure the uniqueness of the sample. Population uniqueness is the number of unique cases in the sample, and it is also the number of unique cases in the basic population from which the data is sampled. Usually, the event log is a sample from the population, and the original event log is not available. Therefore, the total uniqueness cannot be measured and must be estimated.

There are several models to estimate population uniqueness from a sample. These methods are based on the extrapolation model of the contingency table to simulate the overall uniqueness, so that the specific distribution is suitable for frequency counting [10]. We use the method of Rocher and Hendrickx [7] to estimate the uniqueness of the population. The author uses Gaussian copulas to model the uniqueness of the population, approximate the edge from the sample, and estimate the probability that the uniqueness of the sample is the uniqueness of the population. For this analysis, we assume that the event log is a published example. By applying this method, we estimate the overall uniqueness of the case based on the attributes of the case.

Uniqueness based on traces

Most published event logs used for process mining do not have many case attributes, only event attributes. For example, the sepsis event log [11] has only one case attribute ("age"). However, depending on the event, the case can also be unique. We use trajectories to measure uniqueness.

We assume that the main goal of the opponent is to redefine the individual given multiple points and reveal other sensitive points.We think that the opponent has certain knowledge and knows some points, and she can link these points with the event log. In particular, we assume that the adversary knows that someone is included in the event log. In other words, we treat the published event log as a total. As shown in the example in Table 2, all cases are unique even without considering the attributes of the cases: Case 11 is uniquely identifiable by its second active "antibiotic". By combining the activity with the corresponding timestamp, Case 10 and Case 12 can be uniquely identified. For example, the opponent may obtain information about the patient's arrival (for example, "Check-in time: March 5, 2019"). Given this information as a key point of tracking, the opponent is sufficient to identify the patient and display other information from the event log.

therefore, weThe rate at which re-identified risks are expressed as special circumstances. The uniqueness of the trajectory can be measured similarly to the position trajectory ==[12,13]==. In the position trajectory, the point consists only of the position and the time stamp. On the contrary, we have not only two-dimensional points, but also multi-dimensional points of ia. Activities, resources and timestamps. Let {ci} i = 1,..., N is an event composed of a set of N traces. Given a set of m random points (called Mp), we count the number of traces that contain the point set. If the point set Mp is contained in only one trajectory, the trajectory is unique. The uniqueness of a given Mp trace is defined as
Insert picture description here
if the trace is unique, then δi = 1, otherwise it is 0.

result

For evaluation, we used the public event log of 4TU.Centre for Research Data. We divide event logs into personal (R) and software (S) event logs. The case identifier of an individual in real life refers to a natural person, for example, == ADL event log [14] == includes the individual's daily life activities. In event logs involving software activities, events do not directly refer to natural persons, but rather technical components. E.g,The BPI Challenge 2013 event log [15] contains events in the event management system. Some software-related event logs even contain a single case, which makes it more difficult to measure the uniqueness of the case. However, if the appropriate identifier can be linked to the case, the uniqueness of the event log related to the software can also be measured. For example, the events in the BPI Challenge 2013 event log are processed by natural persons. By using the appropriate conversion, the natural person can be used as a case identifier. In the following, we will apply our method to estimate only the uniqueness of real individual event logs (R). We only use multiple case attributes to measure the uniqueness of the case attributes of the event log. Table 4 summarizes our classification results, provides some basic indicators of the number of cases and activities, and points out the uniqueness metric of the application. In order to improve readability and for ethical considerations (see section 4.3 for details), we will apply our method and discuss the intermediate results in detail only for the BPI Challenge 2018 [9] and sepsis [11] event logs. For all other event logs, we provide compressed and pseudonymized results. Please note that the sequence of the pseudonym event logs in the following sections is different from that in Table 4, but the consistency of the pseudonym is consistent throughout the evaluation process.
Insert picture description here

The BPI Challenge 2018 event log is provided by the German company "Data Expert". It contains events related to the application of the EU Agricultural Guarantee Fund payment program. The event log contains 43,809 cases, each representing a direct payment application made by farmers within three years. We have determined "Actual Payment" (PYMT), "Area" (ARA), "Department" (DPT), "Plot" (#PCL), "Small Farmer" (SF), "Young Farmer" (YF), " Year” (Y) and “Applicable Amount” (AMT) are used as case attributes. Data contributors summarize the attributes PYMT, #PCL, and AMT by grouping values ​​into 100 bins, where bins are identified by the minimum value [9].

In order to determine the impact of case attributes, we use various combinations to evaluate its uniqueness. Specifically, we studied which attribute value combinations make the case more unique and therefore unique. The more extensive the adversary's background knowledge, the more likely this individual will become identifiable.

The more case attributes, the more unique the case becomes. We do not consider the case attributes that include the event log activity (ie, the first activity performed), because we assume that the adversary does not know the exact order in which the activities are performed.
Insert picture description here
Not all event logs show high uniqueness based on case attributes. In the case of the BPI Challenge 2018 event log, it can be observedEven a few case attributes will have a high uniqueness, which brings a high risk of re-identification

The sepsis event log was obtained from the information system of a Dutch hospital. It contains events related to the logistics and treatment of patients who enter the emergency room and suspected of having sepsis, which is a life-threatening condition that requires immediate treatment. Initially, the event log was analyzed to understand whether the guidelines for timely antibiotic use were followed, and more generally, it was related to the overall trajectory of the patient [32]. For research purposes, these data have been made publicly available [11].

Several measures have been taken to prevent identification, including:

  • Randomize the timestamp by disturbing the beginning of the case and adjusting the timestamp of each subsequent event accordingly
  • A pseudonym for attendance-related activities, such as "issue A"
  • Summarize employee information by explaining the department only
  • Pseudonymization of work diagnosis
  • Extend the age to people under 5 and at least 10 people.

The event log contains 1,049 cases involving 16 different activities. Each case represents the way a natural person passes through the hospital. The average length of the trace is 14 points (minimum = 3, maximum = 185). Contrary to the BPI Challenge 2018 event log, the sepsis event log has only one attribute that can be used as a case attribute.

In order to estimate the uniqueness of the trajectory, we use the method introduced in Sect 3.2. The points in the sepsis event log include current activities in charge of patient treatment, time stamps and departments. "Age" is used as the case attribute. Since patients receive treatment in different departments, the "department" does not meet the time-invariant criteria for case attributes (see section 2).

For each case, we randomly select m points of the trace and count the number of traces with the same point. In other words, we will look for other traces, such as the same activities that include the same department. We choose randomly selected points to avoid making assumptions about the opponent's knowledge. We know that this may underestimate the risk of re-identification. As a result, the high degree of uniqueness of our results emphasizes the risk of re-identification, because more complex and optimized point selection may lead to higher uniqueness. In Figure 1, we show the uniqueness of the different values ​​of m points and the traces of different projections. As expected, we usually observe that more points lead to higher uniqueness. Assuming that the timestamp is correct (not correct), projection A shows that four points including activity and timestamp are sufficient to identify all traces. By summarizing the time stamp, that is, reducing the resolution to a few days, when considering four points, only 31% of the traces are unique, and when considering all points of the trace, only 70% are unique. Therefore, the results clearly show the impact of induction on re-identifying risks.
Insert picture description here

When considering other projections, the privacy enhancing effect of deleting values ​​from the event log becomes obvious. For example, projection B omits the timestamp, but in other cases assumes that the opponent has background knowledge of all activities, cases, and event attributes. However, it can significantly limit uniqueness to approximately 37%. The projection D, which still contains case attributes and activities, can even limit the uniqueness of the trace to a maximum of 9%. The uniqueness of the trace remains stable beyond 64 points, because only 2% of the traces have more than 64 points.

Our method of estimating uniqueness based on tracking can be applied to all event logs classified as "persons in real life (R)". Figure 2 shows the uniqueness of all event logs for different projections. We give 10%, 50%, and 90% possible points for each trace, that is, the opponent knows this point in each case to evaluate uniqueness. A gray field without a number indicates that the projection cannot be evaluated due to missing attributes.
Insert picture description here

In Figure 2, we observed a similar trend to the sepsis event log:Projection A usually results in high uniqueness. By omitting information represented by various projections, uniqueness is reduced. When comparing projections B and C, the situation attribute will be deleted, which becomes obvious.Projection E, which only considers activities, resulting in small uniqueness, Except for event logs 5 and 9. We explain by the fact that these event logs have many different activities, and the trace length is different in each case. For event log 10, we have seen that the uniqueness of projection B is significantly reduced. This can be explained by a small number of case attributes and a small number of unique activities.

The most surprising event log is 11. It has no unique situation. The main reason for this difference is the time stamp and a small number of unique activities in the daily solution. It is worth adding that, compared to uniqueness, increasing the number of points from 10% to 50% makes sense relative to increasing the number of points from 50% to 90%. For example, the uniqueness of projection A of the event log 10 increases from 62.4% in Fig. 2a to 73.7% in Fig. 2b. Given the 90% point of the trajectory, we cannot observe an increase in the uniqueness of the event log 10. It can also be observed for other event logs and other projections. The main reason is that the trace length varies greatly.

In general, in our research, we found that uniqueness based on traces is higher than uniqueness based on case attributes (see the results in Table 5). For example, the event log 3 has a sample uniqueness of 1.1% based on the case attribute. However, based on the trajectory, for projection C, its case uniqueness is 84.4%.We concluded that trajectories are particularly vulnerable to data re-identification attacks

discuss

Our results show that even for randomly selected tracking points, 11 out of 12 evaluated event logs are uniquely greater than 62%. More specific information, such as the sequence of individual activities, can bring greater uniqueness with fewer points. Attackers can usually use other knowledge about the process to predict certain activities, which is also confirmed in ==[33]==. but,Random selection clearly shows that little background knowledge is enough and has brought a great risk of re-identification to the event log. in contrast,Generalization of attributes helps reduce risk[34]. However, the results show that combining multiple attributes (such as case attributes and activities) still produces unique cases. Combining with reduced value resolution, for example, publishing only the year of birth instead of the entire birthday, can reduce the risk of re-identification. Such generalization (generalization) techniques can also be applied to timestamps, activity or case attributes.

Follow the principle of data minimization, that is, limit the amount of personal data. Data omission is only the most profound way to reduce risk. We will clearly see this when we consider projection. Therefore, these projections can be used to reduce the risk of re-identification.

We apply our method to published event logs to point out the re-identified risks in the process mining area. to this end,We only quantify risk, Avoid correlating other event logs, otherwise the individual may be re-identified. In addition,We have taken some measures during the evaluation process, such as pseudonymizing event logs, so as not to expose or attribute specific event logs

Related work

Re-identify the attack. In the past, many researchers have solved and successfully carried out re-identification attacks [6,7,12,13,35,36]. Narayanan and Shmatikov [35] anonymized the data set containing movie ratings from Netflix by cross-correlating multiple data sets.Our opponent’s goal is to re-identify a person, not to reconstruct all the attribute values ​​of a person. Therefore, we measured uniqueness. We measure the uniqueness based on two well-known methods [7,12,13].

Rocher et al. [7] Estimate the overall uniqueness based on the given attribute value. We use their method to estimate uniqueness based on case attributes. Our method based on the uniqueness of trajectory estimation relies on the method proposed in [12,13], which estimates the uniqueness in the moving trajectory with position data. Due to the structure of the event log, these two methods alone are not enough to determine the uniqueness in the event log, and data preparation is required. For example, the event log has a specific format that needs to be converted in order to apply a uniqueness measure to the trace.

Privacy in process mining . Especially since the General Data Protection Regulation (GDPR) came into effect, the awareness of privacy issues in process mining has increased [37]. Although the process mining manifesto [38] requires balancing the utility and privacy of process mining applications, the number of related contributions is still small. Fahrenkrog-Petersen et al. have solved the problem of discovering the correct main process behavior while preserving privacy in the event log. [4]. Their algorithm guarantees k-anonymity and t-compactness, while maximizing the utility of the cleaned event log. Generally, k-anonymity aggregates data in such a way that everyone cannot be distinguished from at least k-1 others in the data set based on its value [39,40]. However, it has been proved in the past that neither k anonymity nor t compactness are sufficient to provide strong privacy guarantees [41].

To date, the most powerful privacy model that can provide provable privacy guarantees is differential privacy. Recently, it was incorporated into the first privacy protection technology for process mining [2]. This method proposes a privacy engine that can maintain the privacy of personal data by adding noise to the query. The privacy techniques of [2,4] have been combined in web-based tools [3]. The pseudonymization of data sets related to process mining has been discussed in [42,43]. The values ​​of the original data set are replaced by aliases. However, encryption still allows adversaries to potentially re-identify with knowledge about the domain and statistical distribution of the encrypted data. In addition to technical privacy challenges for process mining, the method of [44] also discusses organizational privacy challenges through a framework. Although this method points out some privacy issues in process mining, it does not provide any technical solutions. Pickup etc. [33] Assess the suitability of existing privacy protection methods for process mining data. They proposed a framework to support the mining and analysis of the privacy protection process. And pickups and so on. Although the applicability of the existing data conversion methods for anonymous processing of process data is analyzed,They do not provide methods to support identification information (such as atypical process behaviors) and should be suppressed to reduce the risk of subject re-identification. Our metrics fill this gap and help data owners identify unique cases with atypical process behavior

Compared with the existing work related to privacy perception methods for process mining, this article attempts to quantify the re-identification risk. Data publishers can determine what information should be suppressed before publishing event logs for process mining. If the risk of re-identification is high, the above methods may be able to reduce the risk of re-identification, thereby providing higher privacy assurance.

in conclusion

This paper identifies and evaluates the re-identification risk for process mining in the event log. We found in the communityThere are serious privacy leaks in most of the widely used event logs. In order to solve this problem, we advocate using methods to estimate uniqueness, so that event log publishers can carefully evaluate their event logs before publishing and whether certain information needs to be suppressed. In general, real-world data tracking is a necessary means of evaluating and comparing algorithms. This article shows that as a community,We must take more cautious actions when publishing event logs, and at the same time emphasize the need to develop privacy protection technology for event logs. We believe that this work will increase trust and increase the willingness to share event logs while providing privacy assurance.

Guess you like

Origin blog.csdn.net/weixin_42253964/article/details/107647090