Mining Roles from Event Logs While Preserving Privacy

Summary

Process mining aims to provide insights into actual processes based on event data. This data is widely available and usually contains private information about individuals. On the one hand, knowing who (called resources) has performed specific activities can be used for resource behavior analysis (such as role mining), and is essential for bottleneck analysis. On the other hand, event data with resource information is highly sensitive. Process mining should reveal insights in the form of annotated models, but should not reveal sensitive information about individuals.

In this article, we proved that the problem cannot be solved by simple methods (for example, encrypting data), and it is still possible to identify anonymous people based on some carefully selected events. therefore, weIntroduced a decomposition method and a series of technologies that can protect personal privacyHowever, at the same time, roles can be discovered and used for further bottleneck analysis without revealing sensitive information about individuals. To evaluate our method, we have implemented an interactive environment and applied our method to some real life and artificial event logs.

introduction

We provide a privacy protection method to discover roles from event logs. A decomposition method and some techniques are introduced to protect the personal information of individuals in the event data in this specific contextProtection from frequency-based attacks. The discovered roles can be replaced with resources and used for bottleneck analysis, instead of processing personal identifiers. We will evaluate the typical trade-off between privacy guarantees and loss of accuracy in our approach.

Preliminaries

Insert picture description here
In the following, we define sensitive frequencies based on the frequency box plot, so that not only outliers but also all other abnormal frequencies are classified as sensitive frequencies.Activities with sensitive frequencies are more likely to be recognized by opponents

role mining

When discovering the process model from the event log, the focus is on the process activities and their dependencies. When deriving roles and other organizational entities, the focus is based on the relationships between individual activities. The joint activity-based metric used to discover roles and organizational structure treats each person as a vector of the activity frequency performed by that person, and uses the similarity metric to calculate the similarity between the two vectors. Construct social networks between individuals so that if the similarity is greater than the minimum threshold (Θ), the corresponding individuals will be connected to the undirected edge. Individuals in the same connection part should play the same role [4].

Consider Table 1, and assume that the sequence of activities in each vector is D, V, C, R, S. Then, Paolo's vector is P = (0, 1, 1, 0, 0), and Monica's vector is M = (0, 1, 1, 0, 0). Therefore, the similarity between these vectors is 1. In this article, we use the resource activity matrix (RAM) defined below as the basis for extracting vectors and deriving roles.

Definition 8 (resource activity matrix RAM): RAMEL®=(RAMEL(r, a1), RAMEL(r, a2), …,RAMEL(r, an)). RAMEL(r, a) =∑σ∈EL|[x ∈ σ | x = (r, a)]|.

Table 2 is the RAM
Insert picture description here
definition 9 (joint activity social network JSN) obtained from Table 1 : sim(r1, r2) represents the similar relationship based on RAMEL(r1) and RAMEL(r2),E = ((r1, r2) ∈ res(EL) × res(EL) | sim(r1, r2)> Θ) is the set of undirected edges between resources,andΘ is the threshold of similarity. JSNEL = (res(EL), E) is a joint activity social network.

Note that various similarity measures can be applied, for example, Euclid, Jaccard, Pearson, etc. Figure 1 shows the networks and roles obtained by applying a threshold of 0.1 when using Pearson as the similarity measure.
Insert picture description here

Problem (attack analysis)

Here, we discuss the general issues of confidentiality/privacy in process mining, and then we focus on specific issues and attack models.

General question

Insert picture description here

Think of Table 3 as a fully encrypted event log with surgical information. Include standard attributes (case ID, activity, resource and timestamp). Process mining technology needs to keep the difference. Therefore, case IDs, activities and resources are encrypted according to a deterministic encryption method. 1 Use homomorphic encryption to encrypt digital data (ie timestamps) so that basic mathematical calculations can be applied. Although the fully encrypted event log appears to be secure, it is not.

One can find the most frequent or least frequent activities, and provide the given background knowledge, one can simply replace the encrypted value with the real value. In addition,The location of the activity can also be used to infer sensitive information, for example, when the activity is always the first/last activity, given domain knowledge, the real activity can be inferred. These attacks are considered frequency-based . Please note that after inferring the actual event name, the corresponding performer is most likely to be identified.

In addition to the above attacks, other attributes can also be used to identify actual activities and resources.For example, when the timestamp is encrypted by a deterministic homomorphic encryption method, the duration between two events can be derived. Based on background knowledge, it can be inferred that the longest/shortest duration belongs to a specific event. When there are more attributes, it is possible to combine them to infer other attributes.

These examples illustrate that even with specific domain knowledge, it is possible to leak data from the fully encrypted basic event log. In addition, if mining technology is applied to encrypted event logs, the results will also be encrypted, and data analysts cannot interpret them if they cannot be decrypted.

Attack analysis

Now, let's focus on the specific environment, the goal is to extract the role without revealing who performed what? Roles can be derived from simple event logs, in this setting, activities are considered sensitive attributes. Therefore, the activity is hashed, and we define H(A) as the range of the hash activity. (H(X) = {H(x) | x ∈ X}), H is a one-way hash function, here we use SHA-256.

We assume that the frequency of activities is background knowledge (bk), Where Ufrq = H(A)×N is the activity frequency of the hash, which can reveal the actual activity based on the assumed background knowledge.For example, in the event log table 1, the least frequent activity is the "special situation", which can be revealed based on background knowledge about frequency. We regard this information disclosure as activity disclosure (a type of attribute disclosure). Note that resources are usually not unique identifiers in the event log. However, they may be encrypted or hashed. Here, our focus is on activities, andThe challenge is to eliminate the frequency of activities, And activities are necessary to measure resource similarity and derived roles.When the background knowledge relates to the trajectory (for example, the length of the trajectory and the active position in the trajectory), our method can also improve privacy

method

Idea isBreak the activity into other activities so that the frequency and location of the activity are disturbed. However, at the same time, the similarities between resources should be kept as similar as possible. to this end,We need to determine the number of replacements for each activity and how to allocate the dominant frequency between the replacements of the main activity. We regard D(H(A)) as the whole of the decomposed hash activity, and obtain the purified event log in the following way.

Definition 10 (Sanitized Event Logs) is divided into multiple sets of mul, set, and trace forms.

You can use mul when you do not need to rebuild the trace from the cleaned up event log. In this case, the cleaned event log can completely protect personal privacy and prevent attribute leakage based on background knowledge. In addition, it is obvious that the resource activity matrix and the corresponding joint activity social network can be simply derived from the cleaned event log.

Decomposition method

The number of substitutions (NSa) for each activity a should be specified in such a way that activities with sensitive frequencies are no longer determined. Next, we introduce some technologies.

  • Fixed value: The fixed value is regarded as the number of replacements for each activity. For activity a, NSa = n.
  • Selectivity: With this technique, only the target of the sensitive frequency is interfered. Therefore, only some activities with sensitive frequencies are broken down. We assign alternative
  • Insert picture description here
    Insert picture description here
    Insert picture description here

Please note that our goal is to disturb the frequency range with the least number of activities after decomposition.

  • Frequency-based: Alternatives are allocated based on the relative frequency of the main activity. Here, we assign replacement. For each activity:
    Insert picture description here
    after determining the allocated quantity, determine the allocation set Suba = {sa1, sa2, …, saNSa }, and different activity sets are disjoint.

In order to preserve the main characteristics of the vector, we evenly distribute the frequency of the main activity among the alternative activities. For this reason, when browsing the event log, for each resource, the i-th activity is replaced by the assigned sai.

We guarantee that if a resource performs an activity with a frequency greater than or equal to other resources, the frequency of performing the corresponding replacement will also be greater than or equal to other resources.

Privacy analysis

To analyze privacy, we measure the disclosure risk of the original event log and the cleaned event log. Consider two factors to measure disclosure risk, including:The number of activities with sensitive frequencies and the existence of actual activities with sensitive frequencies

Evaluation

Figure 1 is the output of the python-based tool used in this article. The event log uses the following two (BPIC 2012 and 2017).

CN is the connected part, UC is the unconnected part
Insert picture description here

Table 4 shows the similarity when using fixed value techniques to determine the number of substitutions. It can be seen that the network is almost the same and the accuracy is acceptable. As the number of replacements increases, the average similarity decreases, which indicates a typical trade-off between accuracy and privacy. In addition, the network in the unconnected part is the same, that is, if two resources are not connected in the JSN, then the JSN will not be connected. Figure 2 shows the various thresholds when using selective or frequency-based techniques. It can be seen that, on average, selective techniques can lead to more accurate results. However, in the unconnected part, frequency-based techniques have better results. Please note that in terms of resources and activities, BPIC 2017 is larger than BPIC 2012 (Table 5).
Insert picture description here
Insert picture description here

privacy

In order to evaluate the impact on privacy, we calculated the disclosure risk of the original event log and the cleaned event log after using different technical decomposition methods. Table 6 and Table 7 show the parameters of the disclosure risk of BPIC 2012 and 2017, respectively. It can be seen that when the fixed value technique is used, the DR is lower for larger values ​​because the number of replacements in both event logs is large. In addition, since the relative frequency of the least frequent activities is very low, frequency-based techniques will not affect the lower limit of sensitive frequencies. By combining this technique with a fixed value, this weakness can be alleviated, so that the number of replacements is the relative frequency plus a fixed value.
Insert picture description here
Insert picture description here
In order to compare the introduced technologies, we use the minimum disclosure risk that all technologies can provide as the basis for comparison, and evaluate the accuracy and complexity provided by different technologies for the same disclosure risk. Accuracy is the average similarity between networks, while complexity is considered the number of unique activities. Please note that for fixed-value technology, we check the event log, which has the smallest NS that provides basic disclosure risk. Table 8 and Table 9 show the results of this experiment for BPIC 2012 and 2017, respectively. It can be seen that in both event logs, fixed-value technology can provide more accurate results, while selective technology can reduce complexity.

All the above explanations and our experiments show that the decomposition method provides accurate and highly flexible protection for mining roles from event logs. For example, when the upper frequency limit is more sensitive and the upper frequency limit is higher, the decomposition method based on frequency can be used. method. The accuracy of unconnected parts is more important

Insert picture description here

In this article, we focus on privacy issues for the first time from the organizational perspective of process mining. We propose a method of discovering joint activities in social networks and mining roles, privacy. We have introduced a decomposition method and a series of technologies through which private information about individuals can be protected from frequency-based attacks. The discovered roles can be replaced with individuals in the event data for further performance and bottleneck analysis.

This method has been evaluated on BPIC 2012 and 2017 and proved its impact on accuracy and privacy. In order to evaluate the accuracy, we separately measured the similarity between the connected and unconnected parts of the two networks taking into account different thresholds. In addition, we introduced three different techniques to identify the number of substitutions in the decomposition method, and when assuming the frequency of activities as background knowledge, we demonstrated their impact on accuracy and privacy. In the future, we can explore other technologies or introduce a combination of technologies for the characteristics of event logs.

appendix

pearson similarity measure

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_42253964/article/details/107525568