Supporting Confidentiality in Process Mining Using Abstraction and Encryption

Summary

Using abstraction, a method is proposed that enables us to hide confidential information in a controllable manner and to ensure that the desired process mining results are still retained.

The connector method is used as a technique to safely store the correlation between events.

introduction

Focusing on the confidentiality of process mining, we aim to solve two important issues:

  • Protect sensitive data belonging to the organization
  • Protect personal information

We want to keep as little information as possible, but at the same time have the same desired results. Here, the desired result isProcess modelwithSocial network. The model or network discovered based on the anonymized event data should be consistent with the results obtained from the original event data.

our wayBased on abstraction, In addition, we showedConnector methodThrough it, traces about individuals remain anonymous, while process models and social networks are discoverable. The proposed method allows us to obtain the same results from the secured incident data and the original incident data, while unauthorized persons cannot access confidential information. The framework provides aCross-organizational process miningSecurity solutions.

Related work

DFG

It is a graph in which nodes represent activities and arcs represent causality. When activities "a" and "b" follow frequently, connect "a" and "b" with arrows. The weight of the arrow indicates the frequency of the relationship [19]. Most business process mining tools use DFG. Unlike more advanced process discovery techniques (for example, implemented in ProM),DFG cannot express concurrency. Figure 1 shows the DFG generated in Table 1 of the event log.

Discover social networks

There are many ways to discover social networks from event logs, including methods based on causality, joint activities, and joint cases [9]. but,We only focus on indicators based on causality. These metrics monitor individual situations and how work moves from resource to resource. For example, if there are two follow-up activities, the first is executed by i and the second is executed by j, then there is a switching relationship from person i to person j. If there is an occasional dependency between two activities, the relationship will also become an actual handover. Please note that in this case, the direct follow relationship between resources is not enough, a real temporary dependency is required.

The dependency measure (Equation 1) can be used to realize whether there is a real accidental dependency between the two activities (a and b), while setting the threshold to the minimum required value [2]. In the equation 1 | | a> Lb | shows the frequency of a following b:
Insert picture description here
when observing the switching, indirect inheritance can also be considered. For example, based on the event log of Table 1, there is an unreal switching relationship between "Frank" and "Alex" with a depth of 3. This is untrue because there is no real accidental dependence between all corresponding activities. Figure 2 shows the causal-based network obtained from Table 1 of the event log.
Insert picture description here
Insert picture description here

Cryptography

  • Symmetric cryptography: The same key is used to encrypt and decrypt messages. Data processing in symmetric systems is faster than asymmetric systems because they usually use shorter key lengths. Advanced Encryption Standard (AES) is a symmetric encryption algorithm [13].
  • Asymmetric encryption system: The asymmetric system uses the public key to encrypt the message and the private key to decrypt the message, and vice versa. The use of asymmetric systems enhances the security of communication. Rivest-Shamir-Adleman (RSA) is an asymmetric encryption algorithm.
  • Deterministic cryptographic system: A deterministic cryptographic system is a cryptographic system that always generates the same ciphertext for a given plaintext and key, even when the encryption algorithm is executed separately.
  • Probabilistic cryptographic system: In addition to deterministic cryptographic systems, a probabilistic cryptographic system is a cryptographic system that uses randomness when encrypting. Therefore, when the same plaintext is encrypted multiple times, it will produce different ciphertexts.
  • Homomorphic cryptographic system: The homomorphic cryptographic system allows calculation of ciphertext, for example, Paillier is a partial homomorphic cryptographic system [24].

Problem definition (attack analysis)

Use examples to illustrate the process of mining confidentiality challenges:

Consider Table 2, which describes a fully encrypted event log belonging to the hospital where the operation was performed. Since we need to keep the difference to find the sequence of activities for each situation, we need to discover the process model and use deterministic encryption methods for other analysis such as social network discovery, "case ID", "activity" and "resource" encryption. Digital data (ie, "time stamp" and "cost") are encrypted by homomorphic encryption methods to preserve the ability to perform basic mathematical calculations on encrypted data. Now, suppose we have background knowledge about surgeons and the approximate cost of different types of surgery. The question is whether part of the log can be anonymized now

Insert picture description here

Since the "cost" is encrypted using the homomorphic encryption method, the maximum value of the "cost" is the actual maximum cost. According to background knowledge, we know that, for example, the most expensive event in a hospital is brain surgery performed by "Doctor Jone". On "01/09/2018 at 12:00", the patient's name is "Judy". Since "Case ID", "Activity" and "Resource" are encrypted by deterministic encryption methods, we can replace all these encrypted values ​​with the corresponding ordinary values. Therefore, the encrypted data can be made visible without decryption. This example illustrates that even given a fully encrypted event log, a small amount of contextual knowledge can lead to data leakage.Given domain knowledge, several analyses can be performed to identify individuals or extract some sensitive information from encrypted event logs. Below, we explain a few of them.

  • Exploring the length of the trace : The longest/shortest trace can be found, and relevant background knowledge can be used to implement actual activities and related cases.
  • Frequency mining : People can find traces with the highest or lower frequency, and can use relevant background knowledge to identify corresponding cases and actual activities.

These are just a few examples to illustrate that encryption alone is not a solution. For example, [21] showed that encrypted mobility traces are easy to identify. In addition, any method that only encrypts the entire event log will have the following disadvantages:

  • Encrypted result: Because the result is encrypted, the data analyst cannot interpret the result. For example, as a data analyst, we want to know the most commonly used path after "registering" for an event; how to analyze when the event is not very simple? The only solution is to decrypt the result.
  • Impossibility of accuracy assessment: How to ensure that the results of encrypted event logs are the same as those of ordinary event logs? Likewise, decryption will be required.

Generally, as discussed in [12], using encryption is a resource-consuming activity, and decryption consumes even more resources than encryption. Weaknesses indicate that encryption methods should be used wisely, and where the beneficiaries need to be carefully evaluated and where confidentiality is unavoidable.

Here, we assume that the background knowledge can be any contextual knowledge about the trajectory, which will lead to case disclosure. The frequency of the trace, the length of the trace, the exact/approximate time associated with the case, etc. Please note that this background knowledge is assumed to be used by unauthorized persons to access anonymous data.For example, given the domain knowledge about the frequency of the trajectory, one can guess the actual sequence of activities and the rare cases of the trajectory (such as politicians, celebrities, etc.). Therefore, individuals or minorities and their private information will be leaked. Therefore, case disclosure is the key type of data leakage that should be avoided.

method

Insert picture description here

Figure 3 shows a framework that provides a solution for confidentiality when the desired result is a model. The framework is inspired by [5: Process discovery from event data: relating models and logs through abstractions], whereIntroduce abstraction as an intermediate result of related models and logs. Here, the abstraction directly follows the activity matrix (A-DFM) and directly follows the resource matrix (R-DFM). Figure 4 shows the A-DFM and R-DFM table 1 generated by the event log. A-DFM is regarded as an abstraction of related logs and process models. Together, R-DFM and A-DFM are regarded as an abstraction of related logs and a social network based on causality. As shown in Figure 3,Three different environments and two confidentiality solutions

  • Prohibited environment: In this environment, actual data is required to operate the actual information system. The real event log (EL) generated in this environment contains a lot of valuable confidential information, and no one can access this data except some authorized personnel.
  • Internal environment: Authorized stakeholders can only access this environment. Think of data analysts as authorized stakeholders with access to internal event logs. The event logs in this environment are partially secure , the selected results (such as process models) produced in this environment are the same as the results produced in the prohibited environment, and data analysts can interpret the results without decryption.
  • External environment: In this environment, unauthorized outsiders can access data. Such an environment can be used to provide computing infrastructure (for example, cloud solutions) for processing large data sets. In this environment, the event log should be completely secure, and the results should be encrypted. Whenever a data analyst wants to interpret the result, it needs to be decrypted and converted into an internal version. In addition, the result of the external environment does not have to be exactly the same as the result of the internal environment, but the same explanation needs to be provided.

Table 3 summarizes our assumptions about the internal and external environment. Note that in a forbidden environment, the main assumption is that only a very small number of highly trusted people can access the data. Therefore, no confidentiality solution is required. As described in the chapter. In Figure 4, the contextual knowledge about the trajectory is assumed as background knowledge. As shown in Figure 3, the desired results can be obtained in each environment, namely Process Model (PM) and Social Network (SN). Internal Confidential Solution (ICS) and External Confidential Solution Convert the original event log (EL) into a partial security event log in the internal environment (EL), and then into a completely secure event log (ECS) in the external environment (EL) ). The abstraction as an intermediate result is used to prove accuracy.It should be considered that since the abstraction is considered to be the output of the final stage before the final result (threshold only needs to be applied), when they are equal, the final result will be the same. In addition, the reverse operation of the internal secret solution (the reverse of ICS) and the reverse operation of the external secret solution (ECS-1) provide transparency. ICS and ECS are explained below.
Insert picture description here
Insert picture description here

Internal Confidentiality Solution (ICS)

For ICS, we combined several methods and introduced the connector method. Figure 5 outlines the anonymization process:

Insert picture description here

  • Filter and modify input . The first step in effective anonymization is to prepare data entry. In order to filter the input, a simple frequency limit can be set, and during the loading of the event log, all traces that do not reach the minimum frequency will not be transmitted to the EL.
  • Choose simple data . As mentioned above, we need to produce interpretable results. Therefore, some parts of the event log are kept in plain text format in the internal version of the security event log (EL).We should determine that the required information and/or structure is strictly necessary for the required analysis. According to the abstraction we considered (A-DFM and R-DFM), the only information needed is the relationship between activities/resources
  • Encryption . There are two important options here. The first choice is which columns of the event log should be encrypted. Second, we need to determine which algorithm should be used. For the internal environment, since we want to maintain the ability to apply basic mathematical calculations to encrypted values, we use Paillier for digital attributes (ie "fees"). Please note that the encrypted values ​​shown in this article are not necessarily the true output of the encryption method (They are just incomprehensible text).
  • Make time relative . The time needs to be modified, because keeping the exact time of the event allows people to recognize it. The naive approach of setting the start time of each trace to 0 will make it impossible to replay the event and reconstruct the original log. Therefore, we choose another time relative to all events. This time together with the decryption key can be kept safe. Table 4 shows the first 10 lines of our example log (relative to "01-01-2018: 00.00"), after encrypting the cost and establishment time.
    Insert picture description here
  • Connector method . Using the connector method, we embed the structure that can be used to directly extract the following relationship into EL. Similarly, when the key and relative value are given, the connector method can help us reconstruct the complete original event log. In the first step,Added previous activity ("previous activity") and previous resource ("previous resource") columns, To identify arcs that can be directly connected.
    In the second step, we found a way to safely save the information contained in the "case ID" without allowing it to link events. This can be done byGive each row a random ID ("ID") and a previous ID("Prev.ID") to complete. These IDs uniquely identify the following events in the track because the IDs are not generic names, such as event names. The ID of the start event is always zero. Table 5 shows the addition of "Prev. Activity", "Previous Resource", "ID" and "Previous ID".
    In the third step, considering that these columns contain the same information previously found in the "Case ID", they must be hidden and protected. This can be done by concatenating "ID" and "Prev. ID per line".
    Insert picture description here
    Due to the nature of AES, the order and size of IDs cannot be inferred. It can be concatenated in any style, but in this example, we only concatenate "ID" and "Prev." together. ID", for example, the connector in the first row is "3100". Keep "ID" and "Prev". ID" only needs to decrypt the "Connector" column and divide the result number into two equal parts. This method requires that every time the two IDs differ by 10 times, zero must be added to ensure that the length is equal. Table 6 shows the connection ID column and Encrypt it as a log after the connector.
    Insert picture description here
    In the last step, we use the "Case ID" to anonymize the "Timestamp". Compared to the previous event, the "Time Stamp" attribute of the event with the same "Case ID". The exception is that the first event of each trajectory remains unchanged. This allows the complete calculation of the duration of all arcs in the directly following graph, but it complicates the identification of events based on the time they occur.After creating the relative time, we can freely delete the "Case ID" and destroy the order of all rows, And finally get the unconnected log in Table 7.

Table 7 is the internal security event log (EL), which can be used by data analysts to create A-DFM and R-DFM. Obviously, if process/social network discovery can be done on a pure event log (EL), then AEL will be the same as AEL (ie, both are the same A-DFM/R-DFM), and finally get the desired result. The result will be the same. Note that when the desired result is a process model, you can delete resource-related information ("resource" and "previous resource" columns) from Table 7. In addition, when the desired result is to switch the network, the activity-related information ("activity" and "previous activity") can be deleted, because there is no need to consider the actual causality.

Comparing Table 7 with the original log, you can see that the following questions in EL no longer have answers: (1) Who is responsible for the activities of Case c? (2) For case c, what is the sequence of activities? (3) How long does it take to process case c? (4) For case c, what is the cost of activity a performed by resource r? (5) What is the length of case c? (6) What is the frequency of case c? And many other issues related to the case.

It is also worth noting that since we assume that the data in the internal environment can be accessed by internal trusted personnel who already know the organizational structure, ordinary resources will not be considered a privacy issue.In fact, EL is a partially secure version of the event log in the way that it contains the lowest level of information, and data analysts may need the lowest level of information to get results. Although ICS will not retain the standard format of event logs used by current process discovery techniques, the intermediate input it provides can be used by current tools.In the External Confidentiality Solution (ECS), we need to avoid any form of data leakage and privacy risks based on assumed background knowledge
Insert picture description here

External Confidentiality Solution (ECS)

In the external environment, ordinary parts of the event log may cause data leakage. Therefore, the entire event log will be encrypted. In addition, some other attributes are planned, which may cause data leakage even in encrypted form. Next, we will introduce the two-step ECS.

  • Encrypt the common part . In this step, activities and resources are encrypted using deterministic encryption methods such as AES. Deterministic encryption methods must be used, because differences must be preserved when DFM is discovered. Table 8 shows the results after encryption activities and resources. However, after encryption, it seems impossible to detect "START" activities, and if they are not detected, no trace can be found. In order to identify the "start" activity, we can browse the "Activity" ("Resource") and "Previous Activity" ("Previous Resource") columns, which appear in the "Previous Activity" (Previous Resource) Columns that do not appear in the "Activity" ("Resource") column are "Start" activities (resources).
    Insert picture description here

  • Strengthen encryption and/or schedule event logs . As described in section 15. As shown in Figure 4, since the resource is encrypted by the deterministic encryption method and the cost is encrypted by the homomorphic encryption method, the difference can be preserved. Therefore, through comparison, the minimum/maximum cost can be found, which can be used to extract confidential information Or knowledge of private information. (For example, resource name). In order to reduce the impact of this analysis,Can perform enhanced encryption and/or project event log. Here, we estimate that the required cost does not actually need to achieve the desired effect.

Evaluation

We considered the three evaluation criteria of the proposed method, but also considered performance:

  • Ensure confidentiality: as described in this section. As shown in Figure 5, we can improve confidentiality by defining different environments and indicating the level of information that can be accessed in each environment. In addition,The use of multiple encryption methods and our connector method to separate incidents from incident cases can provide a high degree of confidentiality for assumed background knowledge
  • Reversibility: Given the keys and values ​​used to make time relative, both ICS and ECS are reversible, which means that the proposed method can solve transparency.
  • Accuracy: To illustrate the accuracy of our method through a case study, we explained that the results obtained from the secure version of the event log are exactly the same as the results obtained from the original event log.

Verify that AEL and AEL are the same. Create a DFG from the original and internal version of the event log (threshold is 2000) and find that they are the same. The DFG created in the external environment is also the same. That is, in different state environments, all DFG-based process discovery algorithms will lead to the same process model.

performance

Insert picture description here
In FIG. 10, the darker bar shows the execution time of DFG found in the original event log, and the brighter bar shows the execution time of DFG found in the security event log. When adding a selection or loop, you can see that the running time in milliseconds increases linearly.
Insert picture description here

in conclusion

This paper proposes a novel method to ensure the confidentiality of process mining when the desired result is a model. We proved that confidentiality in process mining cannot be achieved only by encrypting event logs. The new approach was introduced because there is always a trade-off between confidentiality and data utility. Therefore, we reasoned from the required results and how to obtain these results with as little data as possible.

Here, process models and social networks are considered to be ideal results, and confidentiality solutions are provided in a framework that can be extended to include other forms of process mining, that is, different ICS and ECS activities can be explored for different process mining . In addition, the proposed framework can be used in a cross-organizational environment, each of which can cover a party’s specific restrictions and authorizations. In this article, we focus on social networks based on causality and may explore other indicators in the future. In addition,In the future, a confidentiality measure can be defined so that the effectiveness of different solutions in this research field can be quantified and compared

We have adopted a new method called "connector", which can be used in any situation where a secure storage association is required. To evaluate the proposed method, we implemented an interactive environment in Python and used real logs as a case study.

Guess you like

Origin blog.csdn.net/weixin_42253964/article/details/107484615