Paper interpretation: PRETSA: Event Log Sanitization for Privacy-aware Process mining

Problems in the field

The event log from the information system supports a comprehensive analysis of business processes. However, the log may contain sensitive information about a single employee involved in the execution of the process, which presents the risk of a privacy disclosure attack on the event log.

New method proposed

Introduced PRETSA, a new log cleaning algorithm, which provides privacy guarantees in terms of k-anonymity and t-closeness. It avoidsEmployee status, In the event logMembershipAnd they are based onSensitive attributesDisclosure of characteristics (such as performance information).

New conclusions reached

Through the gradual conversion of the event log prefix tree, the efficient use of the process model for discovering performance annotations is maintained.

Question leads

As the process taps its potential, the organization has intensified its efforts to accurately and meticulously record its process. However, once a process involves manual processing, the resulting event log can draw sensitive conclusions to individual employees.

Pseudo-name restriction and anonymization methods [7] are not enough in some application environments, and they cannot prevent frequency- based attacks [8].

For example: if it is known that only a certain employee can perform a specific activity, the pseudonym of the employee information in the log cannot achieve privacy protection.

Therefore, another different perspective is to purify data to provide a clear guarantee of privacy.

Example: k-anonymity and differential privacy. Both privacy concepts have a parameter that can fine-tune the given guarantee strength. There is a trade-off between privacy and utility.

Question: How to purify data in order to maximize the use of data under a given privacy guarantee?

Analysis solution:

Defined by attack model and process analysis type . The attack model determines the privacy assurance to be considered, and the type of process analysis determines how to evaluate the effectiveness of the cleanup log.

The attack involves:

  • Identity disclosure: regardless of whether the incident is related to the employee.
  • Member disclosure: whether the employee’s incident is included in the log.
  • Attribute disclosure: whether the employee can be characterized by the attribute value of the event.

Process analysis:

  • Process discovery: Evaluate the effectiveness of the cleanup log based on the changes in the model found in the cleanup log compared to the model found in the original log. Process discovery techniques are usually derived directly from the relationship of the diagram (DFG), capturing which activities directly inherit from each other, and at what frequency.In order to maximize the utility of the cleaned log, it should be ensured that the DFG constructed from it is similar to the DFG of the original log. So try to keep as many items as possible that directly follow the relationship.

Examples of attacks:

In order to stimulate the need for log cleaning, consider the order processing process. The process includes: creating purchase order PO, updating PO, receiving goods, and activities related to checking, paying and rejecting invoices.

For malicious attackers,Sequence of eventsIt may be sufficient to associate employees with the execution of certain events. This is especially applicable when the attacker has organizational knowledge (such as a manager). For example, the manager may know that for the updated POs, only four employees are allowed to check the corresponding invoices. Through the combination of this background knowledge and the trajectory in the event log, the adversary can obtain sensitive information (identity disclosure/member disclosure).

Consider a scenario where some sales orders have been updated after receipt. If the opponent knows that Su is one of the few employees who are allowed to check the corresponding invoices later, then the opponent can identify the specific event Su performed with high accuracy. There is a maximum probability of successful construction. If the number of events in the equivalence class is 8, and the number of events handled by an employee is 4, the maximum probability of correct allocation (successful disclosure) is 4/8=0.5

Privacy guarantee: k-anonymity, a simple way to ensure k-anonymity is to delete tracks with fewer than k occurrences from the log. But this way hides a lot of information about the existence and frequency of other sequence variants.

PRETSA algorithm:

T-closure event log cleaning based on prefix tree, k-anonymity and t-closeness prevent three kinds of attacks. Construct a prefix tree representation of the event log (each prefix of a series of executed activities observed in the log defines a separate equivalence class), annotated with frequency and attribute value, and gradually transform the tree by relocating and merging subtrees Until the required privacy protection is obtained. The conversion of event logs is fine-grained, which means a moderate loss of log utility.
Insert picture description here
'Check the privacy guarantee

  • Check privacy guarantee: Traverse the prefix tree in a depth-first manner until it reaches the node that violates the privacy guarantee.
  • Tree update: Separate the offending node and its descendants prior to their ancestors.
  • Find the most similar remaining traces: use edit distance (Leven-shtein) to consider similarity.
  • Rebuild the tree: merge the trimmed nodes into the most similar trace.
  • Termination: Iteratively transform the prefix tree until it is completely traversed. No single violation is identified.

experiment:

  1. k-anonymity affects the quality of the process model. useFitness with Precision (precision) measure.
  2. t-closeness will affect the accuracy of the process model annotation (execution time of the event: the difference between the start time and the start time of the next event).
  3. Three baselines:
  • The event log of the trace that does not satisfy k-anonymity is discarded.
  • Discard event logs that do not satisfy t-closeness (the time distribution of activity a is statistically different from the overall distribution of activity a time).
  • Uncleaned event log.

Experimental results:

  • Simple event logs (most variants are fairly common) are always affected very little, and there is not much difference between the three methods.
  • The complex event log (not very structured) shows that there is a big difference between different methods. PRETSA achieves the best results, indicating that the algorithm is particularly useful for less structured processes.

analysis

The heat map describes the impact of t-proximity on the accuracy of execution time annotations. Given specific k and t values, the graph indicates how close the average execution time of the activities in the sanitized logL' is to the original distance to the average value in the original log.
Insert picture description here

Obtained when comparing the original event log with the process model discovered by applying the induction miner to the disinfected event logLog fitness and accuracy. ( Comparing the original log with the model )
Insert picture description here
First, infrequent inductive mining [19] found a process model from the disinfected event logL'. Then, the quality of the discovery model is quantified by determining its adaptability to the original event logL [25] and accuracy [26].

【25】Conformance checking using cost-based fitness analysis
【26】Measuring precision of modeled behavior

Guess you like

Origin blog.csdn.net/weixin_42253964/article/details/106551667