Process model similarity

1. Semantic-level business process model similarity measurement technology 2016

The research of business process similarity measurement mainly focuses onElement tagModel logical structurewithBehavioral semanticsComparison of 3 aspects.
Insert picture description here
In the comparison of actual business process models, completely identical or completely different models are relatively rare, More cases are plausible, that is, the two models have a certain similarity, so the concept of equivalence in process algebra research cannot distinguish the difference between "very similar" and
"somewhat similar".

The active label is used as the basis for the model similarity judgment. The labels in the two modelsThe more the same activities, the greater the model similarity, and vice versa. The algorithm has the advantages of easy implementation and fast calculation speed, but it does not reflect the structural difference of the process model.

The metric proposed by Minor is calculated using the number of nodes and edges that appear between the two models and the total number of nodes and edges of the two models. This method assumes that the two nodes are the same only when the two nodes have the same label, so label semantics cannot be processed. Heterogeneous issues.

The graph edit distance mainly examines the difference of graph structure, and lacks the expression and analysis of business process behavior information.

In practice, the business process model of a large organization is huge, often involving multiple professional fields, and requires a large number of people to model together.However, the background knowledge of the modelers is different, and there are a large number of homographs and heteromorphic synonyms between the models.The problem of semantic heterogeneity of this process model is particularly prominent in the modeling process of introducing reference models. Therefore, the model similarity measurement based on the common term set is too ideal, and it is difficult to apply to the actual enterprise model library.

2.How to Make Process Model Matching Work Better? An Analysis of Current Similarity Measures2017

Process model matching refers to automatically identifying the corresponding activities between two process models, that is, activities that represent the same or similar behavior. By automatically generating this activity correspondence, process model matching technology is a prerequisite for many advanced analysis techniques. Among other things, activity-corresponding identifiers are necessary for coordinating process model variants [1,2], process model searching [3,4] and detecting process model clones [5,6].

First, conduct a structured literature review on process model matching. We provide an overview of existing technologies and their specific techniques used to identify similar activities.

Process model matching problem

The process model matching technology aims to automatically identify activity correspondences that represent similar behaviors in the two models. Figure 1 illustrates the matching problem by showing the recruitment process from two different companies. The gray shading highlights the correspondence between the two processes. For example, the activity "evaluation" from company B corresponds to the activities "check grade" and "check employment reference" from company A. These correspondences indicate that the terms used in the two models are different (for example, "qualification assessment" vs. "ability test") and their level of detail (for example, "assessment" is described in more detail in company A's model ). In view of these differences, correctly identifying the correspondence between two process models can become a complex and challenging task. The moderate performance also highlights the complexity of the matching task.
Insert picture description here

According to Gal [14], we can subdivide the matching process into the first row of matching and the second row of matching. The first row matcher takes the set of activities A1 and A2 in the process model as input, and generates a similarity matrix M(A1, A2) with | A1 |. Row and
| A2 | column. Among them, the similarity matrix can be obtained by comparing the active tags. The second row matcher takes one or more similarity matrices generated by the first row matcher as input, and converts them into a binary similarity matrix M (A1, A2) with entries of 0 or 1. The latter represents the correspondence between two activities. It is important to note that the first line of matching plays a particularly important role in the overall matching result. If the first row matcher calculates a similarity of zero for the two activities, then the second row matcher is unlikely to include the specific activity pair in the final set of correspondences.

Two process models are used as input and a set of activity correspondences are generated at a certain stage.

Each row in Table 1 lists a type of metric used to identify activity correspondence. The "Total" column indicates the total number of matching systems using the corresponding metric type, and the "Reference" column shows papers discussing these matching systems. Overall, Table 1 shows that we have identified 10 measurement types, which are divided into syntactic and semantic measures
Insert picture description here

Syntactic measure

Syntactic metrics are related to simple string comparison and do not consider the meaning or context of words. The most widely used syntactic measure is a measure based on distance, such as Levenshtein distance . Given two labels l1 and l2, the Levenshtein distance calculates the number of editing operations (ie insertion, deletion, and replacement) required to convert l1 to l2. Another distance-based metric is the JaroWinkler distance, which works similarly, but produces values ​​between 0 and 1.

In addition to distance-based metrics, many matching systems also rely on pure word comparison. Very common metrics include Jaccard and Dice coefficients , both of which calculate the similarity between two active tags based on the number of shared and non-shared words.

Another method based on word comparison is cosine similarity . In order to calculate the cosine similarity, it is usually converted into a vector by weighting the appearance frequency of words. Then, the cosine similarity is given by the cosine of the angle between the two motion vectors.

Another way to consider word distribution is Jensen-Shannon distance , which is a method used to measure the similarity of two probability distributions. However, so far, it has only been adopted by the method of Weidlich et al. [30].

A common preprocessing step is to consider the substring relationship between activities. For example, Dadashina et al. If l1 is a substring of l2, then the two active tags l1 and l2 are considered similar (and vice versa) [12]. Then, remove such tags from other similarity considerations, and only get a similarity score of 1.

Semantic measurement

Semantic measures aim to consider the meaning of words. A very common strategy for doing this is to use the word database WordNet [35] to identify synonyms. Usually, matching systems check synonyms in preset steps, and then apply other (usually syntactic) similarity measures [12].

The most prominent semantic measure is Lin similarity. Lin similarity is a method to calculate the semantic relevance of words according to the information content of the words according to the WordNet classification method. In order to use Lin similarity to measure the similarity between two activities (most activities contain multiple words), Lin similarity is usually used in combination with the bag-of-words model.

The bag-of-words model converts activities into multiple sets of words, ignoring grammar and word order. The Lin similarity can then be obtained by identifying word pairs from the two bags with the highest Lin score and calculating their average value. Other metrics based on WordNet dictionaries include Wu & Palmer and Lesk. The former calculates the similarity between two words by considering the path length between two words in the WordNet classification method. The latter compares the WordNet dictionary definitions of two words. Some methods also directly check the relationship of hypernyms (hypernyms are more common words). For example, Hake et al. [12] Treats "car" and "vehicle" as the same word, because "vehicle" is the superordinate word of "car".

Syntactic metrics play a major role. Such as distance measurement. The disadvantage of edit-based distance measures is not only that they cannot recognize synonyms, but also that they tend to treat unrelated words as similar. For example, consider the unrelated words "contract" and "contact". The Levenshtein distance between these words is only 1, indicating that the similarity between these words is very high.

The most prominently used syntactic and semantic measures (ie Levenshtein distance and Lin similarity based on bag of words). Calculate the matching degree of the corresponding relationship.

Future work

  • Only use grammar techniques for preprocessing: Grammar techniques are very useful for identifying trivial or almost trivial correspondences. We found that the best-performing systems mainly use grammatical techniques as preprocessing steps: they first match the same and almost identical tags, and then apply semantic techniques.

3.Similarity of Business Process Models:Metrics and Evaluation 2011

== has done a lot of research on similarity measurement, and put forward more similarity measurement algorithms, the simplest one is called label alignment similarity measurement algorithm. == Proposes a method of calculating similarity using causal footprints (CF). The main idea of ​​the algorithm is to represent the process model as a vector, but because there is too much redundant information in the vector, the high-dimensional vector causes The calculation is very inefficient.

Large and complex organizations often maintain a repository of business process models in order to document and continuously improve their operations. This article will solve the problem of retrieving those process models that are most similar to a given process model or its fragments in the repository. (Via repositorySearch for similar business process models

Three similarity measures are proposed:

  • Tag matching similarity , used to compare tags attached to process model elements;
  • Compare the structural similarity between element tags and process model topology ;
  • Behavioral similarity , used to compare element labels and causality captured in the process model.

These similarity measures are experimentally evaluated based on accuracy and recall, as well as the correlation between the measure and human judgment. The experimental results show that all three indicators produce comparable results, and the structural similarity is slightly better than the other two indicators.

significance

The management and use of large process model repositories require effective search techniques. For example, before adding a new process model to the repository, you need to check whether there is no similar model to prevent duplication. Similarly, in the context of company mergers, process analysts need to identify common or similar business processes to analyze their overlap and determine the areas of merger.

These tasks require users to retrieve process models based on similarity to a given "search model". We use the term process model similarity query to refer to such search queries in the process model repository.

Traditional search engines are based on keyword search and text similarity . They do not consider the structure and behavioral semantics of the process model.

This paper studies three similarity indicators designed to answer similarity queries of process models .

  • The first indicator is a label-based indicator. It takes advantage of the fact that the process model consists of labeled nodes. Metrics calculate the best match between nodes in the process model by comparing calculations between process tags. Based on this matching, the similarity score is calculated in consideration of the overall size of the model.
  • The second indicator is a structural indicator. It uses existing technology to compare graphs based on graph edit distance [2]. This metricConsider the topology of node labels and process model at the same time
  • The third indicator is behavior. It considers the behavioral semantics of the process model, especially the processCausal relationship between activities in the model. These causal relationships are expressed in the form of causal footprints [3].

There are two ways to evaluate:

  • First use the classic concepts of precision and recall,
  • Secondly, the statistical correlation between the similarity score given by each metric and the similarity score given by human experts is calculated.

The evaluation results show that the similarity index considering the structure and behavior of the process model is better than the search engine in answering the similarity query of the process model.

EPC

In the field of business process modeling, there are many symbolic competitions, including UML activity diagrams, business process modeling symbols (BPMN), event-driven process chains (EPC), workflow nets, etc. In this article, we use EPC as a process modeling symbol.
Insert picture description here

Many large process model repositories can be used as EPC. In particular, the repository we used in the experiment consisted of EPC. The label similarity metric defined in this article has nothing to do with the specific process modeling notation used, while the structural similarity metric can be applied to any graph-based process modeling notation, and the behavioral similarity metric can also be used to map toCausal footprintAny symbol.

EPC notation is a graph-based language used to record the time and logical dependencies between functions and events in an organization.

A cause and effect diagram is a set of activities and the conditions for when these activities occur. The causal footprint remains a relatively small business process model. The causal footprint has backtracking and advanced connections, such as (a, B), the occurrence of a will lead to B, which is advanced, (A, b) is the occurrence of b, and A must occur before it.

Cause and effect diagram and cause and effect footprint.

For example, the possible causal footprint of the EPC in Figure 1 includes forward links ("Receipt", {'Verify Invoice','Transfer to Warehouse'}) and Backward Links ({'Receive'},'Verify Invoice") and ( {'Receipt'}, "Transfer to Warehouse"). This example illustrates that the causal footprint is an approximation of EPC behavior, because there are multiple EPCs with the same causal footprint (for example, through XOR-split). Similarly, this EPC has multiple possible causal footprints.

Similarity of process model elements

When comparing business process models, it is unrealistic to assume that their elements (nodes) only have exactly the same label. Figure 2 is an example: the process modeler believes that the functions "customer query processing" and "customer query processing" are actually the same, even though they have different labels.

Therefore, as a basis for measuring the similarity between business process models, we must be able to measure the similarity between them.

We considered three methods to measure the similarity between different process model elements:

  • 1) Syntactic similarity, in which only the grammar of tags is considered;
  • 2) Semantic similarity, which abstracts from the grammar and looks at the semantic similarity in words. as well as
  • 3) Context similarity, we not only consider the tags of the elements themselves, but also the context in which these elements are located.

All of these indicators (described below) result in a similarity score between 0 and 1, where 0 means no similarity and 1 means the same element. Therefore, it is trivial to combine all metrics to obtain a weighted similarity score.

Syntactic similarity

Given two tags, the syntactic similarity measure will return the similarity measured by the string edit distance. The string edit distance [11] is the number of atomic string operands required to get from one string to another. These atomic string operations include: deleting characters, inserting characters, or replacing characters with another.

Semantic similarity

Between two tags, their semantic similarity is based on the degree of equivalence between their constituent words. We assume that exact matches are preferred to synonym matches. Therefore, the equivalent score of the same word is 1, while the equivalent score of synonymous words is 0.75 (see the description below). Therefore, the semantic similarity score is defined as follows.

Consider the mapping of synonyms.

Context similarity

The third similarity measure, when determining the similarity of two model elements, it should also consider the model elements before and after them. This similarity metric is particularly useful for EPC, because in EPC, functions are always located before and after events. Therefore, when comparing two functions, the contextual similarity measure considers surrounding events.

For contextual similarity, another particularly useful process modeling technique is Petri nets, Because in Petri nets, “changes (activities)” are always in front of and behind the warehouse (and vice versa). We refer to the former model element as the input context, and the latter model element as the output context of another model element.

In order to determine the contextual similarity between business process model elements, we need to map between the elements in their input and output contexts. Such a mapping itself is based on a similarity measure, such as one of the measures from a syntactic measure or a semantic measure, and is called an equivalent mapping.

Tag matching similarity

The label matching similarity score is the sum of the label similarity scores of the matched node pairs. To get a score between 0 and 1, we divide the sum by the total number of nodes.

Structural similarity

The second similarity measure in our research is to measure the similarity of the structure of EPC by treating EPC as a labeled graph. If we think of EPC as a graph, its functions, events, and connectors are the nodes of the graph, and the arcs are the edges of the graph. For functions and events, their labels will become the labels of the corresponding nodes.

Then, we can calculate the graph of two EPCEdit distanceTo assign similarity scores [2]. The graph editing distance between two graphs is the minimum number of graph editing operations required to get from one graph to another. Different graphics editing operations can be considered. We consider the following factors: node deletion or insertion, node replacement (a node is a graph and it maps to a node in another graph with a different label), and edge deletion or insertion. Like label matching similarity, the graph edit distance is obtained by first calculating the mapping between EPC nodes and then calculating the best graph edit distance. The score is calculated as follows.

4. Process model similarity measurement based on edit distance of transition label graph 2016

With petri net as input, the process similarity based on behavior is calculated.

Similar to the string edit distance, the graph edit distance [17] is the minimum number of deformation operations required to transform one graph into another. By defining the conversion operation (or editing operation) and its cost, the distance between the two graphs can be quantified. the distance. Graph editing operations include insertion, deletion, and replacement of nodes and edges.

(U→v) means replacing node u with node v, (u→ε) means deleting node u, and (ε→v) means inserting node v. Similarly, you can define edge editing operations.

As shown in Figure 1, in the process of converting from Figure 1a to Figure 1f, the following editing operations were performed successively: delete (a, b), (b, c), (d, e) three edges, delete a node b, insert a node f, insert (e, f), (d, f) two edges, and finally replace the two nodes e and d.
Insert picture description here

5. Process model similarity measurement based on the relationship between tasks 2017

Process model management includes model analysis, model retrieval, and model reuse [2]. Process model similarity measurement plays a very important role in all aspects of process model management.

The similarity measurement of element label mapping is based on the paired label comparison of nodes. It calculates the similarity by defining the mapping between the node labels of the two models. The label matching similarity is equal to the number of matched nodes divided by the total number of nodes.

The structural similarity measurement method is to treat the model as a graph, and use the common subgraph isomorphism and graph edit distance to measure the similarity of the model. The graph edit distance defines in detail the minimum atomic graph operation required to transform from one graph to another. Dijkman et al. [7] proposed a structural similarity measurement method, which defines that each editing operation must pay a corresponding price. Through the editing distance from one model to another, the similarity can be obtained. purpose. Based on the above algorithm, La Rosa et al. [8] proposed an algorithm that combines the edit distance of the graph and the method of activity matching

The similarity measurement based on behavior semantics mainly considers the behavior relationship of the extracted model from the perspective of the behavior semantics of the model (such as execution sequence, task relationship), and then calculates the similarity.

This paper’s model similarity measurement algorithm based on the relationship between tasks

  • 1) Expand based on full prefix. First, the given Petri network model is fully prefixed, so that all the identifiers of the process model and the relationship between tasks can be maintained, which is convenient for subsequent extraction of the behavior characteristics of the process model.

Insert picture description here

  • 2) Node traversal number. The complete prefix expansion obtained by the breadth-first traversal layer by layer, its nodes are numbered according to the traversed level, and the nodes and their corresponding numbers are stored.

Insert picture description here

  • 3) Find out the relationship between tasks. According to the recent public predecessor algorithm, the nearest public predecessor of every two tasks is obtained and stored accordingly, so as to further obtain the relationship between tasks. According to the calculated recent public predecessor and the processing method of the special structure, determine the occurrence relationship between tasks.
  • 4) Model similarity calculation. On the basis of the set of relations between the corresponding tasks, the similarity between the models is calculated by the weighted similarity between the set of relations.

The weights of parallel relationships, mutually exclusive relationships, and causal relationships are calculated using the following formula.
Insert picture description here
Insert picture description here
Directly adopt the proportion of the number of various binary relationships in the total number of binary relationships, and consider that parallel relationships, mutually exclusive relationships and causal relationships have the same importance .

Similarity calculation

According to the Jacobian coefficient, the similarity of parallel relationship, mutual exclusion relationship and causal relationship are calculated by the following formulas

Insert picture description here

6.Measuring Similarity between Semantic Business Process Models2007

A similarity measurement method based on tag semantics is proposed. This type of algorithm has simple ideas and fast calculations, but the topological structure and behavior semantics of the model are not considered, resulting in insufficient accuracy of the calculation results.

Guess you like

Origin blog.csdn.net/weixin_42253964/article/details/107830710