TANE Algorithm: An Efficient Algorithm for Discovering Functions and Approximate Dependencies

1. Introduction to TANE algorithm

1.1. Functional dependency definition:

If for any two possible relations r1 and r2 of R(U), if r1[x]=r2[x], then r1[y]=r2[y], or if r1[y] is not equal to r2[y ], then r1[x] is not equal to r2[x], it is said that X determines Y, or Y depends on X.

The central task we consider: given a relation, find all the smallest nontrivial dependencies in r

1.2. Approximate functional dependencies

How to define dependency similarity, the definition we use is based on the minimum number of tuples (based on the number of ... ) that need to be removed from the relation r for X->A to hold in r. Divide the minimum number of tuples that need to be deleted by the order of r, that is, the number of tuples in r indicates the degree of approximation. The smaller the ratio, the higher the approximation dependence

image-20210712103004060

The worst-case time complexity of this algorithm with respect to the number of attributes is exponential. However, the time complexity is only linear for the number of tuples (provided that the set of dependencies does not change as the number of tuples increases). Linearity, making TANE particularly suitable for relations with large numbers of tuples.

Our search strategy:

  • Computing important information about a set of attributes: Partition
  • According to the information, the dependencies are derived

2. Partition and dependencies

2.1. Division

A dependency X→A holds if all tuples that agree on X also agree on A. The equivalent statement of this sentence is, if for any two possible relations r1 and r2 of R(U), if r1[X]=r2[X], then r1[A]=r2[A], then Dependency X → A holds

equivalence class concept

An equivalence class [t]X of an attribute set X refers to the set of all tuples that are equal to tuple t on X in a given relation instance .

Divide the concept

π x \pi _{x} of the partition on the attribute set XPixThe meaning of is the set of all equivalence classes of X in a given relation instance r.

This means that π x \pi _{x}Pixis a set of disjoint sets of tuples (equivalence classes). In this way, the tuples in each set (equivalence class) have the same values ​​in the attribute set X, and the union of these sets is equal to the relation instance r.

image-20210712105930726

2.2. Subdivision refinement

The notion of partition refinement gives functional dependencies almost directly. If each equivalence class in π{X} is a subset of the corresponding unique equivalence class in π{A}, then a partition π{X} refines another partition π{A}.

Diagram of division and refinement

If each coloring represents an equivalence class, the same coloring means the same value, then the left image π{X} refines π{A}, but the right image is due to the existence of tuple 5, so π{X } is not a refinement of π{A}

Lemma 2.1

Functional dependency X->A holds if and only if partitioning π{X} refines partitioning π{A}. Because each equivalence class in π{X} is a subset of the corresponding equivalence class in π{A}, therefore, the tuples that agree that X is true also agree that A is true.

Lemma 2.2

Functional dependency X->A holds if and only if |π{X}|=|π{XU{A}}|.

Proof: every additional attribute in the attribute set has an additional constraint condition, so the result of dividing XU{A} must be a subset of the result of dividing X, that is, π{XU{A}} is naturally the result of dividing π{X} refinement. Unless π{XU{A}} is equal to π{X}, there cannot be the same number of equivalence classes.

What does it mean that the two are equal: the tuples that agree that X is established also agree that A is established.

2.3. Approximate dependencies

error e (X->A) is the proportion of the smallest tuples that need to be deleted from the relation r such that X->A holds in r. e can be calculated according to π{X}, π{XU{A}}, any equivalence class c of πX is the union of one or more equivalence classes c1, c2, ... of πXU{A}, to make X ->A is established, keep an equivalence class, and delete all tuples in the remaining ci. For the above figure, one equivalence class that remains is {3,4}, and the 5-tuple is deleted.

Therefore, the minimum number of tuples to remove is the size of c minus the largest size in ci. ํ Sum over all equivalence classes of πX to get the total number of tuples to delete.

image-20210712155528771

Understanding: The minimum number of tuples to delete is the size of c minus the largest number in ci. In the figure below, the largest number is 5678, and the number is 4, so the minimum number of deleted tuples is 9-4=5, that is, 3491011 are deleted

image-20210712160302609

3. Search

3.1. Search strategy

To find all minimal non-trivial dependencies, TANE works as follows. It starts with a single set of attributes and then steps through the set containing lattice to search for a larger set of attributes. When an algorithm processes a set X, it tests for dependencies of the form X\{A} -> A, where A ∈ X. This guarantees that only non-trivial dependencies1 are considered . Algorithms from small to large can be used to ensure that only the smallest dependencies get output. It can also be used to efficiently refine the search space (see Figure 2).

A similar size-search strategy, the horizontal algorithm, has been successfully used in many data mining applications. In addition to efficient pruning, the efficiency of level algorithms is based on using the results of previous levels to reduce the computation at each level.

In this section, we consider different aspects of the search, including efficient pruning criteria for horizontal algorithms in TANE, and fast computation of partitions. Both tasks can be solved efficiently by using information from previous levels. Based on the material presented in this section, the precise algorithm is given in Section 4.

3.2. Simplifying the search space

3.2.1. Rhs candidates

TANE works through the trellis until it finds the minimum dependencies that remain. To test the minimum of a latent dependency X\{A}→A, we need to know whether Y\{A}→A holds for some proper subset Y of X. We store this information in the set C(Y) of Y's right candidates.

image-20210717095742965

For a given set X, if A ∈ C(X), then A does not depend on any proper subset of X (∵A does not depend on X). More precisely, the set of initial rhs candidates for a certain set X is C(X) = R\ C ( X ) ‾ \overline{C(X)}C(X) C ( X ) ‾ \overline{C(X)} C(X)={A ∈ X | X \{A} → A holds}。

image-20210716165747389

C ( X ) ‾ \overline{C(X)} C(X)It is a set of attributes in X that satisfy X\A->A.

①A∉C(X)即A∈ C ( X ) ‾ \overline{C(X)} C(X): A belongs to X, and A depends on X

②A∈C(X): A does not belong to X or A belongs to X, but A does not depend on X (X\A->A does not hold).

To find the smallest dependency, just test the dependency X\{A}→A, where A ∈ X and for all B ∈ X, A ∈ C(X\{B}). The purpose is to guarantee that, for a proper subset Y of X, Y \{A}→A does not hold.

Known that A ∈ X belongs to X for all B, it can be split into:

①B!=A,A ∈ C(X \ {B}) ⟹ \Longrightarrow A belongs to X\B, but A does not depend on X\B

②B = A,A ∈ C(X \ {B}) ⟹ \Longrightarrow A does not belong to X\A (natural establishment)

A does not depend on a proper subset of X\{A}, so X\A->A is the minimum non-trivial functional dependency

Example 2. To illustrate the initial rhs candidate set, suppose TANE is considering the set X = {A, B, C} and {C} → A is a valid dependency. Since {C}→A holds, then there is A ∉ \notin/C({A,C}) = C(X\{B}) , which tells TANE that {B,C}→A is not a minimum. C(X\{B}) = C{A,C} = {A,B,C,D} \ {A} = {B,C,D}

∵ C{A,C} = R\ C ( A , C ) ‾ \overline{C(A,C)} C(A,C) = {A,B,C,D} \ {A} = {B,C,D}

∴ For B here belongs to X, A ∉ \notin/C(X\{B}), that is, A depends on X\B ⟺ \Longleftrightarrow ∵ A depends on a proper subset of X

So X\A–A is not a minimum.

pruning rules

When pruning the search space in TANE, assuming C(X) = ∅, then all supersets Y of X are C(Y) = ∅, so that there is no dependency of the form Y\{A}→A, There is also no need to process the set Y. A breadth-first search in a set containing lattice can effectively use this information, as shown in Figure 2.

C(X) is empty, then C ( X ) ‾ \overline{C(X)}C(X)is R, (since A belongs to X, this situation should only be true when X=R), that is, all attributes depend on X. Then all attributes will definitely depend on X superset Y, C(Y) = ∅, Y does not need to be considered, until X.

doubt? From the perspective of X=R, no superset of X exists. I really don’t understand here, I feel that C(X) should be defined as X\ C ( X ) ‾ \overline{C(X)}C(X)

figure 2

FIGURE 2. A pruned set containment lattice for {A, B,C, D}. Due to the deletion of B, only the bold parts are accessed by the levelwise algorithm.

3.2.2. Rhs+candidate pruning

While the initial rhs candidates are sufficient to guarantee the minimization of discovery dependencies, we will use the improved rhs + candidates C + (X) to more efficiently prune the search space:

image-20210715170136341

Note that A can be equal to B. The following lemma shows that we can use rhs + candidates to test for the minimum of dependencies, just as we would use the initial rhs candidates.

Lemma 3.1

Let A∈X, and X\{A}->A be an effective functional dependency, if and only if for all B ∈ X, we have A∈C+(X\{B}), X\{A}-> A is the minimum functional dependency.

A∈C+(X\{B}) means that A does not depend on a subset of X.

If we replace C+(X\{B}) with C(X\{B}), the lemma holds, but the rhs+ candidate has two advantages over the initial rhs candidate. First, we may encounter an A with B such that A ∉ \notin/C+(X\{B}), and stop checking early, saving some time. Second, and more importantly, for some B, C+(X\{B}) can be null, but C(X\{B}) cannot. Then for rhs+candidates, set X will never be processed due to pruning.

How to understand? ? ?

The definition of C+(X) is based on a fundamental property of functional dependencies, as stated in the following lemma.

This section addresses the transition from initial rhs to rhs+

Lemma 3.2

Let B belong to X, and X\{B}→B is a valid dependency. If X->A holds, then X\{B}→A holds.

The lemma allows us to drop extra attributes from the initial rhs candidate set C(X). Suppose X{B}->B holds for some B ∈ X. Then, by the lemma, the dependency of X on the left cannot be minimal, since we can remove B from the left without changing the validity of the dependency. Therefore, we can safely delete the following collections from C(X):

image-20210715152025341

Example 3

image-20210715152250237

Lemma 3.3

3.2.3. Key pruning

An attribute set X is a superkey if no two tuples agree on X, i.e. a partition πX consists only of singleton equivalence classes.

(The primary key in the data table uniquely identifies the tuple!)

A set X is a key if it is a superkey and no proper subset is a superkey. Additional pruning methods can be applied when keys are found during the search for related terms. Additional pruning methods can be applied when keys are found during the search for dependencies.

Lemma 3.4

Let B ∈ X, and X\{B}→B be a valid dependency. If X is a superkey, then X\{B} is a superkey.

Generally, when dealing with X∪{A}, a dependency X→A, A∉X is checked, because π[XU{A}] is needed to check its validity. However, if X is a superkey, then X→A is always valid, we don't need X∪{A}.

Now, consider a superkey X that is not a key. Obviously, for all A∉X, the dependence X→A is not minimal. Furthermore, if A ∈ X and X \{A}→A hold, then, according to Lemma 3.4, X\{A} is a superkey, and we do not need πX to test the validity of X\{A}→A. In other words, we do not use X or πX when looking for minimum dependencies. Therefore, we can delete all keys and their supersets.

Why are all keys and their supersets removed? How to intuitively understand

  • If Cplus(X)=φ, then delete X----X is useless, it is to no longer consider the superset
  • If X is a (super) key, delete X----X is useful, but it has been used, and it is cut to no longer consider the superset

3.3. Calculation and Partitioning

Strip partition with error e

We next describe two methods to reduce the time and space requirements of using partitions.

  • The first replaces partitions with a more compact representation, "peeled partitions".

  • The second is a method for quickly approximating the e error.

These methods optimize the algorithms described in the next section. Then, we describe how to efficiently compute partitions in the level-by-level TANE algorithm.

For both optimizations, we need the notion of an approximate superkey. The error measure can be generalized to other properties of relations; specifically, it can be extended to properties of property sets that are superkeys. We define e(X) as the minimum fraction of tuples that need to be removed from relation r for X to be a superkey. If e(X) is small, then X is an approximate superkey. Using the equation e(X)=1- |π[x]/|r|, it is easy to calculate the error e(X) from the partition π[x].

Approximate superkey degree: X can determine the ability of other attributes R\X in the relation R, so the smaller the proportion of tuples that need to be removed, the closer to the superkey

If X is a superkey, then |π[x]=|r|, then e(X)=0

3.3.1. Strip partition

π ^ \hat{π }Pi^ represents the stripped version of the partition π, removing equivalence classes of size 1. One intuitive explanation for discarding the singleton equivalence class is that the singleton equivalence class (on the left) cannot break any dependencies.

A stripped partition contains the same information as a full partition. For example, the value e(X) is easily calculated from the stripped partition using the equation:

image-20210714160257935

  • ∣ π X ^ ∣ \left|\widehat{\pi_{X}}\right| PiX ∣Represents the order of the stripped partition: how many equivalence classes are there inside
  • ∥ π X ^ ∥ \left\|\widehat{\pi_{X}}\right\| PiX means to strip the partitionπ X ^ \widehat{\pi_{X}}PiX The sum of the sizes of all equivalence classes in

Furthermore, the refinement relations of the partitions are the same, so Lemma 2.1 also holds for stripped partitions.

Lemma 2.2 does not hold for stripped partitions

image-20210714161347405

Lemma 3.5

A functional dependency X→A holds if and only if e(X)=e(XU { A }).

Through the partition, the peeled partition can be calculated, and the error e(X) can be calculated by the peeled partition, so as to verify X→A

3.3.2. Boundary conditions e

Through the following optimization, the time for judging the approximate functional dependencies can be saved, because some e(X->A) do not need to be calculated.

For ①, the error exceeds the threshold, and the approximation does not hold; for ②, the error is within the threshold, and the approximation holds

image-20210714164059369

3.3.3. Computing Partitions

Partitions are not computed from scratch for each attribute set. Instead, when TANE runs through the lattice, it computes a partition as the product of two previously computed partitions. (it computes a partition as a product of two previously computed partitions)

The product of two partitions π' and π", denoted by π'.π'', is the smallest refined partition π that simultaneously refines partitions π' and π''. We have the following result:

Lemma 3.6:

image-20210714174228054

The calculation partition of TANE algorithm is as follows:

① Calculate the partition of each attribute

②Calculate the partitions with at least two attribute sets, split them into two different subsets with a size of |X|-1, and ensure that the union is X

image-20210714174717535

Once TANE has a partition πX, it computes the error e(X), which is used for validity testing based on Lemma 3.5. The full partition is only required to compute the next level of partition.

After initial setup for the first partition π{A} of all A ∈ R, TANE only processes tuple identifiers. This provides two advantages. First, different attribute types and values ​​can be discarded (still differentiated), and calculations are actually performed on integers. Therefore, operations on partitions are very simple and fast. Second, the identifiers of exception tuples are readily available when computing approximate dependencies.

What does this sentence mean? not understand.

4. TANE Algorithm

TANE's search space pruning is based on the fact that for a complete result only the smallest functional dependencies need to be found. For efficient pruning, the algorithm stores a set of right-side candidates C+(X) for each attribute combination X.

Set C+(X)={A∈R|∀B∈X: X \{A, B}→B does not hold}, set C+(X) contains all attributes that may still depend on set X

4.1. TANE main algorithm

To find all valid minimal nontrivial dependencies, TANE searches the set inclusion lattice in a horizontal fashion. Level Li is a collection of attribute sets of size i such that the attribute sets in Li can be used to construct dependencies based on the considerations in the previous sections. TANE starts from L1={ {A}|A∈R}, according to the information obtained during the algorithm, L2 is calculated from L1, L3 is calculated from L2, and so on.

Algorithm: TANE

Input: relation r on schema R

Output: The smallest non-trivial functional dependency that holds in r

image-20210716155551297

Steps to sort out:

  • where the Li level contains all attribute combinations of size i

  • Each attribute combination X stores a set of right candidate objects C+(X)

  • Line 6 compute_dependencies(Li) is used to find and output the minimum dependencies on the left in this layer, and calculate the candidate function dependencies for updating this layer.

  • Line 7 prune( Li ) prunes the search space by deleting the collection in Li

    For each attribute set X in the Li-th layer, if C+ (X) is empty, delete this attribute set X from Li; if X is a key, then for the set that belongs to C+ (X) \X Each attribute A:

    If [the transfer of the external link image fails, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-82dUVr20-1629358621235) (D:/OneDrive/Documents/Assets/image-20210715101931011.png)],

    Then output X → A, and finally, remove the attribute set X from layer Li.

  • Line 8 passes generate_next_level all iiThe i- level node generates the correspondingllL layer node.

  • From this overall process, it can be seen that the overall structure of the TANE algorithm and the Apriori algorithm is similar

  • Start at level 1 (attribute set of size 1) and move up level by level

4.2. Generation level

The generate_next_level(Li) program generates the Li+1 level calculated from the Li level. Level Li+1 will contain only attribute sets of size i+1, all subsets of size i are in level Li . The pruning method guarantees that no dependencies will be lost. The specification of generate_next_level (Li) is: that is, than len(Y)=len(X)-1, then Y is L[i-1]

image-20210715103058106

The algorithm is as follows:

image-20210715102547995

Steps to sort out:

  • PREFIX_BLOCK (Li) in line 2 refers to logically grouping Li according to a common prefix.
  • Lines 3 to 4 reflect two different nodes in the same group with a common prefix, which are combined to form a lower node.
  • Line 5 checks the newly generated nodes of order L[i+1] in order to determine whether all its proper subset nodes of order i are stored in L[i]. If they all exist, then include this node with order i+1 into the L[i+1] set.

Prefix block concept:

The prefix_blocks(Li) procedure divides Li into disjoint blocks as follows. [Consider a set X∈Li is a sorted attribute set, if two groups of X, Y∈Li have a common prefix of length i−1, they belong to the same prefix block , that is, they only have one attribute different, and The non-matching attribute is the last attribute in X and Y. ] Each prefix block forms a contiguous block in the lexicographic order of L[i], thus, the prefix block is easily computed from Li in the lexical order.

Why have the concept of a prefix block? This ensures that the generated X is only 1 more order than the attribute set of the previous layer.

Generation principle:

First, let the layer L[i+1] be an empty set, and perform the following operations for each prefix block K in the layer Li: For each pair of two different attribute sets Y , Z , assign the result of Y ∪ Z to X, and judge that if for all the attributes A belonging to the attribute set X, after removing attribute A from X, the remaining attributes still belong to the Li-th layer, then Assign the result of L[i+1] ∪ { X} to L[i+1] to generate the L[i+1]th layer.

After finding out all the functional dependencies that satisfy the non-trivial and minimum conditions in the given data set, the algorithm will terminate and will not continue to generate the next layer.

Example:

In Figure 2, AB, AC, and AD belong to the same prefix block K, and the existing {Y, Z} combinations include {AB, AC} (generating X=ABC), {AB, AD} (generating X=ABD), {AC, AD} (generating X=ACD),

4.3. Computing dependencies

The following is the calculation-dependent process of the TANE algorithm:

image-20210715110330004

By Lemma 3.1 2 , steps 2, 4, and 5 guarantee that the output of this process is exactly the minimal dependency of the form X\{A}—>A, where X∈Li and A∈X.

The role of COMPUTE DEPENDENCIES (Li):

①Find the smallest dependency on the left in Li ②Compute all X ∈ \inThe set C+(X) of ∈ Li. (The C+(X) set contains all properties that may still depend on the set X, i.e. achieve the maximum value in theoretical sense.)

Steps to sort out:

  • Lines 1 and 2 are exactly the aforementioned method of calculating the initial C+(X) of the lower node through the C+(X\A) of the upper node:

image-20210715154726829

How to understand this step intuitively? Why can the next layer be calculated from the C+(X) of the upper layer, and what is the principle?

Find all the common attributes that may depend on X\A (A∈X), and seek the same. If a certain attribute cannot be determined in the Cpus of X\A1, then X must also be impossible.

  • Line 3 can be seen to process the nodes of the Li layer in sequence

  • Lines 4-8 fully embody Lemma 3.1 and the intrinsic meaning of C+(X).

    principle:

    1. The value of A is the intersection of X and the right set, and the right set is the right candidate

    2. If X\{A}->A is established, then X\{A}->A is the minimum functional dependency, output, according to Lemma 3.1 2

      if

      1. A∈X, X\{A}->A depends on the establishment
      2. A∈C+(X), A does not depend on a subset of X.
    3. Cut off A in C+(X), and all attributes belonging to C+(X)\X===Delete all elements belonging to R\X that do not belong to X in C+(X), that is, the remaining Cpuls are only Subordinate to X.

      (Subtract all Z ∈ C + (X), so that X\Z->Z is the minimum value, or Z that must not be the minimum value)

      In other words, if X contains FD X\A→A, then any FD X→B cannot be a minimum, because A makes X a non-minimum, that is, X\{A, B}->B must hold. Note that this definition also includes A=B.

      image-20210715161718429

      Note : Why not exclude X, that is to say, let the right set be empty, because the remaining elements are a subset of X. If X\{B}->B is established, it is impossible to judge whether it is the minimum value, so it is reserved.

  • The validity test for line 5 is based on Lemma 3.5 3

  • Line 8 implements the difference between C+(X) and C(X). If this row is removed, the algorithm will work correctly, but pruning may not be as effective. (don't understand for now)

  • Q: Through the compute_dependencies process C+(X) only retains the elements in X, then, isn’t C+(X)\X empty? Why does the prune process still need to take elements from it?

A: Not empty only means that for all A that belong to the intersection of X and C+(X), X\A->A is not established. Otherwise, if one item is true, according to line8, C+(X) only keeps the elements in X and excludes the elements that do not belong to X. Why do you do this? Because it is destined not to be the smallest. So pruning Cplus, removing the hopeless B, and removing the A that has fulfilled the wish.

image-20210804094244128

However, if X\A->A does not hold for all A that belong to the intersection of X and C+(X), then there is hope. But why must it be established when the key is pruned? (don't know)

4.4. Trimming grids

The following is the pruning lattice process of the TANE algorithm: cut off the attribute set X that does not meet the conditions

image-20210716095048165

step combing

  • The pruning strategy for the right set reflected in lines 1, 2, and 3
  • Lines 4-8 reflect the pruning strategy for keys
  • From Lemma 4.2 below, assuming that X is a superkey, the functional dependency X\{A}->A is established and is minimal if and only if X \{A} is a key and for all B∈X, A∈C+(X\{B}). It can be seen from this that the output of line 7 is correct.

This process pruning implements the two pruning rules described in Section 3.

①RHS+Pruningimage-20210716103228696

② Key-value pruningimage-20210716103946275

RHS+Trim

By the first rule, X is removed if C+(X)=∅; by the second rule, X is removed if X is a key. In the latter case, the algorithm may also output some dependencies. We will show that pruning does not cause the algorithm to miss any dependencies.

Let's first consider pruning with an empty C+(X).

If C+(X)=∅, the calculation function depends on the loop calculation dependencies of lines 4-8 in the process of compute_dependencies , and the loop of lines 5-7 in the pruning process will not be executed at all, that is, there will be no new FD output. Since C+(Y)=∅ also holds for all Y⊃X, removing X will have no effect on the output of the algorithm.

Now let's consider pruning keys. The correctness of pruning is based on the following lemma.

key trim

Lemma 4.2

Let X be a superkey and let A∈X. A dependency X\{A}→A is efficient and minimal if and only if X\{A} is a key and for all B∈X, A∈C+(X\{B}).

If X is a superkey, A∈X, how to determine whether X\{A}->A is the minimum functional dependency?
if and only if:

  • X\A->A is valid (validity test): X\A can only be a key, and X\A->A is valid. Conversely, if X\A is not a key, then X\A->A must not hold. (validity test - non-trivial)

  • Minimality: A does not depend on a proper subset of X, which can be equivalent to

    • All B∈X, (X\{A,B})->B does not hold
    • all B ∈ X, all B ∈ X, A ∈ C + (X \{B}) (defined by cplus)
    • A does not depend on the proper subset of X (also defined by cplus)

    In summary, it is consistent with Lemma 3.1, that is, X\A->A is valid, all B∈X, all B∈X, A∈C+(X \ {B}), so X\A->A is the smallest.

Note: Lemma 4.2 4 is actually a synthesis of Lemma 3.1 2 and Lemma 3.4 5 .

Lemma 4.2 leads to the following key sentence, which is also the core idea in the process of key pruning:

Lemma 4.2 Corollary

During pruning, the dependency X→A is output on line 7 if and only if if X is a superkey, A∈C+ (X)\X and for all B∈X, A∈C + ( (X+A)\{B}).

Of course, the second half sentence 对于所有的B∈X而言,A∈C + ((X+A)\ {B})can be interpreted as follows:
A ∈ ⋂ B ∈ XC + ( X ∪ { A } \ { B } ) A \in \bigcap_{B \in X} \mathcal{C}^{+}( X \cup\{A\} \backslash\{B\})ABXC+(X{ A}\{ B})

Question: How to explain this sentence? What does A∈C+ (X)\X mean? How to relate to Lemma 4.2.

Answer: When proceeding to the step of prune, the attributes belonging to X in C+(X) have been considered (and none of them are true. If it is true, C+ (X)\X will be empty), so it is necessary to consider the attributes that do not belong to X. Namely C+ (X)\X, then, since X is a superkey, then X+A is also a superkey, (X+A) is equivalent to (( X)) in Lemma 4.2 4 , then X is equivalent to Lemma 4.2 ((X\A)), B is the same as B in 4.2, (obviously here B!=A), so ((X \ {A}))→A is the minimum, that is, X->A is the minimum!

Lemma 4.2 shows that this dependence is efficient and minimal. The lemma also shows that if the minimum dependency X \{A}→A (here A is not necessarily ∈ X) is not output in the calculation dependency process compute_dependencies due to pruning (accurately speaking, Cplus\X should be ignored), then it Will be output during the pruning process prune. So the pruning works correctly.

4.5. Computing Partitions

The algorithm above does not contain references to partitions. However, implementing the centrality test on line 5 of the Compute Dependencies procedure requires knowledge of e(X) and e(X\{A}). At the same time, the superkey test for line 4 of the pruning process is also based on e(X).

e(X) is defined as the minimum fraction of tuples that need to be removed from relation r in order for X to be a superkey. If e(X) is small, then X is an approximate superkey.

In TANE, these values ​​are calculated by formula (1), which is also the meaning of this section, which is to implement dependency testing and superkey testing.

image-20210716112716929

Calculated from the stripped partitions, the partitions are calculated as follows:

In the beginning, partitions are computed directly from the relation r for the set of singleton attributes. Compute the partition π{A} from column r[A] as follows.

What is a singleton property set: meaning should be property X as a single

First, replace the column values ​​with integers 1, 2, 3, ... to keep the equivalence relationship unchanged, that is, replace the same value with the same integer, and replace different values ​​with different integers. This can be done in linear time, using a data structure such as a trie or hash table to map primitive values ​​to integers. So far, the value t[A] is the identifier of the equivalence class [t]{A} of π{A}, and π{A} is easy to construct.

Question: What is the significance of this step? Replace the original value with an integer --- may be convenient to calculate the product of the stripped partition

Finally, the singleton equivalence class in π{A} is stripped to form a stripped partition π A ^ \widehat{\pi_{ {A}}}PiA

A partition with respect to a larger set of attributes X is computed when X is added to its level in line 6 of the generate-level procedure . Where the set X is Y∪Z, and π x is the product of πY and πZ.

image-20210715102547995

In linear time, the following method is used to calculate the product of partition πY and partition πZ.

Doesn't seem to matter, didn't read this part of the code

image-20210716110051951

4.6. Approximate dependencies

For a given threshold ε, the TANE algorithm can be modified to compute all minimal approximate dependencies X → A for which e(X → A) ≤ ε. The key modification is to change the validity test on line 5 of the "computationally dependent" program to read:

Before changing:

image-20210716153626343

After changing:

image-20210716153657458

Also, pruning had to be slightly weakened (reduced) by adding line 8 of the "computationally dependent" procedure:

Before changing:

image-20210716154222642

After changing:

image-20210716154239747

The above algorithm returns only the smallest approximate dependencies. In some applications, it may also be useful to know approximate dependencies that are not minimal but less error-prone. We leave the necessary revisions to the reader.

TANE tries to solve the test of line 5 by using the boundary condition (2). For ①, the error exceeds the threshold, and the approximation does not hold; for ②, the error is within the threshold, and the approximation holds

image-20210714164059369

If that fails, the exact value e(X\{A} → A) of e is computed from the partition using the following procedure.

image-20210716153409815

Note the similarity to the "stripping product" procedure. Here, the table T must be initialized to 0 all at once, but does not need to be reinitialized afterwards.

5. TANE understanding

5.1. Efficiency

TANE is based on partitioning rowsets based on attribute values, which enables fast testing of the validity of functional dependencies even for large numbers of tuples. The use of partitions also makes the discovery of approximate functional dependencies easy and efficient, and erroneous or outlier rows can be easily identified.

5.2. Correctness

How can I guarantee that no dependencies will be lost?

Calculate functional dependencies

Because: Cplus stores all the attributes that may depend on X, and the functional dependency we consider is only A∈X, X\{A}->A is the smallest, so we only need to have all attributes belonging to X ∩ C plus X CplusXThe elements A of C p l u s are judged one by one . If X->A is established, then X->A is the minimum functional dependency. This will not miss non-trivial & minimal functional dependencies on the property set X.

Note: In the process of compute_dependencies, the output FinalFD is X\{A}->A, where A∈X

pruning process

The main purpose of the pruning process is to prune the attribute set X of the current layer and remove X. The reason for removing it is that its superset will definitely not produce the minimum functional dependency, which can save time! Note that before calling prune, the Li layer is only generated by the previous layer. After calling, Li is pruned, so that when you continue to generate the next layer, you don’t need to generate too much, and it will save some time.

(a) Right set pruning

​ If Cplus(X)=φ, delete X----X is useless, cut it out to no longer consider the superset

(b) Key pruning

Note: In the prune process (key pruning to be precise), the output FinalFD is X->A, [A∈C+ (X)\X and for all B∈X, A∈C + ((X +A)\{B})]

Question1: Why is the functional dependency output during the prune process, and the compute_dependencies process has not finished outputting?

Answer: If the minimum dependency X \{A}→A (here A is not necessarily ∈ X) is not output in the calculation dependency process compute_dependencies because the part of Cplus\X is not considered, only those attributes belonging to X in Cplus are considered A, then it will be output in the pruning process prune.

Question2 The rationality of outputting FinalistFD by prune

Answer: Lemma 4.2 and Lemma 4.2 Corollary

Question3 When considering Cplus\X, why do you ensure that X is a (super) key? If not, can Cplus\X not find A?

(I don’t know, my brain is not enough qaq)

5.3. Reasonableness

For the algorithm, partition calculation, generation level and other parts, understand the mechanism of code operation.

6. Summary

We propose a new algorithm, TANE, for discovering functional and approximate dependencies. The method is based on considering partitions of relations and deriving efficient dependencies from partitions. The algorithm searches for dependencies in a breadth-first or level-first manner. We show how to efficiently prune the search space, and how to efficiently compute partitions and dependencies. Experimental results and comparisons show that the proposed algorithm is fast in practice and its scale-up performance is better than previous methods. The method is suitable for relations up to hundreds of thousands of tuples

This approach works best when dependencies are relatively small. When the size of the (minimum) dependency is about half the number of attributes, and the number of dependencies is exponential in the number of attributes, this situation is more or less bad for any algorithm. When the dependence is greater than this value, the horizontal direction method that starts searching from a small dependence is obviously farther from the optimal value. In principle, horizontal searches can start from large dependencies. However, partitions cannot be computed efficiently.

There are other interesting applications of partitioned data mining. Association rules between attribute-value pairs can be computed with small modifications to the current algorithm. Equivalence classes correspond to specific combinations of values ​​for attribute sets. Association rules can be found by comparing equivalence classes rather than full partitions. A possible future research direction is to use the unified view provided by partitions for functional dependencies and association rules to find appropriate generalizations of both, and to develop algorithms for discovering such rules.


  1. Note: The dependencies mentioned below are non-trivial functional dependencies, A depends on X, that is, X\A->A is established. ↩︎

  2. Lemma 3.1 Suppose A∈X, and X\{A}->A is an effective functional dependency, if and only if for all B ∈ X, we have A∈C+(X\{B}), X\{A }->A is the minimum functional dependency. ↩︎ ↩︎ ↩︎

  3. Lemma 3.5 A functional dependence X→A holds if and only if e(X)=e(Xn{ A }). ↩︎

  4. Let X be a superkey and let A∈X. A dependency X\{A}→A is efficient and minimal if and only if X\{A} is a key and for all B∈X, A∈C+(X\{B}). ↩︎ ↩︎

  5. Lemma 3.4 Let B ∈ X, and X \{B}→B be a valid dependency. If X is a superkey, then X\{B} is a superkey -. ↩︎

Guess you like

Origin blog.csdn.net/He_r_o/article/details/118419121