Correlation Analysis

Correlation Analysis
    Purpose: to extract the best explanation for the relationship between variables in the data
   Basic concepts:
    1. Transactions: Each transaction is called a transaction. For example, the data set in Example 1 contains four transactions.
  2. Item: Each item traded is called an item, such as Cola, Egg, etc.
  3. Itemset: A set containing zero or more items is called an itemset, such as {Cola, Egg, Ham}.
  4. k-itemsets: itemsets containing k items are called k-itemsets. For example, {Cola} is called 1-itemsets, and {Cola, Egg} is called 2-itemsets.
  5. Support count: An item set appears in several transactions, and its support count is a few. For example {Diaper, Beer} appears in transactions 002, 003 and 004, so its support count is 3.
  6. Support: The support count is divided by the total number of transactions. For example, the total number of transactions in the above example is 4, and the support count of {Diaper, Beer} is 3, so its support is 3÷4=75%, which means that 75% of people bought Diaper and Beer at the same time.
  7. Frequent itemsets: The itemsets whose support is greater than or equal to a certain threshold are called frequent itemsets. For example, when the threshold is set to 50%, since the support of {Diaper, Beer} is 75%, it is a frequent itemset.
  8. Antecedents and Consequences: For the rule {Diaper}→{Beer}, {Diaper} is called anterior and {Beer} is called posterior.
  9. Confidence: For the rule {Diaper}→{Beer}, the support count of {Diaper, Beer} divided by the support count of {Diaper} is the confidence of this rule. For example, the confidence of the rule {Diaper}→{Beer} is 3÷3=100%. It means that 100% of the people who bought Diaper also bought Beer.
  10. Strong association rules: Rules that are greater than or equal to the minimum support threshold and minimum confidence threshold are called strong association rules. The ultimate goal of association analysis is to find strong association rules.
    11. Frequent K itemsets: K itemsets that satisfy the minimum support threshold.
 12. Candidate K item set: K item set formed by connection.

Example:
transaction number Item
0 soymilk, lettuce
1 lettuce, diapers, wine, beets
2 soymilk, diapers, wine, orange juice
3 lettuce, soymilk, diapers, wine
4 lettuce, soymilk, diapers, orange juice The support of
an item set is defined as data The percentage of records in the set that contain this set.
As shown above, the support of {soymilk} is 4/5, and the support of {soymilk, diapers} is 3/5.
Support is for itemsets, so a minimum support can be defined and only itemsets that satisfy the minimum scale are kept.
Confidence or confidence is defined for association rules.
The credibility of the rule {diaper}➞{beer} is defined as "support({diaper,beer})/support({diaper})",
since {diaper,beer} has a support of 3/5, the diaper has a 4/5 support rating, so "diapers ➞ beer" has a 3/4 confidence rating.
This means that our rules apply to 75% of all records containing "diapers".

1) Apriori algorithm
The Apriori principle is that if an item set is frequent, then all its subsets are also frequent. More commonly used is its inverse negation proposition that if an itemset is infrequent, then all its supersets are also infrequent.
Steps:
    1. First calculate the support of 1 item set, and filter out the frequent 1 item set.
 2. Then arrange and combine 2 item sets, calculate the support of the 2 item sets, and filter out the frequent 2 item sets.
 3. Then, through connection and pruning, 3 item sets are calculated, the support of the 3 item sets is calculated, and the frequent 3 item sets are filtered out.
 4. Then process K itemsets by analogy until no frequent sets appear (refer to the first figure for a specific example).
Advantages:
     The use of prior properties greatly improves the efficiency of frequent item sets generation layer by layer; it is simple and easy to understand; the data set requirements are low.
Disadvantages :
     1. The number of candidate frequent K itemsets is huge.
  2. When verifying the candidate frequent K itemsets, the entire database needs to be scanned, which is very time-consuming. 

2) FP-growth algorithm
Reference : http://blog.csdn.net/huagong_adu/article/details/17739247
Thought and algorithm steps: traverse each element in the data set, obtain the number of occurrences of each element, and then remove the element items that do not meet the minimum support degree according to the frequency of element occurrence. Obtain the filtered frequent itemsets, and then start building the FP tree.
     
The process of building a BP tree is the process of adding frequent itemsets to the tree, which requires a second traversal of the data set. When traversing the elements in the data set, only the frequent itemsets are considered, and the support for each frequent item decreases according to the degree of support. Sort order, and then use the sorted frequent itemsets to fill the tree. The
   filling process is: first build an empty tree, when traversing the first group of frequent itemsets, fill all itemsets into the tree, as the tree's Child nodes (add them from top to bottom when adding them, such as add{z,r} in the first step in the figure below),
   and then fill in the next set of frequent itemsets, for each frequent item there are: traversal For each element in the tree, from top to bottom, from left to right, if the frequent item already exists in the child node of the tree, just add 1 to the frequent item number of the child node,
   if the frequent item does not exist In the child node of the tree, the frequent item is added to the tree as a new child node. Next, the process of adding frequent item groups is the same as the above, until all frequent items are added to the FP tree.
Application Scenario:
     Optimizing shelf product placement, or optimizing the content of mailed product catalogs for
  cross-selling and bundling anomaly
  identification.
 
), and finally generate frequent itemsets through this FP tree
Disadvantage : not suitable for large amounts of data

milk, eggs, bread, potato chips
eggs, popcorn, chips, beer
eggs, bread, chips
milk, eggs, bread, popcorn, chips, beer
milk, bread, beer
eggs, bread, beer
milk, bread, potato chips
milk, eggs, bread, butter, potato chips
milk, eggs, butter, potato chips

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325941905&siteId=291194637