Association Rules data mining analysis Introduction

Correlation analysis is found interesting associations and related links between items from large data sets. A typical example is the association analysis of market basket analysis. In the big data era, correlation analysis is one of the most common data mining tasks.

Outline

Correlation analysis is a simple and practical analytical techniques, refers to a large amount of data found in the centralized association or correlation to describe a property of certain things while Chu current rules and patterns.

Association analysis can be found, interdependencies and relationships between things occur frequently, features or data from large amounts of data. These associations do not always know in advance, but by correlating data set analysis of the data obtained.

Association analysis of great value to business decisions, commonly used in-store or electricity supplier of cross-category recommendation cart joint marketing, shelf layout display, joint promotion, marketing, etc., to achieve the association mutually enhance sales, improve the user experience, reduce the user input and the auditor time to find the purpose of high potential user.

By correlation analysis of data sets may be derived form "due to the occurrence of certain events caused by the occurrence of some other event" rule or the like.

For example, "67% of customers at the same time buy beer also buy diapers," so reasonable beer and diapers shelf display or bundling can improve service quality and efficiency of the supermarket. " 'C # language' courses outstanding students in learning 'data structure' for the good of the possibility of 88%", then the effect can be enhanced by strengthening the teaching learning "C # language".

A typical example is the association analysis of market basket analysis. By discovering the link between its customers into the shopping basket of different commodities, can analyze customer buying habits. By knowing which items are purchased frequently while customers can help retailers develop marketing strategies. Other applications include tariff design, merchandising, merchandise and emissions based on customer buying patterns delineation. For example, shampoo and hair conditioner suits; between milk and bread temporary display; the user to purchase the product they bought those other commodities.

In addition to some of the phenomena associated with the presence of commodities mentioned above, in medicine, researchers hope to find a common feature of patients suffering from a disease existing in the tens of thousands of medical records, from looking better Precaution. In addition, the analysis of the user's bank credit card bills can also get the user's consumption patterns, which facilitates marketing of the respective products. Data mining association analysis has been involved in many aspects of people's lives, provides a great help for the production and marketing and people's lives.

basic concept

By frequent item set mining can find an interesting correlation between large transaction or relationship dataset things and things, and then help businesses make decisions, as well as the design and analysis of customer buying habits. For example, Table 1 shows several customers of a supermarket transaction information, which, TID represents the transaction number, Items on behalf of a traded commodity.

Table 1 Correlation Analysis sample dataset
TIME Items
001 Cola, Egg, Ham
002 Cola, Diaper, Beer
003 Cola, Diaper, Beer, Ham
004 Diaper, Beer

By this set's correlation analysis, we can identify association rules, i.e. {Diaper} → {Beer}. Meaning it represents is that customers will purchase purchase Diaper Beer. This relationship is not inevitable, but the possibility of very large, this is sufficient to assist the business to adjust the placement of the Diaper and Beer, for example, to increase sales by placing in a similar position, or bundle promotions.

Correlation Analysis some common basic concepts.

name Explanation
Affairs Each transaction is called a transaction data, e.g., Table 1 contains four transactions.
item Each item transaction called an item, such as Diaper, Beer and so on.
Itemsets It comprises a collection of known items set of zero or more items, such as {Beer, Diaper}, {Beer, Cola, Ham}.
k- itemsets Comprises k entries called k- itemset itemsets, e.g., {Cola, Beer, Ham} called 3- itemsets.
Support count A set of items which appeared in several transactions, it is the support count a few. For example, {Diaper, Beer} present in the transaction 002, 003 and 004, so that the support count is 3.
Support In addition to the support count of the total number of transactions. For example, in the example the total number of transactions is 4, {Diaper, Beer} of support count is 3, so the {Diaper, Beer} degree of support was 75%, indicating that 75% of people at the same buy Diaper and Beer.
Frequent Item Sets Support is greater than or equal to a certain threshold set items on called frequent item sets. For example, when the threshold is set to 50%, because {Diaper, Beer} of support was 75%, so it is frequent itemsets.
Front and rear parts For rule {A} is called the front member, {E} called back piece.
Confidence For rule {A} → {B}, its confidence is {A, B} of support count by {A} of support count. For example, the rule {Diaper} → {Beer} 3/3 confidence level, i.e. 100%, 100% indicating who bought Diaper bought Beer.
Strong association rules Rule greater than or equal threshold minimum support and minimum confidence threshold is called strong association rules. It said the usual sense of association rules refer strong association rules. The ultimate goal of association analysis is to find strong association rules.

Association analysis step

In general, for a given business transaction data collection, correlation analysis refers to the process by the user-specified minimum support and minimum confidence to seek strong association rules. Correlation Analysis is generally divided into two big step: find frequent itemsets and association rules discovery.

1. Discovering frequent itemsets

Find frequent item set is given by the user through the minimum support, find all frequent item sets that identify the subset of items is not less than the minimum support set by the user.

In fact, these frequent item set may have a containment relationship. For example, the term set {Diaper, Beer, Cob} contains the set of items {Diaper, Beer}. In general, only care about the so-called largest collection of frequent item set is not other frequent item set contains. Find all frequent item sets the basis for the formation of association rules.

The number of frequent item sets generated by the data set things can be very large, therefore, to find out all you can derive other frequent item set, the smaller, representative set of items would be very useful.

name Explanation
Closed Itemsets If X is a closed itemset, but its direct and its superset do not have the same support count, then X is closed itemset.
Frequent closed itemsets If X is a closed itemset, and it is equal to or greater than the minimum support support threshold, then X is frequent closed itemset.
Maximum frequent itemsets If the item is set X frequent item set, and it's not a direct superset frequent, then X is the largest frequent item sets.

The maximum frequent item sets are closed, because any maximal frequent item sets are impossible with its direct superset have the same support count. Maximum frequent itemsets effectively provides a compact representation of frequent itemsets. In other words, the maximum frequent itemsets may be derived form a minimal set of all items of the set of frequent itemsets.

2. discovers association rules

Discovered association rules given by the user by means of minimum confidence, at each maximum frequent itemsets association rules to find a confidence level of not less than the minimum set by the user confidence.

The first step in terms relative to the second step of the task is relatively simple, because it only needs to list all the possible association rules based on frequent item sets have been identified on. Since all association rules are generated on the basis of frequent item set, which already meet the requirements of support threshold, so only need to consider the second step requires confidence threshold, only those greater than the user specified minimum confidence rules will be left behind.

57 association rules data mining analysis
58. The Apriori algorithm and FP-Tree algorithm
59. Based on a large data precision marketing of
60. based personalized recommendation system big data
61. Big Data predictive
62. The other big data applications
63. big data can be used in which industries
64. in the financial industry, large application data
65. big data applications in the Internet industry
66. the application of big data in the logistics industry

Guess you like

Origin blog.csdn.net/yuyuy0145/article/details/92430124