Common IoT Big Data Data Mining Algorithms

1, Apriori arithmetic

1. Overview of Apriori Algorithm

Apriori algorithm is one of the most influential algorithms for mining frequent itemsets of Boolean association rules, which was proposed by Rakesh Agrawal and RamakrishnanSkrikant. It uses an iterative method called layerwise search, where k-itemsets are used to explore (k+1)-itemsets. First, find the set of frequent 1-itemsets. This set is denoted as L1. L1 is used to find the set L2 of frequent 2-itemsets, and L2 is used to find L2, and so on until no k-itemsets can be found. Every time a Lk is found, a database scan is required. In order to improve the efficiency of layer-by-layer generation of frequent itemsets, an important property called Apriori property is used to compress the search space. Its operating theorem is that first, all non-empty subsets of frequent itemsets must also be frequent, and second, all parent sets of infrequent itemsets are infrequent.

  • Association Analysis
      Association analysis is a task of finding interesting relationships in large-scale datasets. These relationships can take two forms: (1) frequent itemsets, (2) association rules.

    (1) Frequent Itemset
    Frequent Itemset: It is a collection of items that often appear together.

    Quantification method: support. Support is the proportion of records in the dataset that contain the itemset. For example, in the data set [[1, 3, 4], [2, 3, 5], [1, 2, 3], [2, 5]], the support of the item set {2} is 3/4, and the item The support of the set {2,3} is 1/2.
    (2) Association rules
    Association rules: imply that there may be a strong relationship between two items.

Quantitative calculation: credibility or confidence (confidence). Confidence is defined for an association rule such as {2}-->{3}. {2}-->{3}, the reliability of this rule is "support degree {2, 3}/support degree {2}", that is, 2/3, which means 2/ in all records containing {2} 3 matches the rule contains {2,3}.

  • The Apriori principle
       needs to calculate support no matter frequent item sets or association rules. If the amount of data is small, to calculate the support of an itemset, you can scan all the data for each itemset, and then count the total number of occurrences of the itemset divided by the total number of transaction records to get the support. But for a data set of N items, there are 2N-1 item set combinations. Even if 4 items need to traverse the data set 15 times, 100 items have possible item set combinations. For modern computers, it takes a long time It takes time to complete the calculation, not to mention the fact that the store will have hundreds or thousands of products. The Apriori algorithm can reduce the calculation time and reduce the itemsets that may be of interest.

If an itemset is frequent, then all its subsets are also frequent. If {2,3} is frequent, then {0}, {1} must also be frequent. Conversely, if an itemset is infrequent, then all its supersets are also infrequent. If {2,3} is infrequent, then {0,2,3}, {1,2,3}, {0,1,2,3} are also infrequent items. So if the calculated support of {2,3} is infrequent, then the support of {0,2,3}, {1,2,3}, {0,1,2,3} does not need to be calculated.

The Apriori principle can avoid the exponential growth of the number of itemsets, so that frequent itemsets can be calculated in a reasonable time.

  • Apriori algorithm finds frequent itemsets
    Apriori algorithm process:

    1) First, generate a list C1 of itemsets with a number of 1 based on the data set;

    2) Calculate the support of each element in C1 according to the frequent itemsets function, and remove the items that do not satisfy the minimum 3) Generate a candidate itemset list C2 with

    k=2 based on L1 according to the function of creating candidate itemsets;

    4) Generate based on C2 according to the frequent itemsets function The frequent itemset list L2 of k=2 that satisfies the minimum support;

    5) Increase the value of k, repeat 3) and 4) to generate Lk, until Lk is empty, return the L list, L includes L1, L2, L3.. .

 

The following are the functions and explanations for creating candidate itemsets.

 

Generally, the number of initial itemsets is 1, and an itemset of k=2 is generated from an itemset of k=1. So the initial value of k is also 2. If you want to generate an item set with a number of k, compare the first k-2 items. If they are equal and merged, there will be one unequal element left in each of the compared items, so that the number after the merge is k-2+1+ 1=k items. (The following is an example derivation process, ignoring C2 to generate L2 after filtering, and directly treating C2 as L2, which is equivalent to a support of 0). Comparing the first k-2 elements can reduce the number of times to traverse the list. For example, if you want to use {0,1}, {0,2}, {1,2} to create a three-element item set, if you merge every two sets, you will get {0,1,2}, {0,1 ,2}, {0,1,2}. The same combination of results is replicated 3 times and still needs to be processed to get non-repeating results. Now only the k-2=1th element is compared, and the first element is the same before the collection is merged to get {0,1,2}. There is only one operation, so there is no need to traverse the list to find non-duplicate values.

Algorithm: Generation of frequent itemsets of Apriori algorithm

5. Code:

Input: data set D; minimum support threshold min_sup

Output: frequent itemsets L in D

(1) L1 = find_frequent_1-itemset( D );

(2) for( k=2; Lk−1≠Φ; k++)

(3) {

(4) Ck = apriori_gen(Lk−1); // generate candidate itemsets

(5)  for all transactions t ∈D

(6)  {

(7) Ct = subset(Ck, t); // Identify all candidates contained in t

(8)    for all candidates c∈Ct

(9)    {

(10) c.count++; // support count increment

(11)    }

(12)  }

(13) Lk = { c∈Ck | c.count≥min_sup} // Extract frequent k-itemsets

(14) }

(15) return L=∪kLk;

procedure apriori_gen(Lk−1

)

(1) for each itemset l1∈Lk−1

(2)  for each itemset l2∈Lk−1

(3)    if( l1[1]=l2[1] ∧…∧ ( l1[k-2]=l2[k-2] ) ∧ ( l1[k-1]<l2[k-2] ) then

(4)    {

(5) c = join( l1, l2 ); // connection: generate candidates

(6)      if has_infrequent_subset( c, Lk−1) then

(7) delete c; // Pruning: remove infrequent candidates

(8)      else

(9)        add c to Ck

(10)    }

(11) return Ck;

procedure has_infrequent_subset( c, Lk−1

)

// Use prior knowledge to determine whether the candidate item set is frequent

(1) for each ( k-1 )-subset s of c

(2) if s ∉Lk−1 then

(3) return TRUE;

(4) return FALSE;

---- Part of the content refers to the summary of the principle of the Apriori algorithm - Liu Jianping Pinard - Blog Garden (cnblogs.com)

  1. AdaBoost algorithm
  • Introduction

The Adaboost algorithm is a boosting method that combines multiple weak classifiers into a strong classifier.

AdaBoost, the abbreviation of "Adaptive Boosting" in English, was proposed by Yoav Freund and Robert Schapire in 1995.

Its self-adaptation lies in the fact that the weights of samples misclassified by the previous weak classifier (the weights corresponding to the samples) will be strengthened, and the samples with updated weights will be used again to train the next new weak classifier. In each round of training, use the population (sample population) to train a new weak classifier, generate new sample weights, and the speaking power of the weak classifier, and iterate until the predetermined error rate is reached or the specified maximum number of iterations is reached.

The relationship between the population-sample-individual needs to be cleared up

Overall N. Sample: {ni}i from 1 to M. Individual: If n1=(1,2), there are two individuals in sample n1.

Algorithm principle

(1) Initialize the weight distribution of the training data (each sample): If there are N samples, each training sample point is given the same weight at the beginning: 1/N.

(2) Train weak classifiers. In the specific training process, if a certain sample has been accurately classified, its weight will be reduced in constructing the next training set; on the contrary, if a certain sample point has not been accurately classified, its weight will be increased . At the same time, the speaking power corresponding to the weak classifier is obtained. Then, the sample set after updating the weights is used to train the next classifier, and the whole training process goes on iteratively.

(3) Combine the weak classifiers obtained by each training into a strong classifier. After the training process of each weak classifier is over, the weak classifier with a small classification error rate has a greater discourse power, which plays a greater decisive role in the final classification function, while the discourse power of a weak classifier with a large classification error rate The weight is smaller, which plays a less decisive role in the final classification function. In other words, the proportion of weak classifiers with low error rate is larger in the final classifier, and vice versa.

2. Algorithm process:

first step:

Initialize the weight distribution for the training data (per sample). Each training sample is given the same weight w=1/N during initialization. N is the total number of samples.

 

D1 represents the weight of each sample in the first iteration. w11 represents the weight of the first sample at the first iteration.

N is the total number of samples.

Step 2: Perform multiple iterations, m=1, 2....M. m represents the number of iterations.

Use the training sample set with weight distribution Dm (m=1,2,3...N) to learn and get a weak classifier.

​​​​​​​

 

This formula indicates that the weak classifier at the mth iteration classifies the sample x as either -1 or 1. Then according to what criterion is the weak classifier obtained?

Criterion: The error function of the weak classifier is the smallest, that is, the sum of the weights corresponding to the misclassified samples is the smallest.

Calculate the discourse power of the weak classifier Gm(x), and the discourse power am represents the importance of Gm(x) in the final classifier. Where em is εm (the value of the error function) in the previous step.

 

 

This formula increases as em decreases. That is, the classifier with a small error rate is more important in the final classifier.

c) Update the weight distribution of the training sample set. for the next iteration. Among them, the weight of the misclassified samples will increase, and the weight of the correctly classified samples will decrease.

 

Dm+1 is the weight of the sample used in the next iteration, and Wm+1,i is the weight of the i-th sample in the next iteration.

Among them, yi represents the category (1 or -1) corresponding to the i-th sample, and Gm(xi) represents the classification (1 or -1) of the sample xi by the weak classifier. If the result is correct, the value of yi*Gm(xi) is 1, otherwise it is -1. where Zm is a normalization factor, so that the sum of weights corresponding to all samples is 1.

 

After the third iteration is completed, the weak classifiers are combined.

 

Then, add a sign function, which is used to calculate the sign of the value. If the value is greater than 0, it is 1. Less than 0, it is -1. It is equal to 0, it is 0. Get the final strong classifier G(x)

 

Using the forward distribution addition model (simply speaking, it is to convert n questions together into one question at a time, and then on the basis of it, find the next question, and iterate n times in this way), the adaboost algorithm can be regarded as, The minimum of the formula. When tn is the correct classification corresponding to sample n, fm is the combination of the first m classifiers (here multiplied by 1/2, because the am of the article read by the blogger is 1/2*log(~~), this does not matter, it is nothing more than More 1/2 less 1/2.

 

Then, assume that the first m-1 relevant parameters have been determined. By simplifying the formula E, we can get:

 

where is a constant.

 

Among them, Tm is the weight of the correctly classified samples, and Mm is the weight of the wrongly classified samples. The formula is not difficult, and you can understand it after reading it a few times.
So far, it can be seen that minimizing E is actually minimizing.

 

What is this formula? Look at the front, this is the criterion when looking for a weak classifier!
Then after getting the weak classifier ym, we can further derive the weights of am and samples. ) where ε is

 

3. Practical application :

(1) For binary classification or multi-classification
(2) Feature selection
(3) Baseline for classifying characters

4. Code:

#encoding=utf-8

import pandas as pd
import time

from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.ensemble import AdaBoostClassifier

if __name__ == '__main__':

    print("Start read data...")
    time_1 = time.time()

    raw_data = pd.read_csv('../data/train_binary.csv', header=0) 
    data = raw_data.values

    features = data[::, 1::]
    labels = data[::, 0]

    #随机选取33%数据作为测试集,剩余为训练集
    train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)

    time_2 = time.time()
    print('read data cost %f seconds' % (time_2 - time_1))


    print('Start training...') 
    # n_estimators represent the number of weak classifiers to be combined;
    # algorithm optional {'SAMME', 'SAMME .R'}, the default is 'SAMME.R', which means the real boosting algorithm is used, 'SAMME' means the discrete boosting algorithm is used
    clf = AdaBoostClassifier(n_estimators=100,algorithm='SAMME.R')
    clf.fit (train_features,train_labels)
    time_3 = time.time()
    print('training cost %f seconds' % (time_3 - time_2))


    print('Start predicting...')
    test_predict = clf.predict(test_features)
    time_4 = time. time()
    print('predicting cost %f seconds' % (time_4 - time_3))


    score = accuracy_score(test_labels, test_predict)
print("The accruacy score is %f" % score)

-------Part of the content refers to the principle and derivation of the AdaBoost algorithm- liuwu265 - Blog Garden (cnblogs.com)

Guess you like

Origin blog.csdn.net/m0_72237363/article/details/130604707