Summary data mining algorithms

Summary data mining algorithm $ \ theta; tech $

Encountered in the study summarizes the data mining algorithms, derivation, implementation, etc.

ID3

  1. Introduction:

    ID3 algorithm is a decision tree classification algorithm classifies data by a set of rules in the form of a decision tree. The decision tree, each classification are starting from the root, each leaf node represents a possible classification results.

  2. Classification
    $$
    Gain (S, A) = Entropy (S) - \ sum_ {V \ Epsilon V (A)} \ {FRAC | s_v |} {| S |} Entropy (s_v)
    $$
    wherein, V (A ) is a range of attribute a, S is the sample set, $ s_v $ $ S $ is equal to the sample value v on the property set a

  3. Classification: Select attribute with the highest information gain has not been used to divide at each node as the criteria for the classification and continue the process until a decision tree can be perfectly classified training examples.

  4. Tags监督学习算法 交叉熵

  5. Realization: the Java

  6. example:

    Watermelon is divided into good and bad melon melons,

    nature
    Percussion sounds Crisp a1 A2 cloudy
    colour Dark green b1 Shallow b2
    1. a1 b1 1

    2. a1 b1 1

    3. a2 b1 1

    4. a2 b2 0

    5. 0 B2 A1
      $$
      Entropy (Start) = - \. 3 FRAC {} {}. 5 log (\ FRAC. 3 {} {}. 5) - \ FRAC. 5} {2} {log (\ FRAC {{2}}. 5) 0.835 = \
      Entropy (percussion sounds) = \ frac {3} { 5} (- \ frac {1} {3} log (\ frac {1} {3}) + - \ frac {2} {3} log (\ frac {2} {3 })) + \ frac {2} {5} (- \ frac {1} {2} log (\ frac {1} {2}) + - \ frac {1} {2 } log (\ FRAC {. 1} {2})) = 0.817 \
      Entropy (color) = \ frac {3} { 5} (- \ frac {3} {3} log (\ frac {3} {3}) ) + \ FRAC. 5} {2} {(- \ FRAC 2} {2} {log (\ {FRAC. 1} {2})) = 0
      $$
      in the embodiment, the color is clearly better classification criteria, because of its larger information gain

C4k5

  1. Introduction:

    ID3 and C4.5 core of the algorithm is the same, but different approaches taken, C4.5 uses information gain ratio as a basis for division, overcome the shortcomings of ID3 algorithm using information gain property values ​​tend to choose more properties (multi-property orientation is often the result of relatively pure, information gain is relatively large).

  2. Sort:
    $$
    GainRatio (S, A) = \ {FRAC Gain (S, A) SplitInformation} {(S, A)}
    $$
    position denominator split factor, calculated as:
    $$
    SplitInformation (S, A ) = - \ sum_ {I =. 1} ^ {C} \ FRAC {| S_i |} {| S |} log_2 \ FRAC {| S_i |} {| S |}
    $$
    C is added in the original sample attribute A fraction the number of species was

  3. Algorithm: Python

  4. Tags监督学习算法 信息增益率 构造过程中树剪枝

CART algorithm

  1. Introduction:

    Is a decision tree classification algorithm, the tree is a binary tree is finally formed, so that the selected attribute to be divided in the best division attribute values, each of the features can be bipartite

  2. Sort: Gini index
    $$
    the Gini (A) =. 1 - \ sum_ K = {}. 1 ^ 2 ^ {C} P_K
    $$
    $ $ P_K probability is positive and negative examples, respectively, the smaller the Gini index purity of the classification high. And entropy effects are similar.

    For example, (if there is room, marital status, annual income) -> Whether defaulted loans, there are marital status (single, married, divorced) and other values, the choice is divided by marital status, to select an attribute value, respectively, a class, and the rest is another class, calculate three Gini index, the Gini index gain obtained one of the highest division rules, proceed to the next step.

    For a continuous range of values ​​of an attribute value, respectively, from small to large to be an intermediate value of the two selected property is binary, Gini index, the optimal division rule.

  3. Training End Condition: Gini index sample set is less than a predetermined threshold value (sample substantially the same class)

  4. Algorithm: Python

  5. Tags 基尼指数 阈值停止法 监督学习算法 节点二分类

AdaBoost bagging lifting scheme

  1. Bagging algorithm description: Bagging algorithm is to vote on the same thing through a plurality of arbiter, the results of the final classification is the classification most votes.

  2. Description: For each discriminator plus a weight, such discrimination results will be more reasonable. Examples are as follows:

    比如你生病了,去n个医院看了n个医生,每个医生给你开了药方,最后的结果中,哪个药方的出现的次数多,那就说明这个药方就越有可能性是最优解,这个很好理解。而装袋算法就是这个思想。
    而AdaBoost算法的核心思想还是基于装袋算法,但是他有一点点的改进,上面的每个医生的投票结果都是一样的,说明地位平等,如果在这里加上一个权重,大城市的医生权重高点,小县城的医生权重低,这样通过最终计算权重和的方式,会更加的合理,这就是AdaBoost算法。AdaBoost算法是一种迭代算法,只有最终分类误差率小于阈值算法才能停止,针对同一训练集数据训练不同的分类器,我们称弱分类器,最后按照权重和的形式组合起来,构成一个组合分类器,就是一个强分类器了。
  3. Training process:

    a. For D_t $ $ training set to train a weak classifier C_T $ $

    b. $ $ C_i data classified by the calculation error rate, Pr $ $ point indicating an error weights, all point to the beginning of the initial weights assigned
    $$
    \ Epsilon T = Pr {I \} SIM D_t [h_t ( x_i) \ neq y_i]
    $$

    $$
    choose;; \alpha_t = \frac{1}{2}ln(\frac{1-\epsilon _t}{\epsilon _t})
    $$

    c. The data points right wrong weight increased, the weight of the heavy points lower, highlighting the wrong data points. $ $ Z_t is a normal factor, so $ D_ {t + 1} $ obeys certain distribution

    $$
    update ;; D_{t+1}(i) = \frac{D_t(i)}{Z_t}*{e^{-\alpha_t},if;h_t(x_i) =y_i, \else;e^{\alpha_t}
    $$

    D. When the final classification error rate is less than a certain threshold value, the train can stop

    e outputs the final prediction functions:.
    $$
    H (X) = Sign (\ sum_ ^ {} T = {T}. 1 \ _ Alpha h_t T (X))
    $$
    Sign is the sign function, if the value is positive, minutes +1 class, otherwise -1 class

  4. The reason weight increase fault point: after the next time again classifier misclassified these points, the overall error rate will increase, thus leading to a very small change, eventually leading to the change in classification is right across hybrid classifier value low. In other words, this algorithm make good classifier accounted for a higher overall value of the rights, and fell classifier weights lower.

  5. Code implementation: the Java

  6. Tags监督学习算法 多分类器综合 多次迭代统一训练集

Apriori algorithm

  1. The algorithm is useful for mining frequent item set, namely to find a combination that often appear, then eventually launch our association rules based on these combinations. such as:

    泡面 -> 火腿[support = 2%][confidence = 70%],就是关联规则的表示方法,其中支持度和置信度是两个衡量规则是否有用的度量标准。
    几个概念:
    - 支持度:所有的事务的2%显示同时购买了泡面和火腿。
    - 置信度:所有买泡面的顾客,70%都买了火腿
    - 项集:项的集合,由项集可推出关联规则
    - 支持度:包含项集的事务数
    - 频繁项集: 大于支持度技术阈值的项集。
    - 置信度: confidence(泡面->火腿) = P(火腿|泡面),已知频繁项集即可知置信度
  2. Algorithm theory

    Algorithm has two steps:

    1. Find all frequent item sets

      a. Scan all transactions candidates get set C1

      b. compared to the support threshold proposed support is less than the threshold value set key, a set too frequent L1

      c. the second iteration, first connecting step, the $ L1 \ Join L1 $ derived candidate sets

      d. pruning step, cut-items set items set contains infrequent, derived C2.

      e. Place the support removed all less than a threshold, obtained L2

      f. the third iteration, step draw connections candidate set

      Three sets g. Pruning step, by subtracting the element L2 is contained not derived C3

      h. count comparator threshold is to L3 of final

      i. iteration know $ C_n $ empty, so the algorithm is over, now we come to all frequent item sets

    2. According to generate strong association rules frequent item sets

      The $ $ L_n nonempty subset, association rule obtained by combining

  3. Algorithm Evaluation

    Need to generate a large number of candidate sets, but also things that have been repeated iteratively scan data to calculate the degree of support, which can lead to relatively under efficiency.

EM algorithm link

  1. Algorithms Introduction

    The method of maximum likelihood statistical probability models to solve the parameters of the EM algorithm is a kind of data is never entirely lost data or data sets (there is implicit variables) in. Unable to directly maximize $ l (\ theta) $, we can continue to build $ l (\ theta) $ lower bound (E step), and then maximize the lower bound (M-step).

  2. The principle of maximum likelihood estimation ( likelihood -> known parameters push down results )

    In most cases, we are to project the results based on known conditions, and maximum likelihood estimation is that already know the result, and the result appears to seek the possibility of maximum conditions, as the estimated value. Can also be interpreted as: we know a probability sample parameters can appear the most, we certainly will not go to select a sample of other small probability, so simply put this parameter as an estimate of the true value.

    General Procedure seeking the maximum of the likelihood function:

    • Write the likelihood function
      $$
      L (\ Theta) = L (x_1, ..., x_n; \ Theta) = \ prod_. 1} ^ {I = NP (x_i; \ Theta)
      \
      \ Hat {\ Theta} = Arg; max; L (\ Theta)
      $$

    • Of logarithmic likelihood function, written in the form of accumulated

    • For $ \ $ Theta partial derivatives for each dimension, i.e. gradient, n unknown parameters, there are n equations, is the solution of equations like extreme points likelihood function, the parameters to obtain n

  3. EM algorithm principle

    $ Q_i (z ^ {(i )}) $ is every sample I , the implied distribution variable z, $ \ sum_ {z ^ { (i)}} Q_i (z ^ {(i)}) = $ 1 from (1) to (2), is a function of the numerator and denominator by the same equal.
    $$
    H (\ Theta) = lnL (\ Theta) = LN \ prod_. 1} ^ {I = NP (x_i; \ Theta) = \ sum_ {I} = ^ NLN. 1; P (X ^ {(I)} , z ^ {(i)} ; \ theta) ;;; (1) \ = \ sum_ilog \ sum_ {z (i)} Q_i (z ^ {(i)}) \ frac {p (x ^ {(i )}, z ^ {(i )}; \ theta)} {Q_i (z ^ {(i)})} ;;; (2) \\ geq \ sum_i \ sum_ {z ^ {(i)}} Q_i (z ^ {(i)} ) log \ frac {p (x ^ {(i)}, z ^ {(i)}; \ theta)} {Q_i (z ^ {(i)})} ;; ( 3) ;; from; Jensen
    $$
    from (2) to (3) demonstrated the following:

    Provided Y is a function of the random variable X, Y = g (x), g is a continuous function, then:
    $$
    If X is a discrete variable, the distribution ratio P (X = x_k) = p_k , the \
    if \ sum_ { k = 1} ^ \ inf g (x_k) p (k) absolute convergence \
    have E (the Y) = E [G (X)] = \ sum_ K = {}. 1 ^ \ INF G (x_k) P_K
    $$
    for the above-described problem, Y is $ [p (x ^ {( i)}, z ^ {(i)}; \ theta)] $, X is $ z ^ {(i)} $, $ Q_i (z ^ {( i)}) $ a $ p_k $, g is $ z ^ {(i)} $ to $ \ frac {p (x ^ {(i)}, z ^ {(i)}; \ theta)} {Q_i (z ^ {(i)} )} $ mapping.

    Together with the Jensen's inequality:
    $$
    E [f (x)] \ F Leq (EX) \ where f (x) is a convex function, where tentatively log (x) is convex, convex function is defined arbitrary two We are below the connection point (x) at y.
    $$

    Jensen inequality condition for the constant value is a random variable, to give:
    $$
    \ FRAC {P (X ^ {(I)}, {Z ^ (I)}; \ Theta)} {Q_i (^ {Z (I) })} = C
    \
    because \ sum_ {Z ^ {(I)}} Q_i (Z ^ {(I)}) =. 1
    \
    ie \ sum_zp (x ^ {(i )}, z; \ theta) = C
    \ can Release:
    Q_i (Z ^ {(I)}) = \ {FRAC P (X ^ {(I)}, {Z ^ (I)}; \ Theta)} {\ sum_zp (X ^ {( I)}, Z; \ Theta)}
    \ = \ FRAC {P (X ^ {(I)}, Z ^ {(I)}; \ Theta)} {\ sum_zp (X ^ {(I)}; \ Theta)}
    \ = P (Z ^ {(I)} | ^ {X (I)}; \ Theta)
    $$

  4. EM algorithm flow
    $$
    \ sum_ilog; P (X ^ {(I)}; \ Theta) = \ sum_ilog \ sum_ {Z (I)} P (X)
    $$

    • E (Expectation) step
      $$
      Q_i (Z ^ {(I)}): = P (Z ^ {(I)} | X (I); \ Theta)
      $$
      i.e. the initial value of the model parameters according to the previous iteration, or parameters to calculate the posterior probability of the hidden variable, in fact, expect hidden variables, as a hidden variable estimates.

    • M (Maximization) step, to maximize the likelihood function like to obtain the new parameter values
      $$
      \ Theta: = Arg; max_ \ Theta \ sum_i \ sum_ {^ {Z (I)}} Q_i (^ {Z (I) }) log \ FRAC {P (X ^ {(I)}, {Z ^ (I)}; \ Theta)} {Q_i (Z ^ {(I)})}
      $$

  5. Reference Knowledge:

    The second derivative is positive, concave function, or is downwardly convex.

Guess you like

Origin www.cnblogs.com/dongxiong/p/11705847.html