Machine papers - integrated learning (a) elaborate Bagging algorithm

Skip to main content

Return integrated learning catalog

Previous: Machine papers - Decision Tree (f)  elaborate evaluation index of cross-validation

Next: Machine papers - integrated learning (b)  elaborate Random Forest (Rondoom Forest) algorithm

 

Directory Contents

Machine papers - integrated learning (a)   elaborate Bagging algorithm

Machine papers - integrated learning (b)  elaborate Random Forest (Rondoom Forest) algorithm

Machine papers - integrated learning (c)  elaborate upgrade (Boosting) algorithm

Machine papers - integrated learning (d)  elaborate AdaBoost algorithm

Machine papers - integrated learning (e)  elaborate upgrade gradient (Gradient Boost) algorithm 

Machine papers - integrated learning (f)  elaborate algorithm GBDT

Machine papers - integrated learning (seven)  elaborate algorithm XGBoost

Machine papers - integrated learning (h)  elaborate elaborate ball49_pred projects (lottery prediction)

Machine papers - integrated learning (nine)  elaborate hotel_pred project (hotel prediction)

 

In this section, elaborate Bagging algorithm, a random forest elaborate the next section (Rondoom Forest) Algorithm

 

A. Understand

1. Definitions

    The so-called ensemble learning (Ensemble Learning, EL), using simple means is appreciated that a plurality of weak classifiers to form a strong classifier, and then, the data prediction, thus improving the overall capacity of the generalization of the classifier.

 

2. The method of ensemble learning

    (1) Bagging:. Boot aggregation algorithm, sample sampling with replacement

    (2). Rondom Forest:Bagging + Decision Tree

    . (3) Boosting: lifting scheme

    (4). AdaBoost

    (5). Gradient Boost

    (6). GBDT

    (7). XGBoost

 

II. Specific algorithms

1. Bagging algorithm

    (1). Bagging algorithm (Boostrap aggregating, the guide aggregation algorithm), also called bagging algorithm. Bagging algorithm can, in combination with other classification regression algorithm to improve its accuracy, and stability by reducing the variance of the results, to avoid over-fitting occurs

 

    . (2) There are many methods Bagging, the main difference is the different random subset of the training methods:

     ①. If the random subset of data extracted is a random subset of the set of samples, called Pasting

     ②. If the sample is drawn with replacement, called Bagging

     ③. If the random subset of data extracted is a random subset of the set of features, referred to as a random subspace (Rondom Subspaces)

     ④. If the estimation is based on the constructed subset of samples and for feature extraction, called random patch (Rondom Patches)

 

    (3) In the sklearn, Bagging is a method using the same BaggingClassifier symbol estimates (or BaggingRegressor), the input parameters and extracting a random subset policy specified by the user. max_samples max_features and controls the size of the subset (for examples and features), bootstrap and bootstrap-features extracted sample and control features are back or without replacement. When a sample set by setting oob_score = True, the outer bag may be used (out of bag) generalization accuracy evaluation sample.

 

    . (4) In Bagging, one sample may be sampled multiple times, it may not have been sampled, assuming that a sample has been set does not appear in the probability sampling was  \large (1 - \tfrac{1}{N})^{N}then known limits of their requirements:

                 \LARGE \lim_{n \rightarrow \infty } (1 - \tfrac{1}{n})^{n} = \tfrac{1}{e} \approx 0.368

           Original sample data set 63.2% of the sample Bagging occurs in the data set used, while in the sample, the sample may be used with external (out of bagging) generalization to evaluate the accuracy of the model.

 

    (5) The final prediction

     ①. For classification task simple voting method, each classifier that is one vote to vote (also probabilistic average)

     ②. For the return of the task, the simple average of the average value to obtain the final result, that is to take all of the classifier.

           Although random split introduced in Bagging deviation increased, but because the average of a plurality of integrated model, but also makes a better model acquired in general.

 

    (6). Bagging algorithm flow

     Input sample set  \large D = \{(x_{1}, y_{1}), (x_{2}, y_{2}), ......, (x_{m}, y_{m})\}, weak learning algorithm, the weak classifier iterations  \large T.

     The final output of the strong classifier \large f(x)

     . ① For  \large t = 1, 2, ......, T:

       a. training set for the first  \large t time the random sampling, a total sampling  \large m times to obtain comprising a  \large m set of samples samples  \large D_{t}.

       b. Use sample set  \large D_{t} of training of  \large t weak learners \large G_{t}(x)

     ②. If classification algorithms to predict, the  \large T category or categories one weak learners largest number of votes cast for the final category.

           If the prediction is the regression algorithm, the  \large T weak learner regression results obtained were the final value of output of the arithmetic mean model obtained.

 

    (7) The main process of the method of bootstrapping

     ①. Repetitively from a sample set of  \large D sampling  \large n samples.

     ②. For each sample of the sub-sample set, statistical learning, access to suppose \large H_{i}

     ③. The number of assumptions are combined to form the final hypothesis \large H_{final}

     ④. The final hypothesis for the specific classification task

 

    (8). Bagging summary

     ①. Bagging reducing variance base classifier improve generalization error

     ②. Performance depends on a stable basis classifiers.

           If the classifier labile group, Bagging training data helps to reduce the error caused by random fluctuations

           If stable, the integrated error by the offset classifier mainly caused by the base classifier.

     ③. Since the probability of each sample to be selected is the same, so Bagging is not focused on any particular instance of the training data set.

 

    (8) The code demonstrates

#!/usr/bin/env python
# _*_ coding:utf-8 _*_
# ============================================
# @Time     : 2020/01/10 20:05
# @Author   : WanDaoYi
# @FileName : bagging_test.py
# ============================================

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

import matplotlib.pyplot as plt
import matplotlib
# 用于解决画图中文乱码
font = {"family": "SimHei"}
matplotlib.rc("font", **font)


# 加载 iris 数据
iris_info = load_iris()

# 读取 iris 的 data 数据 (x)
iris_data = pd.DataFrame(iris_info.data)
print(iris_data.head())

# 获取 data.columns 列属性名称
iris_data.columns = iris_info.feature_names
print(iris_data.columns)

# 获取 iris label 数据
label_info = iris_info.target

# 将 data 数据与 label 数据拼接 有 x 有 y 的成完整的数据。
iris_data["Species"] = label_info

print(iris_data.head())

# 读取前面的 4 列,即 x,花萼长度和宽度
x = iris_data.iloc[:, : 4]
# 读取最后 1 列,即 y,花的类别
y = iris_data.iloc[:, -1]

# 划分训练集和验证集
# train_size:训练数据比例
# random_state:随机种子,保证每次随机数据一致
x_train, x_val, y_train, y_val = train_test_split(x, y, train_size=0.7, random_state=42)

# 使用集成学习 bagging,n_estimators=20 为又放回抽样 20 次
bagging_demo = BaggingClassifier(n_estimators=20)
# 训练模型
bagging_demo.fit(x_train, y_train)
# 预测模型
y_pred = bagging_demo.predict(x_val)

acc_score = accuracy_score(y_val, y_pred)
print("acc_score: {}".format(acc_score))

plt.plot(x_val, y_val, 'r+', label='Iris-true')
plt.plot(x_val, y_pred, 'g.', label='Iris-pred')
plt.show()

    

 

 

 

 

                

 

Skip to main content

Return integrated learning catalog

Previous: Machine papers - Decision Tree (f)  elaborate evaluation index of cross-validation

Next: Machine papers - integrated learning (b)  elaborate Random Forest (Rondoom Forest) algorithm

 

 

Published 42 original articles · won praise 15 · views 2758

Guess you like

Origin blog.csdn.net/qq_38299170/article/details/103833113