Replay article about sklearn Library Learn article (on)

understanding

sklearn official website address: HTTPS: //scikit-learn.gor/stable/

Since the 2007 release, scikit-learn has become an important machine learning Python libraries, referred to sklearn, including support for classification, regression, clustering and dimension reduction and other machine learning algorithms, but also includes feature extraction, data processing, model evaluation three modules .

sklearn is an extension of Scipy, based on Numpy, Matplotlib .. and so on library owns well documented, easy to get started, rich API , while packaging a large number of machine learning algorithms, and built a large data set , is an entry of Oh, very good library

sklearn- Machine Learning

Learning style

Supervised learning

Know: From a pair of input and output experience data has been labeled as an input to learn to predict the results, from a study using the example of the correct answer
Applications: classification, regression

Unsupervised Learning

Understanding: data entry no label, no correct answer , simply look for the law from the data
Applications: clustering, dimension reduction problem

Semi-supervised learning

Interposed between and enhance learning

data set

classification

Training set: sets data to train the model (data amount more than 50%)
Test set: sets data for the test model (25%)
Validation set: Ultra adjustment parameter variables (25%)

Cross-validation

Understanding: The data set is divided into N parts, N-1 parts by training model, a test on the other, usually 5-fold cross-validation .
Advantages: to take advantage of data, the effect of improving the model

Model Assessment

Variance (variance)
Deviation (biass)
Bias - variance balanced
value
- True positive (TP): correct identification of target
- False positive (FP): Error recognition target
- True negative (TN): correct identification of non-target
- False negative (FN): non-target misidentification
index
- Accuracy (ACC) = (TP + TN) / (TP + TN + FP + FN)
- Accuracy rate (P) = TP / (FP + FN)
- Recall (R) = TP / (TP + FN)

Popular understanding: about precision, recall rate, precision rate of small cases, find online

Said a carp pond, there are 1400, 300 shrimp, 300 turtles (ie, the total number is 2000). Now, I want to go fishing for carp, one net, pulled up 700 carp, shrimp, 200, 100 turtle, (ie, the total number of co-scooped 1000).

The correct ratio: (total number of carp phytoplankton / phytoplankton) = 700 / (200 + 100 + 700) = 70%

Recall: (number of carp Carp phytoplankton number / the total number) = 700/1400 = 50%

Binary most telling, alone behind the whole article on the rate of it ..

sklearn official document structure

sklearn library algorithms are mainly four categories: classification, regression, clustering, dimensionality reduction
what linear, decision trees, SVM, KNN, random forests, Adaboost, stochastic gradient descent, Bagging, ExtraTrees ... there's Ha

preprocession: Data pre-processing module
impute: missing value processing module
feature_selection: feature selection module
decomposition: dimensionality reduction algorithm module

Quick slearn

Traditional machine learning task usual process is: to get data -> Data Pre -> Project feature (select, vectorization, etc.) -> Training model -> model assessment -> forecast

A first classification of classical whole data set of statistical learning iris, a total of 150 samples, including four feature variables and a categorical variables.

Characteristic variables

sepal length: length calyx
sepal width: Width calyx
petal length: length of petal
petal width: the width of the petals
Category: iris-setosa (Mountain Iris), iris-versicolor (iris color), iris-virginica (Virginia iris) in which

from sklearn import datasets  # 用内置的数据源
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# 1. 获取数据
# 数据集是 json, {data:[[]], target_names:xxx, xxx}
iris = datasets.load_iris()
# 2. 特征工程 - 获取特征变量和目标变量
iris_X = iris.data
iris_y = iris.target

print("数据准备+特征工程--")
print('X_shape:',iris_X.shape, 'y_shape:',iris_y.shape) # 查看维度
print('y_target:', iris_y)

# 2. 特征工程 - 划分测试集和训练集
X_train, X_test, y_train, y_test = train_test_split(iris_X, iris_y, test_size=0.25)

# 3. 训练模型
print("开始进行训练---")
knn = KNeighborsClassifier()  # 实例对象
knn.fit(X_train, y_train)

print("模型参数:", knn.get_params())

# 4. 模型评价

print("真实值:", y_test)
print("预测值:", knn.predict(X_test))

score = knn.score(X_test, y_test)
print("预测得分为:", round(score, 3))

数据准备+特征工程--
X_shape: (150, 4) y_shape: (150,)
y_target: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
开始进行训练---

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

模型参数: {'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}
真实值: [0 1 0 2 1 1 0 0 2 0 2 2 0 1 1 2 0 1 1 0 0 0 2 0 1 0 1 1 2 1 1 0 1 1 1 1 1
 2]
预测值: [0 1 0 2 1 2 0 0 2 0 2 2 0 1 1 2 0 1 1 0 0 0 2 0 1 0 1 1 2 1 2 0 2 1 1 1 1
 2]
预测得分为: 0.921

Briefly introduced, next is the details of the matter