Replay article about sklearn Library Learn article (on)

understanding

sklearn official website address: HTTPS: //scikit-learn.gor/stable/

Since the 2007 release, scikit-learn has become an important machine learning Python libraries, referred to sklearn, including support for classification, regression, clustering and dimension reduction and other machine learning algorithms, but also includes feature extraction, data processing, model evaluation three modules .

sklearn is an extension of Scipy, based on Numpy, Matplotlib .. and so on library owns well documented, easy to get started, rich API , while packaging a large number of machine learning algorithms, and built a large data set , is an entry of Oh, very good library

sklearn- Machine Learning

Learning style

Supervised learning

  • Know: From a pair of input and output experience data has been labeled as an input to learn to predict the results, from a study using the example of the correct answer
  • Applications: classification, regression

Unsupervised Learning

  • Understanding: data entry no label, no correct answer , simply look for the law from the data
  • Applications: clustering, dimension reduction problem

Semi-supervised learning

  • Interposed between and enhance learning

data set

classification

  • Training set: sets data to train the model (data amount more than 50%)
  • Test set: sets data for the test model (25%)
  • Validation set: Ultra adjustment parameter variables (25%)

Cross-validation

  • Understanding: The data set is divided into N parts, N-1 parts by training model, a test on the other, usually 5-fold cross-validation .
  • Advantages: to take advantage of data, the effect of improving the model

Model Assessment

  • Variance (variance)
  • Deviation (biass)
  • Bias - variance balanced

  • value

    • True positive (TP): correct identification of target
    • False positive (FP): Error recognition target
    • True negative (TN): correct identification of non-target
    • False negative (FN): non-target misidentification
  • index

    • Accuracy (ACC) = (TP + TN) / (TP + TN + FP + FN)
    • Accuracy rate (P) = TP / (FP + FN)
    • Recall (R) = TP / (TP + FN)

Popular understanding: about precision, recall rate, precision rate of small cases, find online

Said a carp pond, there are 1400, 300 shrimp, 300 turtles (ie, the total number is 2000). Now, I want to go fishing for carp, one net, pulled up 700 carp, shrimp, 200, 100 turtle, (ie, the total number of co-scooped 1000).

The correct ratio: (total number of carp phytoplankton / phytoplankton) = 700 / (200 + 100 + 700) = 70%

Recall: (number of carp Carp phytoplankton number / the total number) = 700/1400 = 50%

Binary most telling, alone behind the whole article on the rate of it ..

sklearn official document structure

sklearn library algorithms are mainly four categories: classification, regression, clustering, dimensionality reduction
what linear, decision trees, SVM, KNN, random forests, Adaboost, stochastic gradient descent, Bagging, ExtraTrees ... there's Ha

  • preprocession: Data pre-processing module
  • impute: missing value processing module
  • feature_selection: feature selection module
  • decomposition: dimensionality reduction algorithm module

Quick slearn

Traditional machine learning task usual process is: to get data -> Data Pre -> Project feature (select, vectorization, etc.) -> Training model -> model assessment -> forecast

A first classification of classical whole data set of statistical learning iris, a total of 150 samples, including four feature variables and a categorical variables.

Characteristic variables

  • sepal length: length calyx
  • sepal width: Width calyx
  • petal length: length of petal
  • petal width: the width of the petals
  • Category: iris-setosa (Mountain Iris), iris-versicolor (iris color), iris-virginica (Virginia iris) in which
from sklearn import datasets  # 用内置的数据源
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# 1. 获取数据
# 数据集是 json, {data:[[]], target_names:xxx, xxx}
iris = datasets.load_iris()
# 2. 特征工程 - 获取特征变量和目标变量
iris_X = iris.data
iris_y = iris.target

print("数据准备+特征工程--")
print('X_shape:',iris_X.shape, 'y_shape:',iris_y.shape) # 查看维度
print('y_target:', iris_y)

# 2. 特征工程 - 划分测试集和训练集
X_train, X_test, y_train, y_test = train_test_split(iris_X, iris_y, test_size=0.25)

# 3. 训练模型
print("开始进行训练---")
knn = KNeighborsClassifier()  # 实例对象
knn.fit(X_train, y_train)

print("模型参数:", knn.get_params())

# 4. 模型评价

print("真实值:", y_test)
print("预测值:", knn.predict(X_test))

score = knn.score(X_test, y_test)
print("预测得分为:", round(score, 3))
      
数据准备+特征工程--
X_shape: (150, 4) y_shape: (150,)
y_target: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
开始进行训练---
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')
模型参数: {'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}
真实值: [0 1 0 2 1 1 0 0 2 0 2 2 0 1 1 2 0 1 1 0 0 0 2 0 1 0 1 1 2 1 1 0 1 1 1 1 1
 2]
预测值: [0 1 0 2 1 2 0 0 2 0 2 2 0 1 1 2 0 1 1 0 0 0 2 0 1 0 1 1 2 1 2 0 2 1 1 1 1
 2]
预测得分为: 0.921

Briefly introduced, next is the details of the matter

Guess you like

Origin www.cnblogs.com/chenjieyouge/p/11741441.html