Use sklearn machine learning, data mining

sklearn is one of the most important libraries python machine learning, data mining is used to achieve a variety of algorithms, sklearn framework overview.

sklearn Overview

A, sklearn algorithm:

From the figure, you can see algorithm library mainly four categories: classification, regression, clustering, dimension reduction. among them:

  • Common regression : linear, decision trees, SVM, KNN; integrated regression: Random Forests, Adaboost, GradientBoosting, Bagging, ExtraTrees
  • Commonly used classification : linear, decision trees, SVM, KNN, Naive Bayes; integrated classification: Random Forests, Adaboost, GradientBoosting, Bagging, ExtraTrees
  • Common clustering : k-means (K-means), hierarchical clustering (Hierarchical clustering), DBSCAN
  • Common dimensionality reduction : LinearDiscriminantAnalysis, PCA

Second, the main steps in machine learning applications sklearn

(1) into a data set

Generally divided into three data sets, sklearn carrying, by the process of loading ; another sklearn may generate data ; there is introduced their own set of data
sklearn own data sets

(2) data preprocessing / Engineering features / Data Visualization

Data visualization and engineering characteristics, the data portion of the data preprocessing is the most important operation, adapting the model to have an important role.

See my other project features a blog

I see another article data preprocessing blog

I see another article data visualization blog

Data preprocessing comprises:

  • 降维(sklearn.decomposition)
  • Missing values
  • Data normalization (from sklearn import preprocessing)
  • Normalized data set (preprocessing.StandardScaler (). Fit (traindata))
  • Resolution data set (from sklearn.mode_selection import train_test_split)
  • Feature selection (sklearn.feature_selection)
  • Feature transform (one-hot), etc.

There are many ways in which sklearn specific view api, generally there are three ways

fit (): training algorithm, set the internal parameters.
transform (): data conversion.
fit_transform (): transform two methods were combined and fit.

Mainly in sklearn.preprcessing package.

Scaling:

  • MinMaxScaler: maximum and minimum standardization
  • Normalizer: each feature so that each data value is 1
  • StandardScaler
    : it is characterized in that each mean 0 and variance 1

coding:

  • LabelEncoder: string into integer data type
  • OneHotEncoder: wherein is represented by a binary number
  • Binarizer: is a binarized numerical features
  • MultiLabelBinarizer: multi-tag binarization

(3) model selection

The selected model data

Common regression : linear, decision trees, SVM, KNN; integrated regression: Random Forests, Adaboost, GradientBoosting, Bagging, ExtraTrees

Commonly used classification : linear, decision trees, SVM, KNN, Naive Bayes; integrated classification: Random Forests, Adaboost, GradientBoosting, Bagging, ExtraTrees

Common clustering : k-means (K-means), hierarchical clustering (Hierarchical clustering), DBSCAN

Common dimensionality reduction : LinearDiscriminantAnalysis, PCA

model typically has two properties:
Fit (): training algorithm, set the internal parameters. Receiving training set and category two parameters.
predict (): predictive test set category, set the parameters for the test.

(4) Evaluation Model

Evaluation model I see another blog post
cross validation method I see another blog

Cross-validation and model scoring

(from sklearn.model_selection import cross_val_score)

包:sklearn.cross_validation

KFold:K-Fold交叉验证迭代器。接收元素个数、fold数、是否清洗
LeaveOneOut:LeaveOneOut交叉验证迭代器
LeavePOut:LeavePOut交叉验证迭代器
LeaveOneLableOut:LeaveOneLableOut交叉验证迭代器
LeavePLabelOut:LeavePLabelOut交叉验证迭代器

Common method

train_test_split:分离训练集和测试集(不是K-Fold)
cross_val_score:交叉验证评分,可以指认cv为上面的类的实例
cross_val_predict:交叉验证的预测。

Treated fitting problem

  • 学习曲线from sklearn.model_selection import learning_curve
  • Test curve from sklearn.model_selection import validation_curve

Package: sklearn.metrics
sklearn.metrics comprising scoring method, the performance metric, metric pairs and the distance calculation.

Category outcome measure

Most y_true parameters and y_pred.

accuracy_score:分类准确度
condusion_matrix :分类混淆矩阵
classification_report:分类报告
precision_recall_fscore_support:计算精确度、召回率、f、支持率
jaccard_similarity_score:计算jcaard相似度
hamming_loss:计算汉明损失
zero_one_loss:0-1损失
hinge_loss:计算hinge损失
log_loss:计算log损失

The regression results metrics

explained_varicance_score:可解释方差的回归评分函数
mean_absolute_error:平均绝对误差
mean_squared_error:平均平方误差

(5) model save

  • Save the file as a pickle
# 保存模型
with open('model.pickle', 'wb') as f:
    pickle.dump(model, f)

# 读取模型
with open('model.pickle', 'rb') as f:
    model = pickle.load(f)
model.predict(X_test)
  • sklearn own method joblib
from sklearn.externals import joblib

# 保存模型
joblib.dump(model, 'model.pickle')

#载入模型
model = joblib.load('model.pickle')

Three, sklearn tips

(1) grid search

Grid can search for the best parameters

包:sklearn.grid_search
网格搜索最佳参数

GridSearchCV:搜索指定参数网格中的最佳参数
ParameterGrid:参数网格
ParameterSampler:用给定分布生成参数的生成器
RandomizedSearchCV:超参的随机搜索
通过best_estimator_.get_params()方法,获取最佳参数。

The sample code

from sklearn import datasets,model_selection
iris=datasets.load_iris() # scikit-learn 自带的 iris 数据集
X=iris.data
y=iris.target

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=0)
alg=DecisionTreeClassifier()
parameters={'max_depth':range(1,100),'min_samples_split':range(2,30)}#参数空间
def fit_model(alg,parameters):
  X= X_train
  y= y_train
  scorer=make_scorer(roc_auc_score)
  grid = GridSearchCV(alg,parameters,scoring=scorer,cv=5)
  grid = grid.fit(X,y)
  print (grid.best_params_)
  print (grid.best_score_)
  return grid

fit_model(alg,parameters)

(2) pipeline processing (the Pipeline)

Process flow
sklearn.pipeline包

Pipelining functions and benefits:

pipeline 实现了对全部步骤的流式化封装和管理,可以很方便地使参数集在新数据集上被重复使用。
跟踪记录各步骤的操作(以方便地重现实验结果)
对各步骤进行一个封装
确保代码的复杂程度不至于超出掌控范围
可以结合grid search对参数进行选择
直接调用fit和predict方法来对pipeline中的所有算法模型进行训练和预测。

Basic use

Enter the pipeline as a series of data mining step, which must be the last step estimator , a few steps before a converter. Input data set after treatment of the converter, the output result as the next input . Finally, the last step of the pipeline located estimator to classify data (model, there are fit method).
Each step is represented by a tuple ( 'name', step) . Now to create the line.

    from sklearn.pipeline import Pipeline
    scaling_pipeline = Pipeline([
      ('scale', MinMaxScaler()),
      ('predict', KNeighborsClassifier())

(3) sklearn integrated learning

sklearn integrated learning some particularly well-known integration methods, including bagging, boosting, stacking, enables the integration model, to achieve better results.
See my other blog an integrated learning

Guess you like

Origin blog.csdn.net/qq_39751437/article/details/91786404