python's sklearn study notes



Preface: This article is a study note.

Introduction to sklearn

scikit-learn is a simple and effective tool for data mining and analysis.
Depends on NumPy, SciPy and matplotlib.

It mainly includes the following parts:

  • In terms of function:
    • classification
    • Regression
    • Clustering
    • Dimensionality reduction
    • Model selection
    • Preprocessing
  • From the API module:
    • sklearn.base: Base classes and utility function
    • sklearn.cluster: Clustering
    • sklearn.cluster.bicluster: Biclustering
    • sklearn.covariance: Covariance Estimators
    • sklearn.model_selection: Model Selection
    • sklearn.datasets: Datasets
    • sklearn.decomposition: Matrix Decomposition
    • sklearn.dummy: Dummy estimators
    • sklearn.ensemble: Ensemble Methods
    • sklearn.exceptions: Exceptions and warnings
    • sklearn.feature_extraction: Feature Extraction
    • sklearn.feature_selection: Feature Selection
    • sklearn.gaussian_process: Gaussian Processes
    • sklearn.isotonic: Isotonic regression
    • sklearn.kernel_approximation: Kernel Approximation
    • sklearn.kernel_ridge: Kernel Ridge Regression
    • sklearn.discriminant_analysis: Discriminant Analysis
    • sklearn.linear_model: Generalized Linear Models
    • sklearn.manifold: Manifold Learning
    • sklearn.metrics: Metrics
    • sklearn.mixture: Gaussian Mixture Models
    • sklearn.multiclass: Multiclass and multilabel classification
    • sklearn.multioutput: Multioutput regression and classification
    • sklearn.naive_bayes: Naive Bayes
    • sklearn.neighbors: Nearest Neighbors
    • sklearn.neural_network: Neural network models
    • sklearn.calibration: Probability Calibration
    • sklearn.cross_decomposition: Cross decomposition
    • sklearn.pipeline: Pipeline
    • sklearn.preprocessing: Preprocessing and Normalization
    • sklearn.random_projection: Random projection
    • sklearn.semi_supervised: Semi-Supervised Learning
    • sklearn.svm: Support Vector Machines
    • sklearn.tree: Decision Tree
    • sklearn.utils: Utilities

At my current rookie level, I feel that I often use clustering, classification (svm, tree, linear regression, etc.), decomposition, preprocessing, metrics, etc., so I will learn from these places first.

cluster

Reading sklearn.clusterthe API, you can find that there are two main contents: one is the class of various clustering methods cluster.KMeans, and the other is the function of the clustering method that can be used directly, such as

sklearn.cluster.k_means(X, n_clusters, init='k-means++', 
    precompute_distances='auto', n_init=10, max_iter=300, 
    verbose=False, tol=0.0001, random_state=None, 
    copy_x=True, n_jobs=1, algorithm='auto', return_n_iter=False)
  
  
  • 1
  • 2
  • 3
  • 4

Therefore, in actual use, there are two corresponding methods.

There sklearn.clusterare a total of 9 clustering methods, which are

  • AffinityPropagation: Attractor Propagation
  • AgglomerativeClustering: Hierarchical Clustering
  • Birch
  • DBSCAN
  • FeatureAgglomeration: Feature Agglomeration
  • KMeans: K-Means Clustering
  • MiniBatchKMeans
  • MeanShift
  • SpectralClustering: Spectral Clustering

Take our most familiar Kmeans as an example:

Use the class constructor to construct the Kmeans clusterer

First, the constructor of KMeans in the API is:

sklearn.cluster.KMeans(n_clusters=8,
     init='k-means++', 
    n_init=10, 
    max_iter=300, 
    tol=0.0001, 
    precompute_distances='auto', 
    verbose=0, 
    random_state=None, 
    copy_x=True, 
    n_jobs=1, 
    algorithm='auto'
    )
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

The meaning of the parameters:

  • n_clusters: The number of clusters, that is, how many categories do you want to cluster into?
  • init: How to get the initial cluster center
  • n_init: Get the number of iterations of the initial cluster center
  • max_iter: maximum number of iterations (because the implementation of kmeans algorithm requires iteration)
  • tol: Tolerance, that is, the condition for the convergence of kmeans operating criteria
  • precompute_distances: Whether the distance needs to be calculated in advance
  • verbose: verbose mode (do not understand what it means, anyway, generally do not change the default value)
  • random_state: Randomly generate state conditions for cluster centers.
  • copy_x: A flag for whether to modify the data, if True, the data will not be modified after copying.
  • n_jobs: Parallel setting
  • algorithm: The realization algorithm of kmeans, including: 'auto', 'full', 'elkan', where 'full'indicates that it is realized by EM

Although there are many parameters, default values ​​have been given. So we generally don't need to pass in these parameters, the parameters. It can be called according to actual needs. Here is a simple example:

import numpy as np
from sklearn.cluster import KMeans
data = np.random.rand(100, 3) #生成一个随机数据,样本大小为100, 特征数为3

#假如我要构造一个聚类数为3的聚类器
estimator = KMeans(n_clusters=3)#构造聚类器
estimator.fit(data)#聚类
label_pred = estimator.label_ #获取聚类标签
centroids = estimator.cluster_centers_ #获取聚类中心
inertia = estimator.inertia_ # 获取聚类准则的最后值
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

Use the kmeans function directly:

import numpy as np
from sklearn import cluster
data = np.random.rand(100, 3) #生成一个随机数据,样本大小为100, 特征数为3
k = 3 # 假如我要聚类为3个clusters
[centroid, label, inertia] = cluster.k_means(data, k)
  
  
  • 1
  • 2
  • 3
  • 4
  • 5

Of course, other methods are similar, and the specific use should refer to the API. (Learn to read API, get used to reading API)

classification

Classification is the most important part of data mining or machine learning. However, because the classic classification method mechanism is more characteristic, it seems that sklearn does not specifically customize a class such as a classifier.
Commonly used classification methods are:

  • KNN nearest neighbors:sklearn.neighbors
  • logistic regression logistic regression:sklearn.linear_model.LogisticRegression
  • svm support vector machine:sklearn.svm
  • Naive Bayes Naive Bayes:sklearn.naive_bayes
  • Decision Tree Decision Tree:sklearn.tree
  • Neural network neural network:sklearn.neural_network

Then take KNN as an example (mainly Nearest Neighbors Classification) to see how to use these methods:

from sklearn import neighbors, datasets

# import some data to play with
iris = datasets.load_iris()
n_neighbors = 15
X = iris.data[:, :2]  # we only take the first two features. We could
                      # avoid this ugly slicing by using a two-dim dataset
y = iris.target

weights = 'distance' # also set as 'uniform'
clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
clf.fit(X, y)

# if you have test data, just predict with the following functions
# for example, xx, yy is constructed test data
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Z is the label_pred
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20

Another example is svm:

from sklearn import svm
X = [[0, 0], [1, 1]]
y = [0, 1]

#建立支持向量分类模型
clf = svm.SVC()

#拟合训练数据,得到训练模型参数
clf.fit(X, y)

#对测试点[2., 2.], [3., 3.]预测
res = clf.predict([[2., 2.],[3., 3.]])

#输出预测结果值
print res


#get support vectors
print "support vectors:", clf.support_vectors_

#get indices of support vectors
print "indices of support vectors:", clf.support_ 

#get number of support vectors for each class
print "number of support vectors for each class:", clf.n_support_ 
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

Of course, SVM also has a corresponding regression model SVR

from sklearn import svm
X = [[0, 0], [2, 2]]
y = [0.5, 2.5]
clf = svm.SVR()
clf.fit(X, y) 
res = clf.predict([[1, 1]])
print res
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

logistic regression

from sklearn import linear_model
X = [[0, 0], [1, 1]]
y = [0, 1]
logreg = linear_model.LogisticRegression(C=1e5)

#we create an instance of Neighbours Classifier and fit the data.
logreg.fit(X, y)

res = logreg.predict([[2, 2]])
print res
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

preprocessing

This piece usually I use is the Scale operation. And there are many types of Scale, including:

  • StandardScaler
  • MaxAbsScaler
  • MinMaxScaler
  • RobustScaler
  • Normalizer
  • and other preprocessing operations

Correspondingly, there are direct functions to use: scale(), maxabs_scale(), minmax_scale(), robust_scale(), normaizer().

E.g:

import numpy as np
from sklearn import preprocessing
X = np.random.rand(3,4)


#用scaler的方法
scaler = preprocessing.MinMaxScaler()
X_scaled = scaler.fit_transform(X)


#用scale函数的方法
X_scaled_convinent = preprocessing.minmax_scale(X)
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

decomposition

Let's talk about NMF and PCA, these two are more commonly used.

import numpy as np
X = np.array([[1,1], [2, 1], [3, 1.2], [4, 1], [5, 0.8], [6, 1]])
from sklearn.decomposition import NMF
model = NMF(n_components=2, init='random', random_state=0)
model.fit(X)

print model.components_
print model.reconstruction_err_
print model.n_iter_
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

Let's talk about the difference between this class fit()and the following fit_transform(). The former only trains a model and does not return the branch after nmf, while the latter is in addition to the training data and returns the branch after nmf.

PCA is similar, but without those initialization parameters, as follows:

import numpy as np
X = np.array([[1,1], [2, 1], [3, 1.2], [4, 1], [5, 0.8], [6, 1]])
from sklearn.decomposition import PCA
model = PCA(n_components=2)
model.fit(X)

print model.components_
print model.n_components_
print model.explained_variance_
print model.explained_variance_ratio_
print model.mean_
print model.noise_variance_
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

metrics

The above clustering and classification tasks all require a final evaluation.

Classification

For example, classification, there are the following commonly used evaluation indicators and metrics:

  • accuracy_score
  • auc
  • f1_score
  • fbeta_score
  • hamming_loss
  • hinge_loss
  • jaccard_similarity_score
  • log_loss
  • recall_score

The following example finds the accuracy of the classification results:

from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
ac = accuracy_score(y_true, y_pred)
print ac
ac2 = accuracy_score(y_true, y_pred, normalize=False)
print ac2
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

The use of other indicators is similar.

return

Regression related metrics include but are not limited to the following:

  • mean_absolute_error
  • mean_squared_error
  • median_absolute_error

clustering

There are the following commonly used evaluation indicators (internal and external):

  • adjusted_mutual_info_score
  • adjusted_rand_score
  • completeness_score
  • homogeneity_score
  • normalized_mutual_info_score
  • silhouette_score
  • v_measure_score

The following example seeks the NMI (standard mutual information) of the clustering results, and other indicators are similar.

from sklearn.metrics import normalized_mutual_info_score

y_pred = [0,0,1,1,2,2]
y_true = [1,1,2,2,3,3]

nmi = normalized_mutual_info_score(y_true, y_pred)
print nmi
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

Of course there are many other metrics. Reference API.

datasets

sklearn itself also provides several common datasets, such as iris, diabetes, digits, covtype, kddcup99, boson, breast_cancer, which can be loaded with a similar method of sklearn.datasets.load_iris. It returns a dataset. Data and labels are obtained in the following ways.

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data 
y = iris.target 
  
  
  • 1
  • 2
  • 3
  • 4
  • 5

In addition to these common datasets, the datasets module also provides many data manipulation functions, such as load_files, load_svmlight_file, and many data generators.

panda.io also provides many methods to load external data (such as csv, excel, json, sql, etc.).

You can also get the data set on the mldata repos.

The function of python is still relatively powerful.

Of course, the load of the dataset can also read and write files by writing the readfile function by itself.

concluding remarks

The above mainly learned some of the functions that I use more frequently. When you are familiar with python, just read the Scikit-learn API and everything is not a problem.

In addition, if necessary, you can view the source code of these commonly used functions to learn, to deepen your understanding of the principles of common data mining algorithms.

Forwarding link:
http://blog.csdn.net/lilianforever/article/details/53780613



Preface: This article is a study note.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325608784&siteId=291194637