Preface: This article is a study note.
Introduction to sklearn
scikit-learn is a simple and effective tool for data mining and analysis.
Depends on NumPy, SciPy and matplotlib.
It mainly includes the following parts:
- In terms of function:
- classification
- Regression
- Clustering
- Dimensionality reduction
- Model selection
- Preprocessing
- From the API module:
sklearn.base
: Base classes and utility functionsklearn.cluster
: Clusteringsklearn.cluster
.bicluster: Biclusteringsklearn.covariance
: Covariance Estimatorssklearn.model_selection
: Model Selectionsklearn.datasets
: Datasetssklearn.decomposition
: Matrix Decompositionsklearn.dummy
: Dummy estimatorssklearn.ensemble
: Ensemble Methodssklearn.exceptions
: Exceptions and warningssklearn.feature_extraction
: Feature Extractionsklearn.feature_selection
: Feature Selectionsklearn.gaussian_process
: Gaussian Processessklearn.isotonic
: Isotonic regressionsklearn.kernel_approximation
: Kernel Approximationsklearn.kernel_ridge
: Kernel Ridge Regressionsklearn.discriminant_analysis
: Discriminant Analysissklearn.linear_model
: Generalized Linear Modelssklearn.manifold
: Manifold Learningsklearn.metrics
: Metricssklearn.mixture
: Gaussian Mixture Modelssklearn.multiclass
: Multiclass and multilabel classificationsklearn.multioutput
: Multioutput regression and classificationsklearn.naive_bayes
: Naive Bayessklearn.neighbors
: Nearest Neighborssklearn.neural_network
: Neural network modelssklearn.calibration
: Probability Calibrationsklearn.cross_decomposition
: Cross decompositionsklearn.pipeline
: Pipelinesklearn.preprocessing
: Preprocessing and Normalizationsklearn.random_projection
: Random projectionsklearn.semi_supervised
: Semi-Supervised Learningsklearn.svm
: Support Vector Machinessklearn.tree
: Decision Treesklearn.utils
: Utilities
At my current rookie level, I feel that I often use clustering, classification (svm, tree, linear regression, etc.), decomposition, preprocessing, metrics, etc., so I will learn from these places first.
cluster
Reading sklearn.cluster
the API, you can find that there are two main contents: one is the class of various clustering methods cluster.KMeans
, and the other is the function of the clustering method that can be used directly, such as
sklearn.cluster.k_means(X, n_clusters, init='k-means++',
precompute_distances='auto', n_init=10, max_iter=300,
verbose=False, tol=0.0001, random_state=None,
copy_x=True, n_jobs=1, algorithm='auto', return_n_iter=False)
- 1
- 2
- 3
- 4
Therefore, in actual use, there are two corresponding methods.
There sklearn.cluster
are a total of 9 clustering methods, which are
- AffinityPropagation: Attractor Propagation
- AgglomerativeClustering: Hierarchical Clustering
- Birch
- DBSCAN
- FeatureAgglomeration: Feature Agglomeration
- KMeans: K-Means Clustering
- MiniBatchKMeans
- MeanShift
- SpectralClustering: Spectral Clustering
Take our most familiar Kmeans as an example:
Use the class constructor to construct the Kmeans clusterer
First, the constructor of KMeans in the API is:
sklearn.cluster.KMeans(n_clusters=8,
init='k-means++',
n_init=10,
max_iter=300,
tol=0.0001,
precompute_distances='auto',
verbose=0,
random_state=None,
copy_x=True,
n_jobs=1,
algorithm='auto'
)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
The meaning of the parameters:
n_clusters
: The number of clusters, that is, how many categories do you want to cluster into?init
: How to get the initial cluster centern_init
: Get the number of iterations of the initial cluster centermax_iter
: maximum number of iterations (because the implementation of kmeans algorithm requires iteration)tol
: Tolerance, that is, the condition for the convergence of kmeans operating criteriaprecompute_distances
: Whether the distance needs to be calculated in advanceverbose
: verbose mode (do not understand what it means, anyway, generally do not change the default value)random_state
: Randomly generate state conditions for cluster centers.copy_x
: A flag for whether to modify the data, if True, the data will not be modified after copying.n_jobs
: Parallel settingalgorithm
: The realization algorithm of kmeans, including:'auto'
,'full'
,'elkan'
, where'full'
indicates that it is realized by EM
Although there are many parameters, default values have been given. So we generally don't need to pass in these parameters, the parameters. It can be called according to actual needs. Here is a simple example:
import numpy as np
from sklearn.cluster import KMeans
data = np.random.rand(100, 3) #生成一个随机数据,样本大小为100, 特征数为3
#假如我要构造一个聚类数为3的聚类器
estimator = KMeans(n_clusters=3)#构造聚类器
estimator.fit(data)#聚类
label_pred = estimator.label_ #获取聚类标签
centroids = estimator.cluster_centers_ #获取聚类中心
inertia = estimator.inertia_ # 获取聚类准则的最后值
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
Use the kmeans function directly:
import numpy as np
from sklearn import cluster
data = np.random.rand(100, 3) #生成一个随机数据,样本大小为100, 特征数为3
k = 3 # 假如我要聚类为3个clusters
[centroid, label, inertia] = cluster.k_means(data, k)
- 1
- 2
- 3
- 4
- 5
Of course, other methods are similar, and the specific use should refer to the API. (Learn to read API, get used to reading API)
classification
Classification is the most important part of data mining or machine learning. However, because the classic classification method mechanism is more characteristic, it seems that sklearn does not specifically customize a class such as a classifier.
Commonly used classification methods are:
- KNN nearest neighbors:
sklearn.neighbors
- logistic regression logistic regression:
sklearn.linear_model.LogisticRegression
- svm support vector machine:
sklearn.svm
- Naive Bayes Naive Bayes:
sklearn.naive_bayes
- Decision Tree Decision Tree:
sklearn.tree
- Neural network neural network:
sklearn.neural_network
Then take KNN as an example (mainly Nearest Neighbors Classification) to see how to use these methods:
from sklearn import neighbors, datasets
# import some data to play with
iris = datasets.load_iris()
n_neighbors = 15
X = iris.data[:, :2] # we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
y = iris.target
weights = 'distance' # also set as 'uniform'
clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
clf.fit(X, y)
# if you have test data, just predict with the following functions
# for example, xx, yy is constructed test data
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Z is the label_pred
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
Another example is svm:
from sklearn import svm
X = [[0, 0], [1, 1]]
y = [0, 1]
#建立支持向量分类模型
clf = svm.SVC()
#拟合训练数据,得到训练模型参数
clf.fit(X, y)
#对测试点[2., 2.], [3., 3.]预测
res = clf.predict([[2., 2.],[3., 3.]])
#输出预测结果值
print res
#get support vectors
print "support vectors:", clf.support_vectors_
#get indices of support vectors
print "indices of support vectors:", clf.support_
#get number of support vectors for each class
print "number of support vectors for each class:", clf.n_support_
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
Of course, SVM also has a corresponding regression model SVR
from sklearn import svm
X = [[0, 0], [2, 2]]
y = [0.5, 2.5]
clf = svm.SVR()
clf.fit(X, y)
res = clf.predict([[1, 1]])
print res
- 1
- 2
- 3
- 4
- 5
- 6
- 7
logistic regression
from sklearn import linear_model
X = [[0, 0], [1, 1]]
y = [0, 1]
logreg = linear_model.LogisticRegression(C=1e5)
#we create an instance of Neighbours Classifier and fit the data.
logreg.fit(X, y)
res = logreg.predict([[2, 2]])
print res
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
preprocessing
This piece usually I use is the Scale operation. And there are many types of Scale, including:
StandardScaler
MaxAbsScaler
MinMaxScaler
RobustScaler
Normalizer
- and other preprocessing operations
Correspondingly, there are direct functions to use: scale(), maxabs_scale(), minmax_scale(), robust_scale(), normaizer()
.
E.g:
import numpy as np
from sklearn import preprocessing
X = np.random.rand(3,4)
#用scaler的方法
scaler = preprocessing.MinMaxScaler()
X_scaled = scaler.fit_transform(X)
#用scale函数的方法
X_scaled_convinent = preprocessing.minmax_scale(X)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
decomposition
Let's talk about NMF and PCA, these two are more commonly used.
import numpy as np
X = np.array([[1,1], [2, 1], [3, 1.2], [4, 1], [5, 0.8], [6, 1]])
from sklearn.decomposition import NMF
model = NMF(n_components=2, init='random', random_state=0)
model.fit(X)
print model.components_
print model.reconstruction_err_
print model.n_iter_
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
Let's talk about the difference between this class fit()
and the following fit_transform()
. The former only trains a model and does not return the branch after nmf, while the latter is in addition to the training data and returns the branch after nmf.
PCA is similar, but without those initialization parameters, as follows:
import numpy as np
X = np.array([[1,1], [2, 1], [3, 1.2], [4, 1], [5, 0.8], [6, 1]])
from sklearn.decomposition import PCA
model = PCA(n_components=2)
model.fit(X)
print model.components_
print model.n_components_
print model.explained_variance_
print model.explained_variance_ratio_
print model.mean_
print model.noise_variance_
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
metrics
The above clustering and classification tasks all require a final evaluation.
Classification
For example, classification, there are the following commonly used evaluation indicators and metrics:
accuracy_score
auc
f1_score
fbeta_score
hamming_loss
hinge_loss
jaccard_similarity_score
log_loss
recall_score
- …
The following example finds the accuracy of the classification results:
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
ac = accuracy_score(y_true, y_pred)
print ac
ac2 = accuracy_score(y_true, y_pred, normalize=False)
print ac2
- 1
- 2
- 3
- 4
- 5
- 6
- 7
The use of other indicators is similar.
return
Regression related metrics include but are not limited to the following:
mean_absolute_error
mean_squared_error
median_absolute_error
- …
clustering
There are the following commonly used evaluation indicators (internal and external):
adjusted_mutual_info_score
adjusted_rand_score
completeness_score
homogeneity_score
normalized_mutual_info_score
silhouette_score
v_measure_score
- …
The following example seeks the NMI (standard mutual information) of the clustering results, and other indicators are similar.
from sklearn.metrics import normalized_mutual_info_score
y_pred = [0,0,1,1,2,2]
y_true = [1,1,2,2,3,3]
nmi = normalized_mutual_info_score(y_true, y_pred)
print nmi
- 1
- 2
- 3
- 4
- 5
- 6
- 7
Of course there are many other metrics. Reference API.
datasets
sklearn itself also provides several common datasets, such as iris, diabetes, digits, covtype, kddcup99, boson, breast_cancer, which can be loaded with a similar method of sklearn.datasets.load_iris. It returns a dataset. Data and labels are obtained in the following ways.
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
- 1
- 2
- 3
- 4
- 5
In addition to these common datasets, the datasets module also provides many data manipulation functions, such as load_files, load_svmlight_file, and many data generators.
panda.io also provides many methods to load external data (such as csv, excel, json, sql, etc.).
You can also get the data set on the mldata repos.
The function of python is still relatively powerful.
Of course, the load of the dataset can also read and write files by writing the readfile function by itself.
concluding remarks
The above mainly learned some of the functions that I use more frequently. When you are familiar with python, just read the Scikit-learn API and everything is not a problem.
In addition, if necessary, you can view the source code of these commonly used functions to learn, to deepen your understanding of the principles of common data mining algorithms.
Forwarding link:
http://blog.csdn.net/lilianforever/article/details/53780613
Preface: This article is a study note.