[Data Science] Scikit-learn [Scikit-learn, loading data, training set and test set data, creating models, model fitting, fitting data and models, evaluating model performance, model adjustment]


1. Scikit-learn

Scikit-learn isOpen source Python library, implement machine learning , preprocessing , cross-validation and visualization algorithms through a unified interface .

Insert image description here

>>> from sklearn import neighbors, datasets, preprocessing
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import accuracy_score
>>> iris = datasets.load_iris()
>>> X, y = iris.data[:, :2], iris.target
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)
>>> scaler = preprocessing.StandardScaler().fit(X_train)
>>> X_train = scaler.transform(X_train)
>>> X_test = scaler.transform(X_test)
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)
>>> knn.fit(X_train, y_train)
>>> y_pred = knn.predict(X_test)
>>> accuracy_score(y_test, y_pred)

  The above is usedscikit-learn libraryPerform the k-nearest neighbor (KNN) classification process to obtain the prediction accuracy of the KNN classifier on the iris data set.


2. Load data

  Scikit-learnThe data processed are numbers stored as NumPy arrays or SciPy sparse matrices . It also supports other data types such as Pandas data frames that can be converted to numeric arrays .

>>> import numpy as np
# 导入了 NumPy 库,用于进行数值计算和数组操作
>>> X = np.random.random((10,5))
>>> y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])
>>> X[X < 0.7] = 0

  Above we used the NumPy library to generate a random matrix X with 10 rows and 5 columns, and set elements less than 0.7 to 0. At the same time, a NumPy array y containing the gender labels is created . We get a random matrix X of shape (10, 5), where elements less than 0.7 are set to 0, and create an array y containing the gender labels. These data can be used for subsequentanalyzeModelingor other tasks.


3. Training set and test set data

The train_test_split function of the scikit-learn library   is used to divide the data set intoTraining setandtest set

>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

  Obtain the feature data X_train and X_test of the divided training set and test set, as well as the corresponding category label data y_train and y_test. Can be performed on the training setModel training, and then use the test set to evaluate the performance and accuracy of the model .


4. Create a model

4.1 Supervised learning evaluator

4.1.1 Linear regression

  We use the LinearRegression class of the scikit-learn library ,Create a linear regression model object

>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression(normalize=True)

  Create a linear regression model object lr .


4.1.2 Support Vector Machine (SVM)

  We use the SVC class of the scikit-learn library to create aSupport vector machine (SVM) model object

>>> from sklearn.svm import SVC
>>> svc = SVC(kernel='linear')

  Create a support vector machine model object svc , andClassification using linear kernel function


4.1.3 Naive Bayes

  Using the GaussianNB class of the scikit-learn library , create aNaive Bayes model object

>>> from sklearn.naive_bayes import GaussianNB
>>> gnb = GaussianNB()

  Creates a Naive Bayes model object gnb . The Naive Bayes model is a commonly used probabilistic model that is suitable forClassification problem


4.1.4 KNN

  Use the neighbors module of the scikit-learn library to create aK nearest neighbor (KNN) classifier object

>>> from sklearn import neighbors
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)

  Create a KNN classifier object knn that willSort by nearest neighbors. KNN is aInstance-based learning method for classification based on nearest neighbor labels


4.2 Unsupervised Learning Evaluator

4.2.1 Principal component analysis (PCA)

  Using the PCA class of the scikit-learn library , create aPrincipal Component Analysis (PCA) Object

>>> from sklearn.decomposition import PCA
>>> pca = PCA(n_components=0.95)

  Create a PCA object pca , which can be used for dimensionality reduction or feature extraction . PCA is a commonly used dimensionality reduction technique that canMap high-dimensional data to low-dimensional space and retain the main characteristics of the data


4.2.2 K Means

  Using the KMeans class of the scikit-learn library ,Create a K-Means clusterer object

>>> from sklearn.cluster import KMeans
>>> k_means = KMeans(n_clusters=3, random_state=0)

  Create a K-Means clusterer object k_means that will cluster data points based on the distance between them . K-Means is a commonly used clustering algorithm .Split the data points into a predefined number of clusters so that the data points within the clusters are as similar as possible and the data points between different clusters are very different.


5. Model fitting

5.1 Supervised learning

  ShowsThe process of training three different machine learning algorithms on the training set

# 拟合数据与模型
>>> lr.fit(X, y)
# 使用逻辑回归算法(Logistic Regression)对数据集(X, y)进行拟合。其中,X是输入特征矩阵,y是对应的目标变量向量
>>> knn.fit(X_train, y_train)
# 使用K近邻算法(K-Nearest Neighbors)对训练集(X_train, y_train)进行拟合。其中,X_train是训练集的输入特征矩阵,y_train是对应的目标变量向量
>>> svc.fit(X_train, y_train)
# 使用支持向量机算法(Support Vector Machine)对训练集(X_train, y_train)进行拟合。其中,X_train是训练集的输入特征矩阵,y_train是对应的目标变量向量

5.2 Unsupervised learning

The process of clustering and feature dimensionality reduction   on the training set is as follows.

>>> k_means.fit(X_train) # 拟合数据与模型
>>> pca_model = pca.fit_transform(X_train) # 拟合并转换数据

  The model will fit the K-Means clustering algorithm and PCA algorithm based on the given training set data . For the K-Means clustering algorithm , the model willLearn to find optimal cluster center points;For the PCA algorithm , the model willLearn to find the optimal principal component projection space. These fitting operations will generate corresponding models or transformation objects to facilitate subsequent clustering or dimensionality reduction operations on new data .


6. Fitting data and models

6.1 Standardization

  The following is the process of data normalization for training and test sets .

>>> from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler().fit(X_train)
>>> standardized_X = scaler.transform(X_train)
>>> standardized_X_test = scaler.transform(X_test)

  Create a StandardScaler object scaler andFit it using the data from the training set. We can then use scaler to standardize the training and test sets to ensure that the data has the same scale and range toImprove model training and prediction effects


6.2 Normalization

  Next we learn to perform data normalization on the training set and test set .

>>> from sklearn.preprocessing import Normalizer
>>> scaler = Normalizer().fit(X_train)
>>> normalized_X = scaler.transform(X_train)
>>> normalized_X_test = scaler.transform(X_test)

  Create a Normalizer object scaler and fit it using the data of the training set. Then we canUse a scaler to normalize the training set and test set to ensure that the data has unit norm in the direction of the feature vector.. Normalization can make feature vectors between different samples more comparable and contribute to the training and prediction effects of certain machine learning algorithms.


6.3 Binarization

  Here is the process of binarizing the data set .

>>> from sklearn.preprocessing import Binarizer
>>> binarizer = Binarizer(threshold=0.0).fit(X)
>>> binary_X = binarizer.transform(X)

  Create a Binarizer object binarizer and use the threshold threshold = 0.0 to binarize the data set X. Then we canBinary conversion of data set using binarizer, and save the result in the variable binary_X . Binarization can make features between different samples more comparable and contribute to the training and prediction effects of certain machine learning algorithms.


6.4 Encoding classification features

The process of label encoding the target variable   using LabelEncoder .

>>> from sklearn.preprocessing import LabelEncoder
>>> enc = LabelEncoder()
>>> y = enc.fit_transform(y)

  Create a LabelEncoder object enc, and use its fit_transform method to perform label encoding processing on the target variable y. The encoded result will overwrite the value of the original target variable y, so that the original label is replaced with the corresponding integer encoding .Tag encodingIt is often used to convert non-numeric target variables into a numerical form acceptable to the model for training and prediction of machine learning tasks.


6.5 Entering missing values

  Using the Importer objectFill missing values ​​in the data set

>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(missing_values=0, strategy='mean', axis=0)
>>> imp.fit_transform(X_train)

  Create an Imputer object imp and use its fit_transform method to fill in the missing values ​​in the training set X_train. The filled result will overwrite the original training set data X_train, so that missing values ​​are replaced by the mean of the corresponding column .The Imputer class is often used to handle missing values ​​in the data preprocessing stage to ensure the completeness and accuracy of the data set.


6.6 Generate polynomial features

  Using PolynomialFeaturesPolynomial expansion of features

>>> from sklearn.preprocessing import PolynomialFeatures
>>> poly = PolynomialFeatures(5)
>>> poly.fit_transform(X)

  Create a PolynomialFeatures object poly and use its fit_transform method to polynomially expand the dataset X. The expanded result will contain various power combinations of the original features ,From first degree terms to terms with highest degree 5. Polynomial expansion is often used to increase model complexity to capture nonlinear relationships between features, thereby improving the model's predictive power .


7. Evaluate model performance

7.1 Evaluate model performance

7.1.1 Accuracy

  Use the evaluator scoring method and metric scoring function in scikit-learn to evaluate the accuracy of the model .

>>> knn.score(X_test, y_test)
# 评估器评分法
>>> from sklearn.metrics import accuracy_score
# 指标评分函数
>>> accuracy_score(y_test, y_pred)

  You can use the evaluator scoring method or the metric scoring function to evaluate the accuracy of the model on the test set . The evaluator scoring method directly calls the score method of the model object, while the indicator scoring function needs to pass in the real target variable data and prediction result data to calculate the accuracy score . These assessment methods can help usUnderstand the performance of your model,andCompare performance between different models


7.1.2 Classification prediction evaluation function

  Using scikit-learnclassification_report functionTo generate evaluation indicator reports such as precision, recall, F1 index, and support rate of the classification model.

>>> from sklearn.metrics import classification_report
# 精确度、召回率、F1分数及支持率
>>> print(classification_report(y_test, y_pred))

  You can use the classification_report function to generate an evaluation metric report for a classification model . The report will include metrics such as precision , recall , F1 index , and support for each category , as well as weighted averages and overall averages . These metrics help us evaluate the model's performance on different categories and provide detailed information about the model's performance.


7.1.3 Confusion matrix

Use the confusion_matrix function   in scikit-learn toGenerate confusion matrix for classification model

>>> from sklearn.metrics import confusion_matrix
>>> print(confusion_matrix(y_test, y_pred))

  Use the confusion_matrix function to generate the confusion matrix of the classification model. The confusion matrix shows the classification results of the model on each category in the form of a matrix , includingreal examplefalse positivefalse negative exampleandTrue negative examplequantity. In this way, we can evaluate the model's classification performance on different categories based on the confusion matrix and further analyze misclassification situations .


7.2 Regression indicators

7.2.1 Mean absolute error

  Use the mean_absolute_error function in scikit-learn to calculate the mean absolute error (Mean Absolute Error) of the regression model.

>>> from sklearn.metrics import mean_absolute_error 
>>> y_true = [3, -0.5, 2]
>>> mean_absolute_error(y_true, y_pred)

  Use the mean_absolute_error function to calculate the mean absolute error of the regression model . The indicatorMeasures the average deviation between the model's predictions and the true value, the smaller the value, the more accurate the model’s prediction is . Using mean absolute error can help you evaluate the performance of a regression model and compare performance between different models.


7.2.2 Mean square error

Use the mean_squared_error function   in scikit-learn to calculate the mean squared error ( Mean Squared Error ) of the regression model.

>>> from sklearn.metrics import mean_squared_error
>>> mean_squared_error(y_test, y_pred)

  Use the mean_squared_error function to calculate the mean square error of the regression model. This indicator measures the average difference between the model's prediction results and the true value . The smaller the value, the more accurate the model's prediction is . Using the mean squared error can help us evaluate the performance of a regression model and compare the performance between different models.


7.2.3 R² score

  Use the r2_score function in scikit-learn to calculateR² score of regression model(R-squared score)。

>>> from sklearn.metrics import r2_score
>>> r2_score(y_true, y_pred)

By executing these codes, you can use the r2_score function to calculate the R² score, which is the coefficient of certainty or goodness of fit,   of the regression model . The R² score ranges from 0 to 1, with values ​​closer to 1 indicating a better fit of the model to the data, and values ​​closer to 0 indicating a poorer fit of the model . The R² score helps us evaluate the performance of a regression model and compare it to other models.


7.3 Cluster metrics

7.3.1 Adjusting the Rand coefficient

Use the adjusted_rand_score function   in scikit-learn to calculate the Adjusted Rand Index of the clustering algorithm .

>>> from sklearn.metrics import adjusted_rand_score
>>> adjusted_rand_score(y_true, y_pred) 

  Use the adjusted_rand_score function to calculate the adjusted Rand index of a clustering algorithm.Adjust the Rand index value range between -1 and 1, the closer the value is to 1, the higher the consistency between the clustering results and the real labels, and the closer the value is to 0, the consistency of the clustering results is the same as random division, and the closer the value is to -1, the higher the consistency between the clustering results and the real labels. The lower . Adjusting the Rand index can help evaluate the performance of a clustering algorithm and compare it with other algorithms.


7.3.2 Homogeneity

Use the homogeneity_score function   in scikit-learn to calculate the homogeneity score (Homogeneity Score) of the clustering results.

>>> from sklearn.metrics import homogeneity_score
>>> homogeneity_score(y_true, y_pred) 

  Use the homogeneity_score function to calculate the homogeneity score of the clustering results.The homogeneity score ranges from 0 to 1. The closer the value is to 1, it means that the samples in the clustering result are more likely to belong to the same category. The closer the value is to 0, the more homogeneous the clustering result is. The performance is poorer and the sample distribution is more dispersed.. The homogeneity score can help evaluate the performance of a clustering algorithm in keeping homogeneous samples concentrated and compare with other algorithms.


7.3.3 V-measure

Use the v_measure_score function   in scikit-learn to calculate the clustering resultsV-measure score

>>> from sklearn.metrics import v_measure_score
>>> v_measure_score(y_true, y_pred) 

  Use the v_measure_score function to calculate the V-measure score of the clustering results. The V-measure score ranges from 0 to 1. The closer the value is to 1, the higher the homogeneity and completeness of the clustering results. The closer the value is to 0, the lower the quality of the clustering results .


7.4 Cross-validation

Use the cross_val_score function   in scikit-learn to performCross-validationandCompute model performance evaluation metrics

>>> from sklearn.cross_validation import cross_val_score
>>> print(cross_val_score(knn, X_train, y_train, cv=4))
>>> print(cross_val_score(lr, X, y, cv=2))

  Use the cross_val_score function to cross-validate the model and calculate performance evaluation metrics. Cross-validation can help us evaluate the performance of the model more comprehensively and ensure the reliability of the evaluation metrics . In cross-validation, the input data is divided into several folds (i.e., the number of folds of cross-validation). Each fold is used as a validation set in turn, and the remaining folds are used as a training set to fit the model and calculate performance evaluation indicators . Finally, the performance evaluation indicators of all folds will be summarized to obtainAverage performance evaluation metric of the model


8. Model adjustment

8.1 Grid search

Use the GridSearchCV class   in scikit-learn to do thisgrid searchandParameter tuning

>>> from sklearn.grid_search import GridSearchCV
>>> params = {
    
    "n_neighbors": np.arange(1,3), "metric": ["euclidean", "cityblock"]}
>>> grid = GridSearchCV(estimator=knn, param_grid=params)
>>> grid.fit(X_train, y_train)
>>> print(grid.best_score_)
>>> print(grid.best_estimator_.n_neighbors)

  By executing this code, you can use the GridSearchCV class to perform an exhaustive search of the model's parameter space and find the best parameter combination.Grid search can help us find optimal parameter settings to improve model performance. During the search process, each set of parameters will be cross-validated and the best parameter combination will be selected based on performance evaluation metrics.


8.2 Random parameter optimization

Use the RandomizedSearchCV class   in scikit-learn to do thisrandom searchandParameter tuning

>>> from sklearn.grid_search import RandomizedSearchCV
>>> params = {
    
    "n_neighbors": range(1,5), "weights": ["uniform", "distance"]}
>>> rsearch = RandomizedSearchCV(estimator=knn, param_distributions=params, cv=4, n_iter=8, random_state=5)
>>> rsearch.fit(X_train, y_train)
>>> print(rsearch.best_score_)

  Use the RandomizedSearchCV class to randomly search the model's parameter space and find the best parameter combination. Compared with grid search, random search can find the best parameter combination in a large range of parameter space more efficiently . During the search process, each set of parameters will be cross-validated and the best parameter combination will be selected based on performance evaluation metrics .

Guess you like

Origin blog.csdn.net/m0_65748531/article/details/133465300