Python-Use of SkLearn library

Introduction to SkLearn

The scikit-learn library is one of the most popular machine learning algorithm libraries today, which can be used to solve classification and regression problems.

Data preprocessing

from sklearn import preprocessing

Label coding

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
X['a'] = encoder.fit_transform(X['a'])	# 学习并转换
# eg. 若X为鸢尾花数据集,a代表种类,那么a属性会自动置换为0,1,2,,,,

Data set processing

  1. standardization

The formula is: (X-mean)/std is calculated separately for each attribute/column.

Subtract the mean value of the data's period attributes (by column) and impose its variance. The result is that for each attribute/column, all data are clustered around 0, and the variance is 1.

from sklearn.preprocessing import Scale
X_scaled = scale(X)

The StandardScaler class, the advantage of using this class is that you can save the parameters (mean, variance) in the training set and directly use its object to convert the test set data.

from sklearn.preprocessing import StandardScaler
trans = StandardScaler().fit(X)
X = trans.transform(X)

Build training set and test set

from sklearn.model_selection import train_test_split
train_X, test_X = train_test_split(X, test_size = 0.3)

Build a predictive model

The basic idea of ​​the model is not described here.

K nearest neighbor algorithm (KNN)

from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()	# 创建模型
knn_clf.fit(train_X, train_y)		# 训练模型
knn_clf.predict(test_X)				# 测试模型,返回预测结果
knn_clf.score(test_X, test_y)		# 评估模型

# 注意,此时可以用之间标签编码的模型重新将预测结果反转
# encoder.inverse_transform(knn_clf.predict(test_X))

Naive Bayes Algorithm

from sklearn.native_bayes import GaussianNB
bayes_clf = GaussianNB()

Decision tree algorithm

from sklearn.tree import DecisionTreeVClassifier
tree_clf = DecisionTreeVClassifier()

Logistic regression algorithm

from sklearn.linear_model import LogisticRegression
Log_clf = LogisticRegression(solver='saga', max_iter=1000)

Support Vector Machine Algorithm

from sklearn.svm import SVC
svm_clf = SVC()

Random forest (integrated method)

from sklearn.ensemble import RandomForestClassifier()
forest_clf = RandomForestClassifier()

Adaboost (integration method)

from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier()

Gradient Boosting Tree GBDT (Integrated Method)

from sklearn.ensemble import GradientBoostingClssifier
gbdt_clf = GradientBoostingClssifier()

Cross-validation

from sklearn.model_selection import cross_val_score
scores=cross_val_score(clf,X,y,cv=5,scoring='accuracy')
# 将数据分为5份,每份作为测试集进行一次训练,最终得到五个分数
# accuracy:评价指标是准确度,可以省略使用默认值

Save model

from sklearn.externals import joblib
joblib.dump(clf,'clf.pkl')	# 保存模型
clf=joblib.load('clf.pkl')	# 加载模型

Guess you like

Origin blog.csdn.net/seek0226/article/details/108202437