sklearn各模块详解

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接: https://blog.csdn.net/weixin_42297855/article/details/97917976

import sklearn

linear_model

广义线性模型详解

  1. .LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=None)
  2. .Ridge(alpha=1.0, fit_intercept=True, normalize=False, copy_X=True, max_iter=None, tol=0.001, solver=’auto’, random_state=None)
  3. .RidgeCV(alphas=(0.1, 1.0, 10.0), fit_intercept=True, normalize=False, scoring=None, cv=None, gcv_mode=None, store_cv_values=False)
  4. .Lasso
  5. .MultiTaskLasso
  6. .ElasticNet
  7. .MultiTaskElasticNet
  8. .LassoLars
  9. .OrthogonalMatchingPursuit.orthogonal_mp
  10. .BayesianRidge
  11. .ARDRegression
  12. .LogisticRegression
  13. .SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=1000, tol=0.001, shuffle=True, verbose=0, epsilon=0.1, n_jobs=None, random_state=None, learning_rate=’optimal’, eta0=0.0, power_t=0.5, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, class_weight=None, warm_start=False, average=False)

loss:分类损失函数: ‘hinge’, ‘log’, ‘modified_huber’, ‘squared_hinge’, ‘perceptron’;回归损失函数:‘squared_loss’, ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’
penalty:惩罚项,默认l2
max_iter:最大迭代次数。

  1. .SGDRegressor(loss=’squared_loss’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=1000, tol=0.001, shuffle=True, verbose=0, epsilon=0.1, random_state=None, learning_rate=’invscaling’, eta0=0.01, power_t=0.25, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, warm_start=False, average=False)

loss:略。

  1. .Perceptron
  2. .PassiveAggressiveClassifier
  3. .HuberRegressor

方法

以下clf代指上述任何分类器或回归器。

  1. clf.fit(X_train,y_train)
  2. clf.predict(X_test)

属性

  1. clf.coef_:非常数项系数。
  2. clf.intercept_:常数项系数。
  3. clf.decision_function:决策函数

discriminant_analysis

判别分析
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA:线性判别分析
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA:二次判别分析

  1. LDA(solver=’svd’, shrinkage=None, priors=None, n_components=None, store_covariance=False, tol=0.0001)
  • 参数

solversvd:奇异值分解,是默认的求解器,不计算协方差矩阵,因此建议用于具有大量特征的数据。lsqr:最小二乘解。eigen:特征值分解。
shrinkage:收缩率,可以在训练样本数量比特征数量少的情况下改进协方差矩阵的估计。可以设置为auto或[0,1]的数。指定为auto时需要将sover设置成lsqreigen

下图来自于Sklearn官方文档关于收缩率在不同样本量下的表现。
在这里插入图片描述

priors
n_components:类别数
store_covariance
tol:SVD求解中用于秩和估计的阈值。

  • 属性

coef_ :
Weight vector(s).

intercept_ :
Intercept term.

covariance_ :
Covariance matrix (shared by all classes).

explained_variance_ratio_ :
Percentage of variance explained by each of the selected components. If n_components is not set then all components are stored and the sum of explained variances is equal to 1.0. Only available when eigen or svd solver is used.

means_ : array-like, shape (n_classes, n_features)
Class means.

priors_ : array-like, shape (n_classes,)
Class priors (sum to 1).

scalings_ : array-like, shape (rank, n_classes - 1)
Scaling of the features in the space spanned by the class centroids.

xbar_ : array-like, shape (n_features,)
Overall mean.

classes_ : array-like, shape (n_classes,)
Unique class labels.

  • 方法

decision_function(self, X) Predict confidence scores for samples.
fit(self, X, y) Fit LinearDiscriminantAnalysis model according to the given training data and parameters.
fit_transform(self, X[, y]) Fit to data, then transform it.
get_params(self[, deep]) Get parameters for this estimator.
predict(self, X) Predict class labels for samples in X.
predict_log_proba(self, X) Estimate log probability.
predict_proba(self, X) Estimate probability.
score(self, X, y[, sample_weight]) Returns the mean accuracy on the given test data and labels.
set_params(self, **params) Set the parameters of this estimator.
transform(self, X) Project data to maximize class separation.

  1. QDA(priors=None, reg_param=0.0, store_covariance=False, tol=0.0001)


.kernel_ridge

from sklearn.kernel_ridge import KernelRidge

  1. KernelRidge(alpha=1, kernel=’linear’, gamma=None, degree=3, coef0=1, kernel_params=None)


.svm

from sklearn import svm

分类

  1. svm.SVC(C=1.0, kernel=’rbf’, degree=3, gamma=’auto_deprecated’, coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=’ovr’, random_state=None)

C
kernel:指定核函数:linear,polynomial,rbf,sigmoid。还可以是自定义的核函数。
degree:当kernel指定为polynomial时,指定多项式次数。
gamma:当kernel为非linear时,指定 γ \gamma 的值。
coef0:当kernelpolynomialsigmoid时,指定r的值。
shrinking
probability
tol
cache_size
class_weight:在fit方法中设置,用于样本不平衡问题。
verbose
max_iter
decision_function_shapeovo表示一对一,ovr表示一对剩下的,
random_state:略


  1. svm.NuSVC(nu=0.5, kernel=’rbf’, degree=3, gamma=’auto_deprecated’, coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=’ovr’, random_state=None)

  1. svm.LinearSVC(penalty=’l2’, loss=’squared_hinge’, dual=True, tol=0.0001, C=1.0, multi_class=’ovr’, fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=1000)

回归

  1. svm.SVR(kernel=’rbf’, degree=3, gamma=’auto_deprecated’, coef0=0.0, tol=0.001, C=1.0, epsilon=0.1, shrinking=True, cache_size=200, verbose=False, max_iter=-1)

  1. svm.NuSVR(nu=0.5, C=1.0, kernel=’rbf’, degree=3, gamma=’auto_deprecated’, coef0=0.0, shrinking=True, tol=0.001, cache_size=200, verbose=False, max_iter=-1)

  1. svm.LinearSVR(epsilon=0.0, tol=0.0001, C=1.0, loss=’epsilon_insensitive’, fit_intercept=True, intercept_scaling=1.0, dual=True, verbose=0, random_state=None, max_iter=1000)

属性:
support_vectors_
support_
n_support
.decision_function
dual_coef_ y i α i y_i \alpha_i α i α i \alpha_i - \alpha_i^*
intercept_:略



.neighbors

  1. .NearestNeighbors(n_neighbors=5, radius=1.0, algorithm=’auto’, leaf_size=30, metric=’minkowski’, p=2, metric_params=None, n_jobs=None, **kwargs)

n_neighbors:近邻数,即选择最近的几个样本来分类。
radius
algorithm:算法:‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute`


  1. .KDTree

  1. .BallTree

方法:



.tree

  1. .DecisionTreeClassifier(criterion=’gini’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)

  1. .DecisionTreeRegressor(criterion=’mse’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, presort=False)

方法:
clf.predict_proba([[2., 2.]]):预测每个类的概率,即叶中相同类的训练样本的分数



.ensemble

集成学习

  1. .AdaBoostRegressor(base_estimator=None, n_estimators=50, learning_rate=1.0, loss=’linear’, random_state=None)

n_estimators:基学习器数量
base_estimator:基学习器类型,默认.tree.DecisionTreeRegressor(max_depth=3)


  1. .AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0, algorithm=’SAMME.R’, random_state=None)
  2. .BaggingClassifier(base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)
  3. .BaggingRegressor(base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)
  4. .RandomForestClassifier(n_estimators=’warn’, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None)
  5. .RandomForestRegressor(n_estimators=’warn’, criterion=’mse’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False)
  6. .ExtraTreesClassifier(n_estimators=’warn’, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=False, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None)
  7. .ExtraTreesRegressor(n_estimators=’warn’, criterion=’mse’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=False, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False)
  8. .VotingClassifier(estimators, voting=’hard’, weights=None, n_jobs=None, flatten_transform=True)

voting:默认hard表示绝对多数投票法,即选择超过半数的票数;若为soft表示相对多数投票法,即选择最多票数。

  1. .VotingRegressor(estimators, weights=None, n_jobs=None)[source]
  2. .GradientBoostingClassifier(loss=’deviance’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001)
  3. .GradientBoostingRegressor(loss=’ls’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001)
  4. .HistGradientBoostingClassifier(loss=’auto’, learning_rate=0.1, max_iter=100, max_leaf_nodes=31, max_depth=None, min_samples_leaf=20, l2_regularization=0.0, max_bins=256, scoring=None, validation_fraction=0.1, n_iter_no_change=None, tol=1e-07, verbose=0, random_state=None):数据量较大时效果比GradientBoostingClassifier好得多。
  5. .HistGradientBoostingRegressor(loss=’least_squares’, learning_rate=0.1, max_iter=100, max_leaf_nodes=31, max_depth=None, min_samples_leaf=20, l2_regularization=0.0, max_bins=256, scoring=None, validation_fraction=0.1, n_iter_no_change=None, tol=1e-07, verbose=0, random_state=None)


.feature_selection

.VarianceThreshold(threshold=0.0)



.preprocessing

from sklearn import preprocessing

数据标准化

  1. preprocessing.StandardScaler()
    该标准器本质上是保留了原数据的均值和标准差,并可以同样标准化作用于测试数据。
    例:
scaler = preprocessing.StandardScaler().fit(X_train) 
X_test_transformed = scaler.transform(X_test)
  1. preprocessing.scale(X_train)把数据转化为标准正态分布。
  2. min_max_scaler = preprocessing.MinMaxScaler()把数据转化到[0,1]区间
    例:
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_test_minmax = min_max_scaler.transform(X_test)
  1. max_abs_scaler = preprocessing.MaxAbsScaler(),用法同MinMaxScaler,范围变成[-1,1]

特征编码

1.preprocessing.OneHotEncoder
2. preprocessing.OrdinalEncoder
3. preprocessing.LabelEncoder

model_selection

from sklearn import model_selection

数据集分割

  1. X_train, X_test, y_train, y_test = model_selection.train_test_split(data,target, test_size=0.4, random_state=0,stratify=None)

test_size:测试集的比例
n_splits:k值,进行k次的分割
stratify:指定分层抽样变量,按该变量的类型分布分层抽样。

  1. ShuffleSplit

打乱后分割
n_splits

交叉验证

  1. model_selction.KFold(n_splits=’warn’, shuffle=False, random_state=None)

示例:kf = KFold(n_splits=2)
kf.split(X_train, y_train)

n_splits
shuffle是否打乱,默认否。

  1. RepeatedKFold

n_splits
n_repeats:重复次数
3.LeaveOneOut
4.LeavePOut
5.StratifiedKFold
p:p值
6.GroupKFold
7.LeaveOneGroupOut
8.LeavePGroupsOut
9.GroupShuffleSplit
10.TimeSeriesSplit

超参数搜索

1.model_selection.GridSearchCV(estimator, param_grid, scoring=None, n_jobs=None, iid=’warn’, refit=True, cv=’warn’, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise-deprecating’, return_train_score=False) 网格搜索
参数:

estimator:学习器
param_grid:指定参数空间,以字典形式给出。若有多个参数空间则用list框起来。
scoring:评价准则,若不指定则默认学习器自带的评价准则
n_jobs:指定要并行计算的线程数,默认为None即1,如果设定为-1则表示使用全部cpu。
iid
refit
cv:略
verbose
pre_dispatch
error_score
return_train_score

属性:

cv_results_:返回网格搜索的结果
best_estimator_:返回最优的学习器
best_params_:返回最优的参数
best_score_:返回最优的评价值

2.model_selection.RandomizedSearchCV(estimator, param_distributions, n_iter=10, scoring=None, n_jobs=None, iid=’warn’, refit=True, cv=’warn’, verbose=0, pre_dispatch=‘2*n_jobs’, random_state=None, error_score=’raise-deprecating’, return_train_score=False)随机搜索
参数:

estimator:略
param_distributions:参数的分布,写法和上面的param_grid相似,字典值里是一个随机分布,如果给的是一个list则默认均匀分布。
n_iter
scoring:略
n_jobs:略
iid
refit
cv:略
verbose
pre_dispatch
random_state:略
error_score
return_train_score

属性同GridSearchCV

评估

1.cross_val_score
示例:scores = cross_val_score(clf, data, target, cv=5)
clf:分类器
data
target
cv:当cv为整数时默认使用kfold或分层折叠策略,如果估计量来自ClassifierMixin,则使用后者。另外还可以指定其它的交叉验证迭代器或者是自定义迭代器。
scoring:指定评分方式,详见这里
2.cross_validate

.cross_validation

from sklearn.cross_validation import KFold

.metrics

评价指标详解
分类器评价指标:

  1. .accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)

normalize:默认返回正确率,若为False则返回预测正确的样本数。

  1. .balanced_accuracy_score(y_true, y_pred, sample_weight=None, adjusted=False)

adjusted

  1. .average_precision_score(y_true, y_score, average=’macro’, pos_label=1, sample_weight=None)
  2. .recall_score(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None)
  3. .precision_score(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None)
  4. .f1_score(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None)
  5. .log_loss(y_true, y_pred, eps=1e-15, normalize=True, sample_weight=None, labels=None)

1.scorer
scorer.make_scorer
2.mean_squared_error
average_precision_score
brier_score_loss
log_loss
jaccard_score
roc_auc_score

Clustering

adjusted_mutual_info_score
adjusted_rand_score
completeness_score
fowlkes_mallows_score
homogeneity_score
mutual_info_score
normalized_mutual_info_score
v_measure_score

Regression

explained_variance_score
max_error
mean_absolute_error
mean_squared_error
mean_squared_log_error
median_absolute_error
r2_score

pipeline

1.make_pipeline

datasets

1.load_iris

svm

猜你喜欢

转载自blog.csdn.net/weixin_42297855/article/details/97917976