sklearn各模块详解(2)(最近更新:2019/10/22)

一、.preprocessing

1.1 数据标准化

1.1.1 转化为标准正态分布

  1. .StandardScaler(copy=True, with_mean=True, with_std=True) :该标准器本质上是保留了原数据的均值和标准差,并可以同样标准化作用于测试数据。
  2. .scale(X, axis=0, with_mean=True, with_std=True, copy=True)

1.1.2 转化到某个区间

  1. .MinMaxScaler(feature_range=(0, 1), copy=True)把数据转化到某个区间
  2. .MaxAbsScaler(copy=True),用法同MinMaxScaler,范围变成[-1,1]
  3. .minmax_scale(X, feature_range=(0, 1), axis=0, copy=True)
  4. .maxabs_scale(X, axis=0, copy=True)

1.1.3 Scaler方法

  1. .fit(X_train)
  2. .transform(X_test)
  3. .fit_transform(X_train)

1.1.4 归一化

  1. .normalize(X, norm=’l2’, axis=1, copy=True, return_norm=False)
  2. .Normalizer(norm=’l2’, copy=True)

1.1.5 非线性标准化

1.2 特征编码

1..OneHotEncoder(n_values=None, categorical_features=None, categories=None, drop=None, sparse=True, dtype=<class ‘numpy.float64’>, handle_unknown=’error’)

sparse:默认为True表示用稀疏矩阵表示,一般使用.toarray()转换到False,即数组。

  1. .OrdinalEncoder(categories=’auto’, dtype=<class ‘numpy.float64’>)
  2. .LabelEncoder()

二、.model_selection

2.1 数据集分割

  1. X_train, X_test, y_train, y_test = .train_test_split(data,target, test_size=0.4, random_state=0,stratify=None)

test_size:测试集的比例
n_splits:k值,进行k次的分割
stratify:指定分层抽样变量,按该变量的类型分布分层抽样。

  1. .ShuffleSplit(n_splits=10, test_size=None, train_size=None, random_state=None):打乱后分割
  2. .StratifiedShuffleSplit(n_splits=10, test_size=None, train_size=None, random_state=None)

2.2 用于交叉验证的数据集分割

  1. model_selction.KFold(n_splits=’warn’, shuffle=False, random_state=None)

示例:kf = KFold(n_splits=2)
kf.split(X_train, y_train)

n_splits:k折的k。
shuffle:是否打乱,默认否。

  1. .RepeatedKFold(n_splits=5, n_repeats=10, random_state=None):重复.KFold n_repeats次。
  2. .LeaveOneOut()
  3. LeavePOut()
  4. .StratifiedKFold(n_splits=3, shuffle=False, random_state=None)
  5. GroupKFold
  6. LeaveOneGroupOut
  7. LeavePGroupsOut
  8. GroupShuffleSplit
  9. .TimeSeriesSplit(n_splits=’warn’, max_train_size=None)

在这里插入图片描述


超参数搜索

1.model_selection.GridSearchCV(estimator, param_grid, scoring=None, n_jobs=None, iid=’warn’, refit=True, cv=’warn’, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise-deprecating’, return_train_score=False) 网格搜索
参数:

param_grid:指定参数空间,以字典形式给出。若有多个参数空间则用list框起来。
scoring:评价准则,若不指定则默认学习器自带的评价准则
n_jobs:指定要并行计算的线程数,默认为None即1,如果设定为-1则表示使用全部cpu。
iid
refit
verbose
pre_dispatch
error_score
return_train_score

属性:

cv_results_:返回网格搜索的结果
best_estimator_:返回最优的学习器
best_params_:返回最优的参数
best_score_:返回最优的评价值


2.model_selection.RandomizedSearchCV(estimator, param_distributions, n_iter=10, scoring=None, n_jobs=None, iid=’warn’, refit=True, cv=’warn’, verbose=0, pre_dispatch=‘2*n_jobs’, random_state=None, error_score=’raise-deprecating’, return_train_score=False)随机搜索
参数:

estimator:略
param_distributions:参数的分布,写法和上面的param_grid相似,字典值里是一个随机分布,如果给的是一个list则默认均匀分布。
n_iter
scoring:略
n_jobs:略
iid
refit
cv:略
verbose
pre_dispatch
random_state:略
error_score
return_train_score

属性同GridSearchCV


交叉验证简单评估

  1. .cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=’warn’, n_jobs=None, verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’, error_score=’raise-deprecating’)

示例:scores = cross_val_score(clf, data, target, cv=5)
cv:当cv为整数时默认使用kfold或分层折叠策略,如果估计量来自ClassifierMixin,则使用后者。另外还可以指定其它的交叉验证迭代器或者是自定义迭代器。
scoring:指定评分方式,详见这里


  1. cross_validate
  2. cross_val_predict


三、.cross_validation

交叉验证详解

四、.metrics

建议与评价指标详解配合使用。

4.1 自定义评价指标

  1. .make_scorer

4.2 分类器评价指标

  1. .accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)

normalize:默认返回正确率,若为False则返回预测正确的样本数。

  1. .balanced_accuracy_score(y_true, y_pred, sample_weight=None, adjusted=False)

adjusted

  1. .average_precision_score(y_true, y_score, average=’macro’, pos_label=1, sample_weight=None)
  2. .recall_score(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None)
  3. .precision_score(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None)
  4. .f1_score(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None)

average:可指定micromacro

  1. .log_loss(y_true, y_pred, eps=1e-15, normalize=True, sample_weight=None, labels=None)

average_precision_score
brier_score_loss
jaccard_score

  1. .roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None, max_fpr=None)

4.3聚类器评价指标

  1. .adjusted_rand_score(labels_true, labels_pred):ARI指数
  2. .mutual_info_score(labels_true, labels_pred, contingency=None):互信息
  3. .adjusted_mutual_info_score(labels_true, labels_pred, average_method=’warn’)
  4. .normalized_mutual_info_score(labels_true, labels_pred, average_method=’warn’)
  5. .completeness_score(labels_true, labels_pred)完备性
  6. .homogeneity_score(labels_true, labels_pred):同质性
  7. .homogeneity_completeness_v_measure(labels_true, labels_pred, beta=1.0)
  8. .v_measure_score(labels_true, labels_pred, beta=1.0)
  9. .fowlkes_mallows_score(labels_true, labels_pred, sparse=False)
  10. .silhouette_score(X, labels, metric=’euclidean’, sample_size=None, random_state=None, **kwds) 轮廓系数
  11. .calinski_harabasz_score(X, labels)
  12. .davies_bouldin_score(X, labels)
  13. .contingency_matrix(labels_true, labels_pred, eps=None, sparse=False)

4.4 回归器评价指标

  1. .mean_squared_error(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)
  2. .max_error(y_true, y_pred)
  3. .mean_absolute_error(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)

explained_variance_score
mean_squared_log_error
median_absolute_error
r2_score

五、.pipeline

1.make_pipeline

六、.datasets

1.load_iris

七、.feature_extraction特征提取

7.1

  1. .text.CountVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>)对单列文本做标记和计数。

八、.feature_selection

特征选择一文通

  1. .VarianceThreshold(threshold=0.0):过滤法-方差阈值
  2. .SelectKBest(score_func=<function f_classif>, k=10)

score_func:指定过滤法中的评价准则,默认为f值。

  1. .SelectPercentile(score_func=<function f_classif>, percentile=10)
  2. .SelectFpr(score_func=<function f_classif>, alpha=0.05)
  3. .SelectFdr(score_func=<function f_classif>, alpha=0.05)
  4. .SelectFwe(score_func=<function f_classif>, alpha=0.05)
  5. .GenericUnivariateSelect(score_func=<function f_classif>, mode=’percentile’, param=1e-05):通用的特征筛选器。
  6. .f_regression(X, y, center=True)
  7. .mutual_info_regression(X, y, discrete_features=’auto’, n_neighbors=3, copy=True, random_state=None):互信息。
  8. .chi2(X, y)
  9. .f_classif(X, y)
  10. .mutual_info_classif(X, y, discrete_features=’auto’, n_neighbors=3, copy=True, random_state=None)
  11. .RFE(estimator, n_features_to_select=None, step=1, verbose=0):RFE嵌入法。
  12. .SelectFromModel(estimator, threshold=None, prefit=False, norm_order=1, max_features=None):自选模型嵌入法。
发布了60 篇原创文章 · 获赞 32 · 访问量 2万+

猜你喜欢

转载自blog.csdn.net/weixin_42297855/article/details/100987130