从零开始-Machine Learning学习笔记(18)-Pandas学习笔记(二).md

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/kabuto_hui/article/details/82782646

1. 统计每列中各个元素的个数

感谢@waple_0820 提供的两个个方案:

df.groupby(['colname'],as_index=False)['colname'].agg({'cnt':'count'})  #方案一
df['colname'].value_counts()										#方案二

2. 从原数据中划分train和test

​ 在sklearn中有一个方法叫train_test_split, 通过这种方式将数据进分离。具体操作如下:

from sklearn.cross_validation import train_test_split
X = df.drop('class', axis=1)
Y = df['class']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 42)

X,Y分别是从原数据集中提取的特征和label,然后调用函数train_test_split()就可以将原数据划分为训练数据和测试数据。其中test_size是测试样本在总样本中所占的比例;random_state是用于产生随机数的一个种子,如果你设置了这个值,意味着每次划分的样本都是一样的;如果这个值为空,那么你每次生成的训练和测试样本都是不一样的,但是数量是一样的。

3. 评估模型的查准率(Precision)、查全率(Recall)与F1

​ 同样是使用slearn中的函数,使用操作如下:

from sklearn.metrics import accuracy_score,recall_score,f1_score

print("ACC", accuracy_score(Y_test, Y_pred))			    #查准率
print("REC", recall_score(Y_test, Y_pred, average="micro"))	 #查全率
print("F-score", f1_score(Y_test, Y_pred, average="micro"))	 #F1

其中,第一个参数测试数据的label,第二个参数是预测的结果。其中还有好些参数,我们查看源码就知道知道这些参数的作用:

y_true : 1d array-like, or label indicator array / sparse matrix
    Ground truth (correct) target values.

y_pred : 1d array-like, or label indicator array / sparse matrix
    Estimated targets as returned by a classifier.

labels : list, optional
    The set of labels to include when ``average != 'binary'``, and their
    order if ``average is None``. Labels present in the data can be
    excluded, for example to calculate a multiclass average ignoring a
    majority negative class, while labels not present in the data will
    result in 0 components in a macro average. For multilabel targets,
    labels are column indices. By default, all labels in ``y_true`` and
    ``y_pred`` are used in sorted order.

    .. versionchanged:: 0.17
       parameter *labels* improved for multiclass problem.

pos_label : str or int, 1 by default
    The class to report if ``average='binary'`` and the data is binary.
    If the data are multiclass or multilabel, this will be ignored;
    setting ``labels=[pos_label]`` and ``average != 'binary'`` will report
    scores for that label only.

average : string, [None, 'binary' (default), 'micro', 'macro', 'samples', \
                   'weighted']
    This parameter is required for multiclass/multilabel targets.
    If ``None``, the scores for each class are returned. Otherwise, this
    determines the type of averaging performed on the data:

    ``'binary'``:
        Only report results for the class specified by ``pos_label``.
        This is applicable only if targets (``y_{true,pred}``) are binary.
    ``'micro'``:
        Calculate metrics globally by counting the total true positives,
        false negatives and false positives.
    ``'macro'``:
        Calculate metrics for each label, and find their unweighted
        mean.  This does not take label imbalance into account.
    ``'weighted'``:
        Calculate metrics for each label, and find their average, weighted
        by support (the number of true instances for each label). This
        alters 'macro' to account for label imbalance; it can result in an
        F-score that is not between precision and recall.
    ``'samples'``:
        Calculate metrics for each instance, and find their average (only
        meaningful for multilabel classification where this differs from
        :func:`accuracy_score`).

sample_weight : array-like of shape = [n_samples], optional
    Sample weights.

前两个参数就不再赘述了;

labels是当average!='binary'average =None时需要被设置。数据中存在一些标签可以被排除,如计算多重分类的平均的时候,我们可以将数据看成多个二分类问题的集合,每个类都是一个二分类。 这个时候label只是一个索引,在默认情况下,y_true和y_pred中的标签都是按顺序排列的。

pos_label可以是str,int,默认为1。它在是二分类问题中被调用,如果数据是个多重分类,这个参数被忽略。

average将一个二分类matrics拓展到多分类或多标签问题时,我们可以将数据看成多个二分类问题的集合,每个类都是一个二分类。接着,我们可以通过跨多个分类计算每个二分类metrics得分的均值,这在一些情况下很有用。你可以使用average参数来指定。

average = binary : 在一个二分类问题中,只返回某一个标签的结果,这个标签由pos_label来指定。

average = macro:计算二分类metrics的均值,为每个类给出相同权重的分值。当小类很重要时会出问题,因为该macro-averging方法是对性能的平均。另一方面,该方法假设所有分类都是一样重要的,因此macro-averaging方法会对小类的性能影响很大。

扫描二维码关注公众号,回复: 4122618 查看本文章

average = micro:给出了每个样本类以及它对整个metrics的贡献的pair(sample-weight),而非对整个类的metrics求和,它会每个类的metrics上的权重及因子进行求和,来计算整个份额。Micro-averaging方法在多标签(multilabel)问题中设置。

average = weighted:对于不均衡数量的类来说,计算二分类metrics的平均,通过在每个类的score上进行加权实现。可以用于在label数量不均衡的时候替代macro。

average = samples:应用在multilabel问题上。它不会计算每个类,相反,它会在评估数据中,通过计算真实类和预测类的差异的metrics,来求平均(sample_weight-weighted) 。

average=None将返回一个数组,它包含了每个类的得分。

3. 对Series Object的处理

​ 有时候用loc,iloc获取数据的时候返回的是Series 。对Series 直接访问数据会带上index,于是可以用以下的方式来分别获取index和值。

obj.values  # 获取值
obj.index   # 获取index

猜你喜欢

转载自blog.csdn.net/kabuto_hui/article/details/82782646
今日推荐