从零开始-Machine Learning学习笔记(18)-Pandas学习笔记(二).md

1. 统计每列中各个元素的个数

感谢@waple_0820 提供的两个个方案：

df.groupby(['colname'],as_index=False)['colname'].agg({'cnt':'count'})  #方案一
df['colname'].value_counts()										#方案二

2. 从原数据中划分train和test

在sklearn中有一个方法叫train_test_split，通过这种方式将数据进分离。具体操作如下：

from sklearn.cross_validation import train_test_split
X = df.drop('class', axis=1)
Y = df['class']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 42)

X，Y分别是从原数据集中提取的特征和label，然后调用函数train_test_split()就可以将原数据划分为训练数据和测试数据。其中test_size是测试样本在总样本中所占的比例；random_state是用于产生随机数的一个种子，如果你设置了这个值，意味着每次划分的样本都是一样的；如果这个值为空，那么你每次生成的训练和测试样本都是不一样的，但是数量是一样的。

3. 评估模型的查准率(Precision)、查全率(Recall)与F1

同样是使用slearn中的函数，使用操作如下：

from sklearn.metrics import accuracy_score,recall_score,f1_score

print("ACC", accuracy_score(Y_test, Y_pred))			    #查准率
print("REC", recall_score(Y_test, Y_pred, average="micro"))	 #查全率
print("F-score", f1_score(Y_test, Y_pred, average="micro"))	 #F1

其中，第一个参数测试数据的label，第二个参数是预测的结果。其中还有好些参数，我们查看源码就知道知道这些参数的作用：

y_true : 1d array-like, or label indicator array / sparse matrix
    Ground truth (correct) target values.

y_pred : 1d array-like, or label indicator array / sparse matrix
    Estimated targets as returned by a classifier.

labels : list, optional
    The set of labels to include when ``average != 'binary'``, and their
    order if ``average is None``. Labels present in the data can be
    excluded, for example to calculate a multiclass average ignoring a
    majority negative class, while labels not present in the data will
    result in 0 components in a macro average. For multilabel targets,
    labels are column indices. By default, all labels in ``y_true`` and
    ``y_pred`` are used in sorted order.

    .. versionchanged:: 0.17
       parameter *labels* improved for multiclass problem.

pos_label : str or int, 1 by default
    The class to report if ``average='binary'`` and the data is binary.
    If the data are multiclass or multilabel, this will be ignored;
    setting ``labels=[pos_label]`` and ``average != 'binary'`` will report
    scores for that label only.

average : string, [None, 'binary' (default), 'micro', 'macro', 'samples', \
                   'weighted']
    This parameter is required for multiclass/multilabel targets.
    If ``None``, the scores for each class are returned. Otherwise, this
    determines the type of averaging performed on the data:

    ``'binary'``:
        Only report results for the class specified by ``pos_label``.
        This is applicable only if targets (``y_{true,pred}``) are binary.
    ``'micro'``:
        Calculate metrics globally by counting the total true positives,
        false negatives and false positives.
    ``'macro'``:
        Calculate metrics for each label, and find their unweighted
        mean.  This does not take label imbalance into account.
    ``'weighted'``:
        Calculate metrics for each label, and find their average, weighted
        by support (the number of true instances for each label). This
        alters 'macro' to account for label imbalance; it can result in an
        F-score that is not between precision and recall.
    ``'samples'``:
        Calculate metrics for each instance, and find their average (only
        meaningful for multilabel classification where this differs from
        :func:`accuracy_score`).

sample_weight : array-like of shape = [n_samples], optional
    Sample weights.

前两个参数就不再赘述了；

labels是当average!='binary'和average =None时需要被设置。数据中存在一些标签可以被排除，如计算多重分类的平均的时候，我们可以将数据看成多个二分类问题的集合，每个类都是一个二分类。这个时候label只是一个索引，在默认情况下，y_true和y_pred中的标签都是按顺序排列的。

pos_label可以是str，int，默认为1。它在是二分类问题中被调用，如果数据是个多重分类，这个参数被忽略。

average将一个二分类matrics拓展到多分类或多标签问题时，我们可以将数据看成多个二分类问题的集合，每个类都是一个二分类。接着，我们可以通过跨多个分类计算每个二分类metrics得分的均值，这在一些情况下很有用。你可以使用average参数来指定。

average = binary : 在一个二分类问题中，只返回某一个标签的结果，这个标签由pos_label来指定。

average = macro：计算二分类metrics的均值，为每个类给出相同权重的分值。当小类很重要时会出问题，因为该macro-averging方法是对性能的平均。另一方面，该方法假设所有分类都是一样重要的，因此macro-averaging方法会对小类的性能影响很大。

扫描二维码关注公众号，回复： 4122618 查看本文章

average = micro：给出了每个样本类以及它对整个metrics的贡献的pair（sample-weight），而非对整个类的metrics求和，它会每个类的metrics上的权重及因子进行求和，来计算整个份额。Micro-averaging方法在多标签（multilabel）问题中设置。

average = weighted:对于不均衡数量的类来说，计算二分类metrics的平均，通过在每个类的score上进行加权实现。可以用于在label数量不均衡的时候替代macro。

average = samples：应用在multilabel问题上。它不会计算每个类，相反，它会在评估数据中，通过计算真实类和预测类的差异的metrics，来求平均（sample_weight-weighted）。

average=None将返回一个数组，它包含了每个类的得分。

3. 对Series Object的处理

有时候用loc，iloc获取数据的时候返回的是Series 。对Series 直接访问数据会带上index，于是可以用以下的方式来分别获取index和值。

obj.values  # 获取值
obj.index   # 获取index