Pandas 学习笔记

http://blog.csdn.net/qq_16234613/article/details/64217337

1.读取数据

import pandas as pd

data = pd.read_csv("census.csv")

# 成功 - 显示第一条记录
display(data.head(n=1))
#这个 data 类型是 DataFrame

读取写入
read_csv 　　　　　　to_csv
read_excel 　　　　　to_excel
read_hdf 　　　　　　to_hdf
read_sql 　　　　　　to_sql
read_json　　　　　　to_json
read_msgpack (experimental) 　　to_msgpack (experimental)
read_html 　　　　　　to_html
read_gbq (experimental) 　　　　to_gbq (experimental)
read_stata 　　　　　 to_stata
read_sas
read_clipboard 　　　to_clipboard
read_pickle 　　　　　to_pickle／／速度比csv快

#存储为csv文件
submission = pd.DataFrame({ 'PassengerId': test_df['PassengerId'],'Survived': predictions })
submission.to_csv("submission.csv", index=False)
# index参数是否写入行names键

2.获取某一列

#获取列名为 income 的一列
income_raw = data['income']
income = data.loc[:,'income']
#获取列名为 name 和income 的两列
sub_data = data.loc[:,['name','income']]
#获取第1列，注意是从0开始
sub_Data1 = data.iloc[:,1]
#:表示一整列数据，也可以选一个区间，如选前两行 [0,2）
sub_Data1 = data.iloc[0:2,1]

3.统计每个值出现的个数

income_value_counts = income.value_counts()
#返回的是 pandas.Series类型

4.Series 基础概念及属性

class pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)[source]

One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).

Operations between Series (+, -, /, , *) align values based on their associated index values– they need not be the same length. The result index will be the sorted union of the two indexes.

Series 是一个一维数组的数据结构，同时带有标签（lable）或者说索引（index）。 numpy.ndArray 里的统计函数已经被重写以便于自动除去缺失的数据如 NaN。

Series 之间使用运算符(+, -, /, , *)是基于它们每个元素进行运算，两个Series长度不要求一致。也就是说在pandas的Series中，会保留NumPy的数组操作（用布尔数组过滤数据，标量乘法，以及使用数学函数），并同时保持引用的使用。

(这个跟 Numpy的ndArray 是差不多的。举个 ndArray 的官方例子：）

>>> x = np.array([1, 2])
>>> y = np.array([[3], [4]])

>>> x
array([1, 2])

>>> y
array([[3],
       [4]])

>>> x + y
array([[4, 5],
       [5, 6]])

再补一下知识点：

我们已经了解了Python支持布尔类型的数据，布尔类型只有True和False两种值，但是布尔类型有以下几种运算：

与运算：只有两个布尔值都为 True 时，计算结果才为 True。

True and True   # ==> True
True and False   # ==> False
False and True   # ==> False
False and False   # ==> False

或运算：只要有一个布尔值为 True，计算结果就是 True。

True or True   # ==> True
True or False   # ==> True
False or True   # ==> True
False or False   # ==> False

非运算：把True变为False，或者把False变为True：

not True   # ==> False
not False   # ==> True

布尔运算在计算机中用来做条件判断，根据计算结果为True或者False，计算机可以自动执行不同的后续代码。

在Python中，布尔类型还可以与其他数据类型做 and、or和not运算，请看下面的代码：

a = True
print a and 'a=T' or 'a=F'

计算结果不是布尔类型，而是字符串 'a=T'，这是为什么呢？

因为Python把0、空字符串''和None看成 False，其他数值和非空字符串都看成 True，所以：

True and 'a=T' 计算结果是 'a=T'
继续计算 'a=T' or 'a=F' 计算结果还是 'a=T'

另外布尔值还可以做数字运算，此时 True=1,False=0
a=True
b=False
c=a+b
print c

>>>1

所以求预测结果的准确率的方法可以如下（经过上面那么多的铺垫，其实就是为了看明白下面这一行代码。。。）

def accuracy_score(truth, pred):
    """ 返回 pred 相对于 truth 的准确率 """
    
    # 确保预测的数量与结果的数量一致
    if len(truth) == len(pred): 
        
        # 计算预测准确率（百分比）
        return "Predictions have an accuracy of {:.2f}%.".format((truth == pred).mean()*100)
    
    else:
        return "Number of predictions does not match number of outcomes!"

Python 基本数据类型 List 如果用+，则是链接两个数组，同字符串类型。

5.获取 Series 里面的某个 key 的值

# 被调查者的收入大于$50,000的人数，
#例如income这列的值只有'<=50K'和'>50K'
#income.value_counts()输出如下
#  <=50K    34014
#  >50K     11208

n_greater_50k = income.value_counts().get('>50K')
#value_counts()是经过排序的,如果拿最多的一个数，则可以用
n_most_group = income.value_counts().index[0]

Series都是指针，要获取真正的值可以用.values

获取第i个元素可以这样写 a.values[i]，

6.分割数据

# 将数据切分成特征和对应的标签
income_raw = data['income']
features_raw = data.drop('income', axis = 1)

7.对每一行或每一列应用函数：

def num_missing(x):
  return sum(x.isnull())
#应用列:
print data.apply(num_missing, axis=0)
#应用行:
print data.apply(num_missing, axis=1).head()

# 对于倾斜的数据使用Log转换，注意 lambda
skewed = ['capital-gain', 'capital-loss']
features_raw[skewed] = data[skewed].apply(lambda x: np.log(x + 1))

8.对于

猜你喜欢