pandas数据结构--Serise
Serise是类似一维数组的对象,它由一组数据以及一组与之相关的数据标签组成
In [15]: obj = pd.Series([4,7,-5,3])
In [16]: obj
Out[16]:
0 4
1 7
2 -5
3 3
dtype: int64
In [17]: obj.values
Out[17]: array([ 4, 7, -5, 3])
In [18]: obj.index
Out[18]: RangeIndex(start=0, stop=4, step=1)
左边是索引,右边是值
In [21]: obj2 = pd.Series([4,7,-5,3],index=['a','b','c','d'])
In [22]: obj2
Out[22]:
a 4
b 7
c -5
d 3
dtype: int64
In [23]: obj2.index
Out[23]: Index([u'a', u'b', u'c', u'd'], dtype='object')
In [24]: obj2['a']
Out[24]: 4
In [25]: obj2[obj2>0]
Out[25]:
a 4
b 7
d 3
dtype: int64
In [26]: obj2*2
Out[26]:
a 8
b 14
c -10
d 6
dtype: int64
自己可以指定索引
In [27]: 'b' in obj2
Out[27]: True
In [28]: 'f' in obj2
Out[28]: False
In [29]: '4' in obj2
Out[29]: False
Series可以看成定长的有序字典,它是索引值对数值的映射
In [31]: sdata = {'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':500}
In [32]: bj3 = pd.Series(sdata)
In [33]: bj3
Out[33]:
Ohio 35000
Oregon 16000
Texas 71000
Utah 500
dtype: int64
In [37]: sdata_index = {'a','b','c','Texas'}
In [38]: obj4 = pd.Series(sdata,index=sdata_index)
In [39]: obj4
Out[39]:
a NaN
c NaN
b NaN
Texas 71000.0
dtype: float64
In [40]: pd.isnull(obj4)
Out[40]:
a True
c True
b True
Texas False
dtype: bool
In [41]: pd.isnull(obj4)
Out[41]:
a True
c True
b True
Texas False
dtype: bool
检查数据是否缺失
DataFrame
它是一个表格性的数据结构,它含有一组有序的列,每列可以是不同类型的数值
In [44]: data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],'year'
...: :[2000,2001,2002,2001,2002],'pop':[1.5,1.7,3.6,2.4,2.9]}
In [45]: data
Out[45]:
{'pop': [1.5, 1.7, 3.6, 2.4, 2.9],
'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002]}
In [46]: fram = pd.DataFrame(data)
In [47]: fram
Out[47]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
In [48]: pd.DataFrame(data,columns=['year','state','pop'])
Out[48]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
可以指定序列排序
In [51]: fram2 = pd.DataFrame(data,columns=['year','state','pop','debt']
...: ,index=['one','two','three','four','five'])
In [52]: fram2
Out[52]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
In [53]: fram2['state']
Out[53]:
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
Name: state, dtype: object
In [54]: fram2.year
Out[54]:
one 2000
two 2001
three 2002
four 2001
five 2002
Name: year, dtype: int64
In [56]: fram2.ix['four']
Out[56]:
year 2001
state Nevada
pop 2.4
debt NaN
Name: four, dtype: object
DataFrame可以根据字段访问,还可以用索引访问
In [58]: fram2['debt'] = 16.5
In [59]: fram2
Out[59]:
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
In [60]: fram2['debt'] = np.arange(5)
In [61]: fram2
Out[61]:
year state pop debt
one 2000 Ohio 1.5 0
two 2001 Ohio 1.7 1
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4
In [62]: va = pd.Series([-1.2,-1.5,-1.7],index=['two','four','five'])
In [64]: fram2['debt'] = va
In [65]: fram2
Out[65]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
In [67]: fram2['eastern'] = fram2.state == 'Ohio'
In [68]: fram2
Out[68]:
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
按照字段名赋值,也可以按照索引值赋值,为不存在的值设置NAN
In [69]: del fram2['eastern']
In [70]: fram2
Out[70]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
删除列 通过索引方式返回的列只是相应数据的视图而已,并不是副本,所有数据的修改都会反映到原数据