Data Analysis with Python - Getting Started with Pandas
- Built on NumPy
from pandas importSeries,DataFrame
,import pandas as pd
One, two data structures
1.Series
Python-like dictionary with index and value
Create Series
#不指定索引,默认创建0-N
In [54]: obj = Series([1,2,3,4,5])
In [55]: obj
Out[55]:
0 1
1 2
2 3
3 4
4 5
dtype: int64
#指定索引
In [56]: obj1 = Series([1,2,3,4,5],index=['a','b','c','d','e'])
In [57]: obj1
Out[57]:
a 1
b 2
c 3
d 4
e 5
dtype: int64
#将Python中的字典转换为Series
In [63]: dic = {'a':1,'b':2,'c':3}
In [64]: obj2 = Series(dic)
In [65]: obj2
Out[65]:
a 1
b 2
c 3
dtype: int64
Array operations on Series (filtering against boolean arrays, scalar multiplication, applying functions, etc.) still preserve the correspondence between indices and values.
If the value of the corresponding index cannot be found, it will be represented by NAN , and the data will be automatically filled in the arithmetic operation. If there is no NAN
2.DataFrame
A DataFrame is a tabular data structure with both row and column indices.
Create DataFrame
#传进去一个等长列表组成的字典
IIn [75]: data = {'name':['nadech','bob'],'age':[23,25],'sex':['male','female']}
In [76]: DataFrame(data)
Out[76]:
age name sex
0 23 nadech male
1 25 bob female
#指定列的顺序
In [77]: DataFrame(data,columns=['sex','name','age'])
Out[77]:
sex name age
0 male nadech 23
1 female bob 25
# 嵌套字典创建DataFrame
Operations on DataFrames
#获取某一列
In [82]: frame['age'] /frame.age
Out[82]:
0 23
1 25
Name: age, dtype: int64
#赋值
In [86]: frame2
Out[86]:
age sex name grade
0 23 male nadech NaN
1 25 female bob NaN
In [87]: frame2['grade']=12
In [88]: frame2
Out[88]:
age sex name grade
0 23 male nadech 12
1 25 female bob 12
Index object
In [14]: index = frame.index
In [15]: index
Out[15]: RangeIndex(start=0, stop=3, step=1)
# index 对象不可修改
In [16]: index[0]=3
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
2. Basic functions
1. Reindexing of Series and DataFrame
#Series
In [25]: obj = Series(['nadech','aguilera','irenieee'],index=['a','b','c'])
In [26]: obj
Out[26]:
a nadech
b aguilera
c irenieee
dtype: object
In [27]: obj.reindex(['c','b','a'])
Out[27]:
c irenieee
b aguilera
a nadech
dtype: object
#####DataFrame
In [21]: frame
Out[21]:
one two three
a 0 1 2
b 3 4 5
c 6 7 8
#直接传进去的列表是对行的重新索引
In [22]: frame.reindex(['c','b','a'])
Out[22]:
one two three
c 6 7 8
b 3 4 5
a 0 1 2
#对列的重新索引需要参数columns
In [24]: frame.reindex(columns=['three','two','one'])
Out[24]:
three two one
a 2 1 0
b 5 4 3
c 8 7 6
2. Delete the item on the specified axis
#Series
In [28]: obj.drop('c')
Out[28]:
a nadech
b aguilera
dtype: object
In [30]: obj.drop(['b','a'])
Out[30]:
c irenieee
dtype: object
#####DataFrame
The frame deletes the row index and deletes it directly, and the column index deletes need to specify axis=1
In [39]: frame
Out[39]:
one two three
a 0 1 2
b 3 4 5
c 6 7 8
In [40]: frame.drop('a')
Out[40]:
one two three
b 3 4 5
c 6 7 8
In [41]: frame.drop('one',axis=1)
Out[41]:
two three
a 1 2
b 4 5
c 7 8
3. Indexing, Selection and Filtering
Series索引
In [8]: obj
Out[8]:
a 0
b 1
c 2
d 3
dtype: int32
In [9]: obj['a']
Out[9]: 0
In [10]: obj[0]
Out[10]: 0
#注意利用标签切片和index 0-N是不同的
In [11]: obj[2:3]
Out[11]:
c 2
dtype: int32
In [12]: obj['c':'d']
Out[12]:
c 2
d 3
dtype: int32
DataFrame index
#索取frame的列
In [24]: frame
Out[24]:
one two three four
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
In [25]: frame['one']
Out[25]:
a 0
b 4
c 8
d 12
Name: one, dtype: int32
In [26]: frame[['one','two']]
Out[26]:
one two
a 0 1
b 4 5
c 8 9
d 12 13
#索取frame的行,标签索引
In [33]: frame.ix['a']
Out[33]:
one 0
two 1
three 2
four 3
Name: a, dtype: int32
In [31]: frame.ix[['a','b']]
Out[31]:
one two three four
a 0 1 2 3
b 4 5 6 7
#同时选取行和列
In [35]: frame.ix[['a','b'],['one','two']]
Out[35]:
one two
a 0 1
b 4 5
4. Arithmetic operations and data alignment
#当存在不同的索引对计算时,会产生并集,和NAN,通过fill_value 可以传入参数
- add()
- sub()
- div()
- I have()
5. Operations on Series and DataFrame
#series的索引会匹配到dataframe的列,然后向下广播
In [46]: frame
Out[46]:
one two three four
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
In [47]: obj = frame.ix['a']
In [48]: obj
Out[48]:
one 0
two 1
three 2
four 3
Name: a, dtype: int32
In [49]: frame - obj
Out[49]:
one two three four
a 0 0 0 0
b 4 4 4 4
c 8 8 8 8
d 12 12 12 12
#可以指定series匹配到dataframe的列(即index)然后向右广播,即沿着列广播
In [51]: frame
Out[51]:
one two three four
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
In [52]: obj2 = Series(np.arange(4),index=['a','b','c','d'])
In [53]: obj2
Out[53]:
a 0
b 1
c 2
d 3
dtype: int32
In [54]: frame.sub(obj2,axis=0) #dataframe的行用0、列用1
Out[54]:
one two three four
a 0 1 2 3
b 3 4 5 6
c 6 7 8 9
d 9 10 11 12
5. Sort
#sort by index on axis
#Series
In [6]: obj
Out[6]:
a 0
c 1
b 2
d 3
In [8]: obj.sort_index()
Out[8]:
a 0
b 2
c 1
d 3
dtype: int32
#DataFrame
frame.sort_index()
frame.sort_index(axis=1)
6. obj.index.is_unique
Can be used to determine whether the index is unique
3. Aggregate and Calculate Descriptive Statistics
Descriptive and summary statistics
- count the number of non-Na values
- describe Calculate summary statistics for Series or individual DataFrame columns
- min/max The lowest and highest values are the lowest values in each column
- aigmin/argmax minimum and maximum index position
- idxmin/idxmax can get the minimum and maximum index values
- quantile computes the quantile of a sample
- sum() calculates the sum of each column
- mean() calculates the mean of each column
- median Calculates the arithmetic median for each column
- mad() calculates the mean absolute dispersion from the mean
- var computes the variance of each column
- std calculates the standard deviation of each column
- skew The skewness of the sample values (third-order moment)
- Kurtosis (fourth moment) of kurt sample values
- cumsum Cumulative sum of sample values
- cummin/cummax cumulative maximum and cumulative minimum
- cumprod cumulative product
- diff computes the first-order difference
- pct_change calculates percent change
The unique value of the Series, the count of the value,
- obj.unique() returns an array of unique values
- obj.value_counts() counts the number of occurrences of each value
- pd.value_counts(obj.values) This can also be used to calculate the number of counts, which is the top-level method
- isin([]) Determines whether each value of the Series is included in the incoming value sequence
4. Handling missing data
NAN processing method
- dropna removes null values
- fillna assigns a null value
- isnull to determine whether there is a null value
- notnull
DataFrame.drop() complications
In [49]: fram1
Out[49]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
In [50]: cleaned = fram1.dropna()
In [51]: cleaned
Out[51]:
0 1 2
0 1.0 6.5 3.0
In [52]: fram1.dropna(how='all')
Out[52]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
#如上形式丢弃列的空值,传入axis=1
Fill missing values
obj.fillna() violently fills
fram.fillna({1:0.1,2:0.2}) For the dataframe, you can specify the column to fill the corresponding missing value
# pass in the method, you can fill each column with a non-empty number, and You can limit the number of fills in each column through limit
implace =True will generate new objects
In [57]: df
Out[57]:
0 1 2
0 -0.018286 0.246567 1.115108
1 0.722105 0.984472 -1.709935
2 1.477394 NaN 1.362234
3 0.077912 NaN 0.414627
4 0.530048 NaN NaN
5 0.294424 NaN NaN
In [58]: df.fillna(method='ffill')
Out[58]:
0 1 2
0 -0.018286 0.246567 1.115108
1 0.722105 0.984472 -1.709935
2 1.477394 0.984472 1.362234
3 0.077912 0.984472 0.414627
4 0.530048 0.984472 0.414627
5 0.294424 0.984472 0.414627
In [59]: df.fillna(method='ffill',limit=2)
Out[59]:
0 1 2
0 -0.018286 0.246567 1.115108
1 0.722105 0.984472 -1.709935
2 1.477394 0.984472 1.362234
3 0.077912 0.984472 0.414627
4 0.530048 NaN 0.414627
5 0.294424 NaN 0.414627
5. Hierarchical Index
DataFrame和层次化索引可以互相转换
frame.stack() /unstack()