Data Analysis with Python - Getting Started with Pandas

Data Analysis with Python - Getting Started with Pandas

  • Built on NumPy
  • from pandas importSeries,DataFrame,import pandas as pd

One, two data structures

1.Series

Python-like dictionary with index and value

Create Series

#不指定索引,默认创建0-N
In [54]: obj = Series([1,2,3,4,5])

In [55]: obj
Out[55]:
0    1
1    2
2    3
3    4
4    5
dtype: int64
#指定索引
In [56]: obj1 = Series([1,2,3,4,5],index=['a','b','c','d','e'])

In [57]: obj1
Out[57]:
a    1
b    2
c    3
d    4
e    5
dtype: int64

#将Python中的字典转换为Series
In [63]: dic = {'a':1,'b':2,'c':3}

In [64]: obj2 = Series(dic)

In [65]: obj2
Out[65]:
a    1
b    2
c    3
dtype: int64

Array operations on Series (filtering against boolean arrays, scalar multiplication, applying functions, etc.) still preserve the correspondence between indices and values.
If the value of the corresponding index cannot be found, it will be represented by NAN , and the data will be automatically filled in the arithmetic operation. If there is no NAN

2.DataFrame

A DataFrame is a tabular data structure with both row and column indices.

Create DataFrame

#传进去一个等长列表组成的字典
IIn [75]: data = {'name':['nadech','bob'],'age':[23,25],'sex':['male','female']}

In [76]: DataFrame(data)
Out[76]:
   age    name     sex
0   23  nadech    male
1   25     bob  female

#指定列的顺序
In [77]: DataFrame(data,columns=['sex','name','age'])
Out[77]:
      sex    name  age
0    male  nadech   23
1  female     bob   25
# 嵌套字典创建DataFrame

Operations on DataFrames

#获取某一列
In [82]: frame['age']  /frame.age
Out[82]:
0    23
1    25
Name: age, dtype: int64

#赋值
In [86]: frame2
Out[86]:
   age     sex    name grade
0   23    male  nadech   NaN
1   25  female     bob   NaN

In [87]: frame2['grade']=12

In [88]: frame2
Out[88]:
   age     sex    name  grade
0   23    male  nadech     12
1   25  female     bob     12

Index object

In [14]: index = frame.index

In [15]: index
Out[15]: RangeIndex(start=0, stop=3, step=1)
# index 对象不可修改
In [16]: index[0]=3
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

2. Basic functions

1. Reindexing of Series and DataFrame

#Series
In [25]: obj = Series(['nadech','aguilera','irenieee'],index=['a','b','c'])

In [26]: obj
Out[26]:
a      nadech
b    aguilera
c    irenieee
dtype: object

In [27]: obj.reindex(['c','b','a'])
Out[27]:
c    irenieee
b    aguilera
a      nadech
dtype: object


#####DataFrame
In [21]: frame
Out[21]:
   one  two  three
a    0    1      2
b    3    4      5
c    6    7      8
#直接传进去的列表是对行的重新索引
In [22]: frame.reindex(['c','b','a'])
Out[22]:
   one  two  three
c    6    7      8
b    3    4      5
a    0    1      2
#对列的重新索引需要参数columns
In [24]: frame.reindex(columns=['three','two','one'])
Out[24]:
   three  two  one
a      2    1    0
b      5    4    3
c      8    7    6

2. Delete the item on the specified axis

#Series
In [28]: obj.drop('c')
Out[28]:
a      nadech
b    aguilera
dtype: object

In [30]: obj.drop(['b','a'])
Out[30]:
c    irenieee
dtype: object

#####DataFrame

The frame deletes the row index and deletes it directly, and the column index deletes need to specify axis=1

In [39]: frame
Out[39]:
   one  two  three
a    0    1      2
b    3    4      5
c    6    7      8

In [40]: frame.drop('a')
Out[40]:
   one  two  three
b    3    4      5
c    6    7      8

In [41]: frame.drop('one',axis=1)
Out[41]:
   two  three
a    1      2
b    4      5
c    7      8

3. Indexing, Selection and Filtering

Series索引
In [8]: obj
Out[8]:
a 0
b 1
c 2
d 3
dtype: int32

In [9]: obj['a']
Out[9]: 0

In [10]: obj[0]
Out[10]: 0

#注意利用标签切片和index 0-N是不同的
In [11]: obj[2:3]
Out[11]:
c    2
dtype: int32

In [12]: obj['c':'d']
Out[12]:
c    2
d    3
dtype: int32

DataFrame index

#索取frame的列
In [24]: frame
Out[24]:
   one  two  three  four
a    0    1      2     3
b    4    5      6     7
c    8    9     10    11
d   12   13     14    15

In [25]: frame['one']
Out[25]:
a     0
b     4
c     8
d    12
Name: one, dtype: int32

In [26]: frame[['one','two']]
Out[26]:
   one  two
a    0    1
b    4    5
c    8    9
d   12   13
#索取frame的行,标签索引
In [33]: frame.ix['a']
Out[33]:
one      0
two      1
three    2
four     3
Name: a, dtype: int32

In [31]: frame.ix[['a','b']]
Out[31]:
   one  two  three  four
a    0    1      2     3
b    4    5      6     7

#同时选取行和列
In [35]: frame.ix[['a','b'],['one','two']]
Out[35]:
   one  two
a    0    1
b    4    5

4. Arithmetic operations and data alignment

#当存在不同的索引对计算时,会产生并集,和NAN,通过fill_value 可以传入参数
  • add()
  • sub()
  • div()
  • I have()

5. Operations on Series and DataFrame

#series的索引会匹配到dataframe的列,然后向下广播
In [46]: frame
Out[46]:
   one  two  three  four
a    0    1      2     3
b    4    5      6     7
c    8    9     10    11
d   12   13     14    15

In [47]: obj = frame.ix['a']

In [48]: obj
Out[48]:
one      0
two      1
three    2
four     3
Name: a, dtype: int32

In [49]: frame - obj
Out[49]:
   one  two  three  four
a    0    0      0     0
b    4    4      4     4
c    8    8      8     8
d   12   12     12    12

#可以指定series匹配到dataframe的列(即index)然后向右广播,即沿着列广播
In [51]: frame
Out[51]:
   one  two  three  four
a    0    1      2     3
b    4    5      6     7
c    8    9     10    11
d   12   13     14    15

In [52]: obj2 = Series(np.arange(4),index=['a','b','c','d'])

In [53]: obj2
Out[53]:
a    0
b    1
c    2
d    3
dtype: int32

In [54]: frame.sub(obj2,axis=0)   #dataframe的行用0、列用1
Out[54]:
   one  two  three  four
a    0    1      2     3
b    3    4      5     6
c    6    7      8     9
d    9   10     11    12

5. Sort

#sort by index on axis

   #Series
    In [6]: obj
    Out[6]:
    a    0
    c    1
    b    2
    d    3
    In [8]: obj.sort_index()
    Out[8]:
    a    0
    b    2
    c    1
    d    3
    dtype: int32
    #DataFrame
    frame.sort_index()
    frame.sort_index(axis=1)
    

6. obj.index.is_uniqueCan be used to determine whether the index is unique

3. Aggregate and Calculate Descriptive Statistics

Descriptive and summary statistics

  • count the number of non-Na values
  • describe Calculate summary statistics for Series or individual DataFrame columns
  • min/max The lowest and highest values ​​are the lowest values ​​in each column
  • aigmin/argmax minimum and maximum index position
  • idxmin/idxmax can get the minimum and maximum index values
  • quantile computes the quantile of a sample
  • sum() calculates the sum of each column
  • mean() calculates the mean of each column
  • median Calculates the arithmetic median for each column
  • mad() calculates the mean absolute dispersion from the mean
  • var computes the variance of each column
  • std calculates the standard deviation of each column
  • skew The skewness of the sample values ​​(third-order moment)
  • Kurtosis (fourth moment) of kurt sample values
  • cumsum Cumulative sum of sample values
  • cummin/cummax cumulative maximum and cumulative minimum
  • cumprod cumulative product
  • diff computes the first-order difference
  • pct_change calculates percent change

The unique value of the Series, the count of the value,

  • obj.unique() returns an array of unique values
  • obj.value_counts() counts the number of occurrences of each value
  • pd.value_counts(obj.values) This can also be used to calculate the number of counts, which is the top-level method
  • isin([]) Determines whether each value of the Series is included in the incoming value sequence

4. Handling missing data

NAN processing method

  • dropna removes null values
  • fillna assigns a null value
  • isnull to determine whether there is a null value
  • notnull

DataFrame.drop() complications

In [49]: fram1
Out[49]:
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0

In [50]: cleaned = fram1.dropna()

In [51]: cleaned
Out[51]:
     0    1    2
0  1.0  6.5  3.0

In [52]: fram1.dropna(how='all')
Out[52]:
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
3  NaN  6.5  3.0

#如上形式丢弃列的空值,传入axis=1

Fill missing values

obj.fillna() violently fills
fram.fillna({1:0.1,2:0.2}) For the dataframe, you can specify the column to fill the corresponding missing value
# pass in the method, you can fill each column with a non-empty number, and You can limit the number of fills in each column through limit
implace =True will generate new objects

In [57]: df
Out[57]:
          0         1         2
0 -0.018286  0.246567  1.115108
1  0.722105  0.984472 -1.709935
2  1.477394       NaN  1.362234
3  0.077912       NaN  0.414627
4  0.530048       NaN       NaN
5  0.294424       NaN       NaN

In [58]: df.fillna(method='ffill')
Out[58]:
          0         1         2
0 -0.018286  0.246567  1.115108
1  0.722105  0.984472 -1.709935
2  1.477394  0.984472  1.362234
3  0.077912  0.984472  0.414627
4  0.530048  0.984472  0.414627
5  0.294424  0.984472  0.414627

In [59]: df.fillna(method='ffill',limit=2)
Out[59]:
          0         1         2
0 -0.018286  0.246567  1.115108
1  0.722105  0.984472 -1.709935
2  1.477394  0.984472  1.362234
3  0.077912  0.984472  0.414627
4  0.530048       NaN  0.414627
5  0.294424       NaN  0.414627
    

5. Hierarchical Index

DataFrame和层次化索引可以互相转换
frame.stack()  /unstack()

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326445557&siteId=291194637