"Python Machine Learning" Pandas Data Analysis

Pandas is an open source Python library designed for data processing and analysis tasks; it provides high-performance, easy-to-use data structures and data analysis tools, making data science in Python simple and efficient; Pandas is based on NumPy, Thus seamlessly integrating with many other NumPy-based libraries such as SciPy and scikit-learn;

The two main data structures in Pandas

  • Series, a one-dimensional labeled array that can store different types of data such as integers, floating-point numbers, and strings; Series has an index (index), which makes it similar to a Python dictionary, but has more features and functions;

  • DataFrame, a two-dimensional labeled data structure, similar to a table or spreadsheet; it consists of a series of columns with the same index, each column can have a different data type; DataFrame provides various functions, such as filtering, sorting, grouping , merge and aggregate for efficient operations on large datasets;

Supported data processing and analysis tasks

  • Data import and export
  • Data cleaning and preprocessing
  • Data Filtering and Selection
  • Data sorting, ranking and aggregation
  • Missing value handling
  • group operation
  • pivot table
  • time series analysis
  • Merge and join multiple datasets

Pandas provides a wealth of features that make it one of the most popular and widely used libraries in the Python data science ecosystem;

import pandas as pd
import numpy as np

1. Series

1. Construction and initialization

  1. Series is a one-dimensional data structure, and Pandas will use 0 to n as the index of Series by default;
>>> s = pd.Series([1, 3, 'Beijing', 3.14, -123, 'Year!'])
>>> s
0          1
1          3
2    Beijing
3       3.14
4       -123
5      Year!
dtype: object
  1. Specify the index yourself;
>>> s = pd.Series([1, 3, 'Beijing', 3.14, -123, 'Year!'], index=['A', 'B', 'C', 'D', 'E','G'])
>>> s
A          1
B          3
C    Beijing
D       3.14
E       -123
G      Year!
dtype: object
  1. Construct Series directly using dictionary, because Series itself is keyvalue pairs;
>>> cities = {
    
    'Beijing': 55000, 'Shanghai': 60000, 'Shenzhen': 50000,
>>>           'Hangzhou': 20000, 'Guangzhou': 25000, 'Suzhou': None}
>>> apts = pd.Series(cities)
>>> apts
Beijing      55000.0
Shanghai     60000.0
Shenzhen     50000.0
Hangzhou     20000.0
Guangzhou    25000.0
Suzhou           NaN
dtype: float64

2. Select data

  1. Select data by index
>>> apts['Hangzhou']
20000.0

>>> apts[['Hangzhou', 'Beijing', 'Shenzhen']]
Hangzhou    20000.0
Beijing     55000.0
Shenzhen    50000.0
dtype: float64

>>> # boolean indexing
>>> apts[apts < 50000]
Hangzhou     20000.0
Guangzhou    25000.0
dtype: float64

>>> # boolean indexing 的工作方式
>>> less_than_50000 = apts < 50000
>>> less_than_50000
Beijing      False
Shanghai     False
Shenzhen     False
Hangzhou      True
Guangzhou     True
Suzhou       False
dtype: bool

>>> apts[less_than_50000]
Hangzhou     20000.0
Guangzhou    25000.0
dtype: float64

3. Element assignment

>>> print("Old value: ", apts['Shenzhen'])
Old value:  50000.0

>>> apts['Shenzhen'] = 55000
>>> print("New value: ", apts['Shenzhen'])
New value:  55000.0

>>> print(apts[apts < 50000])
Hangzhou     20000.0
Guangzhou    25000.0
dtype: float64

>>> apts[apts <= 50000] = 40000
>>> print(apts[apts < 50000])
angzhou     40000.0
Guangzhou    40000.0
dtype: float64

4. Mathematical operations

>>> apts / 2
Beijing      27500.0
Shanghai     30000.0
Shenzhen     27500.0
Hangzhou     20000.0
Guangzhou    20000.0
Suzhou           NaN
dtype: float64

>>> np.square(apts)
Beijing      3.025000e+09
Shanghai     3.600000e+09
Shenzhen     3.025000e+09
Hangzhou     1.600000e+09
Guangzhou    1.600000e+09
Suzhou                NaN
dtype: float64

>>> cars = pd.Series({
    
    'Beijing': 300000, 'Shanghai': 400000, 'Shenzhen': 300000,
>>>                   'Tianjin': 200000, 'Guangzhou': 200000, 'Chongqing': 150000})
>>> cars
Beijing      300000
Shanghai     400000
Shenzhen     300000
Tianjin      200000
Guangzhou    200000
Chongqing    150000
dtype: int64

>>> # 按 index 运算
>>> cars + apts * 100
Beijing      5800000.0
Chongqing          NaN
Guangzhou    4200000.0
Hangzhou           NaN
Shanghai     6400000.0
Shenzhen     5800000.0
Suzhou             NaN
Tianjin            NaN
dtype: float64

5. Missing data

>>> print('Hangzhou' in apts)
True

>>> print('Hangzhou' in cars)
False

>>> apts.notnull()
Beijing       True
Shanghai      True
Shenzhen      True
Hangzhou      True
Guangzhou     True
Suzhou       False
dtype: bool

>>> print(apts.isnull())
Beijing      False
Shanghai     False
Shenzhen     False
Hangzhou     False
Guangzhou    False
Suzhou        True
dtype: bool

>>> print(apts[apts.isnull()])
Suzhou   NaN
dtype: float64

2. DataFrame

A DataFrame is a table, Series represents a one-dimensional array, and DataFrame is a two-dimensional array, which can be compared to an excel table; DataFrame can also be regarded as a collection of Series;

1. Construct and select data

  1. Constructed from a dictionary;
>>> data = {
    
    'city': ['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen', 'Hangzhou', 'Chongqing'],
>>>         'year': [2016, 2017, 2016, 2017, 2016, 2016],
>>>         'population': [2100, 2300, 1000, 700, 500, 500]}
>>> d1 = pd.DataFrame(data)
>>> print(d1)
        city  year  population
0    Beijing  2016        2100
1   Shanghai  2017        2300
2  Guangzhou  2016        1000
3   Shenzhen  2017         700
4   Hangzhou  2016         500
5  Chongqing  2016         500
  1. full column traversal
>>> for row in d1.values:
>>>     print(row)
['Beijing' 2016 2100]
['Shanghai' 2017 2300]
['Guangzhou' 2016 1000]
['Shenzhen' 2017 700]
['Hangzhou' 2016 500]
['Chongqing' 2016 500]
  1. select columns and merge traversal
>>> for row in zip(d1['city'], d1['year'], d1['population']):
>>>     print(row)
('Beijing', 2016, 2100)
('Shanghai', 2017, 2300)
('Guangzhou', 2016, 1000)
('Shenzhen', 2017, 700)
('Hangzhou', 2016, 500)
('Chongqing', 2016, 500)

>>> print(d1.columns)
Index(['city', 'year', 'population'], dtype='object')
  1. sequence recombination
>>> print(pd.DataFrame(data, columns=['year', 'city', 'population']))
   year       city  population
0  2016    Beijing        2100
1  2017   Shanghai        2300
2  2016  Guangzhou        1000
3  2017   Shenzhen         700
4  2016   Hangzhou         500
5  2016  Chongqing         500
  1. row and column index
>>> frame2 = pd.DataFrame(data,
>>>                       columns=['year', 'city', 'population', 'debt'],
>>>                       index=['one', 'two', 'three', 'four', 'five', 'six'])
>>> print(frame2)
       year       city  population debt
one    2016    Beijing        2100  NaN
two    2017   Shanghai        2300  NaN
three  2016  Guangzhou        1000  NaN
four   2017   Shenzhen         700  NaN
five   2016   Hangzhou         500  NaN
six    2016  Chongqing         500  NaN

>>> print(frame2['city'])
one        Beijing
two       Shanghai
three    Guangzhou
four      Shenzhen
five      Hangzhou
six      Chongqing
Name: city, dtype: object

>>> print(frame2.year)
one      2016
two      2017
three    2016
four     2017
five     2016
six      2016
Name: year, dtype: int64

>>> # loc 取 label based indexing or iloc 取 positional indexing
>>> print(frame2.loc['three'])
year               2016
city          Guangzhou
population         1000
debt                NaN
Name: three, dtype: object

>>> print(frame2.iloc[2].copy())
year               2016
city          Guangzhou
population         1000
debt                NaN
Name: three, dtype: object

2. Element assignment

  1. Entire column assignment (single value);
>>> frame2['debt'] = 100
>>> print(frame2)
       year       city  population  debt
one    2016    Beijing        2100   100
two    2017   Shanghai        2300   100
three  2016  Guangzhou        1000   100
four   2017   Shenzhen         700   100
five   2016   Hangzhou         500   100
six    2016  Chongqing         500   100
  1. Entire column assignment(list value);
>>> frame2.debt = np.arange(6)
>>> print(frame2)
       year       city  population  debt
one    2016    Beijing        2100     0
two    2017   Shanghai        2300     1
three  2016  Guangzhou        1000     2
four   2017   Shenzhen         700     3
five   2016   Hangzhou         500     4
six    2016  Chongqing         500     5
  1. Use Series to specify the index to be modified and the corresponding value, if not specified, NaN is used by default;
>>> val = pd.Series([100, 200, 300], index=['two', 'three', 'five'])
>>> frame2['debt'] = val
>>> print(frame2)
       year       city  population   debt
one    2016    Beijing        2100    NaN
two    2017   Shanghai        2300  100.0
three  2016  Guangzhou        1000  200.0
four   2017   Shenzhen         700    NaN
five   2016   Hangzhou         500  300.0
six    2016  Chongqing         500    NaN
  1. Assign values ​​to existing columns (create new columns);
>>> frame2['western'] = (frame2.city == 'Chongqing')
>>> print(frame2)
       year       city  population   debt  western
one    2016    Beijing        2100    NaN    False
two    2017   Shanghai        2300  100.0    False
three  2016  Guangzhou        1000  200.0    False
four   2017   Shenzhen         700    NaN    False
five   2016   Hangzhou         500  300.0    False
six    2016  Chongqing         500    NaN     True
  1. Transpose of DataFrame;
>>> pop = {
    
    'Beijing': {
    
    2016: 2100, 2017: 2200},
>>>        'Shanghai': {
    
    2015: 2400, 2016: 2500, 2017: 2600}}
>>> frame3 = pd.DataFrame(pop)
>>> print(frame3)
      Beijing  Shanghai
2016   2100.0      2500
2017   2200.0      2600
2015      NaN      2400

>>> print(frame3.T)
            2016    2017    2015
Beijing   2100.0  2200.0     NaN
Shanghai  2500.0  2600.0  2400.0
  1. Line order reorganization;
>>> pd.DataFrame(pop, index=[2015, 2016, 2017])
	Beijing	Shanghai
2015	NaN	2400
2016	2100.0	2500
2017	2200.0	2600
  1. Initialize data with slices;
>>> pdata = {
    
    'Beijing': frame3['Beijing'][:-1],
>>>          'Shanghai': frame3['Shanghai'][:-1]}
>>> pd.DataFrame(pdata)
	Beijing	Shanghai
2016	2100.0	2500
2017	2200.0	2600
  1. Specify the name of the index and the name of the column;
>>> frame3.index.name = 'year'
>>> frame3.columns.name = 'city'
>>> print(frame3)
city  Beijing  Shanghai
year
2016   2100.0      2500
2017   2200.0      2600
2015      NaN      2400

>>> print(frame2.values)
[[2016 'Beijing' 2100 nan False]
 [2017 'Shanghai' 2300 100.0 False]
 [2016 'Guangzhou' 1000 200.0 False]
 [2017 'Shenzhen' 700 nan False]
 [2016 'Hangzhou' 500 300.0 False]
 [2016 'Chongqing' 500 nan True]]

>>> print(frame2)
      year       city  population   debt  western
one    2016    Beijing        2100    NaN    False
two    2017   Shanghai        2300  100.0    False
three  2016  Guangzhou        1000  200.0    False
four   2017   Shenzhen         700    NaN    False
five   2016   Hangzhou         500  300.0    False
six    2016  Chongqing         500    NaN     True

>>> print(type(frame2.values))
<class 'numpy.ndarray'>

3. Index

1. Index object

>>> obj = pd.Series(range(3), index=['a', 'b', 'c'])
>>> index = obj.index
>>> index
Index(['a', 'b', 'c'], dtype='object')

>>> index[1:]
Index(['b', 'c'], dtype='object')

>>> # index 不能被动态改动
>>> # index[1]='d'

>>> index = pd.Index(np.arange(3))
>>> obj2 = pd.Series([2, 5, 7], index=index)
>>> print(obj2)
0    2
1    5
2    7
dtype: int64

>>> print(obj2.index is index)
True

>>> 2 in obj2.index
True

>>> pop = {
    
    'Beijing': {
    
    2016: 2100, 2017: 2200},
>>>        'Shanghai': {
    
    2015: 2400, 2016: 2500, 2017: 2600}}
>>> frame3 = pd.DataFrame(pop)
>>> print('Shanghai' in frame3.columns)
True

>>> print('2015' in frame3.index)
False

>>> print(2015 in frame3.index)
True

2. Index index and slice

>>> obj = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
>>> print(obj)
a    0
b    1
c    2
d    3
dtype: int64

>>> print(obj['b'])
1
  1. Use the default numeric index;
>>> print(obj[3])
3

>>> print(obj[[1, 3]])
b    1
d    3
dtype: int64

>>> print(obj[['b', 'd']])
b    1
d    3
dtype: int64
  1. conditional filter;
>>> print(obj[obj < 2])
a    0
b    1
dtype: int32
  1. slice filtering and assignment
>>> print(obj['b':'c'])
b    1
c    2
dtype: int64

>>> obj['b':'c'] = 5
>>> print(obj)
a    0
b    5
c    5
d    3
dtype: int64
  1. DataFrame 的 Indexing
>>> a = np.arange(9).reshape(3, 3)
>>> print(a)
[[0 1 2]
 [3 4 5]
 [6 7 8]]

>>> frame = pd.DataFrame(a,
>>>                      index=['a', 'c', 'd'],
>>>                      columns=['Hangzhou', 'Shenzhen', 'Nanjing'])
>>> frame
   Hangzhou  Shenzhen  Nanjing
a         0         1        2
c         3         4        5
d         6         7        8

>>> frame['Hangzhou']
a    0
c    3
d    6
Name: Hangzhou, dtype: int64

>>> frame[['Shenzhen', 'Nanjing']]
   Shenzhen  Nanjing
a         1        2
c         4        5
d         7        8

>>> frame[:2]
   Hangzhou  Shenzhen  Nanjing
a         0         1        2
c         3         4        5

>>> frame.loc['a']
Hangzhou    0
Shenzhen    1
Nanjing     2
Name: a, dtype: int64

>>> frame.loc[['a', 'd'], ['Shenzhen', 'Nanjing']]
   Shenzhen  Nanjing
a         1        2
d         7        8

>>> frame.loc[:'c', 'Hangzhou']
a    0
c    3
Name: Hangzhou, dtype: int64
  1. DataFrame 的 condition selection
>>> frame[frame.Hangzhou > 1]
   Hangzhou  Shenzhen  Nanjing
c         3         4        5
d         6         7        8

>>> frame < 5
   Hangzhou  Shenzhen  Nanjing
a      True      True     True
c      True      True    False
d     False     False    False

>>> frame[frame < 5] = 0
>>> frame
   Hangzhou  Shenzhen  Nanjing
a         0         0        0
c         0         0        5
d         6         7        8

3. reindex

  1. Rearrange according to the new index order;
>>> obj = pd.Series([4.5, 7.2, -5.3, 3.2], index=['d', 'b', 'a', 'c'])
>>> print(obj)
d    4.5
b    7.2
a   -5.3
c    3.2
dtype: float64

>>> obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
>>> obj2
a   -5.3
b    7.2
c    3.2
d    4.5
e    NaN
dtype: float64
  1. Fill the specified value on the new index;
>>> obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
a   -5.3
b    7.2
c    3.2
d    4.5
e    0.0
dtype: float64
  1. Fill the previous most recent value on the new index;
>>> obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
>>> obj3
0      blue
2    purple
4    yellow
dtype: object

>>> # forward
>>> obj3.reindex(range(6), method='ffill')
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object
  1. Fill in the latest value at the new index;
>>> # backward
>>> obj3.reindex(range(6), method='bfill')
0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object
  1. Reindex the DataFrame;
>>> frame = pd.DataFrame(np.arange(9).reshape(3, 3),
                     index=['a', 'c', 'd'],
                     columns=['Hangzhou', 'Shenzhen', 'Nanjing'])
>>> frame
   Hangzhou  Shenzhen  Nanjing
a         0         1        2
c         3         4        5
d         6         7        8

>>> frame2 = frame.reindex(['a', 'b', 'c', 'd'])
>>> frame2
   Hangzhou  Shenzhen  Nanjing
a       0.0       1.0      2.0
b       NaN       NaN      NaN
c       3.0       4.0      5.0
d       6.0       7.0      8.0
  1. Re-specify columns;
>>> frame.reindex(columns=['Shenzhen', 'Hangzhou', 'Chongqing'])
   Shenzhen  Hangzhou  Chongqing
a         1         0        NaN
c         4         3        NaN
d         7         6        NaN

>>> frame3 = frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill').reindex(
>>>     columns=['Shenzhen', 'Hangzhou', 'Chongqing'])
>>> print(frame3)
   Shenzhen  Hangzhou  Chongqing
a         1         0        NaN
b         1         0        NaN
c         4         3        NaN
d         7         6        NaN

>>> print(frame3.loc[['a', 'b', 'd', 'c'],
>>>       ['Shenzhen', 'Hangzhou', 'Chongqing']])
   Shenzhen  Hangzhou  Chongqing
a         1         0        NaN
b         1         0        NaN
d         7         6        NaN
c         4         3        NaN

4. drop

  1. Delete the index in Series and DataFrame;
>>> print(obj3)
0      blue
2    purple
4    yellow
dtype: object

>>> obj4 = obj3.drop(2)
>>> print(obj4)
0      blue
4    yellow
dtype: object

>>> print(obj3.drop([2, 4]))
0    blue
dtype: object
>>> print(frame)
   Hangzhou  Shenzhen  Nanjing
a         0         1        2
c         3         4        5
d         6         7        8

>>> print(frame.drop(['a', 'c']))
   Hangzhou  Shenzhen  Nanjing
d         6         7        8
  1. Delete columns in DataFrame;
>>> print(frame.drop('Shenzhen', axis=1))
   Hangzhou  Nanjing
a         0        2
c         3        5
d         6        8

>>> print(frame.drop(['Shenzhen', 'Hangzhou'], axis=1))
   Nanjing
a        2
c        5
d        8

5. Hierarchical Indexing

  1. Series 的 hierarchical indexing;
>>> data = pd.Series(np.random.randn(10),
>>>                  index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'd'],
>>>                         [1, 2, 3, 1, 2, 1, 2, 3, 1, 2]])
>>> data
a  1   -0.587772
   2    0.597073
   3   -2.354382
b  1    1.403719
   2   -0.612704
c  1   -1.409393
   2    2.098933
   3    0.076322
d  1    0.295683
   2    1.188039

>>> data.index
MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 2),
            ('c', 1),
            ('c', 2),
            ('c', 3),
            ('d', 1),
            ('d', 2)],
           )

>>> data.b
1    1.403719
2   -0.612704
dtype: float64

>>> data['b':'c']
b  1    1.403719
   2   -0.612704
c  1   -1.409393
   2    2.098933
   3    0.076322
dtype: float64

>>> data[:2]
a  1   -0.587772
   2    0.597073
dtype: float64
  1. unstack 和 stack;
>>> # 将 hierarchical indexing 的 Series 转换为 DataFrame
>>> data.unstack()
          1         2         3
a -0.587772  0.597073 -2.354382
b  1.403719 -0.612704       NaN
c -1.409393  2.098933  0.076322
d  0.295683  1.188039       NaN

>>> type(data.unstack())
pandas.core.frame.DataFrame

>>> # 将 DataFrame 转换为 hierarchical indexing 的 Series
>>> data.unstack().stack()
a  1   -0.587772
   2    0.597073
   3   -2.354382
b  1    1.403719
   2   -0.612704
c  1   -1.409393
   2    2.098933
   3    0.076322
d  1    0.295683
   2    1.188039
dtype: float64
  1. DataFrame 的 hierarchical indexing;
>>> frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
>>>                      index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
>>>                      columns=[['Beijing', 'Beijing', 'Shanghai'],
>>>                               ['apts', 'cars', 'apts']])
>>> frame
    Beijing      Shanghai
       apts cars     apts
a 1       0    1        2
  2       3    4        5
b 1       6    7        8
  2       9   10       11

>>> frame.index.names = ['key1', 'key2']
>>> frame.columns.names = ['city', 'type']
>>> frame
city      Beijing      Shanghai
type         apts cars     apts
key1 key2
a    1          0    1        2
     2          3    4        5
b    1          6    7        8
     2          9   10       11

>>> frame.loc['a', 1]
city      type
Beijing   apts    0
          cars    1
Shanghai  apts    2
Name: (a, 1), dtype: int64

>>> frame.loc['a', 2]['Beijing']
type
apts    3
cars    4
Name: (a, 2), dtype: int64

>>> frame.loc['a', 2]['Beijing']['apts'] # 等价:frame.loc['a', 2]['Beijing', 'apts']
3

4. Concatenate 与 Append

>>> df1 = pd.DataFrame({
    
    'apts': [55000, 60000],
>>>                    'cars': [200000, 300000], },
>>>                    index=['Shanghai', 'Beijing'])
>>> print(df1)
           apts    cars
Shanghai  55000  200000
Beijing   60000  300000

>>> df2 = pd.DataFrame({
    
    'apts': [25000, 20000],
>>>                    'cars': [150000, 120000], },
>>>                    index=['Hangzhou', 'Najing'])
>>> print(df2)
           apts    cars
Hangzhou  25000  150000
Najing    20000  120000

>>> df3 = pd.DataFrame({
    
    'apts': [30000, 10000],
>>>                    'cars': [180000, 100000], },
>>>                    index=['Guangzhou', 'Chongqing'])
>>> print(df3)
            apts    cars
Guangzhou  30000  180000
Chongqing  10000  100000

1. Vertical concat

>>> frames = [df1, df2, df3]
>>> print(frames)
[           apts    cars
Shanghai  55000  200000
Beijing   60000  300000,            apts    cars
Hangzhou  25000  150000
Najing    20000  120000,             apts    cars
Guangzhou  30000  180000
Chongqing  10000  100000]

>>> result = pd.concat(frames)
>>> print(result)
            apts    cars
Shanghai   55000  200000
Beijing    60000  300000
Hangzhou   25000  150000
Najing     20000  120000
Guangzhou  30000  180000
Chongqing  10000  100000
>>> # 给 concat 的每一个部分加上一个 Key
>>> result2 = pd.concat(frames, keys=['x', 'y', 'z'])
>>> print(result2)
              apts    cars
x Shanghai   55000  200000
  Beijing    60000  300000
y Hangzhou   25000  150000
  Najing     20000  120000
z Guangzhou  30000  180000
  Chongqing  10000  100000

>>> result2.loc['y']
           apts    cars
Hangzhou  25000  150000
Najing    20000  120000

2. Horizontal concat

>>> df4 = pd.DataFrame({
    
    'salaries': [10000, 30000, 30000, 20000, 15000]},
>>>                    index=['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin'])
>>> print(df4)
           salaries
Suzhou        10000
Beijing       30000
Shanghai      30000
Guangzhou     20000
Tianjin       15000

>>> result3 = pd.concat([result, df4], axis=1, sort=True)
>>> print(result3)
              apts      cars  salaries
Beijing    60000.0  300000.0   30000.0
Chongqing  10000.0  100000.0       NaN
Guangzhou  30000.0  180000.0   20000.0
Hangzhou   25000.0  150000.0       NaN
Najing     20000.0  120000.0       NaN
Shanghai   55000.0  200000.0   30000.0
Suzhou         NaN       NaN   10000.0
Tianjin        NaN       NaN   15000.0

>>> # DataFrame 转化为 hierarchical indexing 的 Series
>>> print(result3.stack())
Beijing    apts         60000.0
           cars        300000.0
           salaries     30000.0
Chongqing  apts         10000.0
           cars        100000.0
Guangzhou  apts         30000.0
           cars        180000.0
           salaries     20000.0
Hangzhou   apts         25000.0
           cars        150000.0
Najing     apts         20000.0
           cars        120000.0
Shanghai   apts         55000.0
           cars        200000.0
           salaries     30000.0
Suzhou     salaries     10000.0
Tianjin    salaries     15000.0
dtype: float64

3. join concat

  1. Perform inner join according to index;
>>> print(result)
            apts    cars
Shanghai   55000  200000
Beijing    60000  300000
Hangzhou   25000  150000
Najing     20000  120000
Guangzhou  30000  180000
Chongqing  10000  100000

>>> print(df4)
           salaries
Suzhou        10000
Beijing       30000
Shanghai      30000
Guangzhou     20000
Tianjin       15000

>>> result3 = pd.concat([result, df4], axis=1, join='inner')
>>> print(result3)
           apts    cars  salaries
Shanghai   55000  200000     30000
Beijing    60000  300000     30000
Guangzhou  30000  180000     20000

4. append(deprecated)

  1. vertical concat;
>>> print(df1)
           apts    cars
Shanghai  55000  200000
Beijing   60000  300000

>>> print(df2)
           apts    cars
Hangzhou  25000  150000
Najing    20000  120000

>>> df1.append(df2)
           apts    cars
Shanghai  55000  200000
Beijing   60000  300000
Hangzhou  25000  150000
Najing    20000  120000
  1. Horizontal concat;
>>> print(df1)
           apts    cars
Shanghai  55000  200000
Beijing   60000  300000

>>> print(df4)
           salaries
Suzhou        10000
Beijing       30000
Shanghai      30000
Guangzhou     20000
Tianjin       15000

>>> df1.append(df4, sort=True)
              apts      cars  salaries
Shanghai   55000.0  200000.0       NaN
Beijing    60000.0  300000.0       NaN
Suzhou         NaN       NaN   10000.0
Beijing        NaN       NaN   30000.0
Shanghai       NaN       NaN   30000.0
Guangzhou      NaN       NaN   20000.0
Tianjin        NaN       NaN   15000.0

5. Concatenate Series and DataFrame

  1. Series is concat as Column;
>>> s1 = pd.Series([60, 50], index=['Shanghai', 'Beijing'], name='meal')
>>> s1
Shanghai    60
Beijing     50
Name: meal, dtype: int64

>>> print(df1)
           apts    cars
Shanghai  55000  200000
Beijing   60000  300000

>>> print(s1)
Shanghai    60
Beijing     50
Name: meal, dtype: int64

>>> print(pd.concat([df1, s1], axis=1))
           apts    cars  meal
Shanghai  55000  200000    60
Beijing   60000  300000    50
  1. Series is concat as Row;
>>> s2 = pd.Series([18000, 12000], index=['apts', 'cars'], name='Xiamen')
>>> s2
apts    18000
cars    12000
Name: Xiamen, dtype: int64

>>> print(df1.append(s2))
           apts    cars
Shanghai  55000  200000
Beijing   60000  300000
Xiamen    18000   12000

5. Merge

1. Perform inner join according to the specified column

>>> df1 = pd.DataFrame({
    
    'apts': [55000, 60000, 58000],
>>>                    'cars': [200000, 300000, 250000],
>>>                     'cities': ['Shanghai', 'Beijing', 'Shenzhen']})
>>> df1
    apts    cars    cities
0  55000  200000  Shanghai
1  60000  300000   Beijing
2  58000  250000  Shenzhen

>>> df4 = pd.DataFrame({
    
    'salaries': [10000, 30000, 30000, 20000, 15000],
>>>                     'cities': ['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin']})
>>> df4
   salaries     cities
0     10000     Suzhou
1     30000    Beijing
2     30000   Shanghai
3     20000  Guangzhou
4     15000    Tianjin

>>> pd.merge(df1, df4, on='cities')
    apts    cars    cities  salaries
0  55000  200000  Shanghai     30000
1  60000  300000   Beijing     30000

2. Perform outer join according to the specified column

>>> pd.merge(df1, df4, on='cities', how='outer')
      apts      cars     cities  salaries
0  55000.0  200000.0   Shanghai   30000.0
1  60000.0  300000.0    Beijing   30000.0
2  58000.0  250000.0   Shenzhen       NaN
3      NaN       NaN     Suzhou   10000.0
4      NaN       NaN  Guangzhou   20000.0
5      NaN       NaN    Tianjin   15000.0

3. Perform right join by specified column

>>> pd.merge(df1, df4, on='cities', how='right')
      apts      cars     cities  salaries
0      NaN       NaN     Suzhou     10000
1  60000.0  300000.0    Beijing     30000
2  55000.0  200000.0   Shanghai     30000
3      NaN       NaN  Guangzhou     20000
4      NaN       NaN    Tianjin     15000

4. Perform left join according to the specified column

>>> pd.merge(df1, df4, on='cities', how='left')
    apts    cars    cities  salaries
0  55000  200000  Shanghai   30000.0
1  60000  300000   Beijing   30000.0
2  58000  250000  Shenzhen       NaN

6. Join

>>> df1 = pd.DataFrame({
    
    'apts': [55000, 60000, 58000],
>>>                    'cars': [200000, 300000, 250000]},
>>>                    index=['Shanghai', 'Beijing', 'Shenzhen'])
>>> df1
           apts    cars
Shanghai  55000  200000
Beijing   60000  300000
Shenzhen  58000  250000

>>> df4 = pd.DataFrame({
    
    'salaries': [10000, 30000, 30000, 20000, 15000]},
>>>                    index=['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin'])
>>> df4
           salaries
Suzhou        10000
Beijing       30000
Shanghai      30000
Guangzhou     20000
Tianjin       15000

1. Perform inner join according to index

>>> df1.join(df4)
           apts    cars  salaries
Shanghai  55000  200000   30000.0
Beijing   60000  300000   30000.0
Shenzhen  58000  250000       NaN

2. Perform outer join according to index

>>> df1.join(df4, how='outer')
              apts      cars  salaries
Beijing    60000.0  300000.0   30000.0
Guangzhou      NaN       NaN   20000.0
Shanghai   55000.0  200000.0   30000.0
Shenzhen   58000.0  250000.0       NaN
Suzhou         NaN       NaN   10000.0
Tianjin        NaN       NaN   15000.0

>>> pd.merge(df1, df4, left_index=True, right_index=True, how='outer')
              apts      cars  salaries
Beijing    60000.0  300000.0   30000.0
Guangzhou      NaN       NaN   20000.0
Shanghai   55000.0  200000.0   30000.0
Shenzhen   58000.0  250000.0       NaN
Suzhou         NaN       NaN   10000.0
Tianjin        NaN       NaN   15000.0

7. Group By

1. Group summation

>>> salaries = pd.DataFrame({
    
    
>>>     'Name': ['July', 'Chu', 'Chu', 'Lin', 'July', 'July', 'Chu', 'July'],
>>>     'Year': [2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017],
>>>     'Salary': [10000, 2000, 4000, 5000, 18000, 25000, 3000, 4000],
>>>     'Bonus': [3000, 1000, 1000, 1200, 4000, 2300, 500, 1000]
>>> })
>>> salaries
   Name  Year  Salary  Bonus
0  July  2016   10000   3000
1   Chu  2016    2000   1000
2   Chu  2016    4000   1000
3   Lin  2016    5000   1200
4  July  2017   18000   4000
5  July  2017   25000   2300
6   Chu  2017    3000    500
7  July  2017    4000   1000

>>> group_by_name = salaries.groupby('Name')
>>> group_by_name
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11c48e550>

>>> group_by_name.aggregate(sum)
      Year  Salary  Bonus
Name
Chu   6049    9000   2500
July  8067   57000  10300
Lin   2016    5000   1200

>>> group_by_name.sum()
      Year  Salary  Bonus
Name
Chu   6049    9000   2500
July  8067   57000  10300
Lin   2016    5000   1200

>>> group_by_name_year = salaries.groupby(['Name', 'Year'])
>>> group_by_name_year.sum()
           Salary  Bonus
Name Year
Chu  2016    6000   2000
     2017    3000    500
July 2016   10000   3000
     2017   47000   7300
Lin  2016    5000   1200

>>> group_by_name_year.size()
Name  Year
Chu   2016    2
      2017    1
July  2016    1
      2017    3
Lin   2016    1
dtype: int64

2. Display various statistical information of the group

>>> group_by_name_year.describe()
          Salary                                                         \
           count          mean           std      min      25%      50%
Name Year
Chu  2016    2.0   3000.000000   1414.213562   2000.0   2500.0   3000.0
     2017    1.0   3000.000000           NaN   3000.0   3000.0   3000.0
July 2016    1.0  10000.000000           NaN  10000.0  10000.0  10000.0
     2017    3.0  15666.666667  10692.676622   4000.0  11000.0  18000.0
Lin  2016    1.0   5000.000000           NaN   5000.0   5000.0   5000.0

                            Bonus                                           \
               75%      max count         mean         std     min     25%
Name Year
Chu  2016   3500.0   4000.0   2.0  1000.000000     0.00000  1000.0  1000.0
     2017   3000.0   3000.0   1.0   500.000000         NaN   500.0   500.0
July 2016  10000.0  10000.0   1.0  3000.000000         NaN  3000.0  3000.0
     2017  21500.0  25000.0   3.0  2433.333333  1504.43788  1000.0  1650.0
Lin  2016   5000.0   5000.0   1.0  1200.000000         NaN  1200.0  1200.0


              50%     75%     max
Name Year
Chu  2016  1000.0  1000.0  1000.0
     2017   500.0   500.0   500.0
July 2016  3000.0  3000.0  3000.0
     2017  2300.0  3150.0  4000.0
Lin  2016  1200.0  1200.0  1200.0

8. Application case (calculate the sum of the number of cyclists on each working day)

1. Read bike.csv to DataFrame

read_csv API reference

bike.csv records the data of Montreal bicycle routes, specifically there are 7 routes, and the data records the number of people on each bicycle route every day;

>>> pd.set_option('display.max_columns', 60)
>>> bikes = pd.read_csv('data/bikes.csv', encoding='latin1')
>>> bikes
    Date;Berri 1;Brébeuf (données non disponibles);Côte-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (données non disponibles)
0                     01/01/2012;35;;0;38;51;26;10;16;
1                     02/01/2012;83;;1;68;153;53;6;43;
2                   03/01/2012;135;;2;104;248;89;3;58;
3                  04/01/2012;144;;1;116;318;111;8;61;
4                  05/01/2012;197;;2;124;330;97;13;95;
..                                                 ...
305      01/11/2012;2405;;1208;1701;3082;2076;165;2461
306        02/11/2012;1582;;737;1109;2277;1392;97;1888
307          03/11/2012;844;;380;612;1137;713;105;1302
308          04/11/2012;966;;446;710;1277;692;197;1374
309      05/11/2012;2247;;1170;1705;3221;2143;179;2430

[310 rows x 1 columns]

Load bike.csv to DataFrame according to the split columns;

>>> bikes = pd.read_csv('data/bikes.csv', sep=';',
>>>                     parse_dates=['Date'], encoding='latin1', dayfirst=True, index_col='Date')
>>> bikes
            Berri 1  Brébeuf (données non disponibles)  Côte-Sainte-Catherine  \
Date
2012-01-01       35                                NaN                      0
2012-01-02       83                                NaN                      1
2012-01-03      135                                NaN                      2
2012-01-04      144                                NaN                      1
2012-01-05      197                                NaN                      2
...             ...                                ...                    ...
2012-11-01     2405                                NaN                   1208
2012-11-02     1582                                NaN                    737
2012-11-03      844                                NaN                    380
2012-11-04      966                                NaN                    446
2012-11-05     2247                                NaN                   1170

            Maisonneuve 1  Maisonneuve 2  du Parc  Pierre-Dupuy  Rachel1  \
Date
2012-01-01             38             51       26            10       16
2012-01-02             68            153       53             6       43
2012-01-03            104            248       89             3       58
2012-01-04            116            318      111             8       61
2012-01-05            124            330       97            13       95
...                   ...            ...      ...           ...      ...
2012-11-01           1701           3082     2076           165     2461
2012-11-02           1109           2277     1392            97     1888
2012-11-03            612           1137      713           105     1302
...
2012-11-03                                  NaN
2012-11-04                                  NaN
2012-11-05                                  NaN

[310 rows x 9 columns]

2. View data samples

  1. Use head and slice to get the first 5 rows;
>>> bikes.head(5)
            Berri 1  Brébeuf (données non disponibles)  Côte-Sainte-Catherine  \
Date
2012-01-01       35                                NaN                      0
2012-01-02       83                                NaN                      1
2012-01-03      135                                NaN                      2
2012-01-04      144                                NaN                      1
2012-01-05      197                                NaN                      2

            Maisonneuve 1  Maisonneuve 2  du Parc  Pierre-Dupuy  Rachel1  \
Date
2012-01-01             38             51       26            10       16
2012-01-02             68            153       53             6       43
2012-01-03            104            248       89             3       58
2012-01-04            116            318      111             8       61
2012-01-05            124            330       97            13       95

            St-Urbain (données non disponibles)
Date
2012-01-01                                  NaN
2012-01-02                                  NaN
2012-01-03                                  NaN
2012-01-04                                  NaN
2012-01-05                                  NaN

>>> bikes[:5]
            Berri 1  Brébeuf (données non disponibles)  Côte-Sainte-Catherine  \
Date
2012-01-01       35                                NaN                      0
2012-01-02       83                                NaN                      1
2012-01-03      135                                NaN                      2
2012-01-04      144                                NaN                      1
2012-01-05      197                                NaN                      2

            Maisonneuve 1  Maisonneuve 2  du Parc  Pierre-Dupuy  Rachel1  \
Date
2012-01-01             38             51       26            10       16
2012-01-02             68            153       53             6       43
2012-01-03            104            248       89             3       58
2012-01-04            116            318      111             8       61
2012-01-05            124            330       97            13       95

            St-Urbain (données non disponibles)
Date
2012-01-01                                  NaN
2012-01-02                                  NaN
2012-01-03                                  NaN
2012-01-04                                  NaN
2012-01-05                                  NaN
  1. Use copy to copy the selected part;
>>> berri_bikes = bikes[['Berri 1']].copy()
>>> berri_bikes.head()
            Berri 1
Date
2012-01-01       35
2012-01-02       83
2012-01-03      135
2012-01-04      144
2012-01-05      197

>>> berri_bikes.index
DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03', '2012-01-04',
               '2012-01-05', '2012-01-06', '2012-01-07', '2012-01-08',
               '2012-01-09', '2012-01-10',
               ...
               '2012-10-27', '2012-10-28', '2012-10-29', '2012-10-30',
               '2012-10-31', '2012-11-01', '2012-11-02', '2012-11-03',
               '2012-11-04', '2012-11-05'],
              dtype='datetime64[ns]', name='Date', length=310, freq=None)

>>> # 查看日期是每月的第几天
>>> berri_bikes.index.day
Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
            ...
            27, 28, 29, 30, 31,  1,  2,  3,  4,  5],
           dtype='int64', name='Date', length=310)

>>> # 查看日期是星期几
>>> berri_bikes.index.weekday
Int64Index([6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
            ...
            5, 6, 0, 1, 2, 3, 4, 5, 6, 0],
           dtype='int64', name='Date', length=310)

3. drops

  1. delete all rows with NaN;
>>> bikes.dropna()
Empty DataFrame
Columns: [Berri 1, Brébeuf (données non disponibles), Côte-Sainte-Catherine, Maisonneuve 1, Maisonneuve 2, du Parc, Pierre-Dupuy, Rachel1, St-Urbain (données non disponibles)]
Index: []
  1. Delete rows whose entire row is NaN;
>>> bikes.dropna(how='all').head()
            Berri 1  Brébeuf (données non disponibles)  Côte-Sainte-Catherine  \
Date
2012-01-01       35                                NaN                      0
2012-01-02       83                                NaN                      1
2012-01-03      135                                NaN                      2
2012-01-04      144                                NaN                      1
2012-01-05      197                                NaN                      2

            Maisonneuve 1  Maisonneuve 2  du Parc  Pierre-Dupuy  Rachel1  \
Date
2012-01-01             38             51       26            10       16
2012-01-02             68            153       53             6       43
2012-01-03            104            248       89             3       58
2012-01-04            116            318      111             8       61
2012-01-05            124            330       97            13       95

            St-Urbain (données non disponibles)
Date
2012-01-01                                  NaN
2012-01-02                                  NaN
2012-01-03                                  NaN
2012-01-04                                  NaN
2012-01-05                                  NaN
  1. Delete the column whose entire column is NaN;
>>> bikes.dropna(axis=1, how='all').head()
            Berri 1  Côte-Sainte-Catherine  Maisonneuve 1  Maisonneuve 2  \
Date
2012-01-01       35                      0             38             51
2012-01-02       83                      1             68            153
2012-01-03      135                      2            104            248
2012-01-04      144                      1            116            318
2012-01-05      197                      2            124            330

            du Parc  Pierre-Dupuy  Rachel1
Date
2012-01-01       26            10       16
2012-01-02       53             6       43
2012-01-03       89             3       58
2012-01-04      111             8       61
2012-01-05       97            13       95

3. feel

  1. fill missing data (single row);
>>> row = bikes.iloc[0].copy()
>>> print(row)
Berri 1                                35.0
Brébeuf (données non disponibles)       NaN
Côte-Sainte-Catherine                   0.0
Maisonneuve 1                          38.0
Maisonneuve 2                          51.0
du Parc                                26.0
Pierre-Dupuy                           10.0
Rachel1                                16.0
St-Urbain (données non disponibles)     NaN
Name: 2012-01-01 00:00:00, dtype: float64

>>> # 平均值
>>> print(row.mean())
25.142857142857142

>>> print(row.fillna(row.mean()))
Berri 1                                35.000000
Brébeuf (données non disponibles)      25.142857
Côte-Sainte-Catherine                   0.000000
Maisonneuve 1                          38.000000
Maisonneuve 2                          51.000000
du Parc                                26.000000
Pierre-Dupuy                           10.000000
Rachel1                                16.000000
St-Urbain (données non disponibles)    25.142857
Name: 2012-01-01 00:00:00, dtype: float64
  1. Fill missing data (full amount);
>>> # 求所有行的行平均值;
>>> m = bikes.mean(axis=1)
>>> print(m)
Date
2012-01-01      25.142857
2012-01-02      58.142857
2012-01-03      91.285714
2012-01-04     108.428571
2012-01-05     122.571429
                 ...
2012-11-01    1871.142857
2012-11-02    1297.428571
2012-11-03     727.571429
2012-11-04     808.857143
2012-11-05    1870.714286
Length: 310, dtype: float64

>>> # 将行中缺失的部分用其行均值填充:遍历各列,将各列的所有行中为 NaN 的部分用 m 的对应位填充
>>> for i, col in enumerate(bikes):
>>>     bikes.iloc[:, i] = bikes.iloc[:, i].fillna(m)
>>>     print(i, col)
0 Berri 1
1 Brébeuf (données non disponibles)
2 Côte-Sainte-Catherine
3 Maisonneuve 1
4 Maisonneuve 2
5 du Parc
6 Pierre-Dupuy
7 Rachel1
8 St-Urbain (données non disponibles)

>>> bikes.head()
            Berri 1  Brébeuf (données non disponibles)  Côte-Sainte-Catherine  \
Date
2012-01-01       35                          25.142857                      0
2012-01-02       83                          58.142857                      1
2012-01-03      135                          91.285714                      2
2012-01-04      144                         108.428571                      1
2012-01-05      197                         122.571429                      2

            Maisonneuve 1  Maisonneuve 2  du Parc  Pierre-Dupuy  Rachel1  \
Date
2012-01-01             38             51       26            10       16
2012-01-02             68            153       53             6       43
2012-01-03            104            248       89             3       58
2012-01-04            116            318      111             8       61
2012-01-05            124            330       97            13       95

            St-Urbain (données non disponibles)
Date
2012-01-01                            25.142857
2012-01-02                            58.142857
2012-01-03                            91.285714
2012-01-04                           108.428571
2012-01-05                           122.571429

4. Calculate the sum of the number of cyclists on each weekday for a single route

  1. Add a weekday column;
>>> berri_bikes.loc[:, 'weekday'] = berri_bikes.index.weekday
>>> berri_bikes[:5]
            Berri 1  weekday
Date
2012-01-01       35        6
2012-01-02       83        0
2012-01-03      135        1
2012-01-04      144        2
2012-01-05      197        3
  1. Group summation by weekday;
>>> weekday_counts = berri_bikes.groupby('weekday').aggregate(sum)
>>> weekday_counts
         Berri 1
weekday
0         134298
1         135305
2         152972
3         160131
4         141771
5         101578
6          99310
  1. Replace the default index with week nouns;
>>> weekday_counts.index = ['Monday', 'Tuesday', 'Wednesday',
>>>                         'Thursday', 'Friday', 'Saturday', 'Sunday']
>>> weekday_counts
           Berri 1
Monday      134298
Tuesday     135305
Wednesday   152972
Thursday    160131
Friday      141771
Saturday    101578
Sunday       99310

5. Calculate the sum of the number of cyclists on each weekday for all routes

  1. Sum by date (sum of the number of people per day for all routes);
>>> bikes = pd.read_csv('data/bikes.csv', sep=';',
>>>                     parse_dates=['Date'], encoding='latin1', dayfirst=True, index_col='Date')
>>> bikes_sum = bikes.sum(axis=1).to_frame()
>>> print(bikes_sum.head())
                0
Date
2012-01-01  176.0
2012-01-02  407.0
2012-01-03  639.0
2012-01-04  759.0
2012-01-05  858.0
  1. Add weekday column;
>>> print(bikes_sum.index)
DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03', '2012-01-04',
               '2012-01-05', '2012-01-06', '2012-01-07', '2012-01-08',
               '2012-01-09', '2012-01-10',
               ...
               '2012-10-27', '2012-10-28', '2012-10-29', '2012-10-30',
               '2012-10-31', '2012-11-01', '2012-11-02', '2012-11-03',
               '2012-11-04', '2012-11-05'],
              dtype='datetime64[ns]', name='Date', length=310, freq=None)

>>> bikes_sum.loc[:, 'weekday'] = bikes_sum.index.weekday
>>> bikes_sum.head()
                0  weekday
Date
2012-01-01  176.0        6
2012-01-02  407.0        0
2012-01-03  639.0        1
2012-01-04  759.0        2
2012-01-05  858.0        3
  1. Summing by weekday minutes (the sum of the number of people riding on each weekday);
>>> weekday_counts = bikes_sum.groupby('weekday').aggregate(sum)
>>> weekday_counts.index = ['Monday', 'Tuesday', 'Wednesday',
>>>                         'Thursday', 'Friday', 'Saturday', 'Sunday']
>>> weekday_counts
                  0
Monday     714963.0
Tuesday    698582.0
Wednesday  789722.0
Thursday   829069.0
Friday     738772.0
Saturday   516701.0
Sunday     518047.0

PS: Welcome friends from all walks of life 阅读, 评论thank you friends 点赞, 关注, 收藏!

Guess you like

Origin blog.csdn.net/ChaoMing_H/article/details/129827999