Pandas is an open source Python library designed for data processing and analysis tasks; it provides high-performance, easy-to-use data structures and data analysis tools, making data science in Python simple and efficient; Pandas is based on NumPy, Thus seamlessly integrating with many other NumPy-based libraries such as SciPy and scikit-learn;
The two main data structures in Pandas
-
Series
, a one-dimensional labeled array that can store different types of data such as integers, floating-point numbers, and strings; Series has an index (index), which makes it similar to a Python dictionary, but has more features and functions; -
DataFrame
, a two-dimensional labeled data structure, similar to a table or spreadsheet; it consists of a series of columns with the same index, each column can have a different data type; DataFrame provides various functions, such as filtering, sorting, grouping , merge and aggregate for efficient operations on large datasets;
Supported data processing and analysis tasks
- Data import and export
- Data cleaning and preprocessing
- Data Filtering and Selection
- Data sorting, ranking and aggregation
- Missing value handling
- group operation
- pivot table
- time series analysis
- Merge and join multiple datasets
Pandas provides a wealth of features that make it one of the most popular and widely used libraries in the Python data science ecosystem;
import pandas as pd
import numpy as np
Article directory
1. Series
1. Construction and initialization
- Series is a one-dimensional data structure, and Pandas will use 0 to n as the index of Series by default;
>>> s = pd.Series([1, 3, 'Beijing', 3.14, -123, 'Year!'])
>>> s
0 1
1 3
2 Beijing
3 3.14
4 -123
5 Year!
dtype: object
- Specify the index yourself;
>>> s = pd.Series([1, 3, 'Beijing', 3.14, -123, 'Year!'], index=['A', 'B', 'C', 'D', 'E','G'])
>>> s
A 1
B 3
C Beijing
D 3.14
E -123
G Year!
dtype: object
- Construct Series directly using dictionary, because Series itself is keyvalue pairs;
>>> cities = {
'Beijing': 55000, 'Shanghai': 60000, 'Shenzhen': 50000,
>>> 'Hangzhou': 20000, 'Guangzhou': 25000, 'Suzhou': None}
>>> apts = pd.Series(cities)
>>> apts
Beijing 55000.0
Shanghai 60000.0
Shenzhen 50000.0
Hangzhou 20000.0
Guangzhou 25000.0
Suzhou NaN
dtype: float64
2. Select data
- Select data by index
>>> apts['Hangzhou']
20000.0
>>> apts[['Hangzhou', 'Beijing', 'Shenzhen']]
Hangzhou 20000.0
Beijing 55000.0
Shenzhen 50000.0
dtype: float64
>>> # boolean indexing
>>> apts[apts < 50000]
Hangzhou 20000.0
Guangzhou 25000.0
dtype: float64
>>> # boolean indexing 的工作方式
>>> less_than_50000 = apts < 50000
>>> less_than_50000
Beijing False
Shanghai False
Shenzhen False
Hangzhou True
Guangzhou True
Suzhou False
dtype: bool
>>> apts[less_than_50000]
Hangzhou 20000.0
Guangzhou 25000.0
dtype: float64
3. Element assignment
>>> print("Old value: ", apts['Shenzhen'])
Old value: 50000.0
>>> apts['Shenzhen'] = 55000
>>> print("New value: ", apts['Shenzhen'])
New value: 55000.0
>>> print(apts[apts < 50000])
Hangzhou 20000.0
Guangzhou 25000.0
dtype: float64
>>> apts[apts <= 50000] = 40000
>>> print(apts[apts < 50000])
angzhou 40000.0
Guangzhou 40000.0
dtype: float64
4. Mathematical operations
>>> apts / 2
Beijing 27500.0
Shanghai 30000.0
Shenzhen 27500.0
Hangzhou 20000.0
Guangzhou 20000.0
Suzhou NaN
dtype: float64
>>> np.square(apts)
Beijing 3.025000e+09
Shanghai 3.600000e+09
Shenzhen 3.025000e+09
Hangzhou 1.600000e+09
Guangzhou 1.600000e+09
Suzhou NaN
dtype: float64
>>> cars = pd.Series({
'Beijing': 300000, 'Shanghai': 400000, 'Shenzhen': 300000,
>>> 'Tianjin': 200000, 'Guangzhou': 200000, 'Chongqing': 150000})
>>> cars
Beijing 300000
Shanghai 400000
Shenzhen 300000
Tianjin 200000
Guangzhou 200000
Chongqing 150000
dtype: int64
>>> # 按 index 运算
>>> cars + apts * 100
Beijing 5800000.0
Chongqing NaN
Guangzhou 4200000.0
Hangzhou NaN
Shanghai 6400000.0
Shenzhen 5800000.0
Suzhou NaN
Tianjin NaN
dtype: float64
5. Missing data
>>> print('Hangzhou' in apts)
True
>>> print('Hangzhou' in cars)
False
>>> apts.notnull()
Beijing True
Shanghai True
Shenzhen True
Hangzhou True
Guangzhou True
Suzhou False
dtype: bool
>>> print(apts.isnull())
Beijing False
Shanghai False
Shenzhen False
Hangzhou False
Guangzhou False
Suzhou True
dtype: bool
>>> print(apts[apts.isnull()])
Suzhou NaN
dtype: float64
2. DataFrame
A DataFrame is a table, Series represents a one-dimensional array, and DataFrame is a two-dimensional array, which can be compared to an excel table; DataFrame can also be regarded as a collection of Series;
1. Construct and select data
- Constructed from a dictionary;
>>> data = {
'city': ['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen', 'Hangzhou', 'Chongqing'],
>>> 'year': [2016, 2017, 2016, 2017, 2016, 2016],
>>> 'population': [2100, 2300, 1000, 700, 500, 500]}
>>> d1 = pd.DataFrame(data)
>>> print(d1)
city year population
0 Beijing 2016 2100
1 Shanghai 2017 2300
2 Guangzhou 2016 1000
3 Shenzhen 2017 700
4 Hangzhou 2016 500
5 Chongqing 2016 500
- full column traversal
>>> for row in d1.values:
>>> print(row)
['Beijing' 2016 2100]
['Shanghai' 2017 2300]
['Guangzhou' 2016 1000]
['Shenzhen' 2017 700]
['Hangzhou' 2016 500]
['Chongqing' 2016 500]
- select columns and merge traversal
>>> for row in zip(d1['city'], d1['year'], d1['population']):
>>> print(row)
('Beijing', 2016, 2100)
('Shanghai', 2017, 2300)
('Guangzhou', 2016, 1000)
('Shenzhen', 2017, 700)
('Hangzhou', 2016, 500)
('Chongqing', 2016, 500)
>>> print(d1.columns)
Index(['city', 'year', 'population'], dtype='object')
- sequence recombination
>>> print(pd.DataFrame(data, columns=['year', 'city', 'population']))
year city population
0 2016 Beijing 2100
1 2017 Shanghai 2300
2 2016 Guangzhou 1000
3 2017 Shenzhen 700
4 2016 Hangzhou 500
5 2016 Chongqing 500
- row and column index
>>> frame2 = pd.DataFrame(data,
>>> columns=['year', 'city', 'population', 'debt'],
>>> index=['one', 'two', 'three', 'four', 'five', 'six'])
>>> print(frame2)
year city population debt
one 2016 Beijing 2100 NaN
two 2017 Shanghai 2300 NaN
three 2016 Guangzhou 1000 NaN
four 2017 Shenzhen 700 NaN
five 2016 Hangzhou 500 NaN
six 2016 Chongqing 500 NaN
>>> print(frame2['city'])
one Beijing
two Shanghai
three Guangzhou
four Shenzhen
five Hangzhou
six Chongqing
Name: city, dtype: object
>>> print(frame2.year)
one 2016
two 2017
three 2016
four 2017
five 2016
six 2016
Name: year, dtype: int64
>>> # loc 取 label based indexing or iloc 取 positional indexing
>>> print(frame2.loc['three'])
year 2016
city Guangzhou
population 1000
debt NaN
Name: three, dtype: object
>>> print(frame2.iloc[2].copy())
year 2016
city Guangzhou
population 1000
debt NaN
Name: three, dtype: object
2. Element assignment
- Entire column assignment (single value);
>>> frame2['debt'] = 100
>>> print(frame2)
year city population debt
one 2016 Beijing 2100 100
two 2017 Shanghai 2300 100
three 2016 Guangzhou 1000 100
four 2017 Shenzhen 700 100
five 2016 Hangzhou 500 100
six 2016 Chongqing 500 100
- Entire column assignment(list value);
>>> frame2.debt = np.arange(6)
>>> print(frame2)
year city population debt
one 2016 Beijing 2100 0
two 2017 Shanghai 2300 1
three 2016 Guangzhou 1000 2
four 2017 Shenzhen 700 3
five 2016 Hangzhou 500 4
six 2016 Chongqing 500 5
- Use Series to specify the index to be modified and the corresponding value, if not specified, NaN is used by default;
>>> val = pd.Series([100, 200, 300], index=['two', 'three', 'five'])
>>> frame2['debt'] = val
>>> print(frame2)
year city population debt
one 2016 Beijing 2100 NaN
two 2017 Shanghai 2300 100.0
three 2016 Guangzhou 1000 200.0
four 2017 Shenzhen 700 NaN
five 2016 Hangzhou 500 300.0
six 2016 Chongqing 500 NaN
- Assign values to existing columns (create new columns);
>>> frame2['western'] = (frame2.city == 'Chongqing')
>>> print(frame2)
year city population debt western
one 2016 Beijing 2100 NaN False
two 2017 Shanghai 2300 100.0 False
three 2016 Guangzhou 1000 200.0 False
four 2017 Shenzhen 700 NaN False
five 2016 Hangzhou 500 300.0 False
six 2016 Chongqing 500 NaN True
- Transpose of DataFrame;
>>> pop = {
'Beijing': {
2016: 2100, 2017: 2200},
>>> 'Shanghai': {
2015: 2400, 2016: 2500, 2017: 2600}}
>>> frame3 = pd.DataFrame(pop)
>>> print(frame3)
Beijing Shanghai
2016 2100.0 2500
2017 2200.0 2600
2015 NaN 2400
>>> print(frame3.T)
2016 2017 2015
Beijing 2100.0 2200.0 NaN
Shanghai 2500.0 2600.0 2400.0
- Line order reorganization;
>>> pd.DataFrame(pop, index=[2015, 2016, 2017])
Beijing Shanghai
2015 NaN 2400
2016 2100.0 2500
2017 2200.0 2600
- Initialize data with slices;
>>> pdata = {
'Beijing': frame3['Beijing'][:-1],
>>> 'Shanghai': frame3['Shanghai'][:-1]}
>>> pd.DataFrame(pdata)
Beijing Shanghai
2016 2100.0 2500
2017 2200.0 2600
- Specify the name of the index and the name of the column;
>>> frame3.index.name = 'year'
>>> frame3.columns.name = 'city'
>>> print(frame3)
city Beijing Shanghai
year
2016 2100.0 2500
2017 2200.0 2600
2015 NaN 2400
>>> print(frame2.values)
[[2016 'Beijing' 2100 nan False]
[2017 'Shanghai' 2300 100.0 False]
[2016 'Guangzhou' 1000 200.0 False]
[2017 'Shenzhen' 700 nan False]
[2016 'Hangzhou' 500 300.0 False]
[2016 'Chongqing' 500 nan True]]
>>> print(frame2)
year city population debt western
one 2016 Beijing 2100 NaN False
two 2017 Shanghai 2300 100.0 False
three 2016 Guangzhou 1000 200.0 False
four 2017 Shenzhen 700 NaN False
five 2016 Hangzhou 500 300.0 False
six 2016 Chongqing 500 NaN True
>>> print(type(frame2.values))
<class 'numpy.ndarray'>
3. Index
1. Index object
>>> obj = pd.Series(range(3), index=['a', 'b', 'c'])
>>> index = obj.index
>>> index
Index(['a', 'b', 'c'], dtype='object')
>>> index[1:]
Index(['b', 'c'], dtype='object')
>>> # index 不能被动态改动
>>> # index[1]='d'
>>> index = pd.Index(np.arange(3))
>>> obj2 = pd.Series([2, 5, 7], index=index)
>>> print(obj2)
0 2
1 5
2 7
dtype: int64
>>> print(obj2.index is index)
True
>>> 2 in obj2.index
True
>>> pop = {
'Beijing': {
2016: 2100, 2017: 2200},
>>> 'Shanghai': {
2015: 2400, 2016: 2500, 2017: 2600}}
>>> frame3 = pd.DataFrame(pop)
>>> print('Shanghai' in frame3.columns)
True
>>> print('2015' in frame3.index)
False
>>> print(2015 in frame3.index)
True
2. Index index and slice
>>> obj = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
>>> print(obj)
a 0
b 1
c 2
d 3
dtype: int64
>>> print(obj['b'])
1
- Use the default numeric index;
>>> print(obj[3])
3
>>> print(obj[[1, 3]])
b 1
d 3
dtype: int64
>>> print(obj[['b', 'd']])
b 1
d 3
dtype: int64
- conditional filter;
>>> print(obj[obj < 2])
a 0
b 1
dtype: int32
- slice filtering and assignment
>>> print(obj['b':'c'])
b 1
c 2
dtype: int64
>>> obj['b':'c'] = 5
>>> print(obj)
a 0
b 5
c 5
d 3
dtype: int64
- DataFrame 的 Indexing
>>> a = np.arange(9).reshape(3, 3)
>>> print(a)
[[0 1 2]
[3 4 5]
[6 7 8]]
>>> frame = pd.DataFrame(a,
>>> index=['a', 'c', 'd'],
>>> columns=['Hangzhou', 'Shenzhen', 'Nanjing'])
>>> frame
Hangzhou Shenzhen Nanjing
a 0 1 2
c 3 4 5
d 6 7 8
>>> frame['Hangzhou']
a 0
c 3
d 6
Name: Hangzhou, dtype: int64
>>> frame[['Shenzhen', 'Nanjing']]
Shenzhen Nanjing
a 1 2
c 4 5
d 7 8
>>> frame[:2]
Hangzhou Shenzhen Nanjing
a 0 1 2
c 3 4 5
>>> frame.loc['a']
Hangzhou 0
Shenzhen 1
Nanjing 2
Name: a, dtype: int64
>>> frame.loc[['a', 'd'], ['Shenzhen', 'Nanjing']]
Shenzhen Nanjing
a 1 2
d 7 8
>>> frame.loc[:'c', 'Hangzhou']
a 0
c 3
Name: Hangzhou, dtype: int64
- DataFrame 的 condition selection
>>> frame[frame.Hangzhou > 1]
Hangzhou Shenzhen Nanjing
c 3 4 5
d 6 7 8
>>> frame < 5
Hangzhou Shenzhen Nanjing
a True True True
c True True False
d False False False
>>> frame[frame < 5] = 0
>>> frame
Hangzhou Shenzhen Nanjing
a 0 0 0
c 0 0 5
d 6 7 8
3. reindex
- Rearrange according to the new index order;
>>> obj = pd.Series([4.5, 7.2, -5.3, 3.2], index=['d', 'b', 'a', 'c'])
>>> print(obj)
d 4.5
b 7.2
a -5.3
c 3.2
dtype: float64
>>> obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
>>> obj2
a -5.3
b 7.2
c 3.2
d 4.5
e NaN
dtype: float64
- Fill the specified value on the new index;
>>> obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
a -5.3
b 7.2
c 3.2
d 4.5
e 0.0
dtype: float64
- Fill the previous most recent value on the new index;
>>> obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
>>> obj3
0 blue
2 purple
4 yellow
dtype: object
>>> # forward
>>> obj3.reindex(range(6), method='ffill')
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
- Fill in the latest value at the new index;
>>> # backward
>>> obj3.reindex(range(6), method='bfill')
0 blue
1 purple
2 purple
3 yellow
4 yellow
5 NaN
dtype: object
- Reindex the DataFrame;
>>> frame = pd.DataFrame(np.arange(9).reshape(3, 3),
index=['a', 'c', 'd'],
columns=['Hangzhou', 'Shenzhen', 'Nanjing'])
>>> frame
Hangzhou Shenzhen Nanjing
a 0 1 2
c 3 4 5
d 6 7 8
>>> frame2 = frame.reindex(['a', 'b', 'c', 'd'])
>>> frame2
Hangzhou Shenzhen Nanjing
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
- Re-specify columns;
>>> frame.reindex(columns=['Shenzhen', 'Hangzhou', 'Chongqing'])
Shenzhen Hangzhou Chongqing
a 1 0 NaN
c 4 3 NaN
d 7 6 NaN
>>> frame3 = frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill').reindex(
>>> columns=['Shenzhen', 'Hangzhou', 'Chongqing'])
>>> print(frame3)
Shenzhen Hangzhou Chongqing
a 1 0 NaN
b 1 0 NaN
c 4 3 NaN
d 7 6 NaN
>>> print(frame3.loc[['a', 'b', 'd', 'c'],
>>> ['Shenzhen', 'Hangzhou', 'Chongqing']])
Shenzhen Hangzhou Chongqing
a 1 0 NaN
b 1 0 NaN
d 7 6 NaN
c 4 3 NaN
4. drop
- Delete the index in Series and DataFrame;
>>> print(obj3)
0 blue
2 purple
4 yellow
dtype: object
>>> obj4 = obj3.drop(2)
>>> print(obj4)
0 blue
4 yellow
dtype: object
>>> print(obj3.drop([2, 4]))
0 blue
dtype: object
>>> print(frame)
Hangzhou Shenzhen Nanjing
a 0 1 2
c 3 4 5
d 6 7 8
>>> print(frame.drop(['a', 'c']))
Hangzhou Shenzhen Nanjing
d 6 7 8
- Delete columns in DataFrame;
>>> print(frame.drop('Shenzhen', axis=1))
Hangzhou Nanjing
a 0 2
c 3 5
d 6 8
>>> print(frame.drop(['Shenzhen', 'Hangzhou'], axis=1))
Nanjing
a 2
c 5
d 8
5. Hierarchical Indexing
- Series 的 hierarchical indexing;
>>> data = pd.Series(np.random.randn(10),
>>> index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'd'],
>>> [1, 2, 3, 1, 2, 1, 2, 3, 1, 2]])
>>> data
a 1 -0.587772
2 0.597073
3 -2.354382
b 1 1.403719
2 -0.612704
c 1 -1.409393
2 2.098933
3 0.076322
d 1 0.295683
2 1.188039
>>> data.index
MultiIndex([('a', 1),
('a', 2),
('a', 3),
('b', 1),
('b', 2),
('c', 1),
('c', 2),
('c', 3),
('d', 1),
('d', 2)],
)
>>> data.b
1 1.403719
2 -0.612704
dtype: float64
>>> data['b':'c']
b 1 1.403719
2 -0.612704
c 1 -1.409393
2 2.098933
3 0.076322
dtype: float64
>>> data[:2]
a 1 -0.587772
2 0.597073
dtype: float64
- unstack 和 stack;
>>> # 将 hierarchical indexing 的 Series 转换为 DataFrame
>>> data.unstack()
1 2 3
a -0.587772 0.597073 -2.354382
b 1.403719 -0.612704 NaN
c -1.409393 2.098933 0.076322
d 0.295683 1.188039 NaN
>>> type(data.unstack())
pandas.core.frame.DataFrame
>>> # 将 DataFrame 转换为 hierarchical indexing 的 Series
>>> data.unstack().stack()
a 1 -0.587772
2 0.597073
3 -2.354382
b 1 1.403719
2 -0.612704
c 1 -1.409393
2 2.098933
3 0.076322
d 1 0.295683
2 1.188039
dtype: float64
- DataFrame 的 hierarchical indexing;
>>> frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
>>> index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
>>> columns=[['Beijing', 'Beijing', 'Shanghai'],
>>> ['apts', 'cars', 'apts']])
>>> frame
Beijing Shanghai
apts cars apts
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
>>> frame.index.names = ['key1', 'key2']
>>> frame.columns.names = ['city', 'type']
>>> frame
city Beijing Shanghai
type apts cars apts
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
>>> frame.loc['a', 1]
city type
Beijing apts 0
cars 1
Shanghai apts 2
Name: (a, 1), dtype: int64
>>> frame.loc['a', 2]['Beijing']
type
apts 3
cars 4
Name: (a, 2), dtype: int64
>>> frame.loc['a', 2]['Beijing']['apts'] # 等价:frame.loc['a', 2]['Beijing', 'apts']
3
4. Concatenate 与 Append
>>> df1 = pd.DataFrame({
'apts': [55000, 60000],
>>> 'cars': [200000, 300000], },
>>> index=['Shanghai', 'Beijing'])
>>> print(df1)
apts cars
Shanghai 55000 200000
Beijing 60000 300000
>>> df2 = pd.DataFrame({
'apts': [25000, 20000],
>>> 'cars': [150000, 120000], },
>>> index=['Hangzhou', 'Najing'])
>>> print(df2)
apts cars
Hangzhou 25000 150000
Najing 20000 120000
>>> df3 = pd.DataFrame({
'apts': [30000, 10000],
>>> 'cars': [180000, 100000], },
>>> index=['Guangzhou', 'Chongqing'])
>>> print(df3)
apts cars
Guangzhou 30000 180000
Chongqing 10000 100000
1. Vertical concat
>>> frames = [df1, df2, df3]
>>> print(frames)
[ apts cars
Shanghai 55000 200000
Beijing 60000 300000, apts cars
Hangzhou 25000 150000
Najing 20000 120000, apts cars
Guangzhou 30000 180000
Chongqing 10000 100000]
>>> result = pd.concat(frames)
>>> print(result)
apts cars
Shanghai 55000 200000
Beijing 60000 300000
Hangzhou 25000 150000
Najing 20000 120000
Guangzhou 30000 180000
Chongqing 10000 100000
>>> # 给 concat 的每一个部分加上一个 Key
>>> result2 = pd.concat(frames, keys=['x', 'y', 'z'])
>>> print(result2)
apts cars
x Shanghai 55000 200000
Beijing 60000 300000
y Hangzhou 25000 150000
Najing 20000 120000
z Guangzhou 30000 180000
Chongqing 10000 100000
>>> result2.loc['y']
apts cars
Hangzhou 25000 150000
Najing 20000 120000
2. Horizontal concat
>>> df4 = pd.DataFrame({
'salaries': [10000, 30000, 30000, 20000, 15000]},
>>> index=['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin'])
>>> print(df4)
salaries
Suzhou 10000
Beijing 30000
Shanghai 30000
Guangzhou 20000
Tianjin 15000
>>> result3 = pd.concat([result, df4], axis=1, sort=True)
>>> print(result3)
apts cars salaries
Beijing 60000.0 300000.0 30000.0
Chongqing 10000.0 100000.0 NaN
Guangzhou 30000.0 180000.0 20000.0
Hangzhou 25000.0 150000.0 NaN
Najing 20000.0 120000.0 NaN
Shanghai 55000.0 200000.0 30000.0
Suzhou NaN NaN 10000.0
Tianjin NaN NaN 15000.0
>>> # DataFrame 转化为 hierarchical indexing 的 Series
>>> print(result3.stack())
Beijing apts 60000.0
cars 300000.0
salaries 30000.0
Chongqing apts 10000.0
cars 100000.0
Guangzhou apts 30000.0
cars 180000.0
salaries 20000.0
Hangzhou apts 25000.0
cars 150000.0
Najing apts 20000.0
cars 120000.0
Shanghai apts 55000.0
cars 200000.0
salaries 30000.0
Suzhou salaries 10000.0
Tianjin salaries 15000.0
dtype: float64
3. join concat
- Perform inner join according to index;
>>> print(result)
apts cars
Shanghai 55000 200000
Beijing 60000 300000
Hangzhou 25000 150000
Najing 20000 120000
Guangzhou 30000 180000
Chongqing 10000 100000
>>> print(df4)
salaries
Suzhou 10000
Beijing 30000
Shanghai 30000
Guangzhou 20000
Tianjin 15000
>>> result3 = pd.concat([result, df4], axis=1, join='inner')
>>> print(result3)
apts cars salaries
Shanghai 55000 200000 30000
Beijing 60000 300000 30000
Guangzhou 30000 180000 20000
4. append(deprecated)
- vertical concat;
>>> print(df1)
apts cars
Shanghai 55000 200000
Beijing 60000 300000
>>> print(df2)
apts cars
Hangzhou 25000 150000
Najing 20000 120000
>>> df1.append(df2)
apts cars
Shanghai 55000 200000
Beijing 60000 300000
Hangzhou 25000 150000
Najing 20000 120000
- Horizontal concat;
>>> print(df1)
apts cars
Shanghai 55000 200000
Beijing 60000 300000
>>> print(df4)
salaries
Suzhou 10000
Beijing 30000
Shanghai 30000
Guangzhou 20000
Tianjin 15000
>>> df1.append(df4, sort=True)
apts cars salaries
Shanghai 55000.0 200000.0 NaN
Beijing 60000.0 300000.0 NaN
Suzhou NaN NaN 10000.0
Beijing NaN NaN 30000.0
Shanghai NaN NaN 30000.0
Guangzhou NaN NaN 20000.0
Tianjin NaN NaN 15000.0
5. Concatenate Series and DataFrame
- Series is concat as Column;
>>> s1 = pd.Series([60, 50], index=['Shanghai', 'Beijing'], name='meal')
>>> s1
Shanghai 60
Beijing 50
Name: meal, dtype: int64
>>> print(df1)
apts cars
Shanghai 55000 200000
Beijing 60000 300000
>>> print(s1)
Shanghai 60
Beijing 50
Name: meal, dtype: int64
>>> print(pd.concat([df1, s1], axis=1))
apts cars meal
Shanghai 55000 200000 60
Beijing 60000 300000 50
- Series is concat as Row;
>>> s2 = pd.Series([18000, 12000], index=['apts', 'cars'], name='Xiamen')
>>> s2
apts 18000
cars 12000
Name: Xiamen, dtype: int64
>>> print(df1.append(s2))
apts cars
Shanghai 55000 200000
Beijing 60000 300000
Xiamen 18000 12000
5. Merge
1. Perform inner join according to the specified column
>>> df1 = pd.DataFrame({
'apts': [55000, 60000, 58000],
>>> 'cars': [200000, 300000, 250000],
>>> 'cities': ['Shanghai', 'Beijing', 'Shenzhen']})
>>> df1
apts cars cities
0 55000 200000 Shanghai
1 60000 300000 Beijing
2 58000 250000 Shenzhen
>>> df4 = pd.DataFrame({
'salaries': [10000, 30000, 30000, 20000, 15000],
>>> 'cities': ['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin']})
>>> df4
salaries cities
0 10000 Suzhou
1 30000 Beijing
2 30000 Shanghai
3 20000 Guangzhou
4 15000 Tianjin
>>> pd.merge(df1, df4, on='cities')
apts cars cities salaries
0 55000 200000 Shanghai 30000
1 60000 300000 Beijing 30000
2. Perform outer join according to the specified column
>>> pd.merge(df1, df4, on='cities', how='outer')
apts cars cities salaries
0 55000.0 200000.0 Shanghai 30000.0
1 60000.0 300000.0 Beijing 30000.0
2 58000.0 250000.0 Shenzhen NaN
3 NaN NaN Suzhou 10000.0
4 NaN NaN Guangzhou 20000.0
5 NaN NaN Tianjin 15000.0
3. Perform right join by specified column
>>> pd.merge(df1, df4, on='cities', how='right')
apts cars cities salaries
0 NaN NaN Suzhou 10000
1 60000.0 300000.0 Beijing 30000
2 55000.0 200000.0 Shanghai 30000
3 NaN NaN Guangzhou 20000
4 NaN NaN Tianjin 15000
4. Perform left join according to the specified column
>>> pd.merge(df1, df4, on='cities', how='left')
apts cars cities salaries
0 55000 200000 Shanghai 30000.0
1 60000 300000 Beijing 30000.0
2 58000 250000 Shenzhen NaN
6. Join
>>> df1 = pd.DataFrame({
'apts': [55000, 60000, 58000],
>>> 'cars': [200000, 300000, 250000]},
>>> index=['Shanghai', 'Beijing', 'Shenzhen'])
>>> df1
apts cars
Shanghai 55000 200000
Beijing 60000 300000
Shenzhen 58000 250000
>>> df4 = pd.DataFrame({
'salaries': [10000, 30000, 30000, 20000, 15000]},
>>> index=['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin'])
>>> df4
salaries
Suzhou 10000
Beijing 30000
Shanghai 30000
Guangzhou 20000
Tianjin 15000
1. Perform inner join according to index
>>> df1.join(df4)
apts cars salaries
Shanghai 55000 200000 30000.0
Beijing 60000 300000 30000.0
Shenzhen 58000 250000 NaN
2. Perform outer join according to index
>>> df1.join(df4, how='outer')
apts cars salaries
Beijing 60000.0 300000.0 30000.0
Guangzhou NaN NaN 20000.0
Shanghai 55000.0 200000.0 30000.0
Shenzhen 58000.0 250000.0 NaN
Suzhou NaN NaN 10000.0
Tianjin NaN NaN 15000.0
>>> pd.merge(df1, df4, left_index=True, right_index=True, how='outer')
apts cars salaries
Beijing 60000.0 300000.0 30000.0
Guangzhou NaN NaN 20000.0
Shanghai 55000.0 200000.0 30000.0
Shenzhen 58000.0 250000.0 NaN
Suzhou NaN NaN 10000.0
Tianjin NaN NaN 15000.0
7. Group By
1. Group summation
>>> salaries = pd.DataFrame({
>>> 'Name': ['July', 'Chu', 'Chu', 'Lin', 'July', 'July', 'Chu', 'July'],
>>> 'Year': [2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017],
>>> 'Salary': [10000, 2000, 4000, 5000, 18000, 25000, 3000, 4000],
>>> 'Bonus': [3000, 1000, 1000, 1200, 4000, 2300, 500, 1000]
>>> })
>>> salaries
Name Year Salary Bonus
0 July 2016 10000 3000
1 Chu 2016 2000 1000
2 Chu 2016 4000 1000
3 Lin 2016 5000 1200
4 July 2017 18000 4000
5 July 2017 25000 2300
6 Chu 2017 3000 500
7 July 2017 4000 1000
>>> group_by_name = salaries.groupby('Name')
>>> group_by_name
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11c48e550>
>>> group_by_name.aggregate(sum)
Year Salary Bonus
Name
Chu 6049 9000 2500
July 8067 57000 10300
Lin 2016 5000 1200
>>> group_by_name.sum()
Year Salary Bonus
Name
Chu 6049 9000 2500
July 8067 57000 10300
Lin 2016 5000 1200
>>> group_by_name_year = salaries.groupby(['Name', 'Year'])
>>> group_by_name_year.sum()
Salary Bonus
Name Year
Chu 2016 6000 2000
2017 3000 500
July 2016 10000 3000
2017 47000 7300
Lin 2016 5000 1200
>>> group_by_name_year.size()
Name Year
Chu 2016 2
2017 1
July 2016 1
2017 3
Lin 2016 1
dtype: int64
2. Display various statistical information of the group
>>> group_by_name_year.describe()
Salary \
count mean std min 25% 50%
Name Year
Chu 2016 2.0 3000.000000 1414.213562 2000.0 2500.0 3000.0
2017 1.0 3000.000000 NaN 3000.0 3000.0 3000.0
July 2016 1.0 10000.000000 NaN 10000.0 10000.0 10000.0
2017 3.0 15666.666667 10692.676622 4000.0 11000.0 18000.0
Lin 2016 1.0 5000.000000 NaN 5000.0 5000.0 5000.0
Bonus \
75% max count mean std min 25%
Name Year
Chu 2016 3500.0 4000.0 2.0 1000.000000 0.00000 1000.0 1000.0
2017 3000.0 3000.0 1.0 500.000000 NaN 500.0 500.0
July 2016 10000.0 10000.0 1.0 3000.000000 NaN 3000.0 3000.0
2017 21500.0 25000.0 3.0 2433.333333 1504.43788 1000.0 1650.0
Lin 2016 5000.0 5000.0 1.0 1200.000000 NaN 1200.0 1200.0
50% 75% max
Name Year
Chu 2016 1000.0 1000.0 1000.0
2017 500.0 500.0 500.0
July 2016 3000.0 3000.0 3000.0
2017 2300.0 3150.0 4000.0
Lin 2016 1200.0 1200.0 1200.0
8. Application case (calculate the sum of the number of cyclists on each working day)
1. Read bike.csv to DataFrame
bike.csv records the data of Montreal bicycle routes, specifically there are 7 routes, and the data records the number of people on each bicycle route every day;
>>> pd.set_option('display.max_columns', 60)
>>> bikes = pd.read_csv('data/bikes.csv', encoding='latin1')
>>> bikes
Date;Berri 1;Brébeuf (données non disponibles);Côte-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (données non disponibles)
0 01/01/2012;35;;0;38;51;26;10;16;
1 02/01/2012;83;;1;68;153;53;6;43;
2 03/01/2012;135;;2;104;248;89;3;58;
3 04/01/2012;144;;1;116;318;111;8;61;
4 05/01/2012;197;;2;124;330;97;13;95;
.. ...
305 01/11/2012;2405;;1208;1701;3082;2076;165;2461
306 02/11/2012;1582;;737;1109;2277;1392;97;1888
307 03/11/2012;844;;380;612;1137;713;105;1302
308 04/11/2012;966;;446;710;1277;692;197;1374
309 05/11/2012;2247;;1170;1705;3221;2143;179;2430
[310 rows x 1 columns]
Load bike.csv to DataFrame according to the split columns;
>>> bikes = pd.read_csv('data/bikes.csv', sep=';',
>>> parse_dates=['Date'], encoding='latin1', dayfirst=True, index_col='Date')
>>> bikes
Berri 1 Brébeuf (données non disponibles) Côte-Sainte-Catherine \
Date
2012-01-01 35 NaN 0
2012-01-02 83 NaN 1
2012-01-03 135 NaN 2
2012-01-04 144 NaN 1
2012-01-05 197 NaN 2
... ... ... ...
2012-11-01 2405 NaN 1208
2012-11-02 1582 NaN 737
2012-11-03 844 NaN 380
2012-11-04 966 NaN 446
2012-11-05 2247 NaN 1170
Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1 \
Date
2012-01-01 38 51 26 10 16
2012-01-02 68 153 53 6 43
2012-01-03 104 248 89 3 58
2012-01-04 116 318 111 8 61
2012-01-05 124 330 97 13 95
... ... ... ... ... ...
2012-11-01 1701 3082 2076 165 2461
2012-11-02 1109 2277 1392 97 1888
2012-11-03 612 1137 713 105 1302
...
2012-11-03 NaN
2012-11-04 NaN
2012-11-05 NaN
[310 rows x 9 columns]
2. View data samples
- Use head and slice to get the first 5 rows;
>>> bikes.head(5)
Berri 1 Brébeuf (données non disponibles) Côte-Sainte-Catherine \
Date
2012-01-01 35 NaN 0
2012-01-02 83 NaN 1
2012-01-03 135 NaN 2
2012-01-04 144 NaN 1
2012-01-05 197 NaN 2
Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1 \
Date
2012-01-01 38 51 26 10 16
2012-01-02 68 153 53 6 43
2012-01-03 104 248 89 3 58
2012-01-04 116 318 111 8 61
2012-01-05 124 330 97 13 95
St-Urbain (données non disponibles)
Date
2012-01-01 NaN
2012-01-02 NaN
2012-01-03 NaN
2012-01-04 NaN
2012-01-05 NaN
>>> bikes[:5]
Berri 1 Brébeuf (données non disponibles) Côte-Sainte-Catherine \
Date
2012-01-01 35 NaN 0
2012-01-02 83 NaN 1
2012-01-03 135 NaN 2
2012-01-04 144 NaN 1
2012-01-05 197 NaN 2
Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1 \
Date
2012-01-01 38 51 26 10 16
2012-01-02 68 153 53 6 43
2012-01-03 104 248 89 3 58
2012-01-04 116 318 111 8 61
2012-01-05 124 330 97 13 95
St-Urbain (données non disponibles)
Date
2012-01-01 NaN
2012-01-02 NaN
2012-01-03 NaN
2012-01-04 NaN
2012-01-05 NaN
- Use copy to copy the selected part;
>>> berri_bikes = bikes[['Berri 1']].copy()
>>> berri_bikes.head()
Berri 1
Date
2012-01-01 35
2012-01-02 83
2012-01-03 135
2012-01-04 144
2012-01-05 197
>>> berri_bikes.index
DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03', '2012-01-04',
'2012-01-05', '2012-01-06', '2012-01-07', '2012-01-08',
'2012-01-09', '2012-01-10',
...
'2012-10-27', '2012-10-28', '2012-10-29', '2012-10-30',
'2012-10-31', '2012-11-01', '2012-11-02', '2012-11-03',
'2012-11-04', '2012-11-05'],
dtype='datetime64[ns]', name='Date', length=310, freq=None)
>>> # 查看日期是每月的第几天
>>> berri_bikes.index.day
Int64Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
...
27, 28, 29, 30, 31, 1, 2, 3, 4, 5],
dtype='int64', name='Date', length=310)
>>> # 查看日期是星期几
>>> berri_bikes.index.weekday
Int64Index([6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
...
5, 6, 0, 1, 2, 3, 4, 5, 6, 0],
dtype='int64', name='Date', length=310)
3. drops
- delete all rows with NaN;
>>> bikes.dropna()
Empty DataFrame
Columns: [Berri 1, Brébeuf (données non disponibles), Côte-Sainte-Catherine, Maisonneuve 1, Maisonneuve 2, du Parc, Pierre-Dupuy, Rachel1, St-Urbain (données non disponibles)]
Index: []
- Delete rows whose entire row is NaN;
>>> bikes.dropna(how='all').head()
Berri 1 Brébeuf (données non disponibles) Côte-Sainte-Catherine \
Date
2012-01-01 35 NaN 0
2012-01-02 83 NaN 1
2012-01-03 135 NaN 2
2012-01-04 144 NaN 1
2012-01-05 197 NaN 2
Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1 \
Date
2012-01-01 38 51 26 10 16
2012-01-02 68 153 53 6 43
2012-01-03 104 248 89 3 58
2012-01-04 116 318 111 8 61
2012-01-05 124 330 97 13 95
St-Urbain (données non disponibles)
Date
2012-01-01 NaN
2012-01-02 NaN
2012-01-03 NaN
2012-01-04 NaN
2012-01-05 NaN
- Delete the column whose entire column is NaN;
>>> bikes.dropna(axis=1, how='all').head()
Berri 1 Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 \
Date
2012-01-01 35 0 38 51
2012-01-02 83 1 68 153
2012-01-03 135 2 104 248
2012-01-04 144 1 116 318
2012-01-05 197 2 124 330
du Parc Pierre-Dupuy Rachel1
Date
2012-01-01 26 10 16
2012-01-02 53 6 43
2012-01-03 89 3 58
2012-01-04 111 8 61
2012-01-05 97 13 95
3. feel
- fill missing data (single row);
>>> row = bikes.iloc[0].copy()
>>> print(row)
Berri 1 35.0
Brébeuf (données non disponibles) NaN
Côte-Sainte-Catherine 0.0
Maisonneuve 1 38.0
Maisonneuve 2 51.0
du Parc 26.0
Pierre-Dupuy 10.0
Rachel1 16.0
St-Urbain (données non disponibles) NaN
Name: 2012-01-01 00:00:00, dtype: float64
>>> # 平均值
>>> print(row.mean())
25.142857142857142
>>> print(row.fillna(row.mean()))
Berri 1 35.000000
Brébeuf (données non disponibles) 25.142857
Côte-Sainte-Catherine 0.000000
Maisonneuve 1 38.000000
Maisonneuve 2 51.000000
du Parc 26.000000
Pierre-Dupuy 10.000000
Rachel1 16.000000
St-Urbain (données non disponibles) 25.142857
Name: 2012-01-01 00:00:00, dtype: float64
- Fill missing data (full amount);
>>> # 求所有行的行平均值;
>>> m = bikes.mean(axis=1)
>>> print(m)
Date
2012-01-01 25.142857
2012-01-02 58.142857
2012-01-03 91.285714
2012-01-04 108.428571
2012-01-05 122.571429
...
2012-11-01 1871.142857
2012-11-02 1297.428571
2012-11-03 727.571429
2012-11-04 808.857143
2012-11-05 1870.714286
Length: 310, dtype: float64
>>> # 将行中缺失的部分用其行均值填充:遍历各列,将各列的所有行中为 NaN 的部分用 m 的对应位填充
>>> for i, col in enumerate(bikes):
>>> bikes.iloc[:, i] = bikes.iloc[:, i].fillna(m)
>>> print(i, col)
0 Berri 1
1 Brébeuf (données non disponibles)
2 Côte-Sainte-Catherine
3 Maisonneuve 1
4 Maisonneuve 2
5 du Parc
6 Pierre-Dupuy
7 Rachel1
8 St-Urbain (données non disponibles)
>>> bikes.head()
Berri 1 Brébeuf (données non disponibles) Côte-Sainte-Catherine \
Date
2012-01-01 35 25.142857 0
2012-01-02 83 58.142857 1
2012-01-03 135 91.285714 2
2012-01-04 144 108.428571 1
2012-01-05 197 122.571429 2
Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1 \
Date
2012-01-01 38 51 26 10 16
2012-01-02 68 153 53 6 43
2012-01-03 104 248 89 3 58
2012-01-04 116 318 111 8 61
2012-01-05 124 330 97 13 95
St-Urbain (données non disponibles)
Date
2012-01-01 25.142857
2012-01-02 58.142857
2012-01-03 91.285714
2012-01-04 108.428571
2012-01-05 122.571429
4. Calculate the sum of the number of cyclists on each weekday for a single route
- Add a weekday column;
>>> berri_bikes.loc[:, 'weekday'] = berri_bikes.index.weekday
>>> berri_bikes[:5]
Berri 1 weekday
Date
2012-01-01 35 6
2012-01-02 83 0
2012-01-03 135 1
2012-01-04 144 2
2012-01-05 197 3
- Group summation by weekday;
>>> weekday_counts = berri_bikes.groupby('weekday').aggregate(sum)
>>> weekday_counts
Berri 1
weekday
0 134298
1 135305
2 152972
3 160131
4 141771
5 101578
6 99310
- Replace the default index with week nouns;
>>> weekday_counts.index = ['Monday', 'Tuesday', 'Wednesday',
>>> 'Thursday', 'Friday', 'Saturday', 'Sunday']
>>> weekday_counts
Berri 1
Monday 134298
Tuesday 135305
Wednesday 152972
Thursday 160131
Friday 141771
Saturday 101578
Sunday 99310
5. Calculate the sum of the number of cyclists on each weekday for all routes
- Sum by date (sum of the number of people per day for all routes);
>>> bikes = pd.read_csv('data/bikes.csv', sep=';',
>>> parse_dates=['Date'], encoding='latin1', dayfirst=True, index_col='Date')
>>> bikes_sum = bikes.sum(axis=1).to_frame()
>>> print(bikes_sum.head())
0
Date
2012-01-01 176.0
2012-01-02 407.0
2012-01-03 639.0
2012-01-04 759.0
2012-01-05 858.0
- Add weekday column;
>>> print(bikes_sum.index)
DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03', '2012-01-04',
'2012-01-05', '2012-01-06', '2012-01-07', '2012-01-08',
'2012-01-09', '2012-01-10',
...
'2012-10-27', '2012-10-28', '2012-10-29', '2012-10-30',
'2012-10-31', '2012-11-01', '2012-11-02', '2012-11-03',
'2012-11-04', '2012-11-05'],
dtype='datetime64[ns]', name='Date', length=310, freq=None)
>>> bikes_sum.loc[:, 'weekday'] = bikes_sum.index.weekday
>>> bikes_sum.head()
0 weekday
Date
2012-01-01 176.0 6
2012-01-02 407.0 0
2012-01-03 639.0 1
2012-01-04 759.0 2
2012-01-05 858.0 3
- Summing by weekday minutes (the sum of the number of people riding on each weekday);
>>> weekday_counts = bikes_sum.groupby('weekday').aggregate(sum)
>>> weekday_counts.index = ['Monday', 'Tuesday', 'Wednesday',
>>> 'Thursday', 'Friday', 'Saturday', 'Sunday']
>>> weekday_counts
0
Monday 714963.0
Tuesday 698582.0
Wednesday 789722.0
Thursday 829069.0
Friday 738772.0
Saturday 516701.0
Sunday 518047.0
- Previous: "Python Machine Learning" Numpy Matrix Operation
- Next: "Python Machine Learning" Matplotlib Data Exploration
PS: Welcome friends from all walks of life 阅读
, 评论
thank you friends 点赞
, 关注
, 收藏
!