Pandas data analysis 1
foreword
Vue框架:
Learn Vue
OJ算法系列:
magic tricks from the project- Detailed algorithm explanation
Linux操作系统:
of Fenghouqimen-linux
C++11:
Tongtianlu-C++11
One line to check whether Pandas has been downloaded:
pip list
One line to download:
pip install pandas
Two basic data structures
One-dimensional array Series
Array initialization:
By default, an array is the data in a row, and the index is an increasing index
import pandas as pd
#数组初始化
series1 = pd.Series([1, 'a', [1.1, 1.2], 'string'])
print(series1)
print(series1[3])
index specifies the index:
The index may not be 0 1 2, or other types
import numpy as np
series2 = pd.Series([1, 'a', [1.1, 1.2], np.nan], index=['甲', '乙', '丙', '丁'])
print(series2)
print(series2['甲'])
Dictionaries specify both index & value:
A key in the dictionary corresponds to a row in the Series
dictionary = {
'甲':1, '乙':'a', '丙':[123, 456], '丁':np.nan}
series3 = pd.Series(dictionary)
print(series3)
print(series3['丁'])
Two-dimensional array DataFrame:
columns initialization:
columns are all column attributes, independent of data definition
import pandas as pd
import numpy as np
#按照行写data
data = [
['a', 1, 111],
['b', 2, 222],
['c', 3, 333]
]
#按照行写columns
columns = ['name', 'class', 'score']
dataframe = pd.DataFrame(data,columns) #dtype=float
print(dataframe)
Dictionary initialization:
A key in the dictionary corresponds to a column of data in the DataFrame. The row data
index in the DataFrame is still missing in the index column. If there is
a missing number in the index column, numpy.NaN is used to represent empty
import pandas as pd
import numpy as np
#按照列写字典
dictionary = {
'name' : ['a', 'b', 'c'],
'class': [1, 2, 3],
'score': [111, 222, 333]
}
dataframe = pd.DataFrame(dictionary) #dtype=float
print(dataframe)
print(dataframe[['name', 'score']])
Series and DataFrame synthesis:
A **column** of a DataFrame can be filled with a Series
# data支持多种数据类型
#日期数组
dates = pd.date_range('20130101', periods=4)
#series
series = pd.Series(1, index=list(range(4)), dtype='float32')
#np数组
np_array = np.array([3] * 4, dtype='int32')
#Categorical
categorical = pd.Categorical(["test", "train", "test", "train"])
dataframe1 = pd.DataFrame({
'A': 1., #可以复用非数组,但不可用数组
'B': dates,
'C': series,
'D': np_array,
'E': categorical,
'F': 'foo'
})
print(dataframe1)
Basic entry function:
Data index:
- basic data:
import pandas as pd
dataframe = pd.DataFrame({
'a' : [1, 2, 3, 4, 5],
'b' : ['a', 'b', 'c', 'd', 'e'],
'c' : [2.1, 2.2, 2.3, 2.4, 2.5]
})
print(dataframe)
slice:
['key1', 'key2', 'key3', ...]
print(dataframe.a)
#等同于:
print(dataframe['a'])
#多列:
print(dataframe[['b', 'c']])
loc:
.loc[a:b, ['key1', 'key2', 'key3', ...]] The
number of lines starts from a and goes to b, but does not contain b
#所有行,c列
print(dataframe.loc[:, ['c']])
#某几行,c列和a列
print(dataframe.loc[0:3, ['c', 'a']])
location:
.iloc[a:b:c, d:e:f]
The number of rows starts from a, and increases by c steps each time, less than b.
The number of columns starts from d, and increases by f steps each time, less than e
.iloc[a: b:c, [d, e]] From row a, increase c steps each time, less than
column b d and e
print(dataframe.iloc[0:1, 0:3:2])
print(dataframe.iloc[0:4, [0, 2]])
- iloc pays more attention to the index, and cannot directly specify the key, after all, i starts with
Condition index:
A certain value size relationship:
#打印完整行
print(dataframe[dataframe.a > 3])
#打印行中几个值
print(dataframe.loc[dataframe.a > 2, ['b', 'c']])
Is a value among:
print(dataframe[dataframe['a'].isin([3, 5])])
Delete data out of place:
index[] delete row:
#df.index[[1,2,3]],删索引所指的几行
dataframe2.drop(dataframe2.index[[1, 2]])
[] + axis delete column:
dataframe2.drop(['a', 'b'], axis = 1)
Delete rows with a value in a column:
dataframe2[~(dataframe2['a'] == 3)]
Data preprocessing:
- The given data is as follows:
df1 = pd.DataFrame({
'id': [1,2,3],
'a': [11,22,33],
'b': [111,222,333],})
df2 = pd.DataFrame({
'id': [4,5,6],
'd': [1.1,2.2,3.3]})
concat stitching:
The inner and outer of the join determine whether to delete the row or use NaN to fill the missing row. The
0 and 1 of the axis determine the merge key or the merge index. 0 merges the key and 1 merges the index.
axis0 + outer:
#行上外连,合并键,但不合并index,缺失则NAN
pd.concat([df1, df2], axis=0, join='outer')
axis0 + inner:
#行上内连,合并键,但不合并index,缺失则删整行
pd.concat([df1, df2], axis=0, join='inner')
axis1 + outer:
#列上外连,合并index,不合并键,不会出现NaN
pd.concat([df1, df2], axis=1, join='outer')
axis1 + inner:
#列上内连,合并index,不合并键,不会出现NaN
pd.concat([df1, df2], axis=1, join='inner')
merge splicing:
axis=1 merge index
merge can specify key merge, not limited to index
pd.merge(dataframe1, dataframe2, on = 'id')
drop_duplicates() deduplication:
Deduplication is aimed at two identical
rows . After deduplication, the index remains unchanged and is not continuous.
- Data are as follows:
import pandas as pd
dataframe = pd.DataFrame({
'id' : [1, 2, 3, 3, 5],
'vale' : [1, 2, 3, 3, 5]
})
keep=‘first’:
- Keep the first in the repeating group
dataframe.drop_duplicates(keep='first')
keep=‘last’:
- Keep last in repeating group
dataframe.drop_duplicates(keep='last')
keep=False:
- Keep the first in the repeating group
dataframe.drop_duplicates(keep=False)
Remove NaNs:
any mode:
If a row contains a NaN, the row/column needs to be discarded
#含有NaN的行,整个行就丢弃
dataframe3.dropna(how = 'any', axis=0)
#含有NaN的列,整个列就丢弃
dataframe3.dropna(how = 'any', axis=1)
all mode:
All NaN in a row, this row/column is discarded
#指定行中均为NaN,则丢弃整个行
dataframe3.dropna(how = 'all', axis = 0)
#指定列中均为NaN,则丢弃整个列
dataframe3.dropna(how = 'all', axis = 1)
Fill NaNs:
fill():
If a value is NaN, it will be filled with value
dataframe3.fillna(value = 0)
ffill:
Fill according to the corresponding value of the previous row:
dataframe3.fillna(method='ffill', axis = 0)
Fill according to the corresponding value of the previous column:
dataframe3.fillna(method='ffill', axis = 1)