Article directory
foreword
This article will continue to be updated, recording the common operations in the pandas library DataFrame that are easy to encounter in daily work.
1. DataFrame creation
1. Create based on list (or numpy.ndarray)
data = [['Jack', 10], ['Tom', 12], ['Lucy', 13]]
columns = ['Name', 'Age']
df_by_list = pd.DataFrame(data, columns=columns)
print(df_by_list)
output:
Name Age
0 Jack 10
1 Tom 12
2 Lucy 13
2. Based on dictionary creation
row = {
'Name': ['Jack', 'Tom', 'Lucy'],
'Age': [10, 12, 13]
}
df_by_dict = pd.DataFrame(row)
print(df_by_dict)
output:
Name Age
0 Jack 10
1 Tome 12
2 Lucy 13
3. The way to read the csv file
csv file style:
df = pd.read_csv('city.csv')
print(df.head(5))
output:
id name province city
0 101010100 北京 北京市 北京市
1 101010200 海淀 北京市 海淀
2 101010300 朝阳 北京市 朝阳
3 101010400 顺义 北京市 顺义
4 101010500 怀柔 北京市 怀柔
2. Query
1. df direct query
① Query a column
names = df['Name'].tolist()
print(names)
output:
['Jack', 'Tom', 'Lucy']
② Query multiple columns
names = df[['Name','Age']]
print(names)
output:
Name Age
0 Jack 10
1 Tom 12
2 Lucy 13
③ Condition query
ages = df[(df['Age'] > 10) & (df['Age'] < 13)]
print(ages)
output:
Name Age
1 Tom 12
2. query () method
① Condition query
result = df.query('Age > 10 & Age < 13')
print(result)
output:
Name Age
1 Tom 12
② Queries with variables (use @variable)
names = ['Tom', 'Lily', 'Sam']
result = df.query('Name not in @names')
print(result)
output:
Name Age
0 Jack 10
2 Lucy 13
3. Query row index value
For example, I now want to check the row index whose Name field is Tom:
print(df)
index = df[df['Name'] == "Tom"].index.tolist()[0] # 查询索引
print("Tom所在行的索引:", index)
output:
Name Age
0 Jack 10
1 Tom 12
2 Lucy 13
Tom所在行的索引: 1
4. Fuzzy query (must be a string type)
For example, I want to perform a fuzzy query on the Sdate field to query the data in 2023:
data = [['20201001', 10], ['20201002', 12], ['20201003', 13],['20231003', 13]]
columns = ['Sdate', 'type']
df = pd.DataFrame(data, columns=columns)
df = df[df['Sdate'].str.contains('2023')] # 模糊查询
print(df)
output:
Sdate type
3 20231003 13
This fuzzy query method also supports regular expressions: for example, I want to query the data starting with 2023:
data = [['20201001', 10], ['20201002', 12], ['20202303', 13],['20231003', 13]]
columns = ['Sdate', 'type']
df = pd.DataFrame(data, columns=columns)
df = df[df['Sdate'].str.contains('^2023')] # 正则表示
print(df)
output:
Sdate type
3 20231003 13
Three, increase
1. Add columns
① Add directly: add a new column in the last column
df['Gender'] = ['M', 'M', 'F']
print(df)
output:
Name Age Gender
0 Jack 10 M
1 Tom 12 M
2 Lucy 13 F
② insert method: you can specify the location to add
df.insert(0, 'Gender', ['M', 'M', 'F'])
print(df)
output:
Gender Name Age
0 M Jack 10
1 M Tom 12
2 F Lucy 13
2. Add row
① loc function: add a line
df.loc[len(df.index)] = ('Lily', 20)
print(df)
output:
Name Age
0 Jack 10
1 Tom 12
2 Lucy 13
3 Lily 20
Note: If it is not added in the last line, the data will be replaced, for example:
df.loc[1] = ('Lily', 20)
print(df)
output:
Name Age
0 Jack 10
1 Lily 20
2 Lucy 13
② Add multiple lines
data1 = [['Lily', 23], ['Sam', 35]]
columns1 = ['Name', 'Age']
df1 = pd.DataFrame(data1, columns=columns1)
df2 = pd.concat([df, df1], ignore_index=True)
print(df2)
output:
Name Age
0 Jack 10
1 Tom 12
2 Lucy 13
3 Lily 23
4 Sam 35
Note:
1. The ignore_index=True parameter means to reset the index
2. The append method is about to become obsolete, it is recommended to use the concat method
3. The concat method requires that the two dfs need to have the same column name
Four, update (change)
1. Update the entire row value
data1 = [['Lily', 23], ['Sam', 35]]
columns1 = ['Name', 'Age']
new_df = pd.DataFrame(data1, columns=columns1)
df.update(new_df)
print(df)
output:
Name Age
0 Lily 23.0
1 Sam 35.0
2 Lucy 13.0
2. Update a value
① Modify by sequential numerical index:
df.iloc[0, 1] = 25 # 0表示按顺序数的第一行,1表示第二列
print(df)
output:
Name Age
0 Jack 25
1 Tom 12
2 Lucy 13
② Modify by actually setting the index:
df.loc[0, 'Age'] = 25 # 0表示索引等于0的那一行
print(df)
output:
Name Age
0 Jack 25
1 Tom 12
2 Lucy 13
3. Update the numeric type of an entire column
For example, change the Sdate column from a numeric type to a string type:
data = [[20201001, 10], [20201002, 12], [20201003, 13]]
columns = ['Sdate', 'type']
df = pd.DataFrame(data, columns=columns)
print(df)
print("Sdate开始类型:",df['Sdate'].dtypes)
df['Sdate'] = pd.Series(df['Sdate'], dtype="string") # 更改类型
print("Sdate改变后类型:",df['Sdate'].dtypes)
output:
Sdate type
0 20201001 10
1 20201002 12
2 20201003 13
Sdate开始类型: int64
Sdate改变后类型: string
4. Adjust the format of a column date (string/object type)
For example, convert the '20201001' format of the Sdate column to the '2020-10-01' format:
data = [['20201001', 10], ['20201002', 12], ['20201003', 13]]
columns = ['Sdate', 'type']
df = pd.DataFrame(data, columns=columns)
print(df)
# pd.to_datetime(df['Sdate']) 把 Sdate这一列转换为datetime64[ns]时间数据类型
df['Sdate'] = pd.to_datetime(df['Sdate']).dt.strftime('%Y-%m-%d') # 格式化
print(df)
output:
Sdate type
0 20201001 10
1 20201002 12
2 20201003 13
Sdate type
0 2020-10-01 10
1 2020-10-02 12
2 2020-10-03 13
5. Delete
1. Delete row
df = df.drop(df[(df['Age'] > 10) & (df['Age'] < 13)].index)
print(df)
output:
Name Age
0 Jack 10
2 Lucy 13
2. Delete the column
df = df.drop('Age', axis=1)
print(df)
output:
Name
0 Jack
1 Tom
2 Lucy
注意:
DataFrame.drop(labels=None,axis=0, index=None, columns=None, inplace=False)
- labels: the rows or columns to delete, given in a list
- axis: The default is 0, which means that the row is to be deleted, and the axis must be specified as 1 when deleting a column
- index : Directly specify the row to be deleted, and you can use a list as a parameter to delete multiple rows
- columns: Directly specify the columns to be deleted. To delete multiple columns, you can use a list as a parameter
- inplace: The default is False, the delete operation does not change the original data; when inplace = True, the original data is changed
Six, traversal
for index, row in df.iterrows():
print(index)
print(row['Name'])
print(row['Age'])
output:
0 Jack 10
1 Tom 12
2 Lucy 13
Note: The return value of iterrows() here is a tuple, (index, row), index is the row index, and row is all the data of a row, which can be obtained through the field name
Seven, conversion
1. Mutual conversion between dictionary and dataFrame
Reference article: https://blog.csdn.net/m0_43609475/article/details/125328938
2. Data type conversion
df = pd.read_csv('energy.csv', encoding='gb2312')
print(df.dtypes)
df['能量值'] = df['能量值'].astype(object)
print("=====================================")
print(df.dtypes)
output:
日期 object
能量值 int64
电量值 float64
dtype: object
=====================================
日期 object
能量值 object
电量值 float64
dtype: object
3. Convert Nan value to None value
Reason: The empty value in pandas is represented by NaN. If it is inserted into the database, NaN must be converted into a None value, otherwise an error will be reported
df = pd.read_csv('energy.csv', encoding='gb2312')
print(df)
# df.astype(object) ==> DataFram : 先把表中所有类型改为object
# df.where(条件式,值) ==> DataFram: 在满足条件式的位置保留原值,在不满足条件的位置填充自设的值
# pd.notnull(df) ==> DataFram: 返回一个布尔类型的df,NaN位置为False,其余位置为True
print("=====================================")
df = df.astype(object).where(pd.notnull(df), None)
print(df)
output:
日期 能量值 电量值
0 2020-06-06 2900 NaN
1 2020-06-07 3300 0.0
2 2020-06-08 666 666.0
=====================================
日期 能量值 电量值
0 2020-06-06 2900 None
1 2020-06-07 3300 0.0
2 2020-06-08 666 666.0
8. Others
1. Remove rows with Nan values
df = pd.read_csv('energy.csv', encoding='gb2312')
print(df)
print("==========================================")
result = df.drop(df[df.isnull().T.any()].index)
print(result)
output:
日期 能量值 电量值
0 2020-06-06 2900 NaN
1 2020-06-07 3300 0.0
2 2020-06-08 666 666.0
==========================================
日期 能量值 电量值
1 2020-06-07 3300 0.0
2 2020-06-08 666 666.0
explain:
df = pd.read_csv('energy.csv', encoding='gb2312')
print(df)
print("==========================================")
print("df.isnull():")
print(df.isnull())
print("==========================================")
print("df.isnull().T:")
print(df.isnull().T)
print("==========================================")
print("df.isnull().T.any():")
print(df.isnull().T.any()) # any()==> Series: 返回任何元素是否为真(可能在轴上)。
print("==========================================")
print("df[df.isnull().T.any()]:")
print(df[df.isnull().T.any()])
print("==========================================")
print("df[df.isnull().T.any()].index:")
print(df[df.isnull().T.any()].index)
output:
日期 能量值 电量值
0 2020-06-06 2900 NaN
1 2020-06-07 3300 0.0
2 2020-06-08 666 666.0
==========================================
df.isnull():
日期 能量值 电量值
0 False False True
1 False False False
2 False False False
==========================================
df.isnull().T:
0 1 2
日期 False False False
能量值 False False False
电量值 True False False
==========================================
df.isnull().T.any():
0 True
1 False
2 False
dtype: bool
==========================================
df[df.isnull().T.any()]:
日期 能量值 电量值
0 2020-06-06 2900 NaN
==========================================
df[df.isnull().T.any()].index:
Int64Index([0], dtype='int64')
2. join operation
df1 = pd.read_csv('energy.csv', encoding='gb2312')
df2 = pd.read_csv('energy.csv', encoding='gb2312')
result = pd.merge(df1, df2, how='left', on=['日期']) # df1 left join df2
print(result)
output:
日期 能量值_x 电量值_x 能量值_y 电量值_y
0 2020-06-06 2900 NaN 2900 NaN
1 2020-06-07 3300 0.0 3300 0.0
2 2020-06-08 666 666.0 666 666.0