pandas library--DataFrame common operations


foreword

This article will continue to be updated, recording the common operations in the pandas library DataFrame that are easy to encounter in daily work.


1. DataFrame creation

1. Create based on list (or numpy.ndarray)

data = [['Jack', 10], ['Tom', 12], ['Lucy', 13]]
columns = ['Name', 'Age']
df_by_list = pd.DataFrame(data, columns=columns)
print(df_by_list)

output:

   Name  Age
0  Jack   10
1   Tom   12
2  Lucy   13

2. Based on dictionary creation

row = {
    
    
            'Name': ['Jack', 'Tom', 'Lucy'],
            'Age': [10, 12, 13]
        }
df_by_dict = pd.DataFrame(row)
print(df_by_dict)

output:

   Name  Age
0  Jack   10
1  Tome   12
2  Lucy   13

3. The way to read the csv file

csv file style:
insert image description here

df = pd.read_csv('city.csv')
print(df.head(5))

output:

          id name province city
0  101010100   北京      北京市  北京市
1  101010200   海淀      北京市   海淀
2  101010300   朝阳      北京市   朝阳
3  101010400   顺义      北京市   顺义
4  101010500   怀柔      北京市   怀柔

2. Query

1. df direct query

① Query a column

names = df['Name'].tolist()
print(names)

output:

['Jack', 'Tom', 'Lucy']

② Query multiple columns

names = df[['Name','Age']]
print(names)

output:

   Name  Age
0  Jack   10
1   Tom   12
2  Lucy   13

③ Condition query

ages = df[(df['Age'] > 10) & (df['Age'] < 13)]
print(ages)

output:

  Name  Age
1  Tom   12

2. query () method

① Condition query

result = df.query('Age > 10 & Age < 13')
print(result)

output:

  Name  Age
1  Tom   12

② Queries with variables (use @variable)

names = ['Tom', 'Lily', 'Sam']
result = df.query('Name not in @names')
print(result)

output:

   Name  Age
0  Jack   10
2  Lucy   13

3. Query row index value

For example, I now want to check the row index whose Name field is Tom:

print(df)
index = df[df['Name'] == "Tom"].index.tolist()[0]  # 查询索引
print("Tom所在行的索引:", index)

output:

   Name  Age
0  Jack   10
1   Tom   12
2  Lucy   13
Tom所在行的索引: 1

4. Fuzzy query (must be a string type)

For example, I want to perform a fuzzy query on the Sdate field to query the data in 2023:

data = [['20201001', 10], ['20201002', 12], ['20201003', 13],['20231003', 13]]
columns = ['Sdate', 'type']
df = pd.DataFrame(data, columns=columns)
df = df[df['Sdate'].str.contains('2023')]  # 模糊查询
print(df)

output:

      Sdate  type
3  20231003    13

This fuzzy query method also supports regular expressions: for example, I want to query the data starting with 2023:

data = [['20201001', 10], ['20201002', 12], ['20202303', 13],['20231003', 13]]
columns = ['Sdate', 'type']
df = pd.DataFrame(data, columns=columns)
df = df[df['Sdate'].str.contains('^2023')]  # 正则表示
print(df)

output:

      Sdate  type
3  20231003    13

Three, increase

1. Add columns

① Add directly: add a new column in the last column

df['Gender'] = ['M', 'M', 'F']
print(df)

output:

   Name  Age Gender
0  Jack   10      M
1   Tom   12      M
2  Lucy   13      F

② insert method: you can specify the location to add

df.insert(0, 'Gender', ['M', 'M', 'F'])
print(df)

output:

  Gender  Name  Age
0      M  Jack   10
1      M   Tom   12
2      F  Lucy   13

2. Add row

① loc function: add a line

df.loc[len(df.index)] = ('Lily', 20)
print(df)

output:

   Name  Age
0  Jack   10
1   Tom   12
2  Lucy   13
3  Lily   20

Note: If it is not added in the last line, the data will be replaced, for example:

df.loc[1] = ('Lily', 20)
print(df)

output:

   Name  Age
0  Jack   10
1  Lily   20
2  Lucy   13

② Add multiple lines

data1 = [['Lily', 23], ['Sam', 35]]
columns1 = ['Name', 'Age']
df1 = pd.DataFrame(data1, columns=columns1)
df2 = pd.concat([df, df1], ignore_index=True)
print(df2)

output:

   Name  Age
0  Jack   10
1   Tom   12
2  Lucy   13
3  Lily   23
4   Sam   35

Note:
1. The ignore_index=True parameter means to reset the index
2. The append method is about to become obsolete, it is recommended to use the concat method
3. The concat method requires that the two dfs need to have the same column name

Four, update (change)

1. Update the entire row value

data1 = [['Lily', 23], ['Sam', 35]]
columns1 = ['Name', 'Age']
new_df = pd.DataFrame(data1, columns=columns1)
df.update(new_df)
print(df)

output:

   Name   Age
0  Lily  23.0
1   Sam  35.0
2  Lucy  13.0

2. Update a value

① Modify by sequential numerical index:

df.iloc[0, 1] = 25  # 0表示按顺序数的第一行,1表示第二列
print(df)

output:

   Name  Age
0  Jack   25
1   Tom   12
2  Lucy   13

② Modify by actually setting the index:

df.loc[0, 'Age'] = 25 # 0表示索引等于0的那一行
print(df)

output:

   Name  Age
0  Jack   25
1   Tom   12
2  Lucy   13

3. Update the numeric type of an entire column

For example, change the Sdate column from a numeric type to a string type:

data = [[20201001, 10], [20201002, 12], [20201003, 13]]
columns = ['Sdate', 'type']
df = pd.DataFrame(data, columns=columns)
print(df)
print("Sdate开始类型:",df['Sdate'].dtypes)

df['Sdate'] = pd.Series(df['Sdate'], dtype="string")  # 更改类型
print("Sdate改变后类型:",df['Sdate'].dtypes)

output:

      Sdate  type
0  20201001    10
1  20201002    12
2  20201003    13
Sdate开始类型: int64
Sdate改变后类型: string

4. Adjust the format of a column date (string/object type)

For example, convert the '20201001' format of the Sdate column to the '2020-10-01' format:

data = [['20201001', 10], ['20201002', 12], ['20201003', 13]]
columns = ['Sdate', 'type']
df = pd.DataFrame(data, columns=columns)
print(df)
# pd.to_datetime(df['Sdate']) 把 Sdate这一列转换为datetime64[ns]时间数据类型
df['Sdate'] = pd.to_datetime(df['Sdate']).dt.strftime('%Y-%m-%d')  # 格式化
print(df)

output:

      Sdate  type
0  20201001    10
1  20201002    12
2  20201003    13
        Sdate  type
0  2020-10-01    10
1  2020-10-02    12
2  2020-10-03    13

5. Delete

1. Delete row

df = df.drop(df[(df['Age'] > 10) & (df['Age'] < 13)].index)
print(df)

output:

   Name  Age
0  Jack   10
2  Lucy   13

2. Delete the column

df = df.drop('Age', axis=1)
print(df)

output:

   Name
0  Jack
1   Tom
2  Lucy

注意:
DataFrame.drop(labels=None,axis=0, index=None, columns=None, inplace=False)

  • labels: the rows or columns to delete, given in a list
  • axis: The default is 0, which means that the row is to be deleted, and the axis must be specified as 1 when deleting a column
  • index : Directly specify the row to be deleted, and you can use a list as a parameter to delete multiple rows
  • columns: Directly specify the columns to be deleted. To delete multiple columns, you can use a list as a parameter
  • inplace: The default is False, the delete operation does not change the original data; when inplace = True, the original data is changed

Six, traversal

for index, row in df.iterrows():
    print(index)
    print(row['Name'])
    print(row['Age'])

output:

0 Jack 10
1 Tom 12
2 Lucy 13

Note: The return value of iterrows() here is a tuple, (index, row), index is the row index, and row is all the data of a row, which can be obtained through the field name


Seven, conversion

1. Mutual conversion between dictionary and dataFrame

Reference article: https://blog.csdn.net/m0_43609475/article/details/125328938

2. Data type conversion

df = pd.read_csv('energy.csv', encoding='gb2312')
print(df.dtypes)
df['能量值'] = df['能量值'].astype(object)
print("=====================================")
print(df.dtypes)

output:

日期      object
能量值      int64
电量值    float64
dtype: object
=====================================
日期      object
能量值     object
电量值    float64
dtype: object

3. Convert Nan value to None value

Reason: The empty value in pandas is represented by NaN. If it is inserted into the database, NaN must be converted into a None value, otherwise an error will be reported

df = pd.read_csv('energy.csv', encoding='gb2312')
print(df)
# df.astype(object) ==> DataFram : 先把表中所有类型改为object
# df.where(条件式,值) ==> DataFram: 在满足条件式的位置保留原值,在不满足条件的位置填充自设的值
# pd.notnull(df) ==> DataFram: 返回一个布尔类型的df,NaN位置为False,其余位置为True
print("=====================================")
df = df.astype(object).where(pd.notnull(df), None)
print(df)

output:

           日期   能量值    电量值
0  2020-06-06  2900    NaN
1  2020-06-07  3300    0.0
2  2020-06-08   666  666.0
=====================================
           日期   能量值    电量值
0  2020-06-06  2900   None
1  2020-06-07  3300    0.0
2  2020-06-08   666  666.0

8. Others

1. Remove rows with Nan values

df = pd.read_csv('energy.csv', encoding='gb2312')
print(df)
print("==========================================")
result = df.drop(df[df.isnull().T.any()].index)
print(result)

output:

           日期   能量值    电量值
0  2020-06-06  2900    NaN
1  2020-06-07  3300    0.0
2  2020-06-08   666  666.0
==========================================
           日期   能量值    电量值
1  2020-06-07  3300    0.0
2  2020-06-08   666  666.0

explain:

df = pd.read_csv('energy.csv', encoding='gb2312')
print(df)
print("==========================================")
print("df.isnull():")
print(df.isnull())
print("==========================================")
print("df.isnull().T:")
print(df.isnull().T)
print("==========================================")
print("df.isnull().T.any():")
print(df.isnull().T.any()) # any()==> Series: 返回任何元素是否为真(可能在轴上)。
print("==========================================")
print("df[df.isnull().T.any()]:")
print(df[df.isnull().T.any()])
print("==========================================")
print("df[df.isnull().T.any()].index:")
print(df[df.isnull().T.any()].index)

output:

          日期   能量值    电量值
0  2020-06-06  2900    NaN
1  2020-06-07  3300    0.0
2  2020-06-08   666  666.0
==========================================
df.isnull():
      日期    能量值    电量值
0  False  False   True
1  False  False  False
2  False  False  False
==========================================
df.isnull().T:
         0      1      2
日期   False  False  False
能量值  False  False  False
电量值   True  False  False
==========================================
df.isnull().T.any():
0     True
1    False
2    False
dtype: bool
==========================================
df[df.isnull().T.any()]:
           日期   能量值  电量值
0  2020-06-06  2900  NaN
==========================================
df[df.isnull().T.any()].index:
Int64Index([0], dtype='int64')

2. join operation

df1 = pd.read_csv('energy.csv', encoding='gb2312')
df2 = pd.read_csv('energy.csv', encoding='gb2312')
result = pd.merge(df1, df2, how='left', on=['日期']) # df1 left join df2
print(result)

output:

           日期  能量值_x  电量值_x  能量值_y  电量值_y
0  2020-06-06   2900    NaN   2900    NaN
1  2020-06-07   3300    0.0   3300    0.0
2  2020-06-08    666  666.0    666  666.0

Guess you like

Origin blog.csdn.net/bradyM/article/details/125485280