Methods and practices of pandas to solve data missing and duplication

insert image description here

1. Missing data

A common missing data refers to a data record, a data item has no value, extended to practical applications, there is also a lack of time series, such as collecting data on the hour, missing data at a certain moment (missing a whole row data).

Solution, if the data is not deleted, interpolation is generally performed, the common way is to add 0, or a certain empirical value, a more scientific method is linear interpolation, or a more complex algorithm.

1.1. Time series supplement

For example, given a certain time series (hourly), 3 points of data are missing in the middle, interpolation is added, and the data is expanded to an interval of half an hour.

Code 1.

import pandas as pd

key = ['getTime','temp','text','humidity']
data = [['2023-04-30T00:00',7,'晴',57],
        ['2023-04-30T01:00',6,'晴',58],
        ['2023-04-30T02:00',6,'阴',55],
        ['2023-04-30T04:00',4,'晴',50]]
df = pd.DataFrame(data,columns=key)
df.index = df['getTime'].astype('datetime64')

insert image description here
Supplements a time series while linearly interpolating numeric columns.

Code 2.

df1 = df.resample('30min').interpolate(method='linear') 

insert image description here

Note: To supplement the time series, the index in the DataFrame needs to be a time series.

Or, create a new time series table, and then use pd.merge to make up for missing time series.

Code 3.

times = pd.date_range('2023-04-30 00:00', '2023-04-30 04:59', freq='1h') # 与上文采用标准国际时间 UTC
# times = pd.date_range('2023-04-30 00:00', '2023-04-30 04:59', freq='1h', tz='Asia/Shanghai')
df0 = pd.DataFrame(index=times)
df0 = pd.merge(left=df0,right=df,left_index=True,right_index=True,how='left')

insert image description here

1.2. Missing data items

1.2.1. Linear interpolation

Code 4.

df0[['temp','humidity']] = df0[['temp','humidity']].interpolate(method='linear')  

insert image description here
Or, when complementing the time series directly, linear interpolation, see Code 2 for details .

1.2.2. Copy the previous data

If it is non-numeric data, you can copy the content of the previous data. Similarly, numerical data is also satisfied.

Code 5.

df0[['text','getTime']] = df0[['text','getTime']].fillna(method='ffill') 

insert image description here

1.2.3. Null padding

For example , fill the null value "6" for the result of code 3.

Code 6.

df0.fillna(6, inplace=True) 

insert image description here

2. Delete duplicate rows

First, build duplicate data and merge the same table.

Code 7.

# 合并同表前两条记录
df2 = pd.concat([df,df.head(2)])

Among them, head(2) is to fetch the first two records in the table.
insert image description here

2.1. Delete exact duplicate rows

Delete duplicate records.

Code 8.

df2 = df2.drop_duplicates()

insert image description here

NOTE: This is to delete the exact same data.

2.2. Deduplication

Deduplication is performed by a certain column (can be multiple), and for duplicates, the value that appears for the first time is retained.

Code 9.

df2 = df2.drop_duplicates('text',keep='first')

df.drop_duplicates(subset=['A','B','C'],keep='first',inplace=True)
parameters are described as follows:

  • subset: Indicates the name of the column to be added, the default is None.
  • keep: There are three optional parameters, which are first, last, and False. The default is first, which means that only the first occurrence of duplicates is kept and the rest are deleted. last means that only the last occurrence of duplicates is kept, and False means Remove all duplicates.
  • inplace: Boolean value parameter, the default is False, which means to return a copy after deleting duplicates, if it is True, it means to delete duplicates directly on the original data.

3. Modify data by condition

Modify some data values ​​according to conditions. The common method is to call the function with apply(). You can also directly use the loc location index to modify the data, and refer to the result generated by code 1 .

This article uses the loc method to modify data according to conditions.

Code 10.

df.loc[df.loc[(df.index>=pd.to_datetime('2023-04-30 01:00')) ].index,
        ['temp','humidity']] = df[['temp','humidity']].loc[(df.index>=pd.to_datetime('2023-04-30 01:00')) ]+10

insert image description here

According to the index, the specific column can be used as the query condition.

4. Finding null values ​​and processing

4.1. Null query

Query out the null value and replace another item of data in the peer data, for example: query the result set of code 2 , query the value of "temp" when "text" is empty, and replace it with the value of "humidity".

Code 11.

df1.loc[df1[df1['text'].isnull()].index,'temp'] = df1['humidity'].loc[df1['text'].isnull()]

insert image description here
According to the result set of query code 2 , query non-empty data.

Code 12.

df1 = df1.loc[~df1['text'].isnull()]

insert image description here

Among them, ~ represents the negation symbol, and the .isnull() method is used to determine whether it is empty.

4.2. Remove null values

According to the result set of query code 2 , delete the data rows with null values.

Code 13.

df1 = df1.dropna()

insert image description here

Guess you like

Origin blog.csdn.net/xiaoyw/article/details/130339513