Data processing is often required in the data preparation (preprocessing) process, such as data cleaning including missing value and outlier processing, data transformation such as normalized data

Data processing is often required in the data preparation (preprocessing) process, such as data cleaning including missing value and outlier processing, data transformation such as normalized data, data specification such as attribute specification (selecting some representative attributes), etc., in Python There are many fast methods for data preprocessing. Take the missing value processing in data cleaning as an example. In the actual process, it is often found that some data are missing (NaN), and these values ​​​​need special processing. The isnan() function in numpy can be used to judge missing values. For Series or DataFrame, the judgment and processing of missing values ​​is very convenient. For example, df.dropna() can delete rows containing NaN (NA), df.dropna (how='all') Only those rows that are all NaN are discarded, and value interpolation can also be performed, such as filling interpolation with 0, mean, median or mode, etc., or interpolation can be used based on known Points to establish an interpolation function f(x), and obtain f(xi) through xi to approximate it. Common methods include Lagrange interpolation method and Newton interpolation method NaN. For specific interpolation examples, please refer to section 6.5 next week. Taking the commonly used simple filling as an example, you can use df.fillna(a certain value) to replace NaN with a certain value such as 0 or the average value (for example, df.fillna(0) means to replace NaN with 0), or you can use its method parameter Specify the fill direction for missing values, for example:

fruit_df = pd.Series([‘apple’, ‘orange’, ‘pear’], index=[0, 2, 5])
fruit_df = fruit_df.reindex(range(7))
fruit_df
0 apple
1 NaN
2 orange
3 NaN
4 NaN
5 pear
6 NaN
dtype: object

The inplace parameter is set to True to directly modify the original object fruit_df, otherwise the filled result will be returned, and the original object will remain unchanged

fruit_df.fillna(method=‘ffill’, inplace = True)
print(fruit_df)
0 apple
1 apple
2 orange
3 orange
4 orange
5 pear
6 pear
dtype: object

ffill means to replace NaN with the previous non-missing value, and bfill means to replace NaN with the next non-missing data. Different filling methods should be selected according to the characteristics of the data. If you choose the wrong result, the result may not be filled correctly. Please choose the following filling result of the way.

fruit_df = pd.Series([‘apple’, ‘orange’, ‘pear’], index=[0, 2, 5])
fruit_df = fruit_df.reindex(range(7))
fruit_df.fillna(method=‘bfill’, inplace = True)
print(fruit_df)

A.
0 apple
2 orange
5 pear
dtype: object

B.
0 apple
1 orange
2 orange
3 pear
4 pear
5 pear
6 NaN
dtype: object

C.
0 apple
1 apple
2 orange
3 orange
4 orange
5 pear
6 pear
dtype: object

D.
0 apple
1 orange
2 orange
3 pear
4 pear
5 pear
6 pear
dtype: object
Correct answer: B You are right

Guess you like

Origin blog.csdn.net/immenselee/article/details/87932813