In the " data cleansing (a) (Data Analysis Pandas Data Munging / Wrangling) with pandas ", we introduce some of the pandas command data cleaning frequently used.
Then look at the specific steps to clean this data:
Transaction_ID Transaction_Date Product_ID Quantity Unit_Price Total_Price 0 1 2010-08-21 2 1 30 30 1 2 2011-05-26 4 1 40 40 2 3 2011-06-16 3 NaN 32 32 3 4 2012-08-26 2 3 55 165 4 5 2013-06-06 4 1 124 124 5 1 2010-08-21 2 1 30 30 6 7 2013-12-30 7 8 2014-04-24 2 2 NaN NaN 8 9 2015-04-24 4 3 60 1800 9 10 2016-05-08 4 4 9 36
1, see the number of rows of data:
print(transactions.shape)
(10, 6)
A total of 10 data lines 6.
2, view the data types of data:
print(transactions.dtypes)
Transaction_ID int64
Transaction_Date datetime64[ns]
Product_ID object
Quantity object
Unit_Price object
Total_Price object
Transaction_ID column is an integer, Transaction_Date column is time series, and the remaining columns are the object.
3, the above two steps can also be used info () command Alternative:
RangeIndex: 10 entries, 0 to 9 Data columns (total 6 columns): Transaction_ID 10 non-null int64 Transaction_Date 10 non-null datetime64[ns] Product_ID 10 non-null object Quantity 9 non-null object Unit_Price 9 non-null object Total_Price 9 non-null object dtypes: datetime64[ns](1), int64(1), object(4) memory usage: 560.0+ bytes None
RangeIndex: 10 entries indicates a total of 10 rows, Data columns (total 6 columns) indicates a total of six, each next column shows the number and type of non-null values.
4, to see which lines, of which there are several columns missing values, and how many rows a total of how many columns there are missing values:
Print ( " Which of these lines have missing values: " ) Print (transactions.apply ( the lambda X: SUM (x.isnull ()), Axis =. 1))
Which lines are missing value: 0 0 . 1 0 2. 1 . 3 0 . 4 0 . 5 0 . 6 0 . 7 2 . 8 0 . 9 0
Print ( " Which of these missing values listed: " ) Print (transactions.apply ( the lambda X: SUM (x.isnull ()), Axis = 0))
Which of these missing values listed: transaction_id 0 TRANSACTION_DATE 0 Product_ID 0 the Quantity . 1 unit_price . 1 total_price . 1
Print ( " a total number of rows with missing values: " ) Print (len (transactions.apply ( the lambda X: SUM (x.isnull ()), Axis =. 1) .nonzero () [0]))
There are a number of rows with missing values:
2
Print ( " a total number of missing values listed: " ) Print (len (transactions.apply ( the lambda X: SUM (x.isnull ()), Axis = 0) .nonzero () [0]))
There are a number of lists of missing values:
3
It should be noted that this data set there are some spaces where should be treated as missing values, it is necessary to convert the spaces to display NaN:
transactions=transactions.applymap(lambda x: np.NaN if str(x).isspace() else x)
Now data set is shown below:
Transaction_ID Transaction_Date product_id Quantity Unit_Price \ 0 1 2010-08-21 2.0 1.0 30.0 1 1.0 40.0 4.0 2 2011-05-26 2 2011-06-16 3 3.0 32.0 3 4 2012-08-26 2.0 3.0 55.0 4 5 2013- 1.0 124.0 4.0 06-06 5 1 2010-08-21 2.0 1.0 30.0 6 7 2013-12-30 Gulf Shores 7 8 2014-04-24 2.0 2.0 NaN 2015-04-24 3.0 60.0 4.0 8 9 9 10 2016-05-08 4.0 4.0 9.0 Total_Price 0 30.0 1 40.0 2 32.0 3 165.0 4 124.0 5 30.0 6 NaN 7 NaN 8 1800.0 9 36.0
5, the removal of missing values:
transactions.dropna(inplace=True)
6, of course, we may choose not to remove the missing value, but the filling, which method is suitable for the small amount of data, the fill backwards Pick:
transactions.fillna(method='backfill',inplace=True)
Now the data set as follows:
Transaction_ID Transaction_Date Product_ID Quantity Unit_Price \ 0 1 2010-08-21 2.0 1.0 30.0 1 2 2011-05-26 4.0 1.0 40.0 2 3 2011-06-16 3.0 3.0 32.0 3 4 2012-08-26 2.0 3.0 55.0 4 5 2013-06-06 4.0 1.0 124.0 5 1 2010-08-21 2.0 1.0 30.0 6 7 2013-12-30 2.0 2.0 60.0 7 8 2014-04-24 2.0 2.0 60.0 8 9 2015-04-24 4.0 3.0 60.0 9 10 2016-05-08 4.0 4.0 9.0 Total_Price 0 30.0 1 40.0 2 32.0 3 165.0 4 124.0 5 30.0 6 1800.0 7 1800.0 8 1800.0 9 36.0
7, now filled with mean to try:
transactions.fillna(transactions.mean(),inplace=True)
Now the data set as follows:
Transaction_ID Transaction_Date Product_ID Quantity Unit_Price \ 0 1 2010-08-21 2.0 1.0 30.0 1 2 2011-05-26 4.0 1.0 40.0 2 3 2011-06-16 3.0 2.0 32.0 3 4 2012-08-26 2.0 3.0 55.0 4 5 2013-06-06 4.0 1.0 124.0 5 1 2010-08-21 2.0 1.0 30.0 6 7 2013-12-30 3.0 2.0 47.5 7 8 2014-04-24 2.0 2.0 47.5 8 9 2015-04-24 4.0 3.0 60.0 9 10 2016-05-08 4.0 4.0 9.0 Total_Price 0 30.000 1 40.000 2 32.000 3 165.000 4 124.000 5 30.000 6 282.125 7 282.125 8 1800.000 9 36.000
* Note: If the data set, there are outliers, the outlier should be removed, then the mean filling.
8, try again filled by interpolation, linear interpolation is the default, the data applied to the column when the linear relationship:
transactions.interpolate(inplace=True)
Obviously this does not meet the requirements, so the display does not update the data gathered. Only for demonstration purposes.
9, the display rows with duplicate values:
print(transactions.duplicated())
0 False 1 False 2 False 3 False 4 False 5 True 6 False 7 False 8 False 9 False
Displaying the line with index 5 duplicate values.
10, removing duplicates:
transactions.drop_duplicates(inplace=True)
11, where selecting the mean fill in missing values, now need to find outliers, first describe () method shows an overall numerical data in the case of:
print(transactions.describe())
Transaction_ID Product_ID Quantity Unit_Price Total_Price count 9.000000 9.000000 9.000000 9.000000 9.000000 mean 5.444444 3.111111 2.111111 49.444444 310.138889 std 3.205897 0.927961 1.054093 31.850672 567.993712 min 1.000000 2.000000 1.000000 9.000000 30.000000 25% 3.000000 2.000000 1.000000 32.000000 36.000000 50% 5.000000 3.000000 2.000000 47.500000 124.000000 75% 8.000000 4.000000 3.000000 55.000000 282.125000 max 10.000000 4.000000 4.000000 124.000000 1800.000000
After a review of the maximum and minimum Quantity, Unit_Price, Total_Price found, Quantity no problem, the maximum Unit_Price and Total_Price some exceptions, with the greatest Quantity Unit_Price get is multiplied by 496, much lower than 1800, it is determined Total_Price abnormal values. Unit_Price maximum value can not be determined there is no problem, after viewing the data set and found no problems, so retention.
12, find the outliers with mask values outside the range of the lower limit value is the abnormal data (typically data limit Q3 + 1.5 * IQR, the lower limit of the data is usually Q1-1.5 * IQR):
IQR=transactions.describe().loc['75%','Total_Price']-transactions.describe().loc['25%','Total_Price'] upper_extreme=transactions.describe().loc['75%','Total_Price']+1.5*IQR lower_extreme=transactions.describe().loc['25%','Total_Price']-1.5*IQR print(transactions.loc[((transactions['Total_Price']>upper_extreme) | (transactions['Total_Price']<lower_extreme))])
Transaction_ID Transaction_Date Product_ID Quantity Unit_Price \ 8 9 2015-04-24 4.0 3.0 60.0 Total_Price 8 1800.0
Total_Price first calculate the IQR, whereby then calculate its upper and lower limits, the line has to find outliers.
13, to find a box plot outlier:
import matplotlib.pyplot as plt fig,ax=plt.subplots() ax.boxplot(transactions['Total_Price']) plt.shaow()
This column displays Total_Price abnormal values (dots).
14, post-processing outliers view dataset outliers found this line is added an extra Total_Price 0, so with alternative 1800 180:
transactions.replace(1800,180,inplace=True)
Now the data set as follows:
Transaction_ID Transaction_Date Product_ID Quantity Unit_Price \ 0 1 2010-08-21 2.0 1.0 30.0 1 2 2011-05-26 4.0 1.0 40.0 2 3 2011-06-16 3.0 2.0 32.0 3 4 2012-08-26 2.0 3.0 55.0 4 5 2013-06-06 4.0 1.0 124.0 6 7 2013-12-30 3.0 2.0 47.5 7 8 2014-04-24 2.0 2.0 47.5 8 9 2015-04-24 4.0 3.0 60.0 9 10 2016-05-08 4.0 4.0 9.0 Total_Price 0 30.000 1 40.000 2 32.000 3 165.000 4 124.000 6 282.125 7 282.125 8 180.000 9 36.000
Cleaning work in this data set so far, because the data set is very simple, it can only be used as a presentation. In reality many more complex than this dataset, which further comprises washing the contents of the data type conversion, data format conversion, and the like add features.