Pandas washed with data (b) (Data Analysis Pandas Data Munging / Wrangling) data cleansing (a) (Data Analysis Pandas Data Munging / Wrangling) with pandas

In the " data cleansing (a) (Data Analysis Pandas Data Munging / Wrangling) with pandas ", we introduce some of the pandas command data cleaning frequently used.

 

Then look at the specific steps to clean this data:

   Transaction_ID Transaction_Date Product_ID Quantity Unit_Price Total_Price
0               1       2010-08-21          2        1         30          30
1               2       2011-05-26          4        1         40          40
2               3       2011-06-16          3      NaN         32          32
3               4       2012-08-26          2        3         55         165
4               5       2013-06-06          4        1        124         124
5               1       2010-08-21          2        1         30          30
6               7       2013-12-30                                           
7               8       2014-04-24          2        2        NaN         NaN
8               9       2015-04-24          4        3         60        1800
9              10       2016-05-08          4        4          9          36

 

1, see the number of rows of data:

print(transactions.shape)
(10, 6)

A total of 10 data lines 6.

 

2, view the data types of data:

print(transactions.dtypes)
Transaction_ID               int64
Transaction_Date    datetime64[ns]
Product_ID                  object
Quantity                    object
Unit_Price                  object
Total_Price                 object

Transaction_ID column is an integer, Transaction_Date column is time series, and the remaining columns are the object.

 

3, the above two steps can also be used info () command Alternative:

RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
Transaction_ID      10 non-null int64
Transaction_Date    10 non-null datetime64[ns]
Product_ID          10 non-null object
Quantity            9 non-null object
Unit_Price          9 non-null object
Total_Price         9 non-null object
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 560.0+ bytes
None

RangeIndex: 10 entries indicates a total of 10 rows, Data columns (total 6 columns) indicates a total of six, each next column shows the number and type of non-null values.

 

4, to see which lines, of which there are several columns missing values, and how many rows a total of how many columns there are missing values:

Print ( " Which of these lines have missing values: " )
 Print (transactions.apply ( the lambda X: SUM (x.isnull ()), Axis =. 1))
Which lines are missing value: 
0 0
 . 1     0
 2. 1 
. 3     0
 . 4     0
 . 5     0
 . 6     0
 . 7 2 
. 8     0
 . 9 0

 

Print ( " Which of these missing values listed: " )
 Print (transactions.apply ( the lambda X: SUM (x.isnull ()), Axis = 0))
Which of these missing values listed: 
transaction_id 0 
TRANSACTION_DATE 0 
Product_ID 0 
the Quantity             . 1 
unit_price           . 1 
total_price          . 1

 

Print ( " a total number of rows with missing values: " )
 Print (len (transactions.apply ( the lambda X: SUM (x.isnull ()), Axis =. 1) .nonzero () [0]))
There are a number of rows with missing values:
 2

 

Print ( " a total number of missing values listed: " )
 Print (len (transactions.apply ( the lambda X: SUM (x.isnull ()), Axis = 0) .nonzero () [0]))
There are a number of lists of missing values:
 3

 

It should be noted that this data set there are some spaces where should be treated as missing values, it is necessary to convert the spaces to display NaN:

transactions=transactions.applymap(lambda x: np.NaN if str(x).isspace() else x)

 

Now data set is shown below:

   Transaction_ID Transaction_Date product_id Quantity Unit_Price \ 
0                1 2010-08-21 2.0 1.0 30.0    
1 1.0 40.0 4.0 2 2011-05-26    
2 2011-06-16 3 3.0 32.0    
3 4 2012-08-26 2.0 3.0 55.0    
4 5 2013- 1.0 124.0 4.0 06-06    
5 1 2010-08-21 2.0 1.0 30.0    
6 7 2013-12-30          Gulf Shores   
 7 8 2014-04-24 2.0 2.0          NaN   
2015-04-24 3.0 60.0 4.0 8 9    
9 10 2016-05-08 4.0 4.0 9.0    

   Total_Price   
0          30.0   
1 40.0   
2 32.0   
3 165.0   
4 124.0   
5 30.0   
6           NaN  
 7           NaN  
 8 1800.0   
9 36.0  

 

5, the removal of missing values:

transactions.dropna(inplace=True)

 

6, of course, we may choose not to remove the missing value, but the filling, which method is suitable for the small amount of data, the fill backwards Pick:

transactions.fillna(method='backfill',inplace=True)

 

Now the data set as follows:

   Transaction_ID Transaction_Date  Product_ID  Quantity  Unit_Price  \
0               1       2010-08-21         2.0       1.0        30.0   
1               2       2011-05-26         4.0       1.0        40.0   
2               3       2011-06-16         3.0       3.0        32.0   
3               4       2012-08-26         2.0       3.0        55.0   
4               5       2013-06-06         4.0       1.0       124.0   
5               1       2010-08-21         2.0       1.0        30.0   
6               7       2013-12-30         2.0       2.0        60.0   
7               8       2014-04-24         2.0       2.0        60.0   
8               9       2015-04-24         4.0       3.0        60.0   
9              10       2016-05-08         4.0       4.0         9.0   

   Total_Price  
0         30.0  
1         40.0  
2         32.0  
3        165.0  
4        124.0  
5         30.0  
6       1800.0  
7       1800.0  
8       1800.0  
9         36.0

 

7, now filled with mean to try:

transactions.fillna(transactions.mean(),inplace=True)

 

Now the data set as follows:

   Transaction_ID Transaction_Date  Product_ID  Quantity  Unit_Price  \
0               1       2010-08-21         2.0       1.0        30.0   
1               2       2011-05-26         4.0       1.0        40.0   
2               3       2011-06-16         3.0       2.0        32.0   
3               4       2012-08-26         2.0       3.0        55.0   
4               5       2013-06-06         4.0       1.0       124.0   
5               1       2010-08-21         2.0       1.0        30.0   
6               7       2013-12-30         3.0       2.0        47.5   
7               8       2014-04-24         2.0       2.0        47.5   
8               9       2015-04-24         4.0       3.0        60.0   
9              10       2016-05-08         4.0       4.0         9.0   

   Total_Price  
0       30.000  
1       40.000  
2       32.000  
3      165.000  
4      124.000  
5       30.000  
6      282.125  
7      282.125  
8     1800.000  
9       36.000  

 

* Note: If the data set, there are outliers, the outlier should be removed, then the mean filling.

 

8, try again filled by interpolation, linear interpolation is the default, the data applied to the column when the linear relationship:

transactions.interpolate(inplace=True)

Obviously this does not meet the requirements, so the display does not update the data gathered. Only for demonstration purposes.

 

9, the display rows with duplicate values:

print(transactions.duplicated())
0    False
1    False
2    False
3    False
4    False
5     True
6    False
7    False
8    False
9    False

Displaying the line with index 5 duplicate values.

 

10, removing duplicates:

transactions.drop_duplicates(inplace=True)

 

11, where selecting the mean fill in missing values, now need to find outliers, first describe () method shows an overall numerical data in the case of:

print(transactions.describe())
       Transaction_ID  Product_ID  Quantity  Unit_Price  Total_Price
count        9.000000    9.000000  9.000000    9.000000     9.000000
mean         5.444444    3.111111  2.111111   49.444444   310.138889
std          3.205897    0.927961  1.054093   31.850672   567.993712
min          1.000000    2.000000  1.000000    9.000000    30.000000
25%          3.000000    2.000000  1.000000   32.000000    36.000000
50%          5.000000    3.000000  2.000000   47.500000   124.000000
75%          8.000000    4.000000  3.000000   55.000000   282.125000
max         10.000000    4.000000  4.000000  124.000000  1800.000000

After a review of the maximum and minimum Quantity, Unit_Price, Total_Price found, Quantity no problem, the maximum Unit_Price and Total_Price some exceptions, with the greatest Quantity Unit_Price get is multiplied by 496, much lower than 1800, it is determined Total_Price abnormal values. Unit_Price maximum value can not be determined there is no problem, after viewing the data set and found no problems, so retention.

 

12, find the outliers with mask values ​​outside the range of the lower limit value is the abnormal data (typically data limit Q3 + 1.5 * IQR, the lower limit of the data is usually Q1-1.5 * IQR):

IQR=transactions.describe().loc['75%','Total_Price']-transactions.describe().loc['25%','Total_Price']
upper_extreme=transactions.describe().loc['75%','Total_Price']+1.5*IQR
lower_extreme=transactions.describe().loc['25%','Total_Price']-1.5*IQR
print(transactions.loc[((transactions['Total_Price']>upper_extreme) | (transactions['Total_Price']<lower_extreme))])
   Transaction_ID Transaction_Date  Product_ID  Quantity  Unit_Price  \
8               9       2015-04-24         4.0       3.0        60.0   

   Total_Price  
8       1800.0  

Total_Price first calculate the IQR, whereby then calculate its upper and lower limits, the line has to find outliers.

 

13, to find a box plot outlier:

import matplotlib.pyplot as plt
fig,ax=plt.subplots()

ax.boxplot(transactions['Total_Price'])
plt.shaow()

This column displays Total_Price abnormal values ​​(dots).

 

14, post-processing outliers view dataset outliers found this line is added an extra Total_Price 0, so with alternative 1800 180:

transactions.replace(1800,180,inplace=True)

 

Now the data set as follows:

   Transaction_ID Transaction_Date  Product_ID  Quantity  Unit_Price  \
0               1       2010-08-21         2.0       1.0        30.0   
1               2       2011-05-26         4.0       1.0        40.0   
2               3       2011-06-16         3.0       2.0        32.0   
3               4       2012-08-26         2.0       3.0        55.0   
4               5       2013-06-06         4.0       1.0       124.0   
6               7       2013-12-30         3.0       2.0        47.5   
7               8       2014-04-24         2.0       2.0        47.5   
8               9       2015-04-24         4.0       3.0        60.0   
9              10       2016-05-08         4.0       4.0         9.0   

   Total_Price  
0       30.000  
1       40.000  
2       32.000  
3      165.000  
4      124.000  
6      282.125  
7      282.125  
8      180.000  
9       36.000 

 

Cleaning work in this data set so far, because the data set is very simple, it can only be used as a presentation. In reality many more complex than this dataset, which further comprises washing the contents of the data type conversion, data format conversion, and the like add features.

 

Guess you like

Origin www.cnblogs.com/HuZihu/p/11404528.html