Python - cleaning method frequently used data - Processing duplicates

      In the data processing, the data generally requires cleaning work, such as whether there are duplicate data set, if there are missing, and if data integrity with consistent, whether there is abnormality in the related data found problems like required for processing manner, the following study we used data together cleaning method.

1. The process is repeated observation

Repeat observation: that the observed phenomenon duplicate rows, duplicate observation data analysis and mining will affect the accuracy of the results, so before data analysis and modeling require repetitive inspection observations, if there are repeated observations,

Also you need to remove duplicate entries.

    In the data collection process, there may be repeated occurrence of the observation, for example by a web crawler, it is easier to produce duplicate data in the following table, by an amount of data downloaded crawler obtaining market electricity supplier APP APP class (part)

 

 CD products will be seen by observing and Dangdang appears three times. If the collector is not up 10 rows, but the 10 million lines, and even more that can not be detected by way of the data to the naked eye whether the presence of duplicates.

Here we look at how to deal with checking duplicates with python, and how to remove duplicate data entry items

Code:

PD PANDAS AS Import
DF = pd.read_excel (r'D: \ data_test04.xlsx ')
Print (' data set is a repeat observations: \ n ', any (df.duplicated ()))

out:

Whether duplicate data sets observations: 
 True

The code is a simple two-line to deal with the

Record test data sets can be seen whether there is repeated, using Duplicated (English meaning of the word is repeated, the meaning of replication) method, this method returns the set of test result data of each line, in order to obtain the most direct result, you can use any function, which is expressed in a plurality of condition judgments, only one condition is True, any result of the function to be True. as a result, the use of any function returns the value True, indicating

This data set is the presence of repeated measures.

Delete duplicate data centralized observation:

df.drop_duplicates(inplace = True)
df

 

The results shown above, since the initial sending duplicate rows 10 to give 7, line numbers to be deleted: 3,8 and 9. In this method there inplace parameter set to True directly says do operations on the original data set.

 

Guess you like

Origin www.cnblogs.com/tinglele527/p/11910693.html
Recommended