8 Python code data cleaning, copy available



Data cleansing, and data analysis is the only way to use the training data model, but also the most cost local data scientist / programmer effort.

These codes are used to clean the data has two advantages: First, from a function written, do not change the parameter can be used directly. The second is very simple, but also with comments longest 11 lines.

In introducing each piece of code, Lee uses are given, comments are also given in the code.

We can put this article together the collection, use as a toolbox.

8 cover large scene data cleansing codes

data codes cleaning, covering a total of 8 scenes, namely:

to delete multiple columns, change the data type, converts the digital variable categorical variables, check for missing data, deleting a string column, remove column spaces, with two string concatenation (conditional), converted time stamp (date and time from the format string)

deleting a plurality of rows

during data analysis, not all columns are used, with convenient df.drop you delete the specified column.

 

def drop_multiple_col (col_names_list, df): '' 'AIM -> Drop multiple columns based on their column names INPUT -> List of column names, df OUTPUT -> updated df with dropped columns ------' '' df. drop (col_names_list, axis = 1, inplace = True) return df

converting data types

when large data sets need to convert the data type to save memory.

def change_dtypes (col_int, col_float, df ): '' 'AIM -> Changing dtypes to save memory INPUT -> List of column names (int, float), df OUTPUT -> updated df with smaller memory ------' '' df [col_int] = df [col_int] .astype ( 'int32') df [col_float] = df [col_float] .astype ( 'float32')

to convert the categorical variable to a numeric variable

number of machine learning models claim numerical variable format . This categorical variables need to first convert numeric variables. At the same time, you can also keep categorical variables, for data visualization.

def convert_cat2num (df): # Convert categorical variable to numerical variable num_encode = { 'col_1': { 'YES': 1, 'NO': 0}, 'col_2': { 'WON': 1, 'LOSE': 0 , 'DRAW': 0}} df.replace (num_encode, inplace = True)

to check for missing data

If you want to check the number of missing data of each column, using the following code is the fastest way. It can make you a better understanding of what's missing more column data to determine how the next step of data cleaning and analysis operations.

def check_missing_data (df):

String delete columns

Sometimes, there will be new characters or other strange symbols appear in a string column, which you can use df [ 'col_1']. Replace it simply dispose of them.

def remove_col_str (df):. # remove a portion of string in a dataframe column - col_1 df [ 'col_1'] replace ( '\ n', '', regex = True, inplace = True) # remove all the characters after & # (including & #) for column -. col_1 df [ 'col_1'] replace ( '. & # *', '', regex = True, inplace = True)

space delete columns

when the data is confusing, what are there are likely to occur. Often there will be some space at the beginning of the string. In the delete column when the spaces at the beginning of the string, the following code is very useful.

def remove_col_white_space (df): # remove white space at the beginning of string df [col] = df [col] .str.lstrip ()

with two string concatenation (conditional)

when you want to conditionally string when connecting together two, helpful code. For example, you can set some of the letters at the end of the first column, and then the second column are coupled together.

If necessary, at the end of the letter can be deleted after the connection is completed.

def concat_col_str_condition (df): # concat 2 columns with strings if the last 3 letters of the first column are 'pil' mask = df [ 'col_1'] str.endswith ( 'pil', na = False) col_new = df [. mask] [ 'col_1'] + df [mask] [ 'col_2'] col_new.replace ( 'pil', '', regex = True, inplace = True) # replace the 'pil' with emtpy space

 

converted time stamp (from string to date time format)

when dealing with time-series data, we are likely to encounter timestamp column string format.

This means that the format you want to convert a string date time format (or other designated format according to our needs), for data to be meaningful analysis.

 


----------------
Disclaimer: This article is the original article CSDN bloggers "HelloWorld_MHC", and follow CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement. .
Original link: https: //blog.csdn.net/weixin_38175358/article/details/86521813

Guess you like

Origin www.cnblogs.com/bighammerdata/p/11647832.html