Article directory
1. Pandas data cleaning
Data cleaning is the process of processing useless data.
Many data sets contain missing data, incorrect data formats, incorrect data, or duplicate data. If you want to make data analysis more accurate, you need to process these useless data.
Sample data is as follows:
The above table contains four types of empty data:
- n/a
- THAT
- -
- already
dropna()
This method can delete rows containing empty fields. The syntax is as follows:
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
Parameter Description:
axis
: The default is 0, which means the entire row will be removed when the value is empty. If the parameter axis=1 is set, the entire column will be removed when the value is empty.how
: The default is 'any'. If any data in a row (or column) contains NA, the entire row will be removed. If how='all' is set, the entire row (or column) will be removed only when NA appears.thresh
: Set how much non-null value data is required to be retained.subset
: Set the column you want to check. If there are multiple columns, you can use a list of column names as parameters.inplace
: If set to True, the calculated value will directly overwrite the previous value and None will be returned. The source data will be modified.
Note: By default, the dropna() method returns a new DataFrame and does not modify the source data.
If you want to modify the source data DataFrame, you can use the inplace = True parameter:
isnull()
Determine whether each cell is empty.
import pandas as pd
df = pd.read_csv('property-data.csv')
print (df['NUM_BEDROOMS'])
print (df['NUM_BEDROOMS'].isnull())
fillna()
Method to replace some empty fields:
import pandas as pd
df = pd.read_csv('property-data.csv')
df.fillna(12345, inplace = True)
print(df.to_string())
mean()
The , median()
and mode()
methods calculate the mean (the average of all values added together), the median (the number in the middle of the order), and the mode (the number that occurs most frequently) of a column.
import pandas as pd
df = pd.read_csv('property-data.csv')
x = df["ST_NUM"].mean()
df["ST_NUM"].fillna(x, inplace = True)
print(df.to_string())
duplicated()
and drop_duplicates()
methods can help us clean duplicate data. If the corresponding data is duplicated, duplicated() will return True, otherwise it will return False.
import pandas as pd
person = {
"name": ['Google', 'Runoob', 'Runoob', 'Taobao'],
"age": [50, 40, 40, 23]
}
df = pd.DataFrame(person)
print(df.duplicated())
The result of running the program is:
0 False
1 False
2 True
3 False
dtype: bool
drop_duplicates()
Method to remove duplicate data
import pandas as pd
persons = {
"name": ['Google', 'Runoob', 'Runoob', 'Taobao'],
"age": [50, 40, 40, 23]
}
df = pd.DataFrame(persons)
df.drop_duplicates(inplace = True)
print(df)
The result of running the program is:
name age
0 Google 50
1 Runoob 40
3 Taobao 23
2. Commonly used functions in Pandas
1. Read data
serial number | function | Function |
---|---|---|
1 | pd.read_csv(filename) |
Read CSV file |
2 | pd.read_excel(filename) |
Read Excel file |
3 | pd.read_sql(query, connection_object) |
Read data from SQL database |
4 | pd.read_json(json_string) |
Read data from JSON string |
5 | pd.read_html(url) |
Read data from HTML page |
2. View data
serial number | function | Function |
---|---|---|
1 | df.head(n) |
Display the first n rows of data |
2 | df.tail(n) |
Display the last n rows of data |
3 | df.info() |
Displays data information, including column names, data types, missing values, etc. |
4 | df.describe() |
Display basic statistical information of data, including mean, variance, maximum value, minimum value, etc. |
5 | df.shape |
Display the number of rows and columns of data |
3. Data cleaning
serial number | function | Function |
---|---|---|
1 | df.dropna() |
Remove rows or columns containing missing values |
2 | df.fillna(value) |
Replace missing values with specified values |
3 | df.replace(old_value, new_value) |
Replace specified value with new value |
4 | df.duplicated() |
Check if there is duplicate data |
5 | df.drop_duplicates() |
Remove duplicate data |
4. Data selection and slicing
serial number | function | Function |
---|---|---|
1 | df[column_name] |
Select specified column |
2 | df.loc[row_index, column_name] |
Select data by label |
3 | df.iloc[row_index, column_index] |
Select data by location |
4 | df.ix[row_index, column_name] |
Select data by label or location |
5 | df.filter(items=[column_name1, column_name2]) |
Select specified column |
6 | df.filter(regex='regex') |
Select columns whose column names match a regular expression |
7 | df.sample(n) |
Randomly select n rows of data |
5. Data sorting
serial number | function | Function |
---|---|---|
1 | df.sort_values(column_name) |
Sort by value of specified column |
2 | df.sort_values([column_name1, column_name2], ascending=[True, False]) |
Sort by values in multiple columns |
3 | df.sort_index() |
Sort by index |
6. Data grouping and aggregation
serial number | function | Function |
---|---|---|
1 | df.groupby(column_name) |
Group by specified column |
2 | df.aggregate(function_name) |
Aggregate the grouped data |
3 | df.pivot_table(values, index, columns, aggfunc) |
Generate pivot table |
7. Data merging
serial number | function | Function |
---|---|---|
1 | pd.concat([df1, df2]) |
Combine multiple data frames by row or column |
2 | pd.merge(df1, df2, on=column_name) |
Merge two data frames according to specified columns |
8. Data selection and filtering
serial number | function | Function |
---|---|---|
1 | df.loc[row_indexer, column_indexer] |
Select rows and columns by label |
2 | df.iloc[row_indexer, column_indexer] |
Select rows and columns by position |
3 | df[df['column_name'] > value] |
Select rows in a column that meet the criteria |
4 | df.query('column_name > value') |
Use a string expression to select rows in a column that meet a condition |
9. Data statistics and description
serial number | function | Function |
---|---|---|
1 | df.describe() |
Calculate basic statistics such as mean, standard deviation, minimum, maximum, etc. |
2 | df.mean() |
Calculate the average of each column |
3 | df.median() |
Calculate the median of each column |
4 | df.mode() |
Calculate the mode of each column |
5 | df.count() |
Count the number of non-missing values in each column |