Python machine learning: drop() deletes rows and columns

In the work of feature engineering and dividing data sets, the drop() function can come in handy. It can easily cull data, manipulate columns and manipulate rows, etc.

The detailed syntax of drop() is as follows:

The deleted row is index, and the deleted column is columns:

DataFrame.drop(labels=None, axis=0, index=None, columns=None, inplace=False)

parameter:

labels: The label of the row or column to be deleted, which can be a single label or a list of labels.

axis: The axis of the row or column to delete, 0 for row and 1 for column.

index: The index of the row to delete, either a single index or a list of indices.

columns: The column name of the column to be deleted, which can be a single column name or a list of column names.

inplace: Whether to operate on the original DataFrame, the default is False, that is, not to operate on the original DataFrame.


delete column

Use case 1: Remove unwanted features.

For example, if some features have little influence on the results, independent variables that are not related to the dependent variable can be deleted; in order to avoid multicollinearity, independent variables with strong correlations should be deleted.

df = data.drop(data[['RowNumber','CustomerId','Surname']],axis=1)
df

Code explanation:

data is a data set, and the two square brackets represent the DataFrame format, which filters 3 fields to be deleted;

axis=1 represents the operation column;

operation result:


Use scenario 2: delete the dependent variable

# 自变量、因变量
x_data = df.drop(['Exited'],axis=1)
y_data = df['Exited']
x_data

Code explanation:

Fill in the field to be deleted in the drop() function, which means deleting the column named "Exited" from df;

['Exited'] This field is the dependent variable we want to eliminate, and a single field can be expressed like this;

operation result:

 


delete row

Usage scenario 3: When dividing the data set, a training set is generated, and the samples that are divided into the training set are removed, and the rest is the test set.

#划分训练集
train_data = data.sample(frac = 0.8, random_state = 0)
#测试集
test_data = data.drop(train_data.index)

 

Code explanation:

Fill in the row index in the drop() function to delete the row;

train_data is our divided training set, and train_data.index represents the row index;

axis=0, means to delete the row, or not to write, it is the default value;

Guess you like

Origin blog.csdn.net/Sukey666666/article/details/128927802