[python] data mining analysis and cleaning - summary of discretization methods


foreword

Discretization is a very important part of data cleaning. Subsequent standardization, outlier processing, models, etc. all need to discretize some text data. Here I divide discretization into two categories, discretization of numerical data and discretization of character data


1. Discretization of character data

The purpose of discretizing characters is to ensure that subsequent data cleaning can be performed normally, because data with characters cannot be used for many data cleaning operations. Here, the data of 'report type', 'accounting standard', and 'currency code' are used as examples. Be explained.

1.1 onehot alone hot

Use one-hot encoding for processing, and perform one-hot processing on the characters that have appeared to become multi-dimensional.

import pandas as pd
emb_dummies_df = pd.get_dummies(data['会计准则'],prefix=data[['会计准则']].columns[0])
#prefix表示列名在值的前面要添加的字符串
emb_dummies_df

insert image description here

insert image description here

One-hot encoding can be used to convert the value of the column into a multi-dimensional digital representation, but it will increase the dimension and increase the amount of calculation. You can also use k-means to cluster the data first and then encode it.

1.2 Factoring discrete coding

Just now is to change a single column into multi-dimensional data, and use 1 and 0 to indicate whether there is such a number. And Factoring is to convert the numbers in this column into 1, 2...n depending on how many classes there are, the code and examples are given below

data['会计准则'] = pd.factorize(data['会计准则'])[0]
data[['会计准则']]

insert image description here
insert image description here

Phased summary: Here are all single-column discretization processing. In addition to these, there are also encodings such as TF-IDF, which are often used for text analysis. If there is time later, it may continue to update a wave.

2. Discretization of numerical data

2.1 Binning (data binning)

It is to divide the data by area, such as 1-30, 30-100, 100-1000 to obtain different area classes, and analyze them accordingly.

 #与区间的数学符号一致, 小括号表示开放,中括号表示封闭, 可以通过right参数改变
print(pd.cut(ages, bins, right=False))#qcut函数是根据均等距离划分

#单个列进行划分
train_data['Fare_bin'] = pd.qcut(train_data['Fare'],5) #5是指分成五份

#自定义范围划分
bins = [0,59,70,80,100]
df['Categories'] = pd.cut(df['score'],bins) #bins的各值作为区间的边

# 可以通过labels自定义箱名或者区间名 用于多个列进行划分
group_names = ['Youth', 'YonngAdult', 'MiddleAged', 'Senior']
data = pd.cut(ages, bins, labels=group_names)
print(data)
print(pd.value_counts(data))

# 如果将箱子的边替代为箱子的个数,pandas将根据数据中的最小值和最大值计算出等长的箱子
data2 = np.random.rand(20)
print(pd.cut(data2, 4, precision=2))   # precision=2 将十进制精度限制在2位

# qcut是另一个分箱相关的函数, 基于样本分位数进行分箱。取决于数据的分布,使用cut不会使每个箱子具有相同数据数量的数据点,而qcut,使用
# 样本的分位数,可以获得等长的箱
data3 = np.random.randn(1000)   # 正太分布
cats = pd.qcut(data3, 4)
print(pd.value_counts(cats))

Data binning is a method of discretizing continuous variables, which divides the continuous data range into several ordered, non-overlapping intervals, and then maps the data into the corresponding intervals.

The significance of data binning is:

Reduced complexity: For some machine learning algorithms, the handling of continuous variables can increase computational complexity. Binning can transform continuous variables into discrete variables, reducing computational complexity and facilitating the handling of missing values ​​and outliers.

Improve prediction accuracy: In some scenarios, discretized data can better reveal the relationship between variables and improve the prediction accuracy of the model. For example, in a credit scoring model, dividing income into tiers better captures the non-linear relationship between income and default rates.

Ease of interpretation and visualization: Discretized data is easier to interpret and visualize. For example, in marketing analysis, dividing age into several groups can more clearly show the demographic distribution and consumption habits of different age groups.

Summarize

Discretization of continuous variables: Discretization
of continuous variables divides the continuous data range into several ordered, non-overlapping intervals, and then maps the data into corresponding intervals. The discretized data can better reveal the relationship between variables and improve the prediction accuracy of the model. In addition, discretization of continuous variables can also reduce computational complexity, facilitate handling of missing values ​​and outliers, and make interpretation and visualization easier.

Character discretization:
Character discretization converts character data into discrete data. The discretized data can be better applied to algorithms such as classification, clustering, and association rule mining. For example, in text classification, after converting the text into a bag-of-words model, each word can be converted into a feature by discretization, and the text can be converted into a vector. In addition, character discretization can also facilitate data processing, such as data deduplication, data compression, etc.

Guess you like

Origin blog.csdn.net/weixin_47058355/article/details/128894091