1 Overview
Discretization is to map finite individuals in infinite space to finite space. Data discrete operations are mostly performed on continuous data, and the distribution of data value domains after processing will change from continuous attributes to discrete attributes.
Discretization is usually processed for continuous data, but in many cases it can also be processed for already discretized data. In this scenario, the division of discrete data itself is generally too complicated, trivial or even inconsistent with business logic, and further data processing is required. Aggregate or repartition.
2. Discretization for time data
The discretization of time data is mainly used for data concentration and granularity conversion with time as the main feature. After discretization, the scattered time features are converted into higher-level time features. In a dataset with time, time may be recorded as a sequence of rows, or as a column (dimension) to record data characteristics. Common discretization operations for time data fall into two categories:
1. Discretize for time of day. Generally, the timestamp is converted to seconds, minutes, hours, or am/pm.
2. For the discretization of data above the daily granularity. Generally, the date is converted into a week number, day of the week, month, weekday, quarter, year, etc.
3. Discretization for multi-valued discrete data
Discretization for multi-valued discrete data means that the data to be discretized is not numerical data, but categorical or sequential data. For example, the value of the user income variable may originally be divided into 10 intervals. According to the new modeling requirements, it only needs to be divided into 4 intervals, and then the original 10 intervals need to be merged.
If multi-valued discrete data needs to be discretized, there may be a problem with the logic of the division and it needs to be re-divided. For example, the value of the user activity variable was originally divided into three categories: high value, medium value, and low value; according to the needs of business development, the value of the new user activity reading variable is defined as high value, medium value, low value, and negative value. Value 4 classes. At this time, it is necessary to discretize the data of different categories with uniform rules.
4. Discretization for continuous data
Discretization for continuous data is the main discretization application, and the results of these algorithms are based on class or attribute labels rather than numerical labels. The discretization results of continuous data can be divided into two categories: one is to divide continuous data into a set of specific intervals, such as {(0,10],(10,20],(20,50],(50,100]}); One is to divide continuous data into specific classes, such as class 1, class 2, and class 3; common methods for discretizing continuous data include:
Quantile method : use quantiles such as quartiles, quintiles, and deciles for discretization. This method is simple and easy.
Distance interval method : discretization can be performed using equidistant intervals or custom intervals. This operation is more flexible and can meet custom requirements. In addition, this method (especially equidistant intervals) can better maintain the original data distributed.
Frequency interval method : sort the data according to the frequency distribution of different data, and then discretize according to equal frequency or specified frequency. This method will transform the data into a uniform distribution. The advantage is that the observed values of each interval are the same. The disadvantage is that it has already The distribution form of the original data has been changed.
Clustering method : For example, K-means is used to divide the sample set into multiple discrete clusters.
Chi-square : By using a chi-square-based discretization method, the best adjacent intervals of the data are found and merged to form larger intervals.
5. Binarization for continuous data
In many scenarios, we may need to binarize variable features: compare each data point with a threshold, set it to a fixed value (such as 1) if it is greater than the threshold, and set it to another fixed value (such as 0) if it is less than the threshold. , and then get a binarized dataset with only two ranges. The premise of the application of binarization is that all the attribute values in the data set represent the same or similar meanings.
6. Code practice: Python data discretization processing
import pandas as pd
from sklearn.cluster import KMeans
from sklearn import preprocessing
df=pd.read_table('F:\小橙书\chapter3\data7.txt',names=['id','amount','income','datetime','age'])
df.head(5)
id | amount | income | datetime | age | |
---|---|---|---|---|---|
0 | 15093 | 1390 | 10.40 | 2017-04-30 19:24:13 | 0-10 |
1 | 15062 | 4024 | 4.68 | 2017-04-27 22:44:59 | 70-80 |
2 | 15028 | 6359 | 3.84 | 2017-04-27 10:07:55 | 40-50 |
3 | 15012 | 7759 | 3.70 | 2017-04-04 07:28:18 | 30-40 |
4 | 15021 | 331 | 4.25 | 2017-04-08 11:14:00 | 70-80 |
6.1 Discretization of temporal data
for i,single_data in enumerate(df['datetime']):
single_data_tmp=pd.to_datetime(single_data)
df['datetime'][i]=single_data_tmp.weekday()
df.head(5)
id | amount | income | datetime | age | |
---|---|---|---|---|---|
0 | 15093 | 1390 | 10.40 | 6 | 0-10 |
1 | 15062 | 4024 | 4.68 | 3 | 70-80 |
2 | 15028 | 6359 | 3.84 | 3 | 40-50 |
3 | 15012 | 7759 | 3.70 | 1 | 30-40 |
4 | 15021 | 331 | 4.25 | 5 | 70-80 |
6.2 Discretization of Multivalued Data
map_df=pd.DataFrame([['0-10','0-40']
,['10-20','0-40']
,['20-30','0-40']
,['30-40','0-40']
,['40-50','40-80']
,['50-60','40-80']
,['60-70','40-80']
,['70-80','40-80']
,['80-90','>80']
,['>90','>80']],columns=['age','age2'])
df_tmp=df.merge(map_df,left_on='age',right_on='age',how='inner')
df_tmp
id | amount | income | datetime | age | age2 | |
---|---|---|---|---|---|---|
0 | 15093 | 1390 | 10.40 | 6 | 0-10 | 0-40 |
1 | 15064 | 7952 | 4.40 | 0 | 0-10 | 0-40 |
2 | 15080 | 503 | 5.72 | 5 | 0-10 | 0-40 |
3 | 15068 | 1668 | 3.19 | 5 | 0-10 | 0-40 |
4 | 15019 | 6710 | 3.20 | 0 | 0-10 | 0-40 |
... | ... | ... | ... | ... | ... | ... |
95 | 15098 | 2014 | 3.03 | 6 | 60-70 | 40-80 |
96 | 15046 | 6215 | 5.09 | 2 | 80-90 | >80 |
97 | 15095 | 5294 | 3.74 | 1 | 80-90 | >80 |
98 | 15074 | 5381 | 3.28 | 6 | 80-90 | >80 |
99 | 15074 | 4834 | 3.92 | 2 | 80-90 | >80 |
100 rows × 6 columns
df=df_tmp.drop('age',1)
df.head(5)
id | amount | income | datetime | age2 | |
---|---|---|---|---|---|
0 | 15093 | 1390 | 10.40 | 6 | 0-40 |
1 | 15064 | 7952 | 4.40 | 0 | 0-40 |
2 | 15080 | 503 | 5.72 | 5 | 0-40 |
3 | 15068 | 1668 | 3.19 | 5 | 0-40 |
4 | 15019 | 6710 | 3.20 | 0 | 0-40 |
6.3 Discretization of discrete data
Method 1: Customize the binning interval to achieve discretization
bins=[0,200,1000,5000,10000]
df['amount1']=pd.cut(df['amount'],bins)
df.head(5)
id | amount | income | datetime | age2 | amount1 | |
---|---|---|---|---|---|---|
0 | 15093 | 1390 | 10.40 | 6 | 0-40 | (1000, 5000] |
1 | 15064 | 7952 | 4.40 | 0 | 0-40 | (5000, 10000] |
2 | 15080 | 503 | 5.72 | 5 | 0-40 | (200, 1000] |
3 | 15068 | 1668 | 3.19 | 5 | 0-40 | (1000, 5000] |
4 | 15019 | 6710 | 3.20 | 0 | 0-40 | (5000, 10000] |
Method 2: Discretization using clustering
import numpy as np
data=df.loc[:,'amount']
data_reshape=data.values.reshape(-1,1)
model_kmeans=KMeans(n_clusters=4,random_state=0)
kmeans_result=model_kmeans.fit_predict(data_reshape)
df['amount2']=kmeans_result
df.head(5)
id | amount | income | datetime | age2 | amount1 | amount2 | |
---|---|---|---|---|---|---|---|
0 | 15093 | 1390 | 10.40 | 6 | 0-40 | (1000, 5000] | 2 |
1 | 15064 | 7952 | 4.40 | 0 | 0-40 | (5000, 10000] | 1 |
2 | 15080 | 503 | 5.72 | 5 | 0-40 | (200, 1000] | 2 |
3 | 15068 | 1668 | 3.19 | 5 | 0-40 | (1000, 5000] | 2 |
4 | 15019 | 6710 | 3.20 | 0 | 0-40 | (5000, 10000] | 1 |
Method 3: Discretization using quartiles
labels=['bad','medium','good','awesome']
df['amount3']=pd.qcut(df['amount'],4,labels=labels)
df.head(5)
id | amount | income | datetime | age2 | amount1 | amount2 | amount3 | |
---|---|---|---|---|---|---|---|---|
0 | 15093 | 1390 | 10.40 | 6 | 0-40 | (1000, 5000] | 2 | bad |
1 | 15064 | 7952 | 4.40 | 0 | 0-40 | (5000, 10000] | 1 | awesome |
2 | 15080 | 503 | 5.72 | 5 | 0-40 | (200, 1000] | 2 | bad |
3 | 15068 | 1668 | 3.19 | 5 | 0-40 | (1000, 5000] | 2 | bad |
4 | 15019 | 6710 | 3.20 | 0 | 0-40 | (5000, 10000] | 1 | awesome |
6.3 Binarization of continuous data
binarizer_scaler=preprocessing.Binarizer(threshold=df['income'].mean())
df_income=df.loc[:,'income'].values.reshape(-1,1)
income_tmp=binarizer_scaler.fit_transform(df_income)
# income_tmp
income_tmp.resize(df['income'].shape)
df['income1']=income_tmp
df.head(5)
id | amount | income | datetime | age2 | amount1 | amount2 | amount3 | income1 | |
---|---|---|---|---|---|---|---|---|---|
0 | 15093 | 1390 | 10.40 | 6 | 0-40 | (1000, 5000] | 2 | bad | 1.0 |
1 | 15064 | 7952 | 4.40 | 0 | 0-40 | (5000, 10000] | 1 | awesome | 1.0 |
2 | 15080 | 503 | 5.72 | 5 | 0-40 | (200, 1000] | 2 | bad | 1.0 |
3 | 15068 | 1668 | 3.19 | 5 | 0-40 | (1000, 5000] | 2 | bad | 0.0 |
4 | 15019 | 6710 | 3.20 | 0 | 0-40 | (5000, 10000] | 1 | awesome | 0.0 |