python data discretization

1 Overview

Discretization is to map finite individuals in infinite space to finite space. Data discrete operations are mostly performed on continuous data, and the distribution of data value domains after processing will change from continuous attributes to discrete attributes.

Discretization is usually processed for continuous data, but in many cases it can also be processed for already discretized data. In this scenario, the division of discrete data itself is generally too complicated, trivial or even inconsistent with business logic, and further data processing is required. Aggregate or repartition.

2. Discretization for time data

The discretization of time data is mainly used for data concentration and granularity conversion with time as the main feature. After discretization, the scattered time features are converted into higher-level time features. In a dataset with time, time may be recorded as a sequence of rows, or as a column (dimension) to record data characteristics. Common discretization operations for time data fall into two categories:

1. Discretize for time of day. Generally, the timestamp is converted to seconds, minutes, hours, or am/pm.

2. For the discretization of data above the daily granularity. Generally, the date is converted into a week number, day of the week, month, weekday, quarter, year, etc.

3. Discretization for multi-valued discrete data

Discretization for multi-valued discrete data means that the data to be discretized is not numerical data, but categorical or sequential data. For example, the value of the user income variable may originally be divided into 10 intervals. According to the new modeling requirements, it only needs to be divided into 4 intervals, and then the original 10 intervals need to be merged.

If multi-valued discrete data needs to be discretized, there may be a problem with the logic of the division and it needs to be re-divided. For example, the value of the user activity variable was originally divided into three categories: high value, medium value, and low value; according to the needs of business development, the value of the new user activity reading variable is defined as high value, medium value, low value, and negative value. Value 4 classes. At this time, it is necessary to discretize the data of different categories with uniform rules.

4. Discretization for continuous data

Discretization for continuous data is the main discretization application, and the results of these algorithms are based on class or attribute labels rather than numerical labels. The discretization results of continuous data can be divided into two categories: one is to divide continuous data into a set of specific intervals, such as {(0,10],(10,20],(20,50],(50,100]}); One is to divide continuous data into specific classes, such as class 1, class 2, and class 3; common methods for discretizing continuous data include:

Quantile method : use quantiles such as quartiles, quintiles, and deciles for discretization. This method is simple and easy.

Distance interval method : discretization can be performed using equidistant intervals or custom intervals. This operation is more flexible and can meet custom requirements. In addition, this method (especially equidistant intervals) can better maintain the original data distributed.

Frequency interval method : sort the data according to the frequency distribution of different data, and then discretize according to equal frequency or specified frequency. This method will transform the data into a uniform distribution. The advantage is that the observed values of each interval are the same. The disadvantage is that it has already The distribution form of the original data has been changed.

Clustering method : For example, K-means is used to divide the sample set into multiple discrete clusters.

Chi-square : By using a chi-square-based discretization method, the best adjacent intervals of the data are found and merged to form larger intervals.

5. Binarization for continuous data

In many scenarios, we may need to binarize variable features: compare each data point with a threshold, set it to a fixed value (such as 1) if it is greater than the threshold, and set it to another fixed value (such as 0) if it is less than the threshold. , and then get a binarized dataset with only two ranges. The premise of the application of binarization is that all the attribute values in the data set represent the same or similar meanings.

6. Code practice: Python data discretization processing

import pandas as pd
from sklearn.cluster import KMeans
from sklearn import preprocessing
df=pd.read_table('F:\小橙书\chapter3\data7.txt',names=['id','amount','income','datetime','age'])
df.head(5)

	id	amount	income	datetime	age
0	15093	1390	10.40	2017-04-30 19:24:13	0-10
1	15062	4024	4.68	2017-04-27 22:44:59	70-80
2	15028	6359	3.84	2017-04-27 10:07:55	40-50
3	15012	7759	3.70	2017-04-04 07:28:18	30-40
4	15021	331	4.25	2017-04-08 11:14:00	70-80

6.1 Discretization of temporal data

for i,single_data in enumerate(df['datetime']):
    single_data_tmp=pd.to_datetime(single_data)
    df['datetime'][i]=single_data_tmp.weekday()
df.head(5)

	id	amount	income	datetime	age
0	15093	1390	10.40	6	0-10
1	15062	4024	4.68	3	70-80
2	15028	6359	3.84	3	40-50
3	15012	7759	3.70	1	30-40
4	15021	331	4.25	5	70-80

6.2 Discretization of Multivalued Data

map_df=pd.DataFrame([['0-10','0-40']
                    ,['10-20','0-40']
                    ,['20-30','0-40']
                    ,['30-40','0-40']
                    ,['40-50','40-80']
                    ,['50-60','40-80']
                    ,['60-70','40-80']
                    ,['70-80','40-80']
                    ,['80-90','>80']
                    ,['>90','>80']],columns=['age','age2'])
df_tmp=df.merge(map_df,left_on='age',right_on='age',how='inner')
df_tmp

	id	amount	income	datetime	age	age2
0	15093	1390	10.40	6	0-10	0-40
1	15064	7952	4.40	0	0-10	0-40
2	15080	503	5.72	5	0-10	0-40
3	15068	1668	3.19	5	0-10	0-40
4	15019	6710	3.20	0	0-10	0-40
...	...	...	...	...	...	...
95	15098	2014	3.03	6	60-70	40-80
96	15046	6215	5.09	2	80-90	>80
97	15095	5294	3.74	1	80-90	>80
98	15074	5381	3.28	6	80-90	>80
99	15074	4834	3.92	2	80-90	>80

100 rows × 6 columns

df=df_tmp.drop('age',1)
df.head(5)

	id	amount	income	datetime	age2
0	15093	1390	10.40	6	0-40
1	15064	7952	4.40	0	0-40
2	15080	503	5.72	5	0-40
3	15068	1668	3.19	5	0-40
4	15019	6710	3.20	0	0-40

6.3 Discretization of discrete data

Method 1: Customize the binning interval to achieve discretization

bins=[0,200,1000,5000,10000]
df['amount1']=pd.cut(df['amount'],bins)
df.head(5)

	id	amount	income	datetime	age2	amount1
0	15093	1390	10.40	6	0-40	(1000, 5000]
1	15064	7952	4.40	0	0-40	(5000, 10000]
2	15080	503	5.72	5	0-40	(200, 1000]
3	15068	1668	3.19	5	0-40	(1000, 5000]
4	15019	6710	3.20	0	0-40	(5000, 10000]

Method 2: Discretization using clustering

import numpy as np
data=df.loc[:,'amount']
data_reshape=data.values.reshape(-1,1)
model_kmeans=KMeans(n_clusters=4,random_state=0)
kmeans_result=model_kmeans.fit_predict(data_reshape)
df['amount2']=kmeans_result
df.head(5)

	id	amount	income	datetime	age2	amount1	amount2
0	15093	1390	10.40	6	0-40	(1000, 5000]	2
1	15064	7952	4.40	0	0-40	(5000, 10000]	1
2	15080	503	5.72	5	0-40	(200, 1000]	2
3	15068	1668	3.19	5	0-40	(1000, 5000]	2
4	15019	6710	3.20	0	0-40	(5000, 10000]	1

Method 3: Discretization using quartiles

labels=['bad','medium','good','awesome']
df['amount3']=pd.qcut(df['amount'],4,labels=labels)
df.head(5)

	id	amount	income	datetime	age2	amount1	amount2	amount3
0	15093	1390	10.40	6	0-40	(1000, 5000]	2	bad
1	15064	7952	4.40	0	0-40	(5000, 10000]	1	awesome
2	15080	503	5.72	5	0-40	(200, 1000]	2	bad
3	15068	1668	3.19	5	0-40	(1000, 5000]	2	bad
4	15019	6710	3.20	0	0-40	(5000, 10000]	1	awesome

6.3 Binarization of continuous data

binarizer_scaler=preprocessing.Binarizer(threshold=df['income'].mean())
df_income=df.loc[:,'income'].values.reshape(-1,1)
income_tmp=binarizer_scaler.fit_transform(df_income)
# income_tmp
income_tmp.resize(df['income'].shape)
df['income1']=income_tmp
df.head(5)

	id	amount	income	datetime	age2	amount1	amount2	amount3	income1
0	15093	1390	10.40	6	0-40	(1000, 5000]	2	bad	1.0
1	15064	7952	4.40	0	0-40	(5000, 10000]	1	awesome	1.0
2	15080	503	5.72	5	0-40	(200, 1000]	2	bad	1.0
3	15068	1668	3.19	5	0-40	(1000, 5000]	2	bad	0.0
4	15019	6710	3.20	0	0-40	(5000, 10000]	1	awesome	0.0