python data discretization

1 Overview

        Discretization is to map finite individuals in infinite space to finite space. Data discrete operations are mostly performed on continuous data, and the distribution of data value domains after processing will change from continuous attributes to discrete attributes.

        Discretization is usually processed for continuous data, but in many cases it can also be processed for already discretized data. In this scenario, the division of discrete data itself is generally too complicated, trivial or even inconsistent with business logic, and further data processing is required. Aggregate or repartition.

2. Discretization for time data

        The discretization of time data is mainly used for data concentration and granularity conversion with time as the main feature. After discretization, the scattered time features are converted into higher-level time features. In a dataset with time, time may be recorded as a sequence of rows, or as a column (dimension) to record data characteristics. Common discretization operations for time data fall into two categories:

1. Discretize for time of day. Generally, the timestamp is converted to seconds, minutes, hours, or am/pm. 

2. For the discretization of data above the daily granularity. Generally, the date is converted into a week number, day of the week, month, weekday, quarter, year, etc. 

3. Discretization for multi-valued discrete data 

        Discretization for multi-valued discrete data means that the data to be discretized is not numerical data, but categorical or sequential data. For example, the value of the user income variable may originally be divided into 10 intervals. According to the new modeling requirements, it only needs to be divided into 4 intervals, and then the original 10 intervals need to be merged.

        If multi-valued discrete data needs to be discretized, there may be a problem with the logic of the division and it needs to be re-divided. For example, the value of the user activity variable was originally divided into three categories: high value, medium value, and low value; according to the needs of business development, the value of the new user activity reading variable is defined as high value, medium value, low value, and negative value. Value 4 classes. At this time, it is necessary to discretize the data of different categories with uniform rules.

4. Discretization for continuous data 

        Discretization for continuous data is the main discretization application, and the results of these algorithms are based on class or attribute labels rather than numerical labels. The discretization results of continuous data can be divided into two categories: one is to divide continuous data into a set of specific intervals, such as {(0,10],(10,20],(20,50],(50,100]}); One is to divide continuous data into specific classes, such as class 1, class 2, and class 3; common methods for discretizing continuous data include:

Quantile method : use quantiles such as quartiles, quintiles, and deciles for discretization. This method is simple and easy.

Distance interval method : discretization can be performed using equidistant intervals or custom intervals. This operation is more flexible and can meet custom requirements. In addition, this method (especially equidistant intervals) can better maintain the original data distributed.

Frequency interval method : sort the data according to the frequency distribution of different data, and then discretize according to equal frequency or specified frequency. This method will transform the data into a uniform distribution. The advantage is that the observed values ​​​​of each interval are the same. The disadvantage is that it has already The distribution form of the original data has been changed.

Clustering method : For example, K-means is used to divide the sample set into multiple discrete clusters.

Chi-square : By using a chi-square-based discretization method, the best adjacent intervals of the data are found and merged to form larger intervals.

5. Binarization for continuous data

        In many scenarios, we may need to binarize variable features: compare each data point with a threshold, set it to a fixed value (such as 1) if it is greater than the threshold, and set it to another fixed value (such as 0) if it is less than the threshold. , and then get a binarized dataset with only two ranges. The premise of the application of binarization is that all the attribute values ​​​​in the data set represent the same or similar meanings.

6. Code practice: Python data discretization processing

import pandas as pd
from sklearn.cluster import KMeans
from sklearn import preprocessing
df=pd.read_table('F:\小橙书\chapter3\data7.txt',names=['id','amount','income','datetime','age'])
df.head(5)
id amount income datetime age
0 15093 1390 10.40 2017-04-30 19:24:13 0-10
1 15062 4024 4.68 2017-04-27 22:44:59 70-80
2 15028 6359 3.84 2017-04-27 10:07:55 40-50
3 15012 7759 3.70 2017-04-04 07:28:18 30-40
4 15021 331 4.25 2017-04-08 11:14:00 70-80

6.1 Discretization of temporal data 

for i,single_data in enumerate(df['datetime']):
    single_data_tmp=pd.to_datetime(single_data)
    df['datetime'][i]=single_data_tmp.weekday()
df.head(5)
id amount income datetime age
0 15093 1390 10.40 6 0-10
1 15062 4024 4.68 3 70-80
2 15028 6359 3.84 3 40-50
3 15012 7759 3.70 1 30-40
4 15021 331 4.25 5 70-80

6.2 Discretization of Multivalued Data

map_df=pd.DataFrame([['0-10','0-40']
                    ,['10-20','0-40']
                    ,['20-30','0-40']
                    ,['30-40','0-40']
                    ,['40-50','40-80']
                    ,['50-60','40-80']
                    ,['60-70','40-80']
                    ,['70-80','40-80']
                    ,['80-90','>80']
                    ,['>90','>80']],columns=['age','age2'])
df_tmp=df.merge(map_df,left_on='age',right_on='age',how='inner')
df_tmp
id amount income datetime age age2
0 15093 1390 10.40 6 0-10 0-40
1 15064 7952 4.40 0 0-10 0-40
2 15080 503 5.72 5 0-10 0-40
3 15068 1668 3.19 5 0-10 0-40
4 15019 6710 3.20 0 0-10 0-40
... ... ... ... ... ... ...
95 15098 2014 3.03 6 60-70 40-80
96 15046 6215 5.09 2 80-90 >80
97 15095 5294 3.74 1 80-90 >80
98 15074 5381 3.28 6 80-90 >80
99 15074 4834 3.92 2 80-90 >80

100 rows × 6 columns

df=df_tmp.drop('age',1)
df.head(5)
id amount income datetime age2
0 15093 1390 10.40 6 0-40
1 15064 7952 4.40 0 0-40
2 15080 503 5.72 5 0-40
3 15068 1668 3.19 5 0-40
4 15019 6710 3.20 0 0-40

6.3 Discretization of discrete data

Method 1: Customize the binning interval to achieve discretization

bins=[0,200,1000,5000,10000]
df['amount1']=pd.cut(df['amount'],bins)
df.head(5)
id amount income datetime age2 amount1
0 15093 1390 10.40 6 0-40 (1000, 5000]
1 15064 7952 4.40 0 0-40 (5000, 10000]
2 15080 503 5.72 5 0-40 (200, 1000]
3 15068 1668 3.19 5 0-40 (1000, 5000]
4 15019 6710 3.20 0 0-40 (5000, 10000]

Method 2: Discretization using clustering

import numpy as np
data=df.loc[:,'amount']
data_reshape=data.values.reshape(-1,1)
model_kmeans=KMeans(n_clusters=4,random_state=0)
kmeans_result=model_kmeans.fit_predict(data_reshape)
df['amount2']=kmeans_result
df.head(5)
id amount income datetime age2 amount1 amount2
0 15093 1390 10.40 6 0-40 (1000, 5000] 2
1 15064 7952 4.40 0 0-40 (5000, 10000] 1
2 15080 503 5.72 5 0-40 (200, 1000] 2
3 15068 1668 3.19 5 0-40 (1000, 5000] 2
4 15019 6710 3.20 0 0-40 (5000, 10000] 1

Method 3: Discretization using quartiles

labels=['bad','medium','good','awesome']
df['amount3']=pd.qcut(df['amount'],4,labels=labels)
df.head(5)
id amount income datetime age2 amount1 amount2 amount3
0 15093 1390 10.40 6 0-40 (1000, 5000] 2 bad
1 15064 7952 4.40 0 0-40 (5000, 10000] 1 awesome
2 15080 503 5.72 5 0-40 (200, 1000] 2 bad
3 15068 1668 3.19 5 0-40 (1000, 5000] 2 bad
4 15019 6710 3.20 0 0-40 (5000, 10000] 1 awesome

6.3 Binarization of continuous data

binarizer_scaler=preprocessing.Binarizer(threshold=df['income'].mean())
df_income=df.loc[:,'income'].values.reshape(-1,1)
income_tmp=binarizer_scaler.fit_transform(df_income)
# income_tmp
income_tmp.resize(df['income'].shape)
df['income1']=income_tmp
df.head(5)
id amount income datetime age2 amount1 amount2 amount3 income1
0 15093 1390 10.40 6 0-40 (1000, 5000] 2 bad 1.0
1 15064 7952 4.40 0 0-40 (5000, 10000] 1 awesome 1.0
2 15080 503 5.72 5 0-40 (200, 1000] 2 bad 1.0
3 15068 1668 3.19 5 0-40 (1000, 5000] 2 bad 0.0
4 15019 6710 3.20 0 0-40 (5000, 10000] 1 awesome 0.0

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/weixin_60200880/article/details/127397124