[Data analysis] Data preprocessing - data discretization
data discretization
- data discretization
- Continuous data is too detailed, and the relationship between data is difficult to analyze
- Divide into discretized intervals, discover the correlation between data, and facilitate algorithm processing
- Students' grades : 100-point scale scores are expressed using five-point scale discretization
- A (more than or equal to 85 points), B, C, D, F (less than 60 points)
- Human age : discretized into different age groups (cited from WHO)
- Minors: 0 to 17 years old;
- Youth: 18 to 45 years old;
- Middle-aged people: 46 to 69 years old;
- Elderly: over 70 years old.
- 365 days a year : discretized as 12 months or four seasons
- Students' grades : 100-point scale scores are expressed using five-point scale discretization
unsupervised discretization
- Binning
- Sort the data and divide them into bins of equal depth
- Smooth by bin mean, smooth by bin median, smooth by bin boundary, etc.
- Clustering: Monitor and remove noisy data
- Cluster similar data into clusters
- Calculate a value for each cluster to discretize the data for that cluster
supervised discretization
Supervised discretization—entropy-based discretization
- Entropy is used to measure the degree of uncertainty of the system
- Entropy is introduced by Claude Elwood Shannon from thermodynamic entropy to information theory, so it is also called Shannon entropy
- Shannon proposed the concept of information entropy, which laid the foundation for information theory and digital communication, known as the "father of information theory"
information entropy
- Information entropy: measure the degree of uncertainty of the system
- amount of information
- Define the probability distribution of an event x as P(x)
- Then the self-information of event x is -logP(x), value range: [0,+∞]
- amount of information
- information entropy
- On average, the amount of self-information we get when an event occurs
- That is: entropy can be expressed as the expectation of self-information
Entropy and Data Discretization
- How does entropy relate to data discretization? - degree of uncertainty
- When the data point word (ENTROPY) is complete , it is easy to understand the meaning of the expression, the degree of certainty is high , and the corresponding information entropy is also small .
- When the data points are completely disrupted, it is difficult to understand its meaning, resulting in more uncertainty, and the corresponding information entropy also increases .
- Goal: After discretizing the data, the certainty (also known as "purity") of the data in each interval is higher, so entropy is used to discretize the data.
Entropy-based discretization
- Divide the data on the x-axis
- Entropy—calculating uncertainty and impurity
- Assuming that the data has been discretized, calculate the entropy in a certain interval t after discretization:
- Among them, p( j | t) represents the probability of the jth class in the interval t; the general logarithm log is based on 2
- Calculate Entropy for a single interval
- Entropy—calculating uncertainty and impurity
- Assuming that the data has been discretized, calculate the entropy in a certain interval t after discretization
- Among them, p( j | t) represents the probability of the jth class in the interval t; the general logarithm log is based on 2
- Assuming that the data has been discretized, calculate the entropy in a certain interval t after discretization