[Data analysis] Data preprocessing - data discretization, information entropy

data discretization

  • data discretization
    • Continuous data is too detailed, and the relationship between data is difficult to analyze
    • Divide into discretized intervals, discover the correlation between data, and facilitate algorithm processing
      • Students' grades : 100-point scale scores are expressed using five-point scale discretization
        • A (more than or equal to 85 points), B, C, D, F (less than 60 points)
      • Human age : discretized into different age groups (cited from WHO)
        • Minors: 0 to 17 years old;
        • Youth: 18 to 45 years old;
        • Middle-aged people: 46 to 69 years old;
        • Elderly: over 70 years old.
      • 365 days a year : discretized as 12 months or four seasons
    • insert image description here

unsupervised discretization

  • Binning
    1. Sort the data and divide them into bins of equal depth
    2. Smooth by bin mean, smooth by bin median, smooth by bin boundary, etc.
  • Clustering: Monitor and remove noisy data
    • Cluster similar data into clusters
    • Calculate a value for each cluster to discretize the data for that clusterinsert image description here

supervised discretization

Supervised discretization—entropy-based discretization

  • Entropy is used to measure the degree of uncertainty of the system
    • Entropy is introduced by Claude Elwood Shannon from thermodynamic entropy to information theory, so it is also called Shannon entropy
    • Shannon proposed the concept of information entropy, which laid the foundation for information theory and digital communication, known as the "father of information theory"insert image description here

information entropy

  • Information entropy: measure the degree of uncertainty of the system
    • amount of information
      • Define the probability distribution of an event x as P(x)
      • Then the self-information of event x is -logP(x), value range: [0,+∞]insert image description here
  • information entropy
    • On average, the amount of self-information we get when an event occurs
    • That is: entropy can be expressed as the expectation of self-informationinsert image description here
      insert image description hereinsert image description here

Entropy and Data Discretization

  • How does entropy relate to data discretization? - degree of uncertainty
    • When the data point word (ENTROPY) is complete , it is easy to understand the meaning of the expression, the degree of certainty is high , and the corresponding information entropy is also small .
    • When the data points are completely disrupted, it is difficult to understand its meaning, resulting in more uncertainty, and the corresponding information entropy also increases .
    • Goal: After discretizing the data, the certainty (also known as "purity") of the data in each interval is higher, so entropy is used to discretize the data.insert image description here

Entropy-based discretization

  • Divide the data on the x-axisinsert image description here
  • Entropy—calculating uncertainty and impurity
    • Assuming that the data has been discretized, calculate the entropy in a certain interval t after discretization:insert image description here
  • Among them, p( j | t) represents the probability of the jth class in the interval t; the general logarithm log is based on 2insert image description here
  • Calculate Entropy for a single intervalinsert image description hereinsert image description here
  • Entropy—calculating uncertainty and impurity
    • Assuming that the data has been discretized, calculate the entropy in a certain interval t after discretizationinsert image description here
      • Among them, p( j | t) represents the probability of the jth class in the interval t; the general logarithm log is based on 2
  • insert image description here

Guess you like

Origin blog.csdn.net/weixin_56462041/article/details/129706665