Spark - Skewness Skewness and Kurtosis Kurtosis of Data Skew Combat By ChatGPT4

Table of contents

I. Introduction

2. Introduction to kurtosis Skewness

3. Introduction to kurtosis

4. Realization of Skewness and kurtosis

1. Spark implementation

2. Custom implementation

5. Skewness and kurtosis plotting

1. Skewness Hist Plot

2. Kurtosis Hist Plot

6. Summary


I. Introduction

In the previous Flink data tilt actual combat, we learned the Hash implementation algorithm of Flink keyBy and proposed a more general tilt data verification method. One of them is to use the original Hash algorithm to traverse and obtain the partition distribution. In view of the recent popularity of ChatGPT4, so We also tried to learn how Spark handles skewed data statistically, so we have the following results:

I have to say that the method given is very comprehensive and reliable. The last method gives the skewness operator to detect data skew. Both the skewness operator of skewness and the kurtosis operator of kurtosis are Agg operators of SparkSql. Since ChatGPT4 recommends , let's learn about it below.

2. Introduction to kurtosis Skewness

Skewness is a measure of the direction and degree of statistical data distribution inclination, also known as skewness and skewness coefficient. It is the characteristic number that characterizes the degree of asymmetry of the probability distribution density curve relative to the average value. It is intuitively seen as the relative length of the tail of the density function curve. Skewness is defined as:

 Among them, k2 and k3 represent the second-order and third-order center-to-center distances, respectively.

The skewness of the normal distribution is 0, and the lengths of the tails on both sides are symmetrical. Let S represent skewness, S < 0 represents negative deviation, also called left skewed state, conversely, if S > 0, it represents positive skewed state, also called right skewed state, and if S ≈ 0, the distribution can be considered uniform. Since the graphic characteristics of left and right skewness do not conform to our visual intuition, we often reverse the judgment of left and right skewness. Here is a good way to distinguish it, that is, to observe the tail of the distribution. Which side of the tail is longer is the skewed distribution on which side . For example, the skewness S > 0 of the first positive skewness, we can observe that the tail on the right is long.

Tips:

Right skewed distribution - mean > median > mode

Left skewed distribution - mode > median > mean

 

3. Introduction to kurtosis

Originally, GPT only told the skewness to check the data tilt, but I see that SparkSQL also has a kurtosis operator to calculate the kurtosis. The calculation methods of the two are not much different, one third-order moment and one fourth-order moment. Subsequent use. Demeanor, generally refers to a person's speech and manners and manners, generally personable, haha, I got off topic by the input method, kurtosis, also known as kurtosis coefficient, characterizes the characteristic number of the peak value of the probability density distribution curve at the average value. Kurtosis is defined as:

Among them, m4 is the fourth-order sample center distance, and m2 is the second-order center distance, that is, the sample variance. The reason for subtracting 3 here is to make the kurtosis of the normal distribution 0.

If the kurtosis value K is greater than 0, it is called sharp kurtosis and if K is less than 0, it is called low kurtosis. It can be seen from the figure that the larger K is, the sharper the mean value will be. 

4. Realization of Skewness and kurtosis

1. Spark implementation

- Initialize SparkSession

    val conf = (new SparkConf).setMaster("local[*]").setAppName("SparkSQLAgg")

    val spark = SparkSession
      .builder
      .config(conf)
      .getOrCreate()

    val sc = spark.sparkContext

- Build random data

Use random to construct random data, here the data of 0-50 will be more than the data of 50-100, so 50-100 means the tail behind is longer, so it is a right-biased distribution, that is, positive skewness, so it can be seen that S > 0.

    val dataBuffer = new ArrayBuffer[Double]()

    val random = scala.util.Random

    (0 to 100000).foreach(num => {
      if (num < 50000) {
        dataBuffer.append(100 * random.nextDouble())
      } else {
        dataBuffer.append(50 * random.nextDouble())
      }
    })

- Calculate skewness and kurtosis

Generate a DataFrame from the above simulated data, and directly call the aggregation function to obtain statistical values. You can see that the skewness > 0 meets our expectations above. You can modify the above custom data to view changes in skewness.

    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._

    val skewRDD = sc.parallelize(dataBuffer)

    val skewDF = skewRDD.toDF("key1")

    // key1、key2 为计算 Skewness 的列名,Skewness 会返回一个 DataFrame 算子,其中包含各列的 Skewness 的值
    val skewnessValues = skewDF.agg(avg("key1"), stddev_pop("key1"), var_pop("key1"), skewness("key1"), kurtosis("key1"))

    skewnessValues.show()

 

2. Custom implementation

Based on spark.sql.functions and definition formulas, we can also implement the code of skewness and kurtosis by ourselves:

- Skewness

  /*
    skewness_pop = E [((X - mu_pop) / stddev_pop) ^ 3]
      X: the random variable
      mu_pop: population mean
      stddev_pop: population standard deviation

    sqrt(n) * m3 / sqrt(m2 * m2 * m2)
    where if m refers to (X - mu), then m2 refers to (X - mu)^2 and m3 refers to (X - mu)^3.

    skewness_samp = SUM(i=1 to n) [(xi - mu_samp) ^ 3] / n
                    --------------------------------------
                           stddev_samp ^ 3

   */
  def calculateSkewness(df: DataFrame, column: String): Double = {

    val mean = calculateMean(df, column)
    val stdDev = calculateStdDev(df, column)

    val totalNum = df.count()

    val thirdCentralSampleMoment = df.agg(
      functions.sum(functions.pow(functions.column(column) - mean, 3) / totalNum)
    ).head.getDouble(0)

    val thirdPowerOfSampleStdDev = scala.math.pow(stdDev, 3)

    thirdCentralSampleMoment / thirdPowerOfSampleStdDev
  }

- kurtosis

  /*
    kurtosis_pop = E [((X - mu_pop) / stddev_pop) ^ 4]
    excess_kurtosis_pop = kurtosis_pop - 3.0
      X: the random variable
      mu_pop: population_mean
      stddev_pop: population standard deviation

    n * m4 / (m2 * m2) - 3.0
    where if m refers to (X - mu), then m2 refers to (X - mu)^2 and m4 refers to (X - mu)^4.

    kurtosis_samp =   SUM(i=1 to n) [(xi - mu_samp) ^ 4] / n
                      --------------------------------------
                               stddev_samp ^ 4
    excess_kurtosis_samp = kurtosis_samp - 3.0

   */
  def calculateExcessKurtosis(df: DataFrame, column: String): Double = {

    val mean = calculateMean(df, column)
    val stdDev = calculateStdDev(df, column)

    val totalNum = df.count()

    val fourthCentralSampleMoment = df.agg(
      functions.sum(functions.pow(functions.column(column) - mean, 4) / totalNum)
    ).head.getDouble(0)

    val fourthPowerOfSampleStdDev = scala.math.pow(stdDev, 4)

    (fourthCentralSampleMoment / fourthPowerOfSampleStdDev) - 3
  }

 

- Helper functions

Careful students may find agg aggregation statistical functions with suffixes such as xxx_pop and xxx_samp. Here, the calculation of pop is based on complete data, and the calculation of samp is based on sampled data. If you have a large amount of data and only pursue relatively accurate results, you can choose _samp in the loss Increased speed with some precision.

  def calculateMean(df: DataFrame, column: String): Double = {
    df.agg(avg(column)).head.getDouble(0) // 均值
  }

  def calculateVar(df: DataFrame, column: String): Double = {
    df.agg(var_pop(column)).head.getDouble(0) // 方差
  }

  def calculateStdDev(df: DataFrame, column: String): Double = {
    df.agg(stddev_pop(column)).head.getDouble(0) // 标准差
  }

- calculation test

    val selfSkewness = calculateSkewness(skewDF, "key1")
    val selfKurtosis = calculateExcessKurtosis(skewDF, "key1")
    println(s"Skewness: ${selfSkewness} Kurtosis: ${selfKurtosis}")

The calculation result customized according to the formula is consistent with the official API result: 

 

5. Skewness and kurtosis plotting

The simulated data skewness S > 0 calculated above is right skewness, that is, the right tail is long, and K < 0 is low kurtosis, that is, the degree of peak is smaller than the normal distribution. I am too lazy to bring the data to python for drawing, so I use scala statistics With the drawing tool breeze.plot._, you need to introduce dependencies before using it, which corresponds to scala 2.11 version:

        <!-- https://mvnrepository.com/artifact/org.scalanlp/breeze-viz -->
        <dependency>
            <groupId>org.scalanlp</groupId>
            <artifactId>breeze-viz_2.11</artifactId>
            <version>1.0-RC2</version>
        </dependency>

1. Skewness Hist Plot

import breeze.plot._ 

  def plotHistogram(data: Array[Double]): Unit = {

    val f = Figure()
    val p = f.subplot(0)

    // 加入当前数据分布
    p += hist(data, 100)

    p.xlabel = "X-Axis"
    p.ylabel = "Y-Axis"
    f.saveas("./histogram.png")
  }

The data obtained by sampling the original data rdd is passed into the plotHistogram to obtain the corresponding data distribution. It can be seen that the right tail is long and satisfies the positive skewness of S > 0, that is, the right skewed distribution. 

2. Kurtosis Hist Plot

  def plotHistogram(data: Array[Double]): Unit = {

    val f = Figure()
    val p = f.subplot(0)

    // 加入当前数据分布
    p += hist(normalization(data), 100)

    // 加入正态分布
    val g = breeze.stats.distributions.Gaussian(0, 1)
    p += hist(g.sample(data.length),100)

    p.xlabel = "X-Axis"
    p.ylabel = "Y-Axis"
    f.saveas("./histogram.png")
  }

Normalize the data, and then add a part of N(0, 1) normal distribution data, where blue is the normalized distribution of the original data, K < 0, at this time the state is low kurtosis.

 

6. Summary

Although I have had practical experience with Spark data skew before and learned the concepts of skewness and kurtosis during the study of mathematical statistics, I have never thought about applying these statistical concepts to the actual combat of data skew analysis. It is also in ChatGPT4 I only recalled the past knowledge after being reminded. Some of the above implementation codes were also given by ChatGPT4. I have to lament the rapid development of AI technology. As programmers, you still have to work hard to improve yourself and become that irreplaceable role.

Guess you like

Origin blog.csdn.net/BIT_666/article/details/129629932