Spark mlib official document learning and translation notes (2)

Basic statistics

Correlation analysis and hypothesis testing

Calculating the correlation of two columns of data is a common operation in statistics. In spark.ml, it provides the flexibility to calculate the correlation of multiple columns of data. Supported correlation coefficient calculation methods are

Pearson correlation coefficient and Spearman correlation coefficient.

The Pearson correlation coefficient formula is actually the cosine formula of the angle between the vectors:


cos(a,b)=a·b/(|a|*|b|)




CorrelationUse the Dataset composed of vectors to calculate the correlation matrix. The output is a DataFrame containing a correlation matrix of vector columns

import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row

val data = Seq(
  Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),
  Vectors.dense(4.0, 5.0, 0.0, 3.0),
  Vectors.dense(6.0, 7.0, 0.0, 8.0),
  Vectors.sparse(4, Seq((0, 9.0), (3, 1.0)))
)

val df = data.map(Tuple1.apply).toDF("features")
val Row(coeff1: Matrix) = Correlation.corr(df, "features").head
println("Pearson correlation matrix:\n" + coeff1.toString)

val Row(coeff2: Matrix) = Correlation.corr(df, "features", "spearman").head
println("Spearman correlation matrix:\n" + coeff2.toString)

hypothetical test

Hypothesis test is used to detect whether the result is statistically significant, whether the event result is accidental, spark.ml currently provides the pearson chi-square test for calculating the independent test

ChiSquareTestPearson's independence test is generated for each feature and label. Each feature, (feature, label) combination is converted into a contingency matrix and then calculated by chi-square statistics.

All labels and features must be classified.


The applications of each distribution are as follows:

1. The mean is Z test when the variance is known.

2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean)

3. The variance of the mean is unknown and finding the variance is the X ^ 2 test

4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations.
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com: I am a statistic
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows:

1. The mean is Z test when the variance is known.

2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean)

3. The variance of the mean is unknown and finding the variance is the X ^ 2 test

4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations.
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com: I am a statistic
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows:

1. The mean is Z test when the variance is known.

2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean)

3. The variance of the mean is unknown and finding the variance is the X ^ 2 test

4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations.
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com: I am a statistic
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.stat.ChiSquareTest

val data = Seq(
  (0.0, Vectors.dense(0.5, 10.0)),
  (0.0, Vectors.dense(1.5, 20.0)),
  (1.0, Vectors.dense(1.5, 30.0)),
  (0.0, Vectors.dense(3.5, 30.0)),
  (0.0, Vectors.dense(3.5, 40.0)),
  (1.0, Vectors.dense(3.5, 40.0))
)

val df = data.toDF("label", "features")
val chi = ChiSquareTest.test(df, "features", "label").head
println("pValues = " + chi.getAs[Vector](0))
println("degreesOfFreedom = " + chi.getSeq[Int](1).mkString("[", ",", "]"))
println("statistics = " + chi.getAs[Vector](2))

The applications of each distribution are as follows:

1. The mean is Z test when the variance is known.

2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean)

3. The variance of the mean is unknown and finding the variance is the X ^ 2 test

4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations.
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com: I am a statistic
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows:

1. The mean is Z test when the variance is known.

2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean)

3. The variance of the mean is unknown and finding the variance is the X ^ 2 test

4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations.
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com: I am a statistic
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows:

1. The mean is Z test when the variance is known.

2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean)

3. The variance of the mean is unknown and finding the variance is the X ^ 2 test

4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations.
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com: I am a statistic
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows:

1. The mean is Z test when the variance is known.

2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean)

3. The variance of the mean is unknown and finding the variance is the X ^ 2 test

4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations.
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com: I am a statistic
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows:

1. The mean is Z test when the variance is known.

2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean)

3. The variance of the mean is unknown and finding the variance is the X ^ 2 test

4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations.
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com: I am a statistic
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows:

1. The mean is Z test when the variance is known.

2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean)

3. The variance of the mean is unknown and finding the variance is the X ^ 2 test

4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations.
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com: I am a statistic
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/

The applications of each distribution are as follows:

1. The mean is Z test when the variance is known.

2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean)

3. The variance of the mean is unknown and finding the variance is the X ^ 2 test

4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations.
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com: I am a statistic
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/

The applications of each distribution are as follows:

1. Finding the mean when the variance is known is the Z test

2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean)

3. The mean variance is unknown. Finding the variance is the chi-square test

4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations.

When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations.
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com: I am a statistic
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations.
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com: I am a statistic
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations.
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com: I am a statistic
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations.
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com: I am a statistic
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/

The applications of each distribution are as follows:

1. The mean is Z test when the variance is known.

2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean)

3. The variance of the mean is unknown and finding the variance is the X ^ 2 test

4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations.
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com: I am a statistic
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows:

1. The mean is Z test when the variance is known.

2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean)

3. The variance of the mean is unknown and finding the variance is the X ^ 2 test

4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations.
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com: I am a statistic
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows:

1. The mean is Z test when the variance is known.

2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean)

3. The variance of the mean is unknown and finding the variance is the X ^ 2 test

4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations.
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com: I am a statistic
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/

发布了30 篇原创文章 · 获赞 74 · 访问量 23万+

Guess you like

Origin blog.csdn.net/ruiyiin/article/details/77113581