Basic statistics
Correlation analysis and hypothesis testing
Calculating the correlation of two columns of data is a common operation in statistics. In spark.ml, it provides the flexibility to calculate the correlation of multiple columns of data. Supported correlation coefficient calculation methods are
Pearson correlation coefficient and Spearman correlation coefficient.
cos(a,b)=a·b/(|a|*|b|)
Correlation
Use the Dataset composed of vectors to calculate the correlation matrix. The output is a DataFrame containing a correlation matrix of vector columns
import org.apache.spark.ml.linalg.{Matrix, Vectors} import org.apache.spark.ml.stat.Correlation import org.apache.spark.sql.Row val data = Seq( Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))), Vectors.dense(4.0, 5.0, 0.0, 3.0), Vectors.dense(6.0, 7.0, 0.0, 8.0), Vectors.sparse(4, Seq((0, 9.0), (3, 1.0))) ) val df = data.map(Tuple1.apply).toDF("features") val Row(coeff1: Matrix) = Correlation.corr(df, "features").head println("Pearson correlation matrix:\n" + coeff1.toString) val Row(coeff2: Matrix) = Correlation.corr(df, "features", "spearman").head println("Spearman correlation matrix:\n" + coeff2.toString)
hypothetical test
Hypothesis test is used to detect whether the result is statistically significant, whether the event result is accidental, spark.ml currently provides the pearson chi-square test for calculating the independent test
ChiSquareTest
Pearson's independence test is generated for each feature and label. Each feature, (feature, label) combination is converted into a contingency matrix and then calculated by chi-square statistics.
All labels and features must be classified.
The applications of each distribution are as follows: 1. The mean is Z test when the variance is known. 2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean) 3. The variance of the mean is unknown and finding the variance is the X ^ 2 test 4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations. Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing. Author ID at applysquare.com: I am a statistic Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows: 1. The mean is Z test when the variance is known. 2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean) 3. The variance of the mean is unknown and finding the variance is the X ^ 2 test 4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations. Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing. Author ID at applysquare.com: I am a statistic Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows: 1. The mean is Z test when the variance is known. 2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean) 3. The variance of the mean is unknown and finding the variance is the X ^ 2 test 4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations. Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing. Author ID at applysquare.com: I am a statistic Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
import org.apache.spark.ml.linalg.{Vector, Vectors} import org.apache.spark.ml.stat.ChiSquareTest val data = Seq( (0.0, Vectors.dense(0.5, 10.0)), (0.0, Vectors.dense(1.5, 20.0)), (1.0, Vectors.dense(1.5, 30.0)), (0.0, Vectors.dense(3.5, 30.0)), (0.0, Vectors.dense(3.5, 40.0)), (1.0, Vectors.dense(3.5, 40.0)) ) val df = data.toDF("label", "features") val chi = ChiSquareTest.test(df, "features", "label").head println("pValues = " + chi.getAs[Vector](0)) println("degreesOfFreedom = " + chi.getSeq[Int](1).mkString("[", ",", "]")) println("statistics = " + chi.getAs[Vector](2))
The applications of each distribution are as follows: 1. The mean is Z test when the variance is known. 2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean) 3. The variance of the mean is unknown and finding the variance is the X ^ 2 test 4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations. Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing. Author ID at applysquare.com: I am a statistic Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows: 1. The mean is Z test when the variance is known. 2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean) 3. The variance of the mean is unknown and finding the variance is the X ^ 2 test 4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations. Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing. Author ID at applysquare.com: I am a statistic Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows: 1. The mean is Z test when the variance is known. 2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean) 3. The variance of the mean is unknown and finding the variance is the X ^ 2 test 4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations. Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing. Author ID at applysquare.com: I am a statistic Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows: 1. The mean is Z test when the variance is known. 2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean) 3. The variance of the mean is unknown and finding the variance is the X ^ 2 test 4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations. Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing. Author ID at applysquare.com: I am a statistic Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows: 1. The mean is Z test when the variance is known. 2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean) 3. The variance of the mean is unknown and finding the variance is the X ^ 2 test 4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations. Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing. Author ID at applysquare.com: I am a statistic Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows: 1. The mean is Z test when the variance is known. 2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean) 3. The variance of the mean is unknown and finding the variance is the X ^ 2 test 4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations. Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing. Author ID at applysquare.com: I am a statistic Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows: 1. The mean is Z test when the variance is known. 2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean) 3. The variance of the mean is unknown and finding the variance is the X ^ 2 test 4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations. Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing. Author ID at applysquare.com: I am a statistic Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows:
1. Finding the mean when the variance is known is the Z test
2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean)
3. The mean variance is unknown. Finding the variance is the chi-square test
4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations.
When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations. Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing. Author ID at applysquare.com: I am a statistic Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations. Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing. Author ID at applysquare.com: I am a statistic Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations. Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing. Author ID at applysquare.com: I am a statistic Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations. Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing. Author ID at applysquare.com: I am a statistic Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows: 1. The mean is Z test when the variance is known. 2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean) 3. The variance of the mean is unknown and finding the variance is the X ^ 2 test 4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations. Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing. Author ID at applysquare.com: I am a statistic Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows: 1. The mean is Z test when the variance is known. 2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean) 3. The variance of the mean is unknown and finding the variance is the X ^ 2 test 4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations. Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing. Author ID at applysquare.com: I am a statistic Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
The applications of each distribution are as follows: 1. The mean is Z test when the variance is known. 2. The mean of unknown variance is the t test (sample standard deviation s replaces the population standard deviation R, and the population mean is inferred from the sample mean) 3. The variance of the mean is unknown and finding the variance is the X ^ 2 test 4. When the mean variance of two normal distribution samples is unknown, it is F test to find the ratio of the variance of the two populations. Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing. Author ID at applysquare.com: I am a statistic Url:https://www.applysquare.com/topic-en/RwNU7JdnY/