spark mlib官方文档学习和翻译笔记(2)

基本统计

相关分析和假设检验

计算两列数据的相关性是统计里的通常操作。在spark.ml里,提供了计算多列数据相关性的灵活性。支持的相关系数计算方式有

皮尔逊相关系数和斯皮尔曼相关系数。

皮尔逊相关系数的公式其实也就是向量夹角的余弦公式:


cos(a,b)=a·b/(|a|*|b|)




Correlation使用向量组成的Dataset计算相关性矩阵。输出是一个包含向量列相关性矩阵的DataFrame

import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row

val data = Seq(
  Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),
  Vectors.dense(4.0, 5.0, 0.0, 3.0),
  Vectors.dense(6.0, 7.0, 0.0, 8.0),
  Vectors.sparse(4, Seq((0, 9.0), (3, 1.0)))
)

val df = data.map(Tuple1.apply).toDF("features")
val Row(coeff1: Matrix) = Correlation.corr(df, "features").head
println("Pearson correlation matrix:\n" + coeff1.toString)

val Row(coeff2: Matrix) = Correlation.corr(df, "features", "spearman").head
println("Spearman correlation matrix:\n" + coeff2.toString)

假设检验

假设检验是用于检测结果是否统计显著,事件结果是否偶然,spark.ml目前 提供pearson卡方检验用于计算独立检验

ChiSquareTest 为每一个特征和标签产生皮尔森独立性检验。每一个特征,(特征,标签)组合被转换成一个列联矩阵然后被卡方统计计算。

所有的标签和特征必须分类。


各个分布的应用如下:

1.方差已知情况下求均值是Z检验。

2.方差未知求均值是t检验(样本标准差s代替总体标准差R,由样本平均数推断总体平均数)

3.均值方差都未知求方差是X^2检验

4.两个正态分布样本的均值方差都未知情况下求两个总体的方差比值是F检验
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com:我为统计狂
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
各个分布的应用如下:

1.方差已知情况下求均值是Z检验。

2.方差未知求均值是t检验(样本标准差s代替总体标准差R,由样本平均数推断总体平均数)

3.均值方差都未知求方差是X^2检验

4.两个正态分布样本的均值方差都未知情况下求两个总体的方差比值是F检验
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com:我为统计狂
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
各个分布的应用如下:

1.方差已知情况下求均值是Z检验。

2.方差未知求均值是t检验(样本标准差s代替总体标准差R,由样本平均数推断总体平均数)

3.均值方差都未知求方差是X^2检验

4.两个正态分布样本的均值方差都未知情况下求两个总体的方差比值是F检验
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com:我为统计狂
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.stat.ChiSquareTest

val data = Seq(
  (0.0, Vectors.dense(0.5, 10.0)),
  (0.0, Vectors.dense(1.5, 20.0)),
  (1.0, Vectors.dense(1.5, 30.0)),
  (0.0, Vectors.dense(3.5, 30.0)),
  (0.0, Vectors.dense(3.5, 40.0)),
  (1.0, Vectors.dense(3.5, 40.0))
)

val df = data.toDF("label", "features")
val chi = ChiSquareTest.test(df, "features", "label").head
println("pValues = " + chi.getAs[Vector](0))
println("degreesOfFreedom = " + chi.getSeq[Int](1).mkString("[", ",", "]"))
println("statistics = " + chi.getAs[Vector](2))

各个分布的应用如下:

1.方差已知情况下求均值是Z检验。

2.方差未知求均值是t检验(样本标准差s代替总体标准差R,由样本平均数推断总体平均数)

3.均值方差都未知求方差是X^2检验

4.两个正态分布样本的均值方差都未知情况下求两个总体的方差比值是F检验。
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com:我为统计狂
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
各个分布的应用如下:

1.方差已知情况下求均值是Z检验。

2.方差未知求均值是t检验(样本标准差s代替总体标准差R,由样本平均数推断总体平均数)

3.均值方差都未知求方差是X^2检验

4.两个正态分布样本的均值方差都未知情况下求两个总体的方差比值是F检验。
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com:我为统计狂
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
各个分布的应用如下:

1.方差已知情况下求均值是Z检验。

2.方差未知求均值是t检验(样本标准差s代替总体标准差R,由样本平均数推断总体平均数)

3.均值方差都未知求方差是X^2检验

4.两个正态分布样本的均值方差都未知情况下求两个总体的方差比值是F检验。
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com:我为统计狂
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
各个分布的应用如下:

1.方差已知情况下求均值是Z检验。

2.方差未知求均值是t检验(样本标准差s代替总体标准差R,由样本平均数推断总体平均数)

3.均值方差都未知求方差是X^2检验

4.两个正态分布样本的均值方差都未知情况下求两个总体的方差比值是F检验。
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com:我为统计狂
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
各个分布的应用如下:

1.方差已知情况下求均值是Z检验。

2.方差未知求均值是t检验(样本标准差s代替总体标准差R,由样本平均数推断总体平均数)

3.均值方差都未知求方差是X^2检验

4.两个正态分布样本的均值方差都未知情况下求两个总体的方差比值是F检验。
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com:我为统计狂
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
各个分布的应用如下:

1.方差已知情况下求均值是Z检验。

2.方差未知求均值是t检验(样本标准差s代替总体标准差R,由样本平均数推断总体平均数)

3.均值方差都未知求方差是X^2检验

4.两个正态分布样本的均值方差都未知情况下求两个总体的方差比值是F检验。
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com:我为统计狂
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/

各个分布的应用如下:

1.方差已知情况下求均值是Z检验。

2.方差未知求均值是t检验(样本标准差s代替总体标准差R,由样本平均数推断总体平均数)

3.均值方差都未知求方差是X^2检验

4.两个正态分布样本的均值方差都未知情况下求两个总体的方差比值是F检验。
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com:我为统计狂
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/

各个分布的应用如下:

1、方差已知情况下求均值是Z检验

2、方差未知求均值是t检验(样本标准差s代替总体标准差R,由样本平均数推断总体平均数)

3、均值方差都未知求方差是卡方检验

4、两个正态分布样本的均值方差都未知情况下求两个总体的方差比值是F检验

两个正态分布样本的均值方差都未知情况下求两个总体的方差比值是F检验
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com:我为统计狂
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
两个正态分布样本的均值方差都未知情况下求两个总体的方差比值是F检验
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com:我为统计狂
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
两个正态分布样本的均值方差都未知情况下求两个总体的方差比值是F检验
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com:我为统计狂
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
两个正态分布样本的均值方差都未知情况下求两个总体的方差比值是F检验
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com:我为统计狂
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/

各个分布的应用如下:

1.方差已知情况下求均值是Z检验。

2.方差未知求均值是t检验(样本标准差s代替总体标准差R,由样本平均数推断总体平均数)

3.均值方差都未知求方差是X^2检验

4.两个正态分布样本的均值方差都未知情况下求两个总体的方差比值是F检验。
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com:我为统计狂
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
各个分布的应用如下:

1.方差已知情况下求均值是Z检验。

2.方差未知求均值是t检验(样本标准差s代替总体标准差R,由样本平均数推断总体平均数)

3.均值方差都未知求方差是X^2检验

4.两个正态分布样本的均值方差都未知情况下求两个总体的方差比值是F检验。
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com:我为统计狂
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/
各个分布的应用如下:

1.方差已知情况下求均值是Z检验。

2.方差未知求均值是t检验(样本标准差s代替总体标准差R,由样本平均数推断总体平均数)

3.均值方差都未知求方差是X^2检验

4.两个正态分布样本的均值方差都未知情况下求两个总体的方差比值是F检验。
Copyright is reserved by the author. Please quote the source for citation and contact the author for reproducing.
Author ID at applysquare.com:我为统计狂
Url:https://www.applysquare.com/topic-en/RwNU7JdnY/

发布了30 篇原创文章 · 获赞 74 · 访问量 23万+

猜你喜欢

转载自blog.csdn.net/ruiyiin/article/details/77113581