In this chapter, we will introduce the following:
package import and initial settings of vectors and matrices.
Use Spark 2.0 to create DenseVector and set up.
Use Spark 2.0 to create SparseVector and set up.
Use Spark 2.0 to create DenseMatrix and set up
. Use Sparse Local in Spark 2.0 matrix
using 2.0 Spark executing vector arithmetic
using the matrix arithmetic execution Spark 2.0
Spark 2.0 ML library distributed matrix
explore RowMatrix in Spark 2.0
Exploration distributed in IndexedRowMatrix Spark 2.0
Exploration Spark 2.0 distributed CoordinateMatrix
of distribution 2.0 Spark BlockMatrix
Linear algebra is the cornerstone of machine learning (ML) and mathematical programming (MP). When dealing with Spark's machine libraries, you must understand that the Vector / Matrix structure provided by Scala (which has been imported by default) is different from the Spark ML, MLlib Vector, and Matrix functions provided by Spark. If you want to use Spark (that is, parallelism) for large-scale matrix / vector calculations immediately (for example, SVD implementation alternatives with higher numerical precision), the latter supported by RDD is the required data structure. In some cases, used for derivative product pricing and risk analysis). The Scala vector / matrix library provides a rich set of linear algebra operations, such as dot products, addition operations, etc., which still have their place in the ML pipeline. All in all, the main difference between using Scala Breeze and Spark or Spark ML is that the Spark feature is supported by RDD, which allows simultaneous distributed, concurrent computing and resiliency without any additional concurrent modules or additional work (eg, Akka + breeze ).
Almost all machine learning algorithms use some form of classification or regression mechanism (not necessarily linear) to train the model, and then minimize the error by comparing the training output with the actual output. For example, any implementation of the recommendation system in Spark will rely heavily on matrix factorization, factorization, approximation, or singular value decomposition (SVD). Another area of interest in machine learning that deals with dimensionality reduction in large data sets is principal component analysis (PCA), which relies heavily on linear algebra, factorization, and matrix processing.
When we first checked the source code of the Spark ML and MLlib algorithms in Spark 1.xx, we quickly noticed that Vectors and Matrices used RDD as the basis for many important algorithms.
When we revisited the source code of Spark 2.0 and the machine learning library, we noticed some interesting changes that need to be considered in the future. This is an example of such a change from Spark 1.6.2 to Spark 2.0.0, which affected some of our linear algebra code using Spark:
In previous versions (Spark 1.6.x), you can convert DenseVector or SparseVector (see https://spark.apache.org/docs/1.5.2/api\/ directly access https by using the toBreeze () function: / / Spark. Apache. Org / docs / 1. 5. 2 / api / java / org / apache / spark / mllib / linalg / Vectors.html), such as the following code The library shows:
In Spark 2.0, the toBreeze () function has not only been changed to asBreeze (), but it has also been downgraded to a private function.
To solve this problem, use one of the following code snippets to convert the previous vector to a commonly used BreezeVector instance:
In Spark 2.0, the toBreeze () function has not only been changed to asBreeze (), but it has also been downgraded to a private function.
To solve this problem, use one of the following code snippets to convert the previous vector to a commonly used BreezeVector instance:
Scala is a concise language, and object-oriented and functional programming paradigms can coexist without conflict. Although functional programming is the first choice in the machine learning paradigm, there is nothing wrong with using object-oriented methods for initial data collection and representation at a later stage.
In terms of large-scale distributed matrices, our experience shows that when dealing with large matrix sets 10 times 9 squares to 10 times 13 squares to 10 times 27 squares, etc., you must study the network operations and Mixed row operation. In our experience, the combination of local and distributed matrix / vector operations (eg, dot product, multiplication, etc.) works best when operating on a large scale.
The following figure describes the classification of available Spark vectors and matrices:
spark vector and matrices:
local vector dense
sparse
matrix
distributed
rowmatrix
indexrow matrix
coordinatematrix
blockmatrix
Package import and initial setting of vectors and matrices
Before programming Spark or using vector and matrix artifacts, we need to first import the correct package and then set up the SparkSession in order to gain access to the cluster handle. In this short recipe, we highlight a large number of software packages that can cover most linear algebra operations in Spark. Subsequent recipes will include the exact subset required for a specific procedure.
package chpater02
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix}
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.sql.{SparkSession}
import org.apache.spark.rdd._
import org.apache.spark.mllib.linalg._
import breeze.linalg.{DenseVector => BreezeVector}
import Array._
import org.apache.spark.mllib.linalg.DenseMatrix
import org.apache.spark.mllib.linalg.SparseVector
import org.apache.log4j.Logger
import org.apache.log4j.Level
object MyVectorMatrix {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
// setup SparkSession to use for interactions with Spark
val spark = SparkSession
.builder
.master("local[*]")
.appName("myVectorMatrix")
.config("spark.sql.warehouse.dir", ".")
.getOrCreate()
val xyz = Vectors.dense("2".toDouble, "3".toDouble, "4".toDouble)
println(xyz)
val CustomerFeatures1: Array[Double] = Array(1,3,5,7,9,1,3,2,4,5,6,1,2,5,3,7,4,3,4,1)
val CustomerFeatures2: Array[Double] = Array(2,5,5,8,6,1,3,2,4,5,2,1,2,5,3,2,1,1,1,1)
val ProductFeatures1: Array[Double] = Array(0,1,1,0,1,1,1,0,0,1,1,1,1,0,1,2,0,1,1,0)
val x = Vectors.dense(CustomerFeatures1)
val y = Vectors.dense(CustomerFeatures2)
val z = Vectors.dense(ProductFeatures1)
val a = new BreezeVector(x.toArray)//x.asBreeze
val b = new BreezeVector(y.toArray)//y.asBreeze
val c = new BreezeVector(z.toArray)//z.asBreeze
val NetCustPref = a+b
val dotprod = c.dot(NetCustPref)
println("Net Customer Preference calculated by Scala Vector operations = \n",NetCustPref)
println("Customer Pref DOT Product calculated by Scala Vector operations =",dotprod)
val a2=a.toDenseVector
val b2=b.toDenseVector
val c2=c.toDenseVector
val NetCustPref2 = NetCustPref.toDenseVector
println("Net Customer Pref converted back to Spark Dense Vactor =",NetCustPref2)
val denseVec1 = Vectors.dense(5,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,9)
val sparseVec1 = Vectors.sparse(20, Array(0,2,18,19), Array(5, 3, 8,9))
println(denseVec1.size)
println(denseVec1.numActives)
println(denseVec1.numNonzeros)
println("denceVec1 presentation = ",denseVec1)
println(sparseVec1.size)
println(sparseVec1.numActives)
println(sparseVec1.numNonzeros)
println("sparseVec1 presentation = ",sparseVec1)
//println("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
val ConvertedDenseVect : DenseVector= sparseVec1.toDense
val ConvertedSparseVect : SparseVector= denseVec1.toSparse
println("ConvertedDenseVect =", ConvertedDenseVect)
println("ConvertedSparseVect =", ConvertedSparseVect)
println("Sparse Vector Representation = ",sparseVec1)
println("Converting Sparse Vector back to Dense Vector",sparseVec1.toDense)
println("Dense Vector Representation = ",denseVec1)
println("Converting Dense Vector to Sparse Vector",denseVec1.toSparse)
// Spark Example
// 23.0 34.3 21.3
// 11.0 33.0 22.6
// 17.0 24.5 22.2
// will be Stored as 23.0, 11.0, 17.0, 34.3, 33.0, 24.5, 21.3,22.6,22.2
val denseMat1 = Matrices.dense(3,3,Array(23.0, 11.0, 17.0, 34.3, 33.0, 24.5, 21.3,22.6,22.2))
val MyArray1= Array(10.0, 11.0, 20.0, 30.3)
val denseMat3 = Matrices.dense(2,2,MyArray1)
println("denseMat1=",denseMat1)
println("denseMat3=",denseMat3)
val v1 = Vectors.dense(5,6,2,5)
val v2 = Vectors.dense(8,7,6,7)
val v3 = Vectors.dense(3,6,9,1)
val v4 = Vectors.dense(7,4,9,2)
val Mat11 = Matrices.dense(4,4,v1.toArray ++ v2.toArray ++ v3.toArray ++ v4.toArray)
println("Mat11=\n", Mat11)
println("Number of Columns=",denseMat1.numCols)
println("Number of Rows=",denseMat1.numRows)
println("Number of Active elements=",denseMat1.numActives)
println("Number of Non Zero elements=",denseMat1.numNonzeros)
println("denseMat1 representation of a dense matrix and its value=\n",denseMat1)
val sparseMat1= Matrices.sparse(3,2 ,Array(0,1,3), Array(0,1,2), Array(11,22,33))
println("Number of Columns=",sparseMat1.numCols)
println("Number of Rows=",sparseMat1.numRows)
println("Number of Active elements=",sparseMat1.numActives)
println("Number of Non Zero elements=",sparseMat1.numNonzeros)
println("sparseMat1 representation of a sparse matrix and its value=\n",sparseMat1)
/*
From Manual pages of Apache Spark to use as an example to Define Matrices.sparse()
1.0 0.0 4.0
0.0 3.0 5.0
2.0 0.0 6.0
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0], rowIndices=[0, 2, 1, 0, 1, 2], colPointers=[0, 2, 3, 6]
*/
val sparseMat33= Matrices.sparse(3,3 ,Array(0, 2, 3, 6) ,Array(0, 2, 1, 0, 1, 2),Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))
println(sparseMat33)
val denseFeatureVector= Vectors.dense(1,2,1)
val result0 = sparseMat33.multiply(denseFeatureVector)
println("SparseMat33 =", sparseMat33)
println("denseFeatureVector =", denseFeatureVector)
println("SparseMat33 * DenseFeatureVector =", result0)
//println("*****************************************************************************")
val denseVec13 = Vectors.dense(5,3,0)
println("denseVec2 =", denseVec13)
println("denseMat1 =", denseMat1)
val result3= denseMat1.multiply(denseVec13)
println("denseMat1 * denseVect13 =", result3)
val transposedMat1= sparseMat1.transpose
println("Original sparseMat1 =", sparseMat1)
println("transposedMat1=",transposedMat1)
val transposedMat2= denseMat1.transpose
println("Original sparseMat1 =", denseMat1)
println("transposedMat2=" ,transposedMat2)
println("================================================================================")
val denseMat33: DenseMatrix= new DenseMatrix(3, 3, Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0,7.0,8.0,9.0))
val identityMat33: DenseMatrix = new DenseMatrix(3, 3, Array(1.0, 0.0, 0.0, 0.0,1.0,0.0,0.0,0.0,1.0))
val result2 =denseMat33.multiply(identityMat33)
println(result2)
println(denseMat33.multiply(denseMat33)) // proof in action: it is not symmetrical: aTa not equal a
println("denseMat33 =", denseMat33)
println("Matrix transposed twice", denseMat33.transpose.transpose)
println("denseMat33 =", denseMat33)
/* Vector arithmetic */
val w1 = Vectors.dense(1,2,3)
val w2 = Vectors.dense(4,-5,6)
val w3 = new BreezeVector(w1.toArray)//w1.asBreeze
val w4= new BreezeVector(w2.toArray)// w2.asBreeze
println("w3 + w4 =",w3+w4)
println("w3 - w4 =",w3+w4)
println("w3 * w4 =",w3.dot(w4))
val sv1 = Vectors.sparse(10, Array(0,2,9), Array(5, 3, 13))
val sv2 = Vectors.dense(1,0,1,1,0,0,1,0,0,13)
println("sv1 - Sparse Vector = ",sv1)
println("sv2 - Dense Vector = ",sv2)
// println("sv1 * sve2 =", sv1.asBreeze.dot(sv2.asBreeze))
println("sv1 * sv2 =", new BreezeVector(sv1.toArray).dot(new BreezeVector(sv2.toArray)))
// Matrix multipication
val dMat1: DenseMatrix= new DenseMatrix(2, 2, Array(1.0, 3.0, 2.0, 4.0))
val dMat2: DenseMatrix = new DenseMatrix(2, 2, Array(2.0,1.0,0.0,2.0))
println("dMat1 =",dMat1)
println("dMat2 =",dMat2)
println("dMat1 * dMat2 =", dMat1.multiply(dMat2)) //A x B
println("dMat2 * dMat1 =", dMat2.multiply(dMat1)) //B x A not the same as A xB
val m = new RowMatrix(spark.sparkContext.parallelize(Seq(Vectors.dense(4, 3), Vectors.dense(3, 2))))
val svd = m.computeSVD(2, true)
val v = svd.V
val sInvArray = svd.s.toArray.toList.map(x => 1.0 / x).toArray
val sInverse = new DenseMatrix(2, 2, Matrices.diag(Vectors.dense(sInvArray)).toArray)
val uArray = svd.U.rows.collect.toList.map(_.toArray.toList).flatten.toArray
val uTranspose = new DenseMatrix(2, 2, uArray) // already transposed because DenseMatrix has a column-major orientation
val inverse = v.multiply(sInverse).multiply(uTranspose)
// -1.9999999999998297 2.999999999999767
// 2.9999999999997637 -3.9999999999996767
println("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
println(inverse)
val dataVectors = Seq(
Vectors.dense(0.0, 1.0, 0.0),
Vectors.dense(3.0, 1.0, 5.0),
Vectors.dense(0.0, 7.0, 0.0)
)
val identityVectors = Seq(
Vectors.dense(1.0, 0.0, 0.0),
Vectors.dense(0.0, 1.0, 0.0),
Vectors.dense(0.0, 0.0, 1.0)
)
val dd = dataVectors.map(x => x.toArray).flatten.toArray
dd.foreach(println(_))
val dm00: Matrix = Matrices.dense(3, 3, dd)
print("==============================")
print("\n", dm00)
val distMat33 = new RowMatrix(spark.sparkContext.parallelize(dataVectors))
println("distMatt33 columns - Count =", distMat33.computeColumnSummaryStatistics().count)
println("distMatt33 columns - Mean =", distMat33.computeColumnSummaryStatistics().mean)
println("distMatt33 columns - Variance =", distMat33.computeColumnSummaryStatistics().variance)
println("distMatt33 columns - CoVariance =", distMat33.computeCovariance())
val distMatIdent33 = new RowMatrix(spark.sparkContext.parallelize(identityVectors))
val flatArray = identityVectors.map(x => x.toArray).flatten.toArray
dd.foreach(println(_))
//flaten it so we can use it in Matrices.dense API call
val dmIdentity: Matrix = Matrices.dense(3, 3, flatArray)
val distMat44 = distMat33.multiply(dmIdentity)
println("distMatt44 columns - Count =", distMat44.computeColumnSummaryStatistics().count)
println("distMatt44 columns - Mean =", distMat44.computeColumnSummaryStatistics().mean)
println("distMatt44 columns - Variance =", distMat44.computeColumnSummaryStatistics().variance)
println("distMatt44 columns - CoVariance =", distMat44.computeCovariance())
val distInxMat1 = spark.sparkContext.parallelize( List( IndexedRow( 0L, dataVectors(0)), IndexedRow( 1L, dataVectors(1)), IndexedRow( 1L, dataVectors(2))))
println("distinct elements=", distInxMat1.distinct().count())
val CoordinateEntries = Seq(
MatrixEntry(1, 6, 300),
MatrixEntry(3, 1, 5),
MatrixEntry(1, 7, 10)
)
val distCordMat1 = new CoordinateMatrix(spark.sparkContext.parallelize(CoordinateEntries.toList))
println("First Row (MarixEntry) =",distCordMat1.entries.first())
val distBlkMat1 = distCordMat1.toBlockMatrix().cache()
distBlkMat1.validate()
println("Is block empty =", distBlkMat1.blocks.isEmpty())
spark.stop()
}
}