Apache Spark 2.x Machine Learning Cookbook (2) --- learning linear algebra using spark

In this chapter, we will introduce the following:
package import and initial settings of vectors and matrices.
Use Spark 2.0 to create DenseVector and set up.
Use Spark 2.0 to create SparseVector and set up.
Use Spark 2.0 to create DenseMatrix and set up
. Use Sparse Local in Spark 2.0 matrix
using 2.0 Spark executing vector arithmetic
using the matrix arithmetic execution Spark 2.0
Spark 2.0 ML library distributed matrix
explore RowMatrix in Spark 2.0
Exploration distributed in IndexedRowMatrix Spark 2.0
Exploration Spark 2.0 distributed CoordinateMatrix
of distribution 2.0 Spark BlockMatrix

Linear algebra is the cornerstone of machine learning (ML) and mathematical programming (MP). When dealing with Spark's machine libraries, you must understand that the Vector / Matrix structure provided by Scala (which has been imported by default) is different from the Spark ML, MLlib Vector, and Matrix functions provided by Spark. If you want to use Spark (that is, parallelism) for large-scale matrix / vector calculations immediately (for example, SVD implementation alternatives with higher numerical precision), the latter supported by RDD is the required data structure. In some cases, used for derivative product pricing and risk analysis). The Scala vector / matrix library provides a rich set of linear algebra operations, such as dot products, addition operations, etc., which still have their place in the ML pipeline. All in all, the main difference between using Scala Breeze and Spark or Spark ML is that the Spark feature is supported by RDD, which allows simultaneous distributed, concurrent computing and resiliency without any additional concurrent modules or additional work (eg, Akka + breeze ).

Almost all machine learning algorithms use some form of classification or regression mechanism (not necessarily linear) to train the model, and then minimize the error by comparing the training output with the actual output. For example, any implementation of the recommendation system in Spark will rely heavily on matrix factorization, factorization, approximation, or singular value decomposition (SVD). Another area of ​​interest in machine learning that deals with dimensionality reduction in large data sets is principal component analysis (PCA), which relies heavily on linear algebra, factorization, and matrix processing.

When we first checked the source code of the Spark ML and MLlib algorithms in Spark 1.xx, we quickly noticed that Vectors and Matrices used RDD as the basis for many important algorithms.

When we revisited the source code of Spark 2.0 and the machine learning library, we noticed some interesting changes that need to be considered in the future. This is an example of such a change from Spark 1.6.2 to Spark 2.0.0, which affected some of our linear algebra code using Spark:

In previous versions (Spark 1.6.x), you can convert DenseVector or SparseVector (see https://spark.apache.org/docs/1.5.2/api\/ directly access https by using the toBreeze () function: / / Spark. Apache. Org / docs / 1. 5. 2 / api / java / org / apache / spark / mllib / linalg / Vectors.html), such as the following code The library shows:

In Spark 2.0, the toBreeze () function has not only been changed to asBreeze (), but it has also been downgraded to a private function.

To solve this problem, use one of the following code snippets to convert the previous vector to a commonly used BreezeVector instance:

In Spark 2.0, the toBreeze () function has not only been changed to asBreeze (), but it has also been downgraded to a private function.
To solve this problem, use one of the following code snippets to convert the previous vector to a commonly used BreezeVector instance:

Scala is a concise language, and object-oriented and functional programming paradigms can coexist without conflict. Although functional programming is the first choice in the machine learning paradigm, there is nothing wrong with using object-oriented methods for initial data collection and representation at a later stage.

In terms of large-scale distributed matrices, our experience shows that when dealing with large matrix sets 10 times 9 squares to 10 times 13 squares to 10 times 27 squares, etc., you must study the network operations and Mixed row operation. In our experience, the combination of local and distributed matrix / vector operations (eg, dot product, multiplication, etc.) works best when operating on a large scale.

The following figure describes the classification of available Spark vectors and matrices:

spark vector and matrices:

              local vector  dense 

                                  sparse

                      matrix

             distributed

                         rowmatrix

                         indexrow matrix

                         coordinatematrix

                         blockmatrix

Package import and initial setting of vectors and matrices

Before programming Spark or using vector and matrix artifacts, we need to first import the correct package and then set up the SparkSession in order to gain access to the cluster handle. In this short recipe, we highlight a large number of software packages that can cover most linear algebra operations in Spark. Subsequent recipes will include the exact subset required for a specific procedure.

package chpater02

import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix}
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.sql.{SparkSession}
import org.apache.spark.rdd._
import org.apache.spark.mllib.linalg._
import breeze.linalg.{DenseVector => BreezeVector}
import Array._
import org.apache.spark.mllib.linalg.DenseMatrix
import org.apache.spark.mllib.linalg.SparseVector
import org.apache.log4j.Logger
import org.apache.log4j.Level

object MyVectorMatrix {

  def main(args: Array[String]): Unit = {

    Logger.getLogger("org").setLevel(Level.ERROR)
    Logger.getLogger("akka").setLevel(Level.ERROR)

    // setup SparkSession to use for interactions with Spark
    val spark = SparkSession
      .builder
      .master("local[*]")
      .appName("myVectorMatrix")
      .config("spark.sql.warehouse.dir", ".")
      .getOrCreate()


    val xyz = Vectors.dense("2".toDouble, "3".toDouble, "4".toDouble)
    println(xyz)

    val CustomerFeatures1: Array[Double] = Array(1,3,5,7,9,1,3,2,4,5,6,1,2,5,3,7,4,3,4,1)
    val CustomerFeatures2: Array[Double] = Array(2,5,5,8,6,1,3,2,4,5,2,1,2,5,3,2,1,1,1,1)
    val ProductFeatures1: Array[Double]  = Array(0,1,1,0,1,1,1,0,0,1,1,1,1,0,1,2,0,1,1,0)

    val x = Vectors.dense(CustomerFeatures1)
    val y = Vectors.dense(CustomerFeatures2)
    val z = Vectors.dense(ProductFeatures1)

    val a = new BreezeVector(x.toArray)//x.asBreeze
    val b = new BreezeVector(y.toArray)//y.asBreeze
    val c = new BreezeVector(z.toArray)//z.asBreeze

    val NetCustPref = a+b
    val dotprod = c.dot(NetCustPref)

    println("Net Customer Preference calculated by Scala Vector operations = \n",NetCustPref)
    println("Customer Pref DOT Product calculated by Scala Vector operations =",dotprod)

    val a2=a.toDenseVector
    val b2=b.toDenseVector
    val c2=c.toDenseVector

    val NetCustPref2 = NetCustPref.toDenseVector
    println("Net Customer Pref converted back to Spark Dense Vactor =",NetCustPref2)

    val denseVec1 = Vectors.dense(5,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,9)
    val sparseVec1 = Vectors.sparse(20, Array(0,2,18,19), Array(5, 3, 8,9))

    println(denseVec1.size)
    println(denseVec1.numActives)
    println(denseVec1.numNonzeros)
    println("denceVec1 presentation = ",denseVec1)

    println(sparseVec1.size)
    println(sparseVec1.numActives)
    println(sparseVec1.numNonzeros)
    println("sparseVec1 presentation = ",sparseVec1)

    //println("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    val ConvertedDenseVect : DenseVector= sparseVec1.toDense
    val ConvertedSparseVect : SparseVector= denseVec1.toSparse
    println("ConvertedDenseVect =", ConvertedDenseVect)
    println("ConvertedSparseVect =", ConvertedSparseVect)

    println("Sparse Vector Representation = ",sparseVec1)
    println("Converting Sparse Vector back to Dense Vector",sparseVec1.toDense)

    println("Dense Vector Representation = ",denseVec1)
    println("Converting Dense Vector to Sparse Vector",denseVec1.toSparse)

    // Spark Example
    // 23.0 34.3 21.3
    // 11.0 33.0 22.6
    // 17.0 24.5 22.2
    // will be Stored as 23.0, 11.0, 17.0, 34.3, 33.0, 24.5, 21.3,22.6,22.2

    val denseMat1 = Matrices.dense(3,3,Array(23.0, 11.0, 17.0, 34.3, 33.0, 24.5, 21.3,22.6,22.2))

    val MyArray1= Array(10.0, 11.0, 20.0, 30.3)
    val denseMat3 = Matrices.dense(2,2,MyArray1)

    println("denseMat1=",denseMat1)
    println("denseMat3=",denseMat3)

    val v1 = Vectors.dense(5,6,2,5)
    val v2 = Vectors.dense(8,7,6,7)
    val v3 = Vectors.dense(3,6,9,1)
    val v4 = Vectors.dense(7,4,9,2)

    val Mat11 = Matrices.dense(4,4,v1.toArray ++ v2.toArray ++ v3.toArray ++ v4.toArray)
    println("Mat11=\n", Mat11)

    println("Number of Columns=",denseMat1.numCols)
    println("Number of Rows=",denseMat1.numRows)
    println("Number of Active elements=",denseMat1.numActives)
    println("Number of Non Zero elements=",denseMat1.numNonzeros)
    println("denseMat1 representation of a dense matrix and its value=\n",denseMat1)

    val sparseMat1= Matrices.sparse(3,2 ,Array(0,1,3), Array(0,1,2), Array(11,22,33))
    println("Number of Columns=",sparseMat1.numCols)
    println("Number of Rows=",sparseMat1.numRows)
    println("Number of Active elements=",sparseMat1.numActives)
    println("Number of Non Zero elements=",sparseMat1.numNonzeros)
    println("sparseMat1 representation of a sparse matrix and its value=\n",sparseMat1)

    /*
    From Manual pages of Apache Spark to use as an example to Define Matrices.sparse()
    1.0 0.0 4.0
    0.0 3.0 5.0
    2.0 0.0 6.0
    [1.0, 2.0, 3.0, 4.0, 5.0, 6.0], rowIndices=[0, 2, 1, 0, 1, 2], colPointers=[0, 2, 3, 6]
    */
    val sparseMat33= Matrices.sparse(3,3 ,Array(0, 2, 3, 6) ,Array(0, 2, 1, 0, 1, 2),Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))
    println(sparseMat33)
    val denseFeatureVector= Vectors.dense(1,2,1)

    val result0 = sparseMat33.multiply(denseFeatureVector)
    println("SparseMat33 =", sparseMat33)
    println("denseFeatureVector =", denseFeatureVector)
    println("SparseMat33 * DenseFeatureVector =", result0)

    //println("*****************************************************************************")
    val denseVec13 = Vectors.dense(5,3,0)
    println("denseVec2 =", denseVec13)
    println("denseMat1 =", denseMat1)
    val result3= denseMat1.multiply(denseVec13)
    println("denseMat1 * denseVect13 =", result3)

    val transposedMat1= sparseMat1.transpose
    println("Original sparseMat1 =", sparseMat1)
    println("transposedMat1=",transposedMat1)

    val transposedMat2= denseMat1.transpose
    println("Original sparseMat1 =", denseMat1)
    println("transposedMat2=" ,transposedMat2)

    println("================================================================================")

    val denseMat33: DenseMatrix= new DenseMatrix(3, 3, Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0,7.0,8.0,9.0))
    val identityMat33: DenseMatrix = new DenseMatrix(3, 3, Array(1.0, 0.0, 0.0, 0.0,1.0,0.0,0.0,0.0,1.0))
    val result2 =denseMat33.multiply(identityMat33)
    println(result2)

    println(denseMat33.multiply(denseMat33)) // proof in action:  it is not symmetrical:  aTa not equal a

    println("denseMat33 =", denseMat33)
    println("Matrix transposed twice", denseMat33.transpose.transpose)
    println("denseMat33 =", denseMat33)

    /* Vector arithmetic */
    val w1 = Vectors.dense(1,2,3)
    val w2 = Vectors.dense(4,-5,6)
    val w3 = new BreezeVector(w1.toArray)//w1.asBreeze
    val w4=  new BreezeVector(w2.toArray)// w2.asBreeze
    println("w3 + w4 =",w3+w4)
    println("w3 - w4 =",w3+w4)
    println("w3 * w4 =",w3.dot(w4))
    val sv1 = Vectors.sparse(10, Array(0,2,9), Array(5, 3, 13))
    val sv2 = Vectors.dense(1,0,1,1,0,0,1,0,0,13)
    println("sv1 - Sparse Vector = ",sv1)
    println("sv2 - Dense  Vector = ",sv2)
    //    println("sv1  * sve2  =", sv1.asBreeze.dot(sv2.asBreeze))
    println("sv1  * sv2  =", new BreezeVector(sv1.toArray).dot(new BreezeVector(sv2.toArray)))


    // Matrix multipication
    val dMat1: DenseMatrix= new DenseMatrix(2, 2, Array(1.0, 3.0, 2.0, 4.0))
    val dMat2: DenseMatrix = new DenseMatrix(2, 2, Array(2.0,1.0,0.0,2.0))
    println("dMat1 =",dMat1)
    println("dMat2 =",dMat2)
    println("dMat1 * dMat2 =", dMat1.multiply(dMat2)) //A x B
    println("dMat2 * dMat1 =", dMat2.multiply(dMat1)) //B x A   not the same as A xB

    val m = new RowMatrix(spark.sparkContext.parallelize(Seq(Vectors.dense(4, 3), Vectors.dense(3, 2))))
    val svd = m.computeSVD(2, true)
    val v = svd.V
    val sInvArray = svd.s.toArray.toList.map(x => 1.0 / x).toArray
    val sInverse = new DenseMatrix(2, 2, Matrices.diag(Vectors.dense(sInvArray)).toArray)
    val uArray = svd.U.rows.collect.toList.map(_.toArray.toList).flatten.toArray
    val uTranspose = new DenseMatrix(2, 2, uArray) // already transposed because DenseMatrix has a column-major orientation
    val inverse = v.multiply(sInverse).multiply(uTranspose)
    // -1.9999999999998297  2.999999999999767
    // 2.9999999999997637   -3.9999999999996767
    println("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    println(inverse)


    val dataVectors = Seq(
      Vectors.dense(0.0, 1.0, 0.0),
      Vectors.dense(3.0, 1.0, 5.0),
      Vectors.dense(0.0, 7.0, 0.0)
    )

    val identityVectors = Seq(
      Vectors.dense(1.0, 0.0, 0.0),
      Vectors.dense(0.0, 1.0, 0.0),
      Vectors.dense(0.0, 0.0, 1.0)
    )

    val dd = dataVectors.map(x => x.toArray).flatten.toArray
    dd.foreach(println(_))

    val dm00: Matrix = Matrices.dense(3, 3, dd)
    print("==============================")
    print("\n", dm00)

    val distMat33 = new RowMatrix(spark.sparkContext.parallelize(dataVectors))

    println("distMatt33 columns - Count =", distMat33.computeColumnSummaryStatistics().count)
    println("distMatt33 columns - Mean =", distMat33.computeColumnSummaryStatistics().mean)
    println("distMatt33 columns - Variance =", distMat33.computeColumnSummaryStatistics().variance)
    println("distMatt33 columns - CoVariance =", distMat33.computeCovariance())

    val distMatIdent33 = new RowMatrix(spark.sparkContext.parallelize(identityVectors))

    val flatArray = identityVectors.map(x => x.toArray).flatten.toArray
    dd.foreach(println(_))

    //flaten it so we can use it in Matrices.dense API call
    val dmIdentity: Matrix = Matrices.dense(3, 3, flatArray)

    val distMat44 = distMat33.multiply(dmIdentity)
    println("distMatt44 columns - Count =", distMat44.computeColumnSummaryStatistics().count)
    println("distMatt44 columns - Mean =", distMat44.computeColumnSummaryStatistics().mean)
    println("distMatt44 columns - Variance =", distMat44.computeColumnSummaryStatistics().variance)
    println("distMatt44 columns - CoVariance =", distMat44.computeCovariance())

    val distInxMat1 = spark.sparkContext.parallelize( List( IndexedRow( 0L, dataVectors(0)), IndexedRow( 1L, dataVectors(1)), IndexedRow( 1L, dataVectors(2))))

    println("distinct elements=", distInxMat1.distinct().count())

    val CoordinateEntries = Seq(
      MatrixEntry(1, 6, 300),
      MatrixEntry(3, 1, 5),
      MatrixEntry(1, 7, 10)
    )

    val distCordMat1 = new CoordinateMatrix(spark.sparkContext.parallelize(CoordinateEntries.toList))
    println("First Row (MarixEntry) =",distCordMat1.entries.first())

    val distBlkMat1 =  distCordMat1.toBlockMatrix().cache()
    distBlkMat1.validate()
    println("Is block empty =", distBlkMat1.blocks.isEmpty())

    spark.stop()
  }

}

 

 

 

 

 

 

 

 

 

Published 158 original articles · Like 28 · Visit 330,000+

Guess you like

Origin blog.csdn.net/wangjunji34478/article/details/105605000