Apache Spark 2.x Machine Learning Cookbook(2)

Chapter 1: Using Scala to use Spark for practical machine learning

In this chapter, we will introduce:
Download and install JDK
Download and install IntelliJ
Download and install Spark
Configure IntelliJ to use Spark and run Spark ML sample code
Run sample ML code from Spark
Identify the data source for actual machine learning
Use with IntelliJ IDE's Apache Spark 2.0 runs your first program
How to add graphics to Spark programs

Introduction
With the latest developments in cluster computing and the rise of big data, the field of machine learning has been pushed to the forefront of computing. For a long time, the need for an interactive platform for large-scale data science has been a dream, and it has now become a reality.

The following three areas jointly promote and accelerate the development of interactive data science:

Apache Spark: A unified data science and technology platform that combines a fast computing engine and fault-tolerant data structures into well-designed and integrated products.
Machine learning: an area of ​​artificial intelligence that enables machines to imitate something originally reserved specifically for the human brain Some tasks
Scala: A modern JVM-based language based on traditional languages, but combining functionality and object-oriented concepts without the verbosity of other languages.

First, we need to set up a development environment, which will include the following components:

The recipes in this chapter will provide you with detailed instructions on installing and configuring IntelliJ IDE, Scala plugin and Spark. After setting up the development environment, we will continue to run one of the Spark ML sample codes to test the setup.

Spark provides an easy-to-use distributed framework in the unified technology stack, which makes it the preferred platform for data science projects, which usually do not require iterative algorithms that can eventually be combined into solutions. Due to their internal working principles, these algorithms will produce a large number of intermediate results, which need to be transferred from one stage to the next in the intermediate steps. The need for interactive tools with a robust local distributed machine learning library (MLlib) precludes the disk-based approach of most data science projects.

Spark has different methods for cluster computing. It solves this problem in the form of a technology stack rather than an ecosystem. The combination of a large number of centrally managed libraries and the lightning-fast computing engine that can support fault-tolerant data structures makes Spark expected to replace Hadoop and become the preferred big data analysis platform.

Spark has a modular approach, as shown in the following figure:

Machine learning
The purpose of machine learning is to produce machines and devices that can imitate human intelligence and automate certain tasks traditionally left to the human brain. The design goal of machine learning algorithms is to browse very large data sets in a relatively short time and provide approximate answers, which will make humans spend more time processing.

The field of machine learning can be divided into many forms. At a higher level, it can be divided into supervised learning and unsupervised learning. Supervised learning algorithms are a type of ML algorithms that use a training set (ie labeled data) to calculate probability distributions or graphical models, so that they can classify new data points without further human intervention. Unsupervised learning is a machine learning algorithm used to draw inferences from a data set composed of input data without label responses.

Out of the box, Spark provides a rich set of ML algorithms that can be deployed on large data sets without further coding. The following figure describes Spark's MLlib algorithm as a mind map. Spark's MLlib is designed to take advantage of parallelism while having fault-tolerant distributed data structures. Spark refers to such data structures as elastic distributed data sets or RDDs:

Scala
Scala is a modern programming language that is replacing traditional programming languages ​​such as Java and C ++. Scala is a JVM-based language that not only provides a concise syntax without traditional boilerplate code, but also combines object-oriented and functional programming into an extremely clear and powerful type-safe language.

Scala uses a flexible and expressive approach, making it ideal for interacting with Spark's MLlib. The fact that Spark itself is written in Scala provides strong evidence that the Scala language is a full-service programming language that can be used to create complex system code with high performance requirements.

Scala is built on the tradition of Java by solving some of the shortcomings of Java, while avoiding the all-or-nothing approach. Scala code is compiled into Java bytecode so that it can coexist with rich Java libraries interchangeably. The ability to use Java libraries in combination with Scala (or vice versa) provides software engineers with a continuity and rich environment that enables them to build modern and complex machine learning systems without completely departing from the Java tradition and code base.

Scala fully supports the feature-rich functional programming paradigm, and has standard support for lambdas, currying, type interfaces, immutability, lazy evaluation, and pattern matching paradigms, reminiscent of Perl without using mysterious syntax. Scala supports algebra-friendly data types, anonymous functions, covariance, inverse variance, and higher-order functions, so it is very suitable for machine learning programming.

This is the hello world program in Scala:

object HelloWorld extends App {
println("Hello World! ")
}

Compile and run Hello World in Scala as follows:

scalac HelloWorld.scala
scala HelloWorld

The Apache Spark Machine Learning Guide uses a practical approach that provides developers with a multidisciplinary perspective. This book focuses on machine learning, the interaction and cohesion of Apache Spark and Scala. We have also taken additional steps to teach you how to set up and run a comprehensive development environment that developers are familiar with, and provide code snippets that you must run in an interactive shell without using the modern features provided by the IDE:

apache spark ---> machine learning -->IDE --->scala 

Software versions and libraries used in this book


The following table provides a detailed list of software versions and libraries used in this book. If you follow the installation instructions in this chapter, it will include most of the items listed here. Any other JAR or library files that may be required for a specific recipe are introduced through other installation instructions in the corresponding recipe:

The other required JARs are as follows:

Miscellaneous JARs Version
bliki-core 3.0.19
breeze-viz 0.12
Cloud9 1.5.0
Hadoop-streaming
2.2.0
JCommon
1.0.23
JFreeChart
1.0.19
lucene-analyzers-common
6.0.0
Lucene-Core
6.0.0
scopt
3.3.0
spark -streaming-flume-assembly
2.0.0
spark-streaming-kafka-0-8-assembly2.0.0
We also tested all the recipes in this book on Spark 2.1.1 and found that the program performed as expected. It is recommended to use the software versions and libraries listed in these tables for learning purposes. To keep up with the ever-changing Spark landscape and documentation, the API links to the Spark documentation mentioned in this book point to the latest version of Spark 2.xx, but the API reference in the recipe specifically targets Spark 2.0.0. All Spark documentation links provided in this book will point to the latest documentation on the Spark website. If you want to find the documentation of a specific version of Spark (for example, Spark 2.0.0), please use the following URL to find related documents on the Spark website:

For clarity, we have made the code as simple as possible, rather than demonstrate the advanced features of Scala.

The following is a list of open source data. If you want to develop applications in this area, it is worth exploring:

https://archive.ics.uci.edu/ml/index.php    --- UCI machine learning repository

https://www.kaggle.com/competitions   -- Kaggle datasets

MLdata.org   -- MLdata.org

Google Trends  --http: /​/​www. ​google. ​com/​trends/explore .

The CIA World Factbook -- https: //www. cia. gov/library/publications/the-world-factbook/

 

See also

UCI machine learning repository: This is an extensive library with search
functionality. At the time of writing, there were more than 350 datasets. You can
click on the  https: //archive. ics. uci. edu/ml/index. html link to see all the
datasets or look for a specific set using a simple search (Ctrl + F).
Kaggle datasets: You need to create an account, but you can download any sets for
learning as well as for competing in machine learning competitions. The
https: //www. kaggle. com/competitions link provides details for exploring and
learning more about Kaggle, and the inner workings of machine learning
competitions.
MLdata.org: A public site open to all with a repository of datasets for machine
learning enthusiasts.
Google Trends: You can find statistics on search volume (as a proportion of total
search) for any given term since 2004 on  http: /​/​www. ​google. ​com/​trends/
explore .
The CIA World Factbook:
The  https: //www. cia. gov/library/publications/the-world-factbook/ link
provides information on the history, population, economy, government,
infrastructure, and military of 267 countries.

Other sources for machine learning data:
SMS spam data:  http: //www. dt. fee. unicamp. br/~tiago/smsspamcollection/
Financial dataset from Lending Club
https: //www. lendingclub. com/info/download-data. action
Research data from Yahoo  http: //webscope. sandbox. yahoo. com/index. php
Amazon AWS public dataset http: //aws. amazon. com/public-data-sets/
Labeled visual data from Image Net  http: //www. image-net. org
Census datasets  http: //www. census. gov
Compiled YouTube dataset  http: //netsg. cs. sfu. ca/youtubedata/
Collected rating data from the MovieLens site
http: //grouplens. org/datasets/movielens/
Enron dataset available to the public  http: //www. cs. cmu. edu/~enron/
Dataset for the classic book elements of statistical learning
http: //statweb. stanford. edu/~tibs/ElemStatLearn/data. htmlIMDB
Movie dataset  http: //www. imdb. com/interfaces
Million Song dataset  http: //labrosa. ee. columbia. edu/millionsong/
Dataset for speech and audio  http: //labrosa. ee. columbia. edu/projects/
Face recognition data  http: //www. face-rec. org/databases/
Social science data  http: //www. icpsr. umich. edu/icpsrweb/ICPSR/studies
Bulk datasets from Cornell University  http: //arxiv. org/help/bulk_data_s3
Project Guttenberg datasets
http: //www. gutenberg. org/wiki/Gutenberg: Offline_Catalogs
Datasets from World Bank  http: //data. worldbank. org
Lexical database from World Net  http: //wordnet. princeton. edu
Collision data from NYPD http: //nypd. openscrape. com/#/
Dataset for congressional row calls and others  http: //voteview. com/dwnl. htm
Large graph datasets from Stanford
http: //snap. stanford. edu/data/index. html
Rich set of data from datahub  https: //datahub. io/dataset
Yelp's academic dataset  https: //www. yelp. com/academic_dataset
Source of data from GitHub
https: //github. com/caesar0301/awesome-public-datasets
Dataset archives from Reddit  https: //www. reddit. com/r/datasets/
 

There are some specialized datasets (for example, text analytics in Spanish, and gene and
IMF data) that might be of some interest to you:
Datasets from Colombia (in Spanish):
http: //www. datos. gov. co/frm/buscador/frmBuscador. aspx
Dataset from cancer studies
http: //www. broadinstitute. org/cgi-bin/cancer/datasets. cgi
Research data from Pew  http: //www. pewinternet. org/datasets/
Data from the state of Illinois/USA  https: //data. illinois. gov
Data from freebase.com http: //www. freebase. com
Datasets from the UN and its associated agencies  http: //data. un. org
International Monetary Fund datasets  http: //www. imf. org/external/data. htm
UK government data  https: //data. gov. uk
Open data from Estonia
http: //pub. stat. ee/px-web. 2001/Dialog/statfile1. asp
Many ML libraries in R containing data that can be exported as CSV
https: //www. r-project. org
Gene expression datasets  http: //www. ncbi. nlm. nih. gov/geo/

 <dependencies>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <version>2.2.2</version>
    </dependency>
    <!--spark-mllib-->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-mllib_2.11</artifactId>
        <version>2.2.2</version>
    </dependency>
        <dependency>
                <groupId>org.jfree</groupId>
                <artifactId>jfreechart</artifactId>
                <version>1.5.0</version>
            </dependency>

        

    </dependencies>


 

import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession

object MyFirstSpark20 {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.ERROR)
    val session: SparkSession = SparkSession.builder().master("local[*]").appName("myFirstSpark20").config("spark.sql.warehouse.dir", ".")
      .getOrCreate()
    session
    val x = Array(1.0,5.0,8.0,10.0,15.0,21.0,27.0,30.0,38.0,45.0,50.0,64.0)
    val y = Array(5.0,1.0,4.0,11.0,25.0,18.0,33.0,20.0,30.0,43.0,55.0,57.0)
    val xRDD = session.sparkContext.parallelize(x)
    val yRDD = session.sparkContext.parallelize(y)
    val zipRdd = xRDD.zip(yRDD)
    for (elem <- zipRdd.collect()) {
      println(elem)
    }
    val xSum: Double = xRDD.sum()
    val ySum: Double=yRDD.sum()
    val sumRdd = zipRdd.map(item=>item._1*item._2).sum
    val sumCount = zipRdd.count()
    println(xSum)
    println(ySum)
    println(sumRdd)
    println(sumCount)
    session.stop()
  }

}
import java.awt.Color

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.jfree.chart.plot.{PlotOrientation, XYPlot}
import org.jfree.chart.{ChartFactory, ChartFrame, JFreeChart}
import org.jfree.data.xy.{XYSeries, XYSeriesCollection}

import scala.util.Random


object MyChart {

  def show(chart: JFreeChart) {
    val frame = new ChartFrame("plot", chart)
    frame.pack()
    frame.setVisible(true)
  }

  def configurePlot(plot: XYPlot): Unit = {
    plot.setBackgroundPaint(Color.WHITE)
    plot.setDomainGridlinePaint(Color.BLACK)
    plot.setRangeGridlinePaint(Color.BLACK)
    plot.setOutlineVisible(false)
  }

  def main(args: Array[String]): Unit = {

    Logger.getLogger("org").setLevel(Level.ERROR)

    // setup SparkSession to use for interactions with Spark
    val spark = SparkSession
      .builder
      .master("local[*]")
      .appName("myChart")
      .config("spark.sql.warehouse.dir", ".")
      .getOrCreate()


    val data = spark.sparkContext.parallelize(Random.shuffle(1 to 15).zipWithIndex)

    data.foreach(println)

    val xy = new XYSeries("")
    data.collect().foreach{ case (y: Int, x: Int) => xy.add(x,y) }
    val dataset = new XYSeriesCollection(xy)

    val chart = ChartFactory.createXYLineChart(
      "MyChart",  // chart title
      "x",               // x axis label
      "y",                   // y axis label
      dataset,                   // data
      PlotOrientation.VERTICAL,
      false,                    // include legend
      true,                     // tooltips
      false                     // urls
    )

    val plot = chart.getXYPlot()
    configurePlot(plot)
    show(chart)
    spark.stop()
  }
}

 

 

 

 

 

 

 

Published 158 original articles · Like 28 · Visit 330,000+

Guess you like

Origin blog.csdn.net/wangjunji34478/article/details/105594054