《Spark》------Spark Basics

Table of contents

1. Introduction to Spark

1.1 What is Spark?

1.2. Characteristics of Spark

1.3. Spark Ecosystem

1.4. Components of Spark Core

1.5. Spark installation process

1.5.1. Basic environment, install Linux system, Java environment and Hadoop environment

1.5.2. Download the Spark file and decompress it

1.5.3, edit profile

1.5.4, Spark-shell running

2. Spark cluster construction

2.1. Spark deployment mode

2.2. Why choose Spark On YARN

2.3. Spark On YARN mode

2.4. Start Spark

3. Scala basic syntax and method functions

3.1. Basic syntax such as Scala variable definition and loop

3.1.1. Variable definition

 3.1.2, If loop

 3.1.3, For loop

 3.1.4, Do while loop

3.2. Mutable and immutable collections

3.2.1. Basic syntax of immutable collections

 3.2.2. Basic syntax of mutable collections

3.3. Variable and immutable arrays

3.3.1. Basic grammar of immutable array

3.3.2. Basic syntax of variable arrays

3.4. Mutable and immutable lists

3.4.1. Basic grammar of immutable List

3.4.2. Basic syntax of variable List

3.5, tuple

3.6. Mutable and immutable mapping (Map)

3.6.1. Basic grammar of immutable Map

3.6.2. Basic syntax of variable Map

Four, object-oriented

4.1, what is object-oriented

4.2, what is a class

4.3, what is an object

4.4, the basic structure of classes and objects

4.5. Constructor

4.6. Sample classes and sample objects

Five, Spark RDD programming

5.1, RDD conversion operation

5.2, RDD action operation

5.3, RDD partition

5.4, ​​RDD cache (persistence)

5.5. Operation and reading and writing methods of RDD key-value pairs

5.6. Experiment

6. Spark SQL

6.1, what is DataFrame

6.2, Conversion between DataFrame, DataSet and RDD

6.2.1. From RDD to DataFrame

6.2.2. From DataFrame to RDD


1. Introduction to Spark

1.1 What is Spark?

Apache Spark is a fast and general-purpose computing engine designed for large-scale data processing

1.2. Characteristics of Spark

1. Fast running speed: supports iterative calculation of data in memory

2. Good usability: support writing in Scala, Java, Python and other languages, with concise syntax

3. Strong versatility: Spark ecosystem contains rich components

4. Run anywhere: Spark is highly adaptable and can access different data sources

1.3. Spark Ecosystem

The Spark ecosystem takes Spark Core as the core, reads data from HDFS, Amazon S3, and HBase, and uses MESOS, YARN, and its own Standalone as resource managers to schedule jobs to complete application calculations. Applications come from different components, such as Spark Shell/Spark Submit batch processing, Spark Streaming real-time processing applications, SparkSQL query, MLlib machine learning, GraphX ​​graph processing, etc.

1.4. Components of Spark Core

Spark Core is the core part of the Spark framework, which realizes the basic functions of Spark, including task scheduling, memory management, error recovery and storage system interaction modules.

1) Provides a distributed parallel computing framework for Directed Acyclic Graph (DAG), and provides a Cache mechanism to support multiple iterative calculations or data sharing, which greatly reduces the overhead of reading data between iterative calculations.

2) The RDD introduced in Spark is a collection of read-only objects distributed on multiple computing nodes. These collections are elastic. If a part of the data set is lost, it can be rebuilt according to the "lineage" to ensure high fault tolerance.

3) Mobile computing instead of mobile data, RDD Partition can read the data blocks in the distributed file system to the memory of each node for calculation

1.5. Spark installation process

1.5.1. Basic environment, install Linux system, Java environment and Hadoop environment

1.5.2. Download the Spark file and decompress it

tar -zxvf spark-3.0.3-bin-hadoop2.7.tgz

1.5.3, edit profile

vim /etc/profile

Append spark configuration content, pay attention not to overwrite

export SPARK_HOME=/home/spark/spark-3.0.3-bin-hadoop2.7
export PATH=$PATH:${SPARK_HOME}/bin

Finally, refresh the configuration

source /etc/profile

1.5.4, Spark-shell running

After starting the Spark Shell successfully, you can see the command prompt symbol "Scala>" at the end of the output information

Use the command ":quit" to exit the Spark Shell, or use the "Ctrl + D" key combination to exit the Spark Shell


2. Spark cluster construction

2.1. Spark deployment mode

1) Stand-alone mode: Local mode: Spark runs on a single machine

2) Pseudo-distributed mode: Standalone mode: use the simple cluster manager that comes with Spark

3) Distributed mode: Spark On YARN mode: use YARN as the cluster manager

                           Spark On Mesos Mode: Using Mesos as a Cluster Manager

2.2. Why choose Spark On YARN

The construction of the Spark On YARN mode is relatively simple. You only need to install Spark on one node in the YARN cluster. This node can be used as a client to submit Spark applications to the YARN cluster.

2.3. Spark On YARN mode

The Spark On YARN mode has two modes: client and cluster. The main difference is that the running nodes of the Driver program are different.

client: The Driver program runs on the client, which is suitable for interaction, debugging, and hope to see the output of the operation immediately.

cluster: The Driver program runs on the AM (AplicationMaster) started by RM (ResourceManager), which is suitable for production environments.        

2.4. Start Spark

spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
/home/spark/spark-3.0.3-bin-hadoop2.7/examples/jars/spark-examples_2.12-3.0.3.jar

3. Scala basic syntax and method functions

3.1. Basic syntax such as Scala variable definition and loop

3.1.1. Variable definition

 3.1.2, If loop

 3.1.3, For loop

 3.1.4, Do while loop

 

3.2. Mutable and immutable collections

Mutable Collections: Can be updated or extended where appropriate. This means that the elements of a collection can be modified and removed. Simply put, the collection itself can change dynamically

Immutable collections: In contrast, never change. It is still possible to simulate add, remove or update operations though. But these operations will return a new collection in each case, leaving the original collection unchanged. Simply put, the collection itself cannot change dynamically.

3.2.1. Basic syntax of immutable collections

var/val 变量名 = Set[类型]()
var/val 变量名 = Set(元素1, 元素2, 元素3,..)

 3.2.2. Basic syntax of mutable collections

A variable set refers to elements, and the length is variable. Its creation method is the same as that of an immutable set, but you need to import the variable collection class first.

3.3. Variable and immutable arrays

3.3.1. Basic grammar of immutable array

var/val 变量名 = new Array[元素类型](数组长度)
var/val 变量名 = Array(元素1,元素2,...)

3.3.2. Basic syntax of variable arrays

var/val 变量名 = new ArrayBuffer[元素类型](数组长度)
var/val 变量名 = ArrayBuffer(元素1,元素2,...)

3.4. Mutable and immutable lists

The elements and length of the list are immutable

3.4.1. Basic grammar of immutable List

val/var 变量名 = List(元素1,元素2,元素3,...)

3.4.2. Basic syntax of variable List

val/var 变量名 = ListBuffer[数据类型]()
val/var 变量名 = ListBuffer(元素1,元素2,元素3,..)

3.5, tuple

val/var 元组名 = (元素 1, 元素 2, 元素 3....)
val/var 元组名 = 元素 1 -> 元素 2

3.6. Mutable and immutable mapping (Map)

3.6.1. Basic grammar of immutable Map

val/var map = Map(键->值, 键->值, 键->值...)
val/var map = Map((键, 值), (键, 值), (键, 值), (键, 值)...)

3.6.2. Basic syntax of variable Map

The definition syntax is consistent with that of immutable Map, but you need to manually import the package first: import scala.collection.mutable.Map

Four, object-oriented

4.1, what is object-oriented

Object-oriented is a kind of programming idea, it is based on process-oriented, and completes programming on the basis of objects, that is, instances of classes

4.2, what is a class

A class is a collection of attributes and behaviors and is an abstract concept

4.3, what is an object

Objects are concrete instances of classes

4.4, the basic structure of classes and objects

Create class syntax: class class name {attributes and behaviors} Note: If the class is empty and has no members, {} can be omitted

Create object syntax: val object name = new class ()

4.5. Constructor

When an object is created, the constructor of the class is called automatically. The default constructor was used before. In addition to defining primary constructors, auxiliary constructors can also be defined as needed. Constructors other than primary constructors are called auxiliary constructors

Classification: 1. Main constructor. 2. Auxiliary constructor

4.6. Sample classes and sample objects

Sample class: In Scala, a sample class is a special class that is generally used to save data. It is often used in concurrent programming and frameworks such as Spark and Flink.

Sample object: In Scala, a singleton object decorated with case is called a sample object, and it has no primary constructor


Five, Spark RDD programming

5.1, RDD conversion operation

filter(func): Filter out the elements that satisfy the function func and return a new data set

map(func): Pass each element into the function func and return the result as a new dataset

flatMap(func): Similar to map(), but each input element can be mapped to 0 or more output results

groupByKey(): When applied to a dataset of (K, V) key-value pairs, returns a new dataset in the form of (K, Iterable)

reduceByKey(func): When applied to a data set of (K, V) key-value pairs, a new data set in the form of (K, V) is returned, where each value is passed to the function func for each key for aggregation the result of

5.2, RDD action operation

count(): returns the number of elements in the dataset

collect(): returns all elements in the dataset as an array

first(): returns the first element in the dataset

take(n): returns the first n elements in the dataset as an array

reduce(func): Aggregate the elements in the data set through the function func (input two parameters and return a value)

foreach(func): Pass each element in the data set to the function func to run

  

5.3, RDD partition

RDD is an elastic distributed data set. Usually, RDD is very large and will be divided into many partitions, which are stored on different nodes.

5.4, ​​RDD cache (persistence)

In Spark, RDD adopts a lazy evaluation mechanism, and every time an operation is encountered, the calculation will be performed from scratch. Each call to an action triggers a calculation from scratch. This is expensive for iterative calculations, which often require reusing the same set of data multiple times.

Significance of caching: In the second action operation, you only need to use the value cached in the first action operation to avoid repeated calculations

5.5. Operation and reading and writing methods of RDD key-value pairs

1) Creation of key-value pair RDD

val pairRDD = lines.flatMap(line => line.split("")).map(word => (word, 1))
pairRDD.foreach(println)

2) Read data from file to create RDD

val textFile = sc.textFile(".....")

5.6. Experiment

1. There is a local file word.txt, which contains many lines of text, each line of text is composed of multiple words, and the words are separated by spaces. The following statement can be used for word frequency statistics (that is, to count the number of occurrences of each word)

2. Write to different files according to the last digit of the key value

package com.qst.rdd

import org.apache.spark.{Partitioner, SparkConf, SparkContext}

//自定义分区类,需要继承org.apache.spark.Partitioner类
class MyPartitioner(numParts: Int) extends Partitioner {

  //覆盖分区数
  override def numPartitions: Int = numParts

  //覆盖分区号获取函数
  override def getPartition(key: Any): Int = {
    key.toString.toInt % 10
  }
}

object MyPartitioner {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("persistDemo")
    val sc = new SparkContext(conf)
    // 模拟5个分区的数据
    val data = sc.parallelize(1 to 10, 5)
    // 将每个元素转化为 (元素, 1) 的形式
    data.map((_, 1))
      // MyPartitioner的构造函数需要传入分区数,这里传入的是10,将原本的5个分区转变为10个分区
      .partitionBy(new MyPartitioner(10))
      // 对集合中的每个元组取第一个元素进行映射操作,返回一个包含映射结果的新集合。
      // 也就是只保留元素,去除分区前加的1
      .map(_._1)
      // 使用saveAsTextFile方法将结果保存到HDFS中。
      // Spark 会自动将数据写到多个文件中,每个文件对应一个分区。
      .saveAsTextFile("hdfs://192.168.74.80:9000/output6")
    sc.stop()
  }
}

6. Spark SQL

6.1, what is DataFrame

Similar to RDD, DataFrame is also a distributed data container. However, DataFrame is more like a two-dimensional table of a traditional database. In addition to the data, it also records the structural information of the data, that is, the schema

6.2, Conversion between DataFrame, DataSet and RDD

6.2.1. From RDD to DataFrame

object SparkSQLDemo03 {
// 样例类
case class Person(id: Int, name: String, age: Int)
def main(args: Array[String]): Unit = {
// 准备工作:创建 SparkSession
val spark = SparkSession.builder().appName(this.getClass.getName).master("local[*]").getOrCreate()
val sc = spark.sparkContext
sc.setLogLevel("WARN")
// 1. 通过样例类进行转换
val linesRDD = sc.textFile("file/person.txt")
// 1.1. RDD[String] 变为 RDD[Person]
课程内容页
val personRDD: RDD[Person] = linesRDD.map(x => {
val arr = x.split(",")
Person(arr(0).toInt, arr(1), arr(2).toInt)
})
import spark.implicits._ // 隐式转换
// 1.2. RDD+样例类 => DataFrame
val personDF: DataFrame = personRDD.toDF()
val personDS: Dataset[Person] = personRDD.toDS()
personDF.show()
personDS.show()
// 关闭
spark.stop()
}
}

 6.2.2. From DataFrame to RDD

val rdd = pDF.rdd
println(rdd)
println(rdd.collect())
// 关闭
spark.stop()
}
}

Guess you like

Origin blog.csdn.net/m0_60964321/article/details/131153681