Article directory
- Zero, the learning objectives of this lecture
- 1. Case study: Spark RDD implements word count
-
- (1) Case overview
- (2) Implementation steps
-
- 1. Create a new Spark project managed by Maven
- 2. Add Scala and Spark dependencies
- 3. Create a WordCount object
- 4. Analyze the program code
- 5. Compile and package the Spark project
- 6. Upload the Spark application to the master virtual machine
- 7. Start the HDFS service
- 8. Start the Spark cluster
- 9. Upload the word file to the specified directory in HDFS
- 10. Execute the WordCount program
- 11. View program execution results
Zero, the learning objectives of this lecture
- Spark RDD implements word count
- Spark RDD implements grouping to find TopN
- Spark RDD implements secondary sorting
- Spark RDD implementation to calculate the average score
- Spark RDD implements inverted index statistics for daily new users
- Spark RDD read and write HBase
- Spark RDD data skew problem solving
1. Case study: Spark RDD implements word count
(1) Case overview
- Word counting is an introductory program for learning distributed computing, and there are many implementations, such as MapReduce; using the RDD operator provided by Spark can achieve word counting more easily.
- Create a new Spark project managed by Maven in IntelliJ IDEA, and use the Scala language to write Spark's WordCount program in the project. Finally, package the project and submit it to the Spark cluster (standalone mode) to run.
(2) Implementation steps
1. Create a new Spark project managed by Maven
-
Select File→new→Project... in IDEA, select the Maven item on the left in the pop-up window, then check the Create fromarchetype checkbox on the right and select the
org.scala-tools.archetypes:scala-archetype-simple
item that appears below (indicates that the scala-archetype-simple template is used to build Maven project).
-
Fill in the GroupId and ArtifactId in the pop-up window, keep the default version of Version, and then click the Next button
-
In the pop-up window, select the path of the main directory of the Maven installation, the path of the Maven configuration file settings.xml, and the path of the Maven repository from the local system, and then click the Next button
-
In the pop-up window, the project name is
WordCount
, which is the value of ArtifactId set previously. Of course, it can also be modified, and then click the Finish button.
2. Add Scala and Spark dependencies
- Start spark-shell, you can see that Spark2.4.4 uses Scala2.11.12
- Add Scala2.11.12 and Spark 2.4.4 dependencies in pom.xml file, add Maven build plugin
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>net.hw.spark</groupId>
<artifactId>WordCount</artifactId>
<version>1.0-SNAPSHOT</version>
<inceptionYear>2008</inceptionYear>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.12</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.4</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.3.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<!--设置Spark应用的入口类-->
<mainClass>net.hw.spark.WordCount</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.3.2</version>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
3. Create a WordCount object
net.hw.spark
CreateWordCount
objects in packages
package net.hw.spark
import org.apache.spark.rdd.RDD
import org.apache.spark.{
SparkConf, SparkContext}
/**
* 功能:统计单词个数
* 作者:华卫
* 日期:2022年04月17日
*/
object WordCount {
def main(args: Array[String]): Unit = {
// 创建SparkConf对象,存储应用程序配置信息
val conf = new SparkConf()
.setAppName("Spark-WordCount") // 设置应用程序名称,可在Spark WebUI中显示
.setMaster("spark://master:7077") // 设置集群Master节点访问地址
// 创建SparkContext对象,该对象是提交Spark应用程序的入口
val sc = new SparkContext(conf)
// 读取指定路径(程序执行时传入的第一个参数)的文件内容,生成一个RDD
val rdd: RDD[String] = sc.textFile(args(0))
// 对rdd进行处理
rdd.flatMap(_.split(" ")) // 将RDD的每个元素按照空格进行拆分并将结果合并为一个新RDD
.map((_, 1)) //将RDD中的每个单词和数字1放到一个元组里,即(word,1)
.reduceByKey(_ + _) //对单词根据key进行聚合,对相同的key进行value的累加
.sortBy(_._2, false) // 按照单词数量降序排列
.saveAsTextFile(args(1)) //保存结果到指定的路径(取程序执行时传入的第二个参数)
//停止SparkContext,结束该任务
sc.stop();
}
}
4. Analyze the program code
- The setMaster() method of the SparkConf object is used to set the URL address submitted by the Spark application. If it is a Standalone cluster mode, it refers to the access address of the Master node; if it is a local (single machine) mode, you need to change the address to local or local[N] or local[*], which refers to the use of 1, N and multiple CPUs respectively number of cores. Local mode can run programs directly in the IDE, no Spark cluster is required.
- It is also not necessary to set it here. If it is omitted, it
spark-submit
must be specified with the parameter when submitting the program to the cluster--master
. - The SparkContext object is used to initialize the core components required for the Spark application to run, and is a very important object in the entire Spark application. The object named sc created by default after Spark Shell is started is this object.
- The textFile() method needs to pass in the path of the data source. The data source can be an external data source (HDFS, S3, etc.), or a local file system (Windows or Linux system). The path can be in the following three ways:
(1) File path: For exampletextFile("/input/data.txt ")
, only read specified file.
(2) Directory path: For exampletextFile("/input/words/")
, all files in the specified directory words will be read at this time, excluding subdirectories.
(3) The path contains wildcard characters: for exampletextFile("/input/words/*.txt")
, all TXT files in the words directory will be read at this time. - This method splits the content of the read file by lines and forms an RDD collection. Assuming that the read file is
words.txt
, the specific data conversion process of the above code is shown in the following figure.
5. Compile and package the Spark project
- Expand the Maven Projects window on the right side of IDEA, double-click the package item, and compile and package the prepared
WordCount
project
- Generate two jar packages, one without dependencies and one with dependencies, we use the jar package without dependencies
6. Upload the Spark application to the master virtual machine
- Will
WordCount-1.0-SNAPSHOT.jar
upload to the master virtual machine/home/howard
directory
7. Start the HDFS service
- Excuting an order:
start-dfs.sh
8. Start the Spark cluster
- Excuting an order:
$SPARK_HOME/sbin/start-all.sh
9. Upload the word file to the specified directory in HDFS
- Create word file
words.txt
/wordcount
Directory uploaded to HDFS
10. Execute the WordCount program
(1) Submit the application to the cluster to run
- Excuting an order:
spark-submit --master spark://master:7077 --class net.hw.spark.WordCount WordCount-1.0-SNAPSHOT.jar hdfs://master:9000/wordcount hdfs://master:9000/wordcount_output
(2) Command parameter analysis
- –master: The access path of the Spark Master node. Since the path has been specified by the setMaster() method in the WordCount program, this parameter can be omitted.
- –class: The full access path of the main class of the SparkWordCount program (package name.class name).
- hdfs://master:9000/wordcount: The source path of word data. All files under this path will participate in the statistics.
- hdfs://master:9000/wordcount_output: The output path of the statistics result. As with MapReduce, this directory should not exist ahead of time, Spark will create it automatically.
(3) Spark WebUI interface to view application information
- When the application is running, you can visit Spark's WebUI http://master:8080/ to view the status information of the running application (you can also view the completed application)
- It can be seen that there is an application named Spark-WordCount running, which is the value set by the method setAppName("Spark-WordCount") in the SparkWordCount program.
- While the application is running, you can also visit Spark's WebUI http://master:4040/ to view the status information of the running job (job), including job ID, job description, job running time, and job running The number of stages, the total number of job stages, the number of tasks that have been run by the job, etc. (When the job is finished, the interface will not be accessible)
- Clicking the hyperlink in the rectangular box will jump to the job details page, which displays the information of the running Stages (Active Stages) and the Stages waiting to be run (Pending Stages), including Stage ID, Stage Description, Stage submission time, the running time of the Stage, the number of Task tasks included in the Stage, the number of Task tasks that have been run, etc.
- Click the hyperlink (DAG Visualization) in the rectangular box to view the DAG visualization of this job
- It can be seen that this operation is divided into two Stages. Since the reduceByKey() operation produces wide dependencies, the division is performed before the reduceByKey() operation.
11. View program execution results
- Execute the command:
hdfs dfs -ls /wordcount_output
, view the generated result file
- As you can see, like MapReduce, Spark generates multiple files in the results directory. _SUCCESS is the execution status file, and the result data is stored in the files part-00000 and part-00001.
- Execute the command:
hdfs dfs -cat /wordcount_output/*
, view the data in the result file
- So far, the Spark version WordCount program written in Scala language runs successfully.