Spark basic study notes: Spark RDD case analysis

1. Case analysis: Spark RDD implements word counting

(1) Case overview

Word counting is an entry program for learning distributed computing. There are many ways to implement it, such as MapReduce; using the RDD operator provided by Spark can implement word counting more easily.
Create a new Spark project managed by Maven in IntelliJ IDEA, and use the Scala language to write Spark's WordCount program in the project, and finally package and submit the project to the Spark cluster (Standalone mode) for operation.

(2) Implementation steps

1. Create a new Spark project managed by Maven

Select File→new→Project... in IDEA, select the Maven item on the left in the pop-up window, then check the Create fromarchetype check box on the right and select org.scala-tools.archetypes:scala-archetype that appears below -simple item (indicates that the scala-archetype-simple template is used to build the Maven project).
insert image description here

Fill in the GroupId and ArtifactId in the pop-up window, keep the default settings for Version, and click the Next button
insert image description here

insert image description here

2. Add Scala and Spark dependencies

Start the spark-shell, you can see that Spark2.4.4 uses Scala2.11.12
insert image description here
to add Scala2.11.12 and Spark 2.4.4 dependencies in the pom.xml file, and add the Maven build plugin

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>net.py.spark</groupId>
    <artifactId>WordCount</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.11.12</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>2.4.4</version>
        </dependency>
    </dependencies>
    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.3.0</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                    <archive>
                        <manifest>
                            <!--设置Spark应用的入口类-->
                            <mainClass>net.hw.spark.WordCount</mainClass>
                        </manifest>
                    </archive>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.3.2</version>
                <executions>
                    <execution>
                        <id>scala-compile-first</id>
                        <phase>process-resources</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>scala-test-compile</id>
                        <phase>process-test-resources</phase>
                        <goals>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

3. Create a WordCount object

Create a wordcount object in the net.py.spark package
insert image description here

package net.py.spark

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object WordCount {
  def main(args: Array[String]): Unit = {
    // 创建SparkConf对象,存储应用程序配置信息
    val conf = new SparkConf()
      .setAppName("Spark-WordCount") // 设置应用程序名称,可在Spark WebUI中显示
      .setMaster("spark://master:7077") // 设置集群Master节点访问地址
    // 创建SparkContext对象,该对象是提交Spark应用程序的入口
    val sc = new SparkContext(conf)
    // 读取指定路径(程序执行时传入的第一个参数)的文件内容,生成一个RDD
    val rdd: RDD[String] = sc.textFile(args(0))
    // 对rdd进行处理
    rdd.flatMap(_.split(" ")) // 将RDD的每个元素按照空格进行拆分并将结果合并为一个新RDD
      .map((_, 1)) //将RDD中的每个单词和数字1放到一个元组里,即(word,1)
      .reduceByKey(_ + _) //对单词根据key进行聚合,对相同的key进行value的累加
      .sortBy(_._2, false) // 按照单词数量降序排列
      .saveAsTextFile(args(1))   //保存结果到指定的路径(取程序执行时传入的第二个参数)
    //停止SparkContext,结束该任务
    sc.stop();
  }
}

4. Upload the Spark application to the master virtual machine

Upload WordCount-1.0-SNAPSHOT.jar to the master virtual machine /home/py directory

5. Start the HDFS service

Execute the command: start-dfs.sh
insert image description here

6. Start the Spark cluster

Execute the command: $SPARK_HOME/sbin/start-all.sh
insert image description here

7. Upload the word file to the specified directory of HDFS

Create word file word.txt

insert image description here
Upload to the /wordcount directory of HDFS

8. Execute the WordCount program

(1) Submit the application to run in the cluster

Excuting an order:

[root@master home]# spark-submit --master spark://master:7077 -class net.py.spark.WordCount WordCount-1.0-SNAPSHOT.jar hdfs://master:9000/wordcount hdfs://master:9000/wordcount_output

(2) Command parameter analysis

–master: The access path of the Spark Master node. Since the path has been specified by the setMaster() method in the WordCount program, this parameter can be omitted.
–class: The full access path of the main class of the SparkWordCount program (package name.class name).
hdfs://master:9000/wordcount: source path of word count. All files under this path will participate in statistics.
hdfs://master:9000/wordcount_output: output path of statistical results. As with MapReduce, this directory should not exist in advance, Spark will create it automatically.

(3) View application information on the Spark WebUI interface

While the application is running, you can visit Spark's WebUI http://master:8080/ to view the status information of the running program (you can also view the completed application)

Guess you like

Origin blog.csdn.net/py20010218/article/details/125366361