Spark basic study notes 22: Spark RDD case analysis

Zero, the learning objectives of this lecture

  1. Spark RDD implements word count
  2. Spark RDD implements grouping to find TopN
  3. Spark RDD implements secondary sorting
  4. Spark RDD implementation to calculate the average score
  5. Spark RDD implements inverted index statistics for daily new users
  6. Spark RDD read and write HBase
  7. Spark RDD data skew problem solving

1. Case study: Spark RDD implements word count

(1) Case overview

  • Word counting is an introductory program for learning distributed computing, and there are many implementations, such as MapReduce; using the RDD operator provided by Spark can achieve word counting more easily.
  • Create a new Spark project managed by Maven in IntelliJ IDEA, and use the Scala language to write Spark's WordCount program in the project. Finally, package the project and submit it to the Spark cluster (standalone mode) to run.

(2) Implementation steps

1. Create a new Spark project managed by Maven

  • Select File→new→Project... in IDEA, select the Maven item on the left in the pop-up window, then check the Create fromarchetype checkbox on the right and select the org.scala-tools.archetypes:scala-archetype-simpleitem that appears below (indicates that the scala-archetype-simple template is used to build Maven project).
    insert image description here

  • Fill in the GroupId and ArtifactId in the pop-up window, keep the default version of Version, and then click the Next button
    insert image description here

  • In the pop-up window, select the path of the main directory of the Maven installation, the path of the Maven configuration file settings.xml, and the path of the Maven repository from the local system, and then click the Next button
    insert image description here

  • In the pop-up window, the project name is WordCount, which is the value of ArtifactId set previously. Of course, it can also be modified, and then click the Finish button.
    insert image description here
    insert image description here

2. Add Scala and Spark dependencies

  • Start spark-shell, you can see that Spark2.4.4 uses Scala2.11.12
    insert image description here
  • Add Scala2.11.12 and Spark 2.4.4 dependencies in pom.xml file, add Maven build plugin
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
         http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>net.hw.spark</groupId>
  <artifactId>WordCount</artifactId>
  <version>1.0-SNAPSHOT</version>
  <inceptionYear>2008</inceptionYear>

  <dependencies>
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>2.11.12</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.12</artifactId>
      <version>2.4.4</version>
    </dependency>
  </dependencies>
  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>3.3.0</version>
        <configuration>
          <descriptorRefs>
            <descriptorRef>jar-with-dependencies</descriptorRef>
          </descriptorRefs>
          <archive>
            <manifest>
              <!--设置Spark应用的入口类-->
              <mainClass>net.hw.spark.WordCount</mainClass>
            </manifest>
          </archive>
        </configuration>
        <executions>
          <execution>
            <id>make-assembly</id>
            <phase>package</phase>
            <goals>
              <goal>single</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
      <plugin>
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
        <version>3.3.2</version>
        <executions>
          <execution>
            <id>scala-compile-first</id>
            <phase>process-resources</phase>
            <goals>
              <goal>add-source</goal>
              <goal>compile</goal>
            </goals>
          </execution>
          <execution>
            <id>scala-test-compile</id>
            <phase>process-test-resources</phase>
            <goals>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

3. Create a WordCount object

  • net.hw.sparkCreate WordCountobjects in packages
    insert image description here
    insert image description here
package net.hw.spark

import org.apache.spark.rdd.RDD
import org.apache.spark.{
    
    SparkConf, SparkContext}

/**
  * 功能:统计单词个数
  * 作者:华卫
  * 日期:2022年04月17日
  */
object WordCount {
    
    
  def main(args: Array[String]): Unit = {
    
    
    // 创建SparkConf对象,存储应用程序配置信息
    val conf = new SparkConf()
      .setAppName("Spark-WordCount") // 设置应用程序名称,可在Spark WebUI中显示
      .setMaster("spark://master:7077") // 设置集群Master节点访问地址
     // 创建SparkContext对象,该对象是提交Spark应用程序的入口
    val sc = new SparkContext(conf)
    // 读取指定路径(程序执行时传入的第一个参数)的文件内容,生成一个RDD
    val rdd: RDD[String] = sc.textFile(args(0))
    // 对rdd进行处理
    rdd.flatMap(_.split(" ")) // 将RDD的每个元素按照空格进行拆分并将结果合并为一个新RDD
      .map((_, 1)) //将RDD中的每个单词和数字1放到一个元组里,即(word,1)
      .reduceByKey(_ + _) //对单词根据key进行聚合,对相同的key进行value的累加
      .sortBy(_._2, false) // 按照单词数量降序排列
      .saveAsTextFile(args(1))   //保存结果到指定的路径(取程序执行时传入的第二个参数)
    //停止SparkContext,结束该任务        
    sc.stop();
  }
}

4. Analyze the program code

  • The setMaster() method of the SparkConf object is used to set the URL address submitted by the Spark application. If it is a Standalone cluster mode, it refers to the access address of the Master node; if it is a local (single machine) mode, you need to change the address to local or local[N] or local[*], which refers to the use of 1, N and multiple CPUs respectively number of cores. Local mode can run programs directly in the IDE, no Spark cluster is required.
  • It is also not necessary to set it here. If it is omitted, it spark-submitmust be specified with the parameter when submitting the program to the cluster --master.
  • The SparkContext object is used to initialize the core components required for the Spark application to run, and is a very important object in the entire Spark application. The object named sc created by default after Spark Shell is started is this object.
  • The textFile() method needs to pass in the path of the data source. The data source can be an external data source (HDFS, S3, etc.), or a local file system (Windows or Linux system). The path can be in the following three ways:
    (1) File path: For example textFile("/input/data.txt "), only read specified file.
    (2) Directory path: For example textFile("/input/words/"), all files in the specified directory words will be read at this time, excluding subdirectories.
    (3) The path contains wildcard characters: for example textFile("/input/words/*.txt"), all TXT files in the words directory will be read at this time.
  • This method splits the content of the read file by lines and forms an RDD collection. Assuming that the read file is words.txt, the specific data conversion process of the above code is shown in the following figure.
    insert image description here

5. Compile and package the Spark project

  • Expand the Maven Projects window on the right side of IDEA, double-click the package item, and compile and package the prepared WordCountproject
    insert image description here
  • Generate two jar packages, one without dependencies and one with dependencies, we use the jar package without dependencies

6. Upload the Spark application to the master virtual machine

  • Will WordCount-1.0-SNAPSHOT.jarupload to the master virtual machine /home/howarddirectory
    insert image description here

7. Start the HDFS service

  • Excuting an order:start-dfs.sh
    insert image description here

8. Start the Spark cluster

  • Excuting an order:$SPARK_HOME/sbin/start-all.sh
    insert image description here

9. Upload the word file to the specified directory in HDFS

  • Create word filewords.txt
    insert image description here
  • /wordcountDirectory uploaded to HDFS
    insert image description here

10. Execute the WordCount program

(1) Submit the application to the cluster to run

  • Excuting an order:spark-submit --master spark://master:7077 --class net.hw.spark.WordCount WordCount-1.0-SNAPSHOT.jar hdfs://master:9000/wordcount hdfs://master:9000/wordcount_output
    insert image description here

(2) Command parameter analysis

  • –master: The access path of the Spark Master node. Since the path has been specified by the setMaster() method in the WordCount program, this parameter can be omitted.
  • –class: The full access path of the main class of the SparkWordCount program (package name.class name).
  • hdfs://master:9000/wordcount: The source path of word data. All files under this path will participate in the statistics.
  • hdfs://master:9000/wordcount_output: The output path of the statistics result. As with MapReduce, this directory should not exist ahead of time, Spark will create it automatically.

(3) Spark WebUI interface to view application information

  • When the application is running, you can visit Spark's WebUI http://master:8080/ to view the status information of the running application (you can also view the completed application)
    insert image description here
  • It can be seen that there is an application named Spark-WordCount running, which is the value set by the method setAppName("Spark-WordCount") in the SparkWordCount program.
  • While the application is running, you can also visit Spark's WebUI http://master:4040/ to view the status information of the running job (job), including job ID, job description, job running time, and job running The number of stages, the total number of job stages, the number of tasks that have been run by the job, etc. (When the job is finished, the interface will not be accessible)
    insert image description here
  • Clicking the hyperlink in the rectangular box will jump to the job details page, which displays the information of the running Stages (Active Stages) and the Stages waiting to be run (Pending Stages), including Stage ID, Stage Description, Stage submission time, the running time of the Stage, the number of Task tasks included in the Stage, the number of Task tasks that have been run, etc.
    insert image description here
  • Click the hyperlink (DAG Visualization) in the rectangular box to view the DAG visualization of this job
    insert image description here
  • It can be seen that this operation is divided into two Stages. Since the reduceByKey() operation produces wide dependencies, the division is performed before the reduceByKey() operation.

11. View program execution results

  • Execute the command: hdfs dfs -ls /wordcount_output, view the generated result file
    insert image description here
  • As you can see, like MapReduce, Spark generates multiple files in the results directory. _SUCCESS is the execution status file, and the result data is stored in the files part-00000 and part-00001.
  • Execute the command: hdfs dfs -cat /wordcount_output/*, view the data in the result file
    insert image description here
  • So far, the Spark version WordCount program written in Scala language runs successfully.

Guess you like

Origin blog.csdn.net/howard2005/article/details/124232431
Recommended