Spark learning (1) Using spark for WordCount word count

Spark is a fast, versatile and scalable big data analysis engine written in Scala

1. Brief introduction

The following is a brief introduction to spark. Well, it mainly talks about how good spark is. If you don’t want to see it, you can go directly to the writing project.

Features

1) Fast: Calculation based on memory is fast.

2) Easy to use: supports java, python, scala API, simple syntax.

3) Universal: For big data offline, there are a series of real-time processing solutions, which is convenient.

4) Compatibility: Based on Java, it can be well integrated with other big data technologies, such as using YARN calls.

2. Write a project

Let’s first write the simplest application to let everyone understand Spark. The method of using Spark is actually similar to other frameworks. The first step is to insert the jar package.

1. jar package

We need to create a normal maven project first, and then introduce the maven dependency under the project pom file. If it has been introduced, there is no need to introduce it.

Note here that the software I am using is idea, and I need to download the scala plug-in in idea in advance.

<dependencies>
        <!-- scala依赖 开始 -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-reflect</artifactId>
            <version>2.12.8</version>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.12.8</version>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-compiler</artifactId>
            <version>2.12.8</version>
            <scope>compile</scope>
        </dependency>

        <!-- scala依赖 结束 -->

        <!-- spark依赖 开始 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>2.4.3</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.12</artifactId>
            <version>2.4.3</version>
        </dependency>

        <!-- spark依赖 结束 -->
    </dependencies>

    <build>
        <plugins>
            <!-- 该插件用于将 Scala 代码编译成 class 文件 -->
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.4.6</version>
                <executions>
                    <execution>
                        <!-- 声明绑定到 maven 的 compile 阶段 -->
                        <goals>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.0.0</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

2. Create a data set

We need to create a data set so that spark can be used for calculations. Create a data folder in the project directory to store the data. Create a word file and enter the following content.

java~hadoop~java
html~js
java
js~jquery

The next task is to use spark to count the number of occurrences of each word in the entire document.

3. Write the corresponding processing class to start calculation

Create a package in the java (you can rename scala if you don’t like it) folder, and then create the WordCount.scala file. Note that the file type is object. The order of using Spark is roughly as follows

1. Create Spark context

2. Read the data file

3. Process and convert data into appropriate formats

4. Statistical calculations

The specific processing code is as follows

object WordCount {
    
    

  def main(args: Array[String]): Unit = {
    
    
    // 创建Spark上下文
    /*
    其中master指的是运行模式
    1、local模式
      本地模式,所有计算都运行在一个线程当中,没有任何并行计算,通常练习时使用
      local[K]:这里的k指的是运行的k个进程,
      local[*]:按照CPU最多的核心设置进程
    2、Standalone模式
      运行在集群上
    3、Yarn模式
      运行yarn上,不布置额外的集群
    4、Mesos模式
      运行在Mesos上, Mesos是Apache下的开源分布式资源管理框架
      */
    val spark = new SparkContext("local",
      "WordCount")

    // 读取数据,word就是刚创建的文件
    val data = spark.textFile("data/word")

    // 转换为方便计算的格式
    val result = data
      .flatMap(_.split("~")) // 使用~切分每条记录,这里面的_代表data中的一条记录,spark会根据换行符进行切分,然后分别传入
      .map((_,1)) // java,1 html,1
     .countByKey() // 最后一步,统计相同key值的数量

    println(result)
  }

}

Run the main function to start executing the entire program. Finally, the number of occurrences of each word will be printed. The output is as follows

Map(java -> 3, hadoop -> 1, jquery -> 1, js -> 2, html -> 1)

This is the simplest Spark use case, but it is also the most commonly used example. We can do a lot of things based on this example, such as counting the number of times a certain product ID appears in a shopping data set, and using this number to indicate the popularity of the product. degree.

The code address is below:
https://gitee.com/lihao2/blog-code.git
name is the same as the blog name

In the next article, I will talk about how to use the methods related to the above example to count the most popular types of food in a certain city. Follow the blogger and not get lost

Guess you like

Origin blog.csdn.net/lihao1107156171/article/details/115044456