Spark basic study notes: SparkSQL case

1. Use Spark SQL to realize word frequency statistics

(1) Data source - words.txt

insert image description here

(2) Create a Maven project

insert image description here

(3) Add dependencies and build plugins

insert image description here

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>net.py.wc</groupId>
    <artifactId>SparkSQLWordCount</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.11.8</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.1.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.1.1</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.3.0</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.3.2</version>
                <executions>
                    <execution>
                        <id>scala-compile-first</id>
                        <phase>process-resources</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>scala-test-compile</id>
                        <phase>process-test-resources</phase>
                        <goals>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

(4) Modify the source directory name

Change the source directory name Java to Scala
insert image description here
in the pom.xml file, set the source directory
insert image description here

(5) Create a log property file

insert image description here

log4j.rootLogger=ERROR, stdout, logfile
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spark.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

(6) Create a word frequency statistics singleton object

Create a net.py.wc package and create a SparkSQLWordCount singleton object in the package
insert image description here

package net.py.wc
import org.apache.spark.sql.{Dataset,SparkSession}

object SparkSQLWordCount {

  def main(args: Array[String]): Unit = {
    // 设置HADOOP用户名属性,否则本地运行访问会被拒绝
    System.setProperty("HADOOP_USER_NAME", "root")
    // 创建或得到SparkSession
    val spark = SparkSession.builder()
      .appName("SparkSQLWordCount")
      .master("local[*]")
      .getOrCreate()
    // 读取HDFS上的单词文件
    val lines: Dataset[String] = spark.read.textFile("hdfs://master:9000/input/words.txt")
    // 显示数据集lines内容
    lines.show()
    // 导入Spark会话对象的隐式转换
    import spark.implicits._
    // 将数据集中的数据按空格切分并合并
    val words: Dataset[String] = lines.flatMap(_.split(" "))
    // 显示数据集words内容
    words.show()
    // 将数据集默认列名由value改为word,并转换成数据帧
    val df = words.withColumnRenamed("value", "word").toDF()
    // 显示数据帧内容
    df.show()
    // 基于数据帧创建临时视图
    df.createTempView("v_words")
    // 执行SQL分组查询,实现词频统计
    val wc = spark.sql(
      """
        | select word, count(*) as count
        |    from v_words group by word
        |    order by count desc
        |""".stripMargin)
    // 显示词频统计结果
    wc.show()
    // 关闭会话
    spark.close()
  }
}

(7) Start the program and view the results

insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/py20010218/article/details/125343256