Spark Streaming实时计算实例

一、实验内容
编写Spark Steaming应用程序，实现实时词频统计。
二、实验步骤
1．运行nc，模拟数据源。nc -lk 9999启动服务端且监听Socket服务。
命令：

nc -lk 9999

2．创建一个Maven工程，在pom.xml文件中添加Spark Streaming依赖。
依赖代码：

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>hw</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>11</maven.compiler.source>
        <maven.compiler.target>11</maven.compiler.target>
        <scala.version>2.12.15</scala.version>
        <hadoop.version>2.7.4</hadoop.version>
        <spark.version>3.1.2</spark.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.apache.storm</groupId>
            <artifactId>storm-core</artifactId>
            <version>1.1.0</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${
    
    scala.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>3.1.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.4</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>3.1.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>2.8.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.12</artifactId>
            <version>3.1.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-auth</artifactId>
            <version>2.7.4</version>
        </dependency>
    </dependencies>
    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <testOutputDirectory>src/test/scala</testOutputDirectory>
    </build>
</project>

依赖需要根据自己的idea版本需求和电脑安装的插件来配置，配置完成会自己启动下载，如果没有下载，只需要点击idea右侧测定maevn就会自动下载了。

3．编写程序，使用updateStateByKey()方法对nc客户端不断输入的内容进行实时的词频统计。

上机操作

一、nc客户端
1．在虚拟机linux中启动nc客户端的命令：

nc -lk 9999

输入：

sgdhdj

此处字母随便写。

2．nc客户端截图（要有数据）
在这里插入图片描述

二、实时词频统计客户端

程序（代码要有注释说明）

  package cn.itcast.dstream
  import org.apache.spark.streaming.dstream.{
    
    DStream, ReceiverInputDStream}
  import org.apache.spark.streaming.{
    
    Seconds, StreamingContext}
  import org.apache.spark.{
    
    SparkConf, SparkContext}

  object UpdateSateByKeyTest {
    
    
    //newValues 表示当前批次汇总成的(word,1)中相同单词的所有的1
    def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
    
    
      val newCount =runningCount.getOrElse(0)+newValues.sum
      Some(newCount)
    }
    def main(args: Array[String]): Unit = {
    
    
      //创建sparkConf参数
      val sparkConf: SparkConf = new SparkConf().setAppName("UpdateSateByKeyTest").setMaster("local[2]")
      //构建sparkContext对象，它是所有对象的源头
      val sc: SparkContext = new SparkContext(sparkConf)
      //设置日志的级别
      sc.setLogLevel("WARN")
      //构建StreamingContext对象，需要两个参数，每个批处理的时间间隔
      val scc: StreamingContext = new StreamingContext(sc, Seconds(5))
      //设置checkpoint路径，当前项目下有一个cy目录
      scc.checkpoint("./cy")
      //注册一个监听的IP地址和端口  用来收集数据
      val lines: ReceiverInputDStream[String] = scc.socketTextStream("192.168.118.128", 9999)
      //切分每一行记录
      val word: DStream[String] = lines.flatMap(_.split(" "))
      //每个单词记为1
      val wordAndOne: DStream[(String, Int)] = word.map((_, 1))
      //累计统计单词出现的次数
      val result: DStream[(String, Int)] = wordAndOne.updateStateByKey(updateFunction)
      result.print()
      //打印输出结果
      scc.start()
      //开启流式计算
      scc.awaitTermination()
      //保持程序运行，除非被打断
    }


}

2．客户端截图（要有实时计算的结果）
在这里插入图片描述

Spark Streaming实时计算实例

Spark Streaming实时计算实例

上机操作

猜你喜欢