Spark Streaming real-time computing instance

Spark Streaming real-time computing instance

1. Experimental content
: Write a Spark Steaming application to implement real-time word frequency statistics.
2. Experimental steps
1. Run nc to simulate the data source. nc -lk 9999 starts the server and listens to the Socket service.
Order:

nc -lk 9999

2. Create a Maven project and add Spark Streaming dependency in the pom.xml file.
Dependency code:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>hw</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>11</maven.compiler.source>
        <maven.compiler.target>11</maven.compiler.target>
        <scala.version>2.12.15</scala.version>
        <hadoop.version>2.7.4</hadoop.version>
        <spark.version>3.1.2</spark.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.apache.storm</groupId>
            <artifactId>storm-core</artifactId>
            <version>1.1.0</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${
    
    scala.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>3.1.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.4</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>3.1.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>2.8.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.12</artifactId>
            <version>3.1.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-auth</artifactId>
            <version>2.7.4</version>
        </dependency>
    </dependencies>
    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <testOutputDirectory>src/test/scala</testOutputDirectory>
    </build>
</project>

Dependencies need to be configured according to your own idea version requirements and the plug-ins installed on your computer. After the configuration is completed, the download will start automatically. If there is no download, just click on the right side of the idea to measure maevn and it will automatically download.

3.Write a program and use the updateStateByKey() method to perform real-time word frequency statistics on the content continuously input by the nc client.

Computer operation

1. nc client
1. Command to start nc client in virtual machine linux:

nc -lk 9999

enter:

sgdhdj

Write any letters here.

2. NC client screenshot (data required)
Insert image description here

2. Real-time word frequency statistics client

  1. Program (the code must be commented)
  package cn.itcast.dstream
  import org.apache.spark.streaming.dstream.{
    
    DStream, ReceiverInputDStream}
  import org.apache.spark.streaming.{
    
    Seconds, StreamingContext}
  import org.apache.spark.{
    
    SparkConf, SparkContext}

  object UpdateSateByKeyTest {
    
    
    //newValues 表示当前批次汇总成的(word,1)中相同单词的所有的1
    def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
    
    
      val newCount =runningCount.getOrElse(0)+newValues.sum
      Some(newCount)
    }
    def main(args: Array[String]): Unit = {
    
    
      //创建sparkConf参数
      val sparkConf: SparkConf = new SparkConf().setAppName("UpdateSateByKeyTest").setMaster("local[2]")
      //构建sparkContext对象,它是所有对象的源头
      val sc: SparkContext = new SparkContext(sparkConf)
      //设置日志的级别
      sc.setLogLevel("WARN")
      //构建StreamingContext对象,需要两个参数,每个批处理的时间间隔
      val scc: StreamingContext = new StreamingContext(sc, Seconds(5))
      //设置checkpoint路径,当前项目下有一个cy目录
      scc.checkpoint("./cy")
      //注册一个监听的IP地址和端口  用来收集数据
      val lines: ReceiverInputDStream[String] = scc.socketTextStream("192.168.118.128", 9999)
      //切分每一行记录
      val word: DStream[String] = lines.flatMap(_.split(" "))
      //每个单词记为1
      val wordAndOne: DStream[(String, Int)] = word.map((_, 1))
      //累计统计单词出现的次数
      val result: DStream[(String, Int)] = wordAndOne.updateStateByKey(updateFunction)
      result.print()
      //打印输出结果
      scc.start()
      //开启流式计算
      scc.awaitTermination()
      //保持程序运行,除非被打断
    }


}

2. Client screenshot (must have real-time calculation results)
Insert image description here

Guess you like

Origin blog.csdn.net/qq_62127918/article/details/130415019