Spark Streaming example

Spark Streaming Overview

Here Insert Picture Description
In the interior, which operates in the following manner. Spark Streaming real-time data stream received while being divided into a batch, it will be processed to generate the same batches of the final form of the data stream Spark engine.
Here Insert Picture Description

example

In-depth understanding of how to write your own SS program before, let's take a quick look at the basic program is what kind of SS. Suppose we want to count the number of words in the text data (data from data interface to monitor a TCP server). You only need to do this:
The first step, loaded StreamingContext, this is the main access point for all functions of the stream function, we use two threads of execution and batch interval 1s to create local StreamingContext:

maven

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <spark.version>2.2.1</spark.version>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka_2.11</artifactId>
      <version>1.6.2</version>
    </dependency>

    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>2.11.8</version>
    </dependency>


  </dependencies>
package com.hx.test

/**
 * fileName : Test11StreamingWordCount 
 * Created by 970655147 on 2016-02-12 13:21.
 */
object Test11StreamingWordCount {

  // 基于sparkStreaming的wordCount
  // 环境windows7 + spark1.2 + jdk1.7 + scala2.10.4
  // 1\. 启动netcat [nc -l -p 9999]
  // 2\. 启动当前程序
  // 3\. netcat命令行中输入数据
  // 4\. 回到console, 查看结果[10s 之内]
  // *******************************************
  // 每一个print() 打印一次
  // -------------------------------------------
  // Time: 1455278620000 ms
  // -------------------------------------------
  // Another Infomation !
  // *******************************************
  // inputText : sdf sdf lkj lkj lkj lkj
  // MappedRDD[23] at count at Test11StreamingWordCount.scala:39
  // 2
  // (sdf,2), (lkj,4)
  def main(args :Array[String]) = {

    // Create a StreamingContext with a local master
    // Spark Streaming needs at least two working thread
    val sc = new StreamingContext("local[2]", "NetworkWordCount", Seconds(10) )
    // Create a DStream that will connect to serverIP:serverPort, like localhost:9999
    val lines = sc.socketTextStream("192.168.47.141", 9999)
     // Split each line into words
    // 以空格把收到的每一行数据分割成单词
    val words = lines.flatMap(_.split(" "))
    // 在本批次内计单词的数目
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    // 打印每个RDD中的前10个元素到控制台
    wordCounts.print()

    sc.start()
    sc.awaitTermination()

  }

}

words DStream further mapped (one to one conversion) (word, 1) DStream pairs, the "on" DSTREAM form will be reduced (a Spark operation) to obtain the respective data word in the batch count. Finally, wordCounts.print () to print a small amount of the count value obtained at each second.

Note: After so many lines of code is executed, SS only if its operation is set to start operation will be carried out, but did not start processing in the true sense. After all the conversions are deployed, we need to call the following two operations to actually start the process:

ssc.start() # 开始计算
ssc.awaitTermination() # 等待计算终止

If you've downloaded and compiled Spark, you can run this example in the following explanation. First you need to run Netcat (most Unix systems have tools) as a data server:

$ nc -l 9999

Any type of on-line terminal running netcat will be calculated and printed to the screen.

# TERMINAL 1:
# Running Netcat
$ nc -l 9999
hello world
# TERMINAL 2: RUNNING network_wordcount.py
$ ./bin/spark-submit examples/src/main/python/streaming/network_wordcount.py localhost 9999
(hello,1)
(world,1)

Here Insert Picture Description
These underlying RDD conversion is calculated by the engine Spark. DStream operation hides most of the details, and in order to facilitate a high-level API for developers. Here some of the operations will be described in detail below

Published 79 original articles · won praise 8 · views 20000 +

Guess you like

Origin blog.csdn.net/qq_34219959/article/details/102807169