Spark Learning Journey (4) Use of Streaming

Spark Streaming is similar to Apache Storm and is used for streaming data processing. The so-called streaming processing actually refers to real-time data. The previous spark processed offline data, that is, directly processed data files, while streaming always detects data, and processes one piece of data after it comes out. According to its official documentation, Spark Streaming has the characteristics of high throughput and strong fault tolerance. Spark Streaming supports many data input sources, such as: Kafka, Flume, Twitter, ZeroMQ, simple TCP sockets, etc. After data is input, Spark's highly abstract primitives such as map, reduce, join, window, etc. can be used for operations. The results can also be saved in many places, such as HDFS, database, etc. In addition, Spark Streaming can also be perfectly integrated with MLlib (machine learning) and Graphx.

Spark Streaming uses a discretized stream as an abstract representation, called a DStream. A DStream is a sequence of data received over time. Internally, the data received at each time interval exists as an RDD, and a DStream is a sequence composed of these RDDs (hence the name "discretization").

DStreams can be created from various input sources, such as Flume, Kafka or HDFS. The created DStream supports two operations, one is transformation operation (transformation), which will generate a new DStream, and the other is output operation (output operation), which can write data to an external system. DStream provides support for many operations similar to those supported by RDD, and also adds new time-related operations, such as sliding windows.

Simple use of Stream

We first create the code to send the message

// 主要按照套接字发送数据
object CreateData {
    
    
  def main(args: Array[String]): Unit = {
    
    
    // 通过套接字发送数据
    val listener = new ServerSocket(9888)
    while(true){
    
    
      val socket = listener.accept()
      new Thread(){
    
    
        override def run() = {
    
    
          println("Got client connected from :"+ socket.getInetAddress)
          val out = new PrintWriter(socket.getOutputStream,true)
          while(true){
    
    
            Thread.sleep(1000)
            val context1 = "张三~李四~王五~张三"
            out.write(context1 + '\n')
            out.flush()
          }
          socket.close()
        }
      }.start()
    }
  }
}

Then create a stream to receive parameters

The streaming code receives and processes,

Note here: Each receiver runs as a long-running task in the Spark executor program and therefore occupies the CPU cores allocated to the application. Additionally, we need to have available CPU cores to process the data. This means that if you want to run multiple receivers, you must have at least the same number of cores as the number of receivers, plus the number of cores required to complete the calculations.

If we want to run 10 receivers in a streaming computing application, we need to allocate at least 11 CPU cores to the application

To put it simply, it is best not to use local or local[1]

object StreamingDemo {
    
    

  def main(args: Array[String]): Unit = {
    
    
    // Seconds 数据间隔的时间
    val ssc = new StreamingContext(new SparkConf()
      .setAppName("stream")
      .setMaster("local[2]"), Seconds(10))

    ssc.sparkContext.setLogLevel("WARN")

    // 设置检查点
    ssc.checkpoint("./check")

    val data = ssc.socketTextStream("localhost",9888)

    data.print()

    ssc.start()
    ssc.awaitTermination()
  }

}

In the above code, it can be seen that Streaming itself adopts a "micro-batch" architecture, which converts stream processing into a series of continuous batch processing.

Commonly used algorithms

Commonly used methods in Stream are roughly divided into conversion methods and output methods. Conversion methods are divided into stateless conversion and stateful conversion.

Common conversion methods

Commonly used algorithms are similar to those in RDD. The only difference is that the original RDD type is replaced by a DStream.

method name concrete action
map(func) Pass each element in the source DStream through a function func to obtain new DStreams.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more items.
filter(func) Select the records whose function func evaluates to true in the source DStream as new DStreams
repartition(numPartitions) Change the parallelism level of this DStream by creating more or fewer partitions
union(otherStream) Union source DStreams with other DStreams to get a new DStream
count() Count the number of elements contained in each RDD in the source DStreams to obtain new DStreams of single-element RDDs.
reduce(func) Use the function func (two parameters and one output) to integrate each RDD element in the source DStreams to obtain DStreams of single-element RDD.
reduceByKey(func, [numTasks]) Calling this function on a DStream of (K, V) pairs returns a new DStream of the same (K, V) pairs, but the corresponding V in the new DStream is integrated using the reduce function. Note : By default, this operation uses Spark's default number of parallel tasks (2 in local mode, the number in cluster mode depends on the configuration parameter spark.default.parallelism). You can also pass in the optional parameter numTaska to set a different number of tasks.
join(otherStream, [numTasks]) The two DStreams are (K, V) and (K, W) pairs respectively, and a new DStream of (K, (V, W)) pairs is returned.
cogroup(otherStream, [numTasks]) The two DStreams are (K, V) and (K, W) pairs respectively, and return (K, (Seq[V], Seq[W]) pairs of new DStreams

Stateful transition UpdateStateByKey

In streaming, it is sometimes necessary to process data across batches, such as counting the number of words appearing in all batches. updateStateByKey() provides access to a state variable for a DStream in the form of key-value pairs. Given a DStream consisting of (key, event) pairs and passing a function that specifies how to update the state corresponding to each key according to new events, it can construct a new DStream whose internal data is (key, state) right.

We use this method with an example of word statistics, the specific code:

def main(args: Array[String]): Unit = {
    
    

    // 创建对象
    val ssc = new StreamingContext(
      new SparkConf()
        .setAppName("streaming")
        .setMaster("local[2]"),Seconds(5)
    )

    ssc.checkpoint("./check")

    val data = ssc
      .socketTextStream("127.0.0.1",9888)

    // 计算每个人的数量
    val names = data.flatMap(
      line => {
    
    
        val name = line.split("~")
        name
      }
    ).map((_,1))

    val result = names.updateStateByKey[Int](
      (values:Seq[Int],state:Option[Int]) => {
    
    
        // (当前key值的集合,之前累加的状态) => { 返回值 }
        // 获得之前状态中的数据
        var count = state.getOrElse(0)
        // 遍历当前批次中的数据
        for(value <- values){
    
    
          // 累加
          count += value
        }
        // Some:Option子类,表示有值
        Some(count)
      }
    )

    result.print()

    ssc.start()
    ssc.awaitTermination()
  }
streaming stored in database

Database access needs to be processed with the help of JDBC. Next, we will take word frequency statistics as an example to show how to store it in the database. Database storage requires creating a table first

CREATE TABLE `word` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `key` varchar(255) DEFAULT NULL,
  `value` int(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=25 DEFAULT CHARSET=utf8;

Then perform JDBC storage

def main(args: Array[String]): Unit = {
    
    
    // StreamingContext(spark配置,时间间隔)
    val ssc = new StreamingContext(
      new SparkConf().setAppName("streaming")
        .setMaster("local[2]"),Seconds(3)
    )
    ssc.sparkContext.setLogLevel("ERROR")

    // 设置检查点 streaming是24*7
    ssc.checkpoint("./check_port")

    println("这里是Streaming")

    val data = ssc.socketTextStream("127.0.0.1",9888)

    val mapData = data
      .flatMap(_.split("~"))
      .map((_,1)) // (张三,1) , (李四,1)

    // state 状态,此时数据
    // updateStateByKey[返回值中数据的类型]
    val result = mapData.updateStateByKey[Int](
      // values => 此批次中相同key的值集合
      // key是张三的数据,放在values (1,1,1...)
      // state => 之前这个key计算的数据(状态) (张三,12)
      (values:Seq[Int],state:Option[Int]) => {
    
    
        var count = state.getOrElse(0)
        for(v <- values){
    
    
          count += v
        }
        Some(count)
      }
    )

    result.print()
    result.foreachRDD(
      item => {
    
    
        if(!item.isEmpty()){
    
     // 判断RDD是否为空
          item.foreach{
    
     // 遍历RDD中数据,分别取出存入mysql
            case(key,count) => {
    
    
              // 获得连接
              val connection = getConnection()
              /* 定义SQL,注意word数据库需要提前创建 */
              val sql = "insert into `word` (`key`,`value`) values (?,?)"
              /* 创建prepareStatement对象,用于执行SQL */
              val state = connection.prepareStatement(sql)
              /* 占位符(?的位置)插入数据 */
              state.setString(1,key)
              state.setInt(2,count)
              /* 执行SQL */
              state.execute()
            }
          }
        }
      }
    )

    ssc.start()
    ssc.awaitTermination()

  }

  def getConnection() : Connection = {
    
    
    Class.forName("com.mysql.jdbc.Driver")
    DriverManager.getConnection(
      "jdbc:mysql://127.0.0.1:3306/ssm",
      "root","root")
  }
streaming window method

Data streams are processed in batches based on time order, so another concept is introduced, the time window, later referred to as the window. This window is a bit like all the data in a batch.

Each window is defined by window length and sliding interval, both of which must be an integer multiple of the batch interval of StreamContext. The window length refers to obtaining the RDD accepted within the time length, and the sliding interval refers to the number of seconds to go get data

Here is a simple window example, the data producer is still the above code


def main(args: Array[String]): Unit = {
    
    

    // 创建对象
    val ssc = new StreamingContext(
      new SparkConf()
        .setAppName("streaming")
        .setMaster("local[2]"),Seconds(5)
    )

    ssc.sparkContext.setLogLevel("ERROR")
    ssc.checkpoint("./stream_checkpoint")

    val data = ssc
      .socketTextStream("127.0.0.1",9888)

    // 创建一个15秒窗口时间,5秒间隔时间的窗口
    val winData = data.window(Seconds(15),Seconds(5))
    winData.print()

    ssc.start()
    ssc.awaitTermination()
  }

Common methods

method name Method description
window(window length, interval) Returns a new DStream based on calculations over windowed batches of the source DStream
countByWindow(window length, interval) Returns a sliding window count of elements in the stream.
reduceByWindow(func, windowLength, slideInterval) Create a new single-element stream by combining the elements of a sliding range stream using a custom function.
reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks]) When this function is called on a DStream of (K, V) pairs, a new DStream of (K, V) pairs will be returned. Here, the value of each key is integrated by using the reduce function on the batch data in the sliding window. . **Note:** By default, this operation uses Spark's default number of parallel tasks (locally 2), and performs grouping according to the configuration property (spark.default.parallelism) in cluster mode. You can set a different number of tasks by setting the optional parameter numTasks.

Guess you like

Origin blog.csdn.net/lihao1107156171/article/details/115587995