Part 4|Spark Streaming Programming Guide (1)

Spark Streaming is a stream processing framework built on Spark Core and is a very important part of Spark. Spark Streaming was introduced in Spark 0.7.0 version in February 2013 and has become a stream processing platform widely used in enterprises. In July 2016, Structured Streaming was introduced in Spark 2.0 and reached the production level in Spark 2.2. Structured Streaming is a stream processing engine built on Spark SQL. Users can use the DataSet/DataFreame API to perform Stream processing. At present, Structured Streaming is developing rapidly in different versions. It is worth noting that this article will not explain Structured Streaming too much, and mainly discuss Spark Streaming, including the following:

  • Introduction to Spark Streaming
  • Transformations与Output Operations
  • Spark Streaming data sources (Sources)
  • Spark Streaming Data Sink (Sinks)
    Insert picture description here

Introduction to Spark Streaming

What is DStream

Spark Streaming is built on the RDD of Spark Core. At the same time, Spark Streaming introduces a new concept: DStream (Discretized Stream), which represents a continuous data stream. The DStream abstraction is the stream processing model of Spark Streaming. In its internal implementation, Spark Streaming segments the input data according to a time interval (such as 1 second), and converts each segment of data to RDD in Spark. These segments are Dstreams, and DStream operations are eventually transformed into operations on the corresponding RDD. As shown below:

Insert picture description here

As shown in the figure above, these low-level RDD conversion operations are completed by the Spark engine. The DStream operation shields many low-level details and provides users with a more convenient high-level API.

Calculation model

In Flink, batch processing is a special case of stream processing, so Flink is a natural stream processing engine. This is not the case with Spark Streaming. Spark Streaming believes that stream processing is a special case of batch processing. That is, Spark Streaming is not a pure real-time stream processing engine. It uses a microBatchmodel internally, that is , stream processing is treated as a small time interval batch interval) series of batch processing. Regarding the setting of the time interval, it is necessary to combine the specific business delay requirements, and the interval of seconds or minutes can be realized.

Spark Streaming stores the data received in each short interval in the cluster, and then applies a series of operator operations (map, reduce, groupBy, etc.) to it. The execution process is shown in the figure below:

Insert picture description here

As shown above: Spark Streaming divides the input data stream into small batches, each batch represents the RDD of these columns, and then stores these batches in memory. By starting a Spark job to process the batch data, a stream processing application is realized.

The working mechanism of Spark Streaming

Overview

Insert picture description here

  • In Spark Streaming, there will be a component Receiver that runs as a long-running task on an Executor
  • Each Receiver will be responsible for an input DStream (such as a file stream that reads data from a file, such as a socket stream, or an input stream read from Kafka, etc.)
  • Spark Streaming connects to external data sources through input DStream and reads related data

Implementation details

Insert picture description here

  • 1. Start StreamingContext
  • 2. The StreamingContext starts the receiver, which will always run in the Executor task. Used to continuously receive data sources, there are two main types of recivers, one is reliable recivers, when data is received and stored in spark, the receipt confirmation is sent, and the other is unreliable recivers, which are not sent to the data source Confirmation of receipt. The received data will be cached in the memory of the work node and copied to the memory of the node where other executors are located for fault-tolerant processing.
  • 3. The Streaming context triggers job periodically (according to the batch-interval time interval) for data processing.
  • 4. Output the data.

Spark Streaming programming steps

After the above analysis, I have a preliminary understanding of Spark Streaming. So how to write a Spark Streaming application? A Spark Streaming generally includes the following steps:

  • 1. CreateStreamingContext
  • 2. Create input DStreamto define input source
  • 3. Define processing logic by applying conversion operations and output operations to DStream
  • 4. Use streamingContext.start() to start receiving data and processing flow
  • 5.streamingContext.awaitTermination() method to wait for the end of processing
  object StartSparkStreaming {
    def main(args: Array[String]): Unit = {
      val conf = new SparkConf()
        .setMaster("local[2]")
        .setAppName("Streaming")
      // 1.创建StreamingContext
      val ssc = new StreamingContext(conf, Seconds(5))
      Logger.getLogger("org.apache.spark").setLevel(Level.OFF)
      Logger.getLogger("org.apache.hadoop").setLevel(Level.OFF)
      // 2.创建DStream
      val lines = ssc.socketTextStream("localhost", 9999)
      // 3.定义流计算处理逻辑
      val count = lines.flatMap(_.split(" "))
        .map(word => (word, 1))
        .reduceByKey(_ + _)
      // 4.输出结果
      count.print()
      // 5.启动
      ssc.start()
      // 6.等待执行
      ssc.awaitTermination()
    }
  }

Transformations与Output Operations

DStreams are immutable, which means that their content cannot be changed directly, but a series of transformations (Transformation) are performed on DStreams to achieve the expected application logic. Each conversion creates a new DStream, which represents the converted data from the parent DStream. DStream conversion is lazy, which means that only after the output operation is performed, the conversion operation will be performed, and the operations that trigger the execution are called output operation.

Transformations

Spark Streaming provides a wealth of transformation operations. These transformations are divided into stateful transformations and stateless transformations . In addition, Spark Streaming also provides some window operations. It is worth noting that window operations are also stateful. The details are as follows:

Stateless transformation

Stateless transformation means that the processing of each micro-batch is independent of each other, that is, the current calculation result is not affected by the previous calculation result. Most of the operators of Spark Streaming are stateless, such as the common map() , flatMap(), reduceByKey() and so on.

  • map(func)

Use the func function to convert each element of the source DStream to get a new DStream

    /** Return a new DStream by applying a function to all elements of this DStream. */
    def map[U: ClassTag](mapFunc: T => U): DStream[U] = ssc.withScope {
      new MappedDStream(this, context.sparkContext.clean(mapFunc))
    }
  • flatMap(func)

Similar to map, but each input item can be mapped to 0 or more output items

  /**
   * Return a new DStream by applying a function to all elements of this DStream,
   * and then flattening the results
   */
  def flatMap[U: ClassTag](flatMapFunc: T => TraversableOnce[U]): DStream[U] = ssc.withScope {
    new FlatMappedDStream(this, context.sparkContext.clean(flatMapFunc))
  }
  • filter (func)

Return a new DStream containing only the items in the source DStream that satisfy the function func

  /** Return a new DStream containing only the elements that satisfy a predicate. */
  def filter(filterFunc: T => Boolean): DStream[T] = ssc.withScope {
    new FilteredDStream(this, context.sparkContext.clean(filterFunc))
  }
  • repartition (numPartitions)

Change the degree of parallelism of DStream by creating more or fewer partitions

/**
   * Return a new DStream with an increased or decreased level of parallelism. Each RDD in the
   * returned DStream has exactly numPartitions partitions.
   */
  def repartition(numPartitions: Int): DStream[T] = ssc.withScope {
    this.transform(_.repartition(numPartitions))
  }

  • reduce(func)

Use the function func to gather the elements of each RDD in the source DStream and return a new DStream containing single-element RDDs

  /**
   * Return a new DStream in which each RDD has a single element generated by reducing each RDD
   * of this DStream.
   */
  def reduce(reduceFunc: (T, T) => T): DStream[T] = ssc.withScope {
    this.map((null, _)).reduceByKey(reduceFunc, 1).map(_._2)
  }

  • count()

Count the number of elements in each RDD in the source DStream

/**
   * Return a new DStream in which each RDD has a single element generated by counting each RDD
   * of this DStream.
   */
  def count(): DStream[Long] = ssc.withScope {
    this.map(_ => (null, 1L))
        .transform(_.union(context.sparkContext.makeRDD(Seq((null, 0L)), 1)))
        .reduceByKey(_ + _)
        .map(_._2)
  }
  • union(otherStream)

Return a new DStream containing the source DStream and other DStream elements

/**
   * Return a new DStream by unifying data of another DStream with this DStream.
   * @param that Another DStream having the same slideDuration as this DStream.
   */
  def union(that: DStream[T]): DStream[T] = ssc.withScope {
    new UnionDStream[T](Array(this, that))
  }
  • countByValue()

Applied to a DStream of element type K, it returns a new DStream of (K, V) key-value pair type. The value of each key is the number of occurrences in each RDD of the original DStream lines.flatMap(_.split(" ")).countByValue().print(). For example , for input:, spark spark flinkwill output : (spark,2),(flink,1), That is, group by element value, and then count the number of elements in each group.

It can be seen from the source code that the bottom layer is implemented as map((_,1L)).reduceByKey((x: Long, y: Long) => x + y, numPartitions), which is first mapped to a tuple according to the current element, where The key is the value of the current element, and then summarized according to the key.

/**
   * Return a new DStream in which each RDD contains the counts of each distinct value in
   * each RDD of this DStream. Hash partitioning is used to generate
   * the RDDs with `numPartitions` partitions (Spark's default number of partitions if
   * `numPartitions` not specified).
   */
  def countByValue(numPartitions: Int = ssc.sc.defaultParallelism)(implicit ord: Ordering[T] = null)
      : DStream[(T, Long)] = ssc.withScope {
    this.map((_, 1L)).reduceByKey((x: Long, y: Long) => x + y, numPartitions)
  }

  • reduceByKey(func, [numTasks])

When the operation is performed on a DStream composed of (K,V) key-value pairs, a new DStream composed of (K,V) key-value pairs is returned, and the value of each key is given by the recuce function (Func) gather together

such as:lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _).print()

For input: spark spark flink, output: (spark,2),(flink,1)

  /**
   * Return a new DStream by applying `reduceByKey` to each RDD. The values for each key are
   * merged using the associative and commutative reduce function. Hash partitioning is used to
   * generate the RDDs with Spark's default number of partitions.
   */
  def reduceByKey(reduceFunc: (V, V) => V): DStream[(K, V)] = ssc.withScope {
    reduceByKey(reduceFunc, defaultPartitioner())
  }

  • join(otherStream, [numTasks])

When applied to two DStreams (one containing (K,V) key-value pairs, one containing (K,W) key-value pairs), return a new DStream containing (K, (V, W)) key-value pairs

  /**
   * Return a new DStream by applying 'join' between RDDs of `this` DStream and `other` DStream.
   * Hash partitioning is used to generate the RDDs with Spark's default number of partitions.
   */
  def join[W: ClassTag](other: DStream[(K, W)]): DStream[(K, (V, W))] = ssc.withScope {
    join[W](other, defaultPartitioner())
  }
  • cogroup(otherStream, [numTasks])

When applied to two DStreams (one containing (K,V) key-value pairs, one containing (K,W) key-value pairs), return a tuple containing (K, Seq[V], Seq[W])

// 输入:spark
// 输出:(spark,(CompactBuffer(1),CompactBuffer(1)))
val DS1 = lines.flatMap(_.split(" ")).map((_,1))
val DS2 = lines.flatMap(_.split(" ")).map((_,1))
DS1.cogroup(DS2).print()
  /**
   * Return a new DStream by applying 'cogroup' between RDDs of `this` DStream and `other` DStream.
   * Hash partitioning is used to generate the RDDs with Spark's default number
   * of partitions.
   */
  def cogroup[W: ClassTag](
      other: DStream[(K, W)]): DStream[(K, (Iterable[V], Iterable[W]))] = ssc.withScope {
    cogroup(other, defaultPartitioner())
  }
  • transform(func)

Create a new DStream by applying the RDD-to-RDD function to each RDD of the source DStream. Support any RDD operation in the new DStream

// 输入:spark spark flink
// 输出:(spark,2)、(flink,1)
val lines = ssc.socketTextStream("localhost", 9999)
val resultDStream = lines.transform(rdd => {
rdd.flatMap(_.split("\\W")).map((_, 1)).reduceByKey(_ + _)
})
resultDStream.print()
  /**
   * Return a new DStream in which each RDD is generated by applying a function
   * on each RDD of 'this' DStream.
   */
  def transform[U: ClassTag](transformFunc: RDD[T] => RDD[U]): DStream[U] = ssc.withScope {
    val cleanedF = context.sparkContext.clean(transformFunc, false)
    transform((r: RDD[T], _: Time) => cleanedF(r))
  }

Stateful transformation

Stateful transformation means that the processing of each micro-batch is not independent of each other, that is, the current micro-batch processing depends on the previous micro-batch calculation results. Common stateful transformations mainly include countByValueAndWindow, reduceByKeyAndWindow, mapWithState, updateStateByKey, etc. In fact, all window-based operations are stateful, because the data in the entire window is tracked.

For stateful transformation and Window Operations, see below.

Output Operations

Use Output operations to write DStream to multiple external storage devices or print to the console. As mentioned above, the transformation of Spark Streaming is lazy, so Output Operation is required for trigger calculation, and its function is similar to the action operation of RDD. For details, see Spark Streaming Data Collection (Sinks) below.

Spark Streaming data source

The purpose of Spark Streaming is to become a general stream processing framework. In order to achieve this goal, Spark Streaming uses Receiver to integrate various data sources. However, for some data sources (such as kafka), Spark Streaming supports the use of Direct to receive data, which has better performance than Receiver.

Receiver-based approach

Insert picture description here

The function of Receiver is to collect data from data sources, and then transmit the data to Spark Streaming. The basic principle is: with the continuous arrival of data, these data will be collected and packaged into blocks during the corresponding batch interval. As long as the batch interval is completed, the collected data blocks will be sent to spark for processing .

As shown above: When Spark Streaming starts, the receiver starts to collect data. At t0the end of the batch interval (that is, the collected data in this period of time), the collected block #0 will be sent to Spark for processing. At the t2moment, Spark will process t1the data block of the batch interval, and at the same time it will continuously collect t2the block corresponding to the batch interval **#2**.

Common Receiver-based data sources include: Kafka, Kinesis, Flume, Twitter. In addition, users can also inherit Receiver abstract class, implemented onStart()with onStop()two methods, customize Receiver. This article will not discuss too much on Receiver-based data sources, but will mainly explain in detail the Direct-based Kafka data sources.

Direct-based

Spark 1.3 introduced this new Direct method without Receiver to ensure a stronger end-to-end guarantee. This method does not use Receiver to receive data, but periodically queries the latest offset in each topic+partition of Kafka, and defines the offset range to be processed in each batch accordingly. When starting a job for processing data, Kafka's simple consumer API is used to read the offset range defined by Kafka (similar to reading a file from the file system). Please note that this feature was introduced in Spark 1.3 of Scala and Java API and Spark 1.4 of Python API.

The Direct-based approach has the following advantages:

  • Simplify parallel reading

If you want to read multiple partitions, you do not need to create multiple input DStreams and then perform union operations on them. Spark will create as many RDD partitions as Kafka partitions, and will read data from Kafka in parallel. Therefore, there is a one-to-one correspondence between kafka partition and RDD partition.

  • high performance

If you want to ensure zero data loss, in the Receiver-based approach, you need to turn on the WAL mechanism. This method is actually very inefficient, because the data is actually copied twice, and Kafka itself has a highly reliable mechanism to copy the data, and here it will copy another copy to the WAL. The Direct-based approach does not depend on Receiver and does not need to open the WAL mechanism. As long as data is replicated in Kafka, it can be restored through the copy of Kafka.

  • Exactly-once semantics

Based on the Receiver method, Kafka's high-level API is used to save the consumed offset in Zookeeper. This is the traditional way of consuming Kafka data. This way, in conjunction with the WAL mechanism, can guarantee high reliability with zero data loss, but it cannot guarantee Exactly-once semantics (there may be asynchronous between Spark and Zookeeper). Based on the Direct approach and using Kafka's simple API, Spark Streaming itself is responsible for tracking the consumption offset and storing it in the checkpoint. Spark itself must be synchronized, so it can guarantee that data is consumed once and only once.

Spark Streaming integrates Kafka

How to use

Use KafkaUtils to add Kafka data source, the source code is as follows:

  def createDirectStream[K, V](
      ssc: StreamingContext,
      locationStrategy: LocationStrategy,
      consumerStrategy: ConsumerStrategy[K, V]
    ): InputDStream[ConsumerRecord[K, V]] = {
    val ppc = new DefaultPerPartitionConfig(ssc.sparkContext.getConf)
    createDirectStream[K, V](ssc, locationStrategy, consumerStrategy, ppc)
  }

Explanation of specific parameters:

  • K : Kafka message key type

  • V : Type of Kafka message value

  • ssc:StreamingContext

  • locationStrategy : LocationStrategy, which schedules the consumer according to the topic partition in the Executor, that is, keep the consumer as close to the leader partition as possible. This configuration can improve performance, but the choice of location is only a reference, not absolute. You can choose the following methods:

    • PreferBrokers: Spark and Kafka run on the same node, you can use this method
    • PreferConsistent: This method is used in most cases, it will uniformly allocate partitions among all Executors
    • PreferFixed: Place a specific topic partition on a specific host, used when the data load is unbalanced

    Note : PreferConsisten is used in most cases, the other two methods are only used in specific scenarios. This configuration is only a reference, and the specific situation will still be automatically adjusted according to the resources of the cluster.

  • consumerStrategy : consumption strategy, there are three main ways:

    • Subscribe: Subscribe to the topic collection of the specified topic name
    • SubscribePattern: Subscribe to the matched topic data through regular matching
    • Assign: subscribe to a collection of topic + partition

    Note : Use the Subscribe method in most cases.

Use Cases

object TolerateWCTest {

  def createContext(checkpointDirectory: String): StreamingContext = {

    val sparkConf = new SparkConf()
      .set("spark.streaming.backpressure.enabled", "true")
      //每秒钟从kafka分区中读取的records数量,默认not set
      .set("spark.streaming.kafka.maxRatePerPartition", "1000") //
      //Driver为了获取每个leader分区的最近offsets,连续进行重试的次数,
      //默认是1,表示最多重试2次,仅仅适用于 new Kafka direct stream API
      .set("spark.streaming.kafka.maxRetries", "2")
      .setAppName("TolerateWCTest")

    val ssc = new StreamingContext(sparkConf, Seconds(3))
    ssc.checkpoint(checkpointDirectory)
    val topic = Array("testkafkasource2")
    val kafkaParam = Map[String, Object](
      "bootstrap.servers" -> "kms-1:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "group0",
      "auto.offset.reset" -> "latest", //默认latest,
      "enable.auto.commit" -> (false: java.lang.Boolean)) //默认true,false:手动提交

    val lines = KafkaUtils.createDirectStream(
      ssc,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe[String, String](topic, kafkaParam))

    val words = lines.flatMap(_.value().split(" "))
    val wordDstream = words.map(x => (x, 1))
    val stateDstream = wordDstream.reduceByKey(_ + _)

    stateDstream.cache()
    //参照batch interval设置,
    //不得低于batch interval,否则会报错,
    //设为batch interval的2倍
    stateDstream.checkpoint(Seconds(6))

    //把DStream保存到MySQL数据库中
    stateDstream.foreachRDD(rdd =>
      rdd.foreachPartition { record =>
        var conn: Connection = null
        var stmt: PreparedStatement = null
        // 给每个partition,获取一个连接
        conn = ConnectionPool.getConnection
        // 遍历partition中的数据,使用一个连接,插入数据库

        while (record.hasNext) {
          val wordcounts = record.next()
          val sql = "insert into wctbl(word,count) values (?,?)"
          stmt = conn.prepareStatement(sql);
          stmt.setString(1, wordcounts._1.trim)
          stmt.setInt(2, wordcounts._2.toInt)
          stmt.executeUpdate()
        }
        // 用完以后,将连接还回去
        ConnectionPool.returnConnection(conn)
      })
    ssc
  }

  def main(args: Array[String]) {

    val checkpointDirectory = "hdfs://kms-1:8020/docheckpoint"

    val ssc = StreamingContext.getOrCreate(
      checkpointDirectory,
      () => createContext(checkpointDirectory))
    ssc.start()
    ssc.awaitTermination()
  }
}

Spark Streaming Data Sink (Sinks)

Introduction to Output Operation

Spark Streaming provides the following built-in Output Operation, as follows:

  • print()

Print data to standard output, if no parameters are passed, the first 10 elements are printed by default

  • saveAsTextFiles(prefix, [suffix])

Store the DStream content to the file system, the file name of each batch interval is ` prefix-TIME_IN_MS[.suffix]

  • saveAsObjectFiles(prefix, [suffix])

Save the content of the DStream as the SequenceFile of the serialized java object. The file name of each batch interval is prefix-TIME_IN_MS[.suffix]. Python API does not support this method.

  • saveAsHadoopFiles(prefix, [suffix])

Save the DStream content as a Hadoop file. The file name of each batch interval is prefix-TIME_IN_MS[.suffix]. Python API does not support this method.

  • foreachRDD(func)

A general data output operator, func function outputs the data of each RDD to an external storage device, such as writing the RDD to a file or database.

 /**
   * Apply a function to each RDD in this DStream. This is an output operator, so
   * 'this' DStream will be registered as an output stream and therefore materialized.
   */
  def foreachRDD(foreachFunc: RDD[T] => Unit): Unit = ssc.withScope {
    val cleanedF = context.sparkContext.clean(foreachFunc, false)
    foreachRDD((r: RDD[T], _: Time) => cleanedF(r), displayInnerRDDOps = true)
  }

  /**
   * Apply a function to each RDD in this DStream. This is an output operator, so
   * 'this' DStream will be registered as an output stream and therefore materialized.
   */
  def foreachRDD(foreachFunc: (RDD[T], Time) => Unit): Unit = ssc.withScope {
    // because the DStream is reachable from the outer object here, and because
    // DStreams can't be serialized with closures, we can't proactively check
    // it for serializability and so we pass the optional false to SparkContext.clean
    foreachRDD(foreachFunc, displayInnerRDDOps = true)
  }

  private def foreachRDD(
      foreachFunc: (RDD[T], Time) => Unit,
      displayInnerRDDOps: Boolean): Unit = {
    new ForEachDStream(this,
      context.sparkContext.clean(foreachFunc, false), displayInnerRDDOps).register()
  }

foreachRDD is a very important operation, users can use it to output the processed data to an external storage device. Regarding the use of foreachRDD, you need to pay attention to some details. The specific analysis is as follows:

If you write data to MySQL, you need to get a connection. Users may inadvertently create a connection object in Spark Driver, and then use it in Work to write data to an external device. The code is as follows:

dstream.foreachRDD { rdd =>
  val connection = createNewConnection()  // ①注意:该段代码在driver上执行
  rdd.foreach { record =>
    connection.send(record) // ②注意:该段代码在worker上执行
  }
}

Screaming tip: The above method of use is wrong, because the connection object needs to be serialized and then sent to the driver node. This connection object cannot be serialized, so it cannot be transmitted across nodes. The above code will report a serialization error. The correct way to use it is to create a connection on the worker node, that rdd.foreachis, create a connection internally. The way is as follows:

dstream.foreachRDD { rdd =>
  rdd.foreach { record =>
    val connection = createNewConnection()
    connection.send(record)
    connection.close()
  }
}

The above method solves the problem of not being serializable, but creates a connection for each RDD record. Usually, there is a certain performance overhead when creating a connection object, so frequent creation and destruction of connection objects will reduce the overall throughput . A better approach is to rdd.foreachreplace it with ``rdd.foreachPartition , so that you don't need to create a connection for each record frequently, but create a connection for the RDD partition, which greatly reduces the overhead of creating a connection.

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    val connection = createNewConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    connection.close()
  }
}

In fact, the above usage can be further optimized by reusing connection objects among multiple RDDs or batches of data. Users can maintain a static connection object pool and reuse the objects in the pool to push multiple batches of RDD to external systems to further save costs:

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  
  }
}

Use Cases

  • Simulate database connection pool
/**
 * 简易版的连接池
 */
public class ConnectionPool {
    
    

    // 静态的Connection队列
    private static LinkedList<Connection> connectionQueue;

    /**
     * 加载驱动
     */
    static {
    
    
        try {
    
    
            Class.forName("com.mysql.jdbc.Driver");
        } catch (ClassNotFoundException e) {
    
    
            e.printStackTrace();
        }
    }

    /**
     * 获取连接,多线程访问并发控制
     *
     * @return
     */
    public synchronized static Connection getConnection() {
    
    
        try {
    
    
            if (connectionQueue == null) {
    
    
                connectionQueue = new LinkedList<Connection>();
                for (int i = 0; i < 10; i++) {
    
    
                    Connection conn = DriverManager.getConnection("jdbc:mysql://localhost:3306/wordcount", "root",
                            "123qwe");
                    connectionQueue.push(conn);
                }
            }
        } catch (Exception e) {
    
    
            e.printStackTrace();
        }
        return connectionQueue.poll();
    }

    /**
     * 用完之后,返回一个连接
     */
    public static void returnConnection(Connection conn) {
    
    
        connectionQueue.push(conn);
    }

}

  • Real-time statistics written to MySQL
object WordCount {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    val lines = ssc.socketTextStream("localhost", 9999)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    // 存储到MySQL
    wordCounts.foreachRDD { rdd =>
      rdd.foreachPartition { partition =>
        var conn: Connection = null
        var stmt: PreparedStatement = null
        // 给每个partition,获取一个连接
        conn = ConnectionPool.getConnection
        // 遍历partition中的数据,使用一个连接,插入数据库
        while (partition.hasNext) {
          val wordcounts = partition.next()
          val sql = "insert into wctbl(word,count) values (?,?)"
          stmt = conn.prepareStatement(sql);
          stmt.setString(1, wordcounts._1.trim)
          stmt.setInt(2, wordcounts._2.toInt)
          stmt.executeUpdate()

        }
        // 用完以后,将连接还回去
        ConnectionPool.returnConnection(conn)
      }
    }
    ssc.start()
    ssc.awaitTermination()
  }
}

to sum up

Due to space limitations, this article mainly discusses the Spark Streaming execution mechanism, Transformations and Output Operations, Spark Streaming data sources (Sources), and Spark Streaming data sinks (Sinks). The next article will share time-based window operations , stateful calculations , checkpoints , performance tuning, and more.

Guess you like

Origin blog.csdn.net/jmx_bigdata/article/details/107676548