Detailed explanation of Spark Streaming based on kafka Direct

The original blog post address https://blog.csdn.net/erfucun/article/details/52275369

This blog post mainly includes the following contents: 

1, SparkStreaming on Kafka Direct working principle and mechanism 
2, SparkStreaming on Kafka Direct case combat 
3, SparkStreaming on Kafka Direct source code analysis

One: SparkStreaming on Kafka Direct working mechanism:

1. Features of Direct mode:

(1) The way of Direct is to directly manipulate the metadata information of the bottom layer of kafka, so that if the calculation fails, the data can be re-read and re-processed. That is, the data must be processed. Pulling data means that RDD directly pulls data when it is executed. 
(2) Since kafka is directly operated, kafka is equivalent to your underlying file system. At this time, strict transaction consistency can be guaranteed, that is, it must be processed, and it will only be processed once. The Receiver method cannot be guaranteed, because the data in the Receiver and ZK may not be synchronized, and spark Streaming may consume data repeatedly. This tuning can be solved, but it is obviously not as convenient as Direct. Direct api directly operates kafka, spark streaming is responsible for tracking the offset or offset of the consumption of this data, and saves it to the checkpoint, so its data must be synchronized and must not be repeated (this place must be defined by itself method to save the checkpoint). Even if it is restarted, it will not be repeated , because the checkpoint is already, but when the program is upgraded, the original checkpoint cannot be read. Faced with the problem that the upgrade checkpoint is invalid, how to solve it? When upgrading, it is enough to read the backup I specified, that is It is also possible to manually specify checkpoints, which again perfectly ensures transactional, one-time transaction mechanism. So how to checkpoint manually? When building SparkStreaming, there is the api getorCreate, which will obtain the content of the checkpoint, and specify where the checkpoint is. or as follows:

private static JavaStreamingContext createContext(
            String checkpointDirectory, SparkConf conf) {
        // TODO Auto-generated method stub
        System.out.println("Creating new context");

        SparkConf sparkConf = conf;
        JavaStreamingContext ssc = new JavaStreamingContext(sparkConf,Durations.seconds(5));
        ssc.checkpoint(checkpointDirectory);
        return ssc; 
    }
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

And if after recovering from checkpoint, if there is too much data to be processed, what should we do? 1) Speed ​​limit 2) Enhance the processing power of the machine 3) Put it in the data buffer pool.

(3) Since the bottom layer is to read data directly, there is no so-called Receiver, it is a periodic (Batch Intervel) query kafka directly. When processing data, we will use the kafka-based Consumer api to obtain a specific range (offset range) in kafka. ) in the data. At this time, an obvious performance benefit brought by Direct Api accessing kafka is that if you want to read multiple partitions, Spark will also create RDD partitions. At this time, the RDD partitions and kafka partitions are the same. In the way of Receiver, these two partitions have nothing to do with each other. This advantage is your RDD. In fact, when Kafka is read at the bottom layer, the partition of Kafka is equivalent to a block on the original hdfs. This is consistent with data locality. Both RDD and kafka data are here. Therefore, the place where the data is read, the place where the data is processed, and the program that drives the data processing are all on the same machine, which can greatly improve performance. The disadvantage is that because the patition of RDD and kafka is one-to-one, it will be more troublesome to increase the degree of parallelism. Improving the degree of parallelism is still repartition, that is, repartitioning, because shuffle is generated, which is very time-consuming. This problem, maybe the new version can configure the ratio freely in the future, not one-to-one. Because increasing the degree of parallelism can make better use of the computing resources of the cluster, it is very meaningful. 
(4) There is no need to open the wal mechanism. From the perspective of zero data loss, the efficiency is greatly improved, and the disk space can be saved at least twice. Getting data from kafka is definitely faster than getting data from hdfs because of the zero copy method.

2. Comparison of SparkStreaming on Kafka Direct and Receiver:

From a high-level perspective, the previous integration with Kafka (reciever approach) using WAL works as follows:

1) Kafka Receivers running on Spark workers/executors continuously read data from Kafka, which uses the high-level consumer API in Kafka.

2) Received data is stored in memory in Spark workers/executors and also written to WAL. Only the received data is persisted to the log, Kafka Receivers will update the Kafka offset in Zookeeper. 
3) Received data and WAL storage location information are stored reliably, and if a failure occurs during the period, this information is used to recover from the error and continue processing the data. 
write picture description here 
This method ensures that data received from Kafka is not lost. But in the case of failure, there is a good chance that some data will be processed more than once! This happens when some received data is reliably saved to the WAL, but there is no time to update the Kafka offset in Zookeeper, and the system fails. This leads to data inconsistencies: Spark Streaming knows that the data has been received, but Kafka thinks that the data has not been received, so that when the system returns to normal, Kafka will send the data again.

The reason for this inconsistency is that the two systems cannot perform atomic operations on the saved data information that has already been received. To solve this problem, only one system is needed to maintain a consistent view of what has been sent or received, and that system needs to have all the control to recover from failures. Based on these considerations, the community decided to store all consumption offset information only in Spark Streaming and use Kafka's low-level consumer API to restore data from arbitrary locations.

To build this system, the newly introduced Direct API takes a completely different approach from Receivers and WALs. Instead of starting a Receiver to continuously receive data from Kafka and write it to the WAL, it simply gives the offset position that each batch interval needs to be read. Finally, the Job of each batch is run, those The data corresponding to the offset is already prepared in Kafka. These offsets are also reliably stored (checkpointed) and can be read directly when recovering from a failure.

write picture description here

Note that Spark Streaming can re-read and process those data segments from Kafka after a failure. However, due to the once-only semantics, the result of the last reprocessing is the same as the result of no failed processing.

Thus, the Direct API eliminates the need to use WALs and Receivers, and ensures that each Kafka record is received only once and efficiently. This allows us to integrate Spark Streaming and Kafka nicely. Overall, these features make stream processing pipelines fault-tolerant, efficient, and easy to use.

Second, the actual case of SparkStreaming on Kafka Direct:

1. The source code information is as follows:

package com.dt.spark.SparkApps.sparkstreaming;

import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;

import kafka.serializer.StringDecoder;
import scala.Tuple2;

public class SparkStreamingOnKafkaDirect {

    public static void main(String[] args) {
/*      第一步:配置SparkConf:
        1,至少两条线程因为Spark Streaming应用程序在运行的时候至少有一条线程用于
        不断地循环接受程序,并且至少有一条线程用于处理接受的数据(否则的话有线程用于处理数据,随着时间的推移内存和磁盘都会
        不堪重负)
        2,对于集群而言,每个Executor一般肯定不止一个线程,那对于处理SparkStreaming
        应用程序而言,每个Executor一般分配多少Core比较合适?根据我们过去的经验,5个左右的Core是最佳的
        (一个段子分配为奇数个Core表现最佳,例如3个,5个,7个Core等)
*/      
        SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("SparStreamingOnKafkaReceiver");
/*      SparkConf conf = new SparkConf().setMaster("spark://Master:7077").setAppName("SparStreamingOnKafkaReceiver");
        第二步:创建SparkStreamingContext,
        1,这个是SparkStreaming应用春香所有功能的起始点和程序调度的核心
        SparkStreamingContext的构建可以基于SparkConf参数也可以基于持久化的SparkStreamingContext的内容
        来恢复过来(典型的场景是Driver崩溃后重新启动,由于SparkStreaming具有连续7*24
        小时不间断运行的特征,所以需要Driver重新启动后继续上一次的状态,此时的状态恢复需要基于曾经的Checkpoint))
        2,在一个Sparkstreaming 应用程序中可以创建若干个SparkStreaming对象,使用下一个SparkStreaming
        之前需要把前面正在运行的SparkStreamingContext对象关闭掉,由此,我们获取一个重大的启发
        我们获得一个重大的启发SparkStreaming也只是SparkCore上的一个应用程序而已,只不过SparkStreaming框架想运行的话需要
        spark工程师写业务逻辑
*/      
        @SuppressWarnings("resource")
        JavaStreamingContext jsc = new JavaStreamingContext(conf,Durations.seconds(10));

/*      第三步:创建SparkStreaming输入数据来源input Stream
        1,数据输入来源可以基于File,HDFS,Flume,Kafka-socket等
        2,在这里我们指定数据来源于网络Socket端口,SparkStreaming连接上该端口并在运行时候一直监听
        该端口的数据(当然该端口服务首先必须存在,并且在后续会根据业务需要不断地数据产生当然对于SparkStreaming
        应用程序的而言,有无数据其处理流程都是一样的);
        3,如果经常在每个5秒钟没有数据的话不断地启动空的Job其实会造成调度资源的浪费,因为并没有数据发生计算
        所以实际的企业级生成环境的代码在具体提交Job前会判断是否有数据,如果没有的话就不再提交数据
    在本案例中具体参数含义:
        第一个参数是StreamingContext实例,
        第二个参数是zookeeper集群信息(接受Kafka数据的时候会从zookeeper中获取Offset等元数据信息)
        第三个参数是Consumer Group
        第四个参数是消费的Topic以及并发读取Topic中Partition的线程数
*/      
        Map<String,String> kafkaParameters = new HashMap<String,String>();
        kafkaParameters.put("meteadata.broker.list",
                "Master:9092;Worker1:9092,Worker2:9092");

        Set<String> topics =new HashSet<String>();
         topics.add("SparkStreamingDirected");

        JavaPairInputDStream<String,String> lines = KafkaUtils.createDirectStream(jsc,
                String.class, String.class,
                StringDecoder.class, StringDecoder.class,
                kafkaParameters,
                topics);
    /*
     * 第四步:接下来就像对于RDD编程一样,基于DStream进行编程!!!原因是Dstream是RDD产生的模板(或者说类
     * ),在SparkStreaming发生计算前,其实质是把每个Batch的Dstream的操作翻译成RDD的操作
     * 对初始的DTStream进行Transformation级别处理
     * */
        JavaDStream<String> words = lines.flatMap(new FlatMapFunction<Tuple2<String, String>,String>(){ //如果是Scala,由于SAM装换,可以写成val words = lines.flatMap{line => line.split(" ")}

            @Override
            public Iterable<String> call(Tuple2<String,String> tuple) throws Exception {

                return Arrays.asList(tuple._2.split(" "));//将其变成Iterable的子类
            }
        });
//      第四步:对初始DStream进行Transformation级别操作
        //在单词拆分的基础上对每个单词进行实例计数为1,也就是word => (word ,1 )
        JavaPairDStream<String,Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {

            @Override
            public Tuple2<String, Integer> call(String word) throws Exception {
                return new Tuple2<String, Integer>(word,1);
            }

        });
        //对每个单词事例技术为1的基础上对每个单词在文件中出现的总次数

         JavaPairDStream<String,Integer> wordsCount = pairs.reduceByKey(new Function2<Integer,Integer,Integer>(){

            /**
             * 
             */
            private static final long serialVersionUID = 1L;

            @Override
            public Integer call(Integer v1, Integer v2) throws Exception {
                // TODO Auto-generated method stub
                return v1 + v2;
            }
        });
        /*
         * 此处的print并不会直接出发Job的支持,因为现在一切都是在SparkStreaming的框架控制之下的
         * 对于spark而言具体是否触发真正的JOb运行是基于设置的Duration时间间隔的
         * 诸位一定要注意的是Spark Streaming应用程序要想执行具体的Job,对DStream就必须有output Stream操作
         * output Stream有很多类型的函数触发,类print,savaAsTextFile,scaAsHadoopFiles等
         * 其实最为重要的一个方法是foreachRDD,因为SparkStreaming处理的结果一般都会放在Redis,DB
         * DashBoard等上面,foreach主要就是用来完成这些功能的,而且可以自定义具体的数据放在哪里!!!
         * */
         wordsCount.print();

//       SparkStreaming 执行引擎也就是Driver开始运行,Driver启动的时候位于一条新线程中的,当然
//       其内部有消息接受应用程序本身或者Executor中的消息
         jsc.start();
         jsc.close();
    }

}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129

The next steps are: 
(1) First start the zookeeper service in the bin directory on the machine where the zookeeper is installed:

(2) Next, start the Kafka service on each machine, in the Kafkabin directory: 
1) nohup ./kafka-server-start.sh ../config/server.properties &

2) ./kafka-topic.sh –create –zookeeper Master:2181,Worker1:2181,Worker2:2181 –replication-factor 3 –pertitions 1 –topic SaprkStreamingDirected

3)./kafka -console -producer.sh –broker-list Master:9092,Worker1:9092,Worker2:9092 –topic SparkStreamingDirected

(3) Enter data on the console

(4) At this point, you can observe the value on the eclipse console. 
Three, SparkStreaming on Kafka Direct source code analysis

1. First of all, we see in KafkaUtils that the source code comments of createDirectStream are very slender, mainly including how you can access the offset, how to access the offset, and expose the RDD we need through foreach, etc. 

write picture description here

2. Here we can see the very detailed description of each parameter information: 
write picture description here

3. We see that DirectKafkaInputStream is created 
write picture description here

4. In DirectKafakaInputStream we can see that it creates DirectKafkaInputDStreamcheckpointData: 
write picture description here

5. Through DirectKafkaInputDStreamcheckpointData, here we can see that we can customize checkpoint:

write picture description here

For more source code, I hope you can see it yourself. 
Additional instructions:

Using Spark Streaming can process various data source types, such as: database, HDFS, server log log, network stream, its power is beyond the scenarios you can't imagine, but it is often not used by people. The real reason is that Spark, spark streaming itself doesn't understand.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324588691&siteId=291194637