4、spark streaming+kafka

A, Receiver mode

1, receiver mode schematics

image

After SparkStreaming program up and running, Executor receives kafka receiver tasks there will be a push over the data. Data will be persisted default level is MEMORY_AND_DISK_SER_2, this level can also 
be modified. receiver task data received over the storage and backup of data transmission between nodes will process. After completion of the backup to the updated offset zookeeper consumption, and then to the Driver 
position reporting data receiver tracker. Finally, the localization task Driver The data distributed to execute on different nodes.


2, receiver mode Problems and Solutions

After the process hung Driver, Executor under the Driver will be killed, when've updated the zookeeper consumption offset when, if Driver hung up, it will not find the problems of data, equivalent to loss of data. 


how to solve this problem? 
Open WAL (write ahead log) write-ahead log mechanism, in time come to accept data backup to other nodes, as well as backup to (persistence level we will need to receive data to downgrade to MEMORY_AND_DISK) on a HDFS, 
so that we can ensure data security. However, because the write performance of HDFS more consumption, in order to update and report zookeeper location after the completion of data backup, this will increase the execution time job, so for the task 
execution to improve the degree of retardation.


3, receiver mode described

1.kafka有两种消费者api:
    1.High Level Consumer APl消费者不能做到自己去维护消费者offset,使用高级api时,不关心数据丢失。
    kafka+SparkStreaming Receiver模式就是High Level Consumer API实现的。
    2.Simple Consumer APl消费者可以自己管理offset.
    
2.过程:
    kafka+SparkStreaming receiver 模式接受数据,当向zookeeper中更新完offset后,Driver如果挂掉,Driver 下的Executor 会被kill,会造成丢失数据。

    怎么解决?
    开启WAL(Write Ahead Log)预写日志机利,将数据备份到HDFS中一份,再去更新zookeeper offset,如果开启了WAL机利,接收数据的存储级别要降级,
    去掉"2”开启WAL机利会加大application处理的时间。
    
3.receiver模式依赖zookeeper管理offset.

4.receiver模式的并行度?由spark.streaming.blockInterval=200ms决定。
    receiver 模式接受数据时,每隔spark.streaming.blockInterval将数据落地一个block,假设batchlnterval=5s,一个batch内生成25个block。
    batch-block,batch封装到RDD中,RDD-partition,这里的block对应的就是RDD中的partition。
    
如何提高receiver模式的并行度?
    在batchlnterval一定情况下,减少spark.streaming.blocklnterval 参数值,增大生成的DStream中RDD的partition个数,
    但是建议spark.streaming.blocklnterval最低不能低于50ms.


3、Receive模式Wordcount案例

package cn.spark.study.streaming;

import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaPairReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;

import scala.Tuple2;

/**
 * 基于Kafka receiver方式的实时wordcount程序
 * @author Administrator
 *
 */
public class KafkaReceiverWordCount {

    public static void main(String[] args) {
        SparkConf conf = new SparkConf()
                .setMaster("local[2]")
                .setAppName("KafkaWordCount");  
        JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(5));
        
        // 使用KafkaUtils.createStream()方法,创建针对Kafka的输入数据流
        Map<String, Integer> topicThreadMap = new HashMap<String, Integer>();
        // 使用多少个线程去拉取topic的数据
        topicThreadMap.put("WordCount", 1);
        
        // 这里接收的四个参数;第一个:streamingContext
        // 第二个:ZK quorum;   第三个:consumer group id 可以自己写;   
        // 第四个:per-topic number of Kafka partitions to consume
        JavaPairReceiverInputDStream<String, String> lines = KafkaUtils.createStream(
                jssc, 
                "192.168.1.135:2181,192.168.1.136:2181,192.168.1.137:2181", 
                "DefaultConsumerGroup", 
                topicThreadMap);
        
        // wordcount逻辑
        JavaDStream<String> words = lines.flatMap(
                
                new FlatMapFunction<Tuple2<String,String>, String>() {

                    private static final long serialVersionUID = 1L;

                    @Override
                    public Iterable<String> call(Tuple2<String, String> tuple)
                            throws Exception {
                        return Arrays.asList(tuple._2.split(" "));  
                    }
                    
                });
        
        JavaPairDStream<String, Integer> pairs = words.mapToPair(
                
                new PairFunction<String, String, Integer>() {

                    private static final long serialVersionUID = 1L;

                    @Override
                    public Tuple2<String, Integer> call(String word)
                            throws Exception {
                        return new Tuple2<String, Integer>(word, 1);
                    }
                    
                });
        
        JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey(
                
                new Function2<Integer, Integer, Integer>() {
            
                    private static final long serialVersionUID = 1L;

                    @Override
                    public Integer call(Integer v1, Integer v2) throws Exception {
                        return v1 + v2;
                    }
                    
                });
        
        wordCounts.print();  
        
        jssc.start();
        jssc.awaitTermination();
        jssc.close();
    }
    
}






##eclipse中运行程序

##新建一个topic
[root@spark1 kafka]# bin/kafka-topics.sh --zookeeper 192.168.1.135:2181,192.168.1.136:2181,192.168.1.137:2181 --topic WordCount --replication-factor 1 --partitions 1 --create

##启动生产者,然后可以输入一些数据,观察程序端的输出统计
[root@spark1 kafka]# bin/kafka-console-producer.sh --broker-list 192.168.1.135:9092,192.168.1.136:9092,192.168.1.137:9092 --topic WordCount


二、Driect模式

1、driect模式原理图

image


2、Direct模式理解

Direct 模式采用的是kafka的Simple Consumer APl。

Driect模式就是将kafka看成存数据的一方,不是被动接收数据,而是主动去取数据。消费者偏移量也不是用zookeeper来管理,而是SparkStreaming内部对消费者
偏移量自动来维护,默认消费偏移量是在内存中,当然如果设置了checkpoint目录,那么消费偏移量也会保存在checkpoint中。当然也可以实现用zookeeper来管理。

Direct模式生成的DStream中的RDD的并行度是与读取的topic中的partition个数一致。
Direct模式最好指定checkpoint


3、Direct模式Wordcount案例

package cn.spark.study.streaming;

import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

import kafka.serializer.StringDecoder;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;

import scala.Tuple2;

/**
 * 基于Kafka Direct方式的实时wordcount程序
 * @author Administrator
 *
 */
public class KafkaDirectWordCount {

    public static void main(String[] args) {
        SparkConf conf = new SparkConf()
                .setMaster("local[2]")
                .setAppName("KafkaDirectWordCount");  
        JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(5));
        
        // 首先,要创建一份kafka参数map
        Map<String, String> kafkaParams = new HashMap<String, String>();
        kafkaParams.put("metadata.broker.list", 
                "192.168.1.135:9092,192.168.1.136:9092,192.168.1.137:9092");
        
        // 然后,要创建一个set,里面放入,你要读取的topic
        // 这个,就是我们所说的,它自己给你做的很好,可以并行读取多个topic
        Set<String> topics = new HashSet<String>();
        topics.add("WordCount");
        
        // 创建输入DStream
        JavaPairInputDStream<String, String> lines = KafkaUtils.createDirectStream(
                jssc, 
                String.class, 
                String.class, 
                StringDecoder.class, 
                StringDecoder.class, 
                kafkaParams, 
                topics);
        
        // 执行wordcount操作
        JavaDStream<String> words = lines.flatMap(
                
                new FlatMapFunction<Tuple2<String,String>, String>() {

                    private static final long serialVersionUID = 1L;

                    @Override
                    public Iterable<String> call(Tuple2<String, String> tuple)
                            throws Exception {
                        return Arrays.asList(tuple._2.split(" "));  
                    }
                    
                });
        
        JavaPairDStream<String, Integer> pairs = words.mapToPair(
                
                new PairFunction<String, String, Integer>() {

                    private static final long serialVersionUID = 1L;

                    @Override
                    public Tuple2<String, Integer> call(String word) throws Exception {
                        return new Tuple2<String, Integer>(word, 1);
                    }
                    
                });
        
        JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey(
                
                new Function2<Integer, Integer, Integer>() {

                    private static final long serialVersionUID = 1L;

                    @Override
                    public Integer call(Integer v1, Integer v2) throws Exception {
                        return v1 + v2;
                    }
                    
                });
        
        wordCounts.print();
        
        jssc.start();
        jssc.awaitTermination();
        jssc.close();
    }
    
}





##检查运行,和receive模式类似


三、手动管理offset

1、手动管理offset

在zookeeper中自己管理offset;

使用mysql管理;

使用HBase管理;


2、代码

package com.manage;

import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.atomic.AtomicReference;

import com.google.common.collect.ImmutableMap;
import com.manage.getOffset.GetTopicOffsetFromKafkaBroker;
import com.manage.getOffset.GetTopicOffsetFromZookeeper;

import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.CuratorFrameworkFactory;
import org.apache.curator.retry.RetryUntilElapsed;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
import org.apache.spark.streaming.kafka.HasOffsetRanges;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.apache.spark.streaming.kafka.OffsetRange;
import kafka.cluster.Broker;

import com.fasterxml.jackson.databind.ObjectMapper;

import kafka.api.PartitionOffsetRequestInfo;
import kafka.common.TopicAndPartition;
import kafka.javaapi.OffsetRequest;
import kafka.javaapi.OffsetResponse;
import kafka.javaapi.PartitionMetadata;
import kafka.javaapi.TopicMetadata;
import kafka.javaapi.TopicMetadataRequest;
import kafka.javaapi.TopicMetadataResponse;
import kafka.javaapi.consumer.SimpleConsumer;
import kafka.message.MessageAndMetadata;
import kafka.serializer.StringDecoder;
import scala.Tuple2;

public class UseZookeeperManageOffset {
    /**
     * 使用log4j打印日志,“UseZookeeper.class” 设置日志的产生类
     */
    static final Logger logger = Logger.getLogger(UseZookeeperManageOffset.class);
    
    
    public static void main(String[] args) {
        /**
         * 加载log4j的配置文件,方便打印日志
         */
        ProjectUtil.LoadLogConfig();
        logger.info("project is starting...");
        
        /**
         * 从kafka集群中得到topic每个分区中生产消息的最大偏移量位置
         */
        Map<TopicAndPartition, Long> topicOffsets = GetTopicOffsetFromKafkaBroker.getTopicOffsets("node1:9092,node2:9092,node3:9092", "mytopic");
        
        /**
         * 从zookeeper中获取当前topic每个分区 consumer 消费的offset位置
         */
        Map<TopicAndPartition, Long> consumerOffsets = 
                GetTopicOffsetFromZookeeper.getConsumerOffsets("node3:2181,node4:2181,node5:2181","zhy","mytopic");
        
        /**
         * 合并以上得到的两个offset ,
         *     思路是:
         *         如果zookeeper中读取到consumer的消费者偏移量,那么就zookeeper中当前的offset为准。
         *         否则,如果在zookeeper中读取不到当前消费者组消费当前topic的offset,就是当前消费者组第一次消费当前的topic,
         *             offset设置为topic中消息的最大位置。
         */
        if(null!=consumerOffsets && consumerOffsets.size()>0){
            topicOffsets.putAll(consumerOffsets);
        }
        /**
         * 如果将下面的代码解开,是将topicOffset 中当前topic对应的每个partition中消费的消息设置为0,就是从头开始。
         */
//        for(Map.Entry<TopicAndPartition, Long> item:topicOffsets.entrySet()){
//          item.setValue(0l);
//        }
        
        /**
         * 构建SparkStreaming程序,从当前的offset消费消息
         */
        JavaStreamingContext jsc = SparkStreamingDirect.getStreamingContext(topicOffsets,"zhy");
        jsc.start();
        jsc.awaitTermination();
        jsc.close();
        
    }
}


package com.manage;

import java.io.IOException;
import java.io.InputStream;
import java.util.Properties;

import org.apache.log4j.Logger;
import org.apache.log4j.PropertyConfigurator;

public class ProjectUtil {
    /**
     * 使用log4j配置打印日志
     */
    static final Logger logger = Logger.getLogger(UseZookeeperManageOffset.class);
    /**
     * 加载配置的log4j.properties,默认读取的路径在src下,如果将log4j.properties放在别的路径中要手动加载
     */
    public static void LoadLogConfig() {
        PropertyConfigurator.configure("d:/eclipse4.7WS/SparkStreaming_Kafka_Manage/resource/log4j.properties"); 
    }
    
    /**
     * 加载配置文件
     * 需要将放config.properties的目录设置成资源目录
     * @return
     */
    public static Properties loadProperties() {

        Properties props = new Properties();
        InputStream inputStream = Thread.currentThread().getContextClassLoader().getResourceAsStream("config.properties");
        if(null != inputStream) {
            try {
                props.load(inputStream);
            } catch (IOException e) {
                logger.error(String.format("Config.properties file not found in the classpath"));
            }
        }
        return props;

    }
    
    public static void main(String[] args) {
        Properties props = loadProperties();
        String value = props.getProperty("hello");
        System.out.println(value);
    }
}

Guess you like

Origin www.cnblogs.com/weiyiming007/p/11401591.html