一、Spark Streaming介绍
-----------------------------------------------------------
1.介绍
是spark core的扩展,针对实时数据的实时流处理技术
具有可扩展、高吞吐量、容错的特点
数据可以是来自于kafka,flume,tcpsocket,使用高阶函数(map reduce filter ,join , windows),处理的数据结果可以推送到database,hdfs
数据流处理还可以应用到机器学习和图计算中
2.内部执行流程:
spark接受实时数据流,分成batch(分批次)进行处理,最终在每个batch终产生结果stream.
3.discretized stream or DStream,
离散流,表示的是连续的数据流。
DStream可通过kafka、flume等输入数据流产生,也可以通过对其他DStream进行高阶变换产生(类似于RDD)。
在内部,DStream是表现为RDD序列。
二、SparkStreaming的编程实现
------------------------------------------------------------
1.添加依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.0</version>
</dependency>
2.编写scala -- SparkStreamingDemo
package com.test.spark.streaming.scala
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
object SparkStreamingDemo {
def main(args: Array[String]): Unit = {
//注意local[n] n > 1
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
//创建sparkstreaming上下文一秒钟一个批次
val ssc = new StreamingContext(conf, Seconds(1))
//创建套接字文本流
val lines = ssc.socketTextStream("lcoalhost", 9999)
//
val words = lines.flatMap(_.split(" "))
//
val pairs = words.map(word => (word, 1))
//
val wordCounts = pairs.reduceByKey(_ + _)
//
wordCounts.print()
//启动
ssc.start() // Start the computation
//等待结束
ssc.awaitTermination() // Wait for the computation to terminate
}
}
3.先启动nc服务器,产生数据源
cmd> nc -l -p 9999
4.启动scala程序,查看结果
run
5.打包,丢到ubuntu上运行
$> nc -lk 9999
$> spark-submit --master spark://s100:7077 --name netwc --class com.test.spark.streaming.scala.SparkStreamingDemo TestSpark-1.0-SNAPSHOT.jar
6.java版
package com.test.spark.streaming.java;
import org.apache.spark.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.*;
import org.apache.spark.streaming.api.java.*;
import scala.Tuple2;
import scala.actors.threadpool.Arrays;
public class WCDemoJava {
public static void main(String [] args)
{
SparkConf conf = new SparkConf().setMaster("spark://s100:7077).setAppName("NetworkWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("s100", 9999);
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(x.split(" ")).iterator());
JavaPairDStream<String, Integer> pairs = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey((i1, i2) -> i1 + i2);
wordCounts.print();
jssc.start(); // Start the computation
try {
jssc.awaitTermination(); // Wait for the computation to terminate
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
三、DStream -- 离散流介绍
------------------------------------------------------------
[注意事项]
1.启动上下文之后,不能启动新的流或者添加新的
2.上下文停止后不能restart.
3.同一JVM只有一个active的streamingcontext
4.停止streamingContext会一同stop掉SparkContext,如若只停止StreamingContext.
ssc.stop(false|true);
5.SparkContext可以创建多个StreamingContext,创建新的之前要停掉旧的。
四、Receiver -- 接收者
----------------------------------------------------------
1.介绍
Receiver是接收者,从source接受数据,存储在内存中,供spark处理。
2.源
基本源:fileSystem | socket,内置API支持
高级源:kafka | flume | ...,需要引入pom.xml依赖
3.注意
使用local模式时,不能使用一个线程.使用的local[n],n需要大于receiver的个数。
五、SparkStreaming集成Kafka
-------------------------------------------------------------
1.启动kafka集群
a.启动zk
b.启动kafka[s200,s300,s400]
$> /soft/kafka/bin/kafka-server-start.sh -daemon /soft/kafka/config/server.properties
c.验证kafka是否启动成功
$> netstat -ano | grep 9092
d.创建kafka主题
$> kafka-topics.sh --create --zookeeper s100:2181 --replication-factor 3 --partitions 3 --topic sparktopic1
2.引入pom.xml
...
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.1.0</version>
</dependency>
3.编写java代码
package com.test.spark.kafka;
import java.util.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka010.*;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.common.serialization.StringDeserializer;
import scala.Tuple2;
public class SparkKafkaDemo {
public static void main(String [] args)
{
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("NetworkWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(10));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", "s200:9092,s300:9092,s400:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "g6");
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", false);
Collection<String> topics = Arrays.asList("sparktopic1");
JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
JavaDStream js = stream.map(
new Function<ConsumerRecord<String, String>, String>() {
@Override
public String call(ConsumerRecord<String, String> v1) throws Exception {
return v1.value();
}
}
);
JavaDStream js1 = js.flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterator<String> call(String s) throws Exception {
List<String> list = new ArrayList<String>();
String [] strs = s.split(" ");
for (String ss : strs){
list.add(ss);
}
return list.iterator();
}
});
JavaPairDStream<String,Integer> js2 = js1.mapToPair( x -> new Tuple2<>(x,1));
JavaPairDStream<String,Integer> js3 = js2.reduceByKey((x, y) -> x + y);
js3.print();
jssc.start();
try {
jssc.awaitTermination();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
4.开始kafka控制台生产者,测试
$> kafka-console-producer.sh --broker-list s200:9092 --topic sparktopic1
六、updateStateByKey函数
-----------------------------------------------------------------
1.对DStream的每个key,value应用更新函数,从而产生新的DStream
2.分两步
a.定义状态 --- 状态可以是任意类型
b.定义状态更新函数
3.解释
比如单词统计,进行到maptopair将单词变成tiple<"word",1>,对变换的DStream应用 updateStateByKey函数
函数对应的法则是保持key<"word">原来的1的状态,一旦再出现key<'word'>的时候,就对状态值+1,kv就变成<'word',2>进行返回了
因为是流计算,是分批次进行计算的,所以,key的状态会跨批次的,一直应用法则进行更新和迭代
这样统计出来的单词就不是单批次的单词数量,而是从开始到现在,所有批次的总数量了
4.代码演示
package com.test.spark.streaming.java;
import org.apache.spark.*;
import org.apache.spark.api.java.Optional;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.streaming.*;
import org.apache.spark.streaming.api.java.*;
import scala.Tuple2;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
public class WCDemoJava {
public static void main(String [] args)
{
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("NetworkWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(10));
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999);
JavaDStream<String> words = lines.flatMap(
new FlatMapFunction<String, String>() {
@Override
public Iterator<String> call(String s) throws Exception {
List<String> list = new ArrayList<String>();
String [] strs = s.split(" ");
for (String ss : strs){
list.add(ss);
}
return list.iterator();
}
}
);
jssc.checkpoint("file:///d:\\share\\spark\\checkpoint");
JavaPairDStream<String, Integer> pairs = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairDStream<String, Integer> pairs1 = pairs.updateStateByKey(new Function2<List<Integer>, Optional<Integer>, Optional<Integer>>() {
public Optional<Integer> call(List<Integer> v1, Optional<Integer> v2) throws Exception {
//取得以前的状态
Integer newState = v2.isPresent()? v2.get() : 0;
System.out.println("oldState : " + newState );
for (Integer i : v1) {
//更新旧状态[加上这个批次出现的次数]
newState = newState + i;
}
return Optional.of(newState);
}
});
pairs1.print();
jssc.start(); // Start the computation
try {
jssc.awaitTermination(); // Wait for the computation to terminate
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
七、Streaming Windows 窗口化操作,跨批次
-----------------------------------------------------
1.batch interval : 批次间隔,最小的时间单位
2.window length : 窗口长度,跨批次,批次的正数倍长度
3.silder interval: 滑块间隔,窗口计算的间隔时间,多长时间计算一次窗口,批次的整倍数。也就是相邻窗口之间的间隔[多少个批次]
4.常用窗口化操作:
a.window(windowLength, slideInterval)
//Return a new DStream which is computed based on windowed batches of the source DStream.
//应用于非pair的DS,返回一个新的包含窗口内所有批次的DS
b.countByWindow(windowLength, slideInterval)
//Return a sliding window count of elements in the stream.
//应用于非pair的DS,统计窗口内所有批次的元素的个数
c.reduceByWindow(func, windowLength, slideInterval)
//Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func.
//The function should be associative and commutative so that it can be computed correctly in parallel.
//应用于非pair的DS, 将窗口内所有批次的元素,按照func进行聚合
d.reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])
//When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs
//where the values for each key are aggregated using the given reduce function func over batches in a sliding window.
//Note: By default, this uses Spark's default number of parallel tasks (2 for local mode,
//and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping.
//You can pass an optional numTasks argument to set a different number of tasks.
//应用于pair<k,v>的DS,将窗口内,key相同的value按照func进行聚合操作
e.reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])
//A more efficient version of the above reduceByKeyAndWindow()
//where the reduce value of each window is calculated incrementally using the reduce values of the previous window.
//This is done by reducing the new data that enters the sliding window,
//and “inverse reducing” the old data that leaves the window.
//An example would be that of “adding” and “subtracting” counts of keys as the window slides.
//However, it is applicable only to “invertible reduce functions”,
//that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc).
//Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.
//Note that checkpointing must be enabled for using this operation.
f.countByValueAndWindow(windowLength, slideInterval, [numTasks])
//When called on a DStream of (K, V) pairs,
//returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window.
//Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.
//应用于pair<k,v>的DS,将窗口内,所有批次的pair<k,v>作为key,统计出现的次数作为value,进行返回
5.示例
a.reduceByKeyAndWindow():
// Reduce last 30 seconds of data, every 10 seconds //windows length = 30 //slider interval = 10
scala> val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))
// Reduce last 30 seconds of data, every 10 seconds
java> JavaPairDStream<String, Integer> windowedWordCounts = pairs.reduceByKeyAndWindow((i1, i2) -> i1 + i2, Durations.seconds(30), Durations.seconds(10));
b.代码
package com.test.spark.streaming.java;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.Optional;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.Seconds;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import scala.Some;
import scala.Tuple2;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
public class WCSparkDemo1 {
public static void main(String[] args) throws Exception {
SparkConf conf = new SparkConf();
conf.setAppName("wc");
conf.setMaster("local[4]");
//创建Spark流应用上下文
JavaStreamingContext jsc = new JavaStreamingContext(conf, Seconds.apply(5));
jsc.checkpoint("file:///d:/share/spark/check");
//创建socket离散流
JavaReceiverInputDStream sock = jsc.socketTextStream("localhost",9999);
//压扁
JavaDStream<String> wordsDS = sock.flatMap(new FlatMapFunction<String,String>() {
public Iterator call(String str) throws Exception {
List<String> list = new ArrayList<String>() ;
String[] arr = str.split(" ");
for(String s : arr){
list.add(s);
}
return list.iterator();
}
});
JavaDStream<String> wordsDS1 = wordsDS.window(Seconds.apply(15), Seconds.apply(10));
wordsDS1.print();
// //窗口内的单词总个数统计[1个批次是5秒,3个批次为一个窗口,2个批次的窗口间隔 : 10秒计算一次,每次计算15秒内的批次值]
// JavaDStream js1 = wordsDS.countByWindow(Seconds.apply(15), Seconds.apply(10));
// js1.print();
//映射成元组
// JavaPairDStream<String,Integer> pairDS = wordsDS.mapToPair(new PairFunction<String, String, Integer>() {
// public Tuple2<String, Integer> call(String s) throws Exception {
// return new Tuple2<String,Integer>(s,1);
// }
// }) ;
//窗口统计,将传入的pair元素对作为元素,value为统计的元素对的个数
//输入
//hello wordl
//hello wordl
//hello wordl
//hello wordl
//hello wordl
//输出
// ((hello,1),5)
// ((wordl,1),5)
// JavaPairDStream countDS = pairDS.countByValueAndWindow(Seconds.apply(15), Seconds.apply(10));
// countDS.print();
// //key窗口聚合[1个批次是5秒,3个批次为一个窗口,2个批次的窗口间隔 : 10秒计算一次,每次计算15秒内的批次值]
// JavaPairDStream<String,Integer> countDS = pairDS.reduceByKeyAndWindow(new Function2<Integer, Integer, Integer>() {
// public Integer call(Integer v1, Integer v2) throws Exception {
// return v1 + v2;
// }
// },Seconds.apply(15), Seconds.apply(10));
//打印
// countDS.print();
jsc.start();
jsc.awaitTermination();
jsc.stop();
}
}
八、生产环境下,SparkStream容错的实现
-------------------------------------------------------------
1.两个概念
a.Spark Driver
//驱动,运行用户编写的程序代码的主机。
b.Executors
//执行的spark driver提交的job,内部含有附加组件比如receiver
//receiver接受数据并以block方式保存在memory中,同时,将数据块复制到其他Executor中,以备于容错
//每个批次末端会形成新的DStream,交给下游处理
//如果receiver故障,其他执行器中的receiver会启动进行数据的接收
2.故障类型以及解决办法
a.如果Executor故障,所有未被处理的数据都会丢失,解决办法可以通过wal(写前日志:hbase,hdfs/WALs都有使用)方式将数据预先写入到hdfs或者s3中进行储存.
b.如果Driver故障,driver程序就会停止,所有executor都是丢失连接,停止计算过程。解决办法需要配置和编程
3.解决Driver故障
a.第一步,配置Driver程序自动重启,使用特定的集群管理器clustermanager实现。
b.第二步,重启时,从宕机的地方进行重启,通过检查点机制可以实现该功能。
1)创建检查点
//目录可以是本地,可以是hdfs.
jsc.checkpoint("d://....");
2)使用JavaStreamingContext.getOrCreate方式创建上下文
//不再使用new方式创建SparkStreamContext对象,
//而是通过工厂方式JavaStreamingContext.getOrCreate()方法创建上下文对象,
//首先会检查检查点目录,看是否有job运行,没有就new新的。
JavaStreamingContext jsc = JavaStreamingContext.getOrCreate("file:///d:/scala/check", new Function0<JavaStreamingContext>() {
public JavaStreamingContext call() throws Exception {
JavaStreamingContext jsc = new JavaStreamingContext(conf, Seconds.apply(2));
jsc.checkpoint("file:///d:/scala/check");
return jsc;
}
});
c.编写容错测试代码,计算过程要编写到Function0的call方法中。
package com.test.spark.streaming.java;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function0;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
public class SparkStreamingFaultTolerant {
public static void main(String [] args)
{
//创建一个JavaStreamingContext工厂
Function0<JavaStreamingContext> contextFactory = new Function0<JavaStreamingContext>() {
//首次创建context时调用该方法
public JavaStreamingContext call() {
SparkConf conf = new SparkConf();
conf.setMaster("local[4]");
conf.setAppName("wc");
JavaStreamingContext jssc = new JavaStreamingContext(conf,new Duration(2000));
JavaDStream<String> lines = jssc.socketTextStream("localhost",9999);
/******* 变换代码放到此处 ***********/
JavaDStream<Long> dsCount = lines.countByWindow(new Duration(24 * 60 * 60 * 1000),new Duration(2000));
dsCount.print();
//设置检察点目录
jssc.checkpoint("file:///d:/share/spark/checkpoint");
return jssc;
}
};
//失败重建时会经过检查点。
JavaStreamingContext context = JavaStreamingContext.getOrCreate("file:///d:/share/spark/checkpoint", contextFactory);
context.start();
try {
context.awaitTermination();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}