大数据求索(9): log4j + flume + kafka + spark streaming实时日志流处理实战
一、实时流处理
1.1 实时计算
跟实时系统类似(能在严格的时间限制内响应请求的系统),例如在股票交易中,市场数据瞬息万变,决策通常需要秒级甚至毫秒级。通俗来说,就是一个任务需要在非常短的单位时间内计算出来,这个计算通常是多次的。
1.2 流式计算
通常指源源不断的数据流过系统,系统能够不停地连续计算。这里对时间上可能没什么特别限制,数据流入系统到产生结果,可能经过很长时间。比如系统中的日志数据、电商中的每日用户访问浏览数据等。
1.3 实时流式计算
将实时计算和流式数据结合起来,就是实时流式计算,也就是大数据中通常说的实时流处理。数据源源不断的产生的同时,计算时间上也有了严格的限制。比如,目前电商中的商品推荐,往往在你点了某个商品之后,推荐的商品都是变化的,也就是实时的计算出来推荐给你的。再比如你的手机号,在你话费或者流量快用完时,实时的给你推荐流量包套餐等。
二、实时流处理实战
此例子借鉴慕课网实战视频Spark Streaming实时流处理项目实战,感兴趣的可以学习一下。
2.1 源源不断的数据
此处使用log4j模拟源源不断产生的日志数据,启动一个进程,不停地打印数据即可。简单配置如下:
log4j.rootLogger=INFO,stdout,flume
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.target = System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c] [%p] - %m%n
然后写一个简单的日志打印小程序即可,代码如下:
public class LoggerGenerator {
private static Logger logger = Logger.getLogger(LoggerGenerator.class.getName());
public static void main(String[] args) throws InterruptedException {
int index = 0;
while (true) {
Thread.sleep(1000);
logger.info("value : " + index++);
}
}
}
2.2 实时采集数据
可以采用Flume实时采集日志数据,为了和log4j结合,log4j配置文件需要如下配置
log4j.rootLogger=INFO,stdout,flume
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.target = System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c] [%p] - %m%n
log4j.appender.flume = org.apache.flume.clients.log4jappender.Log4jAppender
log4j.appender.flume.Hostname = wds
log4j.appender.flume.Port = 41414
log4j.appender.flume.UnsafeMode = true
同时,flume配置如下,
agent1.sources=avro-source
agent1.channels=logger-channel
agent1.sinks=logger-sink
# define source
agent1.sources.avro-source.type=avro
agent1.sources.avro-source.bind=0.0.0.0
agent1.sources.avro-source.port=41414
# define channel
agent1.channels.logger-channel.type=memory
#define sink
agent1.sinks.logger-sink.type = logger
agent1.sources.avro-source.channels=logger-channel
agent1.sinks.logger-sink.channel=logger-channel
这里暂时采用logger sink, 目的是为了测试数据能否采集到。做项目的过程中,不要想着一步到位,最好做一步测试一步,方便定位错误,防止错误累积。
启动flume
flume-ng agent \
--name agent1 \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/streaming.conf \
-Dflume.root.logger=INFO,console
注意,这里可能会报错,java.lang.ClassNotFoundException: org.apache.flume.clients.log4jappender.Log4jAppender。需要引入jar包来解决此问题
<dependency>
<groupId>org.apache.flume.flume-ng-clients</groupId>
<artifactId>flume-ng-log4jappender</artifactId>
<version>1.6.0</version>
</dependency>
这时候启动日志打印文件,可以看到flume已经开始采集数据了。
2018-12-07 21:39:03,204 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{flume.client.log4j.timestamp=1544236744447, flume.client.log4j.logger.name=LoggerGenerator, flume.client.log4j.log.level=20000, flume.client.log4j.message.encoding=UTF8} body: 76 61 6C 75 65 20 3A 20 30 value : 0 }
2018-12-07 21:39:03,609 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{flume.client.log4j.timestamp=1544236745545, flume.client.log4j.logger.name=LoggerGenerator, flume.client.log4j.log.level=20000, flume.client.log4j.message.encoding=UTF8} body: 76 61 6C 75 65 20 3A 20 31 value : 1 }
2018-12-07 21:39:04,611 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{flume.client.log4j.timestamp=1544236746548, flume.client.log4j.logger.name=LoggerGenerator, flume.client.log4j.log.level=20000, flume.client.log4j.message.encoding=UTF8} body: 76 61 6C 75 65 20 3A 20 32 value : 2 }
2018-12-07 21:39:05,614 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{flume.client.log4j.timestamp=1544236747550, flume.client.log4j.logger.name=LoggerGenerator, flume.client.log4j.log.level=20000, flume.client.log4j.message.encoding=UTF8} body: 76 61 6C 75 65 20 3A 20 33 value : 3 }
2018-12-07 21:39:06,617 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{flume.client.log4j.timestamp=1544236748554, flume.client.log4j.logger.name=LoggerGenerator, flume.client.log4j.log.level=20000, flume.client.log4j.message.encoding=UTF8} body: 76 61 6C 75 65 20 3A 20 34 value : 4 }
2018-12-07 21:39:07,620 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{flume.client.log4j.timestamp=1544236749556, flume.client.log4j.logger.name=LoggerGenerator, flume.client.log4j.log.level=20000, flume.client.log4j.message.encoding=UTF8} body: 76 61 6C 75 65 20 3A 20 35 value : 5 }
2018-12-07 21:39:08,626 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{flume.client.log4j.timestamp=1544236750560, flume.client.log4j.logger.name=LoggerGenerator, flume.client.log4j.log.level=20000, flume.client.log4j.message.encoding=UTF8} body: 76 61 6C 75 65 20 3A 20 36 value : 6 }
2018-12-07 21:39:09,632 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{flume.client.log4j.timestamp=1544236751568, flume.client.log4j.logger.name=LoggerGenerator, flume.client.log4j.log.level=20000, flume.client.log4j.message.encoding=UTF8} body: 76 61 6C 75 65 20 3A 20 37 value : 7 }
2018-12-07 21:39:10,636 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{flume.client.log4j.timestamp=1544236752572, flume.client.log4j.logger.name=LoggerGenerator, flume.client.log4j.log.level=20000, flume.client.log4j.message.encoding=UTF8} body: 76 61 6C 75 65 20 3A 20 38 value : 8 }
2018-12-07 21:39:11,638 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{flume.client.log4j.timestamp=1544236753575, flume.client.log4j.logger.name=LoggerGenerator, flume.client.log4j.log.level=20000, flume.client.log4j.message.encoding=UTF8} body: 76 61 6C 75 65 20 3A 20 39 value : 9 }
2018-12-07 21:39:12,640 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{flume.client.log4j.timestamp=1544236754577, flume.client.log4j.logger.name=LoggerGenerator, flume.client.log4j.log.level=20000, flume.client.log4j.message.encoding=UTF8} body: 76 61 6C 75 65 20 3A 20 31 30 value : 10 }
2.3 消息队列缓冲数据
这里flume采集到的数据将存到kafka中,作为一个缓冲,也即生产者。此时,需要将flume和kafka打通,flume配置修改如下:
agent1.sources=avro-source
agent1.channels=logger-channel
agent1.sinks=kafka-sink
# define source
agent1.sources.avro-source.type=avro
agent1.sources.avro-source.bind=0.0.0.0
agent1.sources.avro-source.port=41414
# define channel
agent1.channels.logger-channel.type=memory
#define sink
agent1.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
agent1.sinks.kafka-sink.brokerList = wds:9092
agent1.sinks.kafka-sink.topic = streamingtopic
agent1.sinks.kafka-sink.batchSize = 20
agent1.sinks.kafka-sink.requiredAcks = 1
agent1.sources.avro-source.channels=logger-channel
agent1.sinks.kafka-sink.channel=logger-channel
kafka依赖zookeeper,需要先启动zookeeper
$ZK_HOME/bin/zkServer.sh start
启动kafka,需要指定配置文件
bin/kafka-server-start.sh config/server.properties
创建topic
kafka-topics.sh --create --zookeeper wds:2181 --replication-factor 1 --partitions 1 --topic streamingtopic
启动消费进程,测试消费信息测试
kafka-console-consumer.sh --zookeeper wds:2181 --topic streamingtopic
可以看到每隔20个(由flume配置文件里的agent1.sinks.kafka-sink.batchSize = 20决定的)有消费信息输出,证明连接成功
[hadoop@wds ~]$ kafka-console-consumer.sh --zookeeper wds:2181 --topic streamingtopic
value : 0
value : 1
value : 2
value : 3
value : 4
value : 5
value : 6
value : 7
value : 8
value : 9
value : 10
value : 11
value : 12
value : 13
value : 14
value : 15
value : 16
value : 17
value : 18
value : 19
2.4 实时处理数据
使用Spark Streaming从Kafka消费消息,这里采用Receiver模式,可以参照下面简单的代码
object KafkaReceiverWordCount {
def main(args: Array[String]): Unit = {
if (args.length != 4) {
System.err.println("Usage: KafkaReceiverWordCount <ZkQuorum> <group> <topics> <numThreads>")
System.exit(1)
}
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("KafkaReceiverWordCount").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(5))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
// Kafka对接Spark Streaming
val messages = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap)
// _1参数没有用
messages.map(_._2).count()
ssc.start()
ssc.awaitTermination()
}
}
注意,日志和Streaming日志接收都是在本地的,那么生产环境如何做呢?
1) 打包jar,指向LoggerGenerator类
2) Flume和Kafka一样
3) Spark Streaming也需要打成jar包,然后提交到集群群运行,Spark-submit方式提交运行,模式为local/yarn/standalone/mesos
jar包使用maven的mvn assembly:assembly -Dmaven.test.skip=true方式打包,把kafka相关jar包也打进去,不需要的使用provided
打包的时候需要将LoggerGenerator从test中移出来放到java的某个包下,方便运行
打包成功后,在服务上运行
java -cp spark-test-1.0-jar-with-dependencies.jar com.wds.streaming.LoggerGenerator
其中-cp命令是将xxx.jar加入到classpath,这样java class loader就会在这里面查找匹配的类,这样打包的时候不用指定main class,非常方便。然后提交jar包到spark-submit
spark-submit \
--class com.wds.streaming.KafkaStreamingApp \
--master local[2] \
--name KafkaStreamingApp \
--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.2 \
/home/hadoop/hadoop/lib/spark-test-1.0-jar-with-dependencies.jar wds:2181 test streamingtopic 2
日志输出如下
[hadoop@wds lib]$ java -cp spark-test-1.0-jar-with-dependencies.jar com.wds.streaming.LoggerGenerator
log4j:ERROR Could not find value for key log4j.appender.flume.layout
2018-12-07 22:32:27,447 [main] [org.apache.flume.api.NettyAvroRpcClient] [WARN] - Using default maxIOWorkers
2018-12-07 22:32:28,681 [main] [com.wds.streaming.LoggerGenerator] [INFO] - value : 0
2018-12-07 22:32:29,751 [main] [com.wds.streaming.LoggerGenerator] [INFO] - value : 1
2018-12-07 22:32:30,753 [main] [com.wds.streaming.LoggerGenerator] [INFO] - value : 2
2018-12-07 22:32:31,755 [main] [com.wds.streaming.LoggerGenerator] [INFO] - value : 3
2018-12-07 22:32:32,758 [main] [com.wds.streaming.LoggerGenerator] [INFO] - value : 4
2018-12-07 22:32:33,760 [main] [com.wds.streaming.LoggerGenerator] [INFO] - value : 5
2018-12-07 22:32:34,762 [main] [com.wds.streaming.LoggerGenerator] [INFO] - value : 6
2018-12-07 22:32:35,764 [main] [com.wds.streaming.LoggerGenerator] [INFO] - value : 7
2018-12-07 22:32:36,765 [main] [com.wds.streaming.LoggerGenerator] [INFO] - value : 8
2018-12-07 22:32:37,767 [main] [com.wds.streaming.LoggerGenerator] [INFO] - value : 9
2018-12-07 22:32:38,770 [main] [com.wds.streaming.LoggerGenerator] [INFO] - value : 10
2018-12-07 22:32:39,772 [main] [com.wds.streaming.LoggerGenerator] [INFO] - value : 11
2018-12-07 22:32:40,775 [main] [com.wds.streaming.LoggerGenerator] [INFO] - value : 12
2018-12-07 22:32:41,777 [main] [com.wds.streaming.LoggerGenerator] [INFO] - value : 13
2018-12-07 22:32:42,779 [main] [com.wds.streaming.LoggerGenerator] [INFO] - value : 14
2018-12-07 22:32:43,782 [main] [com.wds.streaming.LoggerGenerator] [INFO] - value : 15
2018-12-07 22:32:44,784 [main] [com.wds.streaming.LoggerGenerator] [INFO] - value : 16
Spark Streaming输出如下:
18/12/07 22:32:50 INFO executor.Executor: Running task 0.0 in stage 12.0 (TID 10)
18/12/07 22:32:50 INFO storage.ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 2 blocks
18/12/07 22:32:50 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
18/12/07 22:32:50 INFO executor.Executor: Finished task 0.0 in stage 12.0 (TID 10). 1705 bytes result sent to driver
18/12/07 22:32:50 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 12.0 (TID 10) in 6 ms on localhost (executor driver) (1/1)
18/12/07 22:32:50 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 12.0, whose tasks have all completed, from pool
18/12/07 22:32:50 INFO scheduler.DAGScheduler: ResultStage 12 (print at KafkaStreamingApp.scala:23) finished in 0.006 s
18/12/07 22:32:50 INFO scheduler.DAGScheduler: Job 6 finished: print at KafkaStreamingApp.scala:23, took 0.014111 s
-------------------------------------------
Time: 1544239970000 ms
-------------------------------------------
20
至此,一切打通,测试成功。
三、参考
- Spark Streaming实时流处理项目实战
- flume、kafka、spark官方文档