Use scenario: We on the machine or on a regular basis every day to run many tasks, often error task can not be found in time, it led to the discovery of the time the task has been hung up for a long time.
Solution: real-time monitoring of log output of these tasks based on Flume Kafka framework + + Spark Streaming, the person in charge of information detected when the log appear Error on sending mail to the project.
Objective: To be familiar with this small project based on Flume + Kafka + Spark Streaming framework for real-time log analysis, can be used to real project better.
A, Flume
Flume is for collecting, gathering and transmitting the log data to Kafka. Log may be provided corresponding to a plurality of tasks to a plurality of sources to a kafka sinks. Configuration files are as follows:
#define agent
agent_log.sources = s1 s2
agent_log.channels = c1
agent_log.sinks = k1
#define sources.s1
agent_log.sources.s1.type=exec
agent_log.sources.s1.command=tail -F /data/log1.log
#define sources.s2
agent_log.sources.s2.type=exec
agent_log.sources.s2.command=tail -F /data/log2.log
# Define interceptors
agent_log.sources.s1.interceptors = i1
agent_log.sources.s1.interceptors.i1.type = static
agent_log.sources.s1.interceptors.i1.preserveExisting = false
agent_log.sources.s1.interceptors.i1.key = projectName
agent_log.sources.s1.interceptors.i1.value= project1
agent_log.sources.s2.interceptors = i2
agent_log.sources.s2.interceptors.i2.type = static
agent_log.sources.s2.interceptors.i2.preserveExisting = false
agent_log.sources.s2.interceptors.i2.key = projectName
agent_log.sources.s2.interceptors.i2.value= project2
#define channels
agent_log.channels.c1.type = memory
agent_log.channels.c1.capacity = 1000
agent_log.channels.c1.transactionCapacity = 1000
#define sinks
# Set the receiver Kafka
agent_log.sinks.k1.type= org.apache.flume.sink.kafka.KafkaSink
# Set Kafka's broker address and port number
agent_log.sinks.k1.brokerList=cdh1:9092,cdh2:9092,cdh3:9092
# Set Kafka's Topic
agent_log.sinks.k1.topic=result_log
#Include header
agent_log.sinks.k1.useFlumeEventFormat = true
# Set serialization
agent_log.sinks.k1.serializer.class=kafka.serializer.StringEncoder
agent_log.sinks.k1.partitioner.class=org.apache.flume.plugins.SinglePartition
agent_log.sinks.k1.partition.key=1
agent_log.sinks.k1.request.required.acks=0
agent_log.sinks.k1.max.message.size=1000000
agent_log.sinks.k1.agent_log.type=sync
agent_log.sinks.k1.custom.encoding=UTF-8
# bind the sources and sinks to the channels
agent_log.sources.s1.channels=c1
agent_log.sources.s2.channels=c1
agent_log.sinks.k1.channel=c1
Flume-ng command execution start flume:
flume-ng agent -c /etc/flume-ng/conf -f result_log.conf -n agent_log
Two, Kafka
Kafka is a messaging system, the message may be buffered. Flume Kafka collected log to the message queue (Flume as producers), then Spark Streaming be consumed, and can ensure that no data is lost. kafka specific knowledge can read: https: //www.cnblogs.com/likehua/p/3999538.html
# Create result_log theme
kafka-topics --zookeeper cdh1:2181,cdh1:2181,cdh3:2181 --create --topic result_log --partitions 3 --replication-factor 1
# Test - View kafka list of topics, whether to create a successful observation result_log
kafka-topics --list --zookeeper cdh1:2181,cdh1:2181,cdh3:2181
# Test - start a consumer test transmission log flume to kafka this link is functioning properly
kafka-console-consumer --bootstrap-server cdh1:9092,cdh1:9092,cdh3:9092 --topic result_log
Three, Spark Streaming (programming language: scala, development tools: Idea)
Create a new maven project, add pom.xml configuration dependent. // see specific project code
We used to manage spark streaming Zookeeper consumers offset. transfer
KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams, newOffset))
Establishing a connection with Kafka, return InputDStream, acquiring data stream,
stream.foreachRDD (rdd => {// handler}) // process the data stream.
val ssc = new StreamingContext(sc, Durations.seconds(60))
ssc.start () // start ssc
Send e-mail function configuration org.apache.commons.mail this package HtmlEmail this class, call HtmlEmail.send send mail.
Start.sh write a script that starts Spark Streaming program, last sh start.sh startup script.
#!/bin/bash
export HADOOP_USER_NAME=hdfs
spark2-submit \
--master yarn \
--deploy-mode client \
--executor 3-color \
--num-executors 10 \
--driver-memory 2g \
--executor-memory 1G \
--conf spark.default.parallelism=30 \
--conf spark.storage.memoryFraction=0.5 \
--conf spark.shuffle.memoryFraction=0.3 \
--conf spark.reducer.maxSizeInFlight=128m \
--driver-class-path mysql-connector-java-5.1.38.jar \
--jars mysql-connector-java-5.1.38.jar,qqwry-java-0.7.0.jar,fastjson-1.2.47.jar,spark-streaming-kafka-10_2.11-2.2.0.jar,hive-hbase-handler-1.1.0-cdh5.13.0.jar,commons-email-1.5.jar,commons-email-1.5-sources.jar,mail-1.4.7.jar \
--class com.lin.monitorlog.mianer.Handler \
monitorLog.jar