spark-wide project combat scenes, real-time user behavior analysis, real-time traffic monitoring systems, real-time movie recommendation system

Use scenario: We on the machine or on a regular basis every day to run many tasks, often error task can not be found in time, it led to the discovery of the time the task has been hung up for a long time. 

Solution: real-time monitoring of log output of these tasks based on Flume Kafka framework + + Spark Streaming, the person in charge of information detected when the log appear Error on sending mail to the project.

Objective: To be familiar with this small project based on Flume + Kafka + Spark Streaming framework for real-time log analysis, can be used to real project better.

A, Flume

Flume is for collecting, gathering and transmitting the log data to Kafka. Log may be provided corresponding to a plurality of tasks to a plurality of sources to a kafka sinks. Configuration files are as follows:


#define agent

agent_log.sources = s1 s2                                                                                                                  

agent_log.channels = c1                                                                                                                 

agent_log.sinks = k1                                                                                                                    

                  

#define sources.s1       

agent_log.sources.s1.type=exec                                                                                                          

agent_log.sources.s1.command=tail -F /data/log1.log 

 

#define sources.s2       

agent_log.sources.s2.type=exec                                                                                                          

agent_log.sources.s2.command=tail -F /data/log2.log  

 

# Define interceptors

agent_log.sources.s1.interceptors = i1

agent_log.sources.s1.interceptors.i1.type = static

agent_log.sources.s1.interceptors.i1.preserveExisting = false

agent_log.sources.s1.interceptors.i1.key = projectName

agent_log.sources.s1.interceptors.i1.value= project1

 

agent_log.sources.s2.interceptors = i2

agent_log.sources.s2.interceptors.i2.type = static

agent_log.sources.s2.interceptors.i2.preserveExisting = false

agent_log.sources.s2.interceptors.i2.key = projectName

agent_log.sources.s2.interceptors.i2.value= project2

                                                                                                                                                                                                                                                                                                                                   

#define channels

agent_log.channels.c1.type = memory

agent_log.channels.c1.capacity = 1000

agent_log.channels.c1.transactionCapacity = 1000

 

#define sinks

# Set the receiver Kafka

agent_log.sinks.k1.type= org.apache.flume.sink.kafka.KafkaSink

# Set Kafka's broker address and port number

agent_log.sinks.k1.brokerList=cdh1:9092,cdh2:9092,cdh3:9092

# Set Kafka's Topic

agent_log.sinks.k1.topic=result_log

#Include header

agent_log.sinks.k1.useFlumeEventFormat = true

# Set serialization

agent_log.sinks.k1.serializer.class=kafka.serializer.StringEncoder

agent_log.sinks.k1.partitioner.class=org.apache.flume.plugins.SinglePartition

agent_log.sinks.k1.partition.key=1

agent_log.sinks.k1.request.required.acks=0

agent_log.sinks.k1.max.message.size=1000000

agent_log.sinks.k1.agent_log.type=sync

agent_log.sinks.k1.custom.encoding=UTF-8

 

# bind the sources and sinks to the channels                                                                      

agent_log.sources.s1.channels=c1    

agent_log.sources.s2.channels=c1  

agent_log.sinks.k1.channel=c1 

Flume-ng command execution start flume:

flume-ng agent -c /etc/flume-ng/conf -f result_log.conf -n agent_log


Two, Kafka

Kafka is a messaging system, the message may be buffered. Flume Kafka collected log to the message queue (Flume as producers), then Spark Streaming be consumed, and can ensure that no data is lost. kafka specific knowledge can read: https: //www.cnblogs.com/likehua/p/3999538.html

# Create result_log theme

kafka-topics --zookeeper cdh1:2181,cdh1:2181,cdh3:2181 --create --topic result_log --partitions 3 --replication-factor 1

# Test - View kafka list of topics, whether to create a successful observation result_log

kafka-topics --list --zookeeper cdh1:2181,cdh1:2181,cdh3:2181

# Test - start a consumer test transmission log flume to kafka this link is functioning properly

kafka-console-consumer --bootstrap-server cdh1:9092,cdh1:9092,cdh3:9092 --topic result_log


Three, Spark Streaming (programming language: scala, development tools: Idea)

Create a new maven project, add pom.xml configuration dependent. // see specific project code


We used to manage spark streaming Zookeeper consumers offset. transfer


KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams, newOffset))

Establishing a connection with Kafka, return InputDStream, acquiring data stream,    


stream.foreachRDD (rdd => {// handler}) // process the data stream.

val ssc = new StreamingContext(sc, Durations.seconds(60))  

ssc.start () // start ssc

Send e-mail function configuration org.apache.commons.mail this package HtmlEmail this class, call HtmlEmail.send send mail.


Start.sh write a script that starts Spark Streaming program, last sh start.sh startup script.


#!/bin/bash

export HADOOP_USER_NAME=hdfs

spark2-submit \

--master yarn \

--deploy-mode client \

--executor 3-color \

--num-executors 10 \

--driver-memory  2g \

--executor-memory 1G \

--conf spark.default.parallelism=30 \

--conf spark.storage.memoryFraction=0.5 \

--conf spark.shuffle.memoryFraction=0.3 \

--conf spark.reducer.maxSizeInFlight=128m \

--driver-class-path mysql-connector-java-5.1.38.jar \

--jars  mysql-connector-java-5.1.38.jar,qqwry-java-0.7.0.jar,fastjson-1.2.47.jar,spark-streaming-kafka-10_2.11-2.2.0.jar,hive-hbase-handler-1.1.0-cdh5.13.0.jar,commons-email-1.5.jar,commons-email-1.5-sources.jar,mail-1.4.7.jar \

--class com.lin.monitorlog.mianer.Handler \

monitorLog.jar

 



Guess you like

Origin blog.51cto.com/14384035/2406275