Flume Source processing of multiple lines

ExecSource will readLine()read each line in the log and put it as the body of each flume event, which is perfectly fine for most log records where each line can end:

1
2
2016 -03-18 17 :53 :48 ,374 INFO  namenode .FSNamesystem ( FSNamesystem .java : listCorruptFileBlocks(7217))  -  there  are  no  corrupt  file  blocks . 
2016 -03-18 17 :53 :48 ,278 INFO  namenode .FSNamesystem ( FSNamesystem .java : listCorruptFileBlocks(7217))  -  there  are  no  corrupt  file  blocks .

 

However, for stacktracesome ERRORlog records, if the content of a line is regarded as a flume event, there will be a big problem. Intuitively, it is necessary to treat several lines as a flume event. For example, the following log records should be used as a flume event. One flume event instead of 27 (27 lines total):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
2016-03-18 17:53:40,278 ERROR [HiveServer2-Handler-Pool: Thread-26]: Error occurred during processing of message.
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset
	at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
	at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:268)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset
	at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
	at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
	at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:178)
	at org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
	at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
	at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
	at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
	... 4 more
Caused by: java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:196)
	at java.net.SocketInputStream.read(SocketInputStream.java:122)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
	at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
	... 10 more

 

The way I implement it here is: identify the beginning of each line, if a certain condition is met, it is regarded as a log, otherwise, it is regarded as a part of the previous log.

for example:

For the example above (that is, a log that conforms to the standard log4j), if the following regular expression is satisfied at the beginning of each line:

1
\s?\d\d\d\d-\d\d-\d\d\s\d\d:\d\d:\d\d,\d\d\d

 

就当作一条新的日志,如果不满足,就说明该行内容是上一条日志(已规定格式开头的那条)的一部分。

当然,我增加了可以自定义配置以哪种方式开头视为一条日志的regex配置,可以对不通的source进行不通的配置,已满足要求。

有了这样的约束,就可以写出将某些多行看作一个flume event的ExecSource,我把它开源到了github上,如有兴趣,欢迎前去试用,如有任何建议,欢迎提出与指正:MultiLineExecSource

1
github.com/qwurey/flume-source-multiline

 

该版本基于flume-ng-core 1.6.0

 
 

 

 

 

 

 

转自:http://blog.csdn.net/asia_kobe/article/details/51003173

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326295635&siteId=291194637