Flume collects and processes log files

 

  1. Introduction to Flume

Flume is a highly available, highly reliable, distributed massive log collection, aggregation and transmission system provided by Cloudera. Flume supports customizing various data senders in the log system for data collection; at the same time, Flume provides Ability to simply process data and write to various data recipients (customizable).

  1. System functions

    1. log collection

Flume was originally a log collection system provided by Cloudera, and is currently an incubation project under Apache. Flume supports customizing various data senders in the log system for data collection.

  1. data processing

Flume provides the ability to simply process data and write to various data receivers (customizable) The ability to collect data on data sources such as syslog (syslog log system, supports two modes such as TCP and UDP) and exec (command execution).

  1. Way of working

Flume adopts a multi-Master approach. In order to ensure the consistency of configuration data, Flume[1] introduced ZooKeeper to save configuration data. ZooKeeper itself can ensure the consistency and high availability of configuration data. In addition, when the configuration data changes, ZooKeeper can notify the Flume Master node . The gossip protocol is used to synchronize data between Flume Masters.

  1. Process structure

The structure of Flume is mainly divided into three parts: source, channel and sink. The source is the source, which is responsible for collecting logs; the channel is the channel, which is responsible for transmission and temporary storage; the sink is the destination, which saves the collected logs. In the process of real log collection, according to the type of logs to be collected and storage requirements, select the corresponding type of source, channel, and sink to configure, so as to collect and save the log.

  1. Flume Log Collection Solution

    1. demand analysis

      1. log classification

Operating system: linux

Log update type: generate a new log, append at the end of the original log

  1. Collection time requirements

Acquisition cycle: short cycle (within one day)

  1. Collection plan

    1. Acquisition framework

The process of using flume to collect log files is relatively simple. You only need to select the appropriate source, channel and sink and configure them. If you have special needs, you can also carry out secondary development to meet your personal needs.

The specific process is: configure an agent according to the requirements, select the appropriate source and sink, and then start the agent to start collecting logs.

  1. source

Flume provides a variety of sources for users to choose from, as much as possible to meet the needs of most log collections. Common types of sources include avro, exec, netcat, spooling-directory, and syslog. For the specific scope of use and configuration methods, see source .

  1. channel

Channels in flume are not as important as sources and sinks, but they are an integral part that cannot be ignored. The commonly used channel is memory-channel, and there are other types of channels, such as JDBC, file-channel, custom-channel, etc. For details, see channel .

  1. sink

Flume also has many kinds of sinks, commonly used ones include avro, logger, HDFS, hbase and file-roll, etc. In addition, there are other types of sinks, such as thrift, IRC, custom, etc. For the specific scope of use and usage methods, please refer to sink .

  1. Flume processing logs

Flume can not only collect logs, but also perform simple processing of logs. At the source, the important content of the log body can be filtered and extracted through the interceptor. At the channel, it can be classified by the header, and different types of logs can be put into different channels. In the sink, the body content can be further filtered and classified by regular serialization.

  1. Flume Source Interceptors

Flume can extract important information through the interceptor and add it to the header. The commonly used interceptors include timestamp, host name and UUID, etc. Users can also write regular filters according to their personal needs to filter out the log content in some specific formats to meet special needs.

  1. Flume Channel Selectors

Flume can transmit different logs into different channels according to requirements. There are two specific ways: replication and multiplexing. Replication means that the logs are not grouped, but all logs are transmitted to each channel, and all channels are not treated differently; multiplexing means that the logs are classified according to the specified header, and different logs are put into the system according to the classification rules. In different channels, the logs are artificially classified.

  1. Flume Sink Processors

Flume can also process logs at the sink. Common sink processors include custom, failover, load balancing, and default. Like interceptor, users can also use regular filter processors to filter out log content according to special needs. Unlike the interceptor, the content filtered by regular serialization at the sink will not be added to the header, so that the header of the log will not appear too bloated.

 

  1. appendix

    1. common source

      1. avro source

avro can monitor and collect logs of specified ports. When using avro's source, you need to specify the host IP and port number to be monitored. A specific example is given below:

a1.sources = r1

a1.channels = c1

a1.sources.r1.type = avro

a1.sources.r1.channels = c1

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port = 4141

  1. exec source

exec can read the log through the specified operation. When using exec, you need to specify the shell command to read the log. A specific example is given below:

a1.sources = r1

a1.channels = c1

a1.sources.r1.type = exec

a1.sources.r1.command = tail -F /var/log/secure

a1.sources.r1.channels = c1

  1. spooling-directory source

spo_dir can read the logs in the folder. When using it, specify a folder to read all the files in the folder. It should be noted that the files in the folder cannot be modified during the reading process, and the file name is also Can not be modified. A specific example is given below:

agent-1.channels = ch-1

agent-1.sources = src-1

 

agent-1.sources.src-1.type = spooldir

agent-1.sources.src-1.channels = ch-1

agent-1.sources.src-1.spoolDir = /var/log/apache/flumeSpool

agent-1.sources.src-1.fileHeader = true

  1. syslog source

syslog can read the system log through the syslog protocol, which is divided into tcp and udp. When using it, you need to specify the ip and port. An example of udp is given below:

a1.sources = r1

a1.channels = c1

a1.sources.r1.type = syslogudp

a1.sources.r1.port = 5140

a1.sources.r1.host = localhost

a1.sources.r1.channels = c1

  1. common channel

There are not many types of channels in Flume. The most commonly used channel is the memory channel. The following examples are given:

a1.channels = c1

a1.channels.c1.type = memory

a1.channels.c1.capacity = 10000

a1.channels.c1.transactionCapacity = 10000

a1.channels.c1.byteCapacityBufferPercentage = 20

a1.channels.c1.byteCapacity = 800000

  1. common sink

    1. logger sink

As the name suggests, logger is to write the collected logs to the flume log. It is a very simple but very practical sink.

  1. avro sink

avro can send the received logs to the specified port for the next hop of the cascading agent to collect and receive the logs. When using it, you need to specify the destination ip and port: the example is as follows:

a1.channels = c1

a1.sinks = k1

a1.sinks.k1.type = avro

a1.sinks.k1.channel = c1

a1.sinks.k1.hostname = 10.10.10.10

a1.sinks.k1.port = 4545

  1. file roll sink

file_roll can write the logs collected within a certain period of time to a specified file. The specific process is for the user to specify a folder and a cycle, and then start the agent. At this time, the folder will generate a file that is collected in the cycle. All logs are written into this file until a new file is generated in the next cycle to continue writing, and so on, and so on. A specific example is given below:

a1.channels = c1

a1.sinks = k1

a1.sinks.k1.type = file_roll

a1.sinks.k1.channel = c1

a1.sinks.k1.sink.directory = /var/log/flume

  1. hdfs sink

hdfs is somewhat similar to file roll, both write the collected logs to newly created files and save them, but the difference is that the file storage path of file roll is the local path of the system, while the storage path of hdfs is a distributed file The path of the system hdfs, and the cycle of creating new files in hdfs can be the time, the size of the file, or the number of collected logs. Specific examples are as follows:

a1.channels = c1

a1.sinks = k1

a1.sinks.k1.type = hdfs

a1.sinks.k1.channel = c1

a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S

a1.sinks.k1.hdfs.filePrefix = events-

a1.sinks.k1.hdfs.round = true

a1.sinks.k1.hdfs.roundValue = 10

a1.sinks.k1.hdfs.roundUnit = minute

  1. hbase sink

hbase是一种数据库,可以储存日志,使用时需要指定存储日志的表名和列族名,然后agent就可以将收集到的日志逐条插入到数据库中。例子如下:

a1.channels = c1

a1.sinks = k1

a1.sinks.k1.type = hbase

a1.sinks.k1.table = foo_table

a1.sinks.k1.columnFamily = bar_cf

a1.sinks.k1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer

a1.sinks.k1.channel = c1

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326629234&siteId=291194637