Flume basic features (1.7)

    Apache Flume is a distributed, reliable, and efficient log data collection component ; we usually use Flume to collect log files from multiple servers scattered in the cluster into a central data platform to solve the problem of "from discrete logs". Difficulty viewing and statistic data in files". Of course, Flume can not only collect log files, it also supports the collection of message data such as TCP and UDP; in any case, the problem we finally solve is "collect discrete data". Let's first describe a few concepts:

    1. Event: message, event. The unit of data transmission in Flume is "event". Flume divides the parsed log data and received TCP data into events and transmits it in the internal Flow.

    2. Agent: A Flume process deployed on the host machine close to the data source (such as logs file), usually used to collect, filter, and sort data. Flume Agent usually needs to "modify" the source data and forward it to the remote end Collector.

    3. Collector: Another Flume process (Agent), which is used to receive messages sent by Flume agents. Compared with Agent, the messages “collected” by Collector usually come from multiple servers, and its role is to “aggregate” messages, "Cleaning", "Classification", "Filtering", etc., and responsible for saving and forwarding to downstream.

    4. Source: One of the internal components of Flume, used to parse raw data and encapsulate it into events, or receive Flume Events sent by the client; for the Flume process, source is the front end of the entire data flow (Data Flow), with to "generate" events. (read)

    5. Channel: One of Flume's internal components, a channel used to "transmit" events. Channel usually has features such as "cache" data and "flow control"; the upstream end of Channel is Source, and downstream is Sink. If you are familiar with the pipeline mode Streaming data model, the concept should be very easy to understand.

   6. Sink: One of Flume's internal components, used to send internal events to third-party components through appropriate protocols. For example, Sink can write events to local disk files and send them to other Flume through TCP based on the Avro protocol. For other data storage platforms such as kafka, etc.; Sink eventually removes events from the internal data stream. (Write)

 

    Component internal link relationship:

    1. A Source can send events to one or more Channels, usually a Source corresponds to a Channel; if a Source sends events to multiple Channels, the "selector" mechanism (see below) is required.

    2. Channel is the node associated with Flow, its upstream is Source and downstream is Sink. A Channel can access multiple Sources, that is, multiple Sources can send events to a Channel. At the same time, multiple sinks can consume messages from a channel, which requires the use of the sink processor mechanism (see below).

    3. The upstream of a sink is a channel, and a sink can only consume messages from one channel.

    4. The Source transmits the message to the Channel, and the Sink consumes the message from the Channel, both in internal transactions. The implementation of the Channel is usually a bounded BlockingQueue. If the Channel is full, the put operation of the Source will be rejected and returned abnormally, and it will be retried later; if the Channel is empty, the Sink will not be able to obtain messages.

 

1. Architecture

    1. Data flow model

    Each Flume Event consists of "byte payload" and an optional set of string properties; if you are familiar with JMS programming, you can think that "byte payload" is the body of the Event, which consists of a sequence of byte arrays and is the body of the message. , In addition, Event has some headers, KV structure, which is used to save some properties of this event.

    There are multiple components inside the Flume Agent process (JVM), which can parse the source data into events and forward them from the source to other destinations (hops) through a specific Flow.

    Flume Source is used to consume events sent to it by external data sources; external data sources send messages to Flume agents in a format supported by Flume Source. For example, Flume Avro Sources can be used to receive messages from Avro Clients or other Flume Avro Sinks. Of course, similar Flume Thrift Sources can receive messages sent by Thrift Clients or other Flume Thrift Sinks. When a Flume Source receives a message, it can store the message in one or more Channels. Channel is a passive storage that stores messages until Flume Sink consumes them, such as FileChannel, which is based on the local file system (save messages in local files, append). For Sink, it removes the message from the Channel, and then sends the message to a third-party (external) storage platform, such as HDFS Sink, to save the message in the HDFS system; or forwards the message to the next-level Flume Agent ( next hop) of the Flume Source (in a multi-level architecture). Inside the Agent, both the source and the sink operate asynchronously and batch messages in the Channel. (Based on the source code later, the working principle of each component will be explained in detail)

 

    2. Complex Flows

    Flume allows developers to build multi-hop Flows models, messages can pass through multiple Flume Agents before reaching the final destination; it also allows building such as fan-in (fan-in), fan-out (fan-out) Structural Flows, and models for contextual routing, Failover patterns.

 

    3. Reliability

    Messages (batches) go through each agent's channel and then send to the next agent or the final storage platform. It will be removed from the Channel only after it is received and saved by the next agent or the final storage platform. This is how Flume (single-hop) transport semantics provide end-to-end data flow reliability.

    Flume uses a transactional approach to ensure the reliability of message delivery (this is very important). The storage and retrieval operations of Sources and Sinks will be divided into transactions provided by Channel, which can ensure the reliability of point-to-point transmission of a set of messages within Flow (source->channel->sink). Even in the multi-level Flows mode, the data transfer between the sink of the upper level and the source of the next level runs in their own transactions to ensure that the data can be safely stored in the channel of the next level.

 

    4. Resilience

    Flume supports a persistent type of FileChannel, that is, Channel messages can be saved in the local file system, and this Channel supports data recovery. In addition, MemoryChannel is also supported, which is a memory-based queue, which is very efficient, but when the Agent process fails, those messages left in the Channel will be lost (and cannot be recovered).

 

2. Installation and use

    1. Flume is developed by JAVA, so JDK needs to be installed on the host machine first, and version 1.7+ is recommended; installing Flume itself is not complicated, you only need to prepare a flume configuration file; declared in the configuration file, source, channel, sink, etc. The respective properties, and the Flow association between them.

    2. Each component (source, channel, sink) in Flow has a name, type, and a specific set of configuration options. For example, the Avro source needs to specify the bound hostname and the local port, the memory channel needs to specify the size of the capacity, and the HDFS sink needs to declare the HDFS URI and file path.

    3. Component relationship

    Ultimately, the Agent needs to know the relationship between the various components to build the Flow model. After we declare the configuration characteristics of each component of sources, channels, and sinks, then specify the connection relationship between sinks and sources for channels, that is, which channels will be used by sources to save messages, and which channels will be used by sinks to obtain messages.

 

    4. Start the Agent

    There is a flume-ng script in the bin directory, which can be used to start the agent, but before starting flume, we usually adjust the relevant parameters of the JVM. You can add relevant configuration to flume-env.sh, such as:

xport JAVA_OPTS="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=5445 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -verbose:gc -server -Xms512M -Xmx512M -XX:NewRatio=3 -XX:SurvivorRatio=8 -XX:MaxMetaspaceSize=128M -XX:+UseConcMarkSweepGC -XX:CompressedClassSpaceSize=128M -XX:MaxTenuringThreshold=5 -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/opt/flume/logs/server-gc.log.$(date +%Y%m%d%H%M) -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=64M"

 

    After that, we create a flume-conf.properties file in the conf directory to declare the flume component, the example is as follows:

agent.channels=ch-spooling
agent.sources=src-spooling
agent.sinks=sink-avro-spooling

##spooling
agent.channels.ch-spooling.type=file
agent.channels.ch-spooling.checkpointDir=/opt/flume/.flume/file-channel/ch-spooling/checkpoint
agent.channels.ch-spooling.dataDirs=/opt/flume/.flume/file-channel/ch-spooling/data
agent.channels.ch-spooling.capacity=1000000
agent.channels.ch-spooling.transactionCapacity=10000

agent.sources.src-spooling.type=spooldir
agent.sources.src-spooling.channels=ch-spooling
agent.sources.src-spooling.spoolDir=/opt/deploy/tomcat/order-center/logs
agent.sources.src-spooling.deletePolicy=immediate
#agent.sources.src-spooling.deletePolicy=never
agent.sources.src-spooling.includePattern=((access)|(order-center)).*\.log.+
agent.sources.src-spooling.ignorePattern=^.*\.gz$
agent.sources.src-spooling.consumeOrder=oldest
agent.sources.src-spooling.recursiveDirectorySearch=false
agent.sources.src-spooling.batchSize=1000
agent.sources.src-spooling.inputCharset=UTF-8
agent.sources.src-spooling.decodeErrorPolicy=IGNORE

agent.sinks.sink-avro-spooling.channel=ch-spooling
agent.sinks.sink-avro-spooling.type=avro
agent.sinks.sink-avro-spooling.hostname=10.0.1.100
agent.sinks.sink-avro-spooling.port=9011
agent.sinks.sink-avro-spooling.batch-size=1000
agent.sinks.sink-avro-spooling.compression-type=deflate

 

    Then we can start flume as follows:

bin/flume-ng agent --conf /opt/flume/conf --conf-file /opf/flume/conf/flume-conf.properties --name agent -Dflume.root.logger=INFO,LOGFILE -Dorg.apache.flume.log.printconfig=true -Dorg.apache.flume.log.rawdata=true

   

    The above configuration file declares that the agent name is "agent", that is, all configuration prefixes need to start with the agent name, which is used to define the "namespace" of the configuration. Multiple agents can be declared in one configuration file. In this configuration file, a source, channel and sink are declared respectively.

    When we start, we use "--conf-file" to specify the path of the configuration file, and "--name" to specify the name of the agent to be loaded.

 

    5. Output the original data in the log

    Many times, especially during development and testing, we usually need to detect information about the data in Flume:

    1) By specifying "-Dorg.apache.flume.log.printconfig=true", you can view the configuration information of flume in the startup log.

    2) Through "-Dorg.apache.flume.log.rawdata=true", you can view the raw data of messages in flume, including headers and body content.

    3) Through "-Dflume.root.logger=DEBUG,console" (the general production environment is INFO,LOGFILE), the level of the logger and the output terminal for printing can be declared.

 

    6. Based on zookeeper configuration management:

    Usually, the flume configuration file is saved locally in the agent. If your flume cluster is large, it will be troublesome to adjust the configuration. We can save the flume configuration on zookeeper, then specify the zookeeper address and path can be:

bin/flume-ng agent -conf /opt/flume/conf -z zk1:2181,zk2:2181 -p /flume -name agent

 

    I personally think that using zookeeper to save the Flume configuration increases the complexity of management. After all, operating zookeeper also requires a certain technical threshold; we can solve this problem based on the "jenkins + configuration central control" approach, that is, placing the flume configuration in the configuration central control On the machine, use jenkins to deploy flume uniformly. Before deploying and starting flume, synchronize the configuration file to the flume agent machine through ssh, and then start it. (This is what I am currently using)

 

    7. Third-party plug-ins or dependent libraries

    Flume itself already supports relatively rich components, but in many cases, we may need to extend its features, such as self-developed Flume interceptors, sinks, etc. After that, we need to put our own jars in Flume's CLASSPATH. In flume's "plugins.d" directory, you can create subdirectories according to your component characteristics. Each subdirectory needs to have the following three subdirectories, for example: plugins.d/my-ext/:

    1) lib: The jar of this plugin.

    2) libext: The jar that this plugin depends on.

    3) native: native libraries that this plugin depends on, such as ".so" files.

 

3. Complex Design

    1、Multi-agent

    

 

    In order to realize that the message can pass through multiple agents or hops, the sink of the previous agent and the source of the current agent need to use avro RPC, and the hostname (IP) and port need to be agreed between the two.

 

    2. Data consolidation (Consolidation)

    A more common scenario is: multiple clients that generate logs send data to several agents associated with the storage system. For example, the agents collect logs from hundreds of web servers, and then send them to several agents and write them to the HDFS cluster. .

    In this multi-tier architecture, the Flume Agents in the first layer use Avro sinks, and all point to the Avro source of a remote agent (in the current version, you can also use Thrift sinks between multiple agents) + Thrift source). The source of the second-layer agent can combine the received messages of multiple agents into a channel, and then this channel can be consumed by the sink of the current agent and written to the target storage platform. (Why do we use a multi-level architecture instead of each agent writing directly to the target storage? Please answer!)

 

    3. Multiplexing the flow

    Flume supports the replication of message flows to one or more destinations, either by declaring the flow multiplexer to replicate, or by selectively routing messages to one or more channels.

    In the above example, the source of agent "foo" "fans out" the message flow to three different channels, which can be "replicating" and "multiplexing". In the replicating case, each event is sent to three channels. For multiplexing, any event will be delivered to a subset of the available channels according to the matching method and result in the configuration; for example an event with a property of "txnType" should go to "channel1" when the value is "customer" And "channel3", when the value is "vendor", it should go to "channel2", otherwise it should go to "channel3"; the mapping relationship of this value matching can be specified in the configuration file.

 

4. Configuration (brief description)

    From the above example, we already know that the configuration file needs to declare the characteristics of three components, "source", "channel", and "sink". Each component needs to be prefixed with the agent name, such as "agent".

<agent-name>.sources=<source1> <source2> ##Multiple values ​​separated by spaces
<agent-name>.channels=<channel1> <channel2>
<agent-name>.sinks=<sink1> <sink2>

##Declaring properties of components one by one
<agent-name>.sources.<source-name>.type=<type>
...

##Declare the Flow connection relationship of the component
<agent-name>.sources.<source-name>.channels=<channel1> <channel2> ...

<agent-name>.sinks.<sink-name>.channel=<channel>
##Needs special emphasis, each sink can only access one channel
##Each source can be transmitted to multiple channels according to the "multiplexing" situation

 

    Regarding "fan-out" streams:

    As mentioned above, Flume supports fan-out of message flow from a source to multiple channels. There are two fan-out models: replicating and multiplexing; in "replicating" mode, messages will be sent to all specified channels (replication); in multiplexing mode, according to matching and mapping relationships, messages will only be sent to those who meet the requirements channels. In order to implement fan-out, the list of channels and the fan-out strategy need to be specified in the source. You can specify the selector rules by specifying the "selector" attribute in the channel as "replicating" or "multiplexing". By default, the "selector" type is "replicating":

<agent-name>.sources.<source1>.channels=<channel1> <channel2>
<agent-name>.sources.<source1>.selector.type=replicating ##multiplexing

<agent-name>.sinks.<sink1>.channel=<channel1>
...

 

    For "multiplexing", other configuration items are required, and the mapping relationship between event attributes and channels must be configured. The selector will detect the properties configured in the event headers. If the value matches, it will send the message to the corresponding channel, otherwise it will be sent to the default channel specified in the configuration:

<agent-name>.sources.<source1>.selector.type=multiplexing
<agent-name>.sources.<source1>.selector.header=<someHeader>
<agent-name>.sources.<source1>.selector.mapping.<value1>=<channel1>
##When the value of someHeader is value1, the message is sent to channel1
<agent-name>.sources.<source1>.selector.mapping.<value2>=<channel2> <channel3>
<agent-name>.sources.<source1>.selector.mapping.default=<channel1>
##If no match is successful, use the channel specified by default.
##Special attention: This header must be included in the event, and the value cannot be null, otherwise it will not match any channel.

 

5. Flume Sources (briefly)

    The Source component can receive data from a TCP connection, or parse log entries in a local file, then encapsulate the data into events, and transmit the events to internal channels; Source is the front end of the data flow in the Flume agent. At present, the more commonly used source types built in Flume are:

    1) Avro Source: Based on TCP and Avro data protocols, this source acts as the server side of Avro RPC to receive Avro data sent by Client.

    2) Thrift Source: Based on TPC and Thrift data protocol, this source acts as the server side of Thrift RPC to receive Thrift data sent by Client.

    3) Spooling Directory Source: Detect files in the local file directory and parse existing (or new) files into events. This source is usually used to collect "historical log files", such as log files that are newly added every day.

    4) Taildir Source: Similar to the "tail" command, it detects whether the specified file has new (append) data, and encapsulates the added data into events. Each operation will record the processed position of the current file, and the next operation will Continue from position. This is very useful for us to collect "live logs".

    5) kafka source: As the consumer of kafka, specify the Topics list of kafka, and consume messages from kafka.

    6) There are other sources, such as: Syslog, NetCat, HTTP, etc.

 

6. Flume Sinks (briefly)

    1) HDFS Sink: Write messages to the HDFS file system, and support features such as automatic path creation and file segmentation.

    2) Avro Sink: One of the most commonly used sinks, which transmits messages to the remote server through Avro RPC. Usually used in multi-tier architectures, it is the sink recommended by Flume.

    3) Thrift Sink: Same as above.

    4) File Roll Sink: Write messages to the local file system, support splitting by time, and support custom path management. (many times we need to extend it)

    5) Null Sink: In some scenarios, it is very useful to directly discard the message.

    6) Kafka Sink: Write messages to Kafka, one of the most commonly used sinks. This sink is used as the producer side of kafka.

 

7. Flume Channels (briefly)

    1) Memory Channel: Stores Events in memory, a BlockqingQueue, which is a Channel with weak data reliability but the highest efficiency. Usually suitable for real-time data transfer.

    2) File Channel: Stores Events in a local File. Channels with high data reliability but low efficiency are usually used to transmit data with high reliability requirements.

    3) Others: such as JDBC Channel, kafka Channel, and the experimental Spillable Memory Channel (based on Memory and File)

 

Eight, Flume Selectors (selector)

    In the above, we have mentioned the mechanism of Selector, which is used for routing messages in Source, and transmits events in a Source to the corresponding Channel through certain conditions. Currently two selectors are supported: replicating and multiplexing, the default is "replicating".

    1. Replicating: Replication, that is, each event will be transmitted to multiple channels in a "replicated" manner.

agent.sources=s1
agent.channels=c1 c2 c3
agent.sources.s1.selector.type=replicating
agent.sources.s1.channels=c1 c2 c3
agent.sources.s1.selector.optional=c3

 

    selector has two attributes "type" and "optional", where type is used to specify the type of selector and must be "replicating". Among them, "optional" represents an optional channel, that is, "c1 c2 c3" declared in channels, in which the transaction operation will fail when the message "c1 c2" fails to be written. Because c3 is optional, then writing to c3 fails. The message will be ignored.

 

    2. Multiplexing: Multiplexing, that is, messages are sent to a channel in the Channels list according to a certain strategy; events reuse these Channels.

agent.sources=s1
agent.channels=c1 c2 c3 c4
agent.sources.s1.selector.type=multiplexing
agent.sources.s1.selector.header=state
agent.sources.s1.selector.mapping.CZ=c1
agent.sources.s1.selector.mapping.US=c2 c3
agent.sources.s1.selector.default=c4

 

    This type of selector needs to declare mapping for routing messages. Specify the header to be matched by "header". If the value of the header matches the mapping list, the message will be sent to the channel corresponding to the mapping. If there is no match, it will be sent to the channel specified by "default". It should be noted here that this header must exist in the event, and the value cannot be null, otherwise the message will not be delivered.

 

Nine, Flume Sink Processors (processor)

    Advanced features, sink groups can treat multiple sinks as a whole, and implement features such as "load balancing" or "failover" for multiple sinks in a group. Currently supports two processors: load_balance, failover.

agent.sinkgroups = g1
agent.sinkgroups.g1.sinks=sink1 sink2
agent.sinkgroups.g1.processor.type=load_balance

 

    1) Failover Sink Processor

    Declare multiple sinks in a group, as long as there is a valid sink, the message will be processed and delivered. Its principle is relatively simple. When an exception occurs when a sink message is sent, the sink will be marked as "fail" and added to the failSinks list, and a sink with the highest priority value will be selected from the aliveSinks list to take over and be responsible for Messages after that are sent until it throws an exception. Those sinks marked as fail will intermittently detect their status, traverse the list of failSinks, let them try to send messages one by one, and add this sink to aliveSinks if the sending is successful. (The internals will be explained based on the source code later)

    At any time, only one sink in the sinks list is responsible for message delivery, and other sinks only do "backup", which conforms to the semantics of failover.

agent.sinkgroups = g1
agent.sinkgroups.g1.sinks=sink1 sink2
agent.sinkgroups.g1.processor.type=failover
agent.sinkgroups.g1.processor.priority.sink1=5
agent.sinkgroups.g1.processor.priority.sink2=10
##For failed sinks, the maximum time (milliseconds) for backoff, after timeout will be
## Try to have them send messages to verify liveness.
agent.sinkgroups.g1.processor.maxpenalty=10000

 

    2)Load balanceing Sink Processor

    Support load balancing among multiple sinks. The sink selection mechanism is divided into "random" and "round_robin", and the default is "round_robin". When processing messages, the selector selects a sink from the sinks list according to the configured selection mechanism, and uses this sink to consume messages (get messages from Channel); if the sink cannot transmit messages, the selector will reselect the sink, If all sinks fail to deliver, an exception will eventually be thrown.

    If backoff is turned on, those failed Sinks will be added to the "blacklist" and "retained" for a period of time; after a timeout, the failed Sink can be re-added to the selection list (perhaps it is still unavailable at this time, if it is not available again is used, its timeout time will increase, up to maxTimeout). Every time the selector selects a sink, those sinks in the "blacklist" will not participate.

agent.singroups=g1
agent.sinkgroups.g1.sinks=s1 s2
##must be load_balance
agent.sinkgroups.g1.processor.type=load_balance
##Enable backoff
agent.sinkgroups.g1.processor.backoff=true
##Selection mechanism: random, round_robin
agent.sinkgroups.g1.processor.selector=random
##backoff maximum time, milliseconds
agent.sinkgroups.g1.processor.selector.maxTimeout=30000

 

 Ten, Event Serializers (message serialization)

    Serialization, that is, how to serialize the Event when the sink transmits the Event. Serialization corresponds to deserialization, so the serialization of the current sink should correspond to the deserialization of the remote side.

    1、Body Text Serializer

    Alias ​​(shorthand): text, directly write the Event body in a stream, and the headers part of the event will be ignored.

 

agent.sinks=s1
agent.sinks.s1.type=file_roll ##Write event to local disk
agent.sinks.s1.directory=/logs/flume
agent.sinks.s1.serializer=text
agent.sinks.s1.serializer.appendNewline=true
 

 

    This serialization has only one attribute "appendNewline", that is, whether to append a "newline" after the body data is written.

 

    2、Avro Event Serializer

    This serialization can be used for AvroSink, or when writing events to Avro serialized files.

 

Eleven, Flume Interceptors

    Very important feature, we can use interceptors to modify and drop events (modify/drop), flume supports chained interceptors, that is, multiple interceptors will be executed in sequence according to the order they are declared in the configuration, events ( Usually in batches) will go through each interceptor in turn. In the interceptor, we can modify the headers and even the body of the event; if the interceptor decides to discard an event, we only need to not include it in the returned events list. If we want to discard all, we only need to return an empty list. .

 

agent.sources=s1
agent.channels=c1
agent.sources.s1.interceptors=i1 i2
agent.sources.s1.interceptors.i1.type=host
agent.sources.s1.interceptors.i2=timestamp
 

 

    Note that only the source component supports interceptors, that is, events are allowed to be adjusted using interceptors before they are delivered to the channel. For a custom interceptor, the full name of the class needs to be declared in the "type" attribute (for example: com.test.flume.interceptors.MyInterceptor$Builder), and the custom interceptor needs to implement "org.apache.flume.interceptor. Interceptor" interface, it should be noted that "Builder" needs to be declared in the interceptor.

 

    1. Timstamp interceptor

    Add a header to the event's headers, whose value is the timestamp when the event was processed. If this header already exists in the event (for example, the previous agent has been added to the headers), you can use "preserveExisting" to decide whether to keep the original value. The shorthand for this interceptor is: timestamp.

 

    2. Host interceptor

    Add a header to the headers of the event, whose value is the hostname or IP of the machine where the agent is located; this interceptor is abbreviated as: host

   

    3. Static interceptor

    Add a constant to headers; shorthand: static.

 

agent.sources.s1.interceptors.i1.type=static
agent.sources.s1.interceptors.i1.key=project
agent.sources.s1.interceptors.i1.value=order_center

 

 

    4. UUID interceptor: Add a globally unique UUID header to the event; abbreviated as: UUID.

    5. Regex filter interceptor: Match the body content through regular expressions, and you can decide whether to include or discard the event according to the matching result.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326482041&siteId=291194637