Based on Apache Flume Datahub plug-in log data on the cloud sync

Used in this article

Ali cloud number plus - big data computing services MaxCompute Products Address: https://www.aliyun.com/product/odps


Brief introduction

Apache Flume is a distributed, secure, available system can be used to efficiently collect data from different sources, and the mobile mass polymerization log data to a centralized data storage system, the Source and Sink supports a variety of plug-ins. This article describes how to use Apache Flume of Datahub Sink plug-in real-time log data is uploaded to the Datahub.

Environmental requirements

  • JDK (1.7 or above, 1.7 recommended)
  • Flume-NG 1.x
  • Apache Maven 3.x

Plug-in deployment

Download plug-in archive

$ wget http://odps-repo.oss-cn-hangzhou.aliyuncs.com/data-collectors%2Faliyun-flume-datahub-sink-2.0.2.tar.gz

Plug archive decompression

$ tar zxvf flume-datahub-sink-1.1.0.tar.gz
$ ls flume-datahub-sink
lib    libext

Deployment Datahub Sink plug

The plug-in files after extracting the folder flume-datahub-sink move to the Apache Flume installation directory

$ mkdir {YOUR_FLUME_DIRECTORY}/plugins.d
$ mv flume-datahub-sink {YOUR_FLUME_DIRECTORY}/plugins.d/

After the move, verify Datahub Sink plug-in is already in the appropriate directory:

$ ls { YOUR_APACHE_FLUME_DIR }/plugins.d
flume-datahub-sink

Configuration Example

Flume introduce the principle of architecture, as well as core components refer to the Flume-ng principle and use . Flume use herein will build a Datahub Sink example, the structured data for parsing a log file, and uploaded to the Datahub Topic.

Need to upload the log file format is as follows (each behavior record, separated by commas between fields):

# test_basic.log
some,log,line1
some,log,line2
...

We will create Datahub Topic, and the first and second columns of each row as a log record is written in the Topic.

Creating Datahub Topic

Use Datahub WebConsole created Topic, schema for the (string c1, string c2), the following assumptions built of Topic called test_topic.

Flume profile

Create a file called datahub_basic.conf in conf / files Flume installation directory folder, and enter the following:

# A single-node Flume configuration for Datahub
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = cat {YOUR_LOG_DIRECTORY}/test_basic.log

# Describe the sink
a1.sinks.k1.type = com.aliyun.datahub.flume.sink.DatahubSink
a1.sinks.k1.datahub.accessID = {YOUR_ALIYUN_DATAHUB_ACCESS_ID}
a1.sinks.k1.datahub.accessKey = {YOUR_ALIYUN_DATAHUB_ACCESS_KEY}
a1.sinks.k1.datahub.endPoint = {YOUR_ALIYUN_DATAHUB_END_POINT}
a1.sinks.k1.datahub.project = test_project
a1.sinks.k1.datahub.topic = test_topic
a1.sinks.k1.batchSize = 1
a1.sinks.k1.serializer = DELIMITED
a1.sinks.k1.serializer.delimiter = ,
a1.sinks.k1.serializer.fieldnames = c1,c2,
a1.sinks.k1.serializer.charset = UTF-8
a1.sinks.k1.shard.number = 1
a1.sinks.k1.shard.maxTimeOut = 60

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Here serializer configuration specifies a comma-separated form the input source parsed into three fields, and the third field is ignored.

Start Flume

After configuration, start Flume and specify the name and path agent configuration file, add the -Dflume.root.logger = INFO, console option to log output to the console in real time.

$ cd {YOUR_FLUME_DIRECTORY}
$ bin/flume-ng agent -n a1 -c conf -f conf/datahub_basic.conf -Dflume.root.logger=INFO,console

Write succeeds, the log is as follows:

...
Write success. Event count: 2
...

Data usage

After the log data is uploaded to the Datahub through the Flume, you can use StreamCompute flow calculation for real-time analysis, for example, some of the Web log site, real-time statistics of each page PV / UV and so on. Further, the data may be imported Datahub configuration Connector archive data in MaxCompute to facilitate subsequent off-line analysis.

For data archiving MaxCompute scene, in general, the data needs to be partitioned. Datahub to MaxCompute archive can automatically create partitions based on partition field MaxCompute table, provided that the requirements MaxCompute and Datahub field names and types correspond exactly to the can. If automatic transmission time set partitioning according to the log, it is necessary to specify the respective fields and partition MaxCompute time format in the example above, for example, automatically create partitions in hours, add the following configuration:

a1.sinks.k1.maxcompute.partition.columns = pt
a1.sinks.k1.maxcompute.partition.values = %Y%m%d%H

Note: pt exists in this field needs Datahub Topic and MaxCompute table, and the field is the partition table.

Guess you like

Origin yq.aliyun.com/articles/66112