Real-time streaming access to data warehouse is basically available in large companies. It will be Flume1.8
supported in the future taildir source
. It has the following characteristics and is widely used:
1. Use regular expressions to match the file name in the directory
2. In the monitored file, once data is written,Flume
the information will be written to the designated sink
3. High reliability, no data loss
4. No tracking The file is processed in any way and will not be renamed or deleted.
5. It is not supportedWindows
and cannot read binary files. Support reading text files by line
Taking revenue Flume
stream for example, describes the stream access HDFS
, based thereon back ods
layer on the outer.
1.1 taildir source configuration
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /opt/hoult/servers/conf/startlog_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 =/opt/hoult/servers/logs/start/.*log
1.2 hdfs sink configuration
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /user/data/logs/start/logs/start/%Y-%m-%d/
a1.sinks.k1.hdfs.filePrefix = startlog.
# 配置文件滚动方式(文件大小32M)
a1.sinks.k1.hdfs.rollSize = 33554432
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.idleTimeout = 0
a1.sinks.k1.hdfs.minBlockReplicas = 1
# 向hdfs上刷新的event的个数
a1.sinks.k1.hdfs.batchSize = 100
# 使用本地时间
a1.sinks.k1.hdfs.useLocalTimeStamp = true
1.3 Agent configuration
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# taildir source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /opt/hoult/servers/conf/startlog_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /user/data/logs/start/.*log
# memorychannel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100000
a1.channels.c1.transactionCapacity = 2000
# hdfs sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /opt/hoult/servers/logs/start/%Y-%m-%d/
a1.sinks.k1.hdfs.filePrefix = startlog.
# 配置文件滚动方式(文件大小32M)
a1.sinks.k1.hdfs.rollSize = 33554432
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.idleTimeout = 0
a1.sinks.k1.hdfs.minBlockReplicas = 1
# 向hdfs上刷新的event的个数
a1.sinks.k1.hdfs.batchSize = 1000
# 使用本地时间
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
/opt/hoult/servers/conf/flume-log2hdfs.conf
1.4 Start
flume-ng agent --conf-file /opt/hoult/servers/conf/flume-log2hdfs.conf -name a1 -Dflume.roog.logger=INFO,console
export JAVA_OPTS="-Xms4000m -Xmx4000m -Dcom.sun.management.jmxremote"
# 要想使配置文件生效,还要在命令行中指定配置文件目录
flume-ng agent --conf /opt/hoult/servers/flume-1.9.0/conf --conf-file /opt/hoult/servers/conf/flume-log2hdfs.conf -name a1 -Dflume.roog.logger=INFO,console
$FLUME_HOME/conf/flume-env.sh
The following parameters must be added, otherwise an error will be reported as follows:
1.5 Use a custom interceptor to solve Flume Agent replacing the local time with the timestamp in the log
Use netcat source → logger sink to test
# a1是agent的名称。source、channel、sink的名称分别为:r1 c1 k1
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# source
a1.sources.r1.type = netcat
a1.sources.r1.bind = linux121
a1.sources.r1.port = 9999
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.hoult.flume.CustomerInterceptor$Builder
# channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 100
# sink
a1.sinks.k1.type = logger
# source、channel、sink之间的关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
The main code of the interceptor is as follows:
public class CustomerInterceptor implements Interceptor {
private static DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyyMMdd");
@Override
public void initialize() {
}
@Override
public Event intercept(Event event) {
// 获得body的内容
String eventBody = new String(event.getBody(), Charsets.UTF_8);
// 获取header的内容
Map<String, String> headerMap = event.getHeaders();
final String[] bodyArr = eventBody.split("\\s+");
try {
String jsonStr = bodyArr[6];
if (Strings.isNullOrEmpty(jsonStr)) {
return null;
}
// 将 string 转成 json 对象
JSONObject jsonObject = JSON.parseObject(jsonStr);
String timestampStr = jsonObject.getString("time");
//将timestamp 转为时间日期类型(格式 :yyyyMMdd)
long timeStamp = Long.valueOf(timestampStr);
String date = formatter.format(LocalDateTime.ofInstant(Instant.ofEpochMilli(timeStamp), ZoneId.systemDefault()));
headerMap.put("logtime", date);
event.setHeaders(headerMap);
} catch (Exception e) {
headerMap.put("logtime", "unknown");
event.setHeaders(headerMap);
}
return event;
}
@Override
public List<Event> intercept(List<Event> events) {
List<Event> out = new ArrayList<>();
for (Event event : events) {
Event outEvent = intercept(event);
if (outEvent != null) {
out.add(outEvent);
}
}
return out;
}
@Override
public void close() {
}
public static class Builder implements Interceptor.Builder {
@Override
public Interceptor build() {
return new CustomerInterceptor();
}
@Override
public void configure(Context context) {
}
}
start up
flume-ng agent --conf /opt/hoult/servers/flume-1.9.0/conf --conf-file /opt/hoult/servers/conf/flume-test.conf -name a1 -Dflume.roog.logger=INFO,console
## 测试
telnet linux121 9999
Wu Xie, Xiao San Ye, a little rookie in the background, big data, and artificial intelligence.
Please pay attention to more