Actual Combat-Data Warehouse Construction (3)

Question guide:

1. What is the cluster planning of the data collection module of the data warehouse?
2. What is the configuration of the data warehouse log generation module?
3. How to optimize the data collection module Kafka?

1. Data collection module

[1] Linux environment to build

Linux configuration, please see this blog Linux basic configuration



[2] Hadoop environment to build



1) Basic environment creation

[node01]
cd ~
mkdir bin
cd bin
vim xsync
=======================脚本如下========================
#!/bin/bash
#1 获取输入参数个数,如果没有参数,直接退出
pcount=$#
if((pcount==0)); then
echo no args;
exit;
fi

#2 获取文件名称
p1=$1
fname=`basename $p1`
echo fname=$fname

#3 获取上级目录到绝对路径
pdir=`cd -P $(dirname $p1); pwd`
echo pdir=$pdir

#4 获取当前用户名称
user=`whoami`

#5 循环
for((host=101; host<104; host++)); do
        echo ------------------- hadoop$host --------------
        rsync -rvl $pdir/$fname $user@hadoop$host:$pdir
done

=======================脚本如上========================
chmod 770 xsync
sudo rm -rf /opt/*
sudo mkdir /opt/modules
sudo mkdir /opt/software
sudo mkdir -p /opt/tmp/logs
sudo chown zsy:zsy -R /opt
xsync /opt/*

[node02/node03]
sudo chown zsy:zsy -R /opt

 2) JDK installation

Note : Before installation, please delete the JDK that comes with the system

[node01]
tar -zxf /opt/software/jdk-8u144-linux-x64.tar.gz -C /opt/modules
sudo vim /etc/profile.d/java.sh
export JAVA_HOME=/opt/modules/jdk1.8.0_144
export PATH=$PATH:$JAVA_HOME/bin
source /etc/profile
xsync /opt/modules/jdk1.8.0_144
sudo scp /etc/profile.d/java.sh node02:/etc/profile
sudo scp /etc/profile.d/java.sh node03:/etc/profile
[node02/node03]
source /etc/profile

Note: As you can see, the JDK environment variable I added above created a file ending with .sh in the /etc/profile.d directory , so why can I do this? Let's first talk about the configuration methods of environment variables 1) Modify the /etc/profile file: used to set system environment parameters, such as $PATH, the environment variables inside are effective for all users in the system . Use bash command, you need to source / etc / profile about 2) modify ~ / .bashrc file: for a particular user, set the environment variable only for their own valid user , using bash command, as long as that user to run a command line The file will be read, and the file will load the /etc/bashrc file. The file will traverse the files ending with the .sh file under the /etc/profile.d file and add the environment variables to /etc/bashrc File, so we add the configured environment variables to the /etc/profile.d directory 3) Description: Login Shell: login with a user name, it will automatically load /etc/profile Non-login Sehll: login with ssh, no will automatically load the / etc / Profile , will automatically load









~/.backrc

3) For the
specific installation method of Zookeeper , please click the blog [Zookeeper] Introduction to Zookeeper

4) For the
specific installation method of Hadoop , please click the blog [Hadoop] HadoopHA Highly Available and Fully Distributed Construction

5) For the
specific installation method of Flume , please click the blog  [Flume] Introduction to Flume (1)
Description:
[1] Source


1) How to choose Taildir Source and Exec Source?

Taildir Source has advantages over Exec Sgurce and Spooling Directory Source. Taildir Source: Resumable uploading, multiple directories. Before Flumel.6, you need to customize the Source to record the file location every time you read it to achieve breakpoints.

Exec Source can collect data in real time, but if Flume is not running or the Shell command is wrong, the data will be lost.
Spooling Directory Source monitoring Directory, does not support resumable

upload 2) How to set batchSize?

When Event1K is around, 500-1000 is appropriate (default is 100)

[2] Channel

uses Kafka Channel, eliminating Sink and improving efficiency.
Note that before Flume1.7, Kafka Channel was rarely used because it was found that parseAsFlumeEventThis configuration will not work. That is, no matter parseAsFlumeEvent is configured as true or false, it will be converted to FlumeEvent. In this case, the result is that the information in Flume's headers will always be mixed with the content and written into the Kafka message. This is obviously not what I need, I just need to write the content.

[3] Architecture Figure



[4] Edit Flume to collect log data and send it to Kafka configuration file (remember to synchronize configuration)

# 说明1:我们使用 TAILDIR Source 监控多目录,自动实现断点续传,版本需要在1.7+
# 说明2:我们使用 Kafka Channel,不使用Kafka Sink,提高效率

a1.sources=r1
a1.channels=c1 c2

# configure source
a1.sources.r1.type = TAILDIR
# 断点续传持久化目录
a1.sources.r1.positionFile = /opt/modules/flume/log_position/log_position.json
# 设置需要监控的多个目录,我们只需要一个,所以只添加一个 f1
a1.sources.r1.filegroups = f1
# 设置 f1 对应的监控目录
a1.sources.r1.filegroups.f1 = /opt/tmp/logs/app.+
a1.sources.r1.fileHeader = true
a1.sources.r1.channels = c1 c2
# interceptor 添加拦截器
a1.sources.r1.interceptors = i1 i2
# 自定义拦截器
a1.sources.r1.interceptors.i1.type = com.zsy.flume.interceptor.LogETLInterceptor$Builder   # ETL拦截器
a1.sources.r1.interceptors.i2.type = com.zsy.flume.interceptor.LogTypeInterceptor$Builder  #日志类型拦截器
# 根据header头信息,将source数据发送到不同的 Channel
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = topic
a1.sources.r1.selector.mapping.topic_start = c1
a1.sources.r1.selector.mapping.topic_event = c2

# configure channel
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = node01:9092,node02:9092,node03:9092
a1.channels.c1.kafka.topic = topic_start   # 日志类型是start,数据发往 channel1
a1.channels.c1.parseAsFlumeEvent = false
a1.channels.c1.kafka.consumer.group.id = flume-consumer
a1.channels.c2.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c2.kafka.bootstrap.servers = node01:9092,node02:9092,node03:9092
a1.channels.c2.kafka.topic = topic_event  # 日志类型是event,数据发往 channel2
a1.channels.c2.parseAsFlumeEvent = false
a1.channels.c2.kafka.consumer.group.id = flume-consumer

[5] Custom interceptor

Create a Maven project and add the following dependencies:

<dependencies>
        <dependency>
            <groupId>org.apache.flume</groupId>
            <artifactId>flume-ng-core</artifactId>
            <version>1.7.0</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.3.2</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

Steps to customize the interceptor:
① Define a class to implement Flume's Interceptor interface
② Rewrite 4 methods
 

  • initialization
  • Single event processing
  • Multi Event Processing
  • Close resource


③ Create a static internal class and return the current class object
④ Package upload The

code is as follows

1) com.zsy.flume.interceptor.LogETLInterceptor

package com.zsy.flume.interceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.List;

public class LogETLInterceptor implements Interceptor {
    @Override
    public void initialize() {

    }

    @Override
    public Event intercept(Event event) {
        // 清洗数据 ETL
        // 1.获取日志
        byte[] body = event.getBody();

        String log = new String(body, Charset.forName("UTF-8"));

        // 2.区分类型处理
        if (log.contains("start")) {
            // 验证启动日志逻辑
            if (LogUtils.validateStart(log)) {
                return event;
            }
        } else {
            // 验证事件日志逻辑
            if (LogUtils.validateEvent(log)) {
                return event;
            }
        }
        return null;
    }

    @Override
    public List<Event> intercept(List<Event> events) {
        ArrayList<Event> interceptors = new ArrayList<>();
        // 多event处理
        for (Event event : events) {
            Event intercept = intercept(event);
            if (intercept != null) {
                interceptors.add(intercept);
            }
        }
        return interceptors;
    }

    @Override
    public void close() {

    }

    public static class Builder implements Interceptor.Builder {
        @Override
        public Interceptor build() {
            return new LogETLInterceptor();
        }

        @Override
        public void configure(Context context) {

        }
    }
}

2)com.zsy.flume.interceptor.LogTypeInterceptor

package com.zsy.flume.interceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class LogTypeInterceptor implements Interceptor {
    @Override
    public void initialize() {

    }

    @Override
    public Event intercept(Event event) {
        // 区分类型 start event
        // header body

        byte[] body = event.getBody();
        String log = new String(body, Charset.forName("UTF-8"));

        // 获取头信息
        Map<String, String> headers = event.getHeaders();

        // 业务逻辑判断
        if(log.contains("start")){
            headers.put("topic","topic_start");
        }else{
            headers.put("topic","topic_event");
        }

        return event;
    }

    @Override
    public List<Event> intercept(List<Event> events) {
        ArrayList<Event> interceptors = new ArrayList<>();
        for (Event event : events) {
            Event intercept = intercept(event);
            interceptors.add(intercept);
        }
        return interceptors;
    }

    @Override
    public void close() {

    }

    public static class Builder implements Interceptor.Builder{
        @Override
        public Interceptor build() {
            return new LogTypeInterceptor();
        }

        @Override
        public void configure(Context context) {

        }
    }
}

3) Tools

package com.zsy.flume.interceptor;

import org.apache.commons.lang.math.NumberUtils;

public class LogUtils {
    // 验证启动日志逻辑
    public static boolean validateStart(String log) {
        if (log == null) {
            return false;
        }

        // 判断数据是否是 { 开头 ,是否是 } 结尾
        if (!log.trim().startsWith("{") || !log.trim().endsWith("}")) {
            return false;
        }
        return true;
    }

    // 验证事件日志逻辑
    public static boolean validateEvent(String log) {
        // 判断数据是否是 { 开头 ,是否是 } 结尾
        // 服务器事件 | 日志内容
        if (log == null) {
            return false;
        }

        // 切割
        String[] logContents = log.split("\\|");
        if(logContents.length != 2){
            return false;
        }

        // 校验服务器时间(长度必须是13位,必须全部是数字)
        if(logContents[0].length() != 13 || !logContents[0].matches("[0-9]{13}")){
//        if(logContents[0].length() != 13 || !NumberUtils.isDigits(logContents[0])){
            return false;
        }

        // 校验日志格式
        if (!logContents[1].trim().startsWith("{") || !logContents[1].trim().endsWith("}")) {
            return false;
        }
        return true;
    }
}

4) Pack and upload the jar package without dependencies to Flume's lib directory, Flume will automatically load all the jar packages under lib

[6] Flume start/stop script

#! /bin/bash
case $1 in
"start"){
for i in node01 node02
do
echo " --------启动 $i 采集 flume-------"
ssh $i "nohup /opt/modules/flume/bin/flume-ng agent --conf-file /opt/modules/flume/conf/file-flume-kafka.conf --name a1 -Dflume.root.logger=INFO,LOGFILE >/dev/null 2>&1 &"
done
};;
"stop"){
for i in node01 node02
do
echo " --------停止 $i 采集 flume-------"
ssh $i "ps -ef | grep file-flume-kafka | grep -v grep |awk '{print \$2}' | xargs kill"
done
};;
esac

说明1:nohub该命令表示在退出账户/关闭终端后继续运行相应的进程,意味不挂起,不挂断地运行命令
说明2:awk默认分隔符为空格
说明3:xargs 表示取出前面命令运行地结果,作为后面命令地输入参数

6) Kafka installation For

specific installation methods, please click on the blog [Kafka] Introduction to Kafka Analysis (1)

7) Log generation

Prerequisite: Pack the log data generation code we wrote before and put it on the server

Log start

[1] Code parameter description

/ 参数一:控制发送每条的延时时间,默认是 0
Long delay = args.length > 0 ? Long.parseLong(args[0]) : 0L;
// 参数二:循环遍历次数
int loop_len = args.length > 1 ? Integer.parseInt(args[1]) : 1000;

[2] Copy the generated jar package  log-collector-1.0-SNAPSHOT-jar-with-dependencies.jar  to the
node01 server  /opt  directory, and synchronize it to node02

[3] Execute the jar program on node01

Method 1:

java -classpath log-collector-1.0-SNAPSHOT-jar-with-dependencies.jar com.zsy.appclient.AppMain  > /dev/null 2>&1

说明:
如果打包时没有指定主函数,则使用 -classpath,并在 jar 包后面指定主函数全类名

Way 2:

java -jar log-collector-1.0-SNAPSHOT-jar-with-dependencies.jar > /dev/null 2>&1

说明:
如果打包时指定了主函数,则可以使用 -jar ,此时不用指定主函数的全类名

Note: /dev/null  represents the empty device file of linux, all the content written into this file will be lost, commonly known as "black hole"
 

  • Standard input 0: get input from the keyboard /proc/self/fd/0
  • Standard output 1: output to the screen (ie console) /proc/self/fd/1
  • Error output 2: output to the screen (ie console) /proc/self/fd/2


[4] View the log data, and view the data in the /opt/tmp/logs directory that we specified.

[5] Script

For the convenience of use, we use scripts to generate data!

The log data generation script is as follows

#! /bin/bash
for i in node01 node02
do
        echo "========== $i 生成日志数据中... =========="
        ssh $i "java -jar /opt/log-collector-1.0-SNAPSHOT-jar-with-dependencies.jar  $1 $2  >/dev/null 2>&1 &"
done

The time synchronization script (temporary script, only needed for time synchronization later) is as follows

#!/bin/bash
for i in node01 node02 node03
do
        echo "========== $i =========="
        ssh -t $i "sudo date -s $1"
done

参数说明:
我们可以发现上面的参数中我们使用了 -t ,是因为我们使用了 sudo ,所以需要使用 -t 参数来形成虚拟终端,不需要深究,只要使用 sudo,在ssh后面添加 -t 接口

8) Flume consumes Kafka data and stores it in HDFS

[1] We configured node01 and node02 to collect log data and transfer it to Kafka. Now we need to consume Kafka data on node03 and store it on HDFS. The
architecture diagram is as



follows: Flume configuration is as follows:

## 组件
a1.sources=r1 r2
a1.channels=c1 c2
a1.sinks=k1 k2

## source1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = node01:9092,node02:9092,node03:9092
a1.sources.r1.kafka.topics = topic_start

## source2
a1.sources.r2.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r2.batchSize = 5000
a1.sources.r2.batchDurationMillis = 2000
a1.sources.r2.kafka.bootstrap.servers = node01:9092,node02:9092,node03:9092
a1.sources.r2.kafka.topics = topic_event

## channel1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/modules/flume/checkpoint/behavior1
a1.channels.c1.dataDirs = /opt/modules/flume/data/behavior1/
a1.channels.c1.maxFileSize = 2146435071
a1.channels.c1.capacity = 1000000
a1.channels.c1.keep-alive = 6

## channel2
a1.channels.c2.type = file
a1.channels.c2.checkpointDir = /opt/modules/flume/checkpoint/behavior2
a1.channels.c2.dataDirs = /opt/modules/flume/data/behavior2/
a1.channels.c2.maxFileSize = 2146435071
a1.channels.c2.capacity = 1000000
a1.channels.c2.keep-alive = 6

## sink1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /origin_data/gmall/log/topic_start/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = logstart-

##sink2
a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.path = /origin_data/gmall/log/topic_event/%Y-%m-%d
a1.sinks.k2.hdfs.filePrefix = logevent-

## 不要产生大量小文件
a1.sinks.k1.hdfs.rollInterval = 3600
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k2.hdfs.rollInterval = 10
a1.sinks.k2.hdfs.rollSize = 134217728
a1.sinks.k2.hdfs.rollCount = 0

## 控制输出文件是原生文件。
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k2.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = lzop
a1.sinks.k2.hdfs.codeC = lzop

## 拼装
a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1
a1.sources.r2.channels = c2
a1.sinks.k2.channel= c2

[2] The difference between FileChannel and MemoryChannel
 

  • MemoryChannel transfers data faster, but because the data is stored in the heap memory of the JVM, the Agent process hangs will cause data loss, which is suitable for requirements that do not require high data quality
  • The transmission speed of FileChannel is slower than that of Memory, but the data security is high, and the data can be recovered from failure when the Agent process hangs


[3] FileChannel optimization

By configuring dataDirs to point to multiple paths, each path corresponds to a different hard disk, increasing Flume throughput. The
official instructions are as follows:

Comma separated list of directories for storing log files. Using
multiple directories on separate disks can improve file channel
peformance

checkpointDir  and  backupCheckpointDir  also try arranged in a different hard disk directory corresponding to ensure that
the checkpoint is broken, to quickly recover data using backupCheckpointDir

[4] Sink: HDFS Sink
 

  • (1) HDFS stores a large number of small files, what is the impact?

 

  • Metadata level: Each small file has a copy of metadata, including file path, file name, owner, group, permissions, creation time, etc. These information are stored in the Namenode memory. So too many small files will take up a lot of memory on the Namenode server and affect the performance and service life of the Namenode
  • Computing level: By default, MR will enable a Map task calculation for each small file, which greatly affects computing performance. It also affects the disk addressing time

 

  • (2) HDFS small file processing


The official default configuration of these three parameters will generate small files after being written to HDFS, hdfs.rollInterval, hdfs.rollSize,
hdfs.rollCount are
based on the above  hdfs.rollInterval=3600 , hdfs.rollSize=134217728 , hdfs.rollCount=0  . comprehensive
cooperation with, the effect is as follows:
 

  • (1) New files will be generated on a rolling basis when the file reaches 128M
  • (2) When the file is created for more than 3600 seconds, new files will be generated on a rolling basis


9) Data production! ! !

Finally, after so much preparation, we can finally produce data and store the data on HDFS. Now let’s sort out the overall process! ! !

The process is as follows:

1) Start Zookeeper
2) Start Hadoop cluster
3) Start Kafka
4) Start Flume
5) Production data

At this time we can go to HDFS to view the data.



Flume memory optimization

1) Problem description: If you start consumption Flume, the following exception is thrown:

ERROR hdfs.HDFSEventSink: process failed
java.lang.OutOfMemoryError: GC overhead limit exceeded

2) Solution steps:
 

  • (1)  Add the following configuration to the /opt/modules/flume/conf/flume-env.sh file of the node01 server  :
export JAVA_OPTS="-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote" 
  • (2) Synchronize configuration to other servers


3) Flume memory parameter setting and optimization:

  • JVM heap is generally set to 4G or higher, deployed on a separate server (4 cores, 8 threads, 16G memory)
  • -Xmx and -Xms should be set the same to reduce the performance impact caused by memory jitter. If the settings are inconsistent, it will easily lead to frequent fullgc
  • -Xms represents the minimum size of JVM Heap (heap memory), initially allocated; -Xmx represents the maximum allowable size of JVM Heap (heap memory), allocated on demand. If the settings are not consistent, it is easy to trigger fullgc frequently during initialization due to insufficient memory

Data acquisition channel start/stop script:

1)vim cluster.sh

#!/bin/bash
case $1 in
"start"){
echo " -------- 启动 集群 -------"
#启动 Zookeeper 集群
zk.sh start
sleep 1s;
echo " -------- 启动 hadoop 集群 -------"
/opt/modules/hadoop/sbin/start-dfs.sh
ssh node02 "/opt/modules/hadoop/sbin/start-yarn.sh"
sleep 7s;
#启动 Flume 采集集群
f1.sh start1
#启动 Kafka 采集集群
kk.sh start
sleep 7s;
#启动 Flume 消费集群
f2.sh start
};;
"stop"){
echo " -------- 停止 集群 -------"
#停止 Flume 消费集群
f2.sh stop
#停止 Kafka 采集集群
kk.sh stop
sleep 7s;
#停止 Flume 采集集群
f1.sh stop
echo " -------- 停止 hadoop 集群 -------"
ssh node02 "/opt/modules/hadoop/sbin/stop-yarn.sh"
/opt/modules/hadoop/sbin/stop-dfs.sh
sleep 7s;
#停止 Zookeeper 集群
zk.sh stop
};;
esac


At this point, our data production, cleaning and transmission to HDFS is over.
Follow-up, we need to start building Hive

Guess you like

Origin blog.csdn.net/ytp552200ytp/article/details/108579811