Flume Data Acquisition

Flume Data Acquisition Module

1. 1 Data processing link

1.2 Environment preparation

1.2.1 Cluster  process viewing script _ _ _

(1) Create the script xcall.sh in the /home/bigdata_admin/bin directory

[bigdata_admin@hadoop102 bin]$ vim xcall.sh

(2) Write the following content in the script

#! /bin/bash



for i in hadoop102 hadoop103 hadoop104

do

    echo --------- $i ----------

    ssh $i "$*"

done

(3) Modify the script execution permission

[bigdata_admin@hadoop102 bin]$ chmod 777 xcall.sh

(4) Startup script

[bigdata_admin@hadoop102 bin]$ xcall.sh jps

1.2.2 H adoop installation _ _

1) Installation steps

slightly

2) Project experience

(1) HDFS storage multi-directory based on project experience

The project of the virtual machine does not need to be configured, we only have one disk.

1. Production environment server disk status

2. Configure multiple directories in the hdfs-site.xml file, and pay attention to the access rights of the newly mounted disk.

The path where the DataNode node of HDFS saves data is determined by the dfs.datanode.data.dir parameter, and its default value is file://${hadoop.tmp.dir}/dfs/data . If the server has multiple disks, you must set the The parameters are modified. If the server disk is shown in the figure above, this parameter should be modified to the following value.

<property>

    <name>dfs.datanode.data.dir</name>

<value>file:///dfs/data1,file:///hd2/dfs/data2,file:///hd3/dfs/data3,file:///hd4/dfs/data4</value>

</property>

Note: The disks mounted on each server are different, so the multi-directory configuration of each node may be inconsistent. It can be configured individually .

(2) Cluster data balance of project experience

1 Data balance between nodes

Enable the data balance command.

start-balancer.sh -threshold 10

For the parameter 10, it means that the disk space utilization of each node in the cluster does not differ by more than 10%, which can be adjusted according to the actual situation.

Stop the data balance command.

stop-balancer.sh

2 Data balance between disks

Generate a balanced plan (we only have one disk, no plan will be generated).

hdfs diskbalancer -plan hadoop103

Execute a balanced plan.

hdfs diskbalancer -execute hadoop103.plan.json

View the execution status of the current balance task.

hdfs diskbalancer -query hadoop103

Cancel the balance task.

hdfs diskbalancer -cancel hadoop103.plan.json

(3) Hadoop parameter tuning based on project experience

1. HDFS parameter tuning hdfs-site.xml

The number of Namenode RPC server threads that listen to requests from clients. If dfs.namenode.servicerpc-address is not configured then Namenode RPC server threads listen to requests from all nodes.

NameNode有一个工作线程池,用来处理不同DataNode的并发心跳以及客户端并发的元数据操作。

对于大集群或者有大量客户端的集群来说,通常需要增大参数dfs.namenode.handler.count的默认值10。

<property>

    <name>dfs.namenode.handler.count</name>

    <value>10</value>

</property>

dfs.namenode.handler.count=20*log(e)(ClusterSize), for example, when the cluster size is 8, this parameter is set to 41. This value can be calculated by a simple python code, the code is as follows.

[bigdata_admin@hadoop102 ~]$ python

Python 2.7.5 (default, Apr 11 2018, 07:36:10)

[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2

Type "help", "copyright", "credits" or "license" for more information.

>>> import math

>>> print int(20*math.log(8))

41

>>> quit()

2. YARN parameter tuning yarn-site.xml

Scenario description : A total of 7 machines, hundreds of millions of data per day, data source->Flume->Kafka->HDFS->Hive

Facing problems : HiveSQL is mainly used for data statistics, there is no data skew, small files have been merged, the JVM is enabled for reuse, and IO is not blocked, and less than 50% of the memory is used. But it still runs very slowly, and when the data volume peaks, the entire cluster will go down. Based on this situation, is there any optimization plan?

Solution :

Insufficient memory utilization. This is generally caused by two configurations of Yarn, the maximum memory size that a single task can apply for, and the available memory size of a single Hadoop node. Adjusting these two parameters can improve the utilization of system memory.

(a)yarn.nodemanager.resource.memory-mb

Indicates the total amount of physical memory that YARN can use on this node. The default is 8192 (MB). Note that if your node memory resource is not enough to 8GB, you need to reduce this value, and YARN will not intelligently detect the physical memory of the node total amount.

(b)yarn.scheduler.maximum-allocation-mb

The maximum amount of physical memory that a single task can apply for, the default is 8192 (MB).

1.2.3 Zookeeper  installation _

1 ) Installation steps

slightly

2 ) ZK cluster start and stop script

(1) Create a script in the /home/bigdata_admin/bin directory of hadoop102

[bigdata_admin@hadoop102 bin]$ vim zk.sh

Write the following in the script.

#!/bin/bash



case $1 in

"start"){

for i in hadoop102 hadoop103 hadoop104

do

        echo ---------- zookeeper $i 启动 ------------

ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh start"

done

};;

"stop"){

for i in hadoop102 hadoop103 hadoop104

do

        echo ---------- zookeeper $i 停止 ------------    

ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh stop"

done

};;

"status"){

for i in hadoop102 hadoop103 hadoop104

do

        echo ---------- zookeeper $i 状态 ------------    

ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh status"

done

};;

esac

(2) Increase the script execution permission

[bigdata_admin@hadoop102 bin]$ chmod 777 zk.sh

(3) Zookeeper cluster startup script

[bigdata_admin@hadoop102 module]$ zk.sh start

(4) Zookeeper cluster stop script

[bigdata_admin@hadoop102 module]$ zk.sh stop

1.2.4  Kafka installation _ _ _

1 ) Installation steps

slightly

1.2.5 Flume installation _ _

According to the collection channel planning, a Flume needs to be deployed on the three nodes of hadoop102, hadoop103, and hadoop104 respectively. You can refer to the following steps to install it on hadoop 102 first , and then distribute it.

1) Installation steps

slightly

2) Distribute Flume to hadoop 103 , hadoop 104

[bigdata_admin@hadoop102 ~]$ xsync /opt/module/flume/

3) Project experience

(1) Heap memory adjustment

The Flume heap memory is usually set to 4G or higher, and the configuration is as follows:

Modify the /opt/module/flume/conf/flume-env.sh file and configure the following parameters (the virtual machine environment is not configured yet)

export JAVA_OPTS="-Xms4096m -Xmx4096m -Dcom.sun.management.jmxremote"

Note:

-Xms indicates the minimum size of JVM Heap (heap memory), initial allocation;

-Xmx indicates the maximum allowed size of JVM Heap (heap memory), allocated on demand.

1.3 Log  Collection Flume

1.3.1 Log collection Flume configuration overview

According to the plan, the user behavior log files to be collected are distributed on two log servers, hadoop102 and hadoop103, so log collection Flume needs to be configured on two nodes, hadoop102 and hadoop103. Log collection Flume needs to collect the contents of log files, verify the log format (JSON), and then send the verified logs to Kafka.

Here you can choose TaildirSource and KafkaChannel, and configure log verification interceptors.

The reasons for choosing TailDirSource and KafkaChannel are as follows:

1)TailDirSource

Advantages of TailDirSource over ExecSource and SpoolingDirectorySource.

TailDirSource: breakpoint resume, multi-directory . Before Flume 1.6, it was necessary to customize the Source to record the location of the file each time it was read , so as to realize the resuming of the breakpoint.

ExecSource can collect data in real time, but the data will be lost if Flume is not running or the Shell command fails.

SpoolingDirectorySource monitors directories and supports resumable uploads.

2) Coffee Channel

Using Kafka  Channel saves Sink and improves efficiency.

The key configuration of log collection Flume is as follows:

1.3.2 Log collection Flume configuration practice

1) Create a Flume configuration file

(1) Create file_to_kafka.conf under the job directory of Flume on the hadoop102 node.

[bigdata_admin@hadoop102 flume]$ mkdir job

[bigdata_admin@hadoop102 flume]$ vim job/file_to_kafka.conf

The content of the configuration file is as follows:

#为各组件命名

a1.sources = r1

a1.channels = c1



#描述source

a1.sources.r1.type = TAILDIR

a1.sources.r1.filegroups = f1

a1.sources.r1.filegroups.f1 = /opt/module/applog/log/app.*

a1.sources.r1.positionFile = /opt/module/flume/taildir_position.json

a1.sources.r1.interceptors =  i1

a1.sources.r1.interceptors.i1.type = com.bigdata_admin.flume.interceptor.ETLInterceptor$Builder



#描述channel

a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel

a1.channels.c1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092

a1.channels.c1.kafka.topic = topic_log

a1.channels.c1.parseAsFlumeEvent = false



#绑定source和channel以及sink和channel的关系

a1.sources.r1.channels = c1

(2) Distribute configuration files to hadoop103

[bigdata_admin@hadoop102 flume]$ xsync job

2 ) Write a Flume interceptor

(1) Create a Maven project flume-interceptor

(2) Create package: com.bigdata_admin.flume.interceptor

(3) Add the following configuration to the pom.xml file

<dependencies>
    <dependency>
        <groupId>org.apache.flume</groupId>
        <artifactId>flume-ng-core</artifactId>
        <version>1.9.0</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>fastjson</artifactId>
        <version>1.2.62</version>
    </dependency>
</dependencies>

<build>
    <plugins>
        <plugin>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>2.3.2</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
            </configuration>
        </plugin>
        <plugin>
            <artifactId>maven-assembly-plugin</artifactId>
            <configuration>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

(4) Create the JSONUtils class under the com.bigdata_admin.flume.interceptor package

package com.bigdata_admin.flume.interceptor;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONException;

public class JSONUtils {
    public static boolean isJSONValidate(String log){
        try {
            JSON.parse(log);
            return true;
        }catch (JSONException e){
            return false;
        }
    }
}

(5) Create the ETLInterceptor class under the com.bigdata_admin.flume.interceptor package

package com.bigdata_admin.flume.interceptor;



import com.alibaba.fastjson.JSON;

import org.apache.flume.Context;

import org.apache.flume.Event;

import org.apache.flume.interceptor.Interceptor;



import java.nio.charset.StandardCharsets;

import java.util.Iterator;

import java.util.List;



public class ETLInterceptor implements Interceptor {



    @Override

    public void initialize() {



    }



    @Override

    public Event intercept(Event event) {



        byte[] body = event.getBody();

        String log = new String(body, StandardCharsets.UTF_8);



        if (JSONUtils.isJSONValidate(log)) {

            return event;

        } else {

            return null;

        }

    }



    @Override

    public List<Event> intercept(List<Event> list) {



        Iterator<Event> iterator = list.iterator();



        while (iterator.hasNext()){

            Event next = iterator.next();

            if(intercept(next)==null){

                iterator.remove();

            }

        }



        return list;

    }



    public static class Builder implements Interceptor.Builder{



        @Override

        public Interceptor build() {

            return new ETLInterceptor();

        }

        @Override

        public void configure(Context context) {



        }

    }



    @Override

    public void close() {



    }

}

(6) Packing

flume-interceptor-1.0-SNAPSHOT-jar-with-dependencies.jar

(7) You need to put the packaged package into the /opt/module/flume/lib folder of hadoop102 and hadoop103.

1.3.3 Log collection Flume test

1) Start Zookeeper and Kafka clusters

2) Start the log collection Flume of hadoop 102

[bigdata_admin@hadoop102 flume]$ bin/flume-ng agent -n a1 -c conf/ -f job/file_to_kafka.conf -Dflume.root.logger=info,console

3) Start a Kafka Console-Consumer

[bigdata_admin@hadoop102 kafka]$ bin/kafka-console-consumer.sh --bootstrap-server hadoop102:9092 --topic topic_log

4 ) Generate mock data

[bigdata_admin@hadoop102 ~]$ lg.sh

5 ) Observe whether Kafka consumers can consume data

1.3.4 Flume start and stop script for log collection

1 ) Distribute log collection Flume configuration files and interceptors

If the above test passes, you need to send a copy of the Flume configuration file and interceptor jar package of the hadoop102 node to another log server.

[bigdata_admin@hadoop102 flume]$ scp -r job hadoop103:/opt/module/flume/

[bigdata_admin@hadoop102 flume]$ scp lib/flume-interceptor-1.0-SNAPSHOT-jar-with-dependencies.jar hadoop103:/opt/module/flume/lib/

2) For convenience, here is a script to start and stop the log collection Flume process

(1) Create the script f1.sh in the /home/bigdata_admin/bin directory of the hadoop102 node

[bigdata_admin@hadoop102 bin]$ vim f1.sh

Fill in the following content in the script.

#!/bin/bash



case $1 in

"start"){

        for i in hadoop102 hadoop103

        do

                echo " --------启动 $i 采集flume-------"

                ssh $i "nohup /opt/module/flume/bin/flume-ng agent -n a1 -c /opt/module/flume/conf/ -f /opt/module/flume/job/file_to_kafka.conf >/dev/null 2>&1 &"

        done

};;

"stop"){

        for i in hadoop102 hadoop103

        do

                echo " --------停止 $i 采集flume-------"

                ssh $i "ps -ef | grep file_to_kafka | grep -v grep |awk  '{print \$2}' | xargs -n1 kill -9 "

        done



};;

esac

(2) Increase the script execution permission

[bigdata_admin@hadoop102 bin]$ chmod 777 f1.sh

(3) f1 start

[bigdata_admin@hadoop102 module]$ f1.sh start

(4) f2 stop

[bigdata_admin@hadoop102 module]$ f1.sh stop

1.4 log consumption Flume

1.4.1 Log consumption Flume configuration overview

According to the plan, the Flume needs to send the topic_log data in Kafka to HDFS. And distinguish the user behavior logs generated every day, and send the data of different days to the paths of different days in HDFS.

Here select KafkaSource, FileChannel, HDFSSink.

The key configuration is as follows:

1.4.2 Log consumption Flume configuration practice

1) Create a Flume configuration file

Create kafka_to_hdfs_log.conf under the job directory of Flume on hadoop104 node.

[bigdata_admin@hadoop104 flume]$ vim job/kafka_to_hdfs_log.conf

The content of the configuration file is as follows:

## 组件

a1.sources=r1

a1.channels=c1

a1.sinks=k1



## source1

a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource

a1.sources.r1.batchSize = 5000

a1.sources.r1.batchDurationMillis = 2000

a1.sources.r1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092

a1.sources.r1.kafka.topics=topic_log

a1.sources.r1.interceptors = i1

a1.sources.r1.interceptors.i1.type = com.bigdata_admin.flume.interceptor.TimeStampInterceptor$Builder



## channel1

a1.channels.c1.type = file

a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/behavior1

a1.channels.c1.dataDirs = /opt/module/flume/data/behavior1/

a1.channels.c1.maxFileSize = 2146435071

a1.channels.c1.capacity = 1000000

a1.channels.c1.keep-alive = 6





## sink1

a1.sinks.k1.type = hdfs

a1.sinks.k1.hdfs.path = /origin_data/gmall/log/topic_log/%Y-%m-%d

a1.sinks.k1.hdfs.filePrefix = log-

a1.sinks.k1.hdfs.round = false





a1.sinks.k1.hdfs.rollInterval = 10

a1.sinks.k1.hdfs.rollSize = 134217728

a1.sinks.k1.hdfs.rollCount = 0



## 控制输出文件是原生文件。

a1.sinks.k1.hdfs.fileType = CompressedStream

a1.sinks.k1.hdfs.codeC = gzip



## 拼装

a1.sources.r1.channels = c1

a1.sinks.k1.channel= c1

Note: configuration optimization

(1) FileChannel optimization

By configuring dataDirs to point to multiple paths , each path corresponds to a different hard disk, increasing Flume throughput.

The official description is as follows:

Comma separated list of directories for storing log files. Using multiple directories on separate disks can improve file channel peformance

checkpointDir and backupCheckpointDir should also be configured in the directories corresponding to different hard disks as much as possible to ensure that after the checkpoint is broken, you can quickly use backupCheckpointDir to restore data.

(2) HDFS sink optimization

①What is the impact of storing a large number of small files in HDFS?

Metadata level: Each small file has a piece of metadata, including file path, file name, owner, group, permission, creation time, etc., all of which are stored in Namenode memory. Therefore, too many small files will occupy a large amount of memory on the Namenode server, affecting the performance and service life of the Namenode.

Computing level: By default, MR will enable a Map task calculation for each small file, which greatly affects computing performance. It also affects disk seek times.

②HDFS small file processing

The official default configuration of these three parameters will generate small files after writing to HDFS, hdfs.rollInterval, hdfs.rollSize, hdfs.rollCount.

Based on the comprehensive effects of the above parameters hdfs.rollInterval=3600, hdfs.rollSize=134217728, hdfs.rollCount=0, the effect is as follows:

  • When the file reaches 128M, it will roll to generate a new file
  • When the file is created for more than 3600 seconds, a new file will be generated on a rolling basis

2 ) Write a Flume interceptor

(1) Create a TimeStampInterceptor class under the com.bigdata_admin.flume.interceptor package

package com.bigdata_admin.flume.interceptor;



import com.alibaba.fastjson.JSONObject;

import org.apache.flume.Context;

import org.apache.flume.Event;

import org.apache.flume.interceptor.Interceptor;



import java.nio.charset.StandardCharsets;

import java.util.ArrayList;

import java.util.List;

import java.util.Map;



public class TimeStampInterceptor implements Interceptor {



    private ArrayList<Event> events = new ArrayList<>();



    @Override

    public void initialize() {



    }



    @Override

    public Event intercept(Event event) {



        Map<String, String> headers = event.getHeaders();

        String log = new String(event.getBody(), StandardCharsets.UTF_8);



        JSONObject jsonObject = JSONObject.parseObject(log);



        String ts = jsonObject.getString("ts");

        headers.put("timestamp", ts);



        return event;

    }



    @Override

    public List<Event> intercept(List<Event> list) {

        events.clear();

        for (Event event : list) {

            events.add(intercept(event));

        }



        return events;

    }



    @Override

    public void close() {



    }



    public static class Builder implements Interceptor.Builder {

        @Override

        public Interceptor build() {

            return new TimeStampInterceptor();

        }



        @Override

        public void configure(Context context) {

        }

    }

}

(2) Repackage

flume-interceptor-1.0-SNAPSHOT-jar-with-dependencies.jar

(3) You need to put the package into the /opt/module/flume/lib folder of hadoop104 first.

1.4.3 Log consumption Flume test

1) Start Zookeeper and Kafka clusters

2) Start log collection Flume

[bigdata_admin@hadoop102 ~]$ f1.sh start

3) Start the log consumption Flume of hadoop 104

[bigdata_admin@hadoop104 flume]$ bin/flume-ng agent -n a1 -c conf/ -f job/kafka_to_hdfs_log.conf -Dflume.root.logger=info,console

4 ) Generate mock data

[bigdata_admin@hadoop102 ~]$ lg.sh

5 ) Observe whether data appears in HDFS

1 .4 . 4 Flume start and stop script for log consumption

If the above tests pass, for convenience, create a start-stop script for Flume here.

1 ) Create the script f2.sh in the /home/bigdata_admin/bin directory of the hadoop102 node

[bigdata_admin@hadoop102 bin]$ vim f2.sh

Fill in the following content in the script:

#!/bin/bash



case $1 in

"start")

        echo " --------启动 hadoop104 日志数据flume-------"

        ssh hadoop104 "nohup /opt/module/flume/bin/flume-ng agent -n a1 -c /opt/module/flume/conf -f /opt/module/flume/job/kafka_to_hdfs_log.conf >/dev/null 2>&1 &"

;;

"stop")



        echo " --------停止 hadoop104 日志数据flume-------"

        ssh hadoop104 "ps -ef | grep kafka_to_hdfs_log | grep -v grep |awk '{print \$2}' | xargs -n1 kill"

;;

esac

2 ) Increase the script execution permission

[bigdata_admin@hadoop102 bin]$ chmod 777 f2.sh

3 ) f2 start

[bigdata_admin@hadoop102 module]$ f2.sh start

4 ) f2 stop

[bigdata_admin@hadoop102 module]$ f2.sh stop

1.5 Acquisition channel start/ stop script

1 ) Create the script cluster.sh in the /home/bigdata_admin/bin directory

[bigdata_admin@hadoop102 bin]$ vim cluster.sh

Fill in the following content in the script:

#!/bin/bash



case $1 in

"start"){

        echo ================== 启动 集群 ==================



        #启动 Zookeeper集群

        zk.sh start



        #启动 Hadoop集群

        cdh.sh start



        #启动 Kafka采集集群

        kf.sh start



        #启动 Flume采集集群

        f1.sh start



        #启动 Flume消费集群

        f2.sh start



        };;

"stop"){

        echo ================== 停止 集群 ==================



        #停止 Flume消费集群

        f2.sh stop



        #停止 Flume采集集群

        f1.sh stop



        #停止 Kafka采集集群

        kf.sh stop



        #停止 Hadoop集群

        cdh.sh stop



#循环直至 Kafka 集群进程全部停止

#xcall.sh 是我们写的脚本,作用是在集群的每个节点都执行一次后面的命令。此处 xcall.sh jps 的作用是查看所有节点的 java 进程

#grep Kafka 的作用是过滤所有 Kafka 进程

#wc -l 是统计行数,每个进程会在 jps 中单独占据一行,因此行数等于进程数

#$()的作用是将括号内命令的执行结果作为值取出来

#因此如下命令的作用是统计集群未停止的 Kafka 进程数然后将进程数赋值给 kafka_count 变量



kafka_count=$(xcall.sh jps | grep Kafka | wc -l)



#判断 kafka_count 变量的值是否大于零,如果是则说明仍有未停止的 Kafka 进程,此时不能停止 Zookeeper,因为 Kafka 的工作要依赖于 Zookeeper 的节点,如果在 Kafka 进程停止之前停止了 Zookeeper,可能会导致本次 Kafka 进程无法正常停止。所以当 Kafka 进程数大于零时进入循环,休眠一秒,然后重新统计 Kafka 进程数,直至 Kafka 进程数为零跳出循环,才能进行下一步(停止 Zookeeper 集群)

while [ $kafka_count -gt 0 ]

do

sleep 1

kafka_count=$( xcall.sh jps | grep Kafka | wc -l)

            echo "当前未停止的 Kafka 进程数为 $kafka_count"

done



        #停止 Zookeeper集群

        zk.sh stop

};;

esac

2 ) Increase the script execution permission

[bigdata_admin@hadoop102 bin]$ chmod u+x cluster.sh

3 ) cluster cluster startup script

[bigdata_admin@hadoop102 module]$ cluster.sh start

4 ) cluster cluster stop script

[bigdata_admin@hadoop102 module]$ cluster.sh stop

Guess you like

Origin blog.csdn.net/liuwei0376/article/details/125741754