Flume+kafka+hdfs collects user behavior data

Table of contents

demand background

solution

Specific steps

1. Install and deploy Hadoop and start Hadoop

2. Install Flume under Windows

3. Flume configuration file 

4. Start flume

5. Test

summary

stepped on the pit


demand background

In the project, it is necessary to put user behavior data or other data into the big data warehouse, and Kafka service is already available.

solution

We can get kafka real-time data through flume and dump it to hdfs.

After dumping to hdfs, load it into the Hive table through the load data command, hive then processes the user behavior data, and finally outputs it to mysql for presentation to the client.

Specific steps

1. Install and deploy Hadoop and start Hadoop

For specific steps, see: Windows10 install Hadoop3.3.0_xieedeni's blog - CSDN blog

Install Hive3.1.2 on Windows 10_xieedeni's Blog - CSDN Blog

Note: The version here is Hadoop3.3.0 installed by me, Kafka is Tencent Cloud, and Flume is recommended to install flume1.9 here

2. Install Flume under Windows

1. Download flume1.9

The official download address of flume is http://www.apache.org/dyn/closer.lua/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz

If the download speed of this address is slow, you can use the mirror resource address: https://download.csdn.net/download/xieedeni/24882711

2. Unzip apache-flume-1.9.0-bin

3. Configure flume environment variables

3. Flume configuration file 

1. Create flume to connect kafka to hdfs configuration file %FLUME%/conf/kafka2hdfs.conf

agent.sources = kafka_source
agent.channels = mem_channel
agent.sinks = hdfs_sink
# 以下配置 source
agent.sources.kafka_source.type = org.apache.flume.source.kafka.KafkaSource
agent.sources.kafka_source.channels = mem_channel
agent.sources.kafka_source.batchSize = 5000
agent.sources.kafka_source.kafka.bootstrap.servers = kafka1:6003
agent.sources.kafka_source.kafka.topics = flume-collect
agent.sources.kafka_source.kafka.consumer.group.id = group-1
# kafka访问协议
agent.sources.kafka_source.kafka.consumer.security.protocol = SASL_PLAINTEXT
agent.sources.kafka_source.kafka.consumer.sasl.mechanism = PLAIN
agent.sources.kafka_source.kafka.consumer.sasl.kerberos.service.name = kafka
# 以下配置 sink
agent.sinks.hdfs_sink.type = hdfs
agent.sinks.hdfs_sink.channel = mem_channel
agent.sinks.hdfs_sink.hdfs.path = hdfs://127.0.0.1:9000/warehouse/dd/bigdata/ods/tmp/applogs/%Y-%m-%d
agent.sinks.hdfs_sink.hdfs.filePrefix = ods_event_log-
agent.sinks.hdfs_sink.hdfs.fileSuffix = .log
agent.sinks.hdfs_sink.hdfs.rollSize = 0  
agent.sinks.hdfs_sink.hdfs.rollCount = 0  
agent.sinks.hdfs_sink.hdfs.rollInterval = 3600  
agent.sinks.hdfs_sink.hdfs.threadsPoolSize = 30
agent.sinks.hdfs_sink.hdfs.fileType=DataStream    
agent.sinks.hdfs_sink.hdfs.useLocalTimeStamp=true
agent.sinks.hdfs_sink.hdfs.writeFormat=Text
# 以下配置 channel
agent.channels.mem_channel.type = memory
agent.channels.mem_channel.capacity = 100000
agent.channels.mem_channel.transactionCapacity = 10000

Parameter Description:

a. Study the official documents carefully, otherwise you will encounter many pitfalls in the process. Flume 1.9.0 User Guide — Apache Flume

b. The Kafka protocol is really a trap. There are a lot of information on the Internet, but this introduction is not complete enough

# kafka访问协议
agent.sources.kafka_source.kafka.consumer.security.protocol = SASL_PLAINTEXT
agent.sources.kafka_source.kafka.consumer.sasl.mechanism = PLAIN
agent.sources.kafka_source.kafka.consumer.sasl.kerberos.service.name = kafka

You can see that the Kafka protocol here uses SASL_PLAINTEXT. If you need other methods, please refer to the official documentation.

2. Since the protocol protocol is SASL_PLAINTEXT, the following settings are required

a. Copy %FLUME%/conf/flume-env.sh.template named flume-env.sh and put it in this folder, the content is:

export JAVA_HOME=D:\work\jdk1.8.0_291

b. Copy %FLUME%/conf/flume-env.ps1.template named flume-env.ps1 and put it in this folder, the content is:

$JAVA_OPTS="-Djava.security.auth.login.config=D:\work\soft\apache-flume-1.9.0-bin\conf\kafka_client_jaas.conf"

$FLUME_CLASSPATH="D:\work\soft\apache-flume-1.9.0-bin\lib"

This involves a key file kafka_client_jaas.conf, which is used for the protocol of kafka public network access method as SASL_PLAINTEXT

c. Create the %FLUME%/conf/kafka_client_jaas.conf file, or put it under conf, the content is:

KafkaClient {  
	org.apache.kafka.common.security.plain.PlainLoginModule required
	username="ckafka-123#kafka"  
	password="123";  
};

The username here is "instance id#username"

4. Start flume

The premise of starting is: the kafka service has been started, the topic has been created; the hadoop service has been started and the database has been created, and the hadoop file needs to have permission.

Start command:

cd %FLUME_HOME%/bin
flume-ng agent -c %FLUME_HOME%/conf -n agent -f %FLUME_HOME%/conf/kafka2hdfs.conf &

Parameter Function Example
–conf or -c specifies the configuration folder, including configuration files of flume-env.sh and log4j –conf conf
–conf-file or -f configuration file address –conf-file conf/flume.conf
–name or- n agent name –name a1

Started successfully:

If there is no specific log information, please modify %FLUME%/conf/log4j.properties

5. Test

Kafka generates a message, and flume consumption lands on hdfs

summary

I am a novice and relatively stupid. In order to realize this function, the research took nearly 3 days, and I encountered many pitfalls in the middle. After consulting the information, the information on the Internet is not complete enough. This kind of thing is difficult for those who do not understand it. I really achieved my goal, but I felt that the problems that occurred were really not difficult, but I encountered obstacles everywhere in the process. So record the problems encountered for future reference, and share them with friends who need them. Don't give up, the sun always comes after the storm.

stepped on the pit

1. After starting flume, run to (lifecycleSupervisor-1-0) [INFO - org.apache.kafka.common.utils.AppInfoParser$AppInfo.<init>(AppInfoParser.java:110)] Kafka commitId : xxxxxx

There is no output later, and there is no prompt whether to connect to the kafka topic. As shown below:

There is no error message in the figure, and Flume cannot receive the message after Kafka produces the message, and there is no response.

Reason : Check whether kafka has enabled the security policy. If enabled, you need to set the protocol

# kafka访问协议
agent.sources.kafka_source.kafka.consumer.security.protocol = SASL_PLAINTEXT
agent.sources.kafka_source.kafka.consumer.sasl.mechanism = PLAIN
agent.sources.kafka_source.kafka.consumer.sasl.kerberos.service.name = kafka

See step 3 for the method.

This is a novice pit. If you miss the setting, flume will not be able to connect to kafka.

2. The configuration file is correct and connected to kafka. After receiving the message, flume lands on hdfs and reports an error

[ERROR - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:447)] process failed
java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V

Congratulations, this should be the last pit.

Reason : This is a jar package conflict

The solution is to copy hadoop-3.3.0/share/hadoop/common/lib/guava-27.0-jre.jar of Hadoop to %FLUME_HOME%/lib of Flume, and delete the %FLUME_HOME%/lib/ that comes with Flume guava-11.0.2.jar, just restart flume after deleting it.

Guess you like

Origin blog.csdn.net/xieedeni/article/details/120511421