Flume+kafka+hive collects user behavior data

Table of contents

demand background

solution

Specific steps

1. Install and deploy Hadoop and start Hadoop

2. Install Flume under Windows

3. Flume configuration file 

4. Hive configuration file and startup

5. Format of Kafka data message

6. Start flume

7. Test

summary

stepped on the pit

flume+kafka+hdfs


demand background

In the project, it is necessary to put user behavior data or other data into the big data warehouse, and Kafka service is already available.

solution

We can get kafka real-time data through flume and dump it to hdfs.

After dumping to hdfs, load it into the Hive table through the load data command, hive then processes the user behavior data, and finally outputs it to mysql for presentation to the client.

Specific steps

1. Install and deploy Hadoop and start Hadoop

For specific steps, see:

Windows10 install Hadoop3.3.0_xieedeni's blog - CSDN blog

Install Hive3.1.2 on Windows 10_xieedeni's Blog - CSDN Blog

Note: The version here is Hadoop3.3.0, Hive3.1.2, Kafka is Tencent Cloud, and Flume is recommended to install flume1.9

2. Install Flume under Windows

1. Download flume1.9

The official download address of flume is http://www.apache.org/dyn/closer.lua/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz

If the download speed of this address is slow, you can use the mirror resource address: https://download.csdn.net/download/xieedeni/24882711

2. Unzip apache-flume-1.9.0-bin

3. Configure flume environment variables

3. Flume configuration file 

1. Create flume to connect kafka to hive configuration file %FLUME%/conf/kafka2hive.conf

# in this case called 'agent'
agent.sources = kafka_source
agent.channels = mem_channel
agent.sinks = hive_sink
# 以下配置 source
agent.sources.kafka_source.type = org.apache.flume.source.kafka.KafkaSource
agent.sources.kafka_source.channels = mem_channel
agent.sources.kafka_source.batchSize = 5000
agent.sources.kafka_source.kafka.bootstrap.servers = ckafka-1:6003
agent.sources.kafka_source.kafka.topics = flume-collect
#agent.sources.kafka_source.kafka.topics = bi-collect
agent.sources.kafka_source.kafka.consumer.group.id = group-1
# kafka访问协议
agent.sources.kafka_source.kafka.consumer.security.protocol = SASL_PLAINTEXT
agent.sources.kafka_source.kafka.consumer.sasl.mechanism = PLAIN
agent.sources.kafka_source.kafka.consumer.sasl.kerberos.service.name = kafka

# Hive Sink
agent.sinks.hive_sink.type = hive
agent.sinks.hive_sink.channel = mem_channel
agent.sinks.hive_sink.hive.metastore = thrift://localhost:9083
agent.sinks.hive_sink.hive.database = dd_database_bigdata
agent.sinks.hive_sink.hive.table = dwd_base_event_log_b
#采集的数据放在哪个分区下
agent.sinks.hive_sink.hive.partition = %Y-%m-%d

agent.sinks.hive_sink.hive.txnsPerBatchAsk = 2
#分批入库
agent.sinks.hive_sink.batchSize = 10
#序列化
#agent.sinks.hive_sink.serializer = DELIMITED
agent.sinks.hive_sink.serializer = JSON
#分隔符默认是 ,
agent.sinks.hive_sink.serializer.delimiter = "\t"
agent.sinks.hive_sink.serializer.serdeSeparator = '\t'

agent.sinks.hive_sink.serializer.fieldnames = biz_id,biz_type,behavior_type,behavior_value,user_id,longitude,latitude,ip,request_ip,app_version,app_id,device_id,device_type,network,mobile_type,os,session_id,trace_id,parent_trace_id,page_id,current_time_millis,sign,timestamp,token

# 以下配置 channel
agent.channels.mem_channel.type = memory
agent.channels.mem_channel.capacity = 100000
agent.channels.mem_channel.transactionCapacity = 10000

Parameter Description:

a. Study the official documents carefully, otherwise you will encounter many pitfalls in the process. Flume 1.9.0 User Guide — Apache Flume

b. The Kafka protocol is really a trap. There are a lot of information on the Internet, but this introduction is not complete enough

# kafka访问协议
agent.sources.kafka_source.kafka.consumer.security.protocol = SASL_PLAINTEXT
agent.sources.kafka_source.kafka.consumer.sasl.mechanism = PLAIN
agent.sources.kafka_source.kafka.consumer.sasl.kerberos.service.name = kafka

You can see that the Kafka protocol here uses SASL_PLAINTEXT. If you need other methods, please refer to the official documentation.

2. Since the protocol protocol is SASL_PLAINTEXT, the following settings are required

a. Copy %FLUME%/conf/flume-env.sh.template named flume-env.sh and put it in this folder, the content is:

export JAVA_HOME=D:\work\jdk1.8.0_291

b. Copy %FLUME%/conf/flume-env.ps1.template named flume-env.ps1 and put it in this folder, the content is:

$JAVA_OPTS="-Djava.security.auth.login.config=D:\work\soft\apache-flume-1.9.0-bin\conf\kafka_client_jaas.conf"

$FLUME_CLASSPATH="D:\work\soft\apache-flume-1.9.0-bin\lib"

This involves a key file kafka_client_jaas.conf, which is used for the protocol of kafka public network access method as SASL_PLAINTEXT

c. Create the %FLUME%/conf/kafka_client_jaas.conf file, or put it under conf, the content is:

KafkaClient {  
	org.apache.kafka.common.security.plain.PlainLoginModule required
	username="ckafka-123#kafka"  
	password="123";  
};

The username here is "instance id#username"

4. Hive configuration file and startup

1. Modify the %HIVE_HOME%/conf/hive-site.xml file, pay attention to start the transaction, etc.

<property>
  <name>hive.cli.print.header</name>
  <value>true</value>
  <description>Whether to print the names of the columns in query output.</description>
</property>
<property>
  <name>hive.cli.print.current.db</name>
  <value>true</value>
  <description>Whether to include the current database in the Hive prompt.</description>
</property>
<property>
  <name>hive.metastore.uris</name>
  <value>thrift://xxx:9083</value>
  <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
</property>
	<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://127.0.0.1:3306/hive?serverTimezone=UTC&amp;useSSL=false&amp;allowPublicKeyRetrieval=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>
 
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
  <description>username to use against metastore database</description>
</property>
 
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>hive</value>
  <description>password to use against metastore database</description>
</property>
<property>
  <name>hive.metastore.warehouse.dir</name>
  <value>/user/xxx/hive/warehouse</value>
  <description>location of default database for the warehouse</description>
</property>
<property>
	  <name>hive.exec.parallel</name>
	  <value>true</value>
	  <description>Whether to execute jobs in parallel</description>
</property>
 
<property>
    	<name>hive.support.concurrency</name>
    	<value>true</value>
</property>
 
<property>
    	<name>hive.enforce.bucketing</name>
    	<value>true</value>
</property>
 
<property>
    	<name>hive.exec.dynamic.partition.mode</name>
    	<value>nonstrict</value>
</property>
<property>
    	<name>hive.txn.manager</name>
    	<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
 
<property>
    	<name>hive.compactor.initiator.on</name>
    	<value>true</value>
</property>
 
<property>
    	<name>hive.compactor.worker.threads</name>
    	<value>1</value>
</property>

2. Use hive to create a table

USE dd_database_bigdata;
DROP TABLE IF EXISTS dwd_base_event_log_b;
CREATE TABLE dwd_base_event_log_b
(
     `biz_id` STRING COMMENT '业务id',
     `biz_type` STRING COMMENT '内容类型',
     `behavior_type` STRING COMMENT '行为类型',
     `behavior_value` STRING COMMENT '行为结果,扩展字段',
     `user_id` STRING COMMENT '用户id,不登录为0',
     `longitude` STRING COMMENT '位置经度',
     `latitude` STRING COMMENT '用户纬度',
     `ip` STRING COMMENT 'ip地址',
     `request_ip` STRING,
     `app_version` STRING COMMENT 'app版本',
     `app_id` STRING COMMENT '上报来源,appid',
     `device_id` STRING COMMENT '设备id',
     `device_type` STRING COMMENT '设备类型,安卓,ios,小程序,pc,未知',
     `network` STRING COMMENT '网络类型,wifi,数据网络',
     `mobile_type` STRING COMMENT '手机型号,iphoneX,小米11.....',
     `os` STRING COMMENT '终端操作系统,操作系统,版本信息',
     `session_id` STRING COMMENT '用户一次访问的标识ID',
     `trace_id` STRING COMMENT '行为唯一标识',
     `parent_trace_id` STRING COMMENT '父行为标识',
     `page_id` STRING COMMENT '页面标识',
     `current_time_millis` STRING COMMENT '时间',
     `sign` STRING COMMENT '签名',
     `timestamp` STRING COMMENT '日期',
     `token` STRING COMMENT '请求token'
)
COMMENT '行为事件日志基础明细表buckets'
PARTITIONED BY (`dt` STRING)
stored as orc
LOCATION '/warehouse/dd/bigdata/dwd/dwd_base_event_log_b/'
tblproperties('transactional'='true');

3. Start hive

cd %HIVE_HOME%/bin
hive --service metastore &

5. Format of Kafka data message

Note that there is no need to send a data converted from json to String, otherwise, it will be a wrong data structure after being stored in hive

{"id":"16","biz_id":"9","biz_type":"article","behavior_type":"content_share_weixin","behavior_value":"","user_id":"0","longitude":"113.8078063723414","latitude":"34.79383784587102","ip":"192.168.1.45","request_ip":"","app_version":"1.0","app_id":"210207512489024309","device_id":"C61319F8-E851-4C32-BFD2-7B137F3DF052","device_type":"iOS","network":"wifi","mobile_type":"iPhone 7","os":"14.7.1","session_id":"00000000000000000000004143195282","trace_id":"16282122332630026","parent_trace_id":"","page_id":"","create_time":"6/8/2021 09:10:35"}
{"id":"17","biz_id":"9","biz_type":"article","behavior_type":"content_share_weixin","behavior_value":"","user_id":"0","longitude":"113.8078063723414","latitude":"34.79383784587102","ip":"172.20.10.2","request_ip":"","app_version":"1.0","app_id":"210207512489024309","device_id":"C61319F8-E851-4C32-BFD2-7B137F3DF052","device_type":"iOS","network":"wifi","mobile_type":"iPhone 7","os":"14.7.1","session_id":"00000000000000000000003311508828","trace_id":"16282123402150022","parent_trace_id":"","page_id":"","create_time":"6/8/2021 09:12:21"}

Error example, the data is in quotes "", it is wrong

6. Start flume

The prerequisites for starting are: the kafka service has been started, the topic has been created; the hadoop service has been started and the database has been created, and the hadoop file requires development permissions.

Start command:

cd %FLUME_HOME%/bin
flume-ng agent -c %FLUME_HOME%/conf -n agent -f %FLUME_HOME%/conf/kafka2hive.conf &

Parameter Function Example
–conf or -c specifies the configuration folder, including configuration files of flume-env.sh and log4j –conf conf
–conf-file or -f configuration file address –conf-file conf/flume.conf
–name or- n agent name –name a1

Started successfully:

If there is no specific log information, please modify %FLUME%/conf/log4j.properties

7. Test

Kafka generates a message, and flume consumption lands in hive

select * from dwd_base_event_log_test;

summary

I am a novice and relatively stupid. In order to realize this function, the research took nearly 3 days, and I encountered many pitfalls in the middle. After consulting the information, the information on the Internet is not complete enough. This kind of thing is difficult for those who do not understand it. I really achieved my goal, but I felt that the problems that occurred were really not difficult, but I encountered obstacles everywhere in the process. So record the problems encountered for future reference, and share them with friends who need them. Don't give up, the sun always comes after the storm.

stepped on the pit

1. After starting flume, run to (lifecycleSupervisor-1-0) [INFO - org.apache.kafka.common.utils.AppInfoParser$AppInfo.<init>(AppInfoParser.java:110)] Kafka commitId : xxxxxx

There is no output later, and there is no prompt whether to connect to the kafka topic. As shown below:

There is no error message in the figure, and Flume cannot receive the message after Kafka produces the message, and there is no response.

Reason: Check whether kafka has enabled the security policy. If enabled, you need to set the protocol

# kafka访问协议
agent.sources.kafka_source.kafka.consumer.security.protocol = SASL_PLAINTEXT
agent.sources.kafka_source.kafka.consumer.sasl.mechanism = PLAIN
agent.sources.kafka_source.kafka.consumer.sasl.kerberos.service.name = kafka

See step 3 for the method.

This is a novice pit. If you miss the setting, flume will not be able to connect to kafka.

2. The configuration file is correct and connected to kafka. After receiving the message, flume will report an error to hive

org.apache.hive.hcatalog.streaming.InvalidTable: 
Invalid table db:dd_database_bigdata, table:dwd_base_event_log_test: is not an Acid table

Reason: There is a problem with the creation of the hive table. You need to add attributes when creating: tblproperties('transactional'='true')

3. After receiving the kafka message, transfer it to hive and report an error, as shown in the figure below:


 You can see the obvious cause of the error: org.apache.flume.EventDeliveryException: java.lang.ArrayIndexOutOfBoundsException: 6

Cause: This is data format incomplete

{"id":"16","biz_id":"9","biz_type":"article","behavior_type":"content_share_weixin","behavior_value":"","user_id":"0","longitude":"113.8078063723414","latitude":"34.79383784587102","ip":"192.168.1.45","request_ip":"","app_version":"1.0","app_id":"210207512489024309","device_id":"C61319F8-E851-4C32-BFD2-7B137F3DF052","device_type":"iOS","network":"wifi","mobile_type":"iPhone 7","os":"14.7.1","session_id":"00000000000000000000004143195282","trace_id":"16282122332630026","parent_trace_id":"","page_id":"","create_time":"6/8/2021 09:10:35"}
{"id":"17","biz_id":"9","biz_type":"article","behavior_type":"content_share_weixin","behavior_value":"","user_id":"0","longitude":"113.8078063723414","latitude":"34.79383784587102","ip":"172.20.10.2","request_ip":"","app_version":"1.0","app_id":"210207512489024309","device_id":"C61319F8-E851-4C32-BFD2-7B137F3DF052","device_type":"iOS","network":"wifi","mobile_type":"iPhone 7","os":"14.7.1","session_id":"00000000000000000000003311508828","trace_id":"16282123402150022","parent_trace_id":"","page_id":"","create_time":"6/8/2021 09:12:21"}

All table fields must be passed in, and the message body cannot lack these fields, such as {"id":"16","biz_id":"9"}, this error will be reported

4. If hive and flume are not on the same server, %HIVE_HOME% may not be read, so this may be required

Copy the package under %HIVE_HOME%/hcatalog/share/hcatalog to %FLUME_HOME%/lib.

5. It should be noted that the hive table structure is not case-sensitive, and is uniformly displayed as lowercase.

For example, create a table statement

CREATE TABLE dwd_base_event_log_ddbi
(
     `id` STRING COMMENT '行为类型id',
     `bizId` STRING COMMENT '业务id',
     `bizType` STRING COMMENT '内容类型'
)
COMMENT '行为事件日志基础明细表test'
clustered by(id) into 2 buckets stored as orc
LOCATION '/warehouse/dd/bigdata/dwd/dwd_base_event_log_ddbi/'
tblproperties('transactional'='true');

The actual table structure is equivalent to

CREATE TABLE dwd_base_event_log_ddbi
(
     `id` STRING COMMENT '行为类型id',
     `bizid` STRING COMMENT '业务id',
     `biztype` STRING COMMENT '内容类型'
)
COMMENT '行为事件日志基础明细表test'
clustered by(id) into 2 buckets stored as orc
LOCATION '/warehouse/dd/bigdata/dwd/dwd_base_event_log_ddbi/'
tblproperties('transactional'='true');

If the data structure of the kafka message is

{"id":"16","bizId":"9","bizType":"article"}

Then Flume will report an error when it gets the Kafka message and lands in hive. The content of the error is that there is no field bizId in the table dwd_base_event_log_ddbi.

flume+kafka+hdfs

To collect data, please refer to flume+kafka+hdfs to collect user behavior data_xieedeni's Blog-CSDN Blog

Guess you like

Origin blog.csdn.net/xieedeni/article/details/120524694