This article is from the "Dark Horse Programmer" hudi course
6. Chapter 6 Hudi Case Study
6.1 Case Architecture
6.2 Business Data
6.2.1 Message Data Format
6.2.2 Data Generation
6.3 Qimo Data Collection
6.3.1 What is Apache Flume
6.3.2 Apache Flume Operating Mechanism
6.3.3 Apache Flume Installation Deployment6.3.4
Apache Flume entry
program6.3.5 Qimo social data collection6.3.5
Qimo social data collection6.4.1
Create module6.4.2
Encapsulate entity class6.4.3
Write streaming program6.4.3.1
Build SparkSession instance
object6.4 .3.2 Consume Kafka data
6.4.3.3 Print console
6.4.3.4 Data analysis and conversion
6.4.3.5 Save Hudi table
6.4.4 Streaming program running
6.5 Integrated Hive indicator analysis
6.5.1 Create Hive table
6.5.2 Business indicator analysis
6.6 Spark offline Index analysis
6.6.1 Requirement description
6.6.2 Create database table
6.6.3 Write index analysis program
6.6.3.1 Load Hudi table data
6.6.3.2 Parse IP address and select fields
6.6.3.3 Business indicator analysis
6.6.4 Report program operation
6.7 FineBI report visualization
6.7.1 Install FineBI
6.7.2 Configure data source
6.7.3 Add data set
6.7.4 Create dashboard
6.7.5 Column chart: Top10 users send information Volume
6.7.6 Pie Chart: Information Volume of Top 10 Provinces
6.7.7 Map: Information Volume of Each Province
6. Chapter 6 Hudi case study
Qimo Social is a company specializing in customer service systems. Chuanzhi Education builds a customer service system based on Qimo Social. There are a lot of users chatting every day. Chuanzhi Education currently wants to store these chat records, and at the same time Real-time statistical analysis of the daily message volume is required, please design how to realize data storage and real-time statistical analysis of data.
The requirements are as follows:
-
- Choose a reasonable storage container for data storage, and let it support basic data query work
-
- Perform real-time statistics on the total amount of messages
-
- Real-time statistics of the total amount of messages sent and received in each region
-
- Real-time statistics of the number of messages sent and received by each customer
6.1 Case Architecture
Collect Qimo user chat information data in real time, store the message queue in Kafka, process and convert the data in real time, store the messages in the Hudi table, and finally use Hive and Spark to make statistics on business indicators, and display them based on FanBI visual report.
-
1. Apache Flume: Distributed real-time log data collection framework
Since business-side data is continuously produced in a directory, we need to collect data in real time, and flume is a tool specially used for data collection, such as monitoring Files in a certain directory can be collected as soon as a new file is generated. -
2. Apache Kafka: Distributed message queue
Flume collection process, if the message is very fast, Flume will also collect the data efficiently, then a container that can quickly carry the data is needed, and the data must be processed and converted later Operation, at this time, the data collected by flume can be written into Kafka for message data transmission, and Kafka is also a message system used uniformly by all business lines of the entire group center to connect with subsequent business (offline or real-time). -
3. Apache Spark: Distributed memory computing engine, offline and streaming data analysis and processing of
the entire Qimo social case, which requires real-time collection, so at this time it means that one piece of data needs to be processed, and one piece of data needs to be processed. Sometimes you need some stream processing framework, Structured Streaming or Flink can be.
In addition, in the case of Qimo, the daily user message data is analyzed according to business indicators, and finally stored in the MySQL database, and SparkSQL is selected. -
4. Apache Hudi: Data Lake Framework
Qimo user chat message data is finally stored in the Hudi table (underlying storage: HDFS distributed file system), unified management of data files, and later integrated with Spark and Hive for business indicator analysis. -
5. Apache Hive: big data data warehouse framework
integrated with Hudi table, to analyze Qimo chat data, just write SQL directly. -
6. MySQL: Relational database
Store the analysis results of business indicators in the MySQL database, which is convenient for displaying indicator reports later. -
7. FineBI: Reporting tool
A commercial charting tool of Fanruan Company, which makes chart making easier
6.2 Business data
In this case, we directly provide tools specially used for producing Qimo social news data, which can be directly deployed on the business side for data generation, and then deploy the tool jar package for producing data.
6.2.1 Message data format
User chat data is stored in the log file in text format, which contains 20 fields, as shown in the following figure:
Sample data:
The division symbol between each field of the above data is: \001
6.2.2 Data Generation
Run the jar package: 7Mo_DataGen.jar, specify the parameter information, simulate and generate user chat information data, and write it into the log file.
- The first step, create the original file directory
mkdir -p /export/data/7mo_init
- The second step, upload the simulation data program
cd /export/data/7mo_init
rz
- The third step, create a simulation data directory
mkdir -p /export/data/7mo_data
- The fourth step, run the program to generate data
# 1. 语法
java -jar /export/data/7mo_init/7Mo_DataGen.jar 原始数据路径 模拟数据路径 随机产生数据间隔ms时间
# 2. 测试:每500ms生成一条数据
java -jar /export/data/7mo_init/7Mo_DataGen.jar \
/export/data/7mo_init/7Mo_Data.xlsx \
/export/data/7mo_data \
500
- The fifth step, view the generated data
6.3 Qimo data collection
Due to the large number of users and high activity of Qimo, the chat information data is relatively large (daily increment: 25GB to 30GB), and the data is collected in real time. Here, the framework is selected: Apache Flume.
6.3.1 What is Apache Flume
Aapche Flume is a highly available, highly reliable, distributed massive log collection, aggregation and transmission software provided by Cloudera, URL: http://flume.apache.org/
Flume的核心是把数据从数据源(source)收集过来,再将收集到的数据送到指定的目的地(sink)。为了保证输送的过程一定成功,在送到目的地(sink)之前,会先缓存数据(channel),待数据真正到达目的地(sink)后,flume在删除自己缓存的数据。
当前Flume有两个版本:
- The Flume 0.9X version is collectively referred to as Flume OG (original generation)
- The Flume 1.X version is collectively referred to as Flume NG (next generation)
because Flume NG has undergone core components, core configuration, and code architecture reconstruction, which is very different from Flume OG. Another reason for the change is to incorporate Flume into the apache umbrella, and Cloudera Flume was renamed Apache Flume.
6.3.2 Apache Flume operating mechanism
The core role in the Flume system is the agent , which itself is a Java process that generally runs on the log collection node.
Each agent is equivalent to a data transmitter, and there are three components inside:
- Source: collection source, used to connect with the data source to obtain data;
- Sink: Sinking place, the purpose of transmitting collected data, used to transmit data to the next-level agent or to the final storage system;
- Channel: The data transmission channel inside the agent, which is used to transfer data from the source to the sink;
in the whole process of data transmission, the flow is event, which is the most basic unit of Flume internal data transmission.
The event encapsulates the transmitted data. If it is a text file, it is usually a row of records. The event is also the basic unit of the transaction. The event flows from the source to the channel and then to the sink. It is a byte array and can carry headers (header information) information. Event represents the smallest complete unit of data, from an external data source to an external destination.
A complete event includes: event headers, event body, where the event body is the diary records collected by flume.
6.3.3 Apache Flume installation and deployment
The installation of Apache Flume is very simple, just unzip it directly, and then configure the JDK environment variables.
The first step, upload and decompress
# 上传
cd /export/software
rz apache-flume-1.9.0-bin.tar.gz
# 解压,重命名及创建软链接
tar -zxf apache-flume-1.9.0-bin.tar.gz -C /export/server
cd /export/server
mv apache-flume-1.9.0-bin flume-1.9.0-bin
ln -s flume-1.9.0-bin flume
The second step, modify flume-env.sh
cd /export/server/flume/conf
mv flume-env.sh.template flume-env.sh
vim flume-env.sh
# 22行:修改JDK路径
export JAVA_HOME=/export/server/jdk
6.3.4 Apache Flume Getting Started Program
Requirement description: Listen to a certain port number (for example: 44444) on the server, and collect data sent to this port.
- Step 1. Determine the three major components
-
source component : A component (network component) that can listen to the port number is required,
provided by Apache Flume: NetCat TCP Source
-
Channel component : Need a faster transmission pipeline (memory component)
using Apache Flume to provide: Memory Channel -
Sink component : Here we only need to print it out (log component)
provided by Apache Flume: Logger Sink
- Step 2. Write the collection configuration file: netcat_source_logger_sink.properties
cd /export/server/flume/conf
vim netcat_source_logger_sink.properties
The content is as follows:
# 第一部分: 定义这个agent中各组件的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#第二部分: 描述和配置source组件:r1
a1.sources.r1.type = netcat
a1.sources.r1.bind = node1.itcast.cn
a1.sources.r1.port = 44444
# 第三部分: 描述和配置sink组件:k1
a1.sinks.k1.type = logger
# 第四部分: 描述和配置channel组件,此处使用是内存缓存的方式
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 第五部分: 描述和配置source channel sink之间的连接关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
- Step 3, start flume: specify the acquisition configuration file
/export/server/flume/bin/flume-ng agent -n a1 \
-c conf -f /export/server/flume/conf/netcat_source_logger_sink.properties \
-Dflume.root.logger=INFO,console
参数说明:
-c conf 指定flume自身的配置文件所在目录
-f conf/netcat-logger.con 指定我们所描述的采集方案
-n a1 指定我们这个agent的名字
- Step 4, next test: After it must be started, the connection test
must first send data to the port where the agent collects and monitors, so that the agent has data to collect.
- install telnet
yum -y install telnet
- On any machine that can communicate with the agent node, execute the following command
telnet node1.itcast.cn 44444
6.3.5 Qimo Social Data Collection
Features of Qimo social data source: Continuously output messages to a file in a certain directory. Functional requirements: monitor files in a certain directory in real time, and collect them into Kafka as soon as new files are found.
- Step 1. Determine the three major components
- source component: The file source component that can monitor a certain directory
is provided by Apache Flume: taildir - Channel components: Generally, memory components are selected (more efficient)
provided by Apache Flume: Memory Channel - Sink component: The sink component that outputs to Kafka
is provided by Apache Flume: Kafka Sink
- Step 2. Write the collection configuration file: 7mo_mem_kafka.properties
vim /export/server/flume/conf/7mo_mem_kafka.properties
The content is as follows:
# define a1
a1.sources = s1
a1.channels = c1
a1.sinks = k1
#define s1
a1.sources.s1.type = TAILDIR
#指定一个元数据记录文件
a1.sources.s1.positionFile = /export/server/flume/position/taildir_7mo_kafka.json
#将所有需要监控的数据源变成一个组
a1.sources.s1.filegroups = f1
#指定了f1是谁:监控目录下所有文件
a1.sources.s1.filegroups.f1 = /export/data/7mo_data/.*
#指定f1采集到的数据的header中包含一个KV对
a1.sources.s1.headers.f1.type = 7mo
a1.sources.s1.fileHeader = true
#define c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
#define k1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = 7MO-MSG
a1.sinks.k1.kafka.bootstrap.servers = node1.itcast.cn:9092
a1.sinks.k1.kafka.flumeBatchSize = 10
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 100
#bind
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
- Step 3, start ZK service and Kafka service
/export/server/zookeeper/bin/zkServer.sh start
/export/server/kafka/bin/kafka-server-start.sh -daemon /export/server/kafka/config/server.properties
- Step 4, create a topic
/export/server/kafka/bin/kafka-topics.sh --create \
--topic 7MO-MSG --partitions 3 --replication-factor 2 \
--bootstrap-server node1.itcast.cn:9092
- Step 5, start flume: specify the acquisition configuration file
/export/server/flume/bin/flume-ng agent \
-n a1 -c /export/server/flume/conf/ \
-f /export/server/flume/conf/7mo_mem_kafka.properties \
-Dflume.root.logger=INFO,console
- Step 6. Start the simulated data
java -jar /export/data/7mo_init/7Mo_DataGen.jar \
/export/data/7mo_init/7Mo_Data.xlsx \
/export/data/7mo_data \
5000
Check whether there is data in Kafka Topic:
6.4 Real-time storage of Qimo data
Write a streaming program in Spark: StructuredStreaming, get social data from Kafka consumption in real time, after conversion (data field extraction, etc.), and finally save it in the Hudi table, the format of the table: ROM.
6.4.1 Creating modules
Create a Maven Module module, write a program based on the Spark framework, and add related dependencies. The project structure is as follows:
The pom.xml dependency in the Module module:
<repositories>
<repository>
<id>aliyun</id>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
</repository>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
<repository>
<id>jboss</id>
<url>http://repository.jboss.com/nexus/content/groups/public</url>
</repository>
</repositories>
<properties>
<scala.version>2.12.10</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<spark.version>3.0.0</spark.version>
<hadoop.version>2.7.3</hadoop.version>
<hudi.version>0.9.0</hudi.version>
<mysql.version>5.1.48</mysql.version>
</properties>
<dependencies>
<!-- 依赖Scala语言 -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!-- Spark Core 依赖 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- Spark SQL 依赖 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- Structured Streaming + Kafka 依赖 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- Hadoop Client 依赖 -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<!-- hudi-spark3 -->
<dependency>
<groupId>org.apache.hudi</groupId>
<artifactId>hudi-spark3-bundle_2.12</artifactId>
<version>${hudi.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- hudi-spark3 -->
<dependency>
<groupId>org.apache.hudi</groupId>
<artifactId>hudi-hive-sync</artifactId>
<version>${hudi.version}</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.4.13</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.12</version>
</dependency>
<dependency>
<groupId>org.lionsoul</groupId>
<artifactId>ip2region</artifactId>
<version>1.7.2</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>${mysql.version}</version>
</dependency>
</dependencies>
<build>
<outputDirectory>target/classes</outputDirectory>
<testOutputDirectory>target/test-classes</testOutputDirectory>
<resources>
<resource>
<directory>${project.basedir}/src/main/resources</directory>
</resource>
</resources>
<!-- Maven 编译的插件 -->
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
Hudi table data is stored in the HDFS directory, and the HDFS file system configuration file is placed in the module module resource directory resources.
6.4.2 Encapsulating Entity Classes
Qimo social data analysis encapsulation entity class: MomoMessage , based on Scala language definition Case Class sample class.
package cn.itcast.hudi.momo
/**
* 封装Momo聊天记录实体样例类CaseClass
*/
case class MomoMessage(
msg_time: String,
sender_nickyname: String,
sender_account: String,
sender_sex: String,
sender_ip: String,
sender_os: String,
sender_phone_type: String,
sender_network: String,
sender_gps: String,
receiver_nickyname: String,
receiver_ip: String,
receiver_account: String,
receiver_os: String,
receiver_phone_type: String,
receiver_network: String,
receiver_gps: String,
receiver_sex: String,
msg_type: String,
distance: String,
message: String
)
Subsequently, Kafka consumes social data, parses and encapsulates it into entity class objects.
6.4.3 Writing streaming programs
Create an object: MomoStreamHudi, write the MAIN method, follow the 5 steps of writing a streaming program, and write out the code structure, as shown below:
package cn.itcast.hudi.momo
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.types.StringType
/**
* 编写StructuredStreaming流式程序:
实时消费Kafka中Momo聊天数据,进行转换处理,保存至Hudi表,并且自动同步至Hive表
*/
object MomoStreamHudi {
def main(args: Array[String]): Unit = {
// step1、构建SparkSession实例对象
val spark: SparkSession = createSparkSession(this.getClass)
// step2、从Kafka实时消费数据
val kafkaStreamDF: DataFrame = readFromKafka(spark, "7mo-msg")
// step3、提取数据,转换数据类型
val streamDF: DataFrame = process(kafkaStreamDF)
// step4、保存数据至Hudi表中:MOR(读取时保存)
//printToConsole(streamDF)
saveToHudi(streamDF)
// step5、流式应用启动以后,等待终止
spark.streams.active.foreach(
query => println(s"Query: ${query.name} is Running .............")
)
spark.streams.awaitAnyTermination()
}
}
6.4.3.1 Construct SparkSession instance object
Starting from Spark2.x, the program entry SparkSession, regardless of SparkSQL batch processing or StructuredStreaming stream computing, the program first creates a SparkSession object, encapsulation method: createSparkSession
/**
* 创建SparkSession会话实例对象,基本属性设置
*/
def createSparkSession(clazz: Class[_]): SparkSession = {
SparkSession.builder()
.appName(this.getClass.getSimpleName.stripSuffix("$"))
.master("local[2]")
// 设置序列化方式:Kryo
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
// 设置属性:Shuffle时分区数和并行度
.config("spark.default.parallelism", 2)
.config("spark.sql.shuffle.partitions", 2)
.config("spark.sql.streaming.forceDeleteTempCheckpointLocation", "true")
.getOrCreate()
}
6.4.3.2 Consuming Kafka data
Encapsulation method: readFromKafka , consume Topic data from Kafka, specify the name and address information of Kafka Brokers.
/**
* 指定Kafka Topic名称,实时消费数据
*/
def readFromKafka(spark: SparkSession, topicName: String): DataFrame = {
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "node1.itcast.cn:9092")
.option("subscribe", topicName)
.option("startingOffsets", "latest")
.option("maxOffsetsPerTrigger", 100000)
.option("failOnDataLoss", "false")
.load()
}
6.4.3.3 Print Console
Streaming data printing console, encapsulation method: printToConsole, convenient for testing and use in the development process.
def printToConsole(streamDF: DataFrame): Unit = {
streamDF.writeStream
.outputMode(OutputMode.Append())
.queryName("query-hudi-momo")
.format("console")
.option("numRows", "10")
.option("truncate", "false")
.option("checkpointLocation", "/datas/hudi-struct-ckpt-0")
.start()
}
6.4.3.4 Data Analysis and Transformation
For Kafka consumption data, first parse and encapsulate it into the entity class MomoMessage, and then add fields to construct the three core field values in the Hudi table: message_id is the primary key of each piece of data, the day partition field and the ts data merge field .
/**
* 对Kafka获取数据,进行转换操作,获取所有字段的值,转换为String,以便保存Hudi表
*/
def process(streamDF: DataFrame): DataFrame = {
import streamDF.sparkSession.implicits._
/*
2021-11-25 20:52:58牛星海17870843110女156.35.36.204IOS 9.0华为 荣耀Play4T4G91.319474,29.033363成紫57.54.100.313946849234Android 6.0OPPO A11X4G84.696447,30.573691 女TEXT78.22KM有一种想见不敢见的伤痛,这一种爱还埋藏在我心中,让我对你的思念越来越浓,我却只能把你你放在我心中。
*/
// 1-提取Message消息数据
val messageStreamDF: DataFrame = streamDF.selectExpr("CAST(value AS STRING) message")
// 2-解析数据,封装实体类
val momoStreamDS: Dataset[MomoMessage] = messageStreamDF
.as[String] // 转换为Dataset
.map(message => {
val array = message.split("\001")
val momoMessage = MomoMessage(
array(0), array(1), array(2), array(3), array(4), array(5), array(6), array(7),
array(8), array(9),array(10), array(11), array(12), array(13), array(14),
array(15), array(16), array(17), array(18), array(19)
)
// 返回实体类
momoMessage
})
// 3-为Hudi表添加字段:主键id、数据聚合字段ts、分区字段day
val hudiStreamDF = momoStreamDS.toDF()
.withColumn("ts", unix_timestamp($"msg_time").cast(StringType))
.withColumn(
"message_id",
concat($"sender_account", lit("_"), $"ts", lit("_"), $"receiver_account")
)
.withColumn("day", substring($"msg_time", 0, 10))
hudiStreamDF
}
6.4.3.5 Saving the Hudi table
Use the foreachBatch method to save each batch of data to the Hudi table in the stream data set Stream DataFrame, and you need to specify the necessary attribute fields.
/**
* 将流式数据集DataFrame保存至Hudi表,分别表类型:COW和MOR
*/
def saveToHudi(streamDF: DataFrame): Unit = {
streamDF.writeStream
.outputMode(OutputMode.Append())
.queryName("query-hudi-momo")
// 针对每微批次数据保存
.foreachBatch((batchDF: Dataset[Row], batchId: Long) => {
println(s"============== BatchId: $batchId start ==============")
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
batchDF.write
.format("hudi")
.mode(SaveMode.Append)
.option(TBL_NAME.key, "7mo_msg_hudi")
.option(TABLE_TYPE.key(), "MERGE_ON_READ")
.option(RECORDKEY_FIELD_NAME.key(), "message_id")
.option(PRECOMBINE_FIELD_NAME.key(), "ts")
.option(PARTITIONPATH_FIELD_NAME.key(), "day")
.option(HIVE_STYLE_PARTITIONING_ENABLE.key(), "true")
// 插入数据,产生shuffle时,分区数目
.option("hoodie.insert.shuffle.parallelism", "2")
.option("hoodie.upsert.shuffle.parallelism", "2")
// 表数据存储路径
.save("/hudi-warehouse/7mo_msg_hudi")
})
.option("checkpointLocation", "/datas/hudi-struct-ckpt")
.start()
}
So far, the streaming program StructuredStreaming is written, and then each component service is started for testing.
6.4.4 Streaming program running
Start the services: ZK service, Kafka service and HDFS service, then run the streaming application, and finally run the Flume Agent and the simulated data program to view the Hudi table data storage directory.
# NameNode和DataNode
hadoop-daemon.sh start namenode
hadoop-daemon.sh start datanode
# ZK服务和Kafka服务
/export/server/zookeeper/bin/zkServer.sh start
/export/server/kafka/bin/kafka-server-start.sh -daemon /export/server/kafka/config/server.properties
# Flume Agent
/export/server/flume/bin/flume-ng agent \
-c conf/ \
-n a1 \
-f /export/server/flume/conf/7mo_mem_kafka.properties \
-Dflume.root.logger=INFO,console
# 模拟数据程序
java -jar /export/data/7mo_init/7Mo_DataGen.jar \
/export/data/7mo_init/7Mo_Data.xlsx \
/export/data/7mo_data/ \
5000
Hudi storage directory structure:
So far, the real-time storage of Qimo social data to the Hudi table, the entire link has been completed:
6.5 Integrated Hive indicator analysis
Associate the Hudi table data with the Hive table, and use clients such as beeline to write SQL to analyze the Hudi table data.
6.5.1 Create Hive table
Start the Hive MetaStore service and HiveServer2 service, and then start the beeline client:
/export/server/hive/bin/start-metastore.sh
/export/server/hive/bin/start-hiveserver2.sh
/export/server/hive/bin/start-beeline.sh
Write DDL statements, create Hive tables, associate Hudi tables, and set the InputFormat implementation class.
# 创建Hive表,映射到Hudi表
CREATE EXTERNAL TABLE db_hudi.tbl_7mo_hudi(
msg_time String,
sender_nickyname String,
sender_account String,
sender_sex String,
sender_ip String,
sender_os String,
sender_phone_type String,
sender_network String,
sender_gps String,
receiver_nickyname String,
receiver_ip String,
receiver_account String,
receiver_os String,
receiver_phone_type String,
receiver_network String,
receiver_gps String,
receiver_sex String,
msg_type String,
distance String,
message String,
message_id String,
ts String
)
PARTITIONED BY (day string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '/hudi-warehouse/7mo_msg_hudi' ;
Since Hudi is a partition table, you need to manually add partition information:
alter table db_hudi.tbl_7mo_hudi
add if not exists partition(day = '2021-11-27') location '/hudi-warehouse/7mo_msg_hudi/day=2021-11-27' ;
Query the first 10 records of the Hive table:
SELECT
msg_time, sender_nickyname, receiver_nickyname, ts
FROM db_hudi.tbl_7mo_hudi
WHERE day = '2021-11-27'
limit 10 ;
6.5.2 Analysis of business indicators
Write SQL to perform simple index statistical analysis on Qimo social data. Since the data flow is small, set the local mode to execute.
set hive.exec.mode.local.auto=true;
set hive.mapred.mode=nonstrict;
- Indicator 1: Statistical total message volume
WITH tmp AS (
SELECT COUNT(1) AS momo_total FROM db_hudi.tbl_7mo_hudi WHERE day = '2021-11-27'
)
SELECT "全国" AS momo_name, momo_total FROM tmp;
- Indicator 2: Statistics of each user, the amount of messages sent
WITH tmp AS (
SELECT
sender_nickyname, COUNT(1) momo_total
FROM db_hudi.tbl_7mo_hudi
WHERE day = '2021-11-27' GROUP BY sender_nickyname
)
SELECT
sender_nickyname AS momo_name, momo_total
FROM tmp
ORDER BY momo_total DESC LIMIT 10;
- Indicator 3: Statistics of each user, the amount of messages received
WITH tmp AS (
SELECT
receiver_nickyname, COUNT(1) momo_total
FROM db_hudi.tbl_7mo_hudi
WHERE day = '2021-11-27' GROUP BY receiver_nickyname
)
SELECT
receiver_nickyname AS momo_name, momo_total
FROM tmp
ORDER BY momo_total DESC LIMIT 10;
- Indicator 4: Statistics on the amount of messages sent by men and women
SELECT
sender_sex, receiver_sex, COUNT(1) momo_total
FROM db_hudi.tbl_7mo_hudi
WHERE day = '2021-11-27' GROUP BY sender_sex, receiver_sex;
6.6 Spark offline indicator analysis
Write a SparkSQL program, load the Hudi table data and encapsulate it into a DataFrame, write SQL to analyze the data according to the needs of business indicators, and finally save it in the MySQL database table. The flow diagram is as follows:
6.6.1 Requirements Description
For the real-time statistical operation of Qimo social news data, the following statistical requirements are required:
- 1), the total number of statistical messages
- 2) Count the number of messages sent and received by each region (province) according to the IP address
- 3) Count the number of messages sent and received by each user in Qimo social messages
6.6.2 Creating database tables
Store the above business requirements and final results in one table of the MySQL database: 7mo.7mo_report.
Among them, the field: 7mo_category indicates the indicator type:
- 1: Indicates national information volume statistics
- 2: Indicates the statistics of the amount of information sent by each province
- 3: Indicates the statistics of the amount of information received by each province
- 4: Indicates the statistics of the amount of information sent by the user
- 5: Indicates the statistics of the amount of information received by the user
In the MySQL database, create a database: 7mo, a table: 7mo_reprot, and the corresponding DDL statements are as follows:
-- 创建数据库
CREATE DATABASE IF NOT EXISTS 7mo ;
-- 创建表
CREATE TABLE IF NOT EXISTS `7mo`.`7mo_report` (
`7mo_name` varchar(100) NOT NULL,
`7mo_total` bigint(20) NOT NULL,
`7mo_category` varchar(100) NOT NULL,
PRIMARY KEY (`7mo_name`, `7mo_category`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 ;
6.6.3 Writing an indicator analysis program
Create the object object: MomoSQLHudi, write the MAIN method, follow the 5 steps of writing a streaming program, and write out the code structure, as follows:
package cn.itcast.hudi.momo
import org.apache.spark.sql.{
DataFrame, Dataset, Row, SaveMode, SparkSession}
import org.lionsoul.ip2region.{
DataBlock, DbConfig, DbSearcher}
/**
* 编写SparkSQL程序,基于DSL和SQL分析Hudi表数据,最终保存值MySQL数据库表中
*/
object MomoSQLHudi {
def main(args: Array[String]): Unit = {
// step1、构建SparkSession实例对象
val spark: SparkSession = createSparkSession(this.getClass)
// step2、加载Hudi表数据,指定Hudi数据存储路径
val hudiDF: DataFrame = loadHudiTable(spark, "/hudi-warehouse/7mo_msg_hudi")
//println(s"Count = ${hudiDF.count()}")
//hudiDF.printSchema()
//hudiDF.show(numRows = 10, truncate = false)
// step3、数据ETL转换:提取字段,解析IP为省份和城市
val etlDF: DataFrame = etl(hudiDF)
//println(s"Count = ${etlDF.count()}")
//etlDF.printSchema()
//etlDF.show(numRows = 100, truncate = false)
// step4、业务指标分析
process(etlDF)
// 应用结束,关闭资源
spark.stop()
}
}
Among them, create a SparkSession object, encapsulation method: createSparkSession , the same as in the previous real-time storage.
6.6.3.1 Loading Hudi table data
Use the Spark DataSource external data source interface to load Hudi table data, specify the data storage path, and encapsulate the method: loadHudiTable .
/**
* 指定Hudi表数据存储path,加载Hudi表数据,返回DataFrame
*/
def loadHudiTable(spark: SparkSession, tablePath: String): DataFrame = {
val dataframe = spark.read
.format("hudi")
.load(tablePath)
// 返回数据
dataframe
}
6.6.3.2 Parse IP address and select fields
To resolve the IP address to [province], it is recommended to use the [ip2region] third-party tool library, the official website URL: https://gitee.com/lionsoul/ip2region/, to introduce the use of the IP2Region third-party library:
-
The first step, copy the IP dataset [ip2region.db] to the [dataset] directory under the project
-
The second step, add dependencies in Maven
<dependency>
<groupId>org.lionsoul</groupId>
<artifactId>ip2region</artifactId>
<version>1.7.2</version>
</dependency>
- The third step, the use of ip2region
adopts the custom UDF function method, transfers the IP address data, parses and returns the Province province:
In addition to parsing the IP address into a province, it is also necessary to select the fields involved in the business requirements. The encapsulation method: etl, the code is as follows:
/**
* 提取字段数据和转换经纬度为省份城市
*/
def etl(dataframe: DataFrame): DataFrame = {
val session: SparkSession = dataframe.sparkSession
// 1-自定义UDF函数,解析IP地址为省份和城市
session.udf.register(
"ip_to_province",
(ip: String) => {
// 构建DbSearch对象
val dbSearcher = new DbSearcher(new DbConfig(), "dataset/ip2region.db")
// 依据IP地址解析
val dataBlock: DataBlock = dbSearcher.btreeSearch(ip)
// 中国|0|海南省|海口市|教育网
val region: String = dataBlock.getRegion
// 分割字符串,获取省份和城市
val Array(_, _, province, _, _) = region.split("\\|")
// 返回Region对象
province
}
)
// 2-提取字段和解析IP
dataframe.createOrReplaceTempView("view_tmp_momo")
val etlDF: DataFrame = session.sql(
"""
|SELECT
| day, sender_nickyname, receiver_nickyname,
| ip_to_province(sender_ip) AS sender_province,
| ip_to_province(receiver_ip) AS receiver_province
|FROM
| view_tmp_momo
|""".stripMargin
)
// 返回结果数据
etlDF
}
6.6.3.3 Analysis of business indicators
Register DataFrame as a temporary view, write SQL statements for analysis, and finally merge and save all indicator results.
/**
* 按照业务指标分析数据
*/
def process(dataframe: DataFrame): Unit = {
val session: SparkSession = dataframe.sparkSession
// 1-将DataFrame注册为临时视图
dataframe.createOrReplaceTempView("view_tmp_etl")
// 2-指标1:统计总消息量
val reportAllTotalDF: DataFrame = session.sql(
"""
|WITH tmp AS (
| SELECT COUNT(1) AS 7mo_total FROM view_tmp_etl
|)
|SELECT "全国" AS 7mo_name, 7mo_total, "1" AS 7mo_category FROM tmp;
|""".stripMargin
)
// 2-指标2:统计各省份发送消息量
val reportSenderProvinceTotalDF: DataFrame = session.sql(
"""
|WITH tmp AS (
| SELECT sender_province, COUNT(1) AS 7mo_total FROM view_tmp_etl GROUP BY sender_province
|)
|SELECT sender_province AS 7mo_name, 7mo_total, "2" AS 7mo_category FROM tmp;
|""".stripMargin
)
// 2-指标3:统计各省份接收消息量
val reportReceiverProvinceTotalDF: DataFrame = session.sql(
"""
|WITH tmp AS (
| SELECT receiver_province, COUNT(1) AS 7mo_total FROM view_tmp_etl GROUP BY receiver_province
|)
|SELECT receiver_province AS 7mo_name, 7mo_total, "3" AS 7mo_category FROM tmp;
|""".stripMargin
)
// 2-指标4:统计各个用户, 发送消息量
val reportSenderNickyNameTotalDF: DataFrame = session.sql(
"""
|WITH tmp AS (
| SELECT sender_nickyname, COUNT(1) AS 7mo_total FROM view_tmp_etl GROUP BY sender_nickyname
|)
|SELECT sender_nickyname AS 7mo_name, 7mo_total, "4" AS 7mo_category FROM tmp;
|""".stripMargin
)
// 2-指标5:统计各个用户, 接收消息量
val reportReceiverNickyNameTotalDF: DataFrame = session.sql(
"""
|WITH tmp AS (
| SELECT receiver_nickyname, COUNT(1) AS 7mo_total FROM view_tmp_etl GROUP BY receiver_nickyname
|)
|SELECT receiver_nickyname AS 7mo_name, 7mo_total, "5" AS 7mo_category FROM tmp;
|""".stripMargin
)
// 3-保存报表至MySQL数据库
val reportTotalDF: Dataset[Row] = reportAllTotalDF
.union(reportSenderProvinceTotalDF)
.union(reportReceiverProvinceTotalDF)
.union(reportSenderNickyNameTotalDF)
.union(reportReceiverNickyNameTotalDF)
// reportTotalDF.show(500, truncate = false)
reportTotalDF
.coalesce(1)
.write
.mode(SaveMode.Append)
.format("jdbc")
.option("driver", "com.mysql.jdbc.Driver")
.option("url",
"jdbc:mysql://node1.itcast.cn:3306/?useUnicode=true&characterEncoding=utf-8&useSSL=false")
.option("dbtable", "7mo.7mo_report")
.option("user", "root")
.option("password", "123456")
.save()
}
Among them, the JDBC method of the external data source in SparkSQL is directly used to save the result in the MySQL database table.
6.6.4 Run the report program
After the development is completed, the Spark program loads the Hudi table data, calculates according to the business indicators, and stores the results in the MySQL database.
-
View MySQL database table data
-
Query the top 5 data of each indicator
(SELECT 7mo_name, 7mo_total, "全国总信息量" AS "7mo.category"
FROM 7mo.7mo_report WHERE 7mo_category = 1)
UNION
(SELECT 7mo_name, 7mo_total, "省份发送信息量" AS "7mo.category"
FROM 7mo.7mo_report WHERE 7mo_category = 2 ORDER BY 7mo_total DESC LIMIT 5)
UNION
(SELECT 7mo_name, 7mo_total, "省份接收信息量" AS "7mo.category"
FROM 7mo.7mo_report WHERE 7mo_category = 3 ORDER BY 7mo_total DESC LIMIT 5)
UNION
(SELECT 7mo_name, 7mo_total, "用户发送信息量" AS "7mo.category"
FROM 7mo.7mo_report WHERE 7mo_category = 4 ORDER BY 7mo_total DESC LIMIT 5)
UNION
(SELECT 7mo_name, 7mo_total, "用户接收信息量" AS "7mo.category"
FROM 7mo.7mo_report WHERE 7mo_category = 5 ORDER BY 7mo_total DESC LIMIT 5);
6.7 FineBI report visualization
Use FineBI to connect to the data MySQL database, load the business indicator report data, and display it in different charts.
6.7.1 Install FineBI
FineBI is a Business Intelligence product launched by Fanruan Software Co., Ltd. FineBI is a BI tool positioned for self-service big data analysis, which can help business personnel and data analysts of enterprises to carry out problem-oriented exploratory analysis. Official website: https://www.finebi.com/
FineBI installation: refer to the "FineBI Windows Edition Installation Manual". After the installation is complete, start the login and get to know the basic page.
-
start login
-
Contents: Home screen and help documentation
- Dashboard: used to build all visual reports
-
Data preparation: used to configure data sources for various reports
- management system : used to manage the use of the entire FineBI: user management, data source management, plug-in management, authority management, etc.
6.7.2 Configuring Data Sources
Create a MySQL database connection: [Management System] -> [Data Connection] -> [Data Connection Management]
Fill in the MySQL database connection information:
数据连接名称:node1-mysql
用户名:root
密码:123456
数据连接URL:jdbc:mysql://node1.itcast.cn:3306/7mo?useUnicode=true&characterEncoding=utf8
6.7.3 Add dataset
Add the business report in the MySQL database: 7mo_report, select [Data Preparation, add group [Qimo Data] and business package [Qimo Report].
Click into [Qimo Report], add a table, and use the [SQL Data Set] method:
Enter table name and SQL statement
SELECT
7mo_name, 7mo_total,
CASE 7mo_category
WHEN '1' THEN '总消息量'
WHEN '2' THEN '各省份发送量'
WHEN '3' THEN '各省份接收量'
WHEN '4' THEN '各用户发送量'
WHEN '5' THEN '各用户接收量'
END AS 7mo_category
FROM 7mo.7mo_report
6.7.4 Create a dashboard
First build a dashboard with the name: [Qimo Social Data Statistical Report], as shown in the figure below:
Next, select a template style for the dashboard [preset style 5]: dark blue ocean background.
- First add the title: [Others] -> [Text Components]
Enter the name of the dashboard: Qimo Social Data Statistical Report
- Second, add a text component to display the total number of messages
Choose to add the table before: 7mo_report_mysql
as shown in the figure below: select field values and filter field categories
6.7.5 Bar chart: Top 10 users send messages
The top 10 users who send the most messages are displayed in a column chart.
- Step 1, add components, select [column chart], fill in the title name.
- Step 2. Select different fields, set related filtering and display
. Among them, the displayed data is: the statistical data of the amount of information sent by users.
In addition, only displaying Top10 has the largest amount of sent information, and filtering operations are required.
When displaying a column chart, sort in descending order according to the amount of sent information
6.7.6 Pie Chart: Top 10 Provinces Sending Information Amount
In the form of a pie chart, the amount of information sent by the Top 10 provinces is displayed. The specific operation is as follows:
-
Step 1. Add a component, select [Pie Chart], and fill in the title name.
-
Step 2. Select different fields, set related filtering and display
. Among them, filter to obtain the statistical data of the amount of information sent by each province
. In addition, only the Top10 provinces with the largest amount of information sent are obtained:
In the above pie chart, the data is displayed in the outer frame, and the settings are as follows:
6.7.7 Maps: Amount of Information by Province
In the form of a map, the amount of information sent by each province is displayed. The specific operation is as follows:
- Step 1. Add a component, select [Area Map], and fill in the title name.
- Step 2. Select the province field to map to the geographic role
- Step 3. Select different fields, set related filtering and display
Among them, filter to obtain the statistical data of the amount of information sent by each province