06_Hudi case combat, Apache Flume log collection, SparkSession data processing, data into Kafka, saving Hudi tables, integrating Hive indicator analysis, loading Hudi table data, FineBI report visualization, etc.

This article is from the "Dark Horse Programmer" hudi course

6. Chapter 6 Hudi Case Study
6.1 Case Architecture
6.2 Business Data
6.2.1 Message Data Format
6.2.2 Data Generation
6.3 Qimo Data Collection
6.3.1 What is Apache Flume
6.3.2 Apache Flume Operating Mechanism
6.3.3 Apache Flume Installation Deployment6.3.4
Apache Flume entry
program6.3.5 Qimo social data collection6.3.5
Qimo social data collection6.4.1
Create module6.4.2
Encapsulate entity class6.4.3
Write streaming program6.4.3.1
Build SparkSession instance
object6.4 .3.2 Consume Kafka data
6.4.3.3 Print console
6.4.3.4 Data analysis and conversion
6.4.3.5 Save Hudi table
6.4.4 Streaming program running
6.5 Integrated Hive indicator analysis
6.5.1 Create Hive table
6.5.2 Business indicator analysis
6.6 Spark offline Index analysis
6.6.1 Requirement description
6.6.2 Create database table
6.6.3 Write index analysis program
6.6.3.1 Load Hudi table data
6.6.3.2 Parse IP address and select fields
6.6.3.3 Business indicator analysis
6.6.4 Report program operation
6.7 FineBI report visualization
6.7.1 Install FineBI
6.7.2 Configure data source
6.7.3 Add data set
6.7.4 Create dashboard
6.7.5 Column chart: Top10 users send information Volume
6.7.6 Pie Chart: Information Volume of Top 10 Provinces
6.7.7 Map: Information Volume of Each Province

6. Chapter 6 Hudi case study

Qimo Social is a company specializing in customer service systems. Chuanzhi Education builds a customer service system based on Qimo Social. There are a lot of users chatting every day. Chuanzhi Education currently wants to store these chat records, and at the same time Real-time statistical analysis of the daily message volume is required, please design how to realize data storage and real-time statistical analysis of data.
The requirements are as follows:

    1. Choose a reasonable storage container for data storage, and let it support basic data query work
    1. Perform real-time statistics on the total amount of messages
    1. Real-time statistics of the total amount of messages sent and received in each region
    1. Real-time statistics of the number of messages sent and received by each customer

6.1 Case Architecture

Collect Qimo user chat information data in real time, store the message queue in Kafka, process and convert the data in real time, store the messages in the Hudi table, and finally use Hive and Spark to make statistics on business indicators, and display them based on FanBI visual report.
insert image description here

  • 1. Apache Flume: Distributed real-time log data collection framework
    Since business-side data is continuously produced in a directory, we need to collect data in real time, and flume is a tool specially used for data collection, such as monitoring Files in a certain directory can be collected as soon as a new file is generated.

  • 2. Apache Kafka: Distributed message queue
    Flume collection process, if the message is very fast, Flume will also collect the data efficiently, then a container that can quickly carry the data is needed, and the data must be processed and converted later Operation, at this time, the data collected by flume can be written into Kafka for message data transmission, and Kafka is also a message system used uniformly by all business lines of the entire group center to connect with subsequent business (offline or real-time).

  • 3. Apache Spark: Distributed memory computing engine, offline and streaming data analysis and processing of
    the entire Qimo social case, which requires real-time collection, so at this time it means that one piece of data needs to be processed, and one piece of data needs to be processed. Sometimes you need some stream processing framework, Structured Streaming or Flink can be.
    In addition, in the case of Qimo, the daily user message data is analyzed according to business indicators, and finally stored in the MySQL database, and SparkSQL is selected.

  • 4. Apache Hudi: Data Lake Framework
    Qimo user chat message data is finally stored in the Hudi table (underlying storage: HDFS distributed file system), unified management of data files, and later integrated with Spark and Hive for business indicator analysis.

  • 5. Apache Hive: big data data warehouse framework
    integrated with Hudi table, to analyze Qimo chat data, just write SQL directly.

  • 6. MySQL: Relational database
    Store the analysis results of business indicators in the MySQL database, which is convenient for displaying indicator reports later.

  • 7. FineBI: Reporting tool
    A commercial charting tool of Fanruan Company, which makes chart making easier

6.2 Business data

In this case, we directly provide tools specially used for producing Qimo social news data, which can be directly deployed on the business side for data generation, and then deploy the tool jar package for producing data.

6.2.1 Message data format

User chat data is stored in the log file in text format, which contains 20 fields, as shown in the following figure:
insert image description here

Sample data:
insert image description here
The division symbol between each field of the above data is: \001

6.2.2 Data Generation

Run the jar package: 7Mo_DataGen.jar, specify the parameter information, simulate and generate user chat information data, and write it into the log file.
insert image description here

  • The first step, create the original file directory
mkdir -p /export/data/7mo_init
  • The second step, upload the simulation data program
cd /export/data/7mo_init
rz

insert image description here

  • The third step, create a simulation data directory
mkdir -p /export/data/7mo_data

insert image description here

  • The fourth step, run the program to generate data
# 1. 语法
java -jar /export/data/7mo_init/7Mo_DataGen.jar 原始数据路径 模拟数据路径 随机产生数据间隔ms时间
  	
# 2. 测试:每500ms生成一条数据
java -jar /export/data/7mo_init/7Mo_DataGen.jar \
/export/data/7mo_init/7Mo_Data.xlsx \
/export/data/7mo_data \
500
  • The fifth step, view the generated data
    insert image description here

6.3 Qimo data collection

Due to the large number of users and high activity of Qimo, the chat information data is relatively large (daily increment: 25GB to 30GB), and the data is collected in real time. Here, the framework is selected: Apache Flume.

6.3.1 What is Apache Flume

Aapche Flume is a highly available, highly reliable, distributed massive log collection, aggregation and transmission software provided by Cloudera, URL: http://flume.apache.org/
insert image description here

Flume的核心是把数据从数据源(source)收集过来,再将收集到的数据送到指定的目的地(sink)。为了保证输送的过程一定成功,在送到目的地(sink)之前,会先缓存数据(channel),待数据真正到达目的地(sink)后,flume在删除自己缓存的数据。

insert image description here

当前Flume有两个版本:
  • The Flume 0.9X version is collectively referred to as Flume OG (original generation)
  • The Flume 1.X version is collectively referred to as Flume NG (next generation)
    because Flume NG has undergone core components, core configuration, and code architecture reconstruction, which is very different from Flume OG. Another reason for the change is to incorporate Flume into the apache umbrella, and Cloudera Flume was renamed Apache Flume.

6.3.2 Apache Flume operating mechanism

The core role in the Flume system is the agent , which itself is a Java process that generally runs on the log collection node.
insert image description here

Each agent is equivalent to a data transmitter, and there are three components inside:

  • Source: collection source, used to connect with the data source to obtain data;
  • Sink: Sinking place, the purpose of transmitting collected data, used to transmit data to the next-level agent or to the final storage system;
  • Channel: The data transmission channel inside the agent, which is used to transfer data from the source to the sink;
    in the whole process of data transmission, the flow is event, which is the most basic unit of Flume internal data transmission.
    insert image description here

The event encapsulates the transmitted data. If it is a text file, it is usually a row of records. The event is also the basic unit of the transaction. The event flows from the source to the channel and then to the sink. It is a byte array and can carry headers (header information) information. Event represents the smallest complete unit of data, from an external data source to an external destination.
insert image description here

A complete event includes: event headers, event body, where the event body is the diary records collected by flume.

6.3.3 Apache Flume installation and deployment

The installation of Apache Flume is very simple, just unzip it directly, and then configure the JDK environment variables.
The first step, upload and decompress

# 上传
cd /export/software
rz apache-flume-1.9.0-bin.tar.gz

# 解压,重命名及创建软链接
tar -zxf apache-flume-1.9.0-bin.tar.gz -C /export/server

cd /export/server
mv apache-flume-1.9.0-bin flume-1.9.0-bin
ln -s flume-1.9.0-bin flume

The second step, modify flume-env.sh

cd /export/server/flume/conf
mv flume-env.sh.template  flume-env.sh

vim flume-env.sh
# 22行:修改JDK路径
export JAVA_HOME=/export/server/jdk

insert image description here

6.3.4 Apache Flume Getting Started Program

Requirement description: Listen to a certain port number (for example: 44444) on the server, and collect data sent to this port.
insert image description here

  • Step 1. Determine the three major components
  1. source component : A component (network component) that can listen to the port number is required,
    provided by Apache Flume: NetCat TCP Source
    insert image description here

  2. Channel component : Need a faster transmission pipeline (memory component)
    using Apache Flume to provide: Memory Channel

  3. Sink component : Here we only need to print it out (log component)
    provided by Apache Flume: Logger Sink

  • Step 2. Write the collection configuration file: netcat_source_logger_sink.properties
cd /export/server/flume/conf
vim netcat_source_logger_sink.properties

The content is as follows:

# 第一部分: 定义这个agent中各组件的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#第二部分:  描述和配置source组件:r1
a1.sources.r1.type = netcat
a1.sources.r1.bind = node1.itcast.cn
a1.sources.r1.port = 44444

# 第三部分: 描述和配置sink组件:k1
a1.sinks.k1.type = logger

# 第四部分: 描述和配置channel组件,此处使用是内存缓存的方式
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 第五部分: 描述和配置source  channel   sink之间的连接关系
a1.sources.r1.channels = c1 
a1.sinks.k1.channel = c1
  • Step 3, start flume: specify the acquisition configuration file
/export/server/flume/bin/flume-ng agent -n a1  \
-c conf -f /export/server/flume/conf/netcat_source_logger_sink.properties \
-Dflume.root.logger=INFO,console

参数说明: 	
  -c conf   指定flume自身的配置文件所在目录	
  -f conf/netcat-logger.con  指定我们所描述的采集方案	
  -n a1  指定我们这个agent的名字
  • Step 4, next test: After it must be started, the connection test
    must first send data to the port where the agent collects and monitors, so that the agent has data to collect.
  1. install telnet
yum -y install telnet
  1. On any machine that can communicate with the agent node, execute the following command
telnet node1.itcast.cn  44444

insert image description here

6.3.5 Qimo Social Data Collection

Features of Qimo social data source: Continuously output messages to a file in a certain directory. Functional requirements: monitor files in a certain directory in real time, and collect them into Kafka as soon as new files are found.
insert image description here

  • Step 1. Determine the three major components
  1. source component: The file source component that can monitor a certain directory
    is provided by Apache Flume: taildir
  2. Channel components: Generally, memory components are selected (more efficient)
    provided by Apache Flume: Memory Channel
  3. Sink component: The sink component that outputs to Kafka
    is provided by Apache Flume: Kafka Sink
  • Step 2. Write the collection configuration file: 7mo_mem_kafka.properties
vim /export/server/flume/conf/7mo_mem_kafka.properties

The content is as follows:

# define a1
a1.sources = s1 
a1.channels = c1
a1.sinks = k1

#define s1
a1.sources.s1.type = TAILDIR
#指定一个元数据记录文件
a1.sources.s1.positionFile = /export/server/flume/position/taildir_7mo_kafka.json
#将所有需要监控的数据源变成一个组
a1.sources.s1.filegroups = f1
#指定了f1是谁:监控目录下所有文件
a1.sources.s1.filegroups.f1 = /export/data/7mo_data/.*
#指定f1采集到的数据的header中包含一个KV对
a1.sources.s1.headers.f1.type = 7mo
a1.sources.s1.fileHeader = true

#define c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000

#define k1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = 7MO-MSG
a1.sinks.k1.kafka.bootstrap.servers = node1.itcast.cn:9092
a1.sinks.k1.kafka.flumeBatchSize = 10
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 100

#bind
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
  • Step 3, start ZK service and Kafka service
/export/server/zookeeper/bin/zkServer.sh start 
/export/server/kafka/bin/kafka-server-start.sh -daemon /export/server/kafka/config/server.properties
  • Step 4, create a topic
/export/server/kafka/bin/kafka-topics.sh --create \
--topic 7MO-MSG  --partitions 3 --replication-factor 2 \
--bootstrap-server node1.itcast.cn:9092
  • Step 5, start flume: specify the acquisition configuration file
/export/server/flume/bin/flume-ng agent \
-n a1 -c /export/server/flume/conf/ \
-f /export/server/flume/conf/7mo_mem_kafka.properties \
-Dflume.root.logger=INFO,console
  • Step 6. Start the simulated data
java -jar /export/data/7mo_init/7Mo_DataGen.jar \
/export/data/7mo_init/7Mo_Data.xlsx \
/export/data/7mo_data \
5000

Check whether there is data in Kafka Topic:
insert image description here

6.4 Real-time storage of Qimo data

Write a streaming program in Spark: StructuredStreaming, get social data from Kafka consumption in real time, after conversion (data field extraction, etc.), and finally save it in the Hudi table, the format of the table: ROM.
insert image description here

6.4.1 Creating modules

Create a Maven Module module, write a program based on the Spark framework, and add related dependencies. The project structure is as follows:
insert image description here

The pom.xml dependency in the Module module:

<repositories>
    <repository>
        <id>aliyun</id>
        <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
    </repository>
    <repository>
        <id>cloudera</id>
        <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
    <repository>
        <id>jboss</id>
        <url>http://repository.jboss.com/nexus/content/groups/public</url>
    </repository>
</repositories>

<properties>
    <scala.version>2.12.10</scala.version>
    <scala.binary.version>2.12</scala.binary.version>
    <spark.version>3.0.0</spark.version>
    <hadoop.version>2.7.3</hadoop.version>
    <hudi.version>0.9.0</hudi.version>
    <mysql.version>5.1.48</mysql.version>
</properties>

<dependencies>
    <!-- 依赖Scala语言 -->
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>${scala.version}</version>
    </dependency>

    <!-- Spark Core 依赖 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
    </dependency>
    <!-- Spark SQL 依赖 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
    </dependency>
    <!-- Structured Streaming + Kafka  依赖 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql-kafka-0-10_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
    </dependency>

    <!-- Hadoop Client 依赖 -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>${hadoop.version}</version>
    </dependency>

    <!-- hudi-spark3 -->
    <dependency>
        <groupId>org.apache.hudi</groupId>
        <artifactId>hudi-spark3-bundle_2.12</artifactId>
        <version>${hudi.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-avro_2.12</artifactId>
        <version>${spark.version}</version>
    </dependency>
    <!-- hudi-spark3 -->
    <dependency>
        <groupId>org.apache.hudi</groupId>
        <artifactId>hudi-hive-sync</artifactId>
        <version>${hudi.version}</version>
    </dependency>

    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpcore</artifactId>
        <version>4.4.13</version>
    </dependency>
    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.5.12</version>
    </dependency>

    <dependency>
        <groupId>org.lionsoul</groupId>
        <artifactId>ip2region</artifactId>
        <version>1.7.2</version>
    </dependency>

    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
        <version>${mysql.version}</version>
    </dependency>

</dependencies>

<build>
    <outputDirectory>target/classes</outputDirectory>
    <testOutputDirectory>target/test-classes</testOutputDirectory>
    <resources>
        <resource>
            <directory>${project.basedir}/src/main/resources</directory>
        </resource>
    </resources>
    <!-- Maven 编译的插件 -->
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.0</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
                <encoding>UTF-8</encoding>
            </configuration>
        </plugin>
        <plugin>
            <groupId>net.alchim31.maven</groupId>
            <artifactId>scala-maven-plugin</artifactId>
            <version>3.2.0</version>
            <executions>
                <execution>
                    <goals>
                        <goal>compile</goal>
                        <goal>testCompile</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

Hudi table data is stored in the HDFS directory, and the HDFS file system configuration file is placed in the module module resource directory resources.
insert image description here

6.4.2 Encapsulating Entity Classes

Qimo social data analysis encapsulation entity class: MomoMessage , based on Scala language definition Case Class sample class.

package cn.itcast.hudi.momo

/**
 * 封装Momo聊天记录实体样例类CaseClass
 */
case class MomoMessage(
                         msg_time: String,
                         sender_nickyname: String,
                         sender_account: String,
                         sender_sex: String,
                         sender_ip: String,
                         sender_os: String,
                         sender_phone_type: String,
                         sender_network: String,
                         sender_gps: String,
                         receiver_nickyname: String,
                         receiver_ip: String,
                         receiver_account: String,
                         receiver_os: String,
                         receiver_phone_type: String,
                         receiver_network: String,
                         receiver_gps: String,
                         receiver_sex: String,
                         msg_type: String,
                         distance: String,
                         message: String
                      )

Subsequently, Kafka consumes social data, parses and encapsulates it into entity class objects.

6.4.3 Writing streaming programs

Create an object: MomoStreamHudi, write the MAIN method, follow the 5 steps of writing a streaming program, and write out the code structure, as shown below:

package cn.itcast.hudi.momo

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.types.StringType

/**
 * 编写StructuredStreaming流式程序:
实时消费Kafka中Momo聊天数据,进行转换处理,保存至Hudi表,并且自动同步至Hive表
 */
object MomoStreamHudi {
    
    
   
   def main(args: Array[String]): Unit = {
    
    
      // step1、构建SparkSession实例对象
      val spark: SparkSession = createSparkSession(this.getClass)
      
      // step2、从Kafka实时消费数据
      val kafkaStreamDF: DataFrame = readFromKafka(spark, "7mo-msg")
      
      // step3、提取数据,转换数据类型
      val streamDF: DataFrame = process(kafkaStreamDF)
      
      // step4、保存数据至Hudi表中:MOR(读取时保存)
      //printToConsole(streamDF)
      saveToHudi(streamDF)
      
      // step5、流式应用启动以后,等待终止
      spark.streams.active.foreach(
query => println(s"Query: ${query.name} is Running .............")
)
      spark.streams.awaitAnyTermination()
   }

}

6.4.3.1 Construct SparkSession instance object

Starting from Spark2.x, the program entry SparkSession, regardless of SparkSQL batch processing or StructuredStreaming stream computing, the program first creates a SparkSession object, encapsulation method: createSparkSession

/**
 * 创建SparkSession会话实例对象,基本属性设置
 */
def createSparkSession(clazz: Class[_]): SparkSession = {
    
    
   SparkSession.builder()
      .appName(this.getClass.getSimpleName.stripSuffix("$"))
      .master("local[2]")
      // 设置序列化方式:Kryo
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      // 设置属性:Shuffle时分区数和并行度
      .config("spark.default.parallelism", 2)
      .config("spark.sql.shuffle.partitions", 2)
          .config("spark.sql.streaming.forceDeleteTempCheckpointLocation", "true")
      .getOrCreate()
}

6.4.3.2 Consuming Kafka data

Encapsulation method: readFromKafka , consume Topic data from Kafka, specify the name and address information of Kafka Brokers.

/**
 * 指定Kafka Topic名称,实时消费数据
 */
def readFromKafka(spark: SparkSession, topicName: String): DataFrame = {
    
    
   spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "node1.itcast.cn:9092")
      .option("subscribe", topicName)
      .option("startingOffsets", "latest")
      .option("maxOffsetsPerTrigger", 100000)
      .option("failOnDataLoss", "false")
      .load()
}

6.4.3.3 Print Console

Streaming data printing console, encapsulation method: printToConsole, convenient for testing and use in the development process.

def printToConsole(streamDF: DataFrame): Unit = {
    
    
   streamDF.writeStream
      .outputMode(OutputMode.Append())
      .queryName("query-hudi-momo")
          .format("console")
          .option("numRows", "10")
          .option("truncate", "false")
      .option("checkpointLocation", "/datas/hudi-struct-ckpt-0")
          .start()
}

6.4.3.4 Data Analysis and Transformation

For Kafka consumption data, first parse and encapsulate it into the entity class MomoMessage, and then add fields to construct the three core field values ​​in the Hudi table: message_id is the primary key of each piece of data, the day partition field and the ts data merge field .

/**
 * 对Kafka获取数据,进行转换操作,获取所有字段的值,转换为String,以便保存Hudi表
 */
def process(streamDF: DataFrame): DataFrame = {
    
    
   import streamDF.sparkSession.implicits._
   
   /*
      2021-11-25 20:52:58牛星海17870843110女156.35.36.204IOS 9.0华为 荣耀Play4T4G91.319474,29.033363成紫57.54.100.313946849234Android 6.0OPPO A11X4G84.696447,30.573691 女TEXT78.22KM有一种想见不敢见的伤痛,这一种爱还埋藏在我心中,让我对你的思念越来越浓,我却只能把你你放在我心中。
    */
   // 1-提取Message消息数据
   val messageStreamDF: DataFrame = streamDF.selectExpr("CAST(value AS STRING) message")
   
   // 2-解析数据,封装实体类
   val momoStreamDS: Dataset[MomoMessage] = messageStreamDF
      .as[String] // 转换为Dataset
      .map(message => {
    
    
         val array = message.split("\001")
         val momoMessage = MomoMessage(
            array(0), array(1), array(2), array(3), array(4), array(5), array(6), array(7), 
array(8), array(9),array(10), array(11), array(12), array(13), array(14), 
array(15), array(16), array(17), array(18), array(19)
         )
         // 返回实体类
         momoMessage
      })
   
   // 3-为Hudi表添加字段:主键id、数据聚合字段ts、分区字段day
   val hudiStreamDF = momoStreamDS.toDF()
      .withColumn("ts", unix_timestamp($"msg_time").cast(StringType))
      .withColumn(
         "message_id",
         concat($"sender_account", lit("_"), $"ts", lit("_"), $"receiver_account")
      )
      .withColumn("day", substring($"msg_time", 0, 10))
   
   hudiStreamDF
}

6.4.3.5 Saving the Hudi table

Use the foreachBatch method to save each batch of data to the Hudi table in the stream data set Stream DataFrame, and you need to specify the necessary attribute fields.

/**
 * 将流式数据集DataFrame保存至Hudi表,分别表类型:COW和MOR
 */
def saveToHudi(streamDF: DataFrame): Unit = {
    
    
   streamDF.writeStream
      .outputMode(OutputMode.Append())
      .queryName("query-hudi-momo")
      // 针对每微批次数据保存
      .foreachBatch((batchDF: Dataset[Row], batchId: Long) => {
    
    
         println(s"============== BatchId: $batchId start ==============")
         
         import org.apache.hudi.DataSourceWriteOptions._
         import org.apache.hudi.config.HoodieWriteConfig._
         import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
         
         batchDF.write
            .format("hudi")
            .mode(SaveMode.Append)
            .option(TBL_NAME.key, "7mo_msg_hudi")
            .option(TABLE_TYPE.key(), "MERGE_ON_READ")
            .option(RECORDKEY_FIELD_NAME.key(), "message_id")
            .option(PRECOMBINE_FIELD_NAME.key(), "ts")
            .option(PARTITIONPATH_FIELD_NAME.key(), "day")
            .option(HIVE_STYLE_PARTITIONING_ENABLE.key(), "true")
            // 插入数据,产生shuffle时,分区数目
            .option("hoodie.insert.shuffle.parallelism", "2")
            .option("hoodie.upsert.shuffle.parallelism", "2")
            // 表数据存储路径
            .save("/hudi-warehouse/7mo_msg_hudi")
      })
      .option("checkpointLocation", "/datas/hudi-struct-ckpt")
      .start()
}

So far, the streaming program StructuredStreaming is written, and then each component service is started for testing.

6.4.4 Streaming program running

Start the services: ZK service, Kafka service and HDFS service, then run the streaming application, and finally run the Flume Agent and the simulated data program to view the Hudi table data storage directory.

# NameNode和DataNode
hadoop-daemon.sh start namenode 
hadoop-daemon.sh start datanode

# ZK服务和Kafka服务
/export/server/zookeeper/bin/zkServer.sh start 
/export/server/kafka/bin/kafka-server-start.sh -daemon /export/server/kafka/config/server.properties

# Flume Agent
/export/server/flume/bin/flume-ng agent \
-c conf/ \
-n a1 \
-f /export/server/flume/conf/7mo_mem_kafka.properties \
-Dflume.root.logger=INFO,console

# 模拟数据程序
java -jar /export/data/7mo_init/7Mo_DataGen.jar \
/export/data/7mo_init/7Mo_Data.xlsx \
/export/data/7mo_data/ \
5000

Hudi storage directory structure:
insert image description here
So far, the real-time storage of Qimo social data to the Hudi table, the entire link has been completed:
insert image description here

6.5 Integrated Hive indicator analysis

Associate the Hudi table data with the Hive table, and use clients such as beeline to write SQL to analyze the Hudi table data.
insert image description here

6.5.1 Create Hive table

Start the Hive MetaStore service and HiveServer2 service, and then start the beeline client:

/export/server/hive/bin/start-metastore.sh
/export/server/hive/bin/start-hiveserver2.sh

/export/server/hive/bin/start-beeline.sh

insert image description here
Write DDL statements, create Hive tables, associate Hudi tables, and set the InputFormat implementation class.

# 创建Hive表,映射到Hudi表
CREATE EXTERNAL TABLE db_hudi.tbl_7mo_hudi(
  msg_time             String,
  sender_nickyname     String,
  sender_account       String,
  sender_sex           String,
  sender_ip            String,
  sender_os            String,
  sender_phone_type    String,
  sender_network       String,
  sender_gps           String,
  receiver_nickyname   String,
  receiver_ip          String,
  receiver_account     String,
  receiver_os          String,
  receiver_phone_type  String,
  receiver_network     String,
  receiver_gps         String,
  receiver_sex         String,
  msg_type             String,
  distance             String,
  message              String,
  message_id           String,
  ts                   String       
)
PARTITIONED BY (day string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '/hudi-warehouse/7mo_msg_hudi' ;

Since Hudi is a partition table, you need to manually add partition information:

alter table db_hudi.tbl_7mo_hudi 
add if not exists partition(day = '2021-11-27') location '/hudi-warehouse/7mo_msg_hudi/day=2021-11-27' ;

insert image description here

Query the first 10 records of the Hive table:

SELECT
  msg_time, sender_nickyname, receiver_nickyname, ts 
FROM db_hudi.tbl_7mo_hudi 
WHERE day = '2021-11-27'
limit 10 ;

insert image description here

6.5.2 Analysis of business indicators

Write SQL to perform simple index statistical analysis on Qimo social data. Since the data flow is small, set the local mode to execute.

set hive.exec.mode.local.auto=true;
set hive.mapred.mode=nonstrict;
  • Indicator 1: Statistical total message volume
WITH tmp AS (
  SELECT COUNT(1) AS momo_total  FROM db_hudi.tbl_7mo_hudi WHERE day = '2021-11-27'
)
SELECT "全国" AS momo_name, momo_total FROM tmp;

insert image description here

  • Indicator 2: Statistics of each user, the amount of messages sent
WITH tmp AS (
  SELECT 
    sender_nickyname, COUNT(1) momo_total 
  FROM db_hudi.tbl_7mo_hudi 
  WHERE day = '2021-11-27' GROUP BY sender_nickyname
)
SELECT 
  sender_nickyname AS momo_name, momo_total
FROM tmp 
ORDER BY momo_total DESC LIMIT 10;

insert image description here

  • Indicator 3: Statistics of each user, the amount of messages received
WITH tmp AS (
  SELECT 
    receiver_nickyname, COUNT(1) momo_total 
  FROM db_hudi.tbl_7mo_hudi 
  WHERE day = '2021-11-27' GROUP BY receiver_nickyname
)
SELECT 
  receiver_nickyname AS momo_name, momo_total  
FROM tmp 
ORDER BY momo_total DESC LIMIT 10;

insert image description here

  • Indicator 4: Statistics on the amount of messages sent by men and women
SELECT 
  sender_sex, receiver_sex, COUNT(1) momo_total 
FROM db_hudi.tbl_7mo_hudi 
WHERE day = '2021-11-27' GROUP BY sender_sex, receiver_sex;

insert image description here

6.6 Spark offline indicator analysis

Write a SparkSQL program, load the Hudi table data and encapsulate it into a DataFrame, write SQL to analyze the data according to the needs of business indicators, and finally save it in the MySQL database table. The flow diagram is as follows:
insert image description here

6.6.1 Requirements Description

For the real-time statistical operation of Qimo social news data, the following statistical requirements are required:

  • 1), the total number of statistical messages
  • 2) Count the number of messages sent and received by each region (province) according to the IP address
  • 3) Count the number of messages sent and received by each user in Qimo social messages
    insert image description here

6.6.2 Creating database tables

Store the above business requirements and final results in one table of the MySQL database: 7mo.7mo_report.
insert image description here

Among them, the field: 7mo_category indicates the indicator type:

  • 1: Indicates national information volume statistics
  • 2: Indicates the statistics of the amount of information sent by each province
  • 3: Indicates the statistics of the amount of information received by each province
  • 4: Indicates the statistics of the amount of information sent by the user
  • 5: Indicates the statistics of the amount of information received by the user
    In the MySQL database, create a database: 7mo, a table: 7mo_reprot, and the corresponding DDL statements are as follows:
-- 创建数据库
CREATE DATABASE IF NOT EXISTS 7mo ;
-- 创建表
CREATE TABLE IF NOT EXISTS `7mo`.`7mo_report` (
    `7mo_name` varchar(100) NOT NULL,
    `7mo_total` bigint(20) NOT NULL,
    `7mo_category` varchar(100) NOT NULL,
    PRIMARY KEY (`7mo_name`, `7mo_category`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 ;

6.6.3 Writing an indicator analysis program

Create the object object: MomoSQLHudi, write the MAIN method, follow the 5 steps of writing a streaming program, and write out the code structure, as follows:

package cn.itcast.hudi.momo

import org.apache.spark.sql.{
    
    DataFrame, Dataset, Row, SaveMode, SparkSession}
import org.lionsoul.ip2region.{
    
    DataBlock, DbConfig, DbSearcher}

/**
 * 编写SparkSQL程序,基于DSL和SQL分析Hudi表数据,最终保存值MySQL数据库表中
 */
object MomoSQLHudi {
    
    
   
   def main(args: Array[String]): Unit = {
    
    
      // step1、构建SparkSession实例对象
      val spark: SparkSession = createSparkSession(this.getClass)
      
      // step2、加载Hudi表数据,指定Hudi数据存储路径
      val hudiDF: DataFrame = loadHudiTable(spark, "/hudi-warehouse/7mo_msg_hudi")
      //println(s"Count = ${hudiDF.count()}")
      //hudiDF.printSchema()
      //hudiDF.show(numRows = 10, truncate = false)
      
      // step3、数据ETL转换:提取字段,解析IP为省份和城市
      val etlDF: DataFrame = etl(hudiDF)
      //println(s"Count = ${etlDF.count()}")
      //etlDF.printSchema()
      //etlDF.show(numRows = 100, truncate = false)
      
      // step4、业务指标分析
      process(etlDF)
      
      // 应用结束,关闭资源
      spark.stop()
   }
}

Among them, create a SparkSession object, encapsulation method: createSparkSession , the same as in the previous real-time storage.

6.6.3.1 Loading Hudi table data

Use the Spark DataSource external data source interface to load Hudi table data, specify the data storage path, and encapsulate the method: loadHudiTable .

/**
 * 指定Hudi表数据存储path,加载Hudi表数据,返回DataFrame
 */
def loadHudiTable(spark: SparkSession, tablePath: String): DataFrame = {
    
    
   val dataframe = spark.read
      .format("hudi")
      .load(tablePath)
   
   // 返回数据
   dataframe
}

6.6.3.2 Parse IP address and select fields

To resolve the IP address to [province], it is recommended to use the [ip2region] third-party tool library, the official website URL: https://gitee.com/lionsoul/ip2region/, to introduce the use of the IP2Region third-party library:

  • The first step, copy the IP dataset [ip2region.db] to the [dataset] directory under the project
    insert image description here

  • The second step, add dependencies in Maven

<dependency>
    <groupId>org.lionsoul</groupId>
    <artifactId>ip2region</artifactId>
    <version>1.7.2</version>
</dependency>
  • The third step, the use of ip2region
    insert image description here
    adopts the custom UDF function method, transfers the IP address data, parses and returns the Province province:
    insert image description here

In addition to parsing the IP address into a province, it is also necessary to select the fields involved in the business requirements. The encapsulation method: etl, the code is as follows:

/**
 * 提取字段数据和转换经纬度为省份城市
 */
def etl(dataframe: DataFrame): DataFrame = {
    
    
   val session: SparkSession = dataframe.sparkSession
   
   // 1-自定义UDF函数,解析IP地址为省份和城市
   session.udf.register(
      "ip_to_province",
      (ip: String) => {
    
    
         // 构建DbSearch对象
         val dbSearcher = new DbSearcher(new DbConfig(), "dataset/ip2region.db")
         
         // 依据IP地址解析
         val dataBlock: DataBlock = dbSearcher.btreeSearch(ip)
         // 中国|0|海南省|海口市|教育网
         val region: String = dataBlock.getRegion
         // 分割字符串,获取省份和城市
         val Array(_, _, province, _, _) = region.split("\\|")
         // 返回Region对象
         province
      }
   )
   
   // 2-提取字段和解析IP
   dataframe.createOrReplaceTempView("view_tmp_momo")
   val etlDF: DataFrame = session.sql(
      """
        |SELECT
        |  day, sender_nickyname, receiver_nickyname,
        |  ip_to_province(sender_ip) AS sender_province,
        |  ip_to_province(receiver_ip) AS receiver_province
        |FROM
        |  view_tmp_momo
        |""".stripMargin
   )
   
   // 返回结果数据
   etlDF
}

6.6.3.3 Analysis of business indicators

Register DataFrame as a temporary view, write SQL statements for analysis, and finally merge and save all indicator results.

/**
 * 按照业务指标分析数据
 */
def process(dataframe: DataFrame): Unit = {
    
    
   val session: SparkSession = dataframe.sparkSession
   
   // 1-将DataFrame注册为临时视图
   dataframe.createOrReplaceTempView("view_tmp_etl")
   // 2-指标1:统计总消息量
   val reportAllTotalDF: DataFrame = session.sql(
      """
        |WITH tmp AS (
        |  SELECT COUNT(1) AS 7mo_total  FROM view_tmp_etl
        |)
        |SELECT "全国" AS 7mo_name, 7mo_total, "1" AS 7mo_category FROM tmp;
        |""".stripMargin
   )
   // 2-指标2:统计各省份发送消息量
   val reportSenderProvinceTotalDF: DataFrame = session.sql(
      """
        |WITH tmp AS (
        |  SELECT sender_province, COUNT(1) AS 7mo_total FROM view_tmp_etl GROUP BY sender_province
        |)
        |SELECT sender_province AS 7mo_name, 7mo_total, "2" AS 7mo_category FROM tmp;
        |""".stripMargin
   )
   // 2-指标3:统计各省份接收消息量
   val reportReceiverProvinceTotalDF: DataFrame = session.sql(
      """
        |WITH tmp AS (
        |  SELECT receiver_province, COUNT(1) AS 7mo_total FROM view_tmp_etl GROUP BY receiver_province
        |)
        |SELECT receiver_province AS 7mo_name, 7mo_total, "3" AS 7mo_category FROM tmp;
        |""".stripMargin
   )
   // 2-指标4:统计各个用户, 发送消息量
   val reportSenderNickyNameTotalDF: DataFrame = session.sql(
      """
        |WITH tmp AS (
        |  SELECT sender_nickyname, COUNT(1) AS 7mo_total FROM view_tmp_etl GROUP BY sender_nickyname
        |)
        |SELECT sender_nickyname AS 7mo_name, 7mo_total, "4" AS 7mo_category FROM tmp;
        |""".stripMargin
   )
   // 2-指标5:统计各个用户, 接收消息量
   val reportReceiverNickyNameTotalDF: DataFrame = session.sql(
      """
        |WITH tmp AS (
        |  SELECT receiver_nickyname, COUNT(1) AS 7mo_total FROM view_tmp_etl GROUP BY receiver_nickyname
        |)
        |SELECT receiver_nickyname AS 7mo_name, 7mo_total, "5" AS 7mo_category FROM tmp;
        |""".stripMargin
   )
   // 3-保存报表至MySQL数据库
   val reportTotalDF: Dataset[Row] = reportAllTotalDF
      .union(reportSenderProvinceTotalDF)
      .union(reportReceiverProvinceTotalDF)
      .union(reportSenderNickyNameTotalDF)
      .union(reportReceiverNickyNameTotalDF)
   // reportTotalDF.show(500, truncate = false)
   reportTotalDF
      .coalesce(1)
          .write
          .mode(SaveMode.Append)
          .format("jdbc")
          .option("driver", "com.mysql.jdbc.Driver")
          .option("url", 
"jdbc:mysql://node1.itcast.cn:3306/?useUnicode=true&characterEncoding=utf-8&useSSL=false")
          .option("dbtable", "7mo.7mo_report")
          .option("user", "root")
          .option("password", "123456")
          .save()
}

Among them, the JDBC method of the external data source in SparkSQL is directly used to save the result in the MySQL database table.

6.6.4 Run the report program

After the development is completed, the Spark program loads the Hudi table data, calculates according to the business indicators, and stores the results in the MySQL database.

  • View MySQL database table data
    insert image description here

  • Query the top 5 data of each indicator

(SELECT 7mo_name, 7mo_total, "全国总信息量" AS "7mo.category"
FROM 7mo.7mo_report WHERE 7mo_category = 1)
UNION
(SELECT 7mo_name, 7mo_total, "省份发送信息量" AS "7mo.category"
FROM 7mo.7mo_report WHERE 7mo_category = 2 ORDER BY 7mo_total DESC LIMIT 5)
UNION
(SELECT 7mo_name, 7mo_total, "省份接收信息量" AS "7mo.category"
 FROM 7mo.7mo_report WHERE 7mo_category = 3 ORDER BY 7mo_total DESC LIMIT 5)
UNION
(SELECT 7mo_name, 7mo_total, "用户发送信息量" AS "7mo.category"
 FROM 7mo.7mo_report WHERE 7mo_category = 4 ORDER BY 7mo_total DESC LIMIT 5)
UNION
(SELECT 7mo_name, 7mo_total, "用户接收信息量" AS "7mo.category"
 FROM 7mo.7mo_report WHERE 7mo_category = 5 ORDER BY 7mo_total DESC LIMIT 5);

insert image description here

6.7 FineBI report visualization

Use FineBI to connect to the data MySQL database, load the business indicator report data, and display it in different charts.
insert image description here

6.7.1 Install FineBI

FineBI is a Business Intelligence product launched by Fanruan Software Co., Ltd. FineBI is a BI tool positioned for self-service big data analysis, which can help business personnel and data analysts of enterprises to carry out problem-oriented exploratory analysis. Official website: https://www.finebi.com/
insert image description here

FineBI installation: refer to the "FineBI Windows Edition Installation Manual". After the installation is complete, start the login and get to know the basic page.

  • start login
    insert image description here

  • Contents: Home screen and help documentation
    insert image description here
    insert image description here- Dashboard: used to build all visual reports
    insert image description here

  • Data preparation: used to configure data sources for various reports
    insert image description here- management system : used to manage the use of the entire FineBI: user management, data source management, plug-in management, authority management, etc.
    insert image description here

6.7.2 Configuring Data Sources

Create a MySQL database connection: [Management System] -> [Data Connection] -> [Data Connection Management]
insert image description here
insert image description here

Fill in the MySQL database connection information:

数据连接名称:node1-mysql
用户名:root
密码:123456
数据连接URL:jdbc:mysql://node1.itcast.cn:3306/7mo?useUnicode=true&characterEncoding=utf8

insert image description here
insert image description here

6.7.3 Add dataset

Add the business report in the MySQL database: 7mo_report, select [Data Preparation, add group [Qimo Data] and business package [Qimo Report].
insert image description here

Click into [Qimo Report], add a table, and use the [SQL Data Set] method:
insert image description here

Enter table name and SQL statement

SELECT
  7mo_name, 7mo_total,
  CASE 7mo_category
      WHEN '1' THEN '总消息量'
      WHEN '2' THEN '各省份发送量'
      WHEN '3' THEN '各省份接收量'
      WHEN '4' THEN '各用户发送量'
      WHEN '5' THEN '各用户接收量'
  END AS 7mo_category
FROM 7mo.7mo_report 

insert image description here

6.7.4 Create a dashboard

First build a dashboard with the name: [Qimo Social Data Statistical Report], as shown in the figure below:
insert image description here
Next, select a template style for the dashboard [preset style 5]: dark blue ocean background.
insert image description here

  • First add the title: [Others] -> [Text Components]
    insert image description here

Enter the name of the dashboard: Qimo Social Data Statistical Report
insert image description here - Second, add a text component to display the total number of messages
insert image description here

Choose to add the table before: 7mo_report_mysql
insert image description here
as shown in the figure below: select field values ​​and filter field categories
insert image description here

6.7.5 Bar chart: Top 10 users send messages

The top 10 users who send the most messages are displayed in a column chart.

  • Step 1, add components, select [column chart], fill in the title name.
    insert image description here
  • Step 2. Select different fields, set related filtering and display
    insert image description here
    . Among them, the displayed data is: the statistical data of the amount of information sent by users.
    insert image description here

In addition, only displaying Top10 has the largest amount of sent information, and filtering operations are required.
insert image description here
When displaying a column chart, sort in descending order according to the amount of sent information
insert image description here

6.7.6 Pie Chart: Top 10 Provinces Sending Information Amount

In the form of a pie chart, the amount of information sent by the Top 10 provinces is displayed. The specific operation is as follows:

  • Step 1. Add a component, select [Pie Chart], and fill in the title name.
    insert image description here

  • Step 2. Select different fields, set related filtering and display
    insert image description here
    . Among them, filter to obtain the statistical data of the amount of information sent by each province
    insert image description here
    . In addition, only the Top10 provinces with the largest amount of information sent are obtained:
    insert image description here

In the above pie chart, the data is displayed in the outer frame, and the settings are as follows:
insert image description here

6.7.7 Maps: Amount of Information by Province

In the form of a map, the amount of information sent by each province is displayed. The specific operation is as follows:

  • Step 1. Add a component, select [Area Map], and fill in the title name.
    insert image description here
  • Step 2. Select the province field to map to the geographic role
    insert image description here
  • Step 3. Select different fields, set related filtering and display
    insert image description here

Among them, filter to obtain the statistical data of the amount of information sent by each province
insert image description here

Guess you like

Origin blog.csdn.net/toto1297488504/article/details/132257418