E-mail of the author of the article: [email protected] Address: Huizhou, Guangdong
▲ Purpose of this chapter
⚪ Understand the data collection of the TELECOM project;
⚪ Understand the data cleaning of TELECOM project;
⚪ Understand the data export of TELECOM project;
⚪ Understand the data visualization of the TELECOM project;
⚪ Learn about other aspects of the TELECOM project;
1. Data collection
1. In the actual production environment, the telecom traffic log is definitely not only generated on one server, but every server will generate traffic logs. So at this time, it is necessary to build the fan-in flow model of Flume first, and then transfer the collected data to HDFS for storage.
2. Steps:
a. Create corresponding directories on the second and third servers for storing logs (take the second and third servers as the servers where the logs are generated).
cd /home
mkdir telecomlog
b. Enter the corresponding directory, upload or download the log to the specified directory (in the actual process, the log must be generated in real time).
cd telecomlog/
# The download address of the cloud host
wget http://bj-yzjd.ufile.cn-north-02.ucloud.cn/103_20150615143630_00_00_000_2.csv
c. Collect logs on the second and third servers, and transfer the collected logs to the first server for data fan-in.
cd /home/software/apache-flume-1.9.0-bin/data
#edit file
vim telecomlog.conf
#Add the following to the file
a1.sources = s1
a1.channels = c1
a1.sinks = k1
# The log is placed in the specified directory
# So monitor the changes in the specified directory at this time
# If a new file is created in the directory
# Need to collect the contents of this new file
a1.sources.s1.type = spooldir
# Specify the directory to listen to
a1.sources.s1.spoolDir = /home/telecomlog
# Configure Channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
# Need to send the collected data to the first server
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop01
a1.sinks.k1.port = 8090
# bind
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
d. After the data is collected on the first server, the collected data needs to be written to HDFS.
cd /home/software/apache-flume-1.9.0-bin/data/
#edit file
vim telecomlog.conf
#Add the following to the file
a1.sources = s1
a1.channels = c1
a1.sinks = k1
# Need to receive the data transmitted by the second and third server
a1.sources.s1.type = avro
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090
# Need to add a timestamp to the data
a1.sources.s1.interceptors = i1
a1.sources.s1.interceptors.i1.type = timestamp
# Configure Channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
# Configure Sink
# Need to write data to HDFS, it is best to be able to store data on a daily basis
a1.sinks.k1.type = hdfs
# Specify the storage path of the data on HDFS
a1.sinks.k1.hdfs.path = hdfs://hadoop01:9000/telecomlog/reporttime=%Y-%m-%d
# Specify the storage type of the file on HDFS
a1.sinks.k1.hdfs.fileType = DataStream
# Specify the scrolling interval of the file
a1.sinks.k1.hdfs.rollInterval = 3600
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
# bind
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
e. Start HDFS.
start-dfs.sh
f. Start Flume on the first server.
../bin/flume-ng agent -n a1 -c ../conf -f telecomlog.conf -
Dflume.root.logger=INFO,console
g. Start Flume on the second and third servers.
../bin/flume-ng agent -n a1 -c ../conf -f telecomlog.conf -
Dflume.root.logger=INFO,console
2. Data cleaning
1. Use Flume to collect data on HDFS, then you need to create a table in Hive to manage the original data.
# start YARN
start-yarn.sh
#Enter the lib directory of the HBase installation directory
cd /home/software/hbase-2.4.2/lib
# enter the subdirectory
cd client-facing-thirdparty/
#double naming
mv commons-logging-1.2.jar commons-logging-1.2.bak
mv log4j-1.2.17.jar log4j-1.2.17.bak
mv slf4j-log4j12-1.7.30.jar slf4j-log4j12-1.7.30.bak
#Start the Hive service process
hive --service metastore &
hive --service hiveserver2 &
#Enter the hive client
hive
# create library
create database telecom;
# use this library