CDH6.3.2 software service installation
CDH
Overview
CDH refers to Cloudera’s Distribution Including Apache Hadoop (Cloudera distribution including Apache Hadoop), which is a set of big data solutions provided by Cloudera.
CDH is built on the Apache Hadoop ecosystem, which includes Hadoop core components (such as HDFS, YARN and MapReduce) and other related open source technologies (such as Hive, HBase, Spark, Impala, etc.). By integrating these components, Cloudera provides enterprises with a stable, reliable, and scalable data processing platform.
In CDH, Cloudera Manager is a key component used to manage and monitor the entire cluster. It provides an easy-to-use web interface for cluster configuration, software installation, performance monitoring, and troubleshooting.
Architecture
Hadoop core components:
Hadoop Distributed File System (HDFS):用于存储和管理大规模数据集的分布式文件系统。
Yet Another Resource Negotiator (YARN):用于分配和管理集群资源以运行各种应用程序。
MapReduce:一种分布式数据处理框架,用于在集群上执行大规模数据处理任务。
Data storage and processing components:
Hive:一个基于Hadoop的数据仓库基础架构,提供了类似于SQL的查询语言,方便进行数据分析和处理。
HBase:一个分布式的、面向列的NoSQL数据库,适用于高度可扩展的实时读写。
Spark:一个快速的、通用的大数据处理引擎,支持批处理、交互式查询和流式处理。
Impala:一个高性能的SQL查询引擎,可在Hadoop上实时查询存储在HDFS和HBase中的数据。
Solr:一个开源的、高性能的搜索平台,用于构建实时搜索和大规模分析应用。
Sqoop:用于在Hadoop和关系型数据库之间进行数据传输的工具。
Data integration and stream processing components:
Kafka:一个高吞吐量的、分布式的流处理平台,用于处理实时数据流。
Flume:一个用于高效、可靠地从多个数据源采集、聚合和移动数据的分布式系统。
Security and management components:
Cloudera Manager:用于集群的配置、部署、监控和管理的全面管理平台。
Apache Sentry:提供细粒度的访问控制和权限管理,以保护敏感数据。
Apache Knox:提供了一个单一的访问点和API网关,用于安全地访问和管理Hadoop集群。
Create database
Create the database required by each component
CREATE DATABASE hive DEFAULT CHARSET utf8 COLLATE utf8_general_ci;
CREATE DATABASE oozie DEFAULT CHARSET utf8 COLLATE utf8_general_ci;
CREATE DATABASE hue DEFAULT CHARSET utf8 COLLATE utf8_general_ci;
CREATE DATABASE sentry DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
Install Kafka service
In CDH, the installation and configuration methods of each component service are similar. Here we take the installation of Kafka component service as an example.
Add service
On the homepage, click Add Service
Enter the service list and select the Kafka service
Select three machines for Kafka Broker
Configuration
Adjust the heap memory size in the audit changes, and leave other configurations as default.
Waiting for installation
Kafka command usage
Create Kafka Topic
/opt/cloudera/parcels/CDH/bin/kafka-topics --bootstrap-server node03:9092,node04:9092,node05:9092 --create --replication-factor 1 --partitions 1 --topic test
kafka-topics --bootstrap-server node03:9092,node04:9092,node05:9092 --create --replication-factor 1 --partitions 1 --topic test
View Kafka Topic
/opt/cloudera/parcels/CDH/bin/kafka-topics --zookeeper node03:2181 --list
kafka-topics --zookeeper node03:2181 --list
Delete Kafka Topic
/opt/cloudera/parcels/CDH/bin/kafka-topics --delete --bootstrap-server node03:9092,node04:9092,node05:9092 --topic test
kafka-topics --delete --bootstrap-server node03:9092,node04:9092,node05:9092 --topic test
Installation of other components
In CDH, the installation of each component service is similar. Refer to the above Kafka installation to install the following common components.
Install Flume service
Choose to add Flume service
Select which services Flume depends on
Assign the node where Flume Agent is located
Install Hive service
Choose to add Hive service
Add Hive service to Cluster
Configure hive metadata
Exception occurred:
Solution: Copymysql-connector-java.jar
and distribute it to the /usr/share/java/
directory of each node
[root@node01 ~]# ./sync.sh /usr/share/java/mysql-connector-java.jar
Retest
Use default configuration
Automatically start the Hive process after installation
Note: After installing Spark, configure Hive On Spark, and then restart Hive
Install Spark service
Add Spark service
CDH6.x comes with spark2.4 and does not need to be upgraded
Assign nodes
Just select the default for all cluster settings
Waiting for installation
Click to restart
After installing Spark, configure Hive On Spark, and then restart Hive
Install OOZIE service
Choose to add OOZIE service
Allocate nodes
Configure Oozie metadata
Use default configuration< a i=4> Wait for installation and start oozie
Install HUE service
Choose to add Hue service
Allocate nodes
Configure hue metadata
Wait for installation and automatically start the hue process< /span>
Install Flink service
Flink service is also a common service in the field of big data. However, CDH6.3.2 version does not include Flink service, so Flink needs to be compiled manually.
Download relevant configuration package
Check the version information of each component of CDH. The Flink version matching Hive2.1.1 is flink-1.13.6
Download the Flink installation package
wget https://archive.apache.org/dist/flink/flink-1.13.6/flink-1.13.6-bin-scala_2.11.tgz
Download the Flink source code package
wget https://archive.apache.org/dist/flink/flink-1.13.6/flink-1.13.6-src.tgz
Install Maven
Download Maven
wget https://archive.apache.org/dist/maven/maven-3/3.8.8/binaries/apache-maven-3.8.8-bin.tar.gz
Reference:Maven installation and configuration
Flink’s CDH version compilation configuration
Unzip the Flink source package
tar -axvf flink-1.13.6-bin-scala_2.11.tgz
tar -axvf flink-1.13.6-src.tgz
mv flink-1.13.6 flink-src
flink’s pom.xml file
<flink.hadoop.version>3.0.0-cdh6.3.2</flink.hadoop.version>
<hive.version>2.1.1-cdh6.3.2</hive.version>
Add the following content in<repositories>
tag
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
<repository>
<id>confluent-repo</id>
<url>https://packages.confluent.io/maven/</url>
</repository>
</repositories>
Revision vim /root/flink-src/flink-connectors/flink-sql-connector-hive-2.3.9/pom.xml
Text item
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.1.1-cdh6.3.2</version>
Compile Flink
mvn clean install -DskipTests -Dfast -Drat.skip=true -Dhaoop.version=3.0.0-cdh6.3.2 -Dinclude-hadoop -Dscala-2.11 -T10C
Copy the successfully compiled flink-sql-connector-hive to the lib directory of flink
[root@node01 ~]#cp flink-src/flink-connectors/flink-sql-connector-hive-2.2.0/target/flink-sql-connector-hive-2.2.0_2.11-1.13.6.jar /usr/local/flink/lib/
# 拷贝hive-exec-2.1.1-cdh6.3.2.jar、libfb303-0.9.3.jar
[root@node01 ~]#cp /opt/cloudera/parcels/CDH/jars/hive-exec-2.1.1-cdh6.3.2.jar /usr/local/flink/lib/
[root@node01 ~]#cp /opt/cloudera/parcels/CDH/jars/libfb303-0.9.3.jar /usr/local/flink/lib/
Copy related hadoop packages
[root@node01 ~]#cp /opt/cloudera/parcels/CDH/jars/hadoop-common-3.0.0-cdh6.3.2.jar /usr/local/flink/lib/
[root@node01 ~]#cp /opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-client-common-3.0.0-cdh6.3.2.jar /usr/local/flink/lib/
[root@node01 ~]#cp /opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-client-core-3.0.0-cdh6.3.2.jar /usr/local/flink/lib/
[root@node01 ~]#cp /opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-client-hs-3.0.0-cdh6.3.2.jar /usr/local/flink/lib/
[root@node01 ~]#cp /opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-client-jobclient-3.0.0-cdh6.3.2.jar /usr/local/flink/lib/
Produce Flink’s parcel package and csd file
1. Compress the complete Flink package of lib including the copied jar.
[root@node01 ~]# cd /usr/local
[root@node01 local]#tar -zcvf flink-1.13.6-cdh6.3.2.tgz flink
2. Download the production script
[root@node01 local]# yum install git
[root@node01 local]# git clone https://github.com/YUjichang/flink-parcel.git
加速地址: git clone https://gitclone.com/github.com/YUjichang/flink-parcel.git
3. Modify script configuration
[root@node01 local]# cd flink-parcel/
[root@node01 flink-parcel]# vim flink-parcel.properties
#FLINk存放目录地址
FLINK_URL= /usr/local/flink-1.13.6-cdh6.3.2.tgz
#flink版本号
FLINK_VERSION=1.13.6
#扩展版本号
EXTENS_VERSION=CDH6.3.2
#操作系统版本,以centos为例
OS_VERSION=7
#CDH 小版本
CDH_MIN_FULL=6.0
#CDH大版本
CDH_MAX_FULL=6.4
CDH_MIN=5
CDH_MAX=6
4. Run the build.sh script to start building parcel and csd
[root@node01 flink-parcel]# ./build.sh parcel
[root@node01 flink-parcel]# ./build.sh csd
5. After compilation is completed, the generated Flink parcel and csd files
FLINK_ON_YARN-1.13.6.jar
FLINK-1.13.6-CDH6.3.2-el7.parcel
FLINK-1.13.6-CDH6.3.2-el7.parcel.sha
manifest.json
6. Add Flink service to CM
cp FLINK-1.13.6-CDH6.3.2-el7.parcel /opt/cloudera/parcel-repo/
cp FLINK-1.13.6-CDH6.3.2-el7.parcel.sha /opt/cloudera/parcel-repo/
Add Flink service directly to CM
1. Copy the image package to cloudera’s parcel-repo
root@node01 local]# tar -zxvf flink-1.13.6-cdh6.3.2_parcel.tar.gz
[root@node01 local]# cd flink-1.13.6-cdh6.3.2/
[root@node01 flink-1.13.6-cdh6.3.2]# ll
total 377276
-rwxrwxrwx 1 root root 386296010 Aug 30 11:51 FLINK-1.13.6-CDH6.3.2-el7.parcel
-rwxrwxrwx 1 root root 40 Aug 30 11:51 FLINK-1.13.6-CDH6.3.2-el7.parcel.sha
-rwxrwxrwx 1 root root 21123 Aug 30 11:51 FLINK_ON_YARN-1.13.6.jar
-rwxrwxrwx 1 root root 841 Aug 30 11:52 manifest.json
[root@node01 flink-1.13.6-cdh6.3.2]# cp FLINK-1.13.6-CDH6.3.2-el7.parcel /opt/cloudera/parcel-repo/
[root@node01 flink-1.13.6-cdh6.3.2]# cp FLINK-1.13.6-CDH6.3.2-el7.parcel.sha /opt/cloudera/parcel-repo/
Add Flink.parcel-related loading dependency configuration to the manifest.json file in cloudera's parcel-repo
vim /opt/cloudera/parcel-repo/manifest.json
[
{
"Components": [
{
"Pkg_version": "",
"Version": "6",
"Name": "",
"Pkg_release": ""
}
],
"Hash": "",
"Parcelname": "",
"Replaces": ""
},
{
"Components": [
{
"Pkg_version": "flink1.13.6",
"Version": "flink1.13.6",
"Name": "flink",
"Pkg_release": "cdh6.3.2"
}
],
"Hash": "4e1a65e353d2e36c7e9d12a912eb8516a7f486f5",
"Parcelname": "flink-1.13.6-cdh6.3.2-el7.Parcel",
"Replaces": "flink"
}
]
3. Copy FLINK_ON_YARN to cloudera’s csd
[root@node01 software]# cd flink-1.13.6-cdh6.3.2/
[root@node01 flink-1.13.6-cdh6.3.2]# cp FLINK_ON_YARN-1.13.6.jar /opt/cloudera/csd/
[root@node01 flink-1.13.6-cdh6.3.2]# systemctl restart cloudera-scm-server
4. After restarting, allocate and activate on the CM page
5. Add Flink service configuration and cluster planning
Restart
Verify Flink service
1. Use yarn-per-job mode to run wordcount test flink_on_yarn
[root@node01 ~]# chmod 777 /opt/cloudera/parcels/FLINK/bin/flink
[root@node01 ~]# sudo -u hdfs /opt/cloudera/parcels/FLINK/bin/flink run -t yarn-per-job /opt/cloudera/parcels/FLINK/lib/flink/examples/batch/WordCount.jar
Printing result to stdout. Use --output to specify output path.
2023-08-16 10:37:33,318 WARN org.apache.flink.yarn.configuration.YarnLogConfigUtil [] - The configuration directory ('/etc/flink/conf') already contains a LOG4J config file.If you want to use logback, then please delete or rename the log configuration file.
2023-08-16 10:37:33,589 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2023-08-16 10:37:33,747 INFO org.apache.hadoop.conf.Configuration [] - resource-types.xml not found
2023-08-16 10:37:33,747 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils [] - Unable to find 'resource-types.xml'.
2023-08-16 10:37:33,801 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Cluster specification: ClusterSpecification{
masterMemoryMB=2048, taskManagerMemoryMB=2048, slotsPerTaskManager=1}
2023-08-16 10:37:36,652 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Submitting application master application_1692149627327_0002
2023-08-16 10:37:36,893 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl [] - Submitted application application_1692149627327_0002
2023-08-16 10:37:36,894 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Waiting for the cluster to be allocated
2023-08-16 10:37:36,895 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Deploying cluster, current state ACCEPTED
2023-08-16 10:37:43,446 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - YARN application has been deployed successfully.
2023-08-16 10:37:43,447 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface node05:8080 of application 'application_1692149627327_0002'.
Job has been submitted with JobID c0b4b89406f6fee0a4c3f6b95bb0ee67
Program execution finished
Job with JobID c0b4b89406f6fee0a4c3f6b95bb0ee67 has finished.
Job Runtime: 12175 ms
Accumulator Results:
- 7611dc575cfdcecc9d3528d9326c6aba (java.util.ArrayList) [170 elements]
(a,5)
(action,1)
(after,1)
(against,1)
(all,2)
(and,12)
(arms,1)
(arrows,1)
(awry,1)
(ay,1)
(bare,1)
(be,4)
(bear,3)
(bodkin,1)
(bourn,1)
2. Browser access:http://IP:8088/cluster
View execution
3. Verify Hive_FlinkSQL
[root@node01 ~]# /opt/cloudera/parcels/FLINK/bin/flink-sql-client