Introduction
CDH is Cloudera's Hadoop distribution, which is completely open source. It is more compatible, safer, and more stable than Apache Hadoop. It is also a more common architecture among Internet companies.
CDH version: CDH 5.12.2, Parcel
Hardware preparation
1. If it is a cloud host, just look at the configuration.
2. According to the minimum principle, prepare 6 physical hosts for basic preparation. The general configuration is as follows
: PS: The specific gateway equipment will not be discussed here.
3. System version: Centos7.8 64bit minimum version
Software preparation
The following operations need to be run on all hosts
1. Install basic network tools
yum install net-tools ntp
2. Install the basic JAVA environment
Jar package name: jdk-8u151-linux-x64.rpm
Installation method: rpm -ivh jdk-8u151-linux-x64.rpm
PS: Basically any version bigger than mine will do
3. Modify the host name and host configuration
According to the actual situation, roughly arrange the distribution of the host tasks on the wired
first, and the
current distribution of the synchronization host configuration is as follows, written to the system /etc/hosts file
172.16.3.11 master01
172.16.3.12 master02
172.16.3.101 node01
172.16.3.102 node02
172.16.3.103 node03
172.16.3.104 node04
2. Update the host name and
modify the host name boot configuration file to ensure that the hostname does not change after restart
[root@localhost ~]# cat /etc/sysconfig/network
# Created by anaconda
NETWORKING=yes
HOSTNAME=master01
Change directly without restarting
hostnamectl master01
4. Modify the system parameters to ensure the normal operation of the cluster
Cloudera recommends setting /proc/sys/vm/swappiness to the maximum value of 10. The current setting is 60.
echo 10> /proc/sys/vm/swappiness
Transparent huge page compression is enabled, which may cause major performance issues. Please run
echo never > /sys/kernel/mm/transparent_hugepage/defrag
echo never > /sys/kernel/mm/transparent_hugepage/enabled
To disable this setting,
add the same command to initialization scripts such as /etc/rc.local so that it can be set when the system restarts. The following hosts will be affected:
Add the following options in rc.local
echo 10 > /proc/sys/vm/swappiness
echo never > /sys/kernel/mm/transparent_hugepage/defrag
echo never > /sys/kernel/mm/transparent_hugepage/enabled
#iptablesChoose whether to disable according to the actual situation
iptables -F
service ntpd restart
Modify the system limit
in the /etc/security/limits.conf file, add the following configuration before # End of file
* soft nproc 65536
* hard nproc 65536
* soft nofile 65536
* hard nofile 65536
Then log out and log in again to take effect
5. Turn off all types of firewalls
iptables -F
setenforce 0
6. Time synchronization
CDH has high requirements for time zone and time matching.
Use ntpd service to automatically synchronize the time corresponding to the current time zone
yum install ntp
Use tzselect to adjust the current time zone
Finally, start the ntpd service to
service ntpd restart
Master node installation
First confirm the installed version.
First check the version information, and decide which version to
view the version information
https://www.cloudera.com/documentation/enterprise/release-notes/topics/cm_vd.html#cmvd_topic_1
As of 2021-03-19, the information is as follows
Cloudera Manager is available in the following releases:
Cloudera Manager 5.16.2 is the current release of Cloudera Manager 5.16.
Cloudera Manager 5.15.2. 5.14.4, 5.13.3, 5.12.2, 5.11.2, 5.10.2, 5.9.3, 5.8.5, 5.7.6, 5.6.1, 5.5.6, 5.4.10, 5.3.10, 5.2.7, 5.1.6, and 5.0.7 are previous stable releases of Cloudera Manager 5.14, 5.13, 5.12, 5.11, 5.10, 5.9, 5.8, 5.7, 5.6, 5.5, 5.4, 5.3, 5.2, 5.1, and 5.0 respectively.
Self-built yum source
Install the yum source corresponding to the current system
The first way is to read the official source
Currently for centos7 system, execute the following source to read
rpm -Uvh http://archive.cloudera.com/cdh5/one-click-install/redhat/7/x86_64/cloudera-cdh-5-0.x86_64.rpm
The second way is to build a local source (recommended this way to facilitate subsequent node installation)
Operating environment: a new host: 192.168.1.100, Centos7 system can be
1, first pull the online corresponding version of the repo file
rpm -Uvh http://archive.cloudera.com/cdh5/one-click-install/redhat/ 7/x86_64/cloudera-cdh-5-0.x86_64.rpm
[root@master01 parcel-repo]# cat /etc/yum.repos.d/cloudera-manager.repo
[cloudera-manager]
name = Cloudera Manager, Version 5.12.2
baseurl = https://archive.cloudera.com/cm5/redhat/7/x86_64/cm/5.12.2/
gpgkey = https://archive.cloudera.com/redhat/cdh/RPM-GPG-KEY -cloudera
gpgcheck = 1
2. Install the local source tool
yum install -y yum-utils createrepo httpd
3. Start httpd
service httpd start
4. Synchronization corresponds to the original
reposync -r cloudera-manager
5. Create the corresponding repo path
mkdir -p /var/www/html/mirrors/cdh/
cp -r cloudera-manager/ /var/www/html/mirrors/cdh/
cd /var/www/html/mirrors/cdh/
createrepo.
After the completion of the corresponding local source has been successfully built
and then modify the repo file
[root@master01 parcel-repo]# cat /etc/yum.repos.d/cloudera-manager.repo
[cloudera-manager]
name = Cloudera Manager, Version 5.12.2
baseurl = http://192.168.1.100/mirrors/cdh/cloudera-manager/
gpgkey = https://archive.cloudera.com/redhat/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
Install server
yum install cloudera-manager-daemons cloudera-manager-server
Install the agent
Synchronize the /etc/yum.repos.d/cloudera-manager.repo file to each node and
execute yum install cloudera-manager-agent on each node
so that you can install it locally through the intranet and avoid the embarrassment of slow install speed
Install MySQL
CDH cluster can be supported by many databases, here we choose to use Mysql
MYSQL 5.5.6 installation
安装MySQL
rpm -Uvh http://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm
提取成功之后会在/etc/yum.repo.d路径下方生成两个yum文件
然后开始安装:
yum install mysql-server -y
启动:
service mysqld restart
现在设置的账密如下:
mysql -uroot -p111111
ps:密码复杂点哈,别那么简单
执行CDH建表语句
/usr/share/cmf/schema/scm_prepare_database.sh mysql -uroot -p111111 --scm-host localhost scm scm scm_password
导入连接jar包
mkdir -p /usr/share/java/
cp mysql-connector-java.jar /usr/share/java/
Preparations before deploying the cluster
Put the required parcel package in the corresponding path of master01,
otherwise it needs to be downloaded, which will be slower
After the server-side installation is complete, master01 will have an additional 7180 port
http://172.16.1.11:7180/
admin admin,
and then follow the steps below to build a cluster.
Tick agree
Choose the free version
That one. . . The free version is definitely not as good as the enterprise version, it depends on the specific situation.
Choose parcel installation and corresponding version
Continue directly
Enter the account password corresponding to the host
Because the agent has been installed beforehand, the next node deployment will be easier
PS: The IP and document examples in the picture are different, and the deployment depends on the actual IP.
The parcel package has been deployed before, so it will be faster here
According to the actual business situation, I chose spark to follow my actual needs.
Node distribution according to the actual situation
NameNode: Mainly used to store HDFS metadata information, such as namespace information, block information, etc. When it is running, this information is stored in memory. But this information can also be persisted to disk.
Secondary NameNode: The whole purpose is to provide a checkpoint in HDFS. It is just a helper node of NameNode. Note that this is not a backup node, but a check node. Pay special attention! ! !
Balancer: Balance the data space utilization rate between nodes.
HttpFS: is an http interface of hadoop hdfs provided by cloudera company. Through WebHDFS REST API, you can read and write to hdfs.
NFSGateway: HDFS NFS gateway allows clients to mount HDFS And interact with it through NFS as if it were part of the local file system. The gateway supports NFSv3.
Datanode: The node where the data is stored
Hive Gateway: Hive default gateway. By default, each node must have a
Hive Metastore Server: Hive metadata access entry, using the Thrift protocol to provide cross-language access to hive metadata.
WebHCatServer: WebHCat provides a Rest interface that enables users to execute Hive DDL operations, run Hive HQL tasks, and run MapReduce tasks through the secure HTTPS protocol.
HiveServer2: Hive in databases accessible entrance, the same applies thrift agreement to provide cross-language access to the Hive in the data, such as common python, java and other remote access to the hive data, beeline client also access data through HiveServer2 way
PS : Relatively speaking, if there is a table in Hive. Access the information of this table through the Metastore Server to
access the specific content of the table, through HiveServer2
Hue Server: Hue Server is a web application built on the Django Python web framework.
Load Balancer : Hue 's load balancing
Service Monitor: Collect information about the health and metrics
of the service Activity Monitor: Collect information about the activities of the service
Host Monitor: Collect the health and metrics information about the host
Event Server: Aggregate component events and use them for alerts and searches
Alert Publisher: Generate and provide alerts for specific types of events
Oozie: It is a project integrated with the Hadoop technology stack to support multiple types of Hadoop jobs (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp, Spark) and system-specific jobs (such as Java programs) And shell scripts), a workflow scheduler system for managing Apache Hadoop jobs.
History Server: Historical task record
Gateway: Spark's node scheduling gateway
Resource Manager: resource allocation and scheduling
Job History: historical task scheduling records
Node Manager: resource and task management of a single node
Server: at least three nodes, if reasonable, five nodes, mainly used for configuration management, distributed synchronization, etc.
The corresponding database connection can be configured directly in the next step, as shown in the following figure
Next step --> will start to install automatically according to the previous deployment
Finally, the cluster is set up, cheers! ! !
to sum up
CDH provides relatively complete components and management mechanisms, but it does not mean that maintenance and optimization are not needed. We will gradually talk about optimization related content in the follow-up