Start from scratch-build a CDH big data cluster

Introduction

       CDH is Cloudera's Hadoop distribution, which is completely open source. It is more compatible, safer, and more stable than Apache Hadoop. It is also a more common architecture among Internet companies.

CDH version: CDH 5.12.2, Parcel

Hardware preparation

1. If it is a cloud host, just look at the configuration.
2. According to the minimum principle, prepare 6 physical hosts for basic preparation. The general configuration is as follows
: PS: The specific gateway equipment will not be discussed here.

Start from scratch-build a CDH big data cluster

3. System version: Centos7.8 64bit minimum version

Software preparation

The following operations need to be run on all hosts

1. Install basic network tools

yum install net-tools ntp

2. Install the basic JAVA environment

Jar package name: jdk-8u151-linux-x64.rpm
Installation method: rpm -ivh jdk-8u151-linux-x64.rpm
PS: Basically any version bigger than mine will do
Start from scratch-build a CDH big data cluster

3. Modify the host name and host configuration

According to the actual situation, roughly arrange the distribution of the host tasks on the wired
first, and the
current distribution of the synchronization host configuration is as follows, written to the system /etc/hosts file

172.16.3.11    master01
172.16.3.12    master02
172.16.3.101    node01
172.16.3.102    node02
172.16.3.103    node03
172.16.3.104    node04

2. Update the host name and
modify the host name boot configuration file to ensure that the hostname does not change after restart

[root@localhost ~]# cat /etc/sysconfig/network
# Created by anaconda
NETWORKING=yes
HOSTNAME=master01

Change directly without restarting

hostnamectl master01

4. Modify the system parameters to ensure the normal operation of the cluster

Cloudera recommends setting /proc/sys/vm/swappiness to the maximum value of 10. The current setting is 60.
echo 10> /proc/sys/vm/swappiness

Transparent huge page compression is enabled, which may cause major performance issues. Please run

echo never > /sys/kernel/mm/transparent_hugepage/defrag
echo never > /sys/kernel/mm/transparent_hugepage/enabled

To disable this setting,
add the same command to initialization scripts such as /etc/rc.local so that it can be set when the system restarts. The following hosts will be affected:

Add the following options in rc.local

echo 10 >  /proc/sys/vm/swappiness
echo never > /sys/kernel/mm/transparent_hugepage/defrag
echo never > /sys/kernel/mm/transparent_hugepage/enabled

#iptablesChoose whether to disable according to the actual situation

iptables -F
service ntpd restart

Modify the system limit
in the /etc/security/limits.conf file, add the following configuration before # End of file

*        soft    nproc  65536 
*        hard    nproc  65536 
*        soft    nofile  65536 
*        hard    nofile  65536

Then log out and log in again to take effect

5. Turn off all types of firewalls

iptables -F
setenforce 0

6. Time synchronization

CDH has high requirements for time zone and time matching.
Use ntpd service to automatically synchronize the time corresponding to the current time zone
yum install ntp
Use tzselect to adjust the current time zone
Start from scratch-build a CDH big data cluster

Start from scratch-build a CDH big data cluster

Finally, start the ntpd service to
service ntpd restart

Master node installation

First confirm the installed version.
First check the version information, and decide which version to
view the version information
https://www.cloudera.com/documentation/enterprise/release-notes/topics/cm_vd.html#cmvd_topic_1

As of 2021-03-19, the information is as follows

Cloudera Manager is available in the following releases:
Cloudera Manager 5.16.2 is the current release of Cloudera Manager 5.16.
Cloudera Manager 5.15.2. 5.14.4, 5.13.3, 5.12.2, 5.11.2, 5.10.2, 5.9.3, 5.8.5, 5.7.6, 5.6.1, 5.5.6, 5.4.10, 5.3.10, 5.2.7, 5.1.6, and 5.0.7 are previous stable releases of Cloudera Manager 5.14, 5.13, 5.12, 5.11, 5.10, 5.9, 5.8, 5.7, 5.6, 5.5, 5.4, 5.3, 5.2, 5.1, and 5.0 respectively.

Self-built yum source

Install the yum source corresponding to the current system

The first way is to read the official source

Currently for centos7 system, execute the following source to read
rpm -Uvh http://archive.cloudera.com/cdh5/one-click-install/redhat/7/x86_64/cloudera-cdh-5-0.x86_64.rpm

The second way is to build a local source (recommended this way to facilitate subsequent node installation)

Operating environment: a new host: 192.168.1.100, Centos7 system can be
1, first pull the online corresponding version of the repo file
rpm -Uvh http://archive.cloudera.com/cdh5/one-click-install/redhat/ 7/x86_64/cloudera-cdh-5-0.x86_64.rpm
[root@master01 parcel-repo]# cat /etc/yum.repos.d/cloudera-manager.repo
[cloudera-manager]
name = Cloudera Manager, Version 5.12.2
baseurl = https://archive.cloudera.com/cm5/redhat/7/x86_64/cm/5.12.2/
gpgkey = https://archive.cloudera.com/redhat/cdh/RPM-GPG-KEY -cloudera
gpgcheck = 1

2. Install the local source tool
yum install -y yum-utils createrepo httpd

3. Start httpd
service httpd start

4. Synchronization corresponds to the original
reposync -r cloudera-manager

5. Create the corresponding repo path
mkdir -p /var/www/html/mirrors/cdh/
cp -r cloudera-manager/ /var/www/html/mirrors/cdh/
cd /var/www/html/mirrors/cdh/
createrepo.

After the completion of the corresponding local source has been successfully built
and then modify the repo file
[root@master01 parcel-repo]# cat /etc/yum.repos.d/cloudera-manager.repo
[cloudera-manager]
name = Cloudera Manager, Version 5.12.2
baseurl = http://192.168.1.100/mirrors/cdh/cloudera-manager/
gpgkey = https://archive.cloudera.com/redhat/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1

Install server

yum install cloudera-manager-daemons cloudera-manager-server

Install the agent

Synchronize the /etc/yum.repos.d/cloudera-manager.repo file to each node and
execute yum install cloudera-manager-agent on each node
so that you can install it locally through the intranet and avoid the embarrassment of slow install speed

Install MySQL

CDH cluster can be supported by many databases, here we choose to use Mysql

MYSQL 5.5.6 installation

安装MySQL
rpm -Uvh http://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm
提取成功之后会在/etc/yum.repo.d路径下方生成两个yum文件

然后开始安装:
yum install mysql-server -y

启动:
service mysqld restart

现在设置的账密如下:
mysql -uroot -p111111
ps:密码复杂点哈,别那么简单

执行CDH建表语句
/usr/share/cmf/schema/scm_prepare_database.sh mysql -uroot -p111111 --scm-host localhost scm scm scm_password

导入连接jar包
mkdir -p /usr/share/java/
cp mysql-connector-java.jar  /usr/share/java/

Preparations before deploying the cluster

Put the required parcel package in the corresponding path of master01,
otherwise it needs to be downloaded, which will be slower
Start from scratch-build a CDH big data cluster

After the server-side installation is complete, master01 will have an additional 7180 port
http://172.16.1.11:7180/
admin admin,
and then follow the steps below to build a cluster.

Tick ​​agree

Start from scratch-build a CDH big data cluster

Choose the free version

Start from scratch-build a CDH big data cluster
That one. . . The free version is definitely not as good as the enterprise version, it depends on the specific situation.

Choose parcel installation and corresponding version

Start from scratch-build a CDH big data cluster

Continue directly

Start from scratch-build a CDH big data cluster

Enter the account password corresponding to the host

Start from scratch-build a CDH big data cluster

Because the agent has been installed beforehand, the next node deployment will be easier

Start from scratch-build a CDH big data cluster
PS: The IP and document examples in the picture are different, and the deployment depends on the actual IP.

The parcel package has been deployed before, so it will be faster here

Start from scratch-build a CDH big data cluster

According to the actual business situation, I chose spark to follow my actual needs.

Start from scratch-build a CDH big data cluster

Node distribution according to the actual situation

Start from scratch-build a CDH big data cluster
NameNode: Mainly used to store HDFS metadata information, such as namespace information, block information, etc. When it is running, this information is stored in memory. But this information can also be persisted to disk.
Secondary NameNode: The whole purpose is to provide a checkpoint in HDFS. It is just a helper node of NameNode. Note that this is not a backup node, but a check node. Pay special attention! ! !
Balancer: Balance the data space utilization rate between nodes.
HttpFS: is an http interface of hadoop hdfs provided by cloudera company. Through WebHDFS REST API, you can read and write to hdfs.
NFSGateway: HDFS NFS gateway allows clients to mount HDFS And interact with it through NFS as if it were part of the local file system. The gateway supports NFSv3.
Datanode: The node where the data is stored

Start from scratch-build a CDH big data cluster
Hive Gateway: Hive default gateway. By default, each node must have a
Hive Metastore Server: Hive metadata access entry, using the Thrift protocol to provide cross-language access to hive metadata.
WebHCatServer: WebHCat provides a Rest interface that enables users to execute Hive DDL operations, run Hive HQL tasks, and run MapReduce tasks through the secure HTTPS protocol.
HiveServer2: Hive in databases accessible entrance, the same applies thrift agreement to provide cross-language access to the Hive in the data, such as common python, java and other remote access to the hive data, beeline client also access data through HiveServer2 way
PS : Relatively speaking, if there is a table in Hive. Access the information of this table through the Metastore Server to
access the specific content of the table, through HiveServer2

Start from scratch-build a CDH big data cluster
Hue Server: Hue Server is a web application built on the Django Python web framework.
Load Balancer : Hue 's load balancing

Start from scratch-build a CDH big data cluster
Service Monitor: Collect information about the health and metrics
of the service Activity Monitor: Collect information about the activities of the service
Host Monitor: Collect the health and metrics information about the host
Event Server: Aggregate component events and use them for alerts and searches
Alert Publisher: Generate and provide alerts for specific types of events

Start from scratch-build a CDH big data cluster
Oozie: It is a project integrated with the Hadoop technology stack to support multiple types of Hadoop jobs (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp, Spark) and system-specific jobs (such as Java programs) And shell scripts), a workflow scheduler system for managing Apache Hadoop jobs.

Start from scratch-build a CDH big data cluster
History Server: Historical task record
Gateway: Spark's node scheduling gateway

Start from scratch-build a CDH big data cluster
Resource Manager: resource allocation and scheduling
Job History: historical task scheduling records
Node Manager: resource and task management of a single node

Start from scratch-build a CDH big data cluster
Server: at least three nodes, if reasonable, five nodes, mainly used for configuration management, distributed synchronization, etc.

The corresponding database connection can be configured directly in the next step, as shown in the following figure

Start from scratch-build a CDH big data cluster

Start from scratch-build a CDH big data cluster

Next step --> will start to install automatically according to the previous deployment

Start from scratch-build a CDH big data cluster

Finally, the cluster is set up, cheers! ! !

Start from scratch-build a CDH big data cluster

to sum up

       CDH provides relatively complete components and management mechanisms, but it does not mean that maintenance and optimization are not needed. We will gradually talk about optimization related content in the follow-up

Guess you like

Origin blog.51cto.com/14839701/2665703