cdh cluster layout

Recently, I started working on big data technology. I saw that the company used CDN as a big data cluster tool, so I tried to install it myself. However, in the final division of roles, I was at a loss due to lack of knowledge about some architectures and principles of components and servers. I came across this article by accident, and I recommend that you read it carefully.

Cloudera Platform Software Architecture

XPWS9QQ2BT~_PF3$ONO]U@V.png

Cloudera's software architecture includes the following modules: system deployment and management, data storage, resource management, processing engine, security, data management, tool library, and access interface. Role information for some key components:

WT7SBQ1LCWNCU%LSK})AY9V.pngRO6FEMAPKC}K~7E[1G$BSP7.png

Hardware Configuration

The cluster server is divided into management nodes and working nodes according to the tasks undertaken by the nodes. The management roles of each component are generally deployed on the management node, and the storage, container, or computing roles of each role are generally deployed on the worker nodes. Depending on the type of business, the specific configuration of the cluster is also different:

1. Real-time stream processing service cluster: Hadoop's real-time stream processing performance has high requirements on node memory and CPU, and the message throughput of stream processing based on Spark Streaming can increase linearly with the number of nodes.

89DGKXSGY}1FBPCUZNWF_4O.png


2. Online analysis service cluster: Online analysis services are generally based on MPP SQL engines such as Impala. Complex SQL calculations have high requirements for memory capacity, so 128G or more memory needs to be configured.

IGTY)2F{[HDBI6L(NR[ZNXG.png


3. Cloud storage business cluster: cloud storage business is mainly for the storage and calculation of massive data and files, emphasizing the storage capacity and cost of a single node, so relatively cheap SATA hard drives are configured to meet cost and capacity requirements

)CQW6AWG1LNY~42DH4(5Y7N.png

Role Assignments

small cluster

Small-scale clusters are generally built to support proprietary services, which are limited by the storage and processing capabilities of the cluster and are not suitable for multi-service environments. This can be deployed as an HBase cluster; it can also be an analysis cluster, including YARN and Impala. In a small-scale cluster, in order to maximize the use of the storage and processing capabilities of the cluster, the degree of reuse of nodes is often relatively high. The following figure is a typical small-scale cluster deployment method:

X7V@1LIOQ]7YZ}AL%A~ZG7D.png

For those that need more than two nodes to support the HA function, a tool node is allocated in the cluster to carry these roles, and some other tool roles can be deployed at the same time, and these tool roles themselves do not consume much resources:

SDX6DCVT1XE2W0ILBUH9(7U.png

The remaining nodes can be deployed as pure working nodes, including:

CK_FODQV4RAY~~G1SU}[1SB.png

medium size cluster

For a medium-sized cluster, the number of nodes in the cluster is generally around 20 to 200, and the usual data storage can be planned to hundreds of terabytes, which is suitable for the data platform of a medium-sized enterprise or the data platform of the business department of a large enterprise. The degree of reuse of nodes can be reduced, and can be divided into management nodes, master nodes, tool nodes, and work nodes.

`AIT@ZY)1~NVPA_25]FTRBP.png

Install Cloudera Manager and Cloudera Management Service on the management node.

There is a CDH service management node and HA components installed on the master node, which can be deployed as follows:

VZBFE1B(}BUIICV`E20CSQV.png

Tool nodes can deploy some of the following roles:

AZ81)@4$1@8JMO35%R5)YDX.png

The deployment of worker nodes is similar to small scale:

X@0P]}@RXF[`TINZBG~`S7D.png

massive cluster

The number of large-scale clusters is generally more than 200, and the storage capacity can be hundreds of TB or even PB level, which is suitable for large enterprises to build a company-wide data platform. Compared with medium-scale clusters, the deployment scheme is not much different, mainly due to the enhancement of the availability of some master nodes.

{)U}}T@XXMR[F_662J}FHHL.png

The number of HDFS JournalNodes has increased from 3 to 5, the number of Zookeeper Servers and HBase Masters has also increased from 3 to 5, and the number of Hive Metastores has increased from 1 to 3.

Network topology

Single Rack Deployment

For a small cluster, or a single rack cluster, all nodes are connected to the same access layer switch . The access layer switches are configured in a stacked manner, which is mutually redundant and increases the throughput of the switches. The two network cards of all nodes are configured in active/standby or load balancing mode, and are respectively connected to two switches. In this deployment model, the access layer switch also acts as the aggregation layer.

B9]4KX}5(4HMQDMSXI%)9TX.png

Multi-rack deployment

In the multi-rack deployment mode, in addition to the access layer switch, an aggregation layer switch is also required to connect each access layer switch and be responsible for cross-rack data access.

Y4{4R{VMVUCQCNRJA(WNPVF.png


Actual deployment example

When assigning roles on the rack, in order to avoid the failure of the access layer switch to cause the cluster to be unavailable, some highly available roles need to be deployed under different access layer switches (note that under different access layers, Rather than under different physical racks, many times, customers will connect machines under different physical racks to the same access layer switch.) The following is a physical deployment example of 80 nodes.

S7G@HW[XK$3)SMRO$Y(CO7G.png

Guess you like

Origin blog.csdn.net/selectgoodboy/article/details/86747525