Graduation project 01 based on big data platform: Build HDP cluster based on Docker

foreword

Many people asked me how to do this kind of graduation project of xxxx based on the big data platform . This can refer to the article I wrote about my big data graduation design before. Graduation design based on big data platform . This article is to optimize the previous completion.

Personally, I think it can be divided into two parts. The first part is the basic platform construction. For example Hadoop cluster, Kafka cluster.

The second part is the construction of upper-layer applications, such as data analysis based on big data platforms, and visual applications such as large-screen display. The former provides basic platform capabilities to add big data elements to the entire design; the latter provides upper-layer application capabilities, mainly to let others understand what you have done with the big data platform .

I was bored a few days ago. Based on the docker container on a virtual machine , I used Ambari to build a HDP version of Hadoop big data cluster. Therefore, in combination with this article, we will elaborate on the first part and provide a new way of thinking.

train of thought

In the process of cluster construction, various problems were encountered. Think in the question and look up the information. This is quite an interesting thing.

I also wrote in the previous article that the Hadoop platform construction part of my big data graduation project is based on three virtual machines. The Apache version of Hadoop used at the time.

The disadvantage of the Apache version is that there is no unified management and control platform.

  1. The previous installation needs to manually distribute the installation package and execute the startup command on each node.
  2. Later node maintenance and service startup and shutdown all need to go to the background to execute commands.

With three virtual machines, it takes a lot of effort to start each time. So I thought about using Ambari to build an HDP version, a Hadoop cluster based on a docker container that can be handled by a virtual machine.

Overall structure

The entire architecture design and technology selection are based on individual needs and can be referred to.

1. Technology selection

The operating system of the host and docker is centos7 . I tried centos8, but it doesn't work. main

  1. docker: container, instead of virtual machine nodes to build a cluster
  2. docker-compose: Orchestrate containers. Manage and start all containers
  3. Ambari: Version 2.7.3. Visually install, monitor, and manage all clusters.
  4. HDP: Version 3.1. These include Hadoop, HDFS, Yarn, Spark, Kafka, Zookeeper and other services.
  5. MySQL: ambari metabase. It will also be used later in the application.

In addition to this, some shell scripting is required.

2. Architecture design

insert image description here

Platform List

This is the part of Ambari's homepage dashboard, where you can see HDFS storage and memory usage indicators.

Hadoop cluster

There are four nodes in the Hadoop cluster. NameNode, a standby NameNode, and two DataNodes.

Click the NameNode UI on the right to see the UI interface of the Hadoop cluster.

cluster node

The Hosts here refers to the number of all cluster nodes and also the number of docker nodes. Here, because of limited memory, a docker starts several services .

For example, the kafka1 node is installed with Kafka and Zookeeper.

Environmental preparation

When I was practicing docker to build a cluster, 90% of my time was spent on environment preparation. Similarly, 90% of the problems encountered are also in this step.

1. Virtual machine preparation

My own architecture is a virtual machine, and other nodes are replaced by docker. You can understand docker as a lightweight virtual machine.

Reasons for me to choose docker:

  1. I think it's very interesting, and I want to challenge my weakness.
  2. A virtual machine may take up 20G of storage, and a docker only takes up a few hundred MB *.
  3. You only need to start a virtual machine. Docker runs on this virtual machine as an application service.

In fact, here I recommend using 3 to 4 virtual machines. Because docker itself is difficult for many people, and it takes a lot of time to build docker into a node.

2. Docker container preparation

If you insist on using docker, you can take a look at this step. When I built the node docker image in this step, I repeated it many times.

dockerfile

We have to write the dockerfile for a few months centtos7 to build the system image of the docker container. Moreover, the docker container replaces the virtual machine, so the environment in the docker container must be the same as the virtual machine. So the dockerfile needs to meet the following conditions.

  1. Open port 22 and start the sshd service
  2. Configure jdk, scala
  3. Generate keys and configure ssh password-free login
  4. python2.7 (included with centos7)
  5. Yum installs some software, such as chrony, etc.
  6. configure hosts

In the stage of writing the dockerfile, I consulted a lot of information, built it repeatedly, and tried many times before I succeeded.

docker-compose

docker-compose is an orchestration tool for docker containers. It is necessary to write a yaml configuration file to start/stop all containers through start/stop.

This centos_hdp is the image I built myself, ports are used to open the port of the container, and volumes are used to mount the directory of the host.

3. Download the installation package

In my graduation project in 2016, each component of the big data platform I built was downloaded and installed independently. The Hadoop installation package needs to be downloaded from the Hadoop official website, and the Kafka installation package needs to be downloaded from the Kafka official website. Install whichever version you want to install.

Based on Ambari installation, all components are included in the HDP installation package, but this installation package is quite large, 10G.

ambari-2.7.3.0-centos7.tar.gz
HDP-3.1.0.0-centos7-rpm.tar.gz
HDP-UTILS-1.1.0.22-centos7.tar.gz
HDP-GPL-3.1.0.0-centos7-gpl.tar.gz

The above is the list of required installation packages. After downloading, put them into the locally built http server and use them during ambari installation.

epilogue

This article mainly talks about the architectural design and implementation ideas of big data cluster construction. Later articles will discuss the construction of upper-layer applications. I am also learning the front-end now, and I want to implement some web applications by myself. Regarding the construction of big data clusters, background implementation and front-end technology, you can privately join groups to communicate with each other.

It is difficult to build Hadoop with Ambari based on docker, try it carefully.

Thank you for every encounter

Guess you like

Origin blog.csdn.net/CatchLight/article/details/127510129