Super simple CDH6 deployment and experience (stand-alone version)

Why is it super simple

With the help of ansible, most of the content of CDH6 deployment is simplified, and the probability of manual operation errors is reduced. Today's actual content is to run ansible script on a computer (Apple or Linux operating system) with ansible installed, remote operation A CentOS server, deploy CDH6 on it, and verify whether the deployment is successful.

ansible learning

If you want to understand ansible, please refer to "Ansible2.4 Installation and Experience"

Why deploy standalone CDH6

It is mainly used as a learning and development environment for big data technology and is not suitable for production;

Brief description of actual combat

The content of this actual combat: deployment, startup, verification, the whole process is shown in the following figure:
Insert picture description here

Full text outline

This article consists of the following chapters:

  1. Environmental information
  2. download file;
  3. File placement;
  4. CDH machine settings;
  5. ansible parameter setting;
  6. deploy;
  7. Restart the CDH server
  8. start up;
  9. Set up
  10. Fix the problem
  11. Experience

Environmental information

The operation process of this actual combat is shown in the following figure. Install the ansible2.9 version of the MabBook Pro computer as ansible server, execute the playbook script, remotely operate a CentOS server to complete the deployment and startup of CDH6:
Insert picture description here
blue background above The computer can be an Apple operating system or a Linux operating system. For a computer with a yellow background to run CDH6, it must be CentOS 7.7 operating system (I'm sorry, I have limited conditions, I haven't tried other systems)

The version information of the environment involved in the actual combat is as follows:

  1. ansible server: macOS Catalina 10.15 (CentOS7.7 was also successfully tested)
  2. CDH server: CentOS Linux release 7.7.1908
  3. cm version: 6.1.0
  4. parcel version: 6.1.1
  5. jdk version: 8u191

Download file (ansible server)

All the files used in this actual combat are shown in the following table:

Numbering file name Introduction
1 jdk-8u191-linux-x64.tar.gz Linux version jdk installation package
2 mysql-connector-java-5.1.34.jar JDBC driver for MySQL
3 cloudera-manager-server-6.1.0-769885.el7.x86_64.rpm cm server installation package
4 cloudera-manager-daemons-6.1.0-769885.el7.x86_64.rpm cmemon installation package
5 cloudera-manager-agent-6.1.0-769885.el7.x86_64.rpm cm agent installation package
6 CDH-6.1.1-1.cdh6.1.1.p0.875250-el7.parcel CDH application offline installation package
7 CDH-6.1.1-1.cdh6.1.1.p0.875250-el7.parcel.sha CD verification code for offline installation package of CDH application
8 hosts The remote host configuration used by ansible, which records the information of the CDH6 server
9 ansible.cfg Configuration information used by ansible
9 ansible.cfg Configuration information used by ansible
10 cdh-single-install.yml Ansible script used when deploying CDH
11 cdh-single-start.yml The ansible script used when starting CDH for the first time

Download address of the above 11 files:

  1. jdk-8u191-linux-x64.tar.gz : Oracle's official website is available. In addition, I packaged and uploaded jdk-8u191-linux-x64.tar.gz and mysql-connector-java-5.1.34.jar to csdn. Can be downloaded at one time, address: https://download.csdn.net/download/boling_cavalry/12098987
  2. mysql-connector-java-5.1.34.jar : maven central warehouse is available. In addition, I package and upload jdk-8u191-linux-x64.tar.gz and mysql-connector-java-5.1.34.jar to csdn You can download it once, address: https://download.csdn.net/download/boling_cavalry/12098987
  3. cloudera-manager-server-6.1.0-769885.el7.x86_64.rpm:https://archive.cloudera.com/cm6/6.1.0/redhat7/yum/RPMS/x86_64/cloudera-manager-server-6.1.0-769885.el7.x86_64.rpm
  4. cloudera-manager-daemons-6.1.0-769885.el7.x86_64.rpm:https://archive.cloudera.com/cm6/6.1.0/redhat7/yum/RPMS/x86_64/cloudera-manager-daemons-6.1.0-769885.el7.x86_64.rpm
  5. cloudera-manager-agent-6.1.0-769885.el7.x86_64.rpm:https://archive.cloudera.com/cm6/6.1.0/redhat7/yum/RPMS/x86_64/cloudera-manager-agent-6.1.0-769885.el7.x86_64.rpm
  6. CDH-6.1.1-1.cdh6.1.1.p0.875250-el7.parcel:https://archive.cloudera.com/cdh6/6.1.1/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250-el7.parcel
  7. CDH-6.1.1-1.cdh6.1.1.p0.875250-el7.parcel.sha : https://archive.cloudera.com/cdh6/6.1.1/parcels/CDH-6.1.1-1.cdh6. 1.1.p0.875250-el7.parcel.sha256 (After downloading, change the extension from .sha256 to .sha)
  8. hosts, ansible.cfg, cdh-single-install.yml, cdh-single-start.yml : these four files are saved in my GitHub repository, the address is: https://github.com/zq2599/blog_demos, this There are multiple folders in it, and the above files are in the folder named ansible-cdh6-single , as shown in the red box below:
    Insert picture description here

File placement (ansible server)

If you have downloaded the above 11 files, please place them according to the following locations so that the deployment can be completed successfully:

  1. Create a new folder named playbooks under the home directory: mkdir ~ / playbooks
  2. Put these four files into the playbooks folder: hosts, ansible.cfg, cdh-single-install.yml, cdh-single-start.yml
  3. Create a new subfolder named cdh6 in the playbooks folder;
  4. Put these seven files into the cdh6 folder (that is, the remaining seven): jdk-8u191-linux-x64.tar.gz, mysql-connector-java-5.1.34.jar, cloudera-manager-server-6.1. 0-769885.el7.x86_64.rpm, cloudera-manager-daemons-6.1.0-769885.el7.x86_64.rpm, cloudera-manager-agent-6.1.0-769885.el7.x86_64.rpm, CDH-6.1. 1-1.cdh6.1.1.p0.875250-el7.parcel, CDH-6.1.1-1.cdh6.1.1.p0.875250-el7.parcel.sha
  5. After the placement, the directory and files are as shown in the figure below. Again, the folder playbooks must be placed in the home directory (ie: ~ / ):
    Insert picture description here

CDH server settings

In this actual combat, the CDH server hostname is deskmini and the IP address is 192.168.50.134 . The following operations are required:

  1. Please ensure that the CDH server can log in via SSH (user name + password);
  2. SSH to the machine where CDH is deployed;
  3. Check that the / etc / hostname file is correct, as shown below:
    Insert picture description here
  4. Modify the / etc / hosts file, configure your own IP address and hostname, as shown in the red box below (it turns out that this step is very important, if you do not do it, it may cause you to be stuck in the "allocation" stage during deployment, see the agent log Show that the progress of agent download parcel has been zero percent):
    Insert picture description here

ansible parameter settings (ansible server)

The operation setting of ansible parameter setting is very simple, just configure the information of the machine where the CDH is deployed, including the IP address, login account, password, etc., modify the ~ / playbooks / hosts file, the content is as follows, you need to modify it according to your own situation deskmini, ansible_host, ansible_port, ansible_user, ansible_password:

[cdh_group]
deskmini ansible_host=192.168.50.134 ansible_port=22 ansible_user=root ansible_password=888888

Deployment (ansible server)

  1. Enter the ~ / playbooks directory;
  2. Check whether the ansible remote operation of the CDH server is normal. Run the ansible deskmini -a "free -m" command to display the memory information of the CDH server under normal conditions, as shown below:
    Insert picture description here
  3. Execute this command to start deployment: ansible-playbook cdh-single-install.yml
  4. The entire deployment process involves time-consuming operations such as online installation and file transfer, so please be patient (about half an hour). I encountered network problems during deployment and failed to exit. After the network is normal, I can re-execute the above operations, ansible guarantee Idempotency of operation;
  5. The successful deployment is shown below:
    Insert picture description here

Restart the CDH server

Since the settings of selinux and swap are modified, the operating system needs to be restarted to take effect, so please restart the CDH server;

Start (ansible server)

  1. Wait for the CDH server to restart successfully;
  2. Log in to the ansible server and enter the ~ / playbooks directory;
  3. Execute this command to start initializing the database and then start CDH: ansible-playbook cdh-single-start.yml
  4. After the startup is complete, the following information is output:
    Insert picture description here

Settings (Webpage)

CDH has been started, and the CDH server provides web services to the outside world, which can be operated through a browser:

  1. Browser access: http://192.168.50.134:7180, as shown below, the account password is admin :
    Insert picture description here
  2. All the way to the next, select the 60-day trial version on the select version page:
    Insert picture description here
  3. Select the host page to see deskmini:
    Insert picture description here
  4. Select the CDH version in the red box below, because the corresponding offline package has been copied to the local warehouse of CM, no need to download:
    Insert picture description here
  5. The download is completed instantly, waiting for distribution, decompression, and activation:
    Insert picture description here
  6. Select the service page, I chose Data Engineering here, because spark is required:
    Insert picture description here
  7. Select the machine's page, select all deskmini:
    Insert picture description here
  8. The database setting page, please keep the same as the picture below , the database host is localhost , the name, user name and password of each database are the same, namely: hive, amon, rman, oozie, hue
    Insert picture description here
  9. The parameter setting page, please adjust the storage path appropriately according to the situation of the disk. For example, my / home directory has enough space, and I have changed to the / home directory here :
    Insert picture description here
  10. Wait for the startup to complete:
    Insert picture description here
  11. After waiting for the startup to complete, as shown below:
    Insert picture description here
    At this point, all services have been started, but there are two minor problems that need to be repaired;

Fix HDFS issues

  1. The overall situation of the service is as shown in the figure below. If there is a problem with the HDFS service, click the icon in the red box:
    Insert picture description here
  2. Click on the red box below:
    Insert picture description here
  3. The details of the fault are shown in the following figure, which is a common problem of insufficient copies:
    Insert picture description here
  4. As shown in the figure below, modify the HDFS configuration dfs.replication from 3 to 1, and then save the changes:
    Insert picture description here
  5. Restart the service:
    Insert picture description here
  6. After the above settings, the number of copies has been adjusted to 1, but the number of copies of the existing files has not been synchronized, you need to re-set up, SSH login to the computer deskmini;
  7. Execute the command vi / etc / passwd to find the configuration of the account hdfs, as shown in the red box below, a shell like / sbin / nologin will cause the switch to the hdfs account to fail:
    Insert picture description here
  8. Change the content of the above red box to / bin / bash , after modification, as shown in the red box below:
    Insert picture description here
  9. Execute the command su-hdfs to switch to the hdfs account, then execute the following command to complete the copy number setting:
hadoop fs -setrep -R 1 /
  1. The service is all normal:
    Insert picture description here

Adjust YARN parameters to avoid spark-shell startup failure

  1. By default, the memory allocated by YARN to the container is too small, which causes the spark-shell to fail to start. You need to adjust the memory parameters related to YARM:
    Insert picture description here
  2. On the YARN configuration page, adjust the values ​​of the two parameters yarn.scheduler.maximum-allocation-mb and yarn.nodemanager.resource.memory-mb . I will change the values ​​of these two parameters to 8G here (please follow your own Adjust the actual hardware configuration of the computer), as shown below:
    Insert picture description here
  3. Restart YARN:
    Insert picture description here
  4. Before executing the spark-shell command, first execute the command su-hdfs to switch to the hdfs account;
  5. This time finally successfully entered the spark-shell interactive mode:
    Insert picture description here
    At this point, CDH6 deployment, startup, and settings have been completed, and then experience the big data service;

Experience HDFS and Spark

Next, run a Spark task, the classic WordCount:

  1. Prepare a text file with English content, you can download this file: https://raw.githubusercontent.com/zq2599/blog_demos/master/files/GoneWiththeWind.txt
  2. Log in to SSH and switch to the hdfs account;
  3. Create HDFS folder:
hdfs dfs -mkdir /input
  1. Upload the text file to the / input directory:
hdfs dfs -put ./GoneWiththeWind.txt /input
  1. Run the spark-shell command to start a worker;
  2. Enter the following command to complete a WorkCount task, 192.168.50.134 is the IP address of deskmini:
sc.textFile("hdfs://192.168.50.134:8020/input/GoneWiththeWind.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("hdfs://192.168.50.134:8020/output")
  1. After execution, download the result file:
hdfs dfs -get /output/*
  1. The above command downloads the spark task result files part-00000 and part-00001 to the local machine. Use the vi command to view the file, as shown in the following figure, it can be seen that WorkCount is successfully executed:
    Insert picture description here
  2. View the historical task on the browser, the address is: http://192.168.50.134:18088, you can see the details of this task: So
    Insert picture description here
    far, CDH6 deployment, settings, and experience have been completed, if you are setting up your own learning or development Environment, I hope this article can give you some reference.

In-depth customization

Although the entire actual combat avoids a lot of manual operations in the traditional deployment process, the disadvantages are also obvious: all paths, file names, and service versions are fixed, and no settings can be made. Although ansible also supports variables, if there are too many variables, Will cause you trouble, so if you have the need to modify the version or path, it is recommended that you modify the contents of cdh-single-install.yml and cdh-single-start.yml by yourself. All the files and version information are in it.

Welcome to pay attention to my public number: programmer Xinchen

Insert picture description here

Published 376 original articles · praised 986 · 1.28 million views

Guess you like

Origin blog.csdn.net/boling_cavalry/article/details/105356266