Hadoop pseudo-distributed deployment-CentOS

Written in the front: The blogger is a "small pig" who has devoted himself to training after actual combat development. The nickname is taken from "Peng Peng" in the cartoon "The Lion King". He always treats his surroundings with an optimistic and positive attitude. thing. My technical route has gone from a Java full-stack engineer all the way to the field of big data development and data mining. Now I have a small success. I would like to share with you what I have obtained in the past, and I hope it will be helpful to you on the road of learning. At the same time, the blogger also wants to build a complete technical library through this attempt. Any abnormalities, errors, and precautions related to the technical points of the article will be listed at the end. Welcome to provide materials in various ways.

  • Please criticize and point out any errors in the article and make sure to correct them in time.
  • If you have any questions you want to discuss and learn, please contact me: [email protected].
  • The style of publishing articles varies from column to column, and they are all self-contained. Please correct me for deficiencies.

Hadoop pseudo-distributed deployment-CentOS

Keywords in this article: Hadoop, pseudo-distributed, installation and deployment, CentOS

1. Introduction to Hadoop

The Hadoop software library is a computing framework that can use a simple programming model to perform distributed processing on large data sets in a cluster.

1. Hadoop development history and ecosystem

  • Hadoop originated from the Apache Nutch project, started in 2002, and is one of the sub-projects of Apache Lucene .
  • In February 2006, it became a complete and independent software and was named Hadoop .
  • In January 2008, Hadoop became the top Apache project.
  • In July 2009, MapReduce and HDFS became independent subprojects of Hadoop.
  • In May 2010, Avro broke away from the Hadoop project and became the top Apache project.
  • In May 2010, HBase broke away from the Hadoop project and became the top Apache project.
  • In September 2010, Hive broke away from the Hadoop project and became the top Apache project.
  • In September 2010, Pig separated from the Hadoop project and became the top Apache project.
  • In January 2011, Zookeeper separated from the Hadoop project and became the top Apache project.
  • In December 2011, Hadoop 1.0.0 was released.
  • In October 2012, Impala joined the Hadoop ecosystem.
  • In October 2013, Hadoop 2.0.0 version was released.
  • In February 2014, Spark became the top Apache project.
  • In December 2017, Hadoop 3.0.0 version was released.

2. Hadoop core functions and advantages

  • Distributed storage system: HDFS

HDFS is the abbreviation of Hadoop Distributed File System. It is one of the core projects in the Hadoop ecosystem and the basis of data storage management in distributed computing.

  • Distributed computing framework: MR

MapReduce is a computing model. The core idea is "divide and conquer", which can be used for terabytes of massively parallel computing. After the Map phase is processed, an intermediate result in the form of a key-value pair is formed; Reduce processes the "value" corresponding to the "key" with the same intermediate result to obtain the final result.

  • Resource management platform: YARN

YARN (Yet Another Resource Negotiator) is a resource manager for Hadoop, which can provide unified resource management and scheduling for upper-level applications, and facilitate the resource utilization, unified management, and data sharing of the cluster.

  • High expansion

Hadoop is a highly scalable storage platform that can store and distribute over hundreds of cheap server clusters operating in parallel. To break the limitation that traditional relational databases cannot handle large amounts of data, Hadoop can provide terabytes of data computing power.

  • low cost

Hadoop can group cheap machines into server clusters to distribute and process data. The cost is low, and learners and ordinary users can easily deploy the Hadoop environment on their PCs.

  • high efficiency

Hadoop can process data tasks concurrently, and can move data between different nodes, which can ensure the dynamic balance of each node.

  • Fault tolerance

Hadoop can automatically maintain multiple copies of data. If the computing task fails, Hadoop can reprocess the failed node.

3. Introduction to deployment methods

  • Stand-alone mode

The stand-alone mode is the simplest installation mode. Because Hadoop itself is written based on Java, it can run as long as the environment variables of Java are configured. In this deployment method, we do not need to modify any configuration files, nor start any services, only need to decompress and configure environment variables .
Although the configuration is simple, there are very few things that can be done. Because there are no various daemons, services such as distributed data storage and resource scheduling cannot be used, but we can easily test MapReduce programs .

  • Pseudo-distribution pattern

The pseudo-distribution mode is the most commonly used mode in the learning phase, and all processes can be run on the same machine. In this mode, you can simulate the operating state in the fully distributed mode, and basically complete all operations in the fully distributed mode. The pseudo-distributed mode is a special case of the fully distributed mode.

  • Fully distributed mode

In the fully distributed mode, the master node and sub-nodes will be reflected in the configuration file, and you can specify which machines to run which services to achieve a balance of cost and efficiency. In the enterprise, the fully distributed model is mainly adopted, with nodes ranging from dozens to hundreds. In the learning phase, if the performance of the personal PC is strong enough, multiple virtual machines can also be used instead.

Two, Hadoop download

As a software learner and developer, you must cultivate yourself: the good habit of going to the official website and checking information, getting rid of all kinds of one-click installation, software housekeeping and other things, controlling everything in your own hands, with a rigorous attitude Come ask yourself, come on!

1. Download URL

Search for Hadoop in Baidu, the first two will show the websites we need. Currently Hadoop belongs to the Apache Foundation, so when we open the website, please pay attention to it apache.org.

After entering the Hadoop official website, click Download to open the download interface: https://hadoop.apache.org/releases.html .

2. Version selection

Now we are using the open source community edition, the current mainstream version 2.xy and 3.xy .

When choosing a version of Hadoop, we should consider compatibility with other ecosystem software. There are two general methods of formation:

  • Manually select the version and build according to the compatibility requirements of each component
  • Use CDH (Cloudera's Distribution Including Apache Hadoop) to automatically select the version and solve compatibility issues

In the learning phase, since the operation is relatively simple, there is no need to pay special attention to the compatibility issues of the version, but it is recommended that everyone can understand and practice in both ways.

3. Download the installation package

This article selects version 2.9.2 for demonstration, Source is the source code, Binary is the software package we need, click the binary of the corresponding version to enter the download interface.

Click any mirror address to start downloading, and click the link: https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz .

Three, installation steps

1. Front-end environment

Before configuring Hadoop, you need to configure the JDK. You need to uninstall the historical version before installing. For detailed steps, please refer to another article of mine: JDK decompression installation-CentOS .

  • Query historical version (if not, skip the next step)
rpm -qa|grep java
rpm -qa|grep jdk
  • Uninstall historical version (use root user operation)
rpm -e --nodeps 软件包全称(从查询处获得)
  • unzip
tar -zvxf jdk-8u251-linux-x64.tar.gz
  • Configure environment variables (take global as an example-root user operation)
vi /etc/profile

# 在文件结尾添加以下内容
JAVA_HOME=/opt/jdk1.8.0_251
PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin

export JAVA_HOME
export PATH
  • Refresh environment variables
source /etc/profile
  • Use command to test
java -version

2. Hadoop installation

For Hadoop software, a separate user is usually created for management. The following takes the ordinary user hadoop as an example to operate.

# 新建hadoop用户
useradd hadoop
# 为hadoop用户设置密码
passwd hadoop
# 切换至hadoop用户
su - hadoop
  • unzip

Use the hadoop user to create a new session and upload the Hadoop software package.

tar -zxvf hadoop-2.9.2.tar.gz
  • Configure environment variables (take user variables as an example)
vi ~/.bash_profile

# 在文件结尾添加以下内容
HADOOP_HOME=/home/hadoop/hadoop-2.9.2
PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

export HADOOP_HOME
export PATH
  • Refresh environment variables
source ~/.bash_profile
  • Use command to test
hadoop version

3. Hadoop configuration

If you run Hadoop in pseudo-distributed mode, you need to configure it first and start the daemon (run in the background).

  • The path of the configuration file

The path of the configuration file that needs to be modified is in the etc folder under the Hadoop installation directory.

$HADOOP_HOME/etc/hadoop

  • hadoop-env.sh

This file will be called and executed when Hadoop is started, and JAVA_HOME needs to be set in it, which is the installation location of the dependent Java environment (line 25), to ensure that there is no pound sign before export.

export JAVA_HOME=/opt/jdk1.8.0_251
<configuration>
    <!-- Hadoop临时文件存放路径,默认在/tmp目录下 -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/hadoop/hadoop-2.9.2/data</value>
    </property>
    <!-- NameNode结点的URI(包括协议、主机名称、端口号) -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop:9000</value>
    </property>
    <!-- 设置文件文件删除后,被完全清空的时间,默认为0,单位为分钟 -->
    <property>
        <name>fs.trash.interval</name>
        <value>60</value>
    </property>
</configuration>
  • hdfs-site.xml

HDFS core configuration file, all configuration items can refer to the official document: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml .

<configuration>
    <!-- 块存储份数,默认为3 -->
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <!-- 关闭权限校验,开发学习时可开启,默认为true -->
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
    <!-- namenode的http访问地址和端口 -->
    <property>
        <name>dfs.namenode.http-address</name>
        <value>hadoop:50070</value>
    </property>
</configuration>
  • mapred-site.xml (double-named mapred-site.xml.template)

Hadoop computing function module related configuration files, all configuration items can refer to the official document: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml .

<configuration>
    <!-- 设置运行MapReduce任务方式为yarn,默认为local,可选项还有classic -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
  • yarn-site.xml

Yarn resource scheduling related configuration files, all configuration items can refer to the official document: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-common/yarn-default.xml .

<configuration>
    <!-- resourcemanager的主机名 -->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop</value>
    </property>
    <!-- 分配给容器的物理内存量,单位是MB,设置为-1则自动分配,默认8192MB -->
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>1536</value>
    </property>
    <!-- NodeManager上运行的服务列表,可以配置成mapreduce_shuffle,多个服务使用逗号隔开 -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>
  • slaves

From the node configuration file, it is recommended to use the host name to configure.

hadoop

4. SSH password-free login configuration

The SSH service has been installed and started by default in the CentOS system. Configuring password-free login can make Hadoop more convenient in the process of using it.

  • Configure host name mapping (root user operation)

In order to facilitate maintenance and use, the host name is used in the configuration file for configuration, so before starting to ensure that the host name can be successfully resolved into an ip address, be careful not to use 127.0.0.1, if you have previously configured the stand-alone mode, you need to modify it. .

# 查看本机ip地址
ifconfig
# 编辑修改主机名映射文件,在结尾添加映射信息
vi /etc/hosts

172.16.147.128 hadoop
# 配置完成后使用ping命令验证(Ctrl + C终止)
ping hadoop



  • Generate key
# 整个过程一直回车即可
ssh-keygen -t rsa
  • Configure password-free login to this machine
# 第一个hadoop为用户名,第二个hadoop为主机名,可根据实际情况进行修改
ssh-copy-id hadoop@hadoop
# 输入一次hadoop用户的密码即可通过验证
  • Use the remote login command to verify
# 第一次登录可能会出现验证提示,输入yes后回车,如果不需要输入密码直接登录则配置成功
ssh hadoop@hadoop

5. Cluster startup and confirmation

  • Format namenode

Initialization is required when using Hadoop for the first time. This operation only needs to be performed once. After completion, the corresponding folder will be automatically created in the corresponding directory according to the configuration in core-site.xml.

hdfs namenode -format
  • Start the Hadoop process

Since the environment variables have been configured, you can directly execute the script in the sbin directory: start-all.sh, if you need to stop, you can execute stop-all.sh.

start-all.sh
  • jps command verification

jps is a command after installing the JDK environment. You can view the Java process under the current user. If you cannot use it, you need to check the environment variable configuration of the JDK.

jps


If 5 processes can appear successfully, it proves successful (not including jps itself). It should be noted that if the process automatically disappears after a short period of time after startup, it proves that there is a problem with the process, which needs to be investigated according to the log information.

Insert picture description here

Guess you like

Origin blog.csdn.net/u012039040/article/details/107588353
Recommended