Introduction to big data (1) Hadoop pseudo-distributed installation

1 Introduction

1.1 Definition of Big Data

Big Data (Big Data) is a collection of data that is so large that it greatly exceeds the capabilities of traditional database software tools in terms of acquisition, storage, management, and analysis. It has four major characteristics (4V characteristics): massive data scale, fast data flow, diverse data types, and low value density. IBM proposed that big data has 5V characteristics: Volume (mass), Velocity (speed), Variety (diversification), Value (low value), and Veracity (authenticity).

1.2 Introduction to Hadoop Ecosystem

2. Hadoop pseudo-distributed installation

2.1 Install Centos7 on VMwear

There are quite a lot of online virtual machine installation tutorials, so I won't go into details in this article. Here is a blog I read at the time: VMwear installation Centos7 super detailed process

2.2 Early preparations

2.2.1 Set the virtual machine IP address

Because the IP address of the virtual machine is automatically obtained by default, it may change as the network environment changes. To do this, we need to set up a fixed IP address first.

Click the network icon in the upper right corner of the desktop --> Wired --> Wired Settings --> Setting, then select IPv4, and enter the IP address in the Addresses column (Note: The IP address should preferably be in the same network segment as the host).

2.2.2 Modify the host name

To permanently change the hostname, you can use the following shell command to change the hostname to hadoop0.

hostnamectl set-hostname hadoop0

Then configure the local IP address into the hosts file

vim /etc/hosts

# 在文件末尾加上一行
192.168.1.11 hadoop0

2.2.3 ssh connection virtual machine

Before connecting to the virtual machine through ssh, check whether port 22 is open. If it is not enabled, you can refer to the online tutorial.

netstat -tunlp | grep 22
# 或者以下方式
service ssh status
service sshd status

It is recommended to use ssh software to connect, such as WinSCP and MobaXterm

2.2.4 Install JDK

Download jdk-8u291-linux-x64.tar.gz from the official website , and then use WinSCP or MobaXterm to upload it to the /usr/local directory of CentOS7. Then use the cd command to switch to the /usr/local directory, and use the tar command to decompress.

cd /usr/local/
tar -xvf jdk-8u291-linux-x64.tar.gz

After decompression is complete, you can install the following command to configure environment variables.

# 将解压缩后的文件重命名为jdk
mv jdk1.8.0_152/ jdk

# 然后将JDK的安装目录 /usr/local/jdk 配置到 /etc/profile 的PATH环境变量中
vim /etc/profile
export JAVA_HOME=/usr/local/jdk
export PATH=$PATH:$JAVA_HOME/bin

# 使环境变量生效
source /etc/profile

 Run the java -version command to view the version number of the JDK and test whether the JDK is configured successfully.

2.3 Install Hadoop

2.3.1 Download Hadoop

Download hadoop-3.0.0.tar.gz from the official website, use WinSCP or MobaXterm to upload it to the /usr/local directory of CentOS7, and prepare for installation.

cd /usr/local/
tar -xvf hadoop-3.0.0.tar.gz
mv hadoop-3.0.0 hadoop

2.3.2 Configure environment variables

vim /etc/profile

# 配置环境变量
export JAVA_HOME=/usr/local/jdk
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

# 配置hadoop五个进程的账号
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root


# 使环境变量立即生效
source /etc/profile

2.3.3 Configure hadoop-env.sh

The hadoop-env.sh file stores the global settings used by all Hadoop Shell commands, such as JAVA_HOME, hadoopconfdir and runtime variables: hadoop stack size, java running memory size, etc.

Switch to the directory /usr/local/hadoop/etc/hadoop/ where the Hadoop configuration file is located, and then modify the JDK path of hadoop-env.sh.

cd /usr/local/hadoop/etc/hadoop/

vim hadoop-env.sh

# 将37行的 #JAVA_HOME=/usr/Java/testing hdfs dfs -ls 改成如下
JAVA_HOME=/usr/local/jdk

2.3.4 Configure core-site.xml

vim core-site.xml

# 配置HDFS的访问URL
<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://hadoop0:9000/</value>
                <description>NameNode URI</description>
        </property>
</configuration>

For more configuration information, see core-default.xml

2.3.5 Configure hdfs-site.xml

Configure the path to access the metadata storage of NameNode and DataNode, and the access ports of NameNode and SecondaryNameNode.

<configuration>
        <property>
                <name>dfs.datanode.data.dir</name>
                <value>file:///usr/local/hadoop/data/datanode</value>
        </property>
        <property>
                <name>dfs.namenode.name.dir</name>
                <value>file:///usr/local/hadoop/data/namenode</value>
        </property>
        <property>
                <name>dfs.namenode.http-address</name>
                <value>hadoop0:50070</value>
        </property>
        <property>
                <name>dfs.namenode.secondary.http-address</name>
                <value>hadoop0:50090</value>
        </property>
</configuration>

For more configuration information, see hdfs-default.xml

2.3.6 Configure yarn-site.xml

Configure information such as the access port of the nodemanager and resourcemanager of yarn.

<configuration>

<!-- Site specific YARN configuration properties -->
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
                <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
                <name>yarn.resourcemanager.resource-tracker.address</name>
                <value>hadoop0:8025</value>
        </property>
        <property>
                <name>yarn.resourcemanager.scheduler.address</name>
                <value>hadoop0:8030</value>
        </property>
        <property>
                <name>yarn.resourcemanager.address</name>
                <value>hadoop0:8050</value>
        </property>

</configuration>

For more configuration information, see yarn-default.xml

2.3.7 Format and start Hadoop

So far, the configuration of Hadoop has been completed, but we need to format it before starting Hadoop.

hadoop namenode -format

If no error is reported, the formatting is successful.

Note: If you make an error while using Hadoop, or if Hadoop cannot be started, you may need to reformat it. Reformatting can be carried out by referring to the steps of stopping Hadoop and deleting the data and logs folders under Hadoop for formatting.

stop-all.sh
cd /usr/local/hadoop/
rm -rf data/ logs/
hadoop namenode -format

Use the start-all.sh command to start all processes of Hadoop. Similarly, stop-all.sh shuts down all processes of Hadoop.

2.3.8 Verifying Hadoop

We can use the jps command to view Hadoop-related processes (jps is a small tool provided by JDK to view the current java process, which can be regarded as the abbreviation of JavaVirtual Machine Process Status Tool)

You can use Hadoop commands to view files on HDFS

hadoop fs -ls /

At present, there are no files on HDFS, and the relevant content of HDFS will be introduced in the next section.

You can also view information about Hadoop running through browser access.

2.3.9 Conclusion

Post an official Hadoop document:  hadoop3.0 official document

 

Guess you like

Origin blog.csdn.net/qq_37771475/article/details/116462272