Big Data 01-Hadoop3.3.1 pseudo-distributed installation

Hadoop

Introduction

  Hadoop is an open source distributed computing platform under the Apache Software Foundation, which provides users with a distributed infrastructure with transparent details of the underlying system. Hadoop is developed based on the Java language, has good cross-platform features , and can be deployed in cheap computer clusters .
  The core of Hadoop is the distributed file system HDFS (Hadoop Distributed File System) and MapReduce . HDFS is an open source implementation of Google File System (GFS). It is a distributed file system for ordinary hardware environments. It has high read and write speeds, good fault tolerance and scalability, and supports large-scale data. The distributed storage and its redundant data storage method ensure the security of data well. MapReduce is an open source implementation of Google MapReduce, which allows users to develop parallel applications without knowing the underlying details of the distributed system. Using MapReduce to integrate data on the distributed file system can ensure the efficiency of data analysis and processing. With the help of Hadoop, programmers can easily write distributed parallel programs, which can be run on cheap computer clusters to complete the storage and calculation of massive data.
  Hadoop is recognized as the industry's big data standard open source software, which provides massive data processing capabilities in a distributed environment. Almost all mainstream manufacturers provide development tools, open source software, commercial tools and technical services around Hadoop, such as Google, Yahoo, Microsoft, Cisco and Taobao, all support Hadoop.

characteristic

  Hadoop is a software framework capable of distributed processing of large amounts of data, and it processes data in a reliable, efficient, and scalable manner. It has the following characteristics:

High reliability: With redundant data storage, even if one copy fails, other copies can guarantee normal external services. Hadoop's ability to store and process data bit by bit is trustworthy.
Efficiency: As a parallel distributed computing platform, Hadoop adopts two core technologies of distributed storage and distributed processing, which can efficiently process PB-level data. Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.
High scalability: Hadoop is designed to run efficiently and stably on cheap computer clusters, and can be extended to thousands of computer nodes.
High fault tolerance: redundant data storage is adopted, multiple copies of data are automatically saved, and failed tasks can be automatically redistributed.
Low cost: Hadoop uses cheap computer clusters, and the cost is low, and ordinary users can easily build Hadoop operating environments on their own PCs. Compared with all-in-one machines, commercial data warehouses, and data marts such as QlikView and Yonghong Z-Suite, Hadoop is open source, so the software cost of the project will be greatly reduced.
Running on the Linux platform: Hadoop is developed based on the Java language and can run well on the Linux platform.
Support for multiple programming languages: Applications on Hadoop can also be written in other languages ​​such as C++.

The only officially supported operating platform for Hadoop is Linux

Prerequisite environment configuration

download link

VMWare download address:
https://customerconnect.vmware.com/cn/downloads/details?downloadGroup=WKST-1701-WIN&productId=1376&rPId=100679
Ubuntu download address:
https://ftp.sjtu.edu.cn/ubuntu-cd /22.04.1/ubuntu-22.04.1-desktop-amd64.iso
jdk8 download address:
https://www.oracle.com/java/technologies/downloads/#java8
Hadoop download address:
http://archive.apache. org/dist/hadoop/core/hadoop-3.3.1/hadoop-3.3.1.tar.gz

Install VMware

insert image description hereinsert image description hereinsert image description here
insert image description hereinsert image description here

create virtual machine

insert image description hereinsert image description hereinsert image description here
insert image description hereinsert image description hereinsert image description here
insert image description hereinsert image description hereinsert image description here
insert image description hereinsert image description hereinsert image description here

Install VMware Tools

VMware Tools was not found in the virtual machine, and reinstalling VMware Tools was grayed out, so I tried installing VMware Tools manually and shutting down the virtual machine first.
insert image description hereinsert image description here

insert image description here

Then I typed y all the way, and then something went wrong suddenly popped up, hahahaha, I don’t know what the problem is, just ignore it, click cancel, hahahahaha, and finally it shows that the installation of VMware tools is successful.
insert image description hereinsert image description here
insert image description here

#cd到自己的用户目录
cd /home/xx
#解压安装包
tar -zxvf VMwareTools-10.3.23-17030940.tar.gz
#到解压后的文件夹里面
cd vmware-tools-distrib
#开个权限
sudo chmod 777 vmware-install.pl
#运行安装(其实就是perl安装包)
sudo ./vmware-install.pl

Shared folder

For convenience, set up a shared folder.
Create a new folder and complete the configuration.
insert image description hereinsert image description hereinsert image description hereinsert image description here

Details:
The following packages have unmet dependencies:

samba: Depends: python3 (< 3.11) but 3.10.6-1~22.04 is to be installed
       Depends: samba-common (= 2:4.15.13+dfsg-0ubuntu1) but 2:4.15.13+dfsg-0ubuntu1 is to be installed
       Depends: samba-common-bin (= 2:4.15.13+dfsg-0ubuntu1) but 2:4.15.13+dfsg-0ubuntu1 is to be installed
       Depends: python3:any but it is a virtual package
       Depends: libwbclient0 (= 2:4.15.13+dfsg-0ubuntu1) but 2:4.15.13+dfsg-0ubuntu1 is to be installed
       Depends: samba-libs (= 2:4.15.13+dfsg-0ubuntu1) but 2:4.15.13+dfsg-0ubuntu1 is to be installed

insert image description hereinsert image description hereinsert image description hereinsert image description hereinsert image description here

sudo apt-get install samba

sudo smbpasswd -a xx

Win+r opens
insert image description here
and a box pops up, enter the user name and password you just set (forgot to take a screenshot)
and you can see that both win10 and Ubuntu can access the folder share2.
insert image description here

install java

#将/home/xx/文档/share2目录下`jdk-8u311-linux-x64.tar.gz`解压缩到`/opt`目录下
sudo tar -xzvf /home/xx/文档/share2/jdk-8u311-linux-x64.tar.gz -C /opt

#将jdk1.8.0_311目录重命名为java,执行如下命令:
sudo mv /opt/jdk1.8.0_311/ /opt/java

#修改java目录的所属用户:
sudo chown -R xx:xx /opt/java

insert image description here
Modify system environment variables

#打开/etc/profile文件
sudo vim /etc/profile

# 在文件后面添加下面这段
#java
export JAVA_HOME=/opt/java
export PATH=$JAVA_HOME/bin:$PATH

Enter i and press Enter to enter the insert mode, add content at the end of the file;
press the esc key to exit editing;
enter :wq to save and exit.
insert image description hereinsert image description here

#输入以下命令,使得环境变量生效
source /etc/profile

#执行完上述命令之后,可以通过`JAVA_HOME`目录找到`java`可使用的命令。 通过查看版本号的命令验证是否安装成功
java -version

insert image description here

SSH login permission setting

  For the pseudo-distribution and full distribution of Hadoop, the Hadoop name node (NameNode) needs to start the Hadoop daemon process of all machines in the cluster. This process can be realized through SSH login. Hadoop does not provide a form of SSH input password login. Therefore, in order to log in to each machine smoothly, all machines need to be configured as name nodes, and you can log in to them through SSH without a password.

# 为了实现SSH无密码登录方式,首先需要让NameNode生成自己的SSH密钥
ssh-keygen -t rsa # 执行该命令后,遇到提示信息,一直按回车就可以

#将公共密钥发送给集群中的其他机器,将id_dsa.pub中的内容添加到需要SSH无密码登录的机器的~/ssh/authorized_keys目录下
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
#或者执行以下命令:  
cat /home/datawhale/.ssh/id_rsa.pub >> /home/datawhale/.ssh/authorized_keys

#ssh拒绝连接,安装
sudo apt-get install openssh-server

#通过ssh localhost命令来检测一下是否需要输入密码。 
ssh localhost

insert image description hereinsert image description here

Hadoop pseudo-distributed installation

Install the stand-alone version of Hadoop

Place hadoop-3.3.1.tar.gz to your favorite location, such as /home/xx/文档/share2under a folder. Note that the user and group of the folder must both be hadoop.

# 安装Hadoop
# 将`hadoop-3.3.1.tar.gz`解压缩到`/opt`目录下
sudo tar -xzvf /home/xx/文档/share2/hadoop-3.3.1.tar.gz -C /opt/

# 为了便于操作,我们也将`hadoop-3.3.1`重命名为`hadoop`
sudo mv /opt/hadoop-3.3.1/ /opt/hadoop

# 修改hadoop目录的所属用户和所属组
sudo chown -R xx:xx /opt/hadoop

insert image description here

# 修改系统环境变量
# 打开`/etc/profile`文件
sudo vim /etc/profile

Enter i and press Enter to enter the insert mode, add content at the end of the file;
press the esc key to exit editing;
enter :wq and press Enter to save and exit.

# 在文件末尾,添加如下内容

#hadoop
export HADOOP_HOME=/opt/hadoop
export PATH=$HADOOP_HOME/bin:$PATH
# 使得环境变量生效
source /etc/profile

# 查看版本号命令验证是否安装成功
hadoop version

insert image description here
insert image description hereinsert image description here

# 对于单机安装,首先需要更改`hadoop-env.sh`文件,用于配置Hadoop运行的环境变量
cd /opt/hadoop/
vim etc/hadoop/hadoop-env.sh

Enter i and press Enter to enter the insert mode, add content at the end of the file;
press the esc key to exit editing;
enter :wq and press Enter to save and exit.

# 在文件末尾,添加如下内容
export JAVA_HOME=/opt/java/

The Hadoop documentation also comes with some examples for us to test, WordCountexamples that can be run, and check whether the Hadoop installation is successful. The steps to run the example are as follows:

  1. Create a new folder under /opt/hadoop/the directory inputto store the input data;
  2. Copy etc/hadoop/the configuration files under the folder to inputthe folder;
  3. Create a new folder under hadoopthe directory outputto store the output data;
  4. Run wordCountthe example
  5. View the contents of the output data.
mkdir input
cp etc/hadoop/*.xml input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar grep input output 'dfs[a-z.]+'
cat output/*

The output result means that in all configuration files, there is only one dfs[a-z.]+word that matches the regular expression, and the result is output.
insert image description hereinsert image description here
insert image description here

Hadoop pseudo-distributed installation

A pseudo-distributed installation means simulating a small cluster on a single machine. When Hadoop is applied to a cluster, whether it is pseudo-distributed or truly distributed, it is necessary to set the cooperative work of each component through a configuration file.
For pseudo-distributed configuration, we need to modify core-site.xml, hdfs-site.xml, mapred-site.xmland yarn-site.xmlthese 4 files.

# 打开`core-site.xml`文件
vim /opt/hadoop/etc/hadoop/core-site.xml

Enter i and press Enter to enter the insert mode, add content at the end of the file;
press the esc key to exit editing;
enter :wq and press Enter to save and exit.

Add the following configuration between <configuration>与</configuration>the tags

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

core-site.xmlThe format of the configuration file is very simple. <name>The label represents the name of the configuration item, and <value>the item sets the value of the configuration. For this file, we only need to specify the address and port number of HDFS in it, and the port number is set to 9000 according to the official document.

# 打开`hdfs-site.xml`文件
vim /opt/hadoop/etc/hadoop/hdfs-site.xml

# 添加下面配置到`<configuration>与</configuration>`标签之间
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

For hdfs-site.xmlfiles, we set replicationthe value to 1, which is also the default minimum value for Hadoop to set the number of copies of the same data in the HDFS file system.

# 打开`mapred-site.xml`文件
vim /opt/hadoop/etc/hadoop/mapred-site.xml

# 添加下面配置到`<configuration>与</configuration>`标签之间
```html
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
</configuration>


# 修改yarn-site.xml文件配置
vim /opt/hadoop/etc/hadoop/yarn-site.xml

# 添加下面配置到`<configuration></configuration>`标签之间
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

; For this experiment, after passing the above configuration, the operation requirements have been met. Here is a detailed address of an official document. Interested partners can view other items in the document configuration (the URL is as follows: https://hadoop.apache.org/docs/stable )
insert image description hereinsert image description hereinsert image description hereinsert image description here
insert image description here

After the configuration is complete, the file system needs to be initialized first. Since many tasks of Hadoop are completed on the built-in HDFS file system, the file system needs to be initialized before further computing tasks can be performed. The command to perform initialization is as follows:

# 格式化分布式文件系统
hdfs namenode -format
# 启动Hadoop的所有进程,可以通过提示信息得知,所有的启动信息都写入到对应的日志文件。如果出现启动错误,则可以查看相应的错误日志
/opt/hadoop/sbin/start-all.sh

# 查看Hadoop进程:可以查看所有的`Java`进程
jps

Hadoop WebUI management interface

You can http://localhost:8088view Hadoop information by accessing the web interface.
insert image description hereinsert image description here
insert image description hereinsert image description here

Test HDFS cluster and MapReduce task program

Use the sample program that comes with Hadoop WordCountto check the cluster, and perform the following operations on the master node to create the HDFS directory required to execute MapReduce tasks:

hadoop fs -mkdir /user
hadoop fs -mkdir /user/xx
hadoop fs -mkdir /input

Create test file

vim /home/xx/test

Enter i and press Enter to enter the insert mode, add content at the end of the file;
press the esc key to exit editing;
enter :wq and press Enter to save and exit.

In testthe file, add the following:

Hello world!

insert image description hereinsert image description here

# 将测试文件上传到Hadoop HDFS集群目录
hadoop fs -put /home/xx/test /input

# 执行wordcount程序
hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar wordcount /input /out

# 查看执行结果:
hadoop fs -ls /out
# 可以看到,结果中包含`_SUCCESS`文件,表示Hadoop集群运行成功。

# 查看具体的输出结果
hadoop fs -text /out/part-r-00000

insert image description hereinsert image description here

learning reference

https://github.com/datawhalechina/juicy-bigdata

Guess you like

Origin blog.csdn.net/weixin_45735391/article/details/129010011