Table of contents
Hadoop
Introduction
Hadoop is an open source distributed computing platform under the Apache Software Foundation, which provides users with a distributed infrastructure with transparent details of the underlying system. Hadoop is developed based on the Java language, has good cross-platform features , and can be deployed in cheap computer clusters .
The core of Hadoop is the distributed file system HDFS (Hadoop Distributed File System) and MapReduce . HDFS is an open source implementation of Google File System (GFS). It is a distributed file system for ordinary hardware environments. It has high read and write speeds, good fault tolerance and scalability, and supports large-scale data. The distributed storage and its redundant data storage method ensure the security of data well. MapReduce is an open source implementation of Google MapReduce, which allows users to develop parallel applications without knowing the underlying details of the distributed system. Using MapReduce to integrate data on the distributed file system can ensure the efficiency of data analysis and processing. With the help of Hadoop, programmers can easily write distributed parallel programs, which can be run on cheap computer clusters to complete the storage and calculation of massive data.
Hadoop is recognized as the industry's big data standard open source software, which provides massive data processing capabilities in a distributed environment. Almost all mainstream manufacturers provide development tools, open source software, commercial tools and technical services around Hadoop, such as Google, Yahoo, Microsoft, Cisco and Taobao, all support Hadoop.
characteristic
Hadoop is a software framework capable of distributed processing of large amounts of data, and it processes data in a reliable, efficient, and scalable manner. It has the following characteristics:
High reliability: With redundant data storage, even if one copy fails, other copies can guarantee normal external services. Hadoop's ability to store and process data bit by bit is trustworthy.
Efficiency: As a parallel distributed computing platform, Hadoop adopts two core technologies of distributed storage and distributed processing, which can efficiently process PB-level data. Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.
High scalability: Hadoop is designed to run efficiently and stably on cheap computer clusters, and can be extended to thousands of computer nodes.
High fault tolerance: redundant data storage is adopted, multiple copies of data are automatically saved, and failed tasks can be automatically redistributed.
Low cost: Hadoop uses cheap computer clusters, and the cost is low, and ordinary users can easily build Hadoop operating environments on their own PCs. Compared with all-in-one machines, commercial data warehouses, and data marts such as QlikView and Yonghong Z-Suite, Hadoop is open source, so the software cost of the project will be greatly reduced.
Running on the Linux platform: Hadoop is developed based on the Java language and can run well on the Linux platform.
Support for multiple programming languages: Applications on Hadoop can also be written in other languages such as C++.
The only officially supported operating platform for Hadoop is Linux
Prerequisite environment configuration
download link
VMWare download address:
https://customerconnect.vmware.com/cn/downloads/details?downloadGroup=WKST-1701-WIN&productId=1376&rPId=100679
Ubuntu download address:
https://ftp.sjtu.edu.cn/ubuntu-cd /22.04.1/ubuntu-22.04.1-desktop-amd64.iso
jdk8 download address:
https://www.oracle.com/java/technologies/downloads/#java8
Hadoop download address:
http://archive.apache. org/dist/hadoop/core/hadoop-3.3.1/hadoop-3.3.1.tar.gz
Install VMware
create virtual machine
Install VMware Tools
VMware Tools was not found in the virtual machine, and reinstalling VMware Tools was grayed out, so I tried installing VMware Tools manually and shutting down the virtual machine first.
Then I typed y all the way, and then something went wrong suddenly popped up, hahahaha, I don’t know what the problem is, just ignore it, click cancel, hahahahaha, and finally it shows that the installation of VMware tools is successful.
#cd到自己的用户目录
cd /home/xx
#解压安装包
tar -zxvf VMwareTools-10.3.23-17030940.tar.gz
#到解压后的文件夹里面
cd vmware-tools-distrib
#开个权限
sudo chmod 777 vmware-install.pl
#运行安装(其实就是perl安装包)
sudo ./vmware-install.pl
Shared folder
For convenience, set up a shared folder.
Create a new folder and complete the configuration.
Details:
The following packages have unmet dependencies:
samba: Depends: python3 (< 3.11) but 3.10.6-1~22.04 is to be installed
Depends: samba-common (= 2:4.15.13+dfsg-0ubuntu1) but 2:4.15.13+dfsg-0ubuntu1 is to be installed
Depends: samba-common-bin (= 2:4.15.13+dfsg-0ubuntu1) but 2:4.15.13+dfsg-0ubuntu1 is to be installed
Depends: python3:any but it is a virtual package
Depends: libwbclient0 (= 2:4.15.13+dfsg-0ubuntu1) but 2:4.15.13+dfsg-0ubuntu1 is to be installed
Depends: samba-libs (= 2:4.15.13+dfsg-0ubuntu1) but 2:4.15.13+dfsg-0ubuntu1 is to be installed
sudo apt-get install samba
sudo smbpasswd -a xx
Win+r opens
and a box pops up, enter the user name and password you just set (forgot to take a screenshot)
and you can see that both win10 and Ubuntu can access the folder share2.
install java
#将/home/xx/文档/share2目录下`jdk-8u311-linux-x64.tar.gz`解压缩到`/opt`目录下
sudo tar -xzvf /home/xx/文档/share2/jdk-8u311-linux-x64.tar.gz -C /opt
#将jdk1.8.0_311目录重命名为java,执行如下命令:
sudo mv /opt/jdk1.8.0_311/ /opt/java
#修改java目录的所属用户:
sudo chown -R xx:xx /opt/java
Modify system environment variables
#打开/etc/profile文件
sudo vim /etc/profile
# 在文件后面添加下面这段
#java
export JAVA_HOME=/opt/java
export PATH=$JAVA_HOME/bin:$PATH
Enter i and press Enter to enter the insert mode, add content at the end of the file;
press the esc key to exit editing;
enter :wq to save and exit.
#输入以下命令,使得环境变量生效
source /etc/profile
#执行完上述命令之后,可以通过`JAVA_HOME`目录找到`java`可使用的命令。 通过查看版本号的命令验证是否安装成功
java -version
SSH login permission setting
For the pseudo-distribution and full distribution of Hadoop, the Hadoop name node (NameNode) needs to start the Hadoop daemon process of all machines in the cluster. This process can be realized through SSH login. Hadoop does not provide a form of SSH input password login. Therefore, in order to log in to each machine smoothly, all machines need to be configured as name nodes, and you can log in to them through SSH without a password.
# 为了实现SSH无密码登录方式,首先需要让NameNode生成自己的SSH密钥
ssh-keygen -t rsa # 执行该命令后,遇到提示信息,一直按回车就可以
#将公共密钥发送给集群中的其他机器,将id_dsa.pub中的内容添加到需要SSH无密码登录的机器的~/ssh/authorized_keys目录下
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
#或者执行以下命令:
cat /home/datawhale/.ssh/id_rsa.pub >> /home/datawhale/.ssh/authorized_keys
#ssh拒绝连接,安装
sudo apt-get install openssh-server
#通过ssh localhost命令来检测一下是否需要输入密码。
ssh localhost
Hadoop pseudo-distributed installation
Install the stand-alone version of Hadoop
Place hadoop-3.3.1.tar.gz to your favorite location, such as /home/xx/文档/share2
under a folder. Note that the user and group of the folder must both be hadoop.
# 安装Hadoop
# 将`hadoop-3.3.1.tar.gz`解压缩到`/opt`目录下
sudo tar -xzvf /home/xx/文档/share2/hadoop-3.3.1.tar.gz -C /opt/
# 为了便于操作,我们也将`hadoop-3.3.1`重命名为`hadoop`
sudo mv /opt/hadoop-3.3.1/ /opt/hadoop
# 修改hadoop目录的所属用户和所属组
sudo chown -R xx:xx /opt/hadoop
# 修改系统环境变量
# 打开`/etc/profile`文件
sudo vim /etc/profile
Enter i and press Enter to enter the insert mode, add content at the end of the file;
press the esc key to exit editing;
enter :wq and press Enter to save and exit.
# 在文件末尾,添加如下内容
#hadoop
export HADOOP_HOME=/opt/hadoop
export PATH=$HADOOP_HOME/bin:$PATH
# 使得环境变量生效
source /etc/profile
# 查看版本号命令验证是否安装成功
hadoop version
# 对于单机安装,首先需要更改`hadoop-env.sh`文件,用于配置Hadoop运行的环境变量
cd /opt/hadoop/
vim etc/hadoop/hadoop-env.sh
Enter i and press Enter to enter the insert mode, add content at the end of the file;
press the esc key to exit editing;
enter :wq and press Enter to save and exit.
# 在文件末尾,添加如下内容
export JAVA_HOME=/opt/java/
The Hadoop documentation also comes with some examples for us to test, WordCount
examples that can be run, and check whether the Hadoop installation is successful. The steps to run the example are as follows:
- Create a new folder under
/opt/hadoop/
the directoryinput
to store the input data; - Copy
etc/hadoop/
the configuration files under the folder toinput
the folder; - Create a new folder under
hadoop
the directoryoutput
to store the output data; - Run
wordCount
the example - View the contents of the output data.
mkdir input
cp etc/hadoop/*.xml input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar grep input output 'dfs[a-z.]+'
cat output/*
The output result means that in all configuration files, there is only one dfs[a-z.]+
word that matches the regular expression, and the result is output.
Hadoop pseudo-distributed installation
A pseudo-distributed installation means simulating a small cluster on a single machine. When Hadoop is applied to a cluster, whether it is pseudo-distributed or truly distributed, it is necessary to set the cooperative work of each component through a configuration file.
For pseudo-distributed configuration, we need to modify core-site.xml
, hdfs-site.xml
, mapred-site.xml
and yarn-site.xml
these 4 files.
# 打开`core-site.xml`文件
vim /opt/hadoop/etc/hadoop/core-site.xml
Enter i and press Enter to enter the insert mode, add content at the end of the file;
press the esc key to exit editing;
enter :wq and press Enter to save and exit.
Add the following configuration between <configuration>与</configuration>
the tags
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
core-site.xml
The format of the configuration file is very simple. <name>
The label represents the name of the configuration item, and <value>
the item sets the value of the configuration. For this file, we only need to specify the address and port number of HDFS in it, and the port number is set to 9000 according to the official document.
# 打开`hdfs-site.xml`文件
vim /opt/hadoop/etc/hadoop/hdfs-site.xml
# 添加下面配置到`<configuration>与</configuration>`标签之间
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
For hdfs-site.xml
files, we set replication
the value to 1, which is also the default minimum value for Hadoop to set the number of copies of the same data in the HDFS file system.
# 打开`mapred-site.xml`文件
vim /opt/hadoop/etc/hadoop/mapred-site.xml
# 添加下面配置到`<configuration>与</configuration>`标签之间
```html
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration>
# 修改yarn-site.xml文件配置
vim /opt/hadoop/etc/hadoop/yarn-site.xml
# 添加下面配置到`<configuration>与</configuration>`标签之间
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
; For this experiment, after passing the above configuration, the operation requirements have been met. Here is a detailed address of an official document. Interested partners can view other items in the document configuration (the URL is as follows: https://hadoop.apache.org/docs/stable )
After the configuration is complete, the file system needs to be initialized first. Since many tasks of Hadoop are completed on the built-in HDFS file system, the file system needs to be initialized before further computing tasks can be performed. The command to perform initialization is as follows:
# 格式化分布式文件系统
hdfs namenode -format
# 启动Hadoop的所有进程,可以通过提示信息得知,所有的启动信息都写入到对应的日志文件。如果出现启动错误,则可以查看相应的错误日志
/opt/hadoop/sbin/start-all.sh
# 查看Hadoop进程:可以查看所有的`Java`进程
jps
Hadoop WebUI management interface
You can http://localhost:8088
view Hadoop information by accessing the web interface.
Test HDFS cluster and MapReduce task program
Use the sample program that comes with Hadoop WordCount
to check the cluster, and perform the following operations on the master node to create the HDFS directory required to execute MapReduce tasks:
hadoop fs -mkdir /user
hadoop fs -mkdir /user/xx
hadoop fs -mkdir /input
Create test file
vim /home/xx/test
Enter i and press Enter to enter the insert mode, add content at the end of the file;
press the esc key to exit editing;
enter :wq and press Enter to save and exit.
In test
the file, add the following:
Hello world!
# 将测试文件上传到Hadoop HDFS集群目录
hadoop fs -put /home/xx/test /input
# 执行wordcount程序
hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar wordcount /input /out
# 查看执行结果:
hadoop fs -ls /out
# 可以看到,结果中包含`_SUCCESS`文件,表示Hadoop集群运行成功。
# 查看具体的输出结果
hadoop fs -text /out/part-r-00000