Big Data Offline Phase 02: Apache Hadoop

Introduction to Hadoop

Hadoop is an open source software framework implemented in java language under Apache, and it is a software platform for developing and operating large-scale data processing. Allows distributed processing of large datasets on clusters of massive computers using a simple programming model.

In a narrow sense, Hadoop refers to Apache, an open source framework whose core components are:

  • HDFS (Distributed File System): Solve Massive Data Storage
  • YARN (a framework for job scheduling and cluster resource management): solve resource task scheduling
  • MAPREDUCE (distributed computing programming framework): solving massive data computing

In a broad sense, Hadoop usually refers to a broader concept - the Hadoop ecosystem.

The current Hadoop has grown into a huge system. With the growth of the ecosystem, more and more new projects have emerged, some of which are not in charge of Apache. These projects are good supplements to HADOOP or higher-level abstractions .

A brief history of Hadoop development

Hadoop was created by Apache Lucene founder Doug Cutting. It originated from Nutch, which is a sub-project of Lucene. The design goal of Nutch is to build a large-scale search engine for the entire web, including functions such as web crawling, indexing, and querying. However, as the number of crawled web pages increases, it encounters serious scalability problems: how to solve billions of web pages storage and indexing issues.

In 2003 Google published a paper providing a feasible solution to this problem. The paper describes Google's product architecture, which is called: Google Distributed File System (GFS), which can solve the storage needs of large files generated in the process of web crawling and indexing.

In 2004, Google published a paper introducing the Google version of the MapReduce system to the world.

At the same time, Nutch developers completed the corresponding open source implementations of HDFS and MAPREDUCE, and separated from Nutch to become an independent project HADOOP. By January 2008, HADOOP became the top Apache project, ushering in its rapid development period.

In 2006, Google published a paper on BigTable, which prompted the development of Hbase.

Therefore, the development of Hadoop and its ecosystem is inseparable from the contribution of Google.

Hadoop cluster construction

release version

Hadoop distributions are divided into open source community editions and commercial editions .

The community version refers to the version maintained by the Apache Software Foundation, which is an officially maintained version system.

https://hadoop.apache.org/

The commercial version of Hadoop refers to the version released by third-party commercial companies on the basis of the community version of Hadoop, which has made some modifications, integration, and compatibility testing of various service components. The more famous ones include cloudera's CDH, mapR, and hortonWorks .

https://www.cloudera.com/products/open-source/apache-hadoop/key-cdh-components.html

The version of Hadoop is very special, and it is developed in parallel by multiple branches. From a large perspective, it is divided into three major series of versions: 1.x, 2.x, and 3.x.

Hadoop1.0 consists of a distributed file system HDFS and an offline computing framework MapReduce. The structure is backward and has been eliminated.

Hadoop 2.0 includes a distributed file system HDFS, a resource management system YARN and an offline computing framework MapReduce. Compared with Hadoop 1.0, Hadoop 2.0 is more powerful, has better scalability and performance, and supports multiple computing frameworks.

Hadoop 3.0 has a series of functional enhancements compared to the previous Hadoop 2.0. At present, it has stabilized, and some components of the ecosystem may not have been upgraded and integrated.

We are using in our course: Apache Hadoop 3.3.0.

Cluster Introduction

Specifically, the HADOOP cluster includes two clusters: the HDFS cluster and the YARN cluster. The two are logically separated, but they are often physically together.

The HDFS cluster is responsible for the storage of massive data. The main roles in the cluster are:

NameNode、DataNode、SecondaryNameNode

The YARN cluster is responsible for resource scheduling during massive data operations. The main roles in the cluster are:

ResourceManager、NodeManager

There are three Hadoop deployment methods, Standalone mode (independent mode), Pseudo-Distributed mode (pseudo-distributed mode), Cluster mode (cluster mode), the first two of which are deployed on a single machine.

Stand-alone mode is also called stand-alone mode. Only one machine runs one java process, which is mainly used for debugging.

In the pseudo-distributed mode, the NameNode and DataNode of HDFS, and the ResourceManger and NodeManager of YARN are run on one machine, but separate java processes are started respectively, mainly for debugging.

Cluster mode is mainly used for production environment deployment. N hosts will be used to form a Hadoop cluster. In this deployment mode, the master node and slave nodes will be deployed separately on different machines.

We take 3 nodes as an example to build, and the roles are assigned as follows:

node1 NameNode DataNode ResourceManager

node2 DataNode NodeManager SecondaryNameNode

node3 DataNode NodeManager


Server basic environment preparation

1.0 配置好各虚拟机的网络(采用NAT联网模式)
		
	1.1修改各个虚拟机主机名
		vi /etc/hostname
		
		node1.itcast.cn    
	
	1.2修改主机名和IP的映射关系
		vi /etc/hosts
			
		192.168.227.151	node1.itcast.cn node1
		192.168.227.152	node2.itcast.cn node2
		192.168.227.153	node3.itcast.cn node3
	
	1.3关闭防火墙
		#查看防火墙状态
		systemctl status firewalld.service
		#关闭防火墙
		systemctl stop firewalld.service
		#关闭防火墙开机启动
		systemctl disable firewalld.service

	1.4.配置ssh免登陆(配置node1-->node1,node2,node3)
	#node1生成ssh免登陆密钥

	ssh-keygen -t rsa (一直回车)
	执行完这个命令后,会生成两个文件id_rsa(私钥)、id_rsa.pub(公钥)
	将公钥拷贝到要免密登陆的目标机器上
	ssh-copy-id node1
	ssh-copy-id node2
	ssh-copy-id node3
	
	1.5 同步集群时间
		yum install ntpdate
		ntpdate cn.pool.ntp.org

JDK environment installation

2.1上传jdk
	jdk-8u65-linux-x64.tar.gz
	
2.2解压jdk
	tar -zxvf jdk-8u65-linux-x64.tar.gz -C /export/server
	
2.3将java添加到环境变量中
	vim /etc/profile
	#在文件最后添加
	export JAVA_HOME=/export/server/jdk1.8.0_65
	export PATH=$PATH:$JAVA_HOME/bin
	export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

	#刷新配置
	source /etc/profile

Guess you like

Origin blog.csdn.net/Blue92120/article/details/132203973