Hadoop study notes (a): Hadoop and Big Data basic understanding

Big Data Introduction

Big data?

Big Data (BigData): refers to no longer be a time frame for data collection to capture, manage, and treated with conventional software tools is the need for a new processing mode in order to have more decision-making power, insight found massive force and process optimization capabilities, high growth rates and diverse information assets.

Large data storage unit

Features of Big Data

  • Volume (a lot) : As of now, the amount of data for all printed materials produced by humans is 200PB, and a total amount of data on the history of mankind remark about 5EB. Currently, a typical personal computer's hard drive capacity of the order of TB, while the amount of data that a number of large enterprises has been close to the order of EB.
  • The Velocity (high speed) : This is a great distinction between the data in the most significant feature of conventional data mining. According to the report, "Digital Universe," the IDC is expected that by 2020, global data usage will reach 35.2ZB. In the face of such vast amounts of data, data processing efficiency is life.
  • Variety (diversity) : This type of diversity also allows data to be divided into structured data and unstructured data. With respect to a conventional database for easy storage / text-based structured data, unstructured data more and more, including logging networks, audio, video, images, location information, etc., these multiple types of data processing of the data the ability to put a higher demand.
  • Value (low value density) : The size is inversely proportional to the level of the total value of the density data. How fast to valuable data "purification" has become a large problem under the current background data to be solved.

 

Big Data Technology Ecosystem


Introduction Hadoop

What Hadoop is?

  1. Hadoop is an Apache Foundation by the development of a distributed system infrastructure.

  2. Mainly to solve, analyze massive data storage and computational problems of massive data.

  3. Broadly speaking, Hadoop usually refers to a broader concept --Hadoop ecosystem

Hadoop development history

Lucene框架时Doug Cutting开创的开源软件,用java书写代码,实现与Google类似的全文搜索功能,它提供了全文检索引擎的架构,包括完整的查询引擎和索引引擎。 2001年年底,Lucene成为Apache基金会的一个子项目 对于海量数据的场景,Lucene面对与Google同样的困难,存储数据困难,检索速度慢 可以说,Google是Hadoop的思想之源(Google在大数据方面的三篇论文)

  • GFS ---> HDFS
  • MapReduce ---> MR
  • BigTable ---> HBase

2003-2004年,Google公开了部分GFS和MapReduce的思想的细节,以此为基础Doug Cuting等人用了2年业余时间实现了DFS和MapReduce机制,使Nutch性能飙升。

2005年,Hadoop作为Lucene的子项目Nutch的一部分正式引入Apache基金会

2006年3月份,MapReduce和Nutch Distributed File System (NDFS)分别被纳入称为Hadoop的项目中

Hadoop的名字来源于Doug Cutting儿子的玩具大象

Hadoop就此诞生并迅速发展,标志着大数据时代的来临

Hadoop三大发行版本

Apache、Cloudera、Hortonworks

  • Apache:最原始的版本,对入门学习最好

  • Cloudera:在大型互联网企业中用的较多,产品叫 CDH

  • Hortonworks:文档较好,比Cloudera晚两年出来

Hadoop的优势

  • 高可靠性:Hadoop底层维护多个数据副本,所以即使Hadoop某个计算元素或存储出现故障,也不会导致数据的丢失。

  • 高扩展性:在集群间分配任务数据,可方便的扩展数以千计的节点。

  • 高效性:在MapReduce的思想下,Hadoop是并行工作的,以加快任务处理速度。

  • 高容错性:能够自动将失败的任务重新分配。

Hadoop1.x与Hadoop2.x的区别

HDFS概述

NameNode:存储文件的元数据,如文件名,文件目录结构,文件属性,以及每个文件的块列表和块所在的DataNode等
DataNode:在本地文件系统存储文件块数据,以及块数据的校验和
Secondary NameNode:用来监控HDFS状态的辅助后台程序,每个一段时间获取HDFS元数据的快照

Yarn概述


MapReduce概述






















 


 

 

Guess you like

Origin www.cnblogs.com/wbyixx/p/10984267.html