Data is too big? You should understand the Hadoop Distributed File System

These data are rarely concerned about where they came from, we do not have enough good technical capacity to deal with these data.

Increase the amount of data networking equipment rise

Development of the network will undoubtedly paved the way for us to meet the era of big data, smart computing era. According to forecasts research firm, the global networking equipment is increasing, in some countries, per capita networking equipment has already exceeded 2; such a large number of networked devices and the increasing speed of the network are to make the amount of data social rapid growth, smart city, safe city implementation is also based on video surveillance and other video-based data has become an important part of the big data era.

 

 

Research robot, AI, machine learning so that data becomes necessary factor in the future assisted our lives, in the form of unmanned vehicles, robots courier appeared, on the one hand reflects the value of the data, on the other hand is constantly collecting data, data analysis and nurturing application.

Who will handle the data volume is too big?

After data generation, data collection means that the work has been completed, then the problem of how input and output data valid break?

After the arrival of the era of arrogant data, distributed storage, reading and writing large files have become a hot topic, how to deal with an increasing number of large file storage, analysis and retrieval, a business needs to overcome the problem.

 

 

The prototype of Hadoop to begin from 2002. Hadoop's prototype began in 2002, the Apache Nutch, Nutch is an open source Java implementation of the search engine. Then became Google File System (GFS) based on academic Google published a distributed file storage system called NDFS. And then based on a MapReduce Google technical papers published in the Nutch search engine implements parallel computing for analysis of large data sets (greater than 1TB) of. Finally, Yahoo hired Doug Cutting, Doug Cutting and MapReduce upgrade will NDFS named Hadoop, HDFS (Hadoop Distributed File System, Hadoop Distributed File System) on this form.

It should be said that Hadoop for big data exists, HDFS provides high throughput data access for applications with very large scale data sets. We can see in the design of Hadoop three characteristics: suitable for storing large files, to run on ordinary, inexpensive servers at the same time, the funniest access mode is write once, read many times.

 

 

当然,HDFS也存在一些弊端,比如说不适用于有低延迟要求的应用场景。因为Hadoop是针对大数据传输的存在,是为高数据吞吐量应用而设计,这导致其必然要以高延迟作为代价。同时HDFS分布式存储不适用于小文件传输,在大量小文件传输过程中,namenode的内存就吃不消了。

2、Hadoop概念科普

Hadoop概念科普

在了解了Hadoop的身世和现在适合的应用场景之后,笔者要跟大家科普一下Hadoop的基础架构和主要概念。

NameNode:namenode负责管理文件目录、文件和block的对应关系以及block和datanode的对应关系。这是由唯一一台主机专门保存,当然这台主机如果出错,NameNode就失效了,需要启动备用主机运行NameNode。

DataNode:负责存储,当然大部分容错机制都是在datanode上实现的。分布在廉价的计算机上,用于存储Block块文件。

MapReduce:通俗说MapReduce是一套从海量·源数据提取分析元素最后返回结果集的编程模型,将文件分布式存储到硬盘是第一步,而从海量数据中提取分析我们需要的内容就是MapReduce做的事了。

Block:也叫作数据块,默认大小为64MB。每一个block会在多个datanode上存储多份副本,默认是3份。

Rack:机柜,一个block的三个副本通常会保存到两个或者两个以上的机柜中。

在这里我还是要推荐下我自己建的大数据学习交流qq裙:522189307 , 裙 里都是学大数据开发的,如果你正在学习大数据 ,小编欢迎你加入,大家都是软件开发党,不定期分享干货(只有大数据开发相关的),包括我自己整理的一份最新的大数据进阶资料和高级开发教程,欢迎进阶中和进想深入大数据的小伙伴。上述资料加群可以领取

Guess you like

Origin blog.csdn.net/Yukioog/article/details/90313670