With HDFS data storage and you think the same?

After yesterday installed hadoop pseudo-distributed environment, perform the most basic HDFS operating environment today. HDFS most just getting started with a few errors, then, one by one we demonstrate in practice.

A, the HDFS have their own storage space, the operation can not directly use the linux ls, mkdir commands

1, create a directory with hdfs

cd /usr/local/hadoop/bin

./hdfs dfs -mkdir /mx

2, view the directory you just created in the root directory has been successful in hdfs

./hdfs dfs -Ls /

3, uploading a file to a local linux in hdfs

 ./hdfs dfs -put /home/hadoop/myLocalFile.txt  /mx

Check the file uploaded successfully

./hdfs dfs -Ls / mx

Check the contents of the file properly

/.hdfs dfs -cat /mx/myLocalFile.txt

Second, the corresponding call HDFS in the program file can not be directly java.io.file classes, packages must be introduced hadoop-hdfs.jar

When developing MyEclipse java program, you can not directly use the class java.io.file directly manipulate the file hdfs shall be introduced into three jar package. hadoop-hdfs.jar, hadoop-hdfs-nfs.jar, hadoop-common.jar, by hdfs jar file in operation again.

 

Three, Hadoop space in the default block size is 128MB, Why

Block Size is the smallest memory cell of HDFS, Hadoop from version 2.0 Block Size will rise to 128MB from 64MB, mainly to reduce disk seek time. If the addressing time is about 10ms, and the transmission rate of 100MB / s, in order to make the addressing time of only 1% of the transmission time. The default size of the computer memory block is 512KB, hadoop Block Size of memory blocks is greater than the critical operating system.

当有文件上传到HDFS上时,若文件大小大于设置的块大小,则该文件会被切分存储为多个块,多个块可以存放在不同的DataNode上,整个过程中 HDFS系统会保证一个块存储在一个datanode上 。但值得注意的是 如果某文件大小没有到达128MB,该文件并不会占据整个块空间 。

有人说为了进一步减少寻道时间,我们将Block Size设得更大,不好吗?有原因不建议设得更大:mapreduce的map任务一般对应一个Block Size,如果Block Size太大,无法发挥分布式的并行计算优势。

 

四、hadoop的hdfs datanode节点一般与tasktracker运行在同一台计算机上,以获得更大的性能

hadoop在存储有输入数据(HDFS中的数据)的节点上运行map任务,可以获得高性能,这就是所谓的数据本地化。

如果是不同的计算机,map任务将把其它hdfs计算机上的block文件下载到map任务所在的计算机,再计算,增加了不必要的时间。

 

五、secondary namenode与namenode的区别是什么?

NameNode负责管理整个文件系统的元数据,也就是哪个Block被放入了哪台计算机的目录,属称元数据。

secondary namenode的目的使帮助NameNode合并编辑日志,减少NameNode 启动时间。在主namenode发生故障时(假设没有及时备份数据),可以从SecondaryNameNode恢复数据。

希望以上文章能帮到您。

更多内容实时更新,请访问公众号。    

 

点击这里,获取最高¥1888阿里云产品通用代金券

Guess you like

Origin blog.csdn.net/qq_29718979/article/details/90814339