[Hadoop] study notes (b) HDFS operating principle and

Learning porter, notes taken from the  laboratory building courses

 

1, HDFS principle

HDFS (Hadoop Distributed File System) is a distributed file system. It has high fault tolerance and provides a high throughput of data access, ideal for use on large data sets, it provides a high degree of fault tolerance and high throughput mass data storage solutions.

  • 高吞吐量访问: Each Block HDFS distributed on different Rack, when a user accesses, HDFS will be calculated using the most recent and access to the minimum amount of server users. Since the Block are backed up on different Rack, it is no longer a single data access, speed and efficiency is very fast. In addition HDFS can be read in parallel from a server cluster, increased access bandwidth read and write files.
  • 高容错性: System failure is inevitable, how to do data after failover and fault tolerance are critical. Through various HDFS ensure data reliability, multiple copies distributed to different servers and the physical location of the data check function, the background continuous self-test feature provides data consistency may be high fault tolerance.
  • 线性扩展: Because of HDFS Block information is stored on NameNode, Block files distributed to DataNode, when the expansion of the number only add DataNode, the system can be done without stopping the expansion in services, without human intervention.

1.1 HDFS architecture

Picture Description Information

Shown above is a Master and Slave of HDFS structure, divided NameNode, Secondary NameNode and DataNode three roles.

  • NameNode: There is only one Master node, and namespace management HDFS data block mapping information Hadoop1.X, the copy of the configuration and policy client request;
  • Secondary NameNode: Auxiliary NameNode, share NameNode work regularly merger fsimage and fsedits and pushed to NameNode, emergency situations can aid recovery NameNode;
  • DataNode: The Slave node, data is actually stored, and reading and writing the data block to the NameNode report storing information;

2.2 HDFS read

Picture Description Information

  1. Client desired to open the file read by the open () method call FileSystem object, for HDFS, this object is an instance of the distributed file system;
  2. DistributedFileSystem by using RPC to call NameNode to determine the location of the file start block, in accordance with the number of repetitions of the same Block return multiple locations which follow Hadoop cluster topology sort, from the client side near the top surface;
  3. The first two steps will return a FSDataInputStream object is encapsulated into objects DFSInputStream, DFSInputStream can easily manage and datanode namenode stream, the client calls the read () method of the input stream;
  4. DFSInputStream DataNode address stored file start block immediately nearest DataNode connected, via a data stream called repeatedly read () method, can transfer data from the client to the DataNode;
  5. When reaching the end of the block, DFSInputStream closes the connection to the DataNode and look for the next best block DataNode, these operations are transparent to the client, the client's point of view from just reading a continuous stream;
  6. Once the client has completed the reading, on the right FSDataInputStream call close () method to close the file read.

 

2.3 HDFS write operation

 

Picture Description Information

  1. DistributedFileSystem client by calling the create () method creates a new file;
  2. DistributedFileSystem by RPC call to create a new file NameNode not associated Blocks, created before NameNode will do a variety of verification, such as a file exists, the client or without permission to create and so on. If the check, NameNode logs a record for creating a new file, otherwise it will throw an IO exception;
  3. After the return of the first two objects FSDataOutputStream, and similar to when reading a file, is packaged into FSDataOutputStream DFSOutputStream, DFSOutputStream NameNode can coordinate and Datanode. Start writing data to the client DFSOutputStream, DFSOutputStream data will cut a small packet and written to the internal queue referred to as "data queue" (Data Queue);
  4. DataStreamer will to deal with acceptable Data Queue, its first inquiry NameNode the new Block most suitable store where DataNode several years, such as the number of repetitions is 3, then find the three most suitable DataNode, put them arranged in a pipeline. Packet DataStreamer the queue output in the first Datanode conduit, the first output to a Packet DataNode again DataNode second, and so on;
  5. DFSOutputStream there is a queue called Ack Quene, also composed by the Packet, DataNode wait for the response is received, when all DataNode Pipeline in the time have said already received, then Akc Quene will remove the corresponding Packet packet out;
  6. The client calls the complete close () method for writing data after finalization stream;
  7. DataStreamer the remaining packages in the brush and then wait to Pipeline Information Ack, Ack received after the last one, notice NameNode the file marked as completed.

2.4 HDFS commonly used commands

1. hadoop fs

hadoop fs -ls /
hadoop fs -lsr
hadoop fs -mkdir /user/hadoop
hadoop fs -put a.txt /user/hadoop/
hadoop fs -get /user/hadoop/a.txt /
hadoop fs -cp src dst
hadoop fs -mv src dst
hadoop fs -cat /user/hadoop/a.txt
hadoop fs -rm /user/hadoop/a.txt
hadoop fs -rmr /user/hadoop/a.txt
hadoop fs -text /user/hadoop/a.txt
hadoop fs -copyFromLocal localsrc dst  # 与hadoop fs -put 功能类似
hadoop fs -moveFromLocal localsrc dst  # 将本地文件上传到 hdfs,同时删除本地文件

2. hadoop dfsadmin

HDFS is running a client of dfsadmin

# 报告文件系统的基本信息和统计信息
hadoop dfsadmin -report

hadoop dfsadmin -safemode enter | leave | get | wait 
# 安全模式维护命令。安全模式是 Namenode 的一个状态,这种状态下,Namenode 
# 1. 不接受对名字空间的更改(只读)
# 2. 不复制或删除块
# Namenode 会在启动时自动进入安全模式,当配置的块最小百分比数满足最小的副本数条件时,会自动离开安全模式。安全模式可以手动进入,但是这样的话也必须手动关闭安全模式。

3. hadoop fsck

Run HDFS file system check tool.

usage:

hadoop fsck [GENERIC_OPTIONS] <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]

4. start-balancer.sh

Apache HDFS API can be related to the official website for viewing:

Picture Description Information

 

3, a test case

In compile and run Hadoop cluster Example 3.2, "The Definitive Guide" in the read HDFS contents of the file.

3.1 running code

import java.io.InputStream;

import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.IOUtils;

public class FileSystemCat {
    public static void main(String[] args) throws Exception {
        String uri = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem. get(URI.create (uri), conf);
        InputStream in = null;
        try {
            in = fs.open( new Path(uri));
            IOUtils.copyBytes(in, System.out, 4096, false);
        } finally {
            IOUtils.closeStream(in);
        }
    }
}

3.2 implementation

3.2.1 Creating Code Directory

Configuring the local host name hadoop, shiyanlou need to enter the user's password when sudo. Hadoop will add to the end of the last line.

sudo vim /etc/hosts
# 将hadoop添加到最后一行的末尾,修改后类似:(使用 tab 键添加空格)
# 172.17.2.98 f738b9456777 hadoop
ping hadoop

Use the following command to start the Hadoop

cd /app/hadoop-1.1.2/bin
./start-all.sh
jps # 查看启动的进程,确保 NameNode 和 DataNode 都有启动

Establish myclass and input directory using the following command in /app/hadoop-1.1.2 catalog:

cd /app/hadoop-1.1.2
rm -rf myclass input
mkdir -p myclass input

Picture Description Information

3.2.2 Examples of establishing upload the file to the HDFS

Enter /app/hadoop-1.1.2/input directory, create a file in that directory quangle.txt

cd /app/hadoop-1.1.2/input
touch quangle.txt
vi quangle.txt

Says:

On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.

Picture Description Information

Use the following command to create the directory / class4 in the hdfs

hadoop fs -mkdir /class4
hadoop fs -ls /

说明: In case of no error hadoop command, do again source hadoop-env.sh. Similarly subsequent experiments.

(If you need to use direct hadoop command, need to be added to /app/hadoop-1.1.2 Path path)

Picture Description Information

Examples upload files to the hdfs / class4 folder

cd /app/hadoop-1.1.2/input
hadoop fs -copyFromLocal quangle.txt /class4/quangle.txt
hadoop fs -ls /class4

Picture Description Information

3.2.3 Configuring the local environment

Hadoop-env.sh of /app/hadoop-1.1.2/conf directory configuration, as shown below:

cd /app/hadoop-1.1.2/conf
sudo vi hadoop-env.sh

Add HADOOP_CLASPATH variable value, the value /app/hadoop-1.1.2/myclass, set up after compile the configuration file, to validate the configuration

export HADOOP_CLASSPATH=/app/hadoop-1.1.2/myclass

Picture Description Information

3.2.4 write code

/App/hadoop-1.1.2/myclass into the directory, the establishment of FileSystemCat.java code files in that directory, the command is as follows:

cd /app/hadoop-1.1.2/myclass/
vi FileSystemCat.java

Enter the code contents:

Picture Description Information

3.2.5 compiled code

In /app/hadoop-1.1.2/myclass directory, use the command to compile the following code:

javac -classpath ../hadoop-core-1.1.2.jar FileSystemCat.java

Picture Description Information

3.2.6 using compiled code reads the HDFS file

Using the following command to read the contents of the HDFS /class4/quangle.txt:

hadoop FileSystemCat /class4/quangle.txt

Picture Description Information

 

4, Test Example 2

In the local file system generates a text file of about 100 bytes, write a program to read the contents of the file and write it to 101-120 bytes of HDFS as a new file.

4.1 implementation code

//注意:在编译前请先删除中文注释!
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.OutputStream;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.util.Progressable;

public class LocalFile2Hdfs {
    public static void main(String[] args) throws Exception {

        // 获取读取源文件和目标文件位置参数
        String local = args[0];
        String uri = args[1];

        FileInputStream in = null;
        OutputStream out = null;
        Configuration conf = new Configuration();
        try {
            // 获取读入文件数据
            in = new FileInputStream(new File(local));

            // 获取目标文件信息
            FileSystem fs = FileSystem.get(URI.create(uri), conf);
            out = fs.create(new Path(uri), new Progressable() {
                @Override
                public void progress() {
                    System.out.println("*");
                }
            });

            // 跳过前100个字符
            in.skip(100);
            byte[] buffer = new byte[20];

            // 从101的位置读取20个字符到buffer中
            int bytesRead = in.read(buffer);
            if (bytesRead >= 0) {
                out.write(buffer, 0, bytesRead);
            }
        } finally {
            IOUtils.closeStream(in);
            IOUtils.closeStream(out);
        }
    }
}

4.2 implementation

4.2.1 write code

/App/hadoop-1.1.2/myclass into the directory, the establishment of LocalFile2Hdfs.java code files in that directory, the command is as follows:

cd /app/hadoop-1.1.2/myclass/
vi LocalFile2Hdfs.java

Enter the code contents:

Picture Description Information

4.2.2 compiled code

In /app/hadoop-1.1.2/myclass directory, use the command to compile the following code:

javac -classpath ../hadoop-core-1.1.2.jar LocalFile2Hdfs.java

Picture Description Information

4.2.3 create a test file

/App/hadoop-1.1.2/input into the directory, build local2hdfs.txt file in that directory

cd /app/hadoop-1.1.2/input/
vi local2hdfs.txt

Says:

Washington (CNN) -- Twitter is suing the U.S. government in an effort to loosen restrictions on what the social media giant can say publicly about the national security-related requests it receives for user data.
The company filed a lawsuit against the Justice Department on Monday in a federal court in northern California, arguing that its First Amendment rights are being violated by restrictions that forbid the disclosure of how many national security letters and Foreign Intelligence Surveillance Act court orders it receives -- even if that number is zero.
Twitter vice president Ben Lee wrote in a blog post that it's suing in an effort to publish the full version of a "transparency report" prepared this year that includes those details.
The San Francisco-based firm was unsatisfied with the Justice Department's move in January to allow technological firms to disclose the number of national security-related requests they receive in broad ranges.

Picture Description Information

4.2.4 using compiled code to upload the file contents to HDFS

Local2hdfs following command to read the contents of the first byte written to HDFS 101-120 into a new file:

cd /app/hadoop-1.1.2/input
hadoop LocalFile2Hdfs local2hdfs.txt /class4/local2hdfs_part.txt
hadoop fs -ls /class4

Picture Description Information

4.2.5 whether the validation is successful

Use the following command to read local2hdfs_part.txt content:

hadoop fs -cat /class4/local2hdfs_part.txt

Picture Description Information

 

5, Test Example 3

Test Example 2 is the reverse operation, the HDFS generates a text file of approximately 100 bytes, write a program to read the file, and writes the contents of the first byte of the local file system 101-120 to become a new file.

5.1 implementation code

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.OutputStream;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;

public class Hdfs2LocalFile {
    public static void main(String[] args) throws Exception {

        String uri = args[0];
        String local = args[1];

        FSDataInputStream in = null;
        OutputStream out = null;
        Configuration conf = new Configuration();
        try {
            FileSystem fs = FileSystem.get(URI.create(uri), conf);
            in = fs.open(new Path(uri));
            out = new FileOutputStream(local);

            byte[] buffer = new byte[20];
            in.skip(100);
            int bytesRead = in.read(buffer);
            if (bytesRead >= 0) {
                out.write(buffer, 0, bytesRead);
            }
        } finally {
            IOUtils.closeStream(in);
            IOUtils.closeStream(out);
        }   
    }
}

5.2 implementation

5.2.1 write code

/App/hadoop-1.1.2/myclass into the directory, the establishment of Hdfs2LocalFile.java code files in that directory, the command is as follows:

cd /app/hadoop-1.1.2/myclass/
vi Hdfs2LocalFile.java

Enter the code contents:

Picture Description Information

5.2.2 compiled code

In /app/hadoop-1.1.2/myclass directory, use the command to compile the following code:

javac -classpath ../hadoop-core-1.1.2.jar Hdfs2LocalFile.java

Picture Description Information

5.2.3 create a test file

/App/hadoop-1.1.2/input into the directory, build hdfs2local.txt file in that directory

cd /app/hadoop-1.1.2/input/
vi hdfs2local.txt

Says:

The San Francisco-based firm was unsatisfied with the Justice Department's move in January to allow technological firms to disclose the number of national security-related requests they receive in broad ranges.
"It's our belief that we are entitled under the First Amendment to respond to our users' concerns and to the statements of U.S. government officials by providing information about the scope of U.S. government surveillance -- including what types of legal process have not been received," Lee wrote. "We should be free to do this in a meaningful way, rather than in broad, inexact ranges."

Picture Description Information

In /app/hadoop-1.1.2/input directory to upload the file to the hdfs / class4 / folder:

hadoop fs -copyFromLocal hdfs2local.txt /class4/hdfs2local.txt
hadoop fs -ls /class4/

Picture Description Information

5.2.4 compiled code outputted from the contents of the file to the file system HDFS

Hdfs2local.txt following command to read the contents of 101-120 byte write to the local file system into a new file:

hadoop Hdfs2LocalFile /class4/hdfs2local.txt hdfs2local_part.txt

Picture Description Information

5.2.5 whether the validation is successful

Use the following command to read hdfs2local_part.txt content:

cat hdfs2local_part.txt

Picture Description Information

Published 44 original articles · won praise 16 · views 10000 +

Guess you like

Origin blog.csdn.net/YYIverson/article/details/101109081