Big Data technology -hdfs

hadoop

Theoretical basis: GFS ---- HDFS; MapReduce --- MapReduce; BigTable ---- HBase

Project Website: http://hadoop.apache.org/

Download Path: https://archive.apache.org/dist/hadoop/common/

The main module

  • Hadoop Common

    Basic modules. Including RPC call, Socket Communications ...

  • Hadoop Distributed File System

    hdfs distributed file system for storing data

  • Hadoop YARN

    Resource coordination framework

  • Hadoop MapReduce

    Large data computing framework

  • Hadoop Ozone

    Object storage framework

  • Hadoop Submarine

    Machine learning engine

Distributed File System hdfs

Basis Hadoop Distributed File System file management tools, file systems and decoupled hard drive, perform MapReduce computation

hdfs file default The default split 128M block Block (except the last one), the offset confirmation order of the blocks, each block default three backup .

Files stored can not be modified, can be added (not recommended), generally used for storing historical data.

May specify the block size and number of the backup file is stored at the time, can only be modified after storing the number of completed backups

hdfs storage principle

The basic version

Macro

NameNode (NN)

  • Data receiving client requests

  • Management of distributed file system, mapping between key management documents.

    • Distributed file directory

    • Correspondence between the file directory and file

    • Mapping relationship between file block, blocks corresponding to the file, the block storage node DN

  • DN block map stored in the memory NN

    Conducive to quickly find

    Down volatile memory will fill downtime

    Not suitable for storing small files, small files memory footprint is too large,

  • NN keeping with the DN heartbeat

    • Start state, NN Block collect information DN report, the establishment of block and DN mapping

    • After starting

      DN Default 3 seconds to survive NN report

      If the DN that lost to 10min lost contact, the DN NN will be lost to the block in the backup to the other DN

DataNode(DN)

  • Storing real data file, which is the block

  • Each block corresponds to a piece of metadata, the metadata by checking whether the block is corrupted

  • Keep the heartbeat and NN

    After verifying block information stored in the local block to NN integrity report: Start

    After the start: 3 seconds to survival NN report

  • DN storage medium is a hard disk

SecondaryNameNode

Solution to Problem NN down volatile, the main program is: a snapshot of the log +

Path of the log and a snapshot of / var / sxt / hadoop / ha / dfs / name / current, where there VERSION cluster file information

Each operation will hdfs saved to a log file (edits_inprogress_0000000000000010924)

hdfs provided with a checkpoint in the log file (the checkpoint), when a checkpoint is satisfied any, execution log merge

  • fs.checkpoint.size default 64M, log file size

  • fs.checkpoint.period default 3600 seconds, the interval

Log consolidation process

  • SecondaryNameNode take the current log file from the pull NN, NN create a new log file to perform the task hdfs

  • SecondaryNameNode the pulling of the original log file and the internal snapshot files are merged to generate a new snapshot file and verification file md5 (fsimage_0000000000000010755 and fsimage_0000000000000010755.md5)

  • SecondaryNameNode snapshot generated and returned to the verification NN, such snapshot exists simultaneously SecondaryNameNode

  • NN verification received snapshot and modify the original log file as historical log files (edits_0000000000000010922-0000000000000010923)

When this boot, just to recover from memory mirroring, and redo log files up to 64M, fast boot

nn sets the current (at the last shutdown) of fsimage with the latest log merge to produce new fsimage

 

 

Micro level

 

 

 

HA version

Macro

 

Micro level

 

download

 

 

NN closed mapping records are not stored when the DN, start relying on DN reporting mechanism to regenerate the map

 

 

hdfs the environment to build

The basic configuration (two environments are set up)

  • Free three or more keys each host jdk installed and configure the JAVA_HOME environment variable

  • Prepared in advance in the master node:

    • The hadoop-2.6.5.tar.gz extract to / opt / sxt directory

      • tar -zxvf hadoop-2.6.5.tar.gz

      • mv hadoop-2.6.5 /opt/sxt/

      • cd /opt/sxt/hadoop-2.6.5/etc/hadoop/

    • /Opt/sxt/hadoop-2.6.5/etc/hadoop/ configuration directory

      Hadoop-env.sh the line 25, 16 lines of mapred-env.sh, yarn-env.sh JAVA_HOME path line 23 (mapred-env.sh yarn-env.sh and can not change)

hadoop1.0 build

It includes a NN, a SecondaryNameMode, three DN

  • Installation hadoop

  • Configuration files modified /opt/sxt/hadoop-2.6.5/etc/hadoop

    • core-site.xml hadoop path and arranged NN

      <property>
      <name>fs.defaultFS</name>
      <value>hdfs://node1:9000</value>
      </property>
      <property>
      <name>hadoop.tmp.dir</name>
      <value>/var/sxt/hadoop/full</value>
      </property>
    • Configuration hdfs-site.xml hdfs

      The main configuration file backup and secondary number

      <property>
      <name>dfs.namenode.secondary.http-address</name>
      <value>node2:50090</value>
      </property>
      <property>
      <name>dfs.namenode.secondary.https-address</name>
      <value>node2:50091</value>
      </property>
      <property>
      <name>dfs.replication</name>
      <value>2</value>
      </property>
    • slaves configuration DN

      node1 
      node2
      node3
  • Copy the file to another node Hadoop and creates hadoop directory / var / sxt / hadoop / full

  • Configuration environment variable vim / etc / profile, replicated to other nodes, adding environment variables sync source / etc / profile

    export JAVA_HOME=/usr/java/jdk1.7.0_67
    export HADOOP_HOME=/opt/sxt/hadoop-2.6.5
    export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
  • Format node NN hdfs namenode -format

  • Start the cluster start-dfs.sh, see the process through each node jsp

  • Off Snapshot

hadoop2.x of HA Cluster Setup

ZooKeeper cluster

  • Decompression zookeeper

    • tar -zxvf zookeeper-3.4.6.tar.gz

    • mv zookeeper-3.4.6 /opt/sxt/

    • cd /opt/sxt/zookeeper-3.4.6/conf

  • Copy the configuration file

    • cp zoo_sample.cfg zoo.cfg will create a profile example of a configuration file

    • Modify zoo.cfg the following:

      • dataDir = / var / sxt / zookeeper data directory specified zookeeper

      • clientPort = Access Port 2181 designated zookeeper client

      • server.1 = node1: 2888: 3888 specifies the zookeeper access client and internal port server.2 = node2: 2888: 3888 server.3 = node3: 2888: 3888

  • Zookeeper configuration environment variable

    vim /etc/profile

    export JAVA_HOME=/usr/java/jdk1.7.0_67
    export ZOOKEEPER_HOME=/opt/sxt/zookeeper-3.4.6
    export PATH=$JAVA_HOME/bin:$ZOOKEEPER_HOME/bin:$PATH
  • Zookeeper copy installation package and an environment variable profile to another host

    • scp -r /opt/sxt/zookeeper-3.4.6/ root@node2:/opt/sxt/

    • scp -r /opt/sxt/zookeeper-3.4.6/ root@node3:/opt/sxt/

    • scp -r /etc/profile root@node2:/etc/profile

    • scp -r /etc/profile root@node3:/etc/profile

  • 3 Node uniform implementation

    • mkdir -p / var / sxt / zookeeper zookeeper create data directory, the path specified in accordance with zoo.cfg

    • source / etc / profile add environment variables

  • Specifies zookeeper each node weights, weights may be assigned the higher the greater the

    • nide1 performing echo 1> / var / sxt / zookeeper / myid

    • nide2 performing echo 2> / var / sxt / zookeeper / myid

    • nide2 performing echo 3> / var / sxt / zookeeper / myid

  • Zookeeper start cluster and view the status of start

    • zkServer.sh start

    • zkServer.sh status result 2 1 leader and follower

    • zkServer.sh stop

  • Close shoot snapshots

Hadoop-HA Clusters

  • Core core-site.xml configuration xml file

    • fs.defaultFS configuration hadoop cluster name, corresponding to the dfs.nameservices hdfs-site.xml

    • ha.zookeeper.quorum zookeeper cluster configured host and port

    • hadoop.tmp.dir configure hadoop file storage path

      <configuration>
      <property>
      <name>fs.defaultFS</name>
      <value>hdfs://shsxt</value>
      </property>
      <property>
      <name>ha.zookeeper.quorum</name>
      <value>node1:2181,node2:2181,node3:2181</value>
      </property>
      <property>
      <name>hadoop.tmp.dir</name>
      <value>/var/sxt/hadoop/ha</value>
      </property>
      </configuration>
  • Configuring hdfs-site.xml hdfs profile

    • dfs.nameservices configure the cluster name NN

    • dfs.ha.namenodes.shsxt each host cluster configuration NN, NN attribute names and cluster names match, which is one of the shsxt

    • dfs.namenode.rpc-address.shsxt.nn1 dfs.namenode.rpc-address.shsxt.nn2 configured with each host name and NN internal connection port for connecting eclipse

    • dfs.namenode.http-address.shsxt.nn1 and NN dfs.namenode.http-address.shsxt.nn2 configuration of each host name and an external connection port

    • dfs.namenode.shared.edits.dir configure the host name and port JournalNode

    • dfs.journalnode.edits.dir configuration file directory JournalNode

    • dfs.client.failover.proxy.provider.shsxt failover configuration proxy class, the default

    • dfs.ha.fencing.methods configuration switching mode, sshfence switched by ssh, shell (true) to avoid the split brain disconnection specifying the setting NN

    • dfs.ha.fencing.ssh.private-key-files-free key, use dsa

    • dfs.ha.automatic-failover.enabled automatic failover

    • Backup file number dfs.replication

      <property>
      <name>dfs.nameservices</name>
      <value>shsxt</value>
      </property>
      <property>
      <name>dfs.ha.namenodes.shsxt</name>
      <value>nn1,nn2</value>
      </property>
      <property>
      <name>dfs.namenode.rpc-address.shsxt.nn1</name>
      <value>node1:8020</value>
      </property>
      <property>
      <name>dfs.namenode.rpc-address.shsxt.nn2</name>
      <value>node2:8020</value>
      </property>
      <property>
      <name>dfs.namenode.http-address.shsxt.nn1</name>
      <value>node1:50070</value>
      </property>
      <property>
      <name>dfs.namenode.http-address.shsxt.nn2</name>
      <value>node2:50070</value>
      </property>
      <property>
      <name>dfs.namenode.shared.edits.dir</name>
      <value>qjournal://node1:8485;node2:8485;node3:8485/shsxt</value>
      </property>
      <property>
      <name>dfs.journalnode.edits.dir</name>
      <value>/var/sxt/hadoop/ha/jn</value>
      </property>
      <property>
      <name>dfs.client.failover.proxy.provider.shsxt</name>
      <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
      </property>
      <property>
      <name>dfs.ha.fencing.methods</name>
      <value>sshfence</value>
      <value>shell(true)</value>
      </property>
      <property>
      <name>dfs.ha.fencing.ssh.private-key-files</name>
      <value>/root/.ssh/id_dsa</value>
      </property>
      <property>
      <name>dfs.ha.automatic-failover.enabled</name>
      <value>true</value>
      </property>
      <property>
      <name>dfs.replication</name>
      <value>2</value>
      </property>
  • Configuration DN node of slaves

    node1 node2 node3

  • Configuration environment variable file / etc / profile

    export JAVA_HOME=/usr/java/jdk1.7.0_67
    export HADOOP_HOME=/opt/sxt/hadoop-2.6.5
    export ZOOKEEPER_HOME=/opt/sxt/zookeeper-3.4.6
    export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$ZOOKEEPER_HOME/bin:$PATH
  • The environment variable file hadoop installation files are copied to other nodes

    • scp /etc/profile root@node2:/etc/profile

    • scp /etc/profile root@node3:/etc/profile

    • scp -r /opt/sxt/hadoop-2.6.5/ root@node2:/opt/sxt/

    • scp -r /opt/sxt/hadoop-2.6.5/ root@node3:/opt/sxt/

  • Execution environment variable configuration [123] source / etc / profile

  • Create hadoop data directory [123] mkdir -p / var / sxt / hadoop / ha / jn

  • Start ZooKeeper

    • [123] zkServer.sh start

    • [123] zkServer.sh status

  • Start JournalNode

    • [123]hadoop-daemon.sh start journalnode

  • Formatting NameNode

    • [1] hdfs namenode -format Format primary NN

    • [1] hadoop-daemon.sh start namenode master boot alone NN

    • [2] hdfs namenode -bootstrapStandby formatted SNN

  • Formatting ZKFC [1] hdfs zkfc -formatZK

  • Start the cluster [1] start-dfs.sh

    • After the first start, after starting hadoop by this command, attention to the need to start early ZooKeeper

  • Store execution View Results

    hdfs dfs -mkdir -p /shsxt/java

    hdfs dfs -D dfs.blocksize=1048576 -put jdk-7u67-linux-x64.rpm /user/root

  • Off Snapshot

hdfs command

Network access address

hdfs-site.xml in NN visit the http address

  • 192.168.163.201:50070 and 192.168.163.202:50070

  • node1: 50070 and node2: 50070 (win you need to configure the hosts file)

Start the cluster

Start the cluster

  • All host zkServer.sh start

  • Select a NN execution start-dfs.sh

Close cluster

  • NN node performs: stop-dfs.sh

  • All nodes perform: zkServer.sh stop

View cluster running state jps

jps View running processes each host:

1393 NameNode NN node 
1486 DataNode DN node 
1644 JournalNode JN node 
1799 DFSZKFailoverController ZKFC failover controller 
1274 QuorumPeerMain ZK (zookeeper node) 
1891 Jps

Single node switch

  • hadoop-daemon.sh start namenode open a separate NN

  • hadoop-daemon.sh stop namenode close a separate NN

Upload and download commands

Create a directory hdfs

hdfs dfs -mkdir -p hdfs directory

eg: hdfs dfs -mkdir -p / sxt / bigdata create / sxt / bigdata hadoop directory in

File Upload

hdfs dfs -D block size -put uploaded files directory hdfs

eg:hdfs dfs -D dfs.blocksize=1048576 -put tomcat /sxt/bigdata

Access eclipse

Configuring win

Decompression eclipse, the hadoop-eclipse-plugin-2.6.0.jar file into a directory that eclipse plugins.

Hadoop-2.6.5.tar.gz unpack the software after extracting on: D: \ worksoft \ directory. Decompression bin.zip and replace the contents copied to the D: \ worksoft \ bin directory \ hadoop-2.6.5.tar.gz.

Configuration environment variable

  • HADOOP_HOME

    D:\worksoft\hadoop-2.6.5.tar.gz

  • HADOOP_USER_NAME

    root

  • Add PATH% HADOOP_HOME% / bin;% HADOOP_HOME% / sbin

Connection Configuration

Open eclipse, open view map-reduce. Create a new connection in the map-reduce, HA cluster are arranged two connection nodes NN

  • Custom Connection name Location name

  • DFS Master cancel the check

    Host amended as node1 (NameNode)

    Port rpc-address modified to match the attributes 8020 in the hdfs-site.xml

  • User name changed to root

Upload and download files in the side box test

java upload and download test

  • Create a java project, import the default IDE jar package JUnit4 of import custom hadoop jar package

  • Hadoop software copied from the share folder of each module 121 and lib jar files, the unified file, depending introduced by custom in the project

  • Import core-site.xml and hdfs-site.xml file cluster project as a resource file

Code section

And using the test configuration hdfs hadoop disposed @Before, hdfs disconnected in connection @After

  • before

    • Configuration config =new Configuration(true)

    • fileSystem =FileSystem.get(config)

  • After

    • fileSystem.close()

Hadoop provided by the Configuration, FileSystem, Path, IOUtils uploading and downloading

org.apache.hadoop.conf.Configuration Import; 
Import org.apache.hadoop.fs.FileSystem; 
Import org.apache.hadoop.fs.Path; 
Import org.apache.hadoop.io.IOUtils; 
public class MyHDFS { 
    // for processing of the content as a member variable using 
	the configuration config; 
	the FileSystem FileSystem;	 
	@Before 
	public void the init () throws IOException { 
		// perform initialization 
		// read sheath disposed 
		config = the configuration new new (to true); 
		// Get distributed file The system 
		FileSystem = FileSystem.get (config); 
	} 
	@After 
	public void Destory () throws IOException { 
		// perform connection destruction 
		fileSystem.close (); 
	} 
	// verify the authentication directory exists 
	@Test 
	public void exists () throws IOException { 
		the Path = new new path the Path ( "/ shsxt / Java"); 
		System.out.println (fileSystem.exists (path)); 
	} 
	
	// file upload verification 
	@Test 
	public void Upload () throws {Exception 
		// create a local input stream, the input text file 
		the InputStream = new new in the FileInputStream ( "D: \\ 123.txt"); 
		// Get the output stream, the file system gets the specified path output stream 
		OutputStream out = fileSystem.create (new new the Path ( "/ shsxt / Java / 123.txt")); 
		// hadoop implemented by the transport stream delivery means 
		IOUtils.copyBytes (in, OUT, config); 
	} 
	
	// file download verification 
	@Test 
	public void downloads () throws Exception { 
		// declare the byte stream received 
		ByteArrayOutputStream new new ByteArrayOutputStream OUT = (); 
		// Get the specified file or directory from the input stream DFS 
		the InputStream in fileSystem.open = (new new the Path ( "/ shsxt / Java / 123.txt"));
		// Get string data by streaming
		int len = 0;
		byte[] buffer = new byte[1024];
		while ((len = in.read(buffer)) != -1) {
			out.write(buffer, 0, len); }
		String words = new String(out.toByteArray(), "GB2312");
	}
}

 

 

Guess you like

Origin www.cnblogs.com/javaxiaobu/p/11702996.html