Three minutes to understand hadoop

HADOOP Profile

I. Thinking Big Data

1. What is Big Data thinking

Divide and conquer: a complex problem to a certain "decomposition" method is divided into several parts equivalent to a smaller scale, then one by one to solve, namely to identify the intermediate results of each part, the intermediate results of each part of the composition of the whole question of final result.

Parallel: to enhance the speed of operation of the key distributed computing data together with data to the mobile computing

Two .Hadoop history

1. The three papers

GFS----HDFS

MapReduce---MapReduce

BigTable----HBase

2.Hadoop Models

Hadoop Common: basic modules. RPC calls, Socket Communications

Hadoop Distributed File System distributed file system for storing large information data

Hadoop YARN resource coordination framework

Large frame data calculation Hadoop MapReduce

Hadoop Ozone: object storage frame

Hadoop Submarine: machine learning engine

3. Distributed File System

1. Distributed File System Architecture

FS File System

File system is based on a file on the hard disk management tool

Our customers can decouple the operating system file and hard drive

DFS Distributed File System:

Distributed File System

Our data stored on multiple computers store

Distributed file system has a lot,

HDFS is the basis for the calculation of mapreduce

2. The principle of distributed architecture
  • How to Split

    • Data is stored in a byte array manner on the hard disk

    • If we are divided into two files, byte array is equivalent to split into two

      • 888 KB (909,312 bytes)

      • 444KB 454,656 bytes

      • 444kB 454,656 bytes

    • If we can merge these two arrays together again, the file will be restored to its original appearance

    • If the file is big, it needs to be cut into parts of N, corresponding to a cut into an array of N bytes

      • How splicing? 10,203,040

    • In order to record each sub-block (sub array of bytes) position belongs, the sub-block may be recorded in the entire file offset

      • Index (subscript) has a corresponding array, the data can quickly locate

  • Split size

    • After splitting the block size to be consistent

      • If not the size, it is difficult to calculate its position by the offset

      • If the data block are inconsistent, can pull the data in a multi-node inconsistencies

      • Distributed computing each machine to calculate the results relatively consistent time

      • When distributed algorithm design, data is not uniform, difficult to design algorithms

    • In H1 default size is 64M, in H2 and later, the default size of 128M

    • The same file, the size of all blocks to be exactly the same, except for the last block

    • In different files, the block size may be different

    • The number of blocks = Ceil (total size / size of each block);

      • 1024M 100M 11 block

      • 10M 1M 10 block

    • problem

      • Cut too neat, to a full data blocks cut into two

  • Data Security

    • The multiple copies of data backup

    • Each has three default data backup

    • Data backup is not more than the number of nodes

  • Data Rules

    • Once HDFS file is stored, the data can not be modified

      • Modification affects offset

      • Changes will result in data skew

      • Modify the data will lead to benefits butterfly

    • But it can be added, but is not recommended

    • HDFS are generally stored historical data

3. nodes into

NameNode: management node

DataNode: storing data

III. To build a pseudo-distributed

1. Cloning a virtual machine

Before building method in accordance with linux: Modify the IP address, modify the host name

2. Set Free Key Log

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

ssh-copy-id [email protected] ~ / .ssh / id_rsa.pub (log provided by IP previous step)

3. Upload Hadoop archive specified path and move to the next

tar -zxvf hadoop-2.6.5.tar.gz

mv hadoop-2.6.5 /opt/sxt

4. Set Environment Variables

vim /eyc/profile

Add Hadoop path export HADOOP_HOME = / opt / sxt / hadoop-2.6.5 (actual installation path based)

export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

5. Modify Profile

File Path: /opt/sxt/hadoop-2.6.5/etc/hadoop

(1) modify the configuration file on the jdk

Hadoop-env.sh modify file jdk installation path line 25

Mapred-env.sh modify file jdk installation path line 16

Yarn-env.sh modify file jdk installation path line 23

(2) modify the core profile

core-site.xml file, add

<property>
      <name>fs.defaultFS</name>
      <value>hdfs://node01:9000</value>
</property>
<property>
  <name>hadoop.tmp.dir</name>
      <value>/var/sxt/hadoop/local</value>
</property>

hdfs-site.xml file, add

<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>
<property>
  <name>dfs.namenode.secondary.http-address</name>
  <value>node01:50090</value>
</property>
6. Format

hdfs namenode -format

7. Start

start-dfs.sh

8. Access

http://192.168.61.200:50070

9. Start using the command to create the directory

hdfs dfs -mkdir -p /user/root

hdfs dfs -put apache-tomcat-7.0.61.tar.gz /user/root

hdfs dfs -D dfs.blocksize=1048576 -put jdk-7u67-linux-x64.rpm /user/root

10. Web site for related information

Guess you like

Origin www.cnblogs.com/ruanjianwei/p/11780929.html