HADOOP Profile
I. Thinking Big Data
1. What is Big Data thinking
Divide and conquer: a complex problem to a certain "decomposition" method is divided into several parts equivalent to a smaller scale, then one by one to solve, namely to identify the intermediate results of each part, the intermediate results of each part of the composition of the whole question of final result.
Two .Hadoop history
1. The three papers
GFS----HDFS
MapReduce---MapReduce
BigTable----HBase
2.Hadoop Models
Hadoop Common: basic modules. RPC calls, Socket Communications
Hadoop Distributed File System distributed file system for storing large information data
Hadoop YARN resource coordination framework
Large frame data calculation Hadoop MapReduce
Hadoop Ozone: object storage frame
Hadoop Submarine: machine learning engine
3. Distributed File System
1. Distributed File System Architecture
FS File System
File system is based on a file on the hard disk management tool
Our customers can decouple the operating system file and hard drive
DFS Distributed File System:
Distributed File System
Our data stored on multiple computers store
Distributed file system has a lot,
HDFS is the basis for the calculation of mapreduce
2. The principle of distributed architecture
-
How to Split
-
Data is stored in a byte array manner on the hard disk
-
If we are divided into two files, byte array is equivalent to split into two
-
888 KB (909,312 bytes)
-
444KB 454,656 bytes
-
444kB 454,656 bytes
-
-
If we can merge these two arrays together again, the file will be restored to its original appearance
-
If the file is big, it needs to be cut into parts of N, corresponding to a cut into an array of N bytes
-
How splicing? 10,203,040
-
-
In order to record each sub-block (sub array of bytes) position belongs, the sub-block may be recorded in the entire file offset
-
Index (subscript) has a corresponding array, the data can quickly locate
-
-
-
Split size
-
After splitting the block size to be consistent
-
If not the size, it is difficult to calculate its position by the offset
-
If the data block are inconsistent, can pull the data in a multi-node inconsistencies
-
Distributed computing each machine to calculate the results relatively consistent time
-
When distributed algorithm design, data is not uniform, difficult to design algorithms
-
-
In H1 default size is 64M, in H2 and later, the default size of 128M
-
The same file, the size of all blocks to be exactly the same, except for the last block
-
In different files, the block size may be different
-
The number of blocks = Ceil (total size / size of each block);
-
1024M 100M 11 block
-
10M 1M 10 block
-
-
problem
-
Cut too neat, to a full data blocks cut into two
-
-
-
Data Security
-
The multiple copies of data backup
-
Each has three default data backup
-
Data backup is not more than the number of nodes
-
-
Data Rules
-
Once HDFS file is stored, the data can not be modified
-
Modification affects offset
-
Changes will result in data skew
-
Modify the data will lead to benefits butterfly
-
-
But it can be added, but is not recommended
-
HDFS are generally stored historical data
-
3. nodes into
NameNode: management node
DataNode: storing data
III. To build a pseudo-distributed
1. Cloning a virtual machine
Before building method in accordance with linux: Modify the IP address, modify the host name
2. Set Free Key Log
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
ssh-copy-id [email protected] ~ / .ssh / id_rsa.pub (log provided by IP previous step)
3. Upload Hadoop archive specified path and move to the next
tar -zxvf hadoop-2.6.5.tar.gz
mv hadoop-2.6.5 /opt/sxt
4. Set Environment Variables
vim /eyc/profile
Add Hadoop path export HADOOP_HOME = / opt / sxt / hadoop-2.6.5 (actual installation path based)
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
5. Modify Profile
File Path: /opt/sxt/hadoop-2.6.5/etc/hadoop
(1) modify the configuration file on the jdk
Hadoop-env.sh modify file jdk installation path line 25
Mapred-env.sh modify file jdk installation path line 16
Yarn-env.sh modify file jdk installation path line 23
(2) modify the core profile
core-site.xml file, add
<property>
<name>fs.defaultFS</name>
<value>hdfs://node01:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/sxt/hadoop/local</value>
</property>
hdfs-site.xml file, add
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>node01:50090</value>
</property>
6. Format
hdfs namenode -format
7. Start
start-dfs.sh
8. Access
9. Start using the command to create the directory
hdfs dfs -mkdir -p /user/root
hdfs dfs -put apache-tomcat-7.0.61.tar.gz /user/root
hdfs dfs -D dfs.blocksize=1048576 -put jdk-7u67-linux-x64.rpm /user/root