What is Hadoop, why it appears, and what kind of problems it solves:
1. Reduce costs
2. Framework, the framework for solving big data
3. When the amount of data increases, the business time increases
Three Google papers (the "troika" of the hadoop era):
Google File System
Google Bigtable
Google MapReduce
Personally, I think you can take a look when you are curious.
Four data sources:
User behavior data (recommendation system)
-> Search Habits
->Consumption record, Alipay, WeChat
Business data:
-> Data generated within the company
Reptile technology collection:
->python,java
log file on production machine
-> Production log file
There are three major distributions of hadoop:
Apache -- the apache top-level project
CDH --cloudera
HDP --hortonworks
distributed:
Distributed storage, distributed computing, and finally returned data results to one or more files
hadoop ecosystem:
First: HDFS MR
Now: HDFS hive +storm + spark
[Three operating modes of hadoop]
Local(Standalone) Mode Local mode Developers use local to debug and use local to store files in the local file system
Pseudo_Distributed Mode Pseudo-distributed Developer debug debugging use Build HDFS locally, pseudo-distributed, fully distributed
1. The company gives you the data, and you build hadoop and hive on your own win.
2. There may be a test cluster, the data has been placed, CM (required), run through the login server, and submit it locally
CM: Cloudera Manager (remember that memory needs a lot when learning this)
Fully_Distrtbuted Mode How to build a fully distributed (cluster) production environment?!!! HA: High availability, such as: a node suddenly fails to ensure that the cluster is still available. HA can be done in most big data frameworks
[hadoop environment deployment - JDK part]
1. Modify permissions:
chown -R username.username/opt/
In the company: the permissions of the CM file are root permissions, the reasons for using other users:
1. Delete the wrong thing
2. rm -rf /xx is hardly used in the company
After using mv to move to the tmp directory, consider whether to delete it completely after a week (may be legally responsible): Hearthstone was down for 12 hours, because it was deleted by mistake, and it was not restored in the end
2. Unzip the JDK to the specified directory, it is recommended not to install it in a user's home directory:
tar -zxvf xxxx -C /opt/modules/
3. Add environment variables
Use root to modify the /ect/profile file and configure jdk environment variables
#JAVA_HOME
export JAVA_HOME=jdk directory
export PATH=$PATH:$JAVA_HOME/bin
source /etc/profile
4. Verify: java -version
jps can view java process
echo $JAVA_HOME
[hadoop pseudo-distributed environment deployment--hadoop part]
1. Unzip hadoop to the directory
tar -zxvf xxx -C /opt/modules/
2. Clean up the hadoop directory, delete the hadoop/share/doc directory to save disk space, check df -h with this command
3. Modify the hadoop/etc/hadoop/hadoop-env.sh file
Modify hadoop/etc/hadoop/mapred-env.sh file
Modify the hadoop/etc/hadoop/yarn-env.sh file
Both specify the java installation path
4. Note: The four core modules in hadoop correspond to four default configuration files
Specify the default file system as HDFS, the access entry of the file system, and the machine where the namenode is located
Port 9000 was used by early hadoop 1.x, and now hadoop 2.x uses 8020
The port number is used for direct internal communication of the node, using the RPC communication mechanism
5. Modify the hadoop/etc/hadoop/core-site.xml file
<property>
<name>fs.defaultFS</name>
<value>hdfs://hostname:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name><value>/opt/modules/hadoop-2.5.0/data/tmp</value>
</property>
6. Note: /tmp represents the temporary storage directory. Every time the system restarts, the files in it will be deleted according to the preset script.
Re-customize the file path generated by the system, /tmp will be emptied, and the security of data files cannot be guaranteed
7. Modify the hadoop/etc/hadoop-site.xml file
Specify the number of copies of HDFS files to be stored, the default is 3, here is a single machine, set to 1, this number should be less than the number of datanode nodes
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
8. Modify the hadoop/etc/hadoop/slaves file
-> Specify the machine location of the slave node, add hostname
9. Format the namenode
bin / hdfs purpose -format
10. Start command
sbin/hadoop-daemon.sh start namenode
sbin/hadoop-daemon.sh start datanode
11. View HDFS external UI interface
hostname or IP address, keep up with the 50070 port number, external communication http
dfs.namenode.http-address 50070
12. Test the HDFS environment
Create a folder, hdfs has the concept of user home directory, which is the same as linux
bin/hdfs dfs -mkdir -p user/test/input
13. Upload files to hdfs
bin/hdfs dfs -put etc/hadoop/core-site.xml etc/hadoop/hdfs-site.xml /
14. Read the hdfs file
bin/hdfs dfs -text /core-site.xml
15. Download the file to the local (specify where to download and rename it to get-site.xml)
bin/hdfs dfs -get /core-site.xml /tmp/get-site.xml
[Defects of HDFS]
-> The files stored in hdfs cannot be modified
-> hdfs does not support multi-user concurrent writing
-> hdfs is not suitable for storing a large number of small files
[yarn configuration]
1. Modify the hadoop/etc/hadoop/mapre-site.xml file
Specifies that the mapreduce computing model runs on yarn
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
2. Modify the hadoop/etc/hadoop/mapre-site.xml file
Specifies the running service that starts running the nodemanager on mapreduce
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
3. Specify the resourcemanager master node machine, this is an option, the default is on this machine, but after specifying it, it will be started on other machines, and an error will be reported
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hostname</value>
</property>
4. Start yarn
sbin/yarn-daemon.sh start resourcemanager
sbin/yarn-daemon.sh start nodemanager
5. View the yarn web page
hostname:8088
6. Test run a mapreduce, wordcount word count case
A mapreduce can be divided into five stages
inpu->map()->shuffle->reduce()->output
Step: To run mapreduce on yarn, you need to jar package
Create a new data file and test mapreduce
Upload data files from local to HDFS
bin/hdfs dfs -put /opt/love.txt user/test/input
Use official examples: share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar
7. Run
bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount user/test/input/love.txt user/test/output
View: bin/hdfs dfs -cat user/test/output/part*
[HDFS Architecture]
1. Data block block
2. The default size of each block: 128mb, the size can be customized by the user
3. If you modify it, write it to hdfs-site.xml
128 file block, namenode will create a metadata information for him, this information also takes up space and exists in the memory of namenode
secondarynamenode, HA
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
<description>
Default block size (in bytes) for new files:
The following suffixes can be used (case insensitive):
k (thousands), m (mega), g (giga), t (terra), p (beta), e (essa) to specify size (like 128k, 512m, 1g, etc.)
or provide full in bytes size (eg 134217728 for 128 MB).
</description>
</property>
4.500mb, default size: 128MB [128 128 128 128(12)]
5. If the size of a file is less than the size of the block: it will not occupy the space of the entire block
6. Storage Mode:
hdfs will be divided into blocks by default, and the size can be set
There are different ways to set:
1) Through the cteate method of hdfs api, you can specify the size of the created file block (arbitrary)
2) In hive, it can be set in hive-site.xml, the size of the block output by hive (can be greater than 128)
eg: When I store a 129mb file, how many pieces are there? : A total of two pieces (128+1)?
calculate data :
The files on hdfs are subjected to mapred operation. By default, there will be 128m (same block size) data input in the map.
So here it involves my 129m file that will start several maps to operate
Answer: 1 Because mapred has such a mechanism, if the last file is less than 128*1.1, only one map will be started to execute the job to avoid wasting resources. Of course, only the last one will appear in this case.
eg: 522m file, how many maps are there to process? (4)
If you don't know, think about it
Remember:
hdfs is not suitable for storing too many small files
Consider merging large files. The effect is not obvious
Ali open sourced the tfs Taobao file system, referring to hdfs
7. Ensure data security mechanism
number of copies
A file is written into multiple backups and written to different machine nodes
After the file is divided into chunks, for each chunk backup
8. Placement strategy:
The copy of the first block block, if the Client client is on a machine in the cluster, then the first one is placed on this machine
If the Client is not on the cluster, it will be placed immediately
A copy of the second block will be placed on a node on a different rack than the first, randomly
A copy of the third block will be placed on a different node on the same rack as the second, randomly
anything else
load balancing, evenly distributed
Rack Awareness Mechanism
Scanning mechanism of data blocks
hdfs files generate keys, check them regularly, generate keys, if the block is corrupted, you will get an error when you execute the operation
Repair of blocks (artificial)
Stop the machine node where this block is located (the disk may be damaged, or full, or it may be a process reason)
Many big data frameworks have balancer, load balancing