The road to big data promotion: the construction of hadoop: the first day

What is Hadoop, why it appears, and what kind of problems it solves:

1. Reduce costs

2. Framework, the framework for solving big data

3. When the amount of data increases, the business time increases

Three Google papers (the "troika" of the hadoop era):

Google File System

Google Bigtable

Google MapReduce

Personally, I think you can take a look when you are curious.

Four data sources:

User behavior data (recommendation system)

-> Search Habits

->Consumption record, Alipay, WeChat

Business data:

-> Data generated within the company

Reptile technology collection:

->python,java

log file on production machine

-> Production log file

There are three major distributions of hadoop:

Apache -- the apache top-level project

CDH --cloudera

HDP --hortonworks

distributed:

Distributed storage, distributed computing, and finally returned data results to one or more files

hadoop ecosystem:

First: HDFS MR

Now: HDFS hive +storm + spark

[Three operating modes of hadoop]

Local(Standalone) Mode Local mode Developers use local to debug and use local to store files in the local file system

Pseudo_Distributed Mode Pseudo-distributed Developer debug debugging use Build HDFS locally, pseudo-distributed, fully distributed

1. The company gives you the data, and you build hadoop and hive on your own win.

2. There may be a test cluster, the data has been placed, CM (required), run through the login server, and submit it locally

CM: Cloudera Manager (remember that memory needs a lot when learning this)

Fully_Distrtbuted Mode How to build a fully distributed (cluster) production environment?!!! HA: High availability, such as: a node suddenly fails to ensure that the cluster is still available. HA can be done in most big data frameworks

[hadoop environment deployment - JDK part]

1. Modify permissions:

chown -R username.username/opt/

In the company: the permissions of the CM file are root permissions, the reasons for using other users:

1. Delete the wrong thing

2. rm -rf /xx is hardly used in the company

After using mv to move to the tmp directory, consider whether to delete it completely after a week (may be legally responsible): Hearthstone was down for 12 hours, because it was deleted by mistake, and it was not restored in the end

2. Unzip the JDK to the specified directory, it is recommended not to install it in a user's home directory:

tar -zxvf xxxx -C /opt/modules/

3. Add environment variables

Use root to modify the /ect/profile file and configure jdk environment variables

#JAVA_HOME

export JAVA_HOME=jdk directory

export PATH=$PATH:$JAVA_HOME/bin

source /etc/profile

4. Verify: java -version

jps can view java process

echo $JAVA_HOME

[hadoop pseudo-distributed environment deployment--hadoop part]

1. Unzip hadoop to the directory

tar -zxvf xxx -C /opt/modules/

2. Clean up the hadoop directory, delete the hadoop/share/doc directory to save disk space, check df -h with this command

3. Modify the hadoop/etc/hadoop/hadoop-env.sh file

Modify hadoop/etc/hadoop/mapred-env.sh file

Modify the hadoop/etc/hadoop/yarn-env.sh file

Both specify the java installation path

4. Note: The four core modules in hadoop correspond to four default configuration files

Specify the default file system as HDFS, the access entry of the file system, and the machine where the namenode is located

Port 9000 was used by early hadoop 1.x, and now hadoop 2.x uses 8020

The port number is used for direct internal communication of the node, using the RPC communication mechanism

5. Modify the hadoop/etc/hadoop/core-site.xml file

<name>fs.defaultFS</name>

<value>hdfs://hostname:8020</value>

</property>

<name>hadoop.tmp.dir</name>
<value>/opt/modules/hadoop-2.5.0/data/tmp</value>

</property>

6. Note: /tmp represents the temporary storage directory. Every time the system restarts, the files in it will be deleted according to the preset script.

Re-customize the file path generated by the system, /tmp will be emptied, and the security of data files cannot be guaranteed

7. Modify the hadoop/etc/hadoop-site.xml file

Specify the number of copies of HDFS files to be stored, the default is 3, here is a single machine, set to 1, this number should be less than the number of datanode nodes

<property>
<name>dfs.replication</name>

</property>

8. Modify the hadoop/etc/hadoop/slaves file

-> Specify the machine location of the slave node, add hostname

9. Format the namenode

bin / hdfs purpose -format

10. Start command

sbin/hadoop-daemon.sh start namenode

sbin/hadoop-daemon.sh start datanode

11. View HDFS external UI interface

hostname or IP address, keep up with the 50070 port number, external communication http

dfs.namenode.http-address 50070

12. Test the HDFS environment

Create a folder, hdfs has the concept of user home directory, which is the same as linux

bin/hdfs dfs -mkdir -p user/test/input

13. Upload files to hdfs

bin/hdfs dfs -put etc/hadoop/core-site.xml etc/hadoop/hdfs-site.xml /

14. Read the hdfs file

bin/hdfs dfs -text /core-site.xml

15. Download the file to the local (specify where to download and rename it to get-site.xml)

bin/hdfs dfs -get /core-site.xml /tmp/get-site.xml

[Defects of HDFS]

-> The files stored in hdfs cannot be modified

-> hdfs does not support multi-user concurrent writing

-> hdfs is not suitable for storing a large number of small files

[yarn configuration]

1. Modify the hadoop/etc/hadoop/mapre-site.xml file

Specifies that the mapreduce computing model runs on yarn

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>

</property>

2. Modify the hadoop/etc/hadoop/mapre-site.xml file

Specifies the running service that starts running the nodemanager on mapreduce

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>

</property>

3. Specify the resourcemanager master node machine, this is an option, the default is on this machine, but after specifying it, it will be started on other machines, and an error will be reported

<property>
<name>yarn.resourcemanager.hostname</name>
<value>hostname</value>

</property>

4. Start yarn

sbin/yarn-daemon.sh start resourcemanager

sbin/yarn-daemon.sh start nodemanager

5. View the yarn web page

hostname:8088

6. Test run a mapreduce, wordcount word count case

A mapreduce can be divided into five stages

inpu->map()->shuffle->reduce()->output

Step: To run mapreduce on yarn, you need to jar package

Create a new data file and test mapreduce

Upload data files from local to HDFS

bin/hdfs dfs -put /opt/love.txt user/test/input

Use official examples: share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar

7. Run

bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount user/test/input/love.txt user/test/output

View: bin/hdfs dfs -cat user/test/output/part*

[HDFS Architecture]

1. Data block block

2. The default size of each block: 128mb, the size can be customized by the user

3. If you modify it, write it to hdfs-site.xml

128 file block, namenode will create a metadata information for him, this information also takes up space and exists in the memory of namenode

secondarynamenode, HA

<property>
<name>dfs.blocksize</name>
<value>134217728</value>
<description>
Default block size (in bytes) for new files:
The following suffixes can be used (case insensitive):
k (thousands), m (mega), g (giga), t (terra), p (beta), e (essa) to specify size (like 128k, 512m, 1g, etc.)
or provide full in bytes size (eg 134217728 for 128 MB).
</description>

</property>

4.500mb, default size: 128MB [128 128 128 128(12)]

5. If the size of a file is less than the size of the block: it will not occupy the space of the entire block

6. Storage Mode:

hdfs will be divided into blocks by default, and the size can be set

There are different ways to set:

1) Through the cteate method of hdfs api, you can specify the size of the created file block (arbitrary)

2) In hive, it can be set in hive-site.xml, the size of the block output by hive (can be greater than 128)

eg: When I store a 129mb file, how many pieces are there? : A total of two pieces (128+1)?

calculate data :

The files on hdfs are subjected to mapred operation. By default, there will be 128m (same block size) data input in the map.

So here it involves my 129m file that will start several maps to operate

Answer: 1 Because mapred has such a mechanism, if the last file is less than 128*1.1, only one map will be started to execute the job to avoid wasting resources. Of course, only the last one will appear in this case.

eg: 522m file, how many maps are there to process? (4)

If you don't know, think about it

Remember:

hdfs is not suitable for storing too many small files

Consider merging large files. The effect is not obvious

Ali open sourced the tfs Taobao file system, referring to hdfs

7. Ensure data security mechanism

number of copies

A file is written into multiple backups and written to different machine nodes

After the file is divided into chunks, for each chunk backup

8. Placement strategy:

The copy of the first block block, if the Client client is on a machine in the cluster, then the first one is placed on this machine

If the Client is not on the cluster, it will be placed immediately

A copy of the second block will be placed on a node on a different rack than the first, randomly

A copy of the third block will be placed on a different node on the same rack as the second, randomly

anything else

load balancing, evenly distributed

Rack Awareness Mechanism

Scanning mechanism of data blocks

hdfs files generate keys, check them regularly, generate keys, if the block is corrupted, you will get an error when you execute the operation

Repair of blocks (artificial)

Stop the machine node where this block is located (the disk may be damaged, or full, or it may be a process reason)

Many big data frameworks have balancer, load balancing

The road to big data promotion: the construction of hadoop: the first day

Guess you like