The road to big data promotion: the construction of hadoop: the first day

What is Hadoop, why it appears, and what kind of problems it solves:

 1. Reduce costs

 2. Framework, the framework for solving big data

 3. When the amount of data increases, the business time increases

Three Google papers (the "troika" of the hadoop era):

       Google File System

       Google  Bigtable

       Google  MapReduce

Personally, I think you can take a look when you are curious.

Four data sources:

     User behavior data (recommendation system)

                 -> Search Habits

                 ->Consumption record, Alipay, WeChat

   Business data:

                -> Data generated within the company

   Reptile technology collection:

                ->python,java

   log file on production machine

               -> Production log file


There are three major distributions of hadoop:

       Apache -- the apache top-level project

      CDH      --cloudera

      HDP       --hortonworks

distributed:

Distributed storage, distributed computing, and finally returned data results to one or more files

hadoop ecosystem: 

        First: HDFS MR

       Now: HDFS hive +storm + spark

[Three operating modes of hadoop]

Local(Standalone) Mode Local mode Developers use local to debug and use local to store files in the local file system

Pseudo_Distributed Mode Pseudo-distributed Developer debug debugging use Build HDFS locally, pseudo-distributed, fully distributed

                                           1. The company gives you the data, and you build hadoop and hive on your own win.

                                           2. There may be a test cluster, the data has been placed, CM (required), run through the login server, and submit it locally

                                                        CM: Cloudera Manager (remember that memory needs a lot when learning this)

Fully_Distrtbuted Mode How to build a fully distributed (cluster) production environment?!!! HA: High availability, such as: a node suddenly fails to ensure that the cluster is still available. HA can be done in most big data frameworks

[hadoop environment deployment - JDK part]

1. Modify permissions:

       chown -R username.username/opt/

  In the company: the permissions of the CM file are root permissions, the reasons for using other users:

       1. Delete the wrong thing

        2. rm -rf /xx is hardly used in the company

              After using mv to move to the tmp directory, consider whether to delete it completely after a week (may be legally responsible): Hearthstone was down for 12 hours, because it was deleted by mistake, and it was not restored in the end

2. Unzip the JDK to the specified directory, it is recommended not to install it in a user's home directory: 

   tar -zxvf xxxx -C /opt/modules/

3. Add environment variables

   Use root to modify the /ect/profile file and configure jdk environment variables

    #JAVA_HOME

    export JAVA_HOME=jdk directory

    export  PATH=$PATH:$JAVA_HOME/bin

source /etc/profile

4. Verify: java -version 

      jps can view java process

     echo $JAVA_HOME

[hadoop pseudo-distributed environment deployment--hadoop part]

1. Unzip hadoop to the directory

    tar -zxvf xxx  -C /opt/modules/

2. Clean up the hadoop directory, delete the hadoop/share/doc directory to save disk space, check df -h with this command

3. Modify the hadoop/etc/hadoop/hadoop-env.sh file

   Modify hadoop/etc/hadoop/mapred-env.sh file

   Modify the hadoop/etc/hadoop/yarn-env.sh file

 Both specify the java installation path

4. Note: The four core modules in hadoop correspond to four default configuration files

          Specify the default file system as HDFS, the access entry of the file system, and the machine where the namenode is located

          Port 9000 was used by early hadoop 1.x, and now hadoop 2.x uses 8020

          The port number is used for direct internal communication of the node, using the RPC communication mechanism

5. Modify the hadoop/etc/hadoop/core-site.xml file

<property>

              <name>fs.defaultFS</name>

               <value>hdfs://hostname:8020</value>

</property>

<property>

  <name>hadoop.tmp.dir</name>
  <value>/opt/modules/hadoop-2.5.0/data/tmp</value>

</property>

6. Note: /tmp represents the temporary storage directory. Every time the system restarts, the files in it will be deleted according to the preset script.

          Re-customize the file path generated by the system, /tmp will be emptied, and the security of data files cannot be guaranteed

7. Modify the hadoop/etc/hadoop-site.xml file

         Specify the number of copies of HDFS files to be stored, the default is 3, here is a single machine, set to 1, this number should be less than the number of datanode nodes

    <property>
        <name>dfs.replication</name>

        <value>1</value>

    </property>

8. Modify the hadoop/etc/hadoop/slaves file

   -> Specify the machine location of the slave node, add hostname

9. Format the namenode

    bin / hdfs purpose -format

10. Start command

    sbin/hadoop-daemon.sh start namenode

   sbin/hadoop-daemon.sh start   datanode

11. View HDFS external UI interface

     hostname or IP address, keep up with the 50070 port number, external communication http

      dfs.namenode.http-address 50070

12. Test the HDFS environment

   Create a folder, hdfs has the concept of user home directory, which is the same as linux

     bin/hdfs dfs -mkdir -p  user/test/input

13. Upload files to hdfs

   bin/hdfs dfs -put  etc/hadoop/core-site.xml etc/hadoop/hdfs-site.xml  /

14. Read the hdfs file

    bin/hdfs dfs -text  /core-site.xml

15. Download the file to the local (specify where to download and rename it to get-site.xml)

     bin/hdfs dfs -get /core-site.xml  /tmp/get-site.xml


[Defects of HDFS]

-> The files stored in hdfs cannot be modified

-> hdfs does not support multi-user concurrent writing

-> hdfs is not suitable for storing a large number of small files

[yarn configuration]

1. Modify the hadoop/etc/hadoop/mapre-site.xml file

       Specifies that the mapreduce computing model runs on yarn

   <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>

</property>

2. Modify the hadoop/etc/hadoop/mapre-site.xml file

       Specifies the running service that starts running the nodemanager on mapreduce

      <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>

</property>

3. Specify the resourcemanager master node machine, this is an option, the default is on this machine, but after specifying it, it will be started on other machines, and an error will be reported

      <property>
<name>yarn.resourcemanager.hostname</name>
<value>hostname</value>

</property> 

4. Start yarn

       sbin/yarn-daemon.sh start resourcemanager

       sbin/yarn-daemon.sh start nodemanager 

5. View the yarn web page

     hostname:8088

6. Test run a mapreduce, wordcount word count case

    A mapreduce can be divided into five stages

           inpu->map()->shuffle->reduce()->output

            Step: To run mapreduce on yarn, you need to jar package

                                   Create a new data file and test mapreduce

                                  Upload data files from local to HDFS 

                                 bin/hdfs  dfs  -put /opt/love.txt    user/test/input

                                 Use official examples: share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar 

7. Run

bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar  wordcount   user/test/input/love.txt   user/test/output

View: bin/hdfs dfs -cat user/test/output/part*

[HDFS Architecture]

     1. Data block block

     2. The default size of each block: 128mb, the size can be customized by the user

    3. If you modify it, write it to hdfs-site.xml

128 file block, namenode will create a metadata information for him, this information also takes up space and exists in the memory of namenode

secondarynamenode, HA

    <property>
     <name>dfs.blocksize</name>
    <value>134217728</value>
    <description>
    Default block size (in bytes) for new files:
       The following suffixes can be used (case insensitive):
       k (thousands), m (mega), g (giga), t (terra), p (beta), e (essa) to specify size (like 128k, 512m, 1g, etc.)
       or provide full in bytes size (eg 134217728 for 128 MB).
     </description>

  </property>

4.500mb, default size: 128MB [128 128 128 128(12)]

5. If the size of a file is less than the size of the block: it will not occupy the space of the entire block

6. Storage Mode:

        hdfs will be divided into blocks by default, and the size can be set

                       There are different ways to set:

                               1) Through the cteate method of hdfs api, you can specify the size of the created file block (arbitrary)

                               2) In hive, it can be set in hive-site.xml, the size of the block output by hive (can be greater than 128)

                     eg: When I store a 129mb file, how many pieces are there? : A total of two pieces (128+1)?

calculate data :

       The files on hdfs are subjected to mapred operation. By default, there will be 128m (same block size) data input in the map.

                    So here it involves my 129m file that will start several maps to operate

                   Answer: 1 Because mapred has such a mechanism, if the last file is less than 128*1.1, only one map will be started to execute the job to avoid wasting resources. Of course, only the last one will appear in this case.

              eg: 522m file, how many maps are there to process? (4)

              If you don't know, think about it

Remember:

    hdfs is not suitable for storing too many small files 

                 Consider merging large files. The effect is not obvious

                  Ali open sourced the tfs Taobao file system, referring to hdfs

7. Ensure data security mechanism

                  number of copies

                   A file is written into multiple backups and written to different machine nodes

                  After the file is divided into chunks, for each chunk backup

8. Placement strategy:

               The copy of the first block block, if the Client client is on a machine in the cluster, then the first one is placed on this machine

              If the Client is not on the cluster, it will be placed immediately

             A copy of the second block will be placed on a node on a different rack than the first, randomly

              A copy of the third block will be placed on a different node on the same rack as the second, randomly

              anything else  

               load balancing, evenly distributed

        Rack Awareness Mechanism          

        Scanning mechanism of data blocks

       hdfs files generate keys, check them regularly, generate keys, if the block is corrupted, you will get an error when you execute the operation

        Repair of blocks (artificial)    

         Stop the machine node where this block is located (the disk may be damaged, or full, or it may be a process reason)

         Many big data frameworks have balancer, load balancing

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324853992&siteId=291194637