hadoop learning to use

Tip: After the article is written, the table of contents can be automatically generated. How to generate it can refer to the help document on the right

Article Directory


1. What is the role of hadoop?

what is hadoop?

Hadoop is an open source framework that can write and run distributed applications and process large-scale data. It is designed for offline and large-scale data analysis. It is not suitable for the online transaction processing mode of random reading and writing of several records. . Hadoop=HDFS (file system, related to data storage technology) + Mapreduce (data processing), the data source of Hadoop can be in any form, and it has better performance compared with relational database in processing semi-structured and unstructured data , with more flexible processing capabilities, regardless of any data form will eventually be converted into key/value, key/value is the basic data unit. Use functional style to change to Mapreduce instead of SQL. SQL is a query statement, while Mapreduce uses scripts and codes. For relational databases, Hadoop, which is used to SQL, can be replaced by the open source tool hive.

HDFS|sbin/start-dfs.sh
NameNode|master node|sbin/hadoop-daemon.sh start namenode
DataNode|data storage node|sbin/hadoop-daemon.sh start datanode
Yarn|sbin/start-yarn.sh
ResourceManager|global resource manager|sbin/yarn-daemon.sh start resourcemanager
NodeManager|sub-node resource and task manager|

What can Hadoop do?

Hadoop is good at log analysis. Facebook uses Hive for log analysis. In 2009, 30% of Facebook non-programmers used HiveQL for data analysis; Taobao search also uses Hive for custom filtering; Pig can also be used Doing advanced data processing, including Twitter and LinkedIn to find people you may know, can achieve a recommendation effect similar to Amazon.com's collaborative filtering. Taobao's product recommendation is also! In Yahoo! 40% of Hadoop jobs run with pig, including spam identification and filtering, and user profile modeling. (newly updated on August 25, 2012, Tmall's recommendation system is hive, try mahout in a small amount!)

Building HadoopHA High Availability Cluster

1 common cluster configuration file

  1. hdfs-site.xml

    <configuration>
      <!--  <property>
                <name>dfs.replication</name>
        <value>1</value>
      </property>-->
      <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop-3:50090</value>
      </property>
    </configuration>
    
  2. core-site.xml

    <configuration>
      <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop-1:8020</value>
      </property>
      <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/data/tmp</value>
      </property>
      <property>
        <name>dfs.namenode.name.dir</name>
        <value>file://${hadoop.tmp.dir}/dfs/name</value>
      </property>
      <property>
        <name>dfs.datanode.data.dir</name>
        <value>file://${hadoop.tmp.dir}/dfs/data</value>
      </property>
    </configuration>
    
  3. slaves

    hadoop-1
    hadoop-2
    hadoop-3
    
  4. yarn-site.xml

    <configuration>
      <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
      </property>
      <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop-2</value>
      </property>
      <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
      </property>
      <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>106800</value>
      </property>
    </configuration> 
    
  5. mapred-site.xml

    <configuration>
      <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
      </property>
      <property>
        <name>mapreduce.jobhistory.address</name>
        <value>hadoop-1:10020</value>
      </property>
      <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>hadoop-1:19888</value>
      </property>
    </configuration>
    

2 High availability cluster configuration

  1. hdfs-site.xml

    <configuration>
      <property>
        <!-- 为namenode集群定义一个services name -->
        <name>dfs.nameservices</name>
        <value>ns1</value>
      </property>
      <property>
        <!-- nameservice 包含哪些namenode,为各个namenode起名 -->
        <name>dfs.ha.namenodes.ns1</name>
        <value>nn1,nn2</value>
      </property>
      <property>
        <!--  名为nn1的namenode 的rpc地址和端口号,rpc用来和datanode通讯 -->
        <name>dfs.namenode.rpc-address.ns1.nn1</name>
        <value>hadoop-1:8020</value>
      </property>
      <property>
        <!-- 名为nn2的namenode 的rpc地址和端口号,rpc用来和datanode通讯  -->
        <name>dfs.namenode.rpc-address.ns1.nn2</name>
        <value>hadoop-2:8020</value>
      </property>
      <property>
        <!--名为nn1的namenode 的http地址和端口号,web客户端 -->
        <name>dfs.namenode.http-address.ns1.nn1</name>
        <value>hadoop-1:50070</value>
      </property>
      <property>
        <!--名为nn2的namenode 的http地址和端口号,web客户端 -->
        <name>dfs.namenode.http-address.ns1.nn2</name>
        <value>hadoop-2:50070</value>
      </property>
      <property>
        <!--  namenode间用于共享编辑日志的journal节点列表 -->
        <name>dfs.namenode.shared.edits.dir</name>
        <value>qjournal://hadoop-1:8485;hadoop-2:8485;hadoop-3:8485/ns1</value>
      </property>
      <property>
        <!--  journalnode 上用于存放edits日志的目录 -->
        <name>dfs.journalnode.edits.dir</name>
        <value>/opt/data/tmp/dfs/jn</value>
      </property>
      <property>
        <!--  客户端连接可用状态的NameNode所用的代理类 -->
        <name>dfs.client.failover.proxy.provider.ns1</name>
        <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
      </property>
      <property>
        <!--   -->
        <name>dfs.ha.fencing.methods</name>
        <value>sshfence</value>
      </property>
      <property>
        <name>dfs.ha.fencing.ssh.private-key-files</name>
        <value>/root/.ssh/id_rsa</value>
      </property>
      <property>
        <!-- 配置故障自动转移 -->
        <name>dfs.ha.automatic-failover.enabled</name>
        <value>true</value>
      </property>
      <property>
        <name>dfs.replication.max</name>
        <value>32767</value>
      </property>
    </configuration>
    
  2. core-site.xml

    fs.defaultFS hdfs://ns1 hadoop.tmp.dir /opt/data/tmp dfs.nameservices ns1 dfs.namenode.name.dir file://${hadoop.tmp.dir}/dfs/name dfs.datanode.data.dir file://${hadoop.tmp.dir}/dfs/data ha.zookeeper.quorum hadoop-1:2181,hadoop-2:2181,hadoop-3:2181
  3. slaves

    hadoop-1
    hadoop-2
    hadoop-3
    
  4. yarn-site.xml

    <configuration>
      <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
      </property>
      <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
      </property>
      <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>106800</value>
      </property>
      <property>
        <!--  启用resourcemanager的ha功能 -->
        <name>yarn.resourcemanager.ha.enabled</name>
        <value>true</value>
      </property>
      <property>
        <!--  为resourcemanage ha 集群起个id -->
        <name>yarn.resourcemanager.cluster-id</name>
        <value>yarn-cluster</value>
      </property>
      <property>
        <!--  指定resourcemanger ha 有哪些节点名 -->
        <name>yarn.resourcemanager.ha.rm-ids</name>
        <value>rm12,rm13</value>
      </property>
      <property>
        <!--  指定第一个节点的所在机器 -->
        <name>yarn.resourcemanager.hostname.rm12</name>
        <value>hadoop-2</value>
      </property>
      <property>
        <!--  指定第二个节点所在机器 -->
        <name>yarn.resourcemanager.hostname.rm13</name>
        <value>hadoop-3</value>
      </property>
      <property>
        <!--  指定resourcemanger ha 所用的zookeeper 节点 -->
        <name>yarn.resourcemanager.zk-address</name>
        <value>hadoop-1:2181,hadoop-2:2181,hadoop-3:2181</value>
      </property>
      <property>
        <!--  -->
        <name>yarn.resourcemanager.recovery.enabled</name>
        <value>true</value>
      </property>
      <property>
        <!--  -->
        <name>yarn.resourcemanager.store.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
      </property>
    </configuration>
    
  5. mapred-site.xml

    <configuration>
      <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
      </property>
      <property>
        <name>mapreduce.jobhistory.address</name>
        <value>hadoop-1:10020</value>
      </property>
      <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>hadoop-1:19888</value>
      </property>
    </configuration>
    

Organize and record the commands used to build hadoop HA high-availability clusters

1. Upload files to the file server

#创建文件目录
[root@hadoop-1 hadoop]$ bin/hdfs dfs -mkdir /input

#上传文件
[root@hadoop-1 hadoop]$ bin/hdfs dfs -put /opt/data/wc.input /input/wc.input

#运行程序
[root@hadoop-1 hadoop]$ bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount /input/wc.input /output

#查看HDFS根目录
[root@hadoop-1 hadoop]$ hadoop fs -ls /   或  bin/hdfs fs -ls /

#在根目录创建一个目录test
[root@hadoop-1 hadoop]$ hadoop fs -mkdir /test   或  bin/hdfs fs -mkdir /test

2. Check whether the file has been uploaded in the specified directory of the file server

[root@hadoop-1 hadoop]$ bin/hdfs dfs -ls /input

As shown in the picture:insert image description here

3. Formatting method of namenode service

  1. format directly

    [root@hadoop-1 hadoop]$ bin/hdfs namenode -format
    
  2. Cluster node formatting - keep all nodes (including namenode, datanode, and journalnode's clusterId consistent)

    [root@hadoop-1 hadoop]$ bin/hdfs namenode -format -clusterId hadoop-federation-clusterId
    
  3. Start the namenode in the standby state, and the namenode does not process requests in this state`

    [root@hadoop-1 hadoop]$ bin/hdfs namenode -bootstrapStandby
    

4. Check whether the namenode is in use or in Active state

[root@hadoop-1 hadoop]# hdfs dfsadmin -report

5. Check whether the service is started

  1. Use jps to see if there is a namenode

  2. use without

    [root@hadoop-1 hadoop]$ sbin/hadoop-daemon.sh start namenode
    
  3. Or start namenode with

    [hadoop@bigdata-senior02 hadoop-2.5.0]$ sbin/start-dfs.sh 
    
  4. If the startup fails, it may be caused by namenode formatting multiple times. You need to clean up the data, name, and jn (journalnode) files generated under the hadoop.tmp.dir path configured under core-site.xml

      <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/data/tmp</value>
      </property>
    

    Then use method 3-2 to start again

  5. If it still fails, you need to check the error log

    [root@hadoop-1 hadoop]cd logs
    [root@hadoop-1 logs]tail -fn 300 hadoop-root-namenode-hadoop-1.log

    Then find the problem according to the error log Baidu

6. Datanode failed to start

  1. Check whether the DataNode starts and enter the command jps

  2. If it is found that there is no datanode, execute the following

    [root@hadoop-1 hadoop]# sbin/hadoop-daemon.sh start datanode
    
  3. It is found that the startup is still not successful after execution, check whether core-site.xml is configured with the environment path of datanode

      <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/data/tmp</value>
      </property>
      <property>
        <name>dfs.namenode.name.dir</name>
        <value>file://${hadoop.tmp.dir}/dfs/name</value>
      </property>
      <property>
        <name>dfs.datanode.data.dir</name>
        <value>file://${hadoop.tmp.dir}/dfs/data</value>
      </property>
    
  4. If the configuration does not take effect, (configure if not, just restart the service 5-3 command), enter this path to check whether the current file is generated

    [root@hadoop-1 hadoop]# cd /opt/data/tmp/dfs/data/current/
    

    If the VERSION file comes out, that is, it has been started, check whether the VERSION of the namenode is consistent with the version of the datanode
    insert image description here
    , if not, change the version number of the datanode to the same as that of the namnode. Delete some log files to keep datanode running normally. Such as share/doc and logs/hadoop-root*

    #删除hadoop的功能文档文件
    [root@hadoop-1 hadoop]# cd share/doc
    [root@hadoop-1 doc]# rm -rf *
    
    #删除hadoop的日志文件
    [root@hadoop-1 hadoop]# cd logs/
    [root@hadoop-1 doc]# rm -rf hadoop-root*
    
  5. Start the datanode again

    [root@hadoop-1 hadoop]# sbin/hadoop-daemon.sh start datanode
    

    At this point, most people may restart the datanode successfully, but some people may fail to start because of the configuration file. Here, it is recommended to re-pass the configuration file. The configuration file written by this great god is very detailed and very suitable for beginners. Developers who build hadoop clusters https://blog.csdn.net/hliq5399/article/details/78193113/

6. Check whether the Linux servers can access each other

  1. Use ping to see if you can reach each other's IP

    [root@hadoop-1 hadoop]ping hadoop-2
    

insert image description here

  1. Since most of the configuration is configured through the server IP alias, it is necessary to configure domain name conversion on the DNS domain name service to ensure that your hostname has been changed. If it is not changed, the following operations are invalid. https://blog.csdn.net/qq_22310551/ article/details/84966044 This master explained the configuration of hostname more comprehensively, and the configuration has been analyzed

    	[root@hadoop-1 hadoop]# cat /etc/hosts
    	#127.0.0.1   localhost 
    	#::1         localhost 
    	#192.168.149.110  hadoop-1
    	#127.0.0.1   localhost
    	192.168.149.110 hadoop-1
    	192.168.149.120 hadoop-2
    	192.168.149.130 hadoop-3
    

The code is as follows (example):
1. Configure the virtual machine hostname;

vi /etc/sysconfig/network
#设置成这样
NETWORKING=yes
HOSTNAME=hadoop-1

#关闭--查看更改效果
more /etc/sysconfig/network
hostname
more /proc/sys/kernel/hostname

#然后重启虚拟机,永久改变
reboot

#切记/etc/hosts 和hostname没有任何关系 仅仅是一个dns 域名转换器

2. Configure the domain name of the window, and you can directly access the corresponding IP server of linux through the domain name

C:WindowsSystem32driversetc

#打开hosts 添加配置如
192.168.149.110 hadoop-1
192.168.149.120 hadoop-2
192.168.149.130 hadoop-3

In the future, you can directly use hadoop-1 instead of 192.168.149.110 to access.

Rapid deployment scp -r /opt/data/tmp/dfs/jn/ hadoop-2:/opt/data/tmp/dfs/

The data requested by the url network used here.


  1. Check whether the service is started----Generally speaking, port 8020 is the port after the namenode is started. If it cannot be accessed, it means that the namenode cannot be used. 8485 is the port after the journalnode is started.

    #查看端口是否启动
    [root@hadoop-1 hadoop]# netstat -tpnl
    

insert image description here

  1. The description of the port whose IP is 0.0.0.0 The service can support remote access in multiple ways:
    such as 0.0.0.0:80 can use 127.0.0.1:80 or service IP:80 or DNS domain name translation:80

    #域名转换
    [root@hadoop-1 hadoop]# curl hadoop-1:80
    

    Or access hadoop-1 port on node hadoop-2

    [root@hadoop-2 hadoop]# curl hadoop-1:80
    

    The description with IP 127.0.0.1 can only be accessed locally. For example, the hadoop-1 service cannot be accessed through curl 127.0.0.1 6010 on the hadoop-2 node.
    The IP is 192.168.149.110, indicating that other servers can remotely access the port

  2. Use telnet to check whether you can access the services you have started, such as

    	[root@hadoop-1 hadoop]# telnet hadoop-1 8020
    		Trying 192.168.149.110...
    		Connected to hadoop-1.
    		Escape character is '^]'.
    

    It is found that the connection is successful, telnet needs to be downloaded through Baidu

    #Install
    yum install telnet-server -y
    yum install xinetd -y
    yum install telnet
    start systemctl start telnet.socket
    #Join startup
    systemctl enable telnet.socket

  3. If you can’t access it, it’s also possible that you didn’t enter the access account, password or something (provided that the firewall is closed) by setting ssh to avoid secret access

     #生成密匙,需要一路回车
     [root@hadoop-1 hadoop]$ ssh-keygen -t rsa 
     #发放密钥
     [root@hadoop-1 hadoop]$ ssh-copy-id hadoop-1
     [root@hadoop-1 hadoop]$ ssh-copy-id hadoop-2
     [root@hadoop-1 hadoop]# $ ssh-copy-id hadoop-3
    

    In this way, hadoop-2 and hadoop-3 can access the services of hadoop-1 without secret

7. Start the resource management resourceManager. If the service is not started, Hadoop will not be able to upload files

  1. The startup method starts all services at once (including resourceManager)

    [root@hadoop-1 hadoop]# sbin/start-all.sh 
    
  2. single start

    #Start yarn
    [root@hadoop-1 hadoop]$ sbin/start-yarn.sh
    #Start resourcemanager Start on the specified server
    [root@hadoop-2 hadoop]$ sbin/yarn-daemon.sh start resourcemanager
    [root@hadoop- 3 hadoop]$ sbin/yarn-daemon.sh start resourcemanager

8. After the service is started, the namenode in the website http://hadoop-1:50070/dfshealth.html#tab-overview added link description is in standby, and can be activated by force

#nn1是配置文件里的路径
[root@hadoop-1 hadoop]$ bin/hdfs haadmin -transitionToActive -forcemanual nn1

[root@hadoop-1 hadoop]$ vi etc/hadoop/hdfs-site.xml
<property>
    <!--名为nn1的namenode 的http地址和端口号,web客户端 -->
    <name>dfs.namenode.http-address.ns1.nn1</name>
    <value>hadoop-1:50070</value>
</property>

9. How to start the task node journalnode

[root@hadoop-1 hadoop]# sbin/hadoop-daemon.sh start journalnode

10. Create a zNode in hadoop

[root@hadoop-1 hadoop]$ bin/hdfs zkfc -formatZK

11. Start namenode, datanode, journalnode, zkfc

[root@hadoop-1 hadoop]$ sbin/start-dfs.sh

Summarize

Tip: This article is just some problems encountered by the author during the development process, and there are not too many listed. Most of them are caused by too many namenode formatting. As long as you are willing to spend energy, it will end well

Guess you like

Origin blog.csdn.net/m0_67390963/article/details/126666026