Installation and deployment of Oozie (with CDH version hadoop)

One: Introduction to
oozie : Oozie is a Java Web application that runs in a Java servlet container-namely Tomcat, and uses a database to store the following:
1. Workflow definition.
2. The currently running workflow instance, including the state and variables of the instance.
3. Oozie is integrated with other parts of the Hadoop ecosystem and supports multiple types of Hadoop jobs (map-reduce, Hive, Sqoop, etc.) and system-specific jobs (such as Java programs and shell scripts).

Oozie workflow is a set of actions (for example, Hadoop Map/Reduce job) placed in the control dependent DAG (Direct Acyclic Graph), which specifies the order in which the actions are executed. We will use hPDL (an XML process definition language) to describe this diagram.

Two: Oozie's functional modules and nodes
1. Module

  1. Workflow
    executes the process nodes sequentially, supports fork (branch multiple nodes), join (combine multiple nodes into one)
  2. Coordinator
    triggers workflow regularly
  3. Bundle Job
    binds multiple Coordinators

2. Common nodes:

  1. Control Flow Nodes
    are generally defined at the beginning or end of the workflow, such as start, end, kill, etc. And provide the execution path mechanism of the workflow, such as decision, fork, join, etc.
1. 流程控制节点(action)  
2. <start />——定义workflow的开始  
3. <end />——定义workflow的结束  
4. <decision />——实现switch功能<switch><case /><default /></switch>标签连用  
5. <sub-workflow>——调用子workflow  
6. <kill />——程序出错后跳转到这个节点执行相关操作  
7. <fork />——并发执行workflow  
8. <join />——并发执行结束(与fork一起使用) 
  1. Action Nodes
    are nodes responsible for performing specific actions, such as copying files, executing a shell script, etc.
Since Oozie needs to have good compatibility with other frameworks, for the convenience of follow-up, we are here to deploy a CDH version of hadoop in the case of the original Apache version of hadoop.

Three: Simple deployment of CDH version of hadoop:
(Note: The content of this section is built under the Apache native hadoop that I built last time, because I installed everything similar to jdk in the last build, https:/ /blog.csdn.net/weixin_44080445/article/details/106009359 , and then transferred to this blog about the Apache version of hadoop in the previous blog)

1. Create a new cdh directory under /opt/module above hadoop102

2. Extract the cdh version of hadoop under /opt/software to /opt/module/cdh
[root@hadoop102 software]# tar -zxvf hadoop-2.5.0-cdh5.3.6.tar.gz -C /opt/ module/cdh/
Then we unzip Oozie to /opt/module in this step
[root@hadoop102 software]# tar -zxvf oozie-4.0.0-cdh5.3.6.tar.gz -C /opt/module/

3. Configure hadoop-env.sh, mapred-env.sh, yarn-env.sh, core-site.xml in the /opt/module/cdh/hadoop-2.5.0-cdh5.3.6/etc/hadoop directory hdfs-site.xml, mapred-site.xml.template, yarn-site.xml, slaves these 8 files.

First echo $JAVA_HOME to find your own jdk installation path
[root@hadoop102 hadoop]# echo $JAVA_HOME
/opt/module/jdk1.8.0_144

1) The configuration of hadoop-env.sh, fill in the actual jdk path and replace it

Insert picture description here

2) Configure mapred-env.sh, remove the comment at the beginning, and then fill in the actual jdk path

Insert picture description here

3) Configure yarn-env.sh, remove the comment at the beginning, and then fill in the actual jdk path
Insert picture description here

4)
Fill in the following content in core-site.xml . Note that the Hostname of Oozie Server and the user groups allowed to be proxied by Oozie are all configured as root users. If you are a xxx user, modify it accordingly.

<!-- 指定HDFS中NameNode的地址 -->
<property>
<name>fs.defaultFS</name>
    <value>hdfs://hadoop102:8020</value>
</property>

<!-- 指定Hadoop运行时产生文件的存储目录 -->
<property>
	<name>hadoop.tmp.dir</name>
	<value>/opt/module/cdh/hadoop-2.5.0-cdh5.3.6/data/tmp</value>
</property>

<!-- Oozie Server的Hostname -->
<property>
	<name>hadoop.proxyuser.root.hosts</name>
	<value>*</value>
</property>

<!-- 允许被Oozie代理的用户组 -->
<property>
	<name>hadoop.proxyuser.root.groups</name>
 	<value>*</value>
</property>


5) hdfs-site.xml, add the following content in it.

<!-- 指定HDFS副本的数量 -->
<property>
	<name>dfs.replication</name>
	<value>3</value>
</property>


<!-- 指定Hadoop辅助名称节点主机配置 -->
<property>
      <name>dfs.namenode.secondary.http-address</name>
      <value>hadoop104:50090</value>
</property>

6) mapred-site.xml.template, what needs to be noted here is to modify the name of the file first, that is, mv mapred-site.xml.template mapred-site.xml and then add the following content to the file.

<!-- 指定MR运行在YARN上 -->
<property>
		<name>mapreduce.framework.name</name>
		<value>yarn</value>
</property>


<!-- 历史服务器端地址 -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop102:10020</value>
</property>

<!-- 历史服务器web端地址 -->
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop102:19888</value>
</property>

7)yarn-site.xml

<!-- Reducer获取数据的方式 -->
<property>
 		<name>yarn.nodemanager.aux-services</name>
 		<value>mapreduce_shuffle</value>
</property>

<!-- 指定YARN的ResourceManager的地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop103</value>
</property>

<!-- 日志聚集功能使能 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>

<!-- 日志保留时间设置7天 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>

8) slaves, please note that you need to write according to your own cluster hostname, my three hostnames are hadoop102, hadoop103, and hadoop104

Insert picture description here

4. Copy the configuration information to the two machines of hadoop103 and hadoop104. Note that this step should be copied under /opt/module/cdh

[root@hadoop102 cdh]# scp -r /opt/module/cdh root@hadoop103:/opt/module/
[root@hadoop102 cdh]# scp -r /opt/module/cdh root@hadoop104:/opt/module/

5. After copying, we go to hadoop102's hadoop-2.5.0-cdh5.3.6 directory to format the namenode

[root@hadoop102 hadoop-2.5.0-cdh5.3.6]# bin/hdfs namenode -format

You can see that the hadoop in the corresponding cdh version has been successfully formatted

Insert picture description here

6. Start the cdh version of hadoop
[root@hadoop102 hadoop-2.5.0-cdh5.3.6]# sbin/start-dfs.sh

Then go to hadoop103 to start yarn
[root@hadoop103 hadoop-2.5.0-cdh5.3.6]# sbin/start-yarn.sh

Then continue to start the history server on hadoop102

[root@hadoop102 hadoop-2.5.0-cdh5.3.6]# sbin/mr-jobhistory-daemon.sh start historyserver

At this time, we can check the processes of the three machines
[root@hadoop102 myscripts]# ./showjps.sh
=================== root @hadoop102 =======================
7904 JobHistoryServer
7953 Jps
7418 NameNode
7501 DataNode
7741 NodeManager
=============== ====== root@hadoop103 =======================
7410 NodeManager
7749 Jps
7224 DataNode
7323 ResourceManager
========== =========== root@hadoop104 ======================
7281 SecondaryNameNode
7206 DataNode
7355 NodeManager
7499 Jps
[root@hadoop102 myscripts ]#

Finally, we will check the web side to check the corresponding information, we can see that the CDH version of hadoop is blue, and the web side of the Apache version of hadoop is cyan, here we should focus on 8020, if it does not appear The 8020 we configured earlier indicates that there is a problem with the configuration.
The above is the CDH version of hadoop installation process.

Insert picture description here

Four: Oozie installation and deployment After installing
the above CDH version of hadoop, we can install Oozie directly at this time

1. Decompress the Oozie tar package. This step has been decompressed in the above decompression hadoop by the way, so it will not be described. Then we go to the decompressed /opt/module and enter the Oozie directory to start decompressing oozie-hadooplibs-4.0.0- The cdh5.3.6.tar.gz tar package, because the decompressed package has the same name as the current oozie-4.0.0-cdh5.3.6 directory, so we will extract it to the upper directory

[root@hadoop102 oozie-4.0.0-cdh5.3.6]#

tar -zxvf oozie-hadooplibs-4.0.0-cdh5.3.6.tar.gz -C     ../

Insert picture description here

After unzipping, you will find an additional hadooplibs directory

Insert picture description here

2. Create a libext directory in the Oozie directory ( note that the name of the directory cannot be changed, it must be called libext, because the official website emphasizes this )

Insert picture description here

3. Copy the dependent jar package

1) From the information on the official website, we can see that it allows us to copy all jar files in hadooplibs into libext, enter hadooplibs, and what we want to copy is hadooplib-2.5.0-cdh5.3.6.oozie-4.0.0- cdh5.3.6

Insert picture description here

Insert picture description here
执行[root@hadoop102 oozie-4.0.0-cdh5.3.6]# cp hadooplibs/hadooplib-2.5.0-cdh5.3.6.oozie-4.0.0-cdh5.3.6/* ./libext/

(After execution, finally go to ll to see if there are corresponding files in the libext directory)

2). Copy ext-2.2.zip under /opt/software to libext (note that you don’t need to decompress ext-2.2.zip and copy it directly).

[root@hadoop102 oozie-4.0.0-cdh5.3.6]# cp /opt/software/ext-2.2.zip ./libext/

3). Copy the mysql driver to libext, because the metadata of oozie is also stored in mysql, here you can adjust it according to the location of your own mysql driver package

[root@hadoop102 oozie-4.0.0-cdh5.3.6]# cp /opt/software/mysql-libs/mysql-connector-java-5.1.27/mysql-connector-java-5.1.27-bin.jar ./libext/

4. Modify the configuration file of oozie The configuration file
here to modify is the oozie-site.xml file in oozie-4.0.0-cdh5.3.6/conf, because the content of this file is very large, it is not easy for us to recruit at a glance Corresponding content, so we use the set nu command in vi.
After entering the file, press the Esc key, and then press: Add set nu to the back of: and press Enter to see the line number

1) The driver of jdbc, about line 140, change the value to com.mysql.jdbc.Driver, because it defaults to the derby database

Insert picture description here

2) Modify the value of line 150 to jdbc:mysql://hadoop102:3306/oozie. This step configures the database address required by oozie. Note that our host name is hadoop102. In this step, change it according to your own host name. a bit

Insert picture description here

3) The value of line 158 is changed to root, because my mysql is the root user

Insert picture description here

4) Fill in the mysql password on line 166. Note that there is a space in this value. This space must be deleted first, otherwise an error will occur!!!

Insert picture description here

5) Line 233 tells Oozie to refer to the Hadoop configuration file, because there may be more than one set of hadoop clusters on your Linux, and you need to let oozie know which set of clusters to connect to. Here I let oozie connect to our CDH version of hadoop.

Insert picture description here

Another thing to note here is that there is a *= in the value of line 233. Do not delete this thing , and then fill in the path of your CDH version of the hadoop configuration file, as shown in the figure below.

Insert picture description here

After the above content is set, we press the Esc key, and then press: Enter set nonu and press Enter after: to delete the number. At this time, you can save the file and exit.

Five: Initialize oozie

1. Create Oozie database in Mysql, mysql -uroot -p123456 enters mysql, create oozie database
create database oozie;

2. Start the CDH version of hadoop. The sixth point in the above three: CDH version of hadoop simple deployment has been explained. No more description here.

3. Upload the yarn.tar.gz file in the Oozie directory to HDFS. Note that the yarn.tar.gz file will be decompressed by itself. We only need to execute the command, and oozie will automatically upload and decompress the tar package.

[root@hadoop102 oozie-4.0.0-cdh5.3.6]# bin/oozie-setup.sh sharelib create -fs hdfs://hadoop102:8020 -locallib oozie-sharelib-4.0.0-cdh5.3.6-yarn.tar.gz

After execution, let's go to the /user/root/share/lib/lib_20200801123328 path in HDFS to take a look

Insert picture description here

Insert picture description here

Here you can see that the lib_20200801123328 directory is named after the time, and there are many files in it, and the content inside must not be deleted in the future, otherwise an error will be reported when working on the project. Also note that the above command cannot Execute twice, because it takes a different time to execute each time, and it happens to be named after the time to form a war package.

4. Create the oozie.sql file.
Before creating the file, we can take a look at the oozie database on hadoop102. There is nothing in it.

Insert picture description here

Execute the following content
[root@hadoop102 oozie-4.0.0-cdh5.3.6]# bin/ooziedb.sh create -sqlfile oozie.sql -run

You can see that there are many tables in oozie's database at this time, because he wants to store a lot of data in it.

Insert picture description here

5. Pack the project, generate the war package, the following content will appear after the end

Insert picture description here

Six: start and stop of oozie

Regarding the start and stop of oozie, we don't need to go through the above five: oozie initialization content every time before the operation . The above initialization is only the first time you need to do the oozie installation.

Start:
[root@hadoop102 oozie-4.0.0-cdh5.3.6]# bin/oozied.sh start

停止:
[root@hadoop102 oozie-4.0.0-cdh5.3.6]# bin/oozied.sh stop

After startup, you can see that there is an additional Bootstrap process, this is the process of oozie

Insert picture description here

Then we go to the web side to visit oozie, an interface appears indicating that the installation has been successful.

Insert picture description here

Possible problems here: If the page is not displayed completely, it may be similar to the following. It may be due to a browser problem. It is recommended to change to Firefox or Google browser.

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_44080445/article/details/107722690