Use kettle timing extraction tool mysql hbase cluster data table (a)

Foreword

Due to the recent work of the project will be to use big data and the underlying data extraction, so it took some time to study the relevant technology. Has put together the relevant documents uploaded to csdn, but now upload their own resources can not be set free, the system automatically generates five points to download, so I plan to write in the blog. If there is something wrong please correct me.

 

Brief introduction

1. Hadoop : large data processing framework, three basic components hdfs, yarn, Mapreduce

2. HBase : hadoop and used in conjunction with distributed storage system structured data

3. Kettle : etl open-source tool used for data extraction

As the title says, in the use of relational databases (such as mysql, oracle), if the data is refreshed according to the second level, the need for data and historical data to facilitate future Xu record, then the amount of data is very large, this time using hadoop to store very appropriate. I want to achieve the ultimate goal scored five steps. It is also divided into five articles to describe.

1) hadoop cluster environment set up

2) Use kettle extract data mysql hadoop

3) Based on hadoop cluster structures hbase

4) extraction into mysql data table hbase

5) the timing extraction task execution

 

hadoop cluster environment to build

 

Linux command

 

1. Added to the system a new group zu1 , group identification numbers specifying a new group ( the GID ) is 888 .

groupadd -g 888  zu1

 

 

2. Delete known group zu1

groupdel  zu1

 

3. The group zu1 identification number was changed to 10000, the group name changed to zu2

 groupmod –g 10000 -n zu2 zu1

 

4. The current user group to root, root user group with the proviso that the user is indeed the main groups or additional groups

 newgrp root

 

 The group information see: cat / etc / group

 

Each record in the / etc / group in four fields:

  The first field: group name;

  The second field: a user group password;

  The third field: GID

  Fourth field: a user list, separated by a comma between each user number (,); this field can be empty; if the field is empty for the user name of the user group represented by the GID;

 

 

6. queries  wangkang belongs to the group of what are 

[root @ localhost ~] # groups junk

junk : zu1

7. The user wangkang deleted from the zu1

g passwd  zu1  -d junk  

8. The user wangkang zu1 added to the
usermod -a -G zu1 wangkang

 

Preparation Before Installation

Java Version: 1.8

Hadoop Version: 2.7.7

Two CentOS7 virtual machine, modify the host name, memory, respectively 2G, fixed ip address. My Virtual Machines are

master 192.168.93.131 (master)

slave1 192.168.93.132 (from the server)

1. Use the root user

New users are wangkang two machines, home directory (this can also be created directly at the installation system) is / home / wangkang

The user is added to the root user wangkang group

2. Use the root user

In the / etc / hosts file two machines are added

192.168.93.131 master

192.168.93.132 slave1

Resolve domain name with Cong

3. Use wangkang user

The java installation package is downloaded to your home directory under / home / wangkang directory (you can also upload to download a good linux system), enter the command

$ Cd ~ (enter the main directory)

$ Tar -zxvf jdk-8u8o-linux-x64.tar.gz (decompression)

4. User wangkang

Because the need to access the main server from each other, so we need to set free secret landing

Were executed on two machines

$cd ~

$ Ssh-keygen -t rsa (key generating native)

Executed on the master host

$ Cd ~ /.ssh/ (into the hidden folder .ssh)

$ Ssh-copy-id 192.168.93.131 (the generated key is appended to the unit itself authorized_keys file)

$ Scp / home / junk / .ssh / authorized_keys 192.168.93.132:/home/wangkang/.ssh/

(The master copy of the file authorized_keys in slave1)

In the execution slave1

$ Ssh-copy-id 192.168.93.132 (the key generated is added to the native file authorized_keys)

Note: At this point authorized_keys ask key pieces included two machines.

$ Scp / home / junk / .ssh / authorized_keys 192.168.93.131:/home/wangkang/.ssh/

(Slave1 authorized_keys file copy to the master in the middle)

Free secret landing configuration is complete.

Verify Success:

On the master

$ Ssh slave1 (landing slave1, first login password required)

After entering the password slave1 successful landing slave1

$ Exit (Exit)

$ Ssh slave1 (second landing without a password)

Slave1 landed on the same master to verify avoid dense successful landing

 

Hadoop installation configuration

The whole process using wangkang master user on the host

1. decompression jdk as the hadoop archive upload or download directly to your home directory wangkang, and then decompress

2. Create a new directory in the directory after decompression hadoop

$cd ~/hadoop-2.7.7

$mkdir tmp

$mkdir hdfs

$mkdir hdfs/data

$mkdir hdfs/name

3. Modify hadoop configuration file

Entering ~ / hadoop-2.7.7 / etc / hadoop

1) core-site.xml, add

<property>

     <name>fs.defaultFS</name>

      <value>hdfs://192.168.93.131:9000</value>

   </property>

<property>

      <name>hadoop.tmp.dir</name>

      <value>file:/home/wangkang/hadoop-2.7.7/tmp</value>

     </property>

<property>

      <name>io.file.buffer.size</name>

      <value>131072</value>

</property>

2) hdfs-site.xml, add

<property>

        <name>dfs.namenode.name.dir</name>

        <value>file:/home/wangkang/hadoop-2.7.7/hdfs/name</value>

</property>

<property>

        <name>dfs.datanode.data.dir</name>

        <value>file:/home/wangkang/hadoop-2.7.7/hdfs/data</value>

</property>

<property>

        <name>dfs.replication</name>

        <value>1</value>

</property>

<property>

        <name>dfs.namenode.secondary.http-address</name>

        <value>192.168.93.131:9001</value>

</property>

<property>

        <name>dfs.namenode.servicerpc-address</name>

        <value>192.168.93.131:10000</value>

</property>

<property>

        <name>dfs.webhdfs.enabled</name>

        <value>true</value>

</property>

3) yarn-site.xml, add

<property>

        <name>yarn.nodemanager.aux-services</name>

        <value>mapreduce_shuffle</value>

</property>

<property>

        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

        <value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

<property>

        <name>yarn.resourcemanager.address</name>

        <value>192.168.93.131:8032</value>

</property>

<property>

        <name>yarn.resourcemanager.scheduler.address</name>

        <value>192.168.93.131:8030</value>

</property>

<property>

        <name>yarn.resourcemanager.resource-tracker.address</name>

        <value>192.168.93.131:8031</value>

</property>

<property>

        <name>yarn.resourcemanager.admin.address</name>

        <value>192.168.93.131:8033</value>

</property>

<property>

        <name>yarn.resourcemanager.webapp.address</name>

        <value>192.168.93.131:8088</value>

</property>

<property>

        <name>yarn.nodemanager.resource.memory-mb</name>

        <value>1024</value>

</property>

 

4) mapred-site.xml (if it does not, went mapred-site.xml.templete file, copy the file name to one of mapred-site.xml)

<property>

        <name>mapreduce.framework.name</name>

        <value>yarn</value>

</property>

<property>

        <name>mapreduce.jobhistory.address</name>

        <value>192.168.93.131:10020</value>

</property>

<property>

        <name>mapreduce.jobhistory.webapp.address</name>

        <value>192.168.93.131:19888</value>

</property>

 

 

5) slaves file (version 3.0 called workers file), add

192.168.93.132

6) hadoop-env.sh, add

Export JAVA_HOME=/home/wangkang/jdk_1.8.0_75

7) yarn-env.sh, add

Export JAVA_HOME=/home/wangkang/jdk_1.8.0_75

8) will be configured on the master host hadoop folder assigned to the slave1

$cd ~

$scp -r ./hadoop-2.7.7  192.168.93.132:/home/wangkang

Installation Configuration

Respectively, using the root user on the two machines, open / etc / profile file, adding environment variables

export JAVA_HOME=/home/wangkang/jdk1.8.0_131

export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

export HADOOP_HOME=/home/wangkang/hadoop-2.7.7

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_YARN_HOME=$HADOOP_HOME

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/lib

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native

 

Command execution environment variable is effective immediately

#source /etc/profile

After this command to take effect, if you close and then open the terminal, you will find the environment variables fails, then still have to survive a #source / etc / profile.

The solution is to restart shutdown, then the environment variables would have been in force.

Variable environmental consequences of failure, many script commands execute the command prompt will report does not exist, such as jps

hadoop fs ,start-all.sh

Initialize and run validation (using wangkang all users)

Execute the following command on the master machine:

$ Hdfs namenode -format

(Format hdfs, this operation can only be performed once, multiple times may be a problem, when you see / name has been succussfully formatted table formatting success)

$ Start-dfs.sh (start hdfs, first slow start)

$ Start-yarn.sh (starting yarn)

$ Jps (use the jps command to view the java process on the master machine), as follows

 

Slave1 execute the following command on the machine:

 

View namemode asked by page, open a browser on a virtual plus master, enter 192.168.93.131:50070, you can enter the page

 

But access to the virtual machine master on this machine 192.168.93.131:50070 address was not open.

After checking found that the virtual machine's firewall is not closed, you can access after closing

 

Respectively, using the root user on the master slave1

#sudo firewall-cmd --state (see the firewall status)

#sudo systemctl stop firewalld.service (turn off the firewall, only useful for this start, restart the firewall is still open)

# Sudo systemctl disable firewalld.service (off by default firewall settings, the next boot will not open)

Figure

 

mapreduce run hadoop example comes wordcount

 

wordcount is a jar package, a simple realization mapreduce, statistics for the number of times each word appears in a document

1. created under / home / wangkang a file called test.txt, which is written on a few words, as

 

2. Create a path named word in the file system hdfs

$hadoop  fs  -mkdir  -p  /word

3. upload a test file to word file system path under hdfs

$hadoop  fs  -put  /home/wankang/test.txt  /word/

4. jar package requires the installation in the hadoop directory path

/home/wangkang/hadoop-2.7.7/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar, this jar command is executed

$hadoop  jar  /home/wangkang/hadoop-2.7.7/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar  wordcount  /word  /out

其中

wordcount是这个jar包内部映射的一个别名;

/word是hdfs文件系统中,需要计算的文件存放的路径;

/out是计算结果存放的路径。这个路径是有你来指定的,系统会在hdfs文件系统中自动生成。但是,这个路径在执行命令前一定不能存在,否则会执行失败

 

执行结果如图

 

 

5.查看执行的结果

$hadoop  fs  -text  /out/part-r-00000

 

 

6.执行时会遇到的问题。

问题1.

Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=1536, maxMemory=1024

第一次执行时可能会报这个错误,报错信息是说,这个任务请求的内存是1536,但是我们配置的内存最大只有1024

解决方案:

先关闭所有hadoop服务($stop-all.sh)

找到hadoop-2.7.7/etc/hadoop/yarn-site.xml

将yarn.nodemanager.resource.memory-mb属性由原来的1024改为2048

注意,需要将master机器和slave1机器都改掉。

改完后重启服务($start-all.sh)

问题2.

执行jar的时候任务卡死。

 Running job: job_1559553475936_0001

卡在这个地方不动了。

解决方案:

查阅资料之后了解到是虚拟机内存太小。我的两台虚拟机内存为2G。

先将服务全部关闭。然后关闭虚拟机,将两台虚拟机内存调整为4G,然后重启。

 

 

 

结束:

参考书籍《Hadoop构建数据仓库实践》

 

Guess you like

Origin blog.csdn.net/github_39538842/article/details/92578831