Foreword
Due to the recent work of the project will be to use big data and the underlying data extraction, so it took some time to study the relevant technology. Has put together the relevant documents uploaded to csdn, but now upload their own resources can not be set free, the system automatically generates five points to download, so I plan to write in the blog. If there is something wrong please correct me.
Brief introduction
1. Hadoop : large data processing framework, three basic components hdfs, yarn, Mapreduce
2. HBase : hadoop and used in conjunction with distributed storage system structured data
3. Kettle : etl open-source tool used for data extraction
As the title says, in the use of relational databases (such as mysql, oracle), if the data is refreshed according to the second level, the need for data and historical data to facilitate future Xu record, then the amount of data is very large, this time using hadoop to store very appropriate. I want to achieve the ultimate goal scored five steps. It is also divided into five articles to describe.
1) hadoop cluster environment set up
2) Use kettle extract data mysql hadoop
3) Based on hadoop cluster structures hbase
4) extraction into mysql data table hbase
5) the timing extraction task execution
hadoop cluster environment to build
Linux command
1. Added to the system a new group zu1 , group identification numbers specifying a new group ( the GID ) is 888 .
groupadd -g 888 zu1
2. Delete known group zu1
groupdel zu1
3. The group zu1 identification number was changed to 10000, the group name changed to zu2
groupmod –g 10000 -n zu2 zu1
4. The current user group to root, root user group with the proviso that the user is indeed the main groups or additional groups
newgrp root
The group information see: cat / etc / group
Each record in the / etc / group in four fields:
The first field: group name;
The second field: a user group password;
The third field: GID
Fourth field: a user list, separated by a comma between each user number (,); this field can be empty; if the field is empty for the user name of the user group represented by the GID;
6. queries wangkang belongs to the group of what are
[root @ localhost ~] # groups junk
junk : zu1
7. The user wangkang deleted from the zu1
g passwd zu1 -d junk
8. The user wangkang zu1 added to the
usermod -a -G zu1 wangkang
Preparation Before Installation
Java Version: 1.8
Hadoop Version: 2.7.7
Two CentOS7 virtual machine, modify the host name, memory, respectively 2G, fixed ip address. My Virtual Machines are
master 192.168.93.131 (master)
slave1 192.168.93.132 (from the server)
1. Use the root user
New users are wangkang two machines, home directory (this can also be created directly at the installation system) is / home / wangkang
The user is added to the root user wangkang group
2. Use the root user
In the / etc / hosts file two machines are added
192.168.93.131 master
192.168.93.132 slave1
Resolve domain name with Cong
3. Use wangkang user
The java installation package is downloaded to your home directory under / home / wangkang directory (you can also upload to download a good linux system), enter the command
$ Cd ~ (enter the main directory)
$ Tar -zxvf jdk-8u8o-linux-x64.tar.gz (decompression)
4. User wangkang
Because the need to access the main server from each other, so we need to set free secret landing
Were executed on two machines
$cd ~
$ Ssh-keygen -t rsa (key generating native)
Executed on the master host
$ Cd ~ /.ssh/ (into the hidden folder .ssh)
$ Ssh-copy-id 192.168.93.131 (the generated key is appended to the unit itself authorized_keys file)
$ Scp / home / junk / .ssh / authorized_keys 192.168.93.132:/home/wangkang/.ssh/
(The master copy of the file authorized_keys in slave1)
In the execution slave1
$ Ssh-copy-id 192.168.93.132 (the key generated is added to the native file authorized_keys)
Note: At this point authorized_keys ask key pieces included two machines.
$ Scp / home / junk / .ssh / authorized_keys 192.168.93.131:/home/wangkang/.ssh/
(Slave1 authorized_keys file copy to the master in the middle)
Free secret landing configuration is complete.
Verify Success:
On the master
$ Ssh slave1 (landing slave1, first login password required)
After entering the password slave1 successful landing slave1
$ Exit (Exit)
$ Ssh slave1 (second landing without a password)
Slave1 landed on the same master to verify avoid dense successful landing
Hadoop installation configuration
The whole process using wangkang master user on the host
1. decompression jdk as the hadoop archive upload or download directly to your home directory wangkang, and then decompress
2. Create a new directory in the directory after decompression hadoop
$cd ~/hadoop-2.7.7
$mkdir tmp
$mkdir hdfs
$mkdir hdfs/data
$mkdir hdfs/name
3. Modify hadoop configuration file
Entering ~ / hadoop-2.7.7 / etc / hadoop
1) core-site.xml, add
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.168.93.131:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/wangkang/hadoop-2.7.7/tmp</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
2) hdfs-site.xml, add
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/wangkang/hadoop-2.7.7/hdfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/wangkang/hadoop-2.7.7/hdfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>192.168.93.131:9001</value>
</property>
<property>
<name>dfs.namenode.servicerpc-address</name>
<value>192.168.93.131:10000</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
3) yarn-site.xml, add
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>192.168.93.131:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>192.168.93.131:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>192.168.93.131:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>192.168.93.131:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>192.168.93.131:8088</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1024</value>
</property>
4) mapred-site.xml (if it does not, went mapred-site.xml.templete file, copy the file name to one of mapred-site.xml)
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>192.168.93.131:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>192.168.93.131:19888</value>
</property>
5) slaves file (version 3.0 called workers file), add
192.168.93.132
6) hadoop-env.sh, add
Export JAVA_HOME=/home/wangkang/jdk_1.8.0_75
7) yarn-env.sh, add
Export JAVA_HOME=/home/wangkang/jdk_1.8.0_75
8) will be configured on the master host hadoop folder assigned to the slave1
$cd ~
$scp -r ./hadoop-2.7.7 192.168.93.132:/home/wangkang
Installation Configuration
Respectively, using the root user on the two machines, open / etc / profile file, adding environment variables
export JAVA_HOME=/home/wangkang/jdk1.8.0_131
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export HADOOP_HOME=/home/wangkang/hadoop-2.7.7
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/lib
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native
Command execution environment variable is effective immediately
#source /etc/profile
After this command to take effect, if you close and then open the terminal, you will find the environment variables fails, then still have to survive a #source / etc / profile.
The solution is to restart shutdown, then the environment variables would have been in force.
Variable environmental consequences of failure, many script commands execute the command prompt will report does not exist, such as jps
hadoop fs ,start-all.sh
Initialize and run validation (using wangkang all users)
Execute the following command on the master machine:
$ Hdfs namenode -format
(Format hdfs, this operation can only be performed once, multiple times may be a problem, when you see / name has been succussfully formatted table formatting success)
$ Start-dfs.sh (start hdfs, first slow start)
$ Start-yarn.sh (starting yarn)
$ Jps (use the jps command to view the java process on the master machine), as follows
Slave1 execute the following command on the machine:
View namemode asked by page, open a browser on a virtual plus master, enter 192.168.93.131:50070, you can enter the page
But access to the virtual machine master on this machine 192.168.93.131:50070 address was not open.
After checking found that the virtual machine's firewall is not closed, you can access after closing
Respectively, using the root user on the master slave1
#sudo firewall-cmd --state (see the firewall status)
#sudo systemctl stop firewalld.service (turn off the firewall, only useful for this start, restart the firewall is still open)
# Sudo systemctl disable firewalld.service (off by default firewall settings, the next boot will not open)
Figure
mapreduce run hadoop example comes wordcount
wordcount is a jar package, a simple realization mapreduce, statistics for the number of times each word appears in a document
1. created under / home / wangkang a file called test.txt, which is written on a few words, as
2. Create a path named word in the file system hdfs
$hadoop fs -mkdir -p /word
3. upload a test file to word file system path under hdfs
$hadoop fs -put /home/wankang/test.txt /word/
4. jar package requires the installation in the hadoop directory path
/home/wangkang/hadoop-2.7.7/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar, this jar command is executed
$hadoop jar /home/wangkang/hadoop-2.7.7/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar wordcount /word /out
其中
wordcount是这个jar包内部映射的一个别名;
/word是hdfs文件系统中,需要计算的文件存放的路径;
/out是计算结果存放的路径。这个路径是有你来指定的,系统会在hdfs文件系统中自动生成。但是,这个路径在执行命令前一定不能存在,否则会执行失败
执行结果如图
5.查看执行的结果
$hadoop fs -text /out/part-r-00000
6.执行时会遇到的问题。
问题1.
Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=1536, maxMemory=1024
第一次执行时可能会报这个错误,报错信息是说,这个任务请求的内存是1536,但是我们配置的内存最大只有1024
解决方案:
先关闭所有hadoop服务($stop-all.sh)
找到hadoop-2.7.7/etc/hadoop/yarn-site.xml
将yarn.nodemanager.resource.memory-mb属性由原来的1024改为2048
注意,需要将master机器和slave1机器都改掉。
改完后重启服务($start-all.sh)
问题2.
执行jar的时候任务卡死。
Running job: job_1559553475936_0001
卡在这个地方不动了。
解决方案:
查阅资料之后了解到是虚拟机内存太小。我的两台虚拟机内存为2G。
先将服务全部关闭。然后关闭虚拟机,将两台虚拟机内存调整为4G,然后重启。
结束:
参考书籍《Hadoop构建数据仓库实践》