Record the stepping on the pit and the deployment environment process.
Build a pseudo-distributed Hadoop
- First, the environment needs to install zookeeper. This is a nice outfit, not much to say
- The second more complicated thing is to install openssh. My Linux system is centos 7 mini version. There are many preparations before installing openssh.
The tar packages that need to be installed are:
- libpcap-1.8.1.tar.gz
- zlib-1.2.8.tar.gz
- perl-5.22.4.tar.gz
- openssl-1.0.2j.tar.gz
- The sequence of openssh-7.2p2.tar.gz
is perl first, then zlib. Then it's random. Because zlib will rely on perl5 to
install openssh, the main purpose is to set up password-free login. Easy to build hadoop
- Install hadoop.
You need to configure the Java environment variables, as well as the Hadoop environment variables. The problem that Java_HOME cannot be loaded sometimes can be solved by Baidu, just change about line 25 of the hadoop-evn.cmd configuration file. The main thing to note is the configuration core-site.xml
of the three configuration files core-site.xml, hdfs-site.xml, yarn- site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.168.1.103:9000/</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/u/hadoop-2.7.6/tmp</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.http-address</name>
<value>192.168.1.103:50070</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>192.168.1.103:50090</value>
</property>
<!-- 指定HDFS副本数量 -->
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<!--指定NameNode的存储路径-->
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/u/hadoop-2.7.6/namenode</value>
</property>
<!--指定DataNode的存储路径-->
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/u/hadoop-2.7.6/datanode</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>192.168.1.103</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Then go to the hadoop-2.7.6/sbin/ directory and execute start-all.sh to start all roles at one time. After 2.x starts successfully, it looks like this:
Running the first WordCount program
The book I read is Hadoop in action. The program inside is a bit old, so WordCount written by myself has many methods ClassNotFound. But hadoop comes with the first wordcount program. You can see its source code. Now use its own wordcount to test whether this environment is feasible
- generate input file
echo "I love Java I love Hadoop I love BigData Good Good Study, Day Day Up" > wc.txt
- Create a folder on Hdfs and upload wc.txt to Hdfs
hdfs dfs -mkdir -p /input/wordcount
hdfs dfs -put wc.txt /input/wordcount - Then it can be executed. The output file directory does not exist
hadoop jar /home/u/hadoop-2.7.6/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar wordcount /input/wordcount /output/wordcount
result: