first WordCount

Record the stepping on the pit and the deployment environment process.

Build a pseudo-distributed Hadoop

  1. First, the environment needs to install zookeeper. This is a nice outfit, not much to say
  2. The second more complicated thing is to install openssh. My Linux system is centos 7 mini version. There are many preparations before installing openssh.
    The tar packages that need to be installed are:
    • libpcap-1.8.1.tar.gz
    • zlib-1.2.8.tar.gz
    • perl-5.22.4.tar.gz
    • openssl-1.0.2j.tar.gz
    • The sequence of openssh-7.2p2.tar.gz
      is perl first, then zlib. Then it's random. Because zlib will rely on perl5 to
      install openssh, the main purpose is to set up password-free login. Easy to build hadoop
  3. Install hadoop.
    You need to configure the Java environment variables, as well as the Hadoop environment variables. The problem that Java_HOME cannot be loaded sometimes can be solved by Baidu, just change about line 25 of the hadoop-evn.cmd configuration file. The main thing to note is the configuration core-site.xml
    of the three configuration files core-site.xml, hdfs-site.xml, yarn- site.xml
<configuration>
     <property>
        <name>fs.defaultFS</name>
        <value>hdfs://192.168.1.103:9000/</value>
      </property>
      <property>
        <name>hadoop.tmp.dir</name>
    <value>/home/u/hadoop-2.7.6/tmp</value>
      </property>
</configuration>

hdfs-site.xml

<configuration>
  <property>
   <name>dfs.namenode.http-address</name>
   <value>192.168.1.103:50070</value>
  </property>
  <property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>192.168.1.103:50090</value>
  </property>
  <!-- 指定HDFS副本数量 -->
  <property>
   <name>dfs.replication</name>
   <value>3</value>
  </property>
  <!--指定NameNode的存储路径-->
  <property>
   <name>dfs.namenode.name.dir</name>
   <value>/home/u/hadoop-2.7.6/namenode</value>
  </property>
  <!--指定DataNode的存储路径-->
  <property>
   <name>dfs.datanode.data.dir</name>
   <value>/home/u/hadoop-2.7.6/datanode</value>
  </property>
</configuration>

yarn-site.xml

<configuration>

<!-- Site specific YARN configuration properties -->
   <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>192.168.1.103</value>
   </property>
   <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
   </property>
<property>
     <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
     <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

Then go to the hadoop-2.7.6/sbin/ directory and execute start-all.sh to start all roles at one time. After 2.x starts successfully, it looks like this:
jps

Running the first WordCount program

The book I read is Hadoop in action. The program inside is a bit old, so WordCount written by myself has many methods ClassNotFound. But hadoop comes with the first wordcount program. You can see its source code. Now use its own wordcount to test whether this environment is feasible

  1. generate input file
    echo "I love Java I love Hadoop I love BigData Good Good Study, Day Day Up" > wc.txt
  2. Create a folder on Hdfs and upload wc.txt to Hdfs
    hdfs dfs -mkdir -p /input/wordcount
    hdfs dfs -put wc.txt /input/wordcount
  3. Then it can be executed. The output file directory does not exist
    hadoop jar /home/u/hadoop-2.7.6/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar wordcount /input/wordcount /output/wordcount

result:
write picture description here

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325477556&siteId=291194637