使用E-MapReduce提交Storm作业处理Kafka数据

本文演示如何在E-MapReduce上部署Storm集群和Kafka集群，并运行Storm作业消费Kafka数据。

<h2 class="title sectiontitle" id="h2-url-1">环境准备</h2> 
<div class="p">
 本文选择在杭州Region进行测试，版本选择EMR-3.8.0，本次测试需要的组件版本有： 
 <ul class="ul" id="ul-djq-zkj-gfb"> 
  <li class="li">Kafka：2.11_1.0.0</li> 
  <li class="li">Storm: 1.0.1</li> 
 </ul> 
</div> 
<p class="p">本文使用阿里云EMR服务自动化搭建Kafka集群，详细过程请参考<a class="xref" href="https://help.aliyun.com/document_detail/35223.html?spm=a2c4e.11153940.blogcont637482.18.3e1625a1TUjLXZ" target="_blank">创建集群</a>。 </p> 
<ul class="ul" id="ul-ckz-rlj-gfb"> 
 <li class="li">创建Hadoop集群<br><img class="image" id="image-ild-wlj-gfb" src="http://static-aliyun-doc.oss-cn-hangzhou.aliyuncs.com/assets/img/21765/153811638912655_zh-CN.png"><br></li> 
 <li class="li">创建Kafka集群<br><img class="image" id="image-jfy-ylj-gfb" src="http://static-aliyun-doc.oss-cn-hangzhou.aliyuncs.com/assets/img/21765/153811638912657_zh-CN.png"><br>
  <div class="note note note-note"> 
   <div class="note-icon-wrapper">
    <i class="icon-note note"></i>
   </div> 
   <div class="note-content">
    <strong>说明</strong> 
    <ul class="ul" id="ul-uys-1mj-gfb"> 
     <li class="li">如果使用经典网络，请注意将Hadoop集群和Kafka集群放置在同一个安全组下面，这样可以省去配置安全组，避免网络不通的问题。</li> 
     <li class="li">如果使用VPC网络，请注意将Hadoop集群和Kafka集群放置在同一个VPC/VSwitch以及安全组下面，这样同样省去配置网路和安全组，避免网络不通。</li> 
     <li class="li">如果你熟悉ECS的网络和安全组，可以按需配置。</li> 
    </ul> 
   </div> 
  </div> </li> 
 <li class="li">配置Storm环境 
  <div class="p">
   如果我们想在Storm上运行作业消费Kafka的话，集群初始环境下是会失败的，因为Storm运行环境缺少了不少必须的依赖包，如下： 
   <ul class="ul" id="ul-zfw-dmj-gfb"> 
    <li class="li"><a class="xref" href="http://central.maven.org/maven2/org/apache/curator/curator-client/2.10.0/curator-client-2.10.0.jar" target="_blank">curator-client</a></li> 
    <li class="li"><a class="xref" href="http://central.maven.org/maven2/org/apache/curator/curator-framework/2.10.0/curator-framework-2.10.0.jar" target="_blank">curator-framework</a></li> 
    <li class="li"><a class="xref" href="http://central.maven.org/maven2/org/apache/curator/curator-recipes/2.10.0/curator-recipes-2.10.0.jar" target="_blank">curator-recipes</a></li> 
    <li class="li"><a class="xref" href="http://central.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1/json-simple-1.1.jar" target="_blank">json-simple</a></li> 
    <li class="li"><a class="xref" href="http://central.maven.org/maven2/com/yammer/metrics/metrics-core/2.2.0/metrics-core-2.2.0.jar" target="_blank">metrics-core</a></li> 
    <li class="li"><a class="xref" href="http://central.maven.org/maven2/org/scala-lang/scala-library/2.11.7/scala-library-2.11.7.jar" target="_blank">scala-library</a></li> 
    <li class="li"><a class="xref" href="http://central.maven.org/maven2/org/apache/zookeeper/zookeeper/3.4.6/zookeeper-3.4.6.jar" target="_blank">zookeeper</a></li> 
    <li class="li"><a class="xref" href="http://central.maven.org/maven2/commons-cli/commons-cli/1.3.1/commons-cli-1.3.1.jar" target="_blank">commons-cli</a></li> 
    <li class="li"><a class="xref" href="http://central.maven.org/maven2/commons-collections/commons-collections/3.2.2/commons-collections-3.2.2.jar" target="_blank">commons-collections</a></li> 
    <li class="li"><a class="xref" href="http://central.maven.org/maven2/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar" target="_blank">commons-configuration</a></li> 
    <li class="li"><a class="xref" href="http://central.maven.org/maven2/org/htrace/htrace-core/3.0.4/htrace-core-3.0.4.jar" target="_blank">htrace-core</a></li> 
    <li class="li"><a class="xref" href="http://central.maven.org/maven2/org/slf4j/jcl-over-slf4j/1.6.6/jcl-over-slf4j-1.6.6.jar" target="_blank">jcl-over-slf4j</a></li> 
    <li class="li"><a class="xref" href="http://central.maven.org/maven2/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar" target="_blank">protobuf-java</a></li> 
   </ul> 
  </div> 
  <div class="p">
   以上版本依赖包经过测试可用，如果你再测试过程中引入了其他依赖，也一同添加在Storm lib中，具体操作如下：
   <br>
   <img class="image" id="image-jx1-wnj-gfb" src="http://static-aliyun-doc.oss-cn-hangzhou.aliyuncs.com/assets/img/21765/153811638912659_zh-CN.png">
   <br>
  </div> 
  <div class="p">
   上述操作需要在Kafka集群的每台机器执行一遍。执行完在E-MapReduce控制台重启Storm服务，如下：
   <br>
   <img class="image" id="image-ott-znj-gfb" src="http://static-aliyun-doc.oss-cn-hangzhou.aliyuncs.com/assets/img/21765/153811639012660_zh-CN.png">
   <br>
  </div> 
  <div class="p">
   查看操作历史，待Storm重启完毕：
   <br>
   <img class="image" id="image-tky-34j-gfb" src="http://static-aliyun-doc.oss-cn-hangzhou.aliyuncs.com/assets/img/21765/153811639012661_zh-CN.png">
   <br>
  </div> </li> 
</ul>

<h2 class="title sectiontitle" id="h2-url-2">开发Storm和Kafka作业</h2> 
<ul class="ul" id="ul-b1x-sqj-gfb"> 
 <li class="li"> 
  <div class="p">
   E-MapReduce已经提供了现成的示例代码，直接使用即可，地址如下： 
   <ul class="ul" id="ul-ixn-q4j-gfb"> 
    <li class="li"><a class="xref" href="https://github.com/aliyun/aliyun-emapreduce-demo" target="_blank">e-mapreduce-demo</a></li> 
    <li class="li"><a class="xref" href="https://github.com/aliyun/aliyun-emapreduce-sdk" target="_blank">e-mapreduce-sdk</a></li> 
   </ul> 
  </div> </li> 
 <li class="li">Topic数据准备 
  <ol class="ol" id="ol-w5c-wqj-gfb"> 
   <li class="li">登录到Kafka集群</li> 
   <li class="li">创建一个test topic，分区数10，副本数2<pre class="pre codeblock"><code>/usr/lib/kafka-current/bin/kafka-topics.sh --partitions 10 --replication-factor 2 --zookeeper emr-header-1:/kafka-1.0.0 --topic test --create</code></pre></li> 
   <li class="li">向test topic写入100条数据<pre class="pre codeblock"><code>/usr/lib/kafka-current/bin/kafka-producer-perf-test.sh --num-records 100 --throughput 10000 --record-size 1024 --producer-props bootstrap.servers=emr-worker-1:9092 --topic test</code></pre></li> 
  </ol> 
  <div class="note note note-note"> 
   <div class="note-icon-wrapper">
    <i class="icon-note note"></i>
   </div> 
   <div class="note-content">
    <strong>说明</strong> 以上命令在kafka集群的emr-header-1节点执行，当然也可以客户端机器上执行。 
   </div> 
  </div> </li> 
 <li class="li">运行Storm作业 
  <div class="p">
   登录到Hadoop集群，将第二步中编译得到的e
   <span class="ph filepath">xamples-1.1-shaded.jar</span>拷贝到集群emr-header-1上，这里我放在root根目录下面。提交作业：
   <pre class="pre codeblock"><code>/usr/lib/storm-current/bin/storm jar examples-1.1-shaded.jar com.aliyun.emr.example.storm.StormKafkaSample test aaa.bbb.ccc.ddd hdfs://emr-header-1:9000 sample</code></pre>
  </div> </li> 
 <li class="li">查看作业运行 
  <ul class="ul" id="ul-pv3-zqj-gfb"> 
   <li class="li">查看Storm运行状态 
    <div class="p">
     查看集群上服务的WebUI有2种方式: 
     <ul class="ul" id="ul-fl4-zqj-gfb"> 
      <li class="li">通过Knox方式，参考文档<a class="xref" href="https://help.aliyun.com/document_detail/62675.html" target="_blank">Knox 使用说明</a></li> 
      <li class="li">SSH隧道，参考文档<a class="xref" href="https://help.aliyun.com/document_detail/28187.html?spm=a2c4g.11186623.6.640.24b454c4CAFUqC" target="_blank">SSH 登录集群</a></li> 
     </ul> 
    </div> 
    <div class="p">
     本文选择使用SSH隧道方式，访问地址：
     <span class="ph filepath">http://localhost:9999/index.html </span>。可以看到我们刚刚提交的Topology。点进去可以看到执行详情：
     <br>
     <img class="image" id="image-gl4-zqj-gfb" src="http://static-aliyun-doc.oss-cn-hangzhou.aliyuncs.com/assets/img/21765/153811639012663_zh-CN.png">
     <br>
    </div> </li> 
   <li class="li">查看HDFS输出 
    <ul class="ul" id="ul-tkr-brj-gfb"> 
     <li class="li">查看HDFS文件输出<pre class="pre codeblock"><code>[root@emr-header-1 ~]# hadoop fs -ls /foo/

-rw-r--r-- 3 root hadoop 615000 2018-02-11 13:37 /foo/bolt-2-0-1518327393692.txt
-rw-r--r-- 3 root hadoop 205000 2018-02-11 13:37 /foo/bolt-2-0-1518327441777.txt
[root@emr-header-1 ~]# hadoop fs -cat /foo/bolt-2-0-1518327441777.txt | wc -l
200

     <li class="li">向kafka写120条数据<pre class="pre codeblock"><code>[root@emr-header-1 ~]# /usr/lib/kafka-current/bin/kafka-producer-perf-test.sh --num-records 120 --throughput 10000 --record-size 1024 --producer-props bootstrap.servers=emr-worker-1:9092 --topic test

120 records sent, 816.326531 records/sec (0.80 MB/sec), 35.37 ms avg latency, 134.00 ms max latency, 35 ms 50th, 39 ms 95th, 41 ms 99th, 134 ms 99.9th.

     <li class="li">查看HDFS文件输出<pre class="pre codeblock"><code>[root@emr-header-1 ~]# hadoop fs -cat /foo/bolt-2-0-1518327441777.txt | wc -l

320

    </ul> </li> 
  </ul> </li> 
</ul>

<h2 class="title sectiontitle" id="h2-url-3">总结</h2> 
<p class="p">至此，我们成功实现了在E-MapReduce上部署一套Storm集群和一套Kafka集群，并运行Storm作业消费Kafka数据。当然，E-MapReduce也支持Spark Streaming和Flink组件，同样可以方便在Hadoop集群上运行，处理Kafka数据。 </p> 
<div class="note note note-note"> 
 <div class="note-icon-wrapper">
  <i class="icon-note note"></i>
 </div> 
 <div class="note-content">
  <strong>说明</strong> 
  <p class="p">由于E-MapReduce没有单独的Storm集群类别，所以我们是创建的Hadoop集群，并安装了Storm组件。如果你在使用过程中用不到其他组件，可以很方便地在E-MapReduce管理控制台将那些组件停掉。这样，可以将Hadoop集群作为一个纯粹的Storm集群使用。</p> 
 </div> 
</div>

（本文作者为阿里云大数据产品文档工程师）

使用E-MapReduce提交Storm作业处理Kafka数据

猜你喜欢