基于cdh3.7.5的Flume+Kafka+Sparkstreaming+HDFS+CM+IDEA准实时处理日志(测试通过)

  • 本文的主要目的就是为了实现以下需求:

  1. 通过flume收集日志;

  2. 将收集到的日志分发给kafka;

  3. 通过sparksteaming对kafka获取的日志进行处理;

  4. 然后将处理的结果存储到hdfs的指定目录下。

  • Flume连通Kafka配置

a1.sources = r1

a1.channels = c1

a1.sinks =s1

 

#sources端配置

a1.sources.r1.type=exec

a1.sources.r1.command=tail -F /usr/local/soft/flume/flume_dir/kafka.log

a1.sources.r1.channels=c1

 

#channels端配置

a1.channels.c1.type=memory

a1.channels.c1.capacity=10000

a1.channels.c1.transactionCapacity=100

 

#设置Kafka接收器

a1.sinks.s1.type= org.apache.flume.sink.kafka.KafkaSink

 

#设置Kafka的broker地址和端口

a1.sinks.s1.brokerList=manager:9092,namenode:9092,datanode:9092

 

#设置Kafka的Topic

a1.sinks.s1.topic=realtime

 

#设置序列化方式

a1.sinks.s1.serializer.class=kafka.serializer.StringEncoder

a1.sinks.s1.channel=c1

 

注意,关于配置文件中注意3点:

1.配置文件:

  a.  a1.sources.r1.command=tail -F /usr/local/soft/flume/flume_dir//kafka.log   

  b.  a1.sinks.s1.brokerList= manager:9092,namenode:9092,datanode:9092

  c . a1.sinks.s1.topic=realtime

2.很明显,由配置文件可以了解到:

  a.我们需要在/usr/local/soft/flume/flume_dir/下建一个kafka.log的文件,且向文件中输出内容(下面会说到);

  b.flume连接到kafka的地址是 manager:9092,namenode:9092,datanode:9092,注意不要配置出错了;

  c.flume会将采集后的内容输出到Kafka topic 为realtime上,所以我们启动zk,kafka后需要打开一个终端消费topic realtime的内容。这样就可以看到flume与kafka之间玩起来了~~

  • 编写测试脚本kafka_output.sh
  1. 在/usr/local/soft/flume/flume_dir/下建立空文件kafka.log。在root用户目录下新建脚本kafka_output.sh(一定要给予可执行权限),用来向kafka.log输入内容,脚本内容如下:

for((i=0;i<=1000;i++));

do echo "kafka_test-"+$i>>/usr/local/soft/flume/flume_dir/kafka.log;

done

      2.在Cloudera Manger(CM)上启动Zookeeper,Kafka

  1. 在Kafka集群上创建主题realtime:

 kafka-topics.sh  --create --zookeeper manager:2181,namenode:2181,datanode:2181 --replication-factor 3 --partitions 1  --topic  realtime

    4.打开新终端,在kafka安装目录下执行如下命令,生成对topic realtime 的消费:

kafka-console-consumer.sh --zookeeper manager:2181,namenode:2181,datanode:2181 --from-beginning --topic realtime

 

    5.启动Flume(CM上启动也行):

1)cd /opt/cloudera/parcels/CDH-5.7.5-1.cdh5.7.5.p0.3

2) bin/flume-ng agent --conf conf --conf-file etc/flume-ng/conf.empty/flume-conf.properties --name a1 -Dflume.root.logger=INFO,console

  6.执行kafka_output.sh脚本(注意观察kafka.log内容及消费终端接收到的内容)

 

Flume监控端

 

Kafka 消费端:

 

Sparkstreaming日志处理端:

 

  • pom依赖
  • <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
        <modelVersion>4.0.0</modelVersion>
        <groupId>com.realtime.kafka</groupId>
        <artifactId>kafka2hdfs</artifactId>
        <version>1.0-SNAPSHOT</version>
    
        <properties>
            <maven.compiler.source>1.8</maven.compiler.source>
            <maven.compiler.target>1.8</maven.compiler.target>
            <scala.version>2.11.8</scala.version>
            <spark.version>2.1.0</spark.version>
            <hadoop.version>2.6.5</hadoop.version>
            <jackson.version>2.6.2</jackson.version>
            <encoding>UTF-8</encoding>
        </properties>
    
        <dependencies>
            <dependency>
                <groupId>org.scala-lang</groupId>
                <artifactId>scala-library</artifactId>
                <version>${scala.version}</version>
            </dependency>
           
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core_2.11</artifactId>
                <version>${spark.version}</version>
            </dependency>
    
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-sql_2.11</artifactId>
                <version>${spark.version}</version>
            </dependency>
    
            <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.11 -->
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-streaming_2.11</artifactId>
                <version>${spark.version}</version>
            </dependency>
    
            <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-8_2.11 -->
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
                <version>${spark.version}</version>
            </dependency>
    
            <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume_2.11 -->
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-streaming-flume_2.11</artifactId>
                <version>2.1.0</version>
            </dependency>
           
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-client</artifactId>
                <version>${hadoop.version}</version>
            </dependency>
            
            <dependency>
                <groupId>mysql</groupId>
                <artifactId>mysql-connector-java</artifactId>
                <version>5.1.44</version>
            </dependency>
    
            <!-- json -->
            <dependency>
                <groupId>org.json</groupId>
                <artifactId>json</artifactId>
                <version>20090211</version>
            </dependency>
            <!--jackson json -->
            <dependency>
                <groupId>com.fasterxml.jackson.core</groupId>
                <artifactId>jackson-core</artifactId>
                <version>${jackson.version}</version>
            </dependency>
            <dependency>
                <groupId>com.fasterxml.jackson.core</groupId>
                <artifactId>jackson-databind</artifactId>
                <version>${jackson.version}</version>
            </dependency>
            <dependency>
                <groupId>com.fasterxml.jackson.core</groupId>
                <artifactId>jackson-annotations</artifactId>
                <version>${jackson.version}</version>
            </dependency>
            <dependency>
                <groupId>com.alibaba</groupId>
                <artifactId>fastjson</artifactId>
                <version>1.1.41</version>
            </dependency>
        </dependencies>
    
        <build>
            <pluginManagement>
                <plugins>
                    <plugin>
                        <groupId>net.alchim31.maven</groupId>
                        <artifactId>scala-maven-plugin</artifactId>
                        <version>3.2.2</version>
                    </plugin>
                    <plugin>
                        <groupId>org.apache.maven.plugins</groupId>
                        <artifactId>maven-compiler-plugin</artifactId>
                        <version>3.5.1</version>
                    </plugin>
                </plugins>
            </pluginManagement>
            <plugins>
                <plugin>
                    <groupId>net.alchim31.maven</groupId>
                    <artifactId>scala-maven-plugin</artifactId>
                    <executions>
                        <execution>
                            <id>scala-compile-first</id>
                            <phase>process-resources</phase>
                            <goals>
                                <goal>add-source</goal>
                                <goal>compile</goal>
                            </goals>
                        </execution>
                        <execution>
                            <id>scala-test-compile</id>
                            <phase>process-test-resources</phase>
                            <goals>
                                <goal>testCompile</goal>
                            </goals>
                        </execution>
                    </executions>
                </plugin>
    
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <executions>
                        <execution>
                            <phase>compile</phase>
                            <goals>
                                <goal>compile</goal>
                            </goals>
                        </execution>
                    </executions>
                </plugin>
    
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-shade-plugin</artifactId>
                    <version>2.4.3</version>
                    <executions>
                        <execution>
                            <phase>package</phase>
                            <goals>
                                <goal>shade</goal>
                            </goals>
                            <configuration>
                                <filters>
                                    <filter>
                                        <artifact>*:*</artifact>
                                        <excludes>
                                            <exclude>META-INF/*.SF</exclude>
                                            <exclude>META-INF/*.DSA</exclude>
                                            <exclude>META-INF/*.RSA</exclude>
                                        </excludes>
                                    </filter>
                                </filters>
                            </configuration>
                        </execution>
                    </executions>
                </plugin>
            </plugins>
        </build>
    </project>
    
  • Sparkstreaming代码
package test

import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

/**
  * @ author: create by LuJuHui
  * @ date:2018/7/31
  */
object Kafka2Scala2WC {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("Kafka2Scala2WC").setMaster("local[3]") //local[3]指的是在本地运行,启动3个进程
    val ssc = new StreamingContext(conf, Seconds(5)) //每5秒钟统计一次数据
    val kafkaParams = Map[String, Object](
      /*kafka的端口号*/
      "bootstrap.servers" -> "manager:9092,namenode:9092,datanode:9092",
      /*k-v的反序列化*/
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      /*kafka的组号*/
      "group.id" -> "kafka_wc",
      /*偏移量重置*/
      "auto.offset.reset" -> "latest",
      /*是否自动提交*/
      "enable.auto.commit" -> (false: java.lang.Boolean)
    )

    /*kafka的已经创建的主题*/
    val topics = Array("realtime") //主题可以有多个“topicA”,“topicB”

    /*创建一个离散流*/
    val data = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams)
    )
    /*对kafka消费端取出来的数据进行初步处理,获取value值*/
    val lines = data.map(_.value())

    /*对初步处理的数据进行扁平化处理,并按照空格切分*/
    val words = lines.flatMap(_.split(" "))

    /*获取的单词作为key值,‘1’作为value值,组成(K,V)对*/
    val wordAndOne = words.map((_, 1))
    val reduced = wordAndOne.reduceByKey(_ + _)

    /*将结果存入hdfs,因为使用的cdh3.7.5版本的hadoop,所以其端口号默认为8020*/
    reduced.foreachRDD(rdd => {
      rdd.saveAsTextFile("hdfs://namenode:8020/user/kafka/data")
    })

    /*打印结果*/
    reduced.print()

    /*启动sparkstreaming程序*/
    ssc.start()

    /*等待程序退出*/
    ssc.awaitTermination()
  }
}


【注意:在新建maven项目是一定要根据scala版本去添加对应的依赖jar包,否则会报错】

 

 

猜你喜欢

转载自blog.csdn.net/lukabruce/article/details/81302209