跨服务器布置flume时需要注意公司的安全策略,可能不是配置有问题,有问题需要问运维。
现在业务需求是:
1.不是集群内部服务器布置flume,跨服务器采集数据。
2.采集到的数据通过kafka+sparkstreaming进行实时消费
3.保存到hdfs当中
服务器A的flume配置:
flume_kafka_source.conf
a1.sources = r1
a1.channels = c1
a1.sinks =s1
#sources端配置
a1.sources.r1.type=exec
a1.sources.r1.command=tail -F /home/diamonds/ztx/test/kafka.log
a1.sources.r1.channels=c1
#channels端配置
a1.channels.c1.type=memory
a1.channels.c1.capacity=20000
a1.channels.c1.transactionCapacity=1000
# Describe the sink
a1.sinks.s1.type = avro
#公网IP
a1.sinks.s1.hostname = 118.31.75.XXX
a1.sinks.s1.port = 33333
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.s1.channel = c1
服务器B的flume配置:
flume_kafka_sink.conf
#agent名, source、channel、sink的名称
a1.sources = r1
a1.sinks = s1
a1.channels = c1
#定义source
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 33333
#定义channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity = 10000
#设置Kafka接收器
a1.sinks.s1.type= org.apache.flume.sink.kafka.KafkaSink
#设置Kafka的broker地址和端口
a1.sinks.s1.brokerList=sparkmaster:9092,datanode1:9092,datanode2:9092
#设置Kafka的Topic
a1.sinks.s1.topic=realtime
#设置序列化方式
a1.sinks.s1.serializer.class=kafka.serializer.StringEncoder
a1.sinks.s1.channel=c1
a1.sources.r1.channels=c1
maven依赖:
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<scala.version>2.11.8</scala.version>
<hadoop.version>2.7.4</hadoop.version>
<spark.version>2.0.2</spark.version>
<encoding>UTF-8</encoding>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
</dependency>
<!--引入spark-streaming依赖-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.0.2</version>
</dependency>
<!--引入Flume依赖 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.11</artifactId>
<version>2.0.2</version>
</dependency>
<!--引入kafka依赖 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.0.2</version>
</dependency>
<!--log4j-->
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.14</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass></mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
kafka+sparkstreaming代码:
package cn.test.spark
import kafka.serializer.StringDecoder
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, InputDStream, ReceiverInputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
object SparkStreamingKafkaDirect {
//正式环境
val ks="sparkmaster:9092,datanode1:9092,datanode2:9092"
def main(args: Array[String]): Unit = {
//创建SparkConf
// val sparkConf: SparkConf = new SparkConf().setAppName("SparkStreamingKafkaDirect")
//本地模式
val sparkConf: SparkConf = new SparkConf().setAppName("SparkStreamingKafkaDirect").setMaster("local[3]")
//创建sparkContext
val sparkContext = new SparkContext(sparkConf)
//设置警告级别
sparkContext.setLogLevel("WARN")
//创建streamingContext Seconds 设置窗口大小(秒)
val streamingContext = new StreamingContext(sparkContext,Seconds(6))
//保存topic偏移量
streamingContext.checkpoint("./spark_kafka01")
//准备kafka参数
val kafkaParams=Map("metadata.broker.list"->ks,"group.id"->"ztx01")
//准备topics
val topics = Set("realtime")
//获取kafka数据
val dstream: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](streamingContext,kafkaParams,topics)
//获取topic中的数据
val line: DStream[String] = dstream.map(_._2)
//将结果保存到hdfs中
// line.saveAsTextFiles("/user/hdfs/source/log/kafkaSs/qwer")
//打印结果数据
line.print()
//开启流式计算
streamingContext.start()
streamingContext.awaitTermination()
}
服务器B上安装好了kafka则先启动kafka:
1.创建topic
./kafka-topics --describe --zookeeper 10.10.0.27:2181,10.10.0.8:2181,10.10.0.127:2181 --topic realtime
2.启动topic
./kafka-console-consumer --zookeeper 10.10.0.27:2181,10.10.0.8:2181,10.10.0.127:2181 --from-beginning --topic realtime
3.启动服务器B上flume服务
bin/flume-ng agent -c conf -f conf/flume_kafka_sink.conf -name a1 -Dflume.root.logger=DEBUG,console
4.再启动服务器A的服务:
bin/flume-ng agent -c conf -f conf/flume_kafka_source.conf -name a1 -Dflume.root.logger=DEBUG,console
(后台启动可以在前加上nohub 后面加上&)
5.提交spark任务
./spark2-submit
–class cn.test.spark.SparkStreamingKafkaDirect
–master yarn
–executor-memory 1g
–total-executor-cores 2
/opt/cloudera/parcels/SPARK2/bin/sparkstreaming-kafka.jar
(注意在编译工具IDEA上的scala、spark的版本是否和集群上的spark版本兼容,不兼容则会报错。spark-shell可查看集群上spark的版本)
6.创建一个.log日志文件进行测试
7.创建一个shell文件写入数据到.log日志文件中
ps:
- netstat -ntlp //查看当前所有tcp端口
- 注意监控的文件位置(是监控一个文件内容还是监控一个文件夹中文件的变化)
- 注意创建topic的名称
- 注意flume连接kafka的地址(是否映射,使用公网还是私网)
- 保存到hdfs中时需要做yarn配置,减少生成小文件