快学Big Data -- Spark Streaming 总结(二十五)

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/xfg0218/article/details/82381690

Spark-Streaming 总结

官方文档

http://spark.apache.org/docs/1.6.2/streaming-programming-guide.html

概述

Spark Streaming类似于Apache Storm,用于流式数据的处理。根据其官方文档介绍,Spark Streaming有高吞吐量和容错能力强等特点。Spark Streaming支持的数据输入源很多,例如:Kafka、Flume、Twitter、ZeroMQ和简单的TCP套接字等等。数据输入后可以用Spark的高度抽象原语如:map、reduce、join、window等进行运算。而结果也能保存在很多地方,如HDFS,redis, hbise数据库等。另外Spark Streaming也能和MLlib(机器学习)以及Graphx完美融合。

Spark Strraming 示意图展示

 

可以看出Spark Streaming只是做了中间对数据的处理的部分。

 

 

 

可以看出spark生态圈中有相互支持的组件,在计算时可以相互整合,大大的增加开发的效率与能力。

 

什么是DStream

Discretized Stream是Spark Streaming的基础抽象,代表持续性的数据流和经过各种Spark原语操作后的结果数据流。在内部实现上,DStream是一系列连续的RDD来表示。每个RDD含有一段时间间隔内的数据,如下图:

对数据的操作也是按照RDD为单位来进行的

计算过程由Spark engine来完成

 

1-1) 、DStream相关操作

       DStream上的原语与RDD的类似,分为Transformations(转换)和Output Operations(输出)两种,此外转换操作中还有一些比较特殊的原语,如:updateStateByKey()、transform()以及各种Window相关的原语。

1-2)、Transformations on DStreams

reduce(func)

Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative so that it can be computed in parallel.

countByValue()

When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.

reduceByKey(func, [numTasks])

When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.

join(otherStream, [numTasks])

When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.

cogroup(otherStream, [numTasks])

When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.

transform(func)

Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.

updateStateByKey(func)

Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

 

 

 

1-3) 、特殊的Transformations

 

  1. UpdateStateByKey Operation

UpdateStateByKey原语用于记录历史记录,上文中Word Count示例中就用到了该特性。若不用UpdateStateByKey来更新状态,那么每次数据进来后分析完成后,结果输出后将不在保存

 

  1. Transform Operation

Transform原语允许DStream上执行任意的RDD-to-RDD函数。通过该函数可以方便的扩展Spark API。此外,MLlib(机器学习)以及Graphx也是通过本函数来进行结合的。

 

  1. Window Operations

Window Operations有点类似于Storm中的State,可以设置窗口的大小和滑动窗口的间隔来动态的获取当前Steaming的允许状态

 

 

 

reduceByKeyAndWindow(_+_,_-_, Seconds(6), Seconds(10))  传递了两个函数,对性能进行了优化,主要是获取了之前的数据,在进行下一个计算,提高了运算的效率与速度。

Output Operations on DStreams

Output Operations可以将DStream的数据输出到外部的数据库或文件系统,当某个Output Operations原语被调用时(与RDD的Action相同),streaming程序才会开始真正的计算过程。

 

Output Operation

Meaning

print()

Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.

saveAsTextFiles(prefix, [suffix])

Save this DStream's contents as text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".

saveAsObjectFiles(prefix, [suffix])

Save this DStream's contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".

saveAsHadoopFiles(prefix, [suffix])

Save this DStream's contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".

foreachRDD(func)

The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

 

 

 

Spark Streaming实现实时WordCount

1-1)、图解

 

1-2)、安装nc

[root@hadoop1 ~]# yum install -y nc

Loaded plugins: fastestmirror, langpacks

Loading mirror speeds from cached hostfile

 * base: mirrors.neusoft.edu.cn

 * extras: mirrors.neusoft.edu.cn

 * updates: mirrors.nwsuaf.edu.cn

updates/7/x86_64/primary_db    FAILED                                            99% [======================================================================-]  70 kB/s | 7.0 MB  00:00:00 ETA

http://mirrors.nwsuaf.edu.cn/centos/7.2.1511/updates/x86_64/repodata/f444054b66ff65397e29b26ca982cb38039365a4dcb20acc5876a487ac88d867-primary.sqlite.bz2: [Errno -1] Metadata file does not match checksum

Trying other mirror.

^Cdates/7/x86_64/primary_db                                                      78% [========================================================               ]  76 kB/s | 5.6 MB  00:00:20 ETA

 

 

1-3)、常用的命令

[root@hadoop1 ~]# nc -help

Ncat 6.40 ( http://nmap.org/ncat )

Usage: ncat [options] [hostname] [port]

 

Options taking a time assume seconds. Append 'ms' for milliseconds,

's' for seconds, 'm' for minutes, or 'h' for hours (e.g. 500ms).

  -4                         Use IPv4 only

  -6                         Use IPv6 only

  -U, --unixsock             Use Unix domain sockets only

  -C, --crlf                 Use CRLF for EOL sequence

  -c, --sh-exec <command>    Executes the given command via /bin/sh

  -e, --exec <command>       Executes the given command

      --lua-exec <filename>  Executes the given Lua script

  -g hop1[,hop2,...]         Loose source routing hop points (8 max)

  -G <n>                     Loose source routing hop pointer (4, 8, 12, ...)

  -m, --max-conns <n>        Maximum <n> simultaneous connections

  -h, --help                 Display this help screen

  -d, --delay <time>         Wait between read/writes

  -o, --output <filename>    Dump session data to a file

  -x, --hex-dump <filename>  Dump session data as hex to a file

  -i, --idle-timeout <time>  Idle read/write timeout

  -p, --source-port port     Specify source port to use

  -s, --source addr          Specify source address to use (doesn't affect -l)

  -l, --listen               Bind and listen for incoming connections

  -k, --keep-open            Accept multiple connections in listen mode

  -n, --nodns                Do not resolve hostnames via DNS

  -t, --telnet               Answer Telnet negotiations

  -u, --udp                  Use UDP instead of default TCP

      --sctp                 Use SCTP instead of default TCP

  -v, --verbose              Set verbosity level (can be used several times)

  -w, --wait <time>          Connect timeout

      --append-output        Append rather than clobber specified output files

      --send-only            Only send data, ignoring received; quit on EOF

      --recv-only            Only receive data, never send anything

      --allow                Allow only given hosts to connect to Ncat

      --allowfile            A file of hosts allowed to connect to Ncat

      --deny                 Deny given hosts from connecting to Ncat

      --denyfile             A file of hosts denied from connecting to Ncat

      --broker               Enable Ncat's connection brokering mode

      --chat                 Start a simple Ncat chat server

      --proxy <addr[:port]>  Specify address of host to proxy through

      --proxy-type <type>    Specify proxy type ("http" or "socks4")

      --proxy-auth <auth>    Authenticate with HTTP or SOCKS proxy server

      --ssl                  Connect or listen with SSL

      --ssl-cert             Specify SSL certificate file (PEM) for listening

      --ssl-key              Specify SSL private key (PEM) for listening

      --ssl-verify           Verify trust and domain name of certificates

      --ssl-trustfile        PEM file containing trusted SSL certificates

      --version              Display Ncat's version information and exit

 

See the ncat(1) manpage for full options, descriptions and usage examples

 

 

1-4)、启动nc

[root@hadoop1 ~]# nc -lk 8888

dfhf

fbfr

ere

Gfr

 

1-5)、代码实现


import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Created by Administrator on 2016/11/6.
  */
object SparkstringTest {
  def main(args: Array[String]) {
    // 加载win下的配置文件
    System.setProperty("hadoop.home.dir",
      "E:\\winutils-hadoop-2.6.4\\hadoop-2.6.4");
    // 岁数据今次那个初始化
    val conf = new SparkConf().setAppName("SparkstringTest").setMaster("local[2]")
    // 设置上下文
    val ssc = new StreamingContext(conf, Seconds(5))
    // 建立连接
    val textStream: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop2", 8888)
    // 分割数据
    val map: DStream[String] = textStream.flatMap(_.split(" "))
    // 对数据进行累加
    val map1: DStream[(String, Int)] = map.map((_, 1))
    // 对数据进行累加
    val key: DStream[(String, Int)] = map1.reduceByKey(_ + _)
    // 显示处理的数据
    key.print()
    // 开启进程
    ssc.start()
    // 等待停止
    ssc.awaitTermination()
  }
}

 

 

key.print()  默认显示前10行的数据

 

1-6)、查看结果

*************

-------------------------------------------

Time: 1478415525000 ms

-------------------------------------------

(dff,1)

(a,2)

(dfed,1)

 

************

 

 

reduceByKey 支队当前的额数据进行累加,不会对全局的累加。

 

从TCP端口中读取数据,并对数据进行累加

准备JAR

需要的JAR

<dependency>

            <groupId>org.apache.spark</groupId>

            <artifactId>spark-sql_2.10</artifactId>

            <version>${spark.version}</version>

 </dependency>

 

 <dependency>

            <groupId>org.apache.spark</groupId>

            <artifactId>spark-streaming_2.10</artifactId>

            <version>${spark.version}</version>

 </dependency>

 

  <dependency>

            <groupId>org.apache.spark</groupId>

            <artifactId>spark-streaming-kafka_2.10</artifactId>

            <version>1.6.1</version>

  </dependency>

 

图解

 

 

UpdateStateByKey 实现方式

1-1)、代码实现

package streams

import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Created by Adminon 2016/8/26.
  */
object StateFulStreamingWordCount {

  val updateFunc = (it: Iterator[(String, Seq[Int], Option[Int])]) => {
    //it.map(t => (t._1, t._2.sum + t._3.getOrElse(0)))
    it.map { case (x, y, z) => (x, y.sum + z.getOrElse(0)) }
  }

  def main(args: Array[String]) {
    System.setProperty("hadoop.home.dir",
      "E:\\winutils-hadoop-2.6.4\\hadoop-2.6.4")
    LoggerLevels.setStreamingLogLevels()
    val conf = new SparkConf().setAppName("StateFulStreamingWordCount").setMaster("local[2]")
    //创建StreamingContext并设置产生批次的间隔时间
    val ssc = new StreamingContext(conf, Seconds(5))
    //设置ck目录
    ssc.checkpoint("E://ck0826")
    //从Socket端口中创建RDD
    val lines: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop2", 8888)
    val words: DStream[String] = lines.flatMap(_.split(" "))
    val wordAndOne: DStream[(String, Int)] = words.map((_, 1))
    val result: DStream[(String, Int)] = wordAndOne.updateStateByKey(updateFunc, new HashPartitioner(ssc.sparkContext.defaultParallelism), true)
    //打印,默认的显示前10行的数据
    result.print()
    //开启程序
    ssc.start()
    //等待结束
    ssc.awaitTermination()
  }
}

 

1-2)、写入数据

[root@hadoop2 sbin]# nc -lk 8888

a b c d e f

a b  c d e f

djf ffgrg rghr rigrg righrg

a b c d e f

d d d d d d d

 

1-3)、查看结果

*********************

 

-------------------------------------------

Time: 1478421805000 ms

-------------------------------------------

(d,10)

(ffgrg,1)

(b,3)

(,1)

(f,3)

(djf,1)

(e,3)

(rghr,1)

(rigrg,1)

(a,3)

 

1-4)、设置Log级别

package streams

import org.apache.log4j.{Logger, Level}
import org.apache.spark.Logging

object LoggerLevels extends Logging {

  def setStreamingLogLevels() {
    val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
    if (!log4jInitialized) {
      logInfo("Setting log level to [WARN] for streaming example." +
        " To override add a custom log4j.properties to the classpath.")
      Logger.getRootLogger.setLevel(Level.WARN)
    }
  }
}

 

ReduceByKeyAndWindow 实现方式

1-1)、代码实现

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._

object WindowReduce {
  def main(args: Array[String]) {
    System.setProperty("hadoop.home.dir", "E:\\winutils-hadoop-2.6.4\\hadoop-2.6.4")
    val conf = new SparkConf().setAppName(this.getClass.getName).setMaster("local[4]")
    val sc = new SparkContext(conf)
    // 创建StreamingContext,batch interval为5秒
    val ssc = new StreamingContext(sc, Seconds(2))
    //Socket为数据源
    val lines = ssc.socketTextStream("skycloud1", 8888, StorageLevel.MEMORY_ONLY_SER)
    val words = lines.flatMap(_.split(" "))
    // windows操作,对窗口中的单词进行计数
    val wordCounts = words.map(x => (x, 1)).reduceByKeyAndWindow((a: Int, b: Int) => (a + b), Seconds(6), Seconds(10))
    //显示数据
    wordCounts.print()
    // 开启计算
    ssc.start()
    //等待程序执行结束
    ssc.awaitTermination()
  }
}

 

 

Seconds(2) :batch interval为5秒

Seconds(6)window的长度是30秒,最近30秒的数据

Seconds(10)计算的时间间隔

1-2)、查看结果

-------------------------------------------

Time: 1487556534000 ms

-------------------------------------------


-------------------------------------------

Time: 1487556554000 ms

-------------------------------------------

(edef,2)

(de,1)

(wedfe,1)

(wefef,1)

(ewfef,1)

 

**********************************

 

 

详情请看:http://blog.csdn.net/xfg0218/article/details/56008383

 

Spark 结合Flume

1-1)、上传JAR包到FLume的lib下

JAR下载地址,如果无法下载请联系作者:

链接:http://pan.baidu.com/s/1kVz3bvT 密码:btnf

 

commons-lang3-3.3.2.jar  scala-library-2.10.5.jar    spark-streaming-flume-sink_2.10-1.6.1.jar

 

1-2)、修改Flume配置文件

[root@hadoop1 configurationFile]# vi flume-poll.conf

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

 

# source

a1.sources.r1.type = spooldir

a1.sources.r1.spoolDir = /usr/local/flume/testDate

a1.sources.r1.fileHeader = true

 

# Describe the sink

a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink

a1.sinks.k1.hostname = hadoop1

a1.sinks.k1.port = 8888

 

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

 

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

 

1-3)、启动Flume

[root@hadoop1 configurationFile]# flume-ng agent -n a1 -c conf -f flume-poll.conf  -Dflume.root.logger=WARN,console

***************

16/11/06 01:48:46 INFO sink.SparkSink: Starting Avro server for sink: k1

16/11/06 01:48:46 INFO sink.SparkSink: Blocking Sink Runner, sink will continue to run..

 

1-4)、准备Flume JAR

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-flume_2.10</artifactId>
    <version>${spark.version}</version>
</dependency>

1-5)、代码实现

package streams

import java.net.InetSocketAddress

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object FlumeStreamingWordCount {

  def main(args: Array[String]) {
    // 加载win下的配置文件
    System.setProperty("hadoop.home.dir",
      "E:\\winutils-hadoop-2.6.4\\hadoop-2.6.4");
    val conf = new SparkConf().setAppName("FlumeStreamingWordCount").setMaster("local[2]")
    LoggerLevels.setStreamingLogLevels()
    //创建StreamingContext并设置产生批次的间隔时间
    val ssc = new StreamingContext(conf, Seconds(5))
    //从Socket端口中创建RDD
    val flumeStream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createPollingStream(ssc, Array(new InetSocketAddress("hadoop1", 8888)), StorageLevel.MEMORY_AND_DISK)
    //去取Flume中的数据
    val words = flumeStream.flatMap(x => new String(x.event.getBody().array()).split(" "))
    val wordAndOne: DStream[(String, Int)] = words.map((_, 1))
    val result: DStream[(String, Int)] = wordAndOne.reduceByKey(_ + _)
    //打印
    result.print()
    //开启程序
    ssc.start()
    //等待结束
    ssc.awaitTermination()
  }
}

 

 

1-6)、测试数据

[root@hadoop1 conf]# cp flume-conf.properties.template  /usr/local/flume/testDate/

 

1-7)、查看结果

**********************

-------------------------------------------

Time: 1478427325000 ms

-------------------------------------------

(Unless,1)

(config,1)

(this,5)

(KIND,,1)

(case,,1)

(is,3)

(under,4)

(follows.,1)

(memoryChannel,3)

(sinks,1)

Spark 结合Kafka

1-1)、启动Kafka

[root@hadoop1 start_sh]# cat kafka_start.sh

cat /usr/local/start_sh/slave |while read line

do

{

echo $line

ssh $line "source /etc/profile;nohup kafka-server-start.sh  /usr/local/kafka/config/server.properties  > /dev/null 2>&1&"

}&

wait

done

1-2)、创建topic

[root@hadoop1 start_sh]# kafka-topics.sh --create --zookeeper hadoop1:2181 --replication-factor 2  --partitions 3 --topic lines

Created topic "lines".

1-3)、查看所有的topic

[root@hadoop1 start_sh]# kafka-topics.sh --list --zookeeper hadoop1:2181

lines

1-4)、查看topic的详情

[root@hadoop1 start_sh]# kafka-topics.sh --describe --zookeeper hadoop1:2181 --topic lines

Topic:lines PartitionCount:3 ReplicationFactor:2 Configs:

Topic: lines Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0

Topic: lines Partition: 1 Leader: 2 Replicas: 2,1 Isr: 2,1

Topic: lines Partition: 2 Leader: 0 Replicas: 0,2 Isr: 0,2

1-5)、启动一个生产者发送消息

[root@hadoop1 start_sh]# kafka-console-producer.sh --broker-list  hadoop1:9092 --topic lines

aaaaaaaaaa

bbbbbbbbbbbbbbbbbbbbbbbbbbb

ccccccccccccccccccccccccccccccc

1-6)、启动一个消费者消费数据

[root@hadoop1 start_sh]# kafka-console-consumer.sh --zookeeper hadoop1:2181 --from-beginning --topic lines

aaaaaaaaaa

bbbbbbbbbbbbbbbbbbbbbbbbbbb

ccccccccccccccccccccccccccccccc

1-7)、代码

package streams

import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Created by root on 2016/5/21.
  */
object KafkaWordCount {

  val updateFunc = (iter: Iterator[(String, Seq[Int], Option[Int])]) => {
    //iter.flatMap(it=>Some(it._2.sum + it._3.getOrElse(0)).map(x=>(it._1,x)))
    iter.flatMap { case (x, y, z) => Some(y.sum + z.getOrElse(0)).map(i => (x, i)) }
  }

  def main(args: Array[String]) {
    LoggerLevels.setStreamingLogLevels()
    val Array(zkQuorum, group, topics, numThreads) = args
    val sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    // 设置checkpoint
    ssc.checkpoint("c://ck200")
    //"alog-2016-04-16,alog-2016-04-17,alog-2016-04-18"
    //"Array((alog-2016-04-16, 2), (alog-2016-04-17, 2), (alog-2016-04-18, 2))"
    val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
    val data: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap, StorageLevel.MEMORY_AND_DISK_SER)
    val words = data.map(_._2).flatMap(_.split(" "))
    val wordCounts = words.map((_, 1)).updateStateByKey(updateFunc, new HashPartitioner(ssc.sparkContext.defaultParallelism), true)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

 

ssc.setCheckPointDir()  也是设置checkpoint的方法,主要负责长时间的程序运算产生的垃圾数据。

1-8)、配置参数

hadoop1:2181,hadoop2:2181,hadoop3:2181 g1 lines 1

 

1-9)、测试数据

[root@hadoop1 start_sh]# kafka-console-producer.sh --broker-list  hadoop1:9092 --topic lines

aaaa

bbb

ccc

dddd

eeee

fff

gggg

 

1-10)、查看结果

**************************

-------------------------------------------

Time: 1478444825000 ms

-------------------------------------------

(aaaa,1)

(bbb,1)

(dddd,1)

(ddddddddddddd,1)

(eeee,1)

(fff,1)

(f,1)

(gggg,1)

(fffffffffffffff,1)

(ccc,1)

1-11)、提交集群运行查看结果

A)、运行程序

[root@hadoop1 sparkJar]# spark-submit  --class streams.KafkaWordCount --master spark://hadoop1:7077,hadoop2:7077 --executor-memory 1g --total-executor-cores 2 /usr/local/spark/sparkJar/sparkKafka.jar  hadoop1:2181,hadoop2:2181,hadoop3:2181  sparkKafka lines 2

 

Spark  结合Redis 

1-1)、创建Kafka 中的数据


import java.util.Properties

import kafka.javaapi.producer.Producer
import kafka.producer.{KeyedMessage, ProducerConfig}
import org.codehaus.jettison.json.JSONObject
import scala.util.Random

object KafkaEventProducer {

  private val users = Array("4A4D769EB9679C054DE81B973ED5D768",
    "8dfeb5aaafc027d89349ac9a20b3930f",
    "011BBF43B89BFBF266C865DF0397AA71",
    "f2a8474bf7bd94f0aabbd4cdd2c06dcf",
    "068b746ed4620d25e26055a9f804385f",
    "97edfc08311c70143401745a03a50706",
    "d7f141563005d1b5d0d3dd30138f3f62",
    "c8ee90aade1671a21336c721512b817a",
    "6b67c8c700427dee7552f81f3228c927", "a95f22eabc4fd4b580c011a3161a9d9d")

  private val random = new Random()
  private var pointer = -1

  def getUserID: String = {
    pointer = pointer + 1
    if (pointer >= users.length) {
      pointer = 0
      users(pointer)
    } else {
      users(pointer)
    }
  }
  def click(): Double = {
    random.nextInt(10)
  }
  def main(args: Array[String]) {
    val topic = "user_event"
    // 可以添加多个地址
    val brokers = "hadoop1:9092"
    val props = new Properties()
    props.put("metadata.broker.list", brokers)
    props.put("serializer.class", "kafka.serializer.StringEncoder")
    val kafka = new ProducerConfig(props)
    val producer = new Producer[String, String](kafka)
    while (true) {
      val event = new JSONObject()
      event.put("uid", getUserID)
      event.put("event_time", System.currentTimeMillis.toString)
      event.put("os_type", "Android")
      event.put("click_count", click)
      producer.send(new KeyedMessage[String, String](topic, event.toString))
      println("Message   sent:" + event)
      Thread.sleep(2000)
    }
  }
}

 

1-2)、链接Redis 

package test

import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}


object UserClickCountAnalytics {
  def main(args: Array[String]) {
    var masterUrl = "local[1]"
    if (args.length > 0) {
      masterUrl = args(0)
    }
    val conf = new SparkConf().setMaster(masterUrl).setAppName(this.getClass.getName)
    val ssc = new StreamingContext(conf, Seconds(5))

    val topic = Set("user_events")
    val brokers = "hadoop1:9092"
    val kafkaParames = Map[String, String]("metadata.broker.list" -> brokers, "serializer.class" -> "kafka.serializer.StringEncoder")
    val dbindex = 1
    val clickHashKey = "app:users::click"
    val kafkaStream = KafkaUtils.createDirectStream(ssc, kafkaParames, topic)
    val events = kafkaStream.flatMap(line => {
      Some(line)
    })

    val userClicks = events.map(x => (x._1, x._2))
    userClicks.foreachRDD(rdd => {
      rdd.foreachPartition(partititonOfRecords => {
        partititonOfRecords.foreach(pair => {
          val uid = pair._1
          val clickCount = pair._2
          val jedis = RedisClient.pool.getResource
          jedis.select(dbindex)
          jedis.hincrBy(clickHashKey, uid, clickCount)
          RedisClient.pool.returnResource(jedis)
        })
      })
    })
    ssc.start()
    ssc.awaitTermination()
  }
}

1-3)、Redis 连接池

package test

import org.apache.commons.pool2.impl.GenericObjectPoolConfig
import redis.clients.jedis.JedisPool

object RedisClient extends Serializable {
  val redisHost = "hadoop1"
  val redisPort = "6379".toInt
  val redisTimeout = 300000
  lazy val pool = new JedisPool(new GenericObjectPoolConfig(), redisHost, redisPort, redisTimeout)
  lazy val hook = new Thread {
    override def run = {
      println("hook thread:" + this)
      pool.destroy()
    }
  }
  sys.addShutdownHook(hook.run)
}

 

Some  表示有值,在后面便于操作,否则回报nothing错误。

Spark-Streaming 几种获取数据源的方式

1-1)、通过直连的方式查询数据

val ssc= new StreamingContext(‘spark://hadoop1:7077’,’WordCount’,Seconde(1),[Homes],[Jars])

 

第一个三叔是指定集群的运行地址,第三个参数是指定Spark运行时的窗口的大小,现在表示每一秒对数据进行一次Spark Job的处理。

 

1-2)、通过端口的形式处理数据

Val  lins = ssc.scoketTextStream(‘localhost’,9999)

通过网络就的方式获取数据并对数据进行处理。

Spark 大数据处理技术总结

概述

   以下的信息主要来自于《spark 大数据技术》一书的总结,内容简单易懂。如果遇到任何问题可以联系作者,或者访问:http://blog.csdn.net/xfg0218/article/details/55272083

资料在:链接:http://pan.baidu.com/s/1pKRAXjt 密码:ohpj 中

第一章

 

1-1)、RDD的表达能力

  1. 迭代运算

B)、关系型查询

 

C)、MapReduce批处理

 

 

D)、流式计算

 

1-2)、Spark 子系统

 

1-3)、Spark  生态圈

A)、Spark Core

 

 

 

B)、 Spark SQL

C)、 Spark Streaming

 

D)、GraphX

 

E)、MLib

 

1-4)、Spark 生态系统特征

第二章

1-1)、Spark RDD及编程接口

 

  1. 、Spark 编程中的概念

 

 

B)、上下文的初始化

 

C)、Spark RDD

 

1-1)、RDD分区

 

1-2)、RDD的优先位置

 

1-3)、RDD的依赖关系

 

1-4)、partitions

 

1-5)、preferredLocations

 

1-6)、dependencies

1-7)、compute

 

1-8)、partitioner

 

 

D)、创建操作

1-1)、从集合生成RDD

1-2)、RDD基本转换操作

 

A)、储存操作

 

B)、对RDD的分区重新划分

 

C)、集合造作

 

 

D)、zip系列

 

E)、键值RDD转换操作

 

F)、combineByKey

 

G)、行动操作

 

 

详细请看如下:

 

第三章

1-1)、Spark 运算模式及原理

 

A)、Standalone模式

 

 

B)、Yarn模式

 

详细如下

 

第四章

1-1)、Spark 调度管理原理

A)、Spark调度的概念

 

B)、作业调度模块逻辑概念

 

1-1)、SparkContext

A)、DAGScheduler

 

 

 

B)、逻辑是基于Akka Actor 的机制

 

 

详细如下

 

第五章

1-1)、Spark 的储存管理

  1. 、储存管理的架构

1-1)、通信层

1-2)、储存层

 

 

1-3)、Shuffer数据的持久化

1-4)、shuffer储存块的方式

1-5)、shuffer数据的读取与传递的两种方式

 

 

B)、Spark支持的持久化的选项

 

 

1-1)、StorageLevel

 

详细如下

 

 

第六章

 

1-1)、Stage界面

 

  1. 、正在运行的Stage(Action Stage)

 

B)、Stage 调度模式分为两种

1-2)、Storage界面

 

详细如下

 

第七章

 

1-1)、Spark架构与安装部署

.

A)、OutOfMemory异常的处理方法

 

B)、数据处理吞吐量低

 

C)、Shark比Hive慢的查找原因

 

详细如下

 

第八章

1-1)、用户自定义函数

1-2)、CLI中的用户自定义函数扩展相关的命令

1-3)、UDF关键点说明

 

详细如下

 

 

第九章

1-1)、Spark SQL

 

 

  1. 、SQL引擎的四个步骤

 

B)、初始化

 

C)、类型的转换

 

D)、常用的方法

详细如下

 

第十章

1-1)、Spark Streaming

A)、输入源

B)、actorStream

 

C)、转换操作

 

D)、基于窗口的转换

E)、 输出操作

 

 

1-2)、性能优化

 

 

A)、运行时间优化

 

B)、内存使用优化

 

 

详细如下

 

 

 

猜你喜欢

转载自blog.csdn.net/xfg0218/article/details/82381690