Spark Streaming programming combat (development instance)

This section describes how to write Spark Streaming applications, from simple to difficult to explain the use of several core concepts to solve practical problems.

Data stream simulators

In the simulation example demonstrates the actual situation, it is necessary to access to a steady stream of data stream, for a more realistic environment during the presentation, first need to define the data flow simulator. The main function of the simulator is monitored through the port number specified Socket embodiment, when the external program are connected through the port, and the requested data, the timing simulator specified file random data acquisition, and sends an external program.

A code stream data simulator follows.

import java.io.{PrintWriter}
import java.net.ServerSocket
import scala.io.Source
object StreamingSimulation {
// define methods to obtain random integers
def index(length:Int) = {
import java.util.Random
val rdm = new Random
rdm.nextInt(length)
}
def main(args: Array[String]) {
// call the simulator requires three parameters, namely, the file path, port number, and the time interval (in milliseconds)
if (args.length != 3) {
System.err.printIn(“Usage:<filename> <port><millisecond>”)
System.exit(1)
}
// Get the total number of rows in the specified file
val filename = args(0)
val lines = Source.fromFile(filename).getLines.toList
val filerow = lines.length
// Specify a monitor port, external connection is established when a program requests
val listener = new ServerSocket(args(1).toInt)
while (true) {
val socket = listener.accept()
new Thread() {
override def run = {
printIn(“Got client connected from: “ + socket.getInetAddress)
val out = new PrintWriter(socket.getOutputStream(), true)
while (true) {
Thread.sleep(args(2).toLong)
// when the port accepts the request, acquiring a random data transmission line to each other
val content = lines(index(filerow))
printIn(content)
out.write(content + ‘n‘) out.flush()
}
socket.close()
}
}.start()
}
}
}

IDEA development environment in the package configuration interface:

First need to add Jar package in ClassPath (/app/scala-2.10.4/lib/scala-swing.jar/app/scala-2.10.4/lib/scala-library.jar/app/scala-2.10.4/lib /scala-actors.jar).
Then click the "Build" → "Build Artifacts", select "Build" or "Rebuild" action.
Finally, copy the packed files to the root directory using the following command Spark.

cd /home/hadoop/IdeaProjects/out/artifacts/LearnSpark_jar
cp LearnSpark.jar /app/hadoop/spark-1.1.0/

Example 1: read the file demo

In this instance, Spark Streaming will monitor the files in a directory, access to data in the interval period of change, and then calculate the word count of the number of time periods by Spark Streaming.

Code is as follows.

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds,StreamingContext}
import org.apache.spark.streaming.StreamingContext._
object FileWordCount {
def main(args:Array[String]) {
selection sparkConf = new SparkConf (). setAppName ( "File Word Count"). setMaster ( "local" [2])
Streaming // create a context, the configuration and including Spark interval, where the time interval is 20 seconds val ssc = new StreamingContext (sparkConf, Seconds (20))
// Specify monitored directory, where is / home / hadoop / temp /
val lines = ssc.textFileStream(“/home/hadoop/temp/”)
// to change the data in the specified folder and will be printed word count
val words = lines.flatMap (_.split(“”))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_+_)
wordCounts.print()
// Start Streaming
ssc.start()
ssc.awaitTermination ()
}
}

To run the code there are three steps.

1) Create a Streaming Monitoring directory.

Creating / home / hadoop / temp directory Spark Streaming monitored, timed to add in the directory file, then the statistics of the number of words in documents newly added by Spark Streaming.

2) Use the following command to start the Spark cluster.

$cd /app/hadoop/spark-1.1.0
$sbin/start-all.sh

3) Run Streaming in IDEA program.

IDEA run in this example, since this example it does not require no input parameters configuration parameters, the timestamp in the time print run log. If you add a file monitoring directory, while the output of the output timestamp word counting the number of newly added file for that time period.

Example 2: Network data presentation

In this example, the data stream will be transmitted at a frequency simulator 1 second analog data, Spark Streaming Socket stream data received once every 20 seconds and run to process the data received, the period of time after the print data processed appears frequency, i.e. there is no relationship between the state of each processing period of time.

Code is as follows.

import org.apache.spark. SparkContext {,} SparkConf
import org.apache.spark.streaming.{Milliseconds,Seconds,StreamingContext}
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.storage.StorageLevel
object NetworkWordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName(“NetworkWordCount”).setMaster(“local[2]”)
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc,Seconds(20))
// get the data through Socket, need to provide the host name and port number Socket, the data stored in the memory and hard drive
val lines =ssc.socketTextStream(args(0),args(1).toInt,StorageLevel.MEMORY_AND_DISK_SER)
// read the data is divided into the count
val words = lines.flatMap(_.split(“,”))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print()
ssc.start()
ssc.awaitTermination ()
}
}

To run the code in a total of four steps.

1) Start data stream simulator.

Start streaming data simulator, the simulator Socket port number is 9999, the frequency is 1 second. Sending / home / hadoop people.txt data files under / upload / class7 the directory in the example, wherein the content data people.txt follows.

1 Michael
2 Andy
3 Justin
4

Start command stream data simulator follows.

$cd /app/hadoop/spark-1.1.0
$java -cp LearnSpark.jar class7.StreamingSimulation
/home/hadoop/upload/class7/people.txt 9999 1000

When the program is not connected, the program is blocked.

2) Run Streaming in IDEA program.

IDEA run in this example, the need to configure the host name and port number of the Socket connection, in the configuration where the host name hadoop1, port number 9999.

3) Observation simulator transmission situation.

Spark Streaming IDEA program established in connection with the simulator, the simulator is detected when the test swab start sending data external connection, random data row of data acquired in the designated file, the time interval is 1 second. Figure 1 is a screenshot simulator transmission situation.

Screenshot simulator transmission case 1

4) observation of statistical results.

In the IDEA operating window, the statistics can be observed. By analysis, the number of words within each period of 20 Spark Streaming, just transmitted is the sum of the number per 20 seconds.

—————————
Time：14369195400000ms
(Andy,2)
(Michael,9)
(Justin,9)

Examples 3: Stateful demo

This example is a Spark Streaming state operation, the stream data transmitted from the simulator to simulate one second frequency data, receiving Spark Streaming Socket stream data every five seconds and run to process the data received, the printing process starts after processing after the frequency of word appears.

That is, not only the output results of each statistical data received during this period, all previous data further includes period. With Comparative Example 2, in this example, between the states of the relevant time period.

Code is as follows.

import org.apache.log4j.{Level,Logger}
import org.apache.spark. SparkContext {,} SparkConf
import org.apache.spark.streaming.{Seconds,StreamingContext}
import org.apache.spark.streaming.StreamingContext._
object StatefulWordCount {
def main(args: Array[String]) {
if (args.length != 2) {
System.err.printIn(“Usage: StatefulWordCount <filename> <port> “)
System.exit(1)
}
Logger.getLogger(“org.apache.spark”).setLevel(Level.ERROR)
Logger.getLogger(“org.eclipse.jetty.server”).setLevel(Level.OFF)
// definition of the update state method, the parameter values for the current batch word frequency, state word for the previous cycle, the frequency of
val updateFunc = (values: Seq[Int],state: Option[Int]) => {
val currentCount = values.foldLeft(0)(_ + _)
val previousCount = state.getOrElse (0)
Some(currentCount + previousCount)
}
val conf = new SparkConf().setAppName(“StatefulWordCount”).setMaster(“local[2]”)
val sc = new SparkContext(conf)
// Create StreamingContext, Spark Steaming run at intervals of 5 seconds
val ssc = new StreamingContext(sc, Seconds(5))
// Define checkpoint directory of the current directory
ssc.checkpoint(“.”)
// Get the data is sent over from Socket
val lines = ssc.socketTextStream(args(0),args(1).toInt)
val words = lines.flatMap(_.split(“.” ))
val wordCounts = words.map(x => (x, 1))
// use updateStateByKey to update the status, the total number of statistical word
val stateDstream = wordCounts.updateStateByKey[Int](updateFunc)
stateDstream.print()
ssc.start()
ssc.awaitTermination ()
}
}

And the same process to Example 2 starting application IDEA start the data stream simulator.

See IDEA running window of the operation, the total number of words can be observed for the first time count is 0, the second five, the first N times is 5 (N-1), i.e., total number of words for programs running count the number of words Sum.

———————-
Time：14369196110000ms
———————-

———————-
Time：14369196150000ms
———————-
(Andy,2)
(Michael, 1)
(Justin,2)

Example 4: Window demo

This example is a Spark Streaming window operations, the data stream transmitted by the simulator frequency of 1 second analog data, the data stream received by Spark Streaming Socket run once every 10 seconds and to process the data received, the printing process starts after processing after the frequency of word appears.

Compared to the previous examples, Spark Streaming statistically window () method by reduceByKeyAndWindow, and specify the length of the sliding window time interval in this method.

Code is as follows:

import org.apache.log4j.{Level,Logger}
import org.apache.spark. SparkContext {,} SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
object StatefulWordCount {
def main(args: Array[String]) {
if (args.length != 4) {
System.err.printIn(“Usage: StatefulWordCount <filename> <port><WindowDuration><slideDuration>
“)
System.exit(1)
}
Logger.getLogger(“org.apache.spark”).setLevel(Level.ERROR)
Logger.getLogger(“org.eclipse.jetty.server”).setLevel(Level.OFF)
val conf = new SparkConf().setAppName(“WindowWordCount”).setMaster(“local[2]”)
val sc = new SparkContext(conf)
// Create StreamingContext
val ssc = new StreamingContext(sc, Seconds(5))
// Define checkpoint directory of the current directory
ssc.checkpoint(“.”)
// get the data through Socket, must provide the host name and port number Socket, the data stored in the memory and hard drive
val lines = ssc.socketTextStream(args(0),args(1).toInt,StorageLevel.MEMORY_ONLY_SER)
val words = lines.flatMap(_.split(“.” ))
// Windows operation, the first way to overlay process, the second approach is incremental processing
val wordCounts = words.map(x => (x, 1)).reduceByKeyAndWindow(_+_,_–_,Seconds(args(2).toInt),Seconds(srg3).toInt)
wordCounts.print()
ssc.start()
ssc.awaitTermination ()
}
}

And the same process to Example 2 starting application IDEA start the data stream simulator.

IDEA running window, the total number of words can be observed for the statistical first 4, second 14, N th is 10 (Nl) +4, the total number of words or statistical run is the total number of words.

———————-
Time：14369196740000ms
———————-
(Andy,1)
(Michael, 2)
(Justin,1)

———————-
Time：14369196750000ms
———————-
(Andy,4)
(Michael, 5)
(Justin,5)