1. Spark Streaming function introduction
1) Definition
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams
2. NC service installs and runs Spark Streaming
1) Install the nc command online
- rpm –ivh nc-1.84-22.el6.x86_64.rpm (preferred)
#Install
Upload the nc-1.84-22.el6.x86_64.rpm package to the software directory, and then install
[kfk@bigdata-pro02 softwares]$ sudo rpm -ivh nc-1.84-22.el6.x86_64.rpm
Preparing... ########################################### [100%]
1:nc ########################################### [100%]
[kfk@bigdata-pro02 softwares]$ which nc
/usr/bin/n
#start up
nc -lk 9999(类似于一个接收器)
After starting, data input can be performed below, and then word frequency statistics can be performed from the spark side (as shown in 2))
- yum install -y nc
2) Run WordCount of Spark Streaming
bin/run-example --master local[2] streaming.NetworkWordCount localhost 9999
#data input
#Result statistics
Note: The above effect can only be achieved by adjusting the log level to WARN, otherwise it will be covered by the log and affect the observation
3) Put the file through the pipeline as the input of nc, and then observe the calculation results of spark Streaming
cat test.txt | nc -lk 9999
The specific content of the file
hadoop storm spark
hbase spark flume
spark dajiangtai spark
hdfs mapreduce spark
hive hdfs solr
spark flink storm
hbase storm es
3. Working principle of Spark Streaming
1) Spark Streaming data flow processing
2) The working principle of the receiver
3) Comprehensive working principle
4.Spark Streaming programming model
1) Two ways of StreamingContext initialization
#The first
val ssc = new StreamingContext(sc, Seconds(5))
#Second type
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
2) Cluster test
#Start spark
bin/spark-shell --master local[2]
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val ssc = new StreamingContext(sc, Seconds(5))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
// Exiting paste mode, now interpreting.
#Enter data on the nc server side
spark
hive hbase
hadoop hive
hbase hbase
spark hadoop
hive hbase
spark Hadoop
#Result statistics
5.Spark Streaming reads Socket stream data
1) spark-shell runs the Streaming program, either the number of threads is greater than 1, or it is based on a cluster.
bin/spark-shell --master local[2]
bin/spark-shell --master spark://bigdata-pro01.kfk.com:7077
2) spark running mode
3) Spark Streaming reads Socket stream data
a) Write test code and run it locally
TestStreaming.scala
package com.zimo.spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
*
* @author Zimo
* @date 2019/4/29
*
*/
object TestStreaming {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.master("local[2]")
.appName("streaming")
.getOrCreate()
val sc = spark.sparkContext
//监听网络端口,参数一:hostname 参数二:port 参数三:存储级别,创建了lines流
val ssc = new StreamingContext(spark.sparkContext, Seconds(5))
val lines = ssc.socketTextStream("bigdata-pro02.kfk.com", 9999)
val words = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
words.print()
ssc.start()
ssc.awaitTermination()
}
}
b) Start the nc service to send data
nc -lk 9999
spark hadoop
spark hadoop
hive hbase
spark hadoop
6.Spark Streaming saves data to external systems
1) Save to mysql database
import java.sql.DriverManager
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext}
val sc = spark.sparkContext
val ssc = new StreamingContext(spark.sparkContext, Seconds(5))
val lines = ssc.socketTextStream("bigdata-pro02.kfk.com", 9999)
val words = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
words.foreachRDD(rdd => rdd.foreachPartition(lines => {
Class.forName("com.mysql.jdbc.Driver")
val conn = DriverManager
.getConnection("jdbc:mysql://bigdata-pro01.kfk.com:3306/test", "root", "root")
try {
for (row <- lines){
val sql = "insert into webCount(titleName,count)values('"+row._1+"',"+row._2+")"
conn.prepareStatement(sql).executeUpdate()
}
}finally {
conn.close()
}
}))
ssc.start()
ssc.awaitTermination()
Then input data on the nc server side, and the statistical results will be written into the webCount table in the database.
mysql> select * from webCount;
+-----------+-------+
| titleName | count |
+-----------+-------+
| hive | 4 |
| spark | 4 |
| hadoop | 4 |
| hbase | 5 |
+-----------+-------+
4 rows in set (0.00 sec
2) save to hdfs
This method is simpler than writing to the database. If you are interested, please refer to the following code to test it yourself.
Special Note: Every time it is executed, the content of the HDFS file will be reset and overwritten!
7. Structured Streaming programming model
1) complete output mode
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import spark.implicits._
val lines = spark.readStream
.format("socket")
.option("host", "bigdata-pro02.kfk.com")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
val query = wordCounts.writeStream.outputMode("complete").format("console").start()
2) update output mode
In this mode, if you continue to input on the nc server side, it will always count the value of the input just now and the historical input, and if the outputMod is changed to "update", the statistics will be updated according to the historical input, and only the latest input will be displayed The statistical result after the value value is updated.
3) append output mode
If you change the outputMod to "append", the code should also be slightly modified
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import spark.implicits._
val lines = spark.readStream
.format("socket")
.option("host", "bigdata-pro02.kfk.com")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" ")).map(x => (x, 1))
val query = words.writeStream.outputMode("append").format("console").start()
It can be seen that this mode is simply appending each input.
8. Real-time data processing business analysis
9. Spark Streaming and Kafka integration
1) Preparation
According to the requirements of the official website, the version of our previous kafka is low, and we need to download a version of at least 0.10.0.
Download address http://kafka.apache.org/downloads
Modifying the configuration is very simple, just copy and replace the /config folder we originally configured, and create a new kafka-logs and logs folder according to the original configuration. Then, modify the path in the configuration folder.
2) Write test code and start running
We upload the package ( all 3 nodes do this)
start spark-shell
copy the code in
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "bigdata-pro01.kfk.com:9092")
.option("subscribe", "weblogs")
.load()
import spark.implicits._
val lines= df.selectExpr("CAST(value AS STRING)").as[String]
val words = lines.flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
val query = wordCounts.writeStream
.outputMode("update")
.format("console")
.start()
query.awaitTermination()
At this time, be sure to keep kafka and the producer turned on:
bin/kafka-console-producer.sh --broker-list bigdata-pro01.kfk.com:9092 --topic weblog
Enter a few words on the producer's side
Go back to the spark-shell interface to see the statistical results
10. Complete real-time data analysis based on structured stream
Let's first clear the contents of the webCount table of the test database of mysqld
Open idea, we write two programs
package com.spark.test
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.ProcessingTime
/**
* Created by Zimo on 2017/10/16.
*/
object StructuredStreamingKafka {
case class Weblog(datatime:String,
userid:String,
searchname:String,
retorder:String,
cliorder:String,
cliurl:String)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[2]")
.appName("streaming").getOrCreate()
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "bigdata-pro01.kfk.com:9092")
.option("subscribe", "weblogs")
.load()
import spark.implicits._
val lines = df.selectExpr("CAST(value AS STRING)").as[String]
val weblog = lines.map(_.split(","))
.map(x => Weblog(x(0), x(1), x(2),x(3),x(4),x(5)))
val titleCount = weblog
.groupBy("searchname").count().toDF("titleName","count")
val url ="jdbc:mysql://bigdata-pro01.kfk.com:3306/test"
val username="root"
val password="root"
val writer = new JDBCSink(url,username,password)
val query = titleCount.writeStream
.foreach(writer)
.outputMode("update")
//.format("console")
.trigger(ProcessingTime("5 seconds"))
.start()
query.awaitTermination()
}
}
package com.spark.test
import java.sql._
import java.sql.{Connection, DriverManager}
import org.apache.spark.sql.{ForeachWriter, Row}
/**
* Created by Zimo on 2017/10/17.
*/
class JDBCSink(url:String, username:String,password:String) extends ForeachWriter[Row]{
var statement : Statement =_
var resultSet : ResultSet =_
var connection : Connection=_
override def open(partitionId: Long, version: Long): Boolean = {
Class.forName("com.mysql.jdbc.Driver")
// connection = new MySqlPool(url,username,password).getJdbcConn();
connection = DriverManager.getConnection(url,username,password);
statement = connection.createStatement()
return true
}
override def process(value: Row): Unit = {
val titleName = value.getAs[String]("titleName").replaceAll("[\\[\\]]","")
val count = value.getAs[Long]("count");
val querySql = "select 1 from webCount " +
"where titleName = '"+titleName+"'"
val updateSql = "update webCount set " +
"count = "+count+" where titleName = '"+titleName+"'"
val insertSql = "insert into webCount(titleName,count)" +
"values('"+titleName+"',"+count+")"
try{
var resultSet = statement.executeQuery(querySql)
if(resultSet.next()){
statement.executeUpdate(updateSql)
}else{
statement.execute(insertSql)
}
}catch {
case ex: SQLException => {
println("SQLException")
}
case ex: Exception => {
println("Exception")
}
case ex: RuntimeException => {
println("RuntimeException")
}
case ex: Throwable => {
println("Throwable")
}
}
}
override def close(errorOrNull: Throwable): Unit = {
// if(resultSet.wasNull()){
// resultSet.close()
// }
if(statement==null){
statement.close()
}
if(connection==null){
connection.close()
}
}
}
Add this dependency package in the pom.xml file
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.27</version>
</dependency>
Let me say here that the choice of the dependency package version should be the same as the version of the dependency package in your cluster. Otherwise, an error may be reported. You can refer to the version under the Lib path in hive .
Keep the dfs, hbase, yarn, and zookeeper of the cluster in the starting state
Start the flume of our node 1 and node 2. Before starting, let's modify the configuration of flume , because we changed the jdk version and kafka version later, so we need to modify the configuration file ( change all 3 nodes)
Start flume on node 1
Start kafka on node 1
bin/kafka-server-start.sh config/server.properties
Start flume on node 2
Start the data on node 2 and generate data in real time
Back to idea , let's run the program
Go back to mysql and check the webCount table, there is already data coming in
We modify the configuration file as follows
[client]
socket=/var/lib/mysql/mysql.sock
default-character-set=utf8
[mysqld]
character-set-server=utf8
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
user=mysql
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
[mysql]
default-character-set=utf8
[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
deleted the table
recreate table
create table webCount( titleName varchar(255) CHARACTER SET utf8 DEFAULT NULL, count int(11) DEFAULT NULL )ENGINE=lnnoDB DEFAULT CHARSET=utf8;
run the program again
You can see that there are no Chinese garbled characters, and we can also connect to mysql through the visualization tool to view
The above is the main content of this section introduced by the blogger. This is the blogger's own learning process. I hope it can give you some guidance. If it is useful, I hope you can support it. If it is not useful to you I also hope to forgive, and please point out any mistakes. If you are looking forward to it, you can follow the blogger to get the update as soon as possible, thank you! At the same time, reprinting is also welcome, but the original address must be marked in the obvious position of the blog post, and the right of interpretation belongs to the blogger!