Spark-Streaming real-time data analysis

 


1. Spark Streaming function introduction

1) Definition

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams

http://ke.dajiangtai.com/content/6919/1.png

2. NC service installs and runs Spark Streaming

1) Install the nc command online

  • rpm –ivh nc-1.84-22.el6.x86_64.rpm (preferred)

#Install

Upload the nc-1.84-22.el6.x86_64.rpm package to the software directory, and then install

[kfk@bigdata-pro02 softwares]$ sudo rpm -ivh nc-1.84-22.el6.x86_64.rpm
Preparing...                ########################################### [100%]
   1:nc                     ########################################### [100%]

[kfk@bigdata-pro02 softwares]$ which nc
/usr/bin/n

#start up

nc -lk 9999(类似于一个接收器)

After starting, data input can be performed below, and then word frequency statistics can be performed from the spark side (as shown in 2))

  • yum install -y nc

2) Run WordCount of Spark Streaming

bin/run-example --master local[2] streaming.NetworkWordCount localhost 9999

#data input

#Result statistics

Note: The above effect can only be achieved by adjusting the log level to WARN, otherwise it will be covered by the log and affect the observation

3) Put the file through the pipeline as the input of nc, and then observe the calculation results of spark Streaming

cat test.txt | nc -lk 9999

The specific content of the file

hadoop  storm   spark
hbase   spark   flume
spark   dajiangtai     spark
hdfs    mapreduce      spark
hive    hdfs    solr
spark   flink   storm
hbase   storm   es

3. Working principle of Spark Streaming

1) Spark Streaming data flow processing

http://ke.dajiangtai.com/content/6919/2.png

2) The working principle of the receiver

http://ke.dajiangtai.com/content/6919/3.png

http://ke.dajiangtai.com/content/6919/4.png

http://ke.dajiangtai.com/content/6919/5.png

http://ke.dajiangtai.com/content/6919/6.png

3) Comprehensive working principle

http://ke.dajiangtai.com/content/6919/7.png

http://ke.dajiangtai.com/content/6919/8.png

4.Spark Streaming programming model

1) Two ways of StreamingContext initialization

#The first

val ssc = new StreamingContext(sc, Seconds(5))

#Second type

val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))

2) Cluster test

#Start spark

bin/spark-shell --master local[2]

scala> :paste

// Entering paste mode (ctrl-D to finish)

import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._

val ssc = new StreamingContext(sc, Seconds(5))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()

// Exiting paste mode, now interpreting.

 

#Enter data on the nc server side

spark
hive hbase
hadoop hive
hbase hbase
spark hadoop
hive hbase
spark Hadoop

#Result statistics

5.Spark Streaming reads Socket stream data

1) spark-shell runs the Streaming program, either the number of threads is greater than 1, or it is based on a cluster.

bin/spark-shell --master local[2]
bin/spark-shell --master spark://bigdata-pro01.kfk.com:7077

2) spark running mode

http://ke.dajiangtai.com/content/6919/9.png

3) Spark Streaming reads Socket stream data

a) Write test code and run it locally

TestStreaming.scala



package com.zimo.spark

import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  *
  * @author Zimo
  * @date 2019/4/29
  *
  */
object TestStreaming {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .master("local[2]")
      .appName("streaming")
      .getOrCreate()

    val sc = spark.sparkContext



//监听网络端口,参数一:hostname 参数二:port 参数三:存储级别,创建了lines流
    val ssc = new StreamingContext(spark.sparkContext, Seconds(5))
    val lines = ssc.socketTextStream("bigdata-pro02.kfk.com", 9999)
    val words = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
    words.print()
    ssc.start()
    ssc.awaitTermination()

  }
}

b) Start the nc service to send data

nc -lk 9999

spark hadoop
spark hadoop
hive hbase
spark hadoop

6.Spark Streaming saves data to external systems

1) Save to mysql database

import java.sql.DriverManager
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext}

val sc = spark.sparkContext
  val ssc = new StreamingContext(spark.sparkContext, Seconds(5))
  val lines = ssc.socketTextStream("bigdata-pro02.kfk.com", 9999)
  val words = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

  words.foreachRDD(rdd => rdd.foreachPartition(lines => {
    Class.forName("com.mysql.jdbc.Driver")
    val conn = DriverManager
      .getConnection("jdbc:mysql://bigdata-pro01.kfk.com:3306/test", "root", "root")
    try {
      for (row <- lines){
        val sql = "insert into webCount(titleName,count)values('"+row._1+"',"+row._2+")"
        conn.prepareStatement(sql).executeUpdate()
      }
    }finally {
        conn.close()
    }
  }))
  ssc.start()
  ssc.awaitTermination()

Then input data on the nc server side, and the statistical results will be written into the webCount table in the database.

mysql> select * from webCount;
+-----------+-------+
| titleName | count |
+-----------+-------+
| hive      |     4 |
| spark     |     4 |
| hadoop    |     4 |
| hbase     |     5 |
+-----------+-------+
4 rows in set (0.00 sec

2) save to hdfs

This method is simpler than writing to the database. If you are interested, please refer to the following code to test it yourself.

http://ke.dajiangtai.com/content/6919/11.png

Special Note: Every time it is executed, the content of the HDFS file will be reset and overwritten!

7. Structured Streaming programming model

1) complete output mode

import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import spark.implicits._

val lines = spark.readStream
  .format("socket")
  .option("host", "bigdata-pro02.kfk.com")
  .option("port", 9999)
  .load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
val query = wordCounts.writeStream.outputMode("complete").format("console").start()

2) update output mode

In this mode, if you continue to input on the nc server side, it will always count the value of the input just now and the historical input, and if the outputMod is changed to "update", the statistics will be updated according to the historical input, and only the latest input will be displayed The statistical result after the value value is updated.

3) append output mode

If you change the outputMod to "append", the code should also be slightly modified

import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import spark.implicits._

val lines = spark.readStream
  .format("socket")
  .option("host", "bigdata-pro02.kfk.com")
  .option("port", 9999)
  .load()
val words = lines.as[String].flatMap(_.split(" ")).map(x => (x, 1))
val query = words.writeStream.outputMode("append").format("console").start()

It can be seen that this mode is simply appending each input.

8. Real-time data processing business analysis

9. Spark Streaming and Kafka integration

1) Preparation

According to the requirements of the official website, the version of our previous kafka is low, and we need to download a version of at least 0.10.0.

Download address http://kafka.apache.org/downloads

Modifying the configuration is very simple, just copy and replace the /config folder we originally configured, and create a new kafka-logs and logs folder according to the original configuration. Then, modify the path in the configuration folder.

2) Write test code and start running

We upload the package ( all 3 nodes do this)

start spark-shell

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180322010955932-1783709958.png

copy the code in

val df = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "bigdata-pro01.kfk.com:9092")
      .option("subscribe", "weblogs")
      .load()

    import spark.implicits._
   val lines= df.selectExpr("CAST(value AS STRING)").as[String]
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.groupBy("value").count()
    val query = wordCounts.writeStream
      .outputMode("update")
      .format("console")
      .start()


    query.awaitTermination()

At this time, be sure to keep kafka and the producer turned on:

bin/kafka-console-producer.sh --broker-list bigdata-pro01.kfk.com:9092 --topic weblog

Enter a few words on the producer's side

Go back to the spark-shell interface to see the statistical results

10. Complete real-time data analysis based on structured stream

Let's first clear the contents of the webCount table of the test database of mysqld

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180322114150081-1902027289.png

Open idea, we write two programs

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180322225746195-1933346147.png

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180322230140026-170366579.png

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180322230200976-303812454.png

 

package com.spark.test

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.ProcessingTime


/**

  * Created by Zimo on 2017/10/16.

  */

object StructuredStreamingKafka {

  case class Weblog(datatime:String,
                    userid:String,
                    searchname:String,
                    retorder:String,
                    cliorder:String,
                    cliurl:String)

  def main(args: Array[String]): Unit = {

    val spark  = SparkSession.builder()
      .master("local[2]")
      .appName("streaming").getOrCreate()

    val df = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "bigdata-pro01.kfk.com:9092")
      .option("subscribe", "weblogs")
      .load()

    import spark.implicits._
    val lines = df.selectExpr("CAST(value AS STRING)").as[String]
    val weblog = lines.map(_.split(","))
      .map(x => Weblog(x(0), x(1), x(2),x(3),x(4),x(5)))
    val titleCount = weblog
      .groupBy("searchname").count().toDF("titleName","count")
    val url ="jdbc:mysql://bigdata-pro01.kfk.com:3306/test"
    val username="root"
    val password="root"
    val writer = new JDBCSink(url,username,password)
    val query = titleCount.writeStream
      .foreach(writer)
      .outputMode("update")
        //.format("console")
      .trigger(ProcessingTime("5 seconds"))
      .start()
    query.awaitTermination()
  }

}
package com.spark.test

import java.sql._
import java.sql.{Connection, DriverManager}
import org.apache.spark.sql.{ForeachWriter, Row}

/**
  * Created by Zimo on 2017/10/17.
  */
class JDBCSink(url:String, username:String,password:String) extends ForeachWriter[Row]{

  var statement : Statement =_
  var resultSet : ResultSet =_
  var connection : Connection=_
  override def open(partitionId: Long, version: Long): Boolean = {
    Class.forName("com.mysql.jdbc.Driver")
    //  connection = new MySqlPool(url,username,password).getJdbcConn();
    connection = DriverManager.getConnection(url,username,password);
      statement = connection.createStatement()
      return true
  }

  override def process(value: Row): Unit = {
    val titleName = value.getAs[String]("titleName").replaceAll("[\\[\\]]","")
    val count = value.getAs[Long]("count");

    val querySql = "select 1 from webCount " +
      "where titleName = '"+titleName+"'"

    val updateSql = "update webCount set " +
      "count = "+count+" where titleName = '"+titleName+"'"

    val insertSql = "insert into webCount(titleName,count)" +
      "values('"+titleName+"',"+count+")"

    try{


      var resultSet = statement.executeQuery(querySql)
      if(resultSet.next()){
        statement.executeUpdate(updateSql)
      }else{
        statement.execute(insertSql)
      }
    }catch {
      case ex: SQLException => {
        println("SQLException")
      }
      case ex: Exception => {
        println("Exception")
      }
      case ex: RuntimeException => {
        println("RuntimeException")
      }
      case ex: Throwable => {
        println("Throwable")
      }
    }

  }

  override def close(errorOrNull: Throwable): Unit = {
//    if(resultSet.wasNull()){
//      resultSet.close()
//    }
    if(statement==null){
      statement.close()
    }
    if(connection==null){
      connection.close()
    }
  }
}

 

Add this dependency package in the pom.xml file

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180322232057028-1016980854.png

<dependency>

      <groupId>mysql</groupId>

      <artifactId>mysql-connector-java</artifactId>

      <version>5.1.27</version>

    </dependency>

Let me say here that the choice of the dependency package version should be the same as the version of the dependency package in your cluster. Otherwise, an error may be reported. You can refer to the version under the Lib path in hive .

Keep the dfs, hbase, yarn, and zookeeper of the cluster in the starting state

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180322231010171-803607012.png

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180322231027392-1471559359.png

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180322231051438-199414215.png

 Start the flume of our node 1 and node 2. Before starting, let's modify the configuration of flume , because we changed the jdk version and kafka version later, so we need to modify the configuration file ( change all 3 nodes)

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180322230310836-1178269231.png

Start flume on node 1

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180322230609636-1099721704.png

Start kafka on node 1

bin/kafka-server-start.sh config/server.properties

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180322230705621-1488899955.png

Start flume on node 2

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180322230822747-2147304152.png

Start the data on node 2 and generate data in real time

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180322230952822-1218764676.png

Back to idea , let's run the program

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180322231215101-809927133.png

Go back to mysql and check the webCount table, there is already data coming in

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180322231711813-1200711695.png

We modify the configuration file as follows

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180323101547610-1665572357.png

 https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180323101606325-2117330682.png

 

[client]

socket=/var/lib/mysql/mysql.sock

default-character-set=utf8



[mysqld]

character-set-server=utf8

datadir=/var/lib/mysql

socket=/var/lib/mysql/mysql.sock

user=mysql

# Disabling symbolic-links is recommended to prevent assorted security risks

symbolic-links=0



[mysql]

default-character-set=utf8



[mysqld_safe]

log-error=/var/log/mysqld.log

pid-file=/var/run/mysqld/mysqld.pid

 

deleted the table

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180323101707279-514845126.png

recreate table

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180323101755737-410818476.png

create table webCount( titleName varchar(255) CHARACTER SET utf8 DEFAULT NULL, count int(11) DEFAULT NULL )ENGINE=lnnoDB DEFAULT CHARSET=utf8;

run the program again

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180323101901780-79374838.png

 https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180323101922746-2012918589.png

You can see that there are no Chinese garbled characters, and we can also connect to mysql through the visualization tool to view

https://images2018.cnblogs.com/blog/1023171/201803/1023171-20180323102046986-882146045.png


The above is the main content of this section introduced by the blogger. This is the blogger's own learning process. I hope it can give you some guidance. If it is useful, I hope you can support it. If it is not useful to you I also hope to forgive, and please point out any mistakes. If you are looking forward to it, you can follow the blogger to get the update as soon as possible, thank you! At the same time, reprinting is also welcome, but the original address must be marked in the obvious position of the blog post, and the right of interpretation belongs to the blogger!

Guess you like

Origin blog.csdn.net/py_123456/article/details/89710068