Flume+Kafka+Spark Streaming+MySQL real-time log analysis

Article Directory

Download the source code of this case
Link: https://pan.baidu.com/s/1IzOvSCtLvZzj81XZaYl6CQ
Extraction code: i6i8

Background of the project

In the era of rapid Internet development, more and more people obtain more information through the Internet or do their own business through the Internet. When they devote themselves to building their own website, APP or small program, they will find that after a period of operation and Maintenance found that the growth rate of pageviews and the number of users has not improved. When designing and transforming it, there is no way to start, when you don’t understand the user’s browsing preferences and the preferences of the user group. Although the server log clearly records the user's visit and browsing preferences, it is difficult to filter out high-quality information from a large number of logs in a timely and effective manner through ordinary methods. Spark Streaming is a real-time stream computing framework. This technology can perform real-time and fast analysis of data. Through the combination with Flume and Kafka, it can achieve nearly zero-delay data statistical analysis.

Case requirements

Requirements: real-time analysis of server log data, and real-time calculation of information such as page views in a certain period of time.

Use technology: Flume-"Kafka-"SparkStreaming-"MySql database

#Case Architecture

Insert picture description here

In the architecture, the log file is monitored in real time by Flume. When new data appears in the log file, the data is sent to Kafka, and Spark Streaming receives it for real-time data analysis and finally saves the analysis result in the MySQL database.

1. Analysis

1. Log analysis

1. Visit the webpage in the server through a browser, and a log message will be generated every time you visit. The log contains visitor IP, access time, access address, status code and time-consuming information, as shown in the following figure:

Insert picture description here

Two, log collection

The first step, code editing

By using Flume to monitor the content of the server log file in real time, it will be collected every time it is generated, and the collected structure will be sent to Kafka. The Flume code is as follows.
Insert picture description here

2. Start the acquisition code

After editing the code, start Flume to monitor the server log information, enter the Flume installation directory and execute the following code.

[root@master flume]# bin/flume-ng agent --name a1 --conf conf  --conf-file conf/access_log-HDFS.properties  -Dflume.root.logger=INFO,console

The effect is shown in the figure below.

Insert picture description here

Three, write Spark Streaming code

The first step is to create a project

Insert picture description here

The second step is to choose to create a Scala project

Insert picture description here

The third step is to complete the creation after setting the project name, the path of the project and the Scala version used

Insert picture description here

The fourth step is to create a scala file

Right-click the single-machine mouse in the "src" of the project directory and select "New"->"Package" to create a package named "com.wordcountdemo", and right-click the single-machine in the package and select "New"->"scala class" to create The file is named wordcount

Insert picture description here

Insert picture description here

Step 5: Import dependent packages

Import the Spark dependency package in IDEA, select "File"->"Project Structure"->"Libraries" in the menu, and then click the "+" button to select the "Java" option, and find spark- in the pop-up dialog box assembly-1.6.1-hadoop2.6.0.jar dependent package Click "OK" to load all dependent packages into the project, the result is shown in Figure X.

Insert picture description here

Step 6: Introduce all the methods needed for this program

Note that three jar packages that are not in spark2 are used here: kafka_2.11-0.8.2.1.jar,
metrics-core-2.2.0.jar, spark-streaming-kafka_2.11-1.6.3.jar.

import java.sql.DriverManager                       //连接数据库
import kafka.serializer.StringDecoder                  //序列化数据
import org.apache.spark.streaming.dstream.DStream      //接收输入数据流
import org.apache.spark.streaming.kafka.KafkaUtils      //连接Kafka 
import org.apache.spark.streaming.{Seconds, StreamingContext}  //实时流处理
import org.apache.spark.SparkConf                      //spark程序的入口函数    

The result is shown in the figure.

Insert picture description here

Step 7: Create the main function and Spark program entry.

def main(args: Array[String]): Unit = {
  //创建sparksession
  val conf = new SparkConf().setAppName("Consumer")
  val ssc = new StreamingContext(conf,Seconds(20))  //设置每隔20秒接收并计算一次
}

The result is shown in the figure.

Insert picture description here

Step 8: Set the host address and port number of the Kafka service, and set the topic from which to receive data and set the consumer group

//kafka服务器地址
val kafkaParam = Map("metadata.broker.list" -> "192.168.10.10:9092")
//设置topic
val topic = "testSpark".split(",").toSet
//接收kafka数据
val logDStream: DStream[String] = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc,kafkaParam,topic).map(_._2)

Step 9: Number Analysis

After receiving the data, analyze the data, split the server log data according to spaces, and separately count the number of website views, user registrations, and user bounce rate during the period, and convert the statistical results into key-value pairs. RDD.

   //拆分接收到的数据
    val RDDIP =logDStream.transform(rdd=>rdd.map(x=>x.split(" ")))
    //进行数据分析
    val pv = RDDIP.map(x=>x(0)).count().map(x=>("pv",x))   //用户浏览量
    val jumper = RDDIP.map(x=>x(0)).map((_,1)).reduceByKey(_+_).filter(x=>x._2 == 1).map(x=>x._1).count.map(x=>("jumper",x))   //跳出率
    val reguser =RDDIP.filter(_(8).replaceAll("\"","").toString == "/member.php?mod=register&inajax=1").count.map(x=>("reguser",x))  //注册用户数量

Step 10: Save the calculation results

Traverse the statistical result RDD to take out the values ​​in the key-value pairs and save the analysis results to the pvtab, jumpertab and regusetab tables respectively, and finally start the Spark Streaming program.

  pv.foreachRDD(line =>line.foreachPartition(rdd=>{
      rdd.foreach(word=>{
        val conn = DriverManager.getConnection("jdbc:mysql://master:3306/test", "root", "123456")
        val format = new java.text.SimpleDateFormat("yyyy-MM-dd H:mm:ss")
        val dateFf= format.format(new java.util.Date())
        val sql = "insert into pvtab(time,pv) values("+"'"+dateFf+"'," +"'"+word._2+"')"
        conn.prepareStatement(sql).executeUpdate()
      })
      }))
    jumper.foreachRDD(line =>line.foreachPartition(rdd=>{
      rdd.foreach(word=>{
        val conn = DriverManager.getConnection("jdbc:mysql://master:3306/test", "root", "123456")
        val format = new java.text.SimpleDateFormat("yyyy-MM-dd H:mm:ss")
        val dateFf= format.format(new java.util.Date())
        val sql = "insert into jumpertab(time,jumper) values("+"'"+dateFf+"'," +"'"+word._2+"')"
        conn.prepareStatement(sql).executeUpdate()
    })
    }))
    reguser.foreachRDD(line =>line.foreachPartition(rdd=>{
      rdd.foreach(word=>{
        val conn = DriverManager.getConnection("jdbc:mysql://master:3306/test", "root", "123456")
        val format = new java.text.SimpleDateFormat("yyyy-MM-dd H:mm:ss")
        val dateFf= format.format(new java.util.Date())
        val sql = "insert into regusetab(time,reguse) values("+"'"+dateFf+"'," +"'"+word._2+"')"
        conn.prepareStatement(sql).executeUpdate()
     })
    }))
    ssc.start()        //启动Spark Streaming程序

The result is shown in the figure.

Eleventh step database design

Create a database named "test", and create three tables in the database named "jumpertab", "pvtab", and "regusetab". The database structure is shown in the figure below

jumpertab表

jumpertab表

pvtab table

Insert picture description here

regusetab table

Insert picture description here

Four, compile and run

Edit the program as a jar package and submit it to the cluster to run.

The first step is to add the project to the jar file and set the file name

Select the "File"-"Project Structure" command, select the "Artifacts" button in the pop-up dialog box, select "JAR" -> "Empty" under "+" and set the JAR at "NAME" in the pop-up dialog box The name of the file is "WordCount", and double-click "'firstSpark'compile output" under "firstSpark" on the right to load it to the left, indicating that the project has been added to the JAR package and then click the "OK" button, as shown in the figure below Show.

Insert picture description here

Insert picture description here

The second step is to generate the jar package

Click the "Build" -> "Build Artifacts..." button in the menu bar and click the "Build" button in the pop-up dialog box. After the jar package is generated, the project root directory will automatically create an out directory. You can see the generated in the directory jar package, the result is shown in the figure below.

Insert picture description here

Insert picture description here

The third step, submit and run the Spark Streaming program

[root@master bin]# ./spark-submit --master local[*] --class  com.spark.streaming.sparkword /usr/local/Streaminglog.jar 
  • 1

The result is shown in the figure below

Insert picture description here

Step 4: View the database

Insert picture description here

The complete code is as follows

package spark
import java.sql.DriverManager
import java.util.Calendar

import kafka.serializer.StringDecoder
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.SparkConf
object kafkaspark {
  def main(args: Array[String]): Unit = {
    //    创建sparksession
    val conf = new SparkConf().setAppName("Consumer")
    val ssc = new StreamingContext(conf,Seconds(1))
    val kafkaParam = Map("metadata.broker.list" -> "192.168.10.10:9092")
    val topic = "testSpark".split(",").toSet
    //接收kafka数据
    val logDStream: DStream[String] = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc,kafkaParam,topic).map(_._2)
    //拆分接收到的数据
    val RDDIP =logDStream.transform(rdd=>rdd.map(x=>x.split(" ")))
    //进行数据分析
    val pv = RDDIP.map(x=>x(0)).count().map(x=>("pv",x))
    val jumper = RDDIP.map(x=>x(0)).map((_,1)).reduceByKey(_+_).filter(x=>x._2 == 1).map(x=>x._1).count.map(x=>("jumper",x))
    val reguser =RDDIP.filter(_(8).replaceAll("\"","").toString == "/member.php?mod=register&inajax=1").count.map(x=>("reguser",x))

    //将分析结果保存到MySQL数据库
      pv.foreachRDD(line =>line.foreachPartition(rdd=>{
          rdd.foreach(word=>{
            val conn = DriverManager.getConnection("jdbc:mysql://master:3306/test", "root", "123456")
            val format = new java.text.SimpleDateFormat("H:mm:ss")
            val dateFf= format.format(new java.util.Date())
            var cal:Calendar=Calendar.getInstance()
            cal.add(Calendar.SECOND,-1)
            var Beforeasecond=format.format(cal.getTime())
            val date = Beforeasecond.toString+"-"+dateFf.toString
            val sql = "insert into pvtab(time,pv) values("+"'"+date+"'," +"'"+word._2+"')"
            conn.prepareStatement(sql).executeUpdate()
          })
          }))
    jumper.foreachRDD(line =>line.foreachPartition(rdd=>{
      rdd.foreach(word=>{
        val conn = DriverManager.getConnection("jdbc:mysql://master:3306/test", "root", "123456")
        val format = new java.text.SimpleDateFormat("H:mm:ss")
        val dateFf= format.format(new java.util.Date())
        var cal:Calendar=Calendar.getInstance()
        cal.add(Calendar.SECOND,-1)
        var Beforeasecond=format.format(cal.getTime())
        val date = Beforeasecond.toString+"-"+dateFf.toString
        val sql = "insert into jumpertab(time,jumper) values("+"'"+date+"'," +"'"+word._2+"')"
        conn.prepareStatement(sql).executeUpdate()
      })
    }))
    reguser.foreachRDD(line =>line.foreachPartition(rdd=>{
      rdd.foreach(word=>{
        val conn = DriverManager.getConnection("jdbc:mysql://master:3306/test", "root", "123456")
        val format = new java.text.SimpleDateFormat("H:mm:ss")
        val dateFf= format.format(new java.util.Date())
        var cal:Calendar=Calendar.getInstance()
        cal.add(Calendar.SECOND,-1)
        var Beforeasecond=format.format(cal.getTime())
        val date = Beforeasecond.toString+"-"+dateFf.toString
        val sql = "insert into regusetab(time,reguse) values("+"'"+date+"'," +"'"+word._2+"')"
        conn.prepareStatement(sql).executeUpdate()
      })
    }))
    val num = logDStream.map(x=>(x,1)).reduceByKey(_+_)
    num.print()
    //启动Streaming
    ssc.start()
    ssc.awaitTermination()
    ssc.stop()
  }
}

  •  

Guess you like

Origin blog.csdn.net/qq_36816848/article/details/113827079
Recommended