Spark project combat - data cleaning

Log file: https://pan.baidu.com/s/1Eve8GmGi21JLV70fqJjmQw 
Extraction code: 3xsp

Tools used: IDEA Maven

Use Spark to complete data cleaning and daily user retention analysis:

Table of contents

1. Build the environment

2. Data cleaning

3. User daily retention analysis

4. Source code:


1. Build the environment

Configure pom.xml

    <repositories>
        <repository>
            <id>aliyunmaven</id>
            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
        </repository>
        <repository>
            <id>spring-milestones</id>
            <name>Spring Milestones</name>
            <url>https://repo.spring.io/milestone</url>
        </repository>
    </repositories>
    <dependencies>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.13</artifactId>
            <version>3.2.1</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/junit/junit -->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.13.2</version>
            <scope>test</scope>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.13.8</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.13</artifactId>
            <version>3.2.1</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.13</artifactId>
            <version>3.2.1</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.28</version>
        </dependency>
    </dependencies>

Download the Scala plugin:

file->setting->plugins

2. Data cleaning

The data can be stored in Mysql through the data abstraction of DataFrame in SparkSql. The change process of the RDD format of the entire log can be understood as:

RDD[String]->RDD[Array[String]]->RDD[Row]->DataFrame->存入Mysql

Before data cleaning, you need to understand the specification settings of the web log. The log data is separated from the data by "\t", which is the Tab key. The following is a regular web log, and its specifications are as follows

event_time = 2018-09-04T20:27:31+08:00	
url = http://datacenter.bdqn.cn/logs/user?actionBegin=1536150451540&actionClient=Mozilla%2F5.0+%28Windows+NT+10.0%3B+WOW64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F58.0.3029.110+Safari%2F537.36+SE+2.X+MetaSr+1.0&actionEnd=1536150451668&actionName=startEval&actionTest=0&actionType=3&actionValue=272090&clientType=001_kgc&examType=001&ifEquipment=web&isFromContinue=false&skillIdCount=0&skillLevel=0&testType=jineng&userSID=B842B843AE317425D53D0C567A903EF7.exam-tomcat-node3.exam-tomcat-node3&userUID=272090&userUIP=1.180.18.157	
method = GET	
status = 200	
sip = 192.168.168.64	
user_uip = -	
action_prepend = -	
action_client = Apache-HttpClient/4.1.2 (java 1.5)

1) Convert RDD[String] to RDD[Row], and filter logs with fewer than 8 fields

    val linesRDD = sc.textFile("C:/Users/Lenovo/Desktop/Working/Python/data/test.log")
    import spark.implicits._    

    val line1 = linesRDD.map(x => x.split("\t"))
        //line1.foreach(println)
    val rdd = line1
      .filter(x => x.length == 8)
      .map(x => Row(x(0).trim, x(1).trim, x(2).trim, x(3).trim, x(4).trim, x(5).trim, x(6).trim, x(7).trim))
        //rdd.foreach(println)

2) Convert RDD[Row] to DataFrame and establish a preliminary mapping relationship

//    建立RDD和表格的映射关系
    val schema = StructType(Array(
      StructField("event_time", StringType),
      StructField("url", StringType),
      StructField("method", StringType),
      StructField("status", StringType),
      StructField("sip", StringType),
      StructField("user_uip", StringType),
      StructField("action_prepend", StringType),
      StructField("action_client", StringType)
    ))
    val orgDF = spark.createDataFrame(rdd, schema)
    //    orgDF.show(5)

3) Cut the url into fields according to "&" and "="

//去重,过滤掉状态码非200,过滤时间为空
    //distinct是根据每一条数据进行完整内容的比对和去重,dropDuplicates可以根据指定的字段进行去重。
    val ds1 = orgDF.dropDuplicates("event_time", "url")
      .filter(x => x(3) == "200")
      .filter(x => StringUtils.isNotEmpty(x(0).toString))

    //将url按照"&"和"="切割
    //userSID
    //userUIP
    //actionClient
    //actionBegin
    //actionEnd
    //actionType
    //actionPrepend
    //actionTest
    //ifEquipment
    //actionName
    //id
    //progress进行切割

    //以map的形式建立内部映射关系
    val dfDetail = ds1.map(row => {
      val urlArray = row.getAs[String]("url").split("\\?")
      var map = Map("params" -> "null")
      if (urlArray.length == 2) {
        map = urlArray(1).split("&")
          .map(x => x.split("="))
          .filter(_.length == 2)
          .map(x => (x(0), x(1)))
          .toMap
      }
      (
        //map为url中字段,row为原DataFrame字段
        row.getAs[String]("event_time"),
        row.getAs[String]("user_uip"),
        row.getAs[String]("method"),
        row.getAs[String]("status"),
        row.getAs[String]("sip"),
        map.getOrElse("actionBegin", ""),
        map.getOrElse("actionEnd", ""),
        map.getOrElse("userUID", ""),
        map.getOrElse("userSID", ""),
        map.getOrElse("userUIP", ""),
        map.getOrElse("actionClient", ""),
        map.getOrElse("actionType", ""),
        map.getOrElse("actionPrepend", ""),
        map.getOrElse("actionTest", ""),
        map.getOrElse("ifEquipment", ""),
        map.getOrElse("actionName", ""),
        map.getOrElse("progress", ""),
        map.getOrElse("id", "")
      )
    }).toDF()
//    dfDetail.show(5)

4) Rebuild the table header, spread all the original DataFrame data evenly, and store it in the database

val detailRDD = dfDetail.rdd
    val detailSchema = StructType(Array(
      StructField("event_time", StringType),
      StructField("user_uip", StringType),
      StructField("method", StringType),
      StructField("status", StringType),
      StructField("sip", StringType),
      StructField("actionBegin", StringType),
      StructField("actionEnd", StringType),
      StructField("userUID", StringType),
      StructField("userSID", StringType),
      StructField("userUIP", StringType),
      StructField("actionClient", StringType),
      StructField("actionType", StringType),
      StructField("actionPrepend", StringType),
      StructField("actionTest", StringType),
      StructField("ifEquipment", StringType),
      StructField("actionName", StringType),
      StructField("progress", StringType),
      StructField("id", StringType)
    ))

    val detailDF = spark.createDataFrame(detailRDD, detailSchema)

    //    overwrite重写,append追加
    val prop = new Properties()
    prop.put("user", "root")
    prop.put("password", "******")
    prop.put("driver","com.mysql.jdbc.Driver")
    val url = "jdbc:mysql://localhost:3306/python_db"
    println("开始写入数据库")
    detailDF.write.mode("overwrite").jdbc(url,"logDetail",prop)
    println("完成写入数据库")

3. User daily retention analysis

  1. Find the total number of new users m on the nth day
  2. Find the total number n of the intersection of logins on day n+1 and new users on day n
  3. Retention rate=n/m*100%

1) Find the data table of registration and login behavior

 val prop = new Properties()
    prop.put("user", "root")
    prop.put("password", "******")
    prop.put("driver", "com.mysql.jdbc.Driver")
    val url = "jdbc:mysql://localhost:3306/python_db"
    val dataFrame = spark.read.jdbc(url, "logdetail", prop)

    //所有的注册用户信息(userID,register_time,注册行为)
    val registerDF = dataFrame
      .filter(dataFrame("actionName") === ("Registered"))
      .select("userUID","event_time", "actionName")
      .withColumnRenamed("event_time","register_time")
      .withColumnRenamed("userUID","regUID")
//    registerDF.show(5)
    //原获取的日期格式为2018-09-04T20:27:31+08:00,只需要获取前10个字段(yyyy-mm-dd)
    val registDF2  = registerDF
      .select(registerDF("regUID"),registerDF("register_time")
        .substr(1,10).as("register_date"),registerDF("actionName"))
      .distinct()
//    registDF2.show(5)


    //所有的用户登录信息DF(userUID,signin_time,登录行为)
    val signinDF = dataFrame.filter(dataFrame("actionName") === ("Signin"))
      .select("userUID","event_time", "actionName")
      .withColumnRenamed("event_time","signing_time")
      .withColumnRenamed("userUID","signUID")
//    signinDF.show(5)
    val signiDF2 = signinDF
      .select(signinDF("signUID"),signinDF("signing_time")
        .substr(1,10).as("signing_date"),signinDF("actionName"))
      .distinct()
//    signiDF2.show(5)

2) Find the total number n of the intersection of the nth and n+1 days, and the number of new users m on the nth day

//以inner方式将相同userUID加在一起
    val joinDF = registDF2
      .join(signiDF2,signiDF2("signUID") === registDF2("regUID"),joinType = "inner")
//    joinDF.show(5)

    //Spark内置的datediff函数求出第n和n+1天交集总数n
    val frame = joinDF
      .filter(datediff(joinDF("signing_date"),joinDF("register_date")) === 1)
      .groupBy(joinDF("register_date")).count()
      .withColumnRenamed("count","signcount")
//    frame.show(5)

    //过滤,只拿第n天和当天新增用户总数m
    val frame1 = registDF2
      .groupBy(registDF2("register_date")).count()
      .withColumnRenamed("count","regcount")
//    frame1.show(5)

3) Retention rate = n/m*100%

  //将m和n放在一张表格中
    val frame2 = frame
      .join(frame1,"register_date")
    frame2.show()

    //新增列名留存率,数值为n/m,求出第n天的用户留存率
    frame2.withColumn("留存率",frame2("signcount")/frame2("regcount"))
      .show()

 

4. Source code:

DataClear.scala

package spark

import org.apache.commons.lang.StringUtils
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}
import java.util.Properties

object DataClear {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[1]").appName("DataClear").getOrCreate()
    val sc = spark.sparkContext
    val linesRDD = sc.textFile("C:/Users/Lenovo/Desktop/Working/Python/data/test.log")
    import spark.implicits._
    val line1 = linesRDD.map(x => x.split("\t"))
        //line1.foreach(println)
    val rdd = line1
      .filter(x => x.length == 8)
      .map(x => Row(x(0).trim, x(1).trim, x(2).trim, x(3).trim, x(4).trim, x(5).trim, x(6).trim, x(7).trim))
        //rdd.foreach(println)

    //    建立RDD和表格的映射关系
    val schema = StructType(Array(
      StructField("event_time", StringType),
      StructField("url", StringType),
      StructField("method", StringType),
      StructField("status", StringType),
      StructField("sip", StringType),
      StructField("user_uip", StringType),
      StructField("action_prepend", StringType),
      StructField("action_client", StringType)
    ))
    val orgDF = spark.createDataFrame(rdd, schema)
    //    orgDF.show(5)

    //去重,过滤掉状态码非200,过滤时间为空
    //distinct是根据每一条数据进行完整内容的比对和去重,dropDuplicates可以根据指定的字段进行去重。
    val ds1 = orgDF.dropDuplicates("event_time", "url")
      .filter(x => x(3) == "200")
      .filter(x => StringUtils.isNotEmpty(x(0).toString))

    //将url按照"&"以及"="切割,即按照userUID
    //userSID
    //userUIP
    //actionClient
    //actionBegin
    //actionEnd
    //actionType
    //actionPrepend
    //actionTest
    //ifEquipment
    //actionName
    //id
    //progress进行切割

    val dfDetail = ds1.map(row => {
      val urlArray = row.getAs[String]("url").split("\\?")
      var map = Map("params" -> "null")
      if (urlArray.length == 2) {
        map = urlArray(1).split("&")
          .map(x => x.split("="))
          .filter(_.length == 2)
          .map(x => (x(0), x(1)))
          .toMap
      }
      (
        row.getAs[String]("event_time"),
        row.getAs[String]("user_uip"),
        row.getAs[String]("method"),
        row.getAs[String]("status"),
        row.getAs[String]("sip"),
        map.getOrElse("actionBegin", ""),
        map.getOrElse("actionEnd", ""),
        map.getOrElse("userUID", ""),
        map.getOrElse("userSID", ""),
        map.getOrElse("userUIP", ""),
        map.getOrElse("actionClient", ""),
        map.getOrElse("actionType", ""),
        map.getOrElse("actionPrepend", ""),
        map.getOrElse("actionTest", ""),
        map.getOrElse("ifEquipment", ""),
        map.getOrElse("actionName", ""),
        map.getOrElse("progress", ""),
        map.getOrElse("id", "")

      )
    }).toDF()
//    dfDetail.show(5)

    val detailRDD = dfDetail.rdd
    val detailSchema = StructType(Array(
      StructField("event_time", StringType),
      StructField("user_uip", StringType),
      StructField("method", StringType),
      StructField("status", StringType),
      StructField("sip", StringType),
      StructField("actionBegin", StringType),
      StructField("actionEnd", StringType),
      StructField("userUID", StringType),
      StructField("userSID", StringType),
      StructField("userUIP", StringType),
      StructField("actionClient", StringType),
      StructField("actionType", StringType),
      StructField("actionPrepend", StringType),
      StructField("actionTest", StringType),
      StructField("ifEquipment", StringType),
      StructField("actionName", StringType),
      StructField("progress", StringType),
      StructField("id", StringType)
    ))

    val detailDF = spark.createDataFrame(detailRDD, detailSchema)
    detailDF.show(10)


    //    overwrite重写,append追加
    val prop = new Properties()
    prop.put("user", "root")
    prop.put("password", "******")
    prop.put("driver","com.mysql.jdbc.Driver")
    val url = "jdbc:mysql://localhost:3306/python_db"
    println("开始写入数据库")
    detailDF.write.mode("overwrite").jdbc(url,"logDetail",prop)
    println("完成写入数据库")

  }
}

UserAnaylsis.scala

package spark

import java.text.SimpleDateFormat
import java.util.Properties
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{datediff, unix_timestamp}

object UserAnalysis {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("userAnalysis").master("local").getOrCreate()
    val sc = spark.sparkContext
    val prop = new Properties()
    prop.put("user", "root")
    prop.put("password", "******")
    prop.put("driver", "com.mysql.jdbc.Driver")
    val url = "jdbc:mysql://localhost:3306/python_db"
    val dataFrame = spark.read.jdbc(url, "logdetail", prop)
    dataFrame.show(10)

    //所有的注册用户信息(userID,register_time,注册行为)
    val registerDF = dataFrame.filter(dataFrame("actionName") === ("Registered"))
      .select("userUID","event_time", "actionName")
      .withColumnRenamed("event_time","register_time")
      .withColumnRenamed("userUID","regUID")
//    registerDF.show(5)
    //原获取的日期格式为2018-09-04T20:27:31+08:00,只需要获取前10个字段(yyyy-mm-dd)
    val registDF2  = registerDF
      .select(registerDF("regUID"),registerDF("register_time")
        .substr(1,10).as("register_date"),registerDF("actionName"))
      .distinct()
//    registDF2.show(5)


    //所有的用户登录信息DF(userUID,signin_time,登录行为)
    val signinDF = dataFrame.filter(dataFrame("actionName") === ("Signin"))
      .select("userUID","event_time", "actionName")
      .withColumnRenamed("event_time","signing_time")
      .withColumnRenamed("userUID","signUID")
//    signinDF.show(5)
    val signiDF2 = signinDF
      .select(signinDF("signUID"),signinDF("signing_time")
        .substr(1,10).as("signing_date"),signinDF("actionName"))
      .distinct()
//    signiDF2.show(5)

    //以inner方式将相同userUID加在一起
    val joinDF = registDF2
      .join(signiDF2,signiDF2("signUID") === registDF2("regUID"),joinType = "inner")
//    joinDF.show(5)

    //Spark内置的datediff函数求出第n和n+1天交集总数n
    val frame = joinDF
      .filter(datediff(joinDF("signing_date"),joinDF("register_date")) === 1)
      .groupBy(joinDF("register_date")).count()
      .withColumnRenamed("count","signcount")
//    frame.show(5)

    //过滤,只拿第n天和当天新增用户总数m
    val frame1 = registDF2
      .groupBy(registDF2("register_date")).count()
      .withColumnRenamed("count","regcount")
//    frame1.show(5)

    //将m和n放在一张表格中
    val frame2 = frame
      .join(frame1,"register_date")
//    frame2.show()

    //新增列名留存率,数值为n/m,求出第n天的用户留存率
    frame2.withColumn("留存率",frame2("signcount")/frame2("regcount"))
      .show()

    sc.stop()
  }
}

Guess you like

Origin blog.csdn.net/Legosnow/article/details/124169161