Spark data analysis and processing


The files used in this project are obtained as follows, the extraction code: 6xdx
point I get the file
注意:本文都是在spark-shell环境下完成

Use case 1: Data cleaning

Read the log file and convert it to RDD[Row] type

  • Cut data according to Tab
  • Filter out those with less than 8 fields
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{
    
    StructType, StructField, StringType}
val schemaString = "event_time url method status sip user_uip action_prepend action_client"
val fields = schemaString.split("\\s+").map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val rdd = sc.textFile("file:///data/test.log").map(_.split("\t")).filter(_.size==8)
val rowRDD = rdd.map(x=>Row(x(0), x(1),x(2),x(3), x(4),x(5),x(6), x(7)))
val df = spark.createDataFrame(rowRDD,schema)

Clean the data

日志拆分字段:
event_time
url
method
status
sip
user_uip
action_prepend
action_client
  • Deduplicate data according to the first and second columns
  • Filter out status codes other than 200
  • Filter out data with empty event_time
  • Cut the URL according to "&" and "="

Save data: write data to mysql table

//按照第一列和第二列对数据进行去重 使用 df的dropDuplicates方法
//过滤掉状态码非200
//过滤掉event_time为空的数据 
val ds = df.dropDuplicates("event_time","url").filter(x=>x(3)=="200").filter(x=>x(0).toString().trim !=null && x(0).toString().trim!="")
//将url 按照 ? 进行分割, 取出第二段并按照& 进行分割, 分割出每一组参数,最后按照= 切割, 变成键值对的形式转化成Map集合
val ds2 = ds.map(x=>{
    
    val s = x.getAs[String]("url").split("\\?")(1);val m= s.split("&").map(_.split("=")).filter(_.size==2).map(x=>(x(0),x(1))).toMap;(x.getAs[String]("event_time"),m.getOrElse("userUID", ""),m.getOrElse("userSID", ""),m.getOrElse("actionBegin", ""),m.getOrElse("actionEnd", ""),m.getOrElse("actionType", ""), m.getOrElse("actionName", ""),m.getOrElse("actionValue", ""), m.getOrElse("actionTest", ""),m.getOrElse("ifEquipment", ""),x.getAs[String]("method"),x.getAs[String]("status"),x.getAs[String]("sip"),x.getAs[String]("user_uip"),x.getAs[String]("action_prepend"),x.getAs[String]("action_client"))})
val df2 = ds2.toDF("event_time", "user_uid", "user_sid", "action_begin",
      "action_end", "action_type", "action_name", "action_value",
      "action_test", "if_equipment", "method", "status", "sip",
      "user_uip", "action_prepend", "action_client")
//保存数据,将数据写入mysql表中
val url = "jdbc:mysql://localhost:3306/test"
val prop = new java.util.Properties
prop.setProperty("user", "root")
prop.setProperty("password", "sunyong")
prop.setProperty("driver","com.mysql.jdbc.Driver")
df2.write.mode("overwrite").jdbc(url,"logs",prop)

Use case 2: User retention analysis

Calculate the user’s next day retention rate

  • Find the total number of new users on the day n
  • Find the intersection of the newly added user ID that day and the user ID that logged in the next day, and get the total number of new users logged in the next day m (number of retention on the next day)
  • m/n*100%

Calculate user retention rate for the next week

val logs = spark.read.jdbc(url,"logs2",prop)
logs.cache()
// 1.将action_name=='Registered'抽取为一张注册表
val registered = logs.filter($"action_name" === "Registered").withColumnRenamed("event_time","register_time").select("user_uid","register_time")
// 2.将action_name=='Signin'抽取为一张登陆表
val signin = logs.filter($"action_name" === "Signin").withColumnRenamed("event_time","signin_time").select("user_uid","signin_time")
// 3.将注册表和登陆表通过user_sid进行join操作
val joined = registered.join(signin, registered("user_uid") === signin("user_uid"), "left").drop(signin("user_uid"))
joined.createOrReplaceTempView("j")
// 将时间取前10位并转化为时间戳
val register2signin = spark.sql("select user_uid,register_time,signin_time,unix_timestamp(substr(register_time,1,10),'yyyy-MM-dd')registered_date ,unix_timestamp(substr(signin_time,1,10),'yyyy-MM-dd') signin_date from j")
// 4.若统计次日留存则过滤出来日期相差天数为1的, 若统计七日留存则过滤出来日期相差天数为7的
// 5.对日期进行分组操作,在进行累加求和即可得出次日留存和七日留存
// 次日留存
val day_retention = register2signin.filter($"registered_date" === $"signin_date" - 86400).groupBy($"registered_date",$"user_uid").count().groupBy("registered_date").count()
// 七日留存
val week_retenetion = register2signin.filter($"registered_date" === $"signin_date" - 604800).groupBy("registered_date","user_uid").count().groupBy("registered_date").count()
// 写入数据库
day_retention.write.mode("overwrite").jdbc(url, "day_retention", prop)//只有两天的数据 最后只有一条数据
week_retenetion.write.mode("overwrite").jdbc(url,"week_retention", prop)//只有两天数据,无法得到周留存,数据为空

Use case 3: Active user analysis

Statistical analysis needs

  • Read the database and count the number of active users per day
  • Statistical rules: users who have watched and bought courses are active users
  • Deduplicate UID
val logs = spark.read.jdbc(url,"logs2",prop)
// 1.将学习的和买课的日志过滤出来
// 2.将event_time 的字段的前十个字符取出来作为日期
// 3.按照用户id 去重
// 4.按照日期进行分组
// 5.统计人数
val activeUserCount= logs.filter($"action_name" === "StartLearn" || $"action_name" === "BuyCourse").map(x => {
    
    (x.getAs("user_uid").toString, x.getAs("event_time").toString.substring(0, 10))}).withColumnRenamed("_2", "date").distinct().groupBy("date").count().orderBy("date").cache()
// 写入mysql
activeUserCount.write.mode("overwrite").jdbc(url, "activeUserCount", prop)

Use case 4: Analysis of active user geographic information

Statistical analysis needs

  • Read raw log data
  • Parse the url to get the user's access IP
  • Obtain the address of the province and city area corresponding to the IP through the IP library (read the file and save it in MySQL after conversion)
  • Find the percentage of people in each region
// 从mysql中读取数据
val url = "jdbc:mysql://localhost:3306/test"
val prop = new java.util.Properties
prop.setProperty("user", "root")
prop.setProperty("password", "sunyong")
prop.setProperty("driver","com.mysql.jdbc.Driver")
val logs = spark.read.jdbc(url, "logs2", prop).cache()
// 统计一下日志总条数
val cnt = logs.count()
// 注册udf 函数 1.ip 解析为整数形式   2.求占比
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions._
val rate: UserDefinedFunction = udf((x: Double) => x / cnt)
val ip2Int: UserDefinedFunction = udf((x:String)=>{
    
    val y = x.split("\\.");y(0).toLong*256*256*256+y(1).toLong*256*256+y(2).toLong*256+y(3).toLong})
//使用纯真IP地址查询工具生成ip 地址文件
//将IP数据文件转换成整数形式存入MySQL
case class IP(startip:String,endip:String,city:String,company:String)
val df = sc.textFile("file:///data/IP.txt").map(_.split("\\s+")).filter(_.size==4).map(x=>IP(x(0),x(1),x(2),x(3))).toDF
//转换成整数形式存到数据库中
val ip = df.withColumn("startIp_Int",ip2Int($"startip")).withColumn("endIp_Int",ip2Int($"endip")).drop("startip").drop("endip")
ip.write.mode("overwrite").jdbc(url, "ip", prop)
//读取ip数据 创建临时表
val ipDF = spark.read.jdbc(url, "ip", prop).cache()
ipDF.createOrReplaceTempView("ip")
//对日志数据进行处理 并创建临时表
val regionIp = logs.select("user_uid","user_uip").filter(x => x.getAs("user_uip").toString.size>1).withColumn("ip",ip2Int($"user_uip")).drop("user_uip")

regionIp.createOrReplaceTempView("region")

// 将ip数据表与日志临时表进行join然后计算
val grouped = spark.sql("select * from ip i join region r on r.ip between i.startIp_Int and endIp_Int").select("user_uid","city").groupBy("city", "user_uid").count().dropDuplicates("user_uid").groupBy("city").count()

//最终结果存入数据库
grouped.orderBy(grouped("count").desc).withColumn("rate", rate(grouped.col("count"))).write.mode("overwrite").jdbc(url, "region", prop)
  • The result is as follows
    Insert picture description here

Use case 5: In-depth analysis of user browsing

Statistical analysis needs

  • Read log information, in days as the unit of measurement, the depth value is used to indicate the user's browsing depth
  • Count the number of users in each depth stage, reflect the number of visitors to each url, and optimize the page specifically to increase the conversion rate of the website and generate stickiness for users
  • Calculation rule: The number of current URLs is used as the value of depth
    1) A user browses three pages
    today 2) A URL is visited by 50 people today
val logs = spark.read.jdbc(url,"logs2",prop)
// 1.过滤出action_prepend 字段的长度大于10的
// 2.将event_time 这一列的时间转换为日期
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions._
val time2date:UserDefinedFunction = udf((x:String)=>x.substring(0, 10))
val filtered = logs.filter(x => x.getAs("action_prepend").toString.length > 10).withColumn("date", time2date(logs.col("event_time"))).cache()
// 3.对"date", "user_sid", "action_prepend" 进行分组统计,之后再对"date", "count"进行分组统计.
// 4.统计url
// 统计用户浏览的url 的个数
val user_url = filtered.groupBy("date", "user_uid", "action_prepend").count().groupBy("date", "user_uid").count().orderBy("date", "count")
// 统计url被用户浏览次数
val url_count = filtered.groupBy("date", "action_prepend", "user_uid").count().groupBy("date", "action_prepend").count().orderBy("date", "count")
//写入数据库
user_url.write.mode("overwrite").jdbc(url, "user_url", prop)
url_count.write.mode("overwrite").jdbc(url, "url_count", prop)

Guess you like

Origin blog.csdn.net/sun_0128/article/details/107995928