background
This time, the consumption and access records of a certain website will be used to conduct mining exercises on related user behavior. The relevant test data files have been uploaded to csdn, and the download address is: https://download.csdn.net/download/u013560925/10342224
a. Data format
The data used this time is divided into json and parquet. As a columnar storage, parquet is very advantageous in terms of storage space and operating efficiency, and is very suitable for use in industrial production. For details, see another blog post: https://blog. csdn.net/u013560925/article/details/79516741 ( csv, parquet, orc read and write performance and methods ), the data is divided into two parts: user.json and userLog.json. They are user information and user behavior information. The details are as follows:
user.json
userID:String,name:String,registeredTime:String User ID/User Name/Registration Practice
log.json
logID: Long, userID: Long, time: String, typed: Long, consumed: Double Log ID/User ID/Timestamp/Behavior Type/Consumption Amount User behavior: 1=purchase consumption 0=browsing behavior
Also the logparquet.parquet and userparquet.parquet data formats are the same as above
b. The mining problems involved in this time are as follows:
1. Count the Top 5 with the most visits in a
certain period of time 2. Count the Top 5 with the most purchases in a certain period of time
3. Count the users with the most visits in a certain week
4. Count the users with the most consumption in a certain week
5. Count before and after registration Top 10 people with the most visits in two weeks
6. The top 10 people with the most total purchases in the first two weeks after registration
c. Release Notes
Here, the related version information of spark and sql is as follows. If the versions are inconsistent, the usage of the api will be different:
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.2.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.2.0</version> <scope>compile</scope> </dependency>
text
0. Data reading and sparksession initialization
initialization:
val spark = SparkSession .builder() .appName("Spark Hive Example") .config("spark.sql.warehouse.dir", warehouseLocation) .enableHiveSupport() .getOrCreate() //Note that the following reference packages should be introduced separately import spark.implicits._ //implicit conversion import org.apache.spark.sql.functions._ //agg built-in algorithm
Data read:
val userInfo = spark.read.format("parquet").parquet("/user/wangqi/userparquet.parquet") val userLog = spark.read.format("parquet").parquet("/user/wangqi/logparquet.parquet")
1. Count the Top 5 with the most visits in a specific time period
The focus is on the filter method of filter, in addition to using alias to rename a newly generated column
val startTime = "2016-10-01" val endTime = "2016-11-01" userLog.filter("time>='"+startTime+"' and time <= '"+endTime+"' and typed = 0").join(userInfo,userLog("userID")===userInfo("userID")) .groupBy(userInfo("name"),userInfo("userID")).agg(count(userLog("userID")).alias("userLogCount")).sort(col("userLogCount").desc).limit(5).show()
2. Count the Top 5 with the most purchases in a specific time period
Count the Top5 with the most purchases in a specific time period: for example, 2016-10-01 ~ 2016-11-01
userLog.filter("time >= '" + startTime + "' and time <= '" + endTime + "' and typed = 1") .join(userInfo, userInfo("userID") === userLog("userID")) .groupBy(userInfo("userID"),userInfo("name")) .agg(round(sum(userLog("consumed")), 2).alias("totalConsumed")) .sort(col("totalConsumed").desc) //Use col to select a newly generated column and sort in descending order .limit(5) .show
3. Count the users with the most visits in a certain week
val userLogDS = userLog.as[UserLog].filter("time >= '2016-10-08' and time <= '2016-10-14' and typed = '0'") .map (log => LogOnce (log.logID, log.userID, 1)) .union(userLog.as[UserLog].filter("time >= '2016-10-01' and time <= '2016-10-07' and typed = '0'") .map (log => LogOnce (log.logID, log.userID, -1))) userLogDS.join(userInfo.as[UserInfo], userLogDS("userID") === userInfo("userID")) .groupBy(userInfo("userID"),userInfo("name")) .agg(sum(userLogDS("count")).alias("viewCountIncreased")) .sort(col("viewCountIncreased").desc) //Use col to select a newly generated column and sort in descending order .limit(5) .show()
4. Count the users with the largest increase in consumption in a certain week
val userLogDs = userLog.as[UserLog].filter("time >= '2016-10-08' and time <= '2016-10-14' and typed = '1'") .map(log=>ConsumedOnce(log.logID,log.userID,log.consumed)) .union(userLog.as[UserLog].filter("time >= '2016-10-01' and time <= '2016-10-07' and typed = '1'") .map(log=>ConsumedOnce(log.logID,log.userID,-log.consumed))) userLogDs.join(userInfo.as[UserInfo],userLogDs("userID")===userInfo("userID")) .groupBy(col("userID"),col("name")) .agg(round(sum(col("consumed"))).alias("viewConsumedIncreased")) .sort(col("viewConsuusermedIncreased")) .limit(5).show()
5. Count the top 10 most visited people in the first two weeks after registration
Filter between multiple conditions, connect queries, use && to connect, pay attention to the column to use userlog ("column name") to select
userLog.join(userInfo,userInfo("userID")===userLog("userID")) .filter(userInfo("registeredTime")>="2016-10-01" && userInfo("registeredTime")<="2016-10-14" && userLog("time")>=userInfo("registeredTime") && userLog("time") <= date_add(userInfo("registeredTime"), 14) && userLog("type")===0) .groupBy(userInfo("userID"),userInfo("name")) .agg(count(userLog("time")).alias("logTimes")) .sort(col("logTimes").desc) .limit(5) .show()
6. Count the top 10 people with the most purchases in the first two weeks after registration
userLog.join(userInfo, userInfo("userID") === userLog("userID")) .filter(userInfo("registeredTime") >= "2016-10-01" && userInfo("registeredTime") <= "2016-10-14" && userLog("time") >= userInfo("registeredTime") && userLog("time") <= date_add(userInfo("registeredTime"), 14) && userLog("typed") === 1) .groupBy(userInfo("userID"),userInfo("name")) .agg(round(sum(userLog("consumed")),2).alias("totalConsumed")) .sort(col("totalConsumed").desc) .limit(10) .show()
in conclusion
I forgot to take a screenshot of the running result. If necessary, it can be added later. I used the 2.2 api and dataset to manipulate data. I mainly familiarize myself with the use of the filter function, and combined groupby, sort, agg and other functions to analyze user behavior.