spark - small practice (3) user consumption record information mining

background

         This time, the consumption and access records of a certain website will be used to conduct mining exercises on related user behavior. The relevant test data files have been uploaded to csdn, and the download address is: https://download.csdn.net/download/u013560925/10342224

a. Data format

       The data used this time is divided into json and parquet. As a columnar storage, parquet is very advantageous in terms of storage space and operating efficiency, and is very suitable for use in industrial production. For details, see another blog post: https://blog. csdn.net/u013560925/article/details/79516741 ( csv, parquet, orc read and write performance and methods ), the data is divided into two parts: user.json and userLog.json. They are user information and user behavior information. The details are as follows:

user.json

userID:String,name:String,registeredTime:String
User ID/User Name/Registration Practice

log.json

logID: Long, userID: Long, time: String, typed: Long, consumed: Double
Log ID/User ID/Timestamp/Behavior Type/Consumption Amount
User behavior: 1=purchase consumption 0=browsing behavior

Also the logparquet.parquet and userparquet.parquet data formats are the same as above

b. The mining problems involved in this time are as follows:

1. Count the Top 5 with the most visits in a 
certain period of time 2. Count the Top 5 with the most purchases in a certain period of time
3. Count the users with the most visits in a certain week
4. Count the users with the most consumption in a certain week
5. Count before and after registration Top 10 people with the most visits in two weeks
6. The top 10 people with the most total purchases in the first two weeks after registration

c. Release Notes

 Here, the related version information of spark and sql is as follows. If the versions are inconsistent, the usage of the api will be different:         

<dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.2.0</version>
        </dependency>
    <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.2.0</version>
            <scope>compile</scope>
    </dependency>

text

0. Data reading and sparksession initialization

initialization:

val spark = SparkSession
      .builder()
      .appName("Spark Hive Example")
      .config("spark.sql.warehouse.dir", warehouseLocation)
      .enableHiveSupport()
      .getOrCreate()
       //Note that the following reference packages should be introduced separately
      import spark.implicits._ //implicit conversion
      import org.apache.spark.sql.functions._ //agg built-in algorithm

Data read:

val userInfo = spark.read.format("parquet").parquet("/user/wangqi/userparquet.parquet")

val userLog = spark.read.format("parquet").parquet("/user/wangqi/logparquet.parquet")

1. Count the Top 5 with the most visits in a specific time period 

The focus is on the filter method of filter, in addition to using alias to rename a newly generated column

val startTime =  "2016-10-01"
val endTime = "2016-11-01"
   
 userLog.filter("time>='"+startTime+"' and time <= '"+endTime+"' and typed = 0").join(userInfo,userLog("userID")===userInfo("userID"))
      .groupBy(userInfo("name"),userInfo("userID")).agg(count(userLog("userID")).alias("userLogCount")).sort(col("userLogCount").desc).limit(5).show()

2. Count the Top 5 with the most purchases in a specific time period

Count the Top5 with the most purchases in a specific time period: for example, 2016-10-01 ~ 2016-11-01

userLog.filter("time >= '" + startTime + "' and time <= '" + endTime + "' and typed = 1")
      .join(userInfo, userInfo("userID") === userLog("userID"))
      .groupBy(userInfo("userID"),userInfo("name"))
      .agg(round(sum(userLog("consumed")), 2).alias("totalConsumed"))
      .sort(col("totalConsumed").desc) //Use col to select a newly generated column and sort in descending order
      .limit(5)
      .show

3. Count the users with the most visits in a certain week

Increase the most, count each visit last week as -1, this week as 1, and then use union to merge, if it is negative, it means that there were more visits last week.
val userLogDS = userLog.as[UserLog].filter("time >= '2016-10-08' and time <= '2016-10-14' and typed = '0'")
      .map (log => LogOnce (log.logID, log.userID, 1))   
      .union(userLog.as[UserLog].filter("time >= '2016-10-01' and time <= '2016-10-07' and typed = '0'")
        .map (log => LogOnce (log.logID, log.userID, -1)))

    userLogDS.join(userInfo.as[UserInfo], userLogDS("userID") === userInfo("userID"))
      .groupBy(userInfo("userID"),userInfo("name"))
      .agg(sum(userLogDS("count")).alias("viewCountIncreased"))
      .sort(col("viewCountIncreased").desc) //Use col to select a newly generated column and sort in descending order
      .limit(5)
      .show()

4. Count the users with the largest increase in consumption in a certain week

val userLogDs = userLog.as[UserLog].filter("time >= '2016-10-08' and time <= '2016-10-14' and typed = '1'")
      .map(log=>ConsumedOnce(log.logID,log.userID,log.consumed))
      .union(userLog.as[UserLog].filter("time >= '2016-10-01' and time <= '2016-10-07' and typed = '1'")
        .map(log=>ConsumedOnce(log.logID,log.userID,-log.consumed)))

    userLogDs.join(userInfo.as[UserInfo],userLogDs("userID")===userInfo("userID"))
      .groupBy(col("userID"),col("name"))
      .agg(round(sum(col("consumed"))).alias("viewConsumedIncreased"))
      .sort(col("viewConsuusermedIncreased"))
      .limit(5).show()


5. Count the top 10 most visited people in the first two weeks after registration

Filter between multiple conditions, connect queries, use && to connect, pay attention to the column to use userlog ("column name") to select

userLog.join(userInfo,userInfo("userID")===userLog("userID"))
      .filter(userInfo("registeredTime")>="2016-10-01"
        && userInfo("registeredTime")<="2016-10-14"
        && userLog("time")>=userInfo("registeredTime")
        && userLog("time") <= date_add(userInfo("registeredTime"), 14)
        && userLog("type")===0)
      .groupBy(userInfo("userID"),userInfo("name"))
      .agg(count(userLog("time")).alias("logTimes"))
      .sort(col("logTimes").desc)
      .limit(5)
      .show()


6. Count the top 10 people with the most purchases in the first two weeks after registration

userLog.join(userInfo, userInfo("userID") === userLog("userID"))
      .filter(userInfo("registeredTime") >= "2016-10-01"
        && userInfo("registeredTime") <= "2016-10-14"
        && userLog("time") >= userInfo("registeredTime")
        && userLog("time") <= date_add(userInfo("registeredTime"), 14)
        && userLog("typed") === 1)
      .groupBy(userInfo("userID"),userInfo("name"))
      .agg(round(sum(userLog("consumed")),2).alias("totalConsumed"))
      .sort(col("totalConsumed").desc)
      .limit(10)
      .show()


in conclusion

         I forgot to take a screenshot of the running result. If necessary, it can be added later. I used the 2.2 api and dataset to manipulate data. I mainly familiarize myself with the use of the filter function, and combined groupby, sort, agg and other functions to analyze user behavior.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325584264&siteId=291194637