spark - small practice of film information mining (1)

background

       There are some public datasets on the Internet that can be used for us to make some connections. This time, we use a common movie rating dataset, which is relatively easy to obtain. Baidu can do it. Only the format of the movie dataset is given here:

 1.users.dat 

UserID::Gender::Age::Occupatoin::Zip-Code

2.ratings.dat

UserID::MovieID::Rating::Timestamp

3.movies.dat

MovieID::Title::Genres

This practice will be implemented in three ways: RDD, DataFrame and Dataset. The list of issues involved is as follows:

 (1) The movie with the highest average score (the best word of mouth) and the movie with the highest number of viewers (the most popular) among all the movies

 (2) Analyze the top 10 most popular movies for men and the top 10 most popular movies for women

 (3) TopN analysis of favorite movies of target users in a certain age


 Here, the related version information of spark and sql is as follows. If the versions are inconsistent, the usage of the api will be different:         

<dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.1.0</version>
        </dependency>
    <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.1.0</version>
            <scope>compile</scope>
    </dependency>


text

0. Data processing

    The content of this part is mainly to convert the format between rdd, dataframe and dataset to prepare later

a. read rdd

    val usersRDD = sc.textFile(dataPath + "users.dat")
    val moviesRDD = sc.textFile(dataPath + "movies.dat")
    val occupationsRDD = sc.textFile(dataPath + "occupations.dat")
    val ratingsRDD = sc.textFile(dataPath + "ratings.dat")

b.rdd转dataframe

     Take the data in users.dat as an example, other types of data are similar:

//Provide structure information
 val schemaforusers = StructType("UserID::Gender::Age::OccupationID::Zip_Code".split("::").
      map(column => StructField(column, StringType, true))) //Use the Struct method to format the data of Users, that is, add metadata information of the data on the basis of RDD
 // rdd-> rdd [Row]
 val usersRDDRows = usersRDD.map(_.split("::")).map(line => Row(line(0).trim,line(1).
      trim,line(2).trim,line(3).trim,line(4).trim)) //Turn each of our data into data in Row
 //to dataframe
 val usersDataFrame = spark.createDataFrame(usersRDDRows, schemaforusers)

     If there is other type information, such as DoubleType, we can add it separately, such as rating:

val schemaforratings = StructType("UserID::MovieID".split("::").
      map(column => StructField(column, StringType, true))).
      add("Rating", DoubleType, true).
      add("Timestamp",StringType, true)

     If the read data already has structural information, but it is only read as Rdd, the implicit conversion method can be imported, and rdd.toDF() can be directly

     If there is no column information, it needs to be specified:

import spark.implicits._
val testDF = rdd.map {line=>
      (line._1,line._2)
    }.toDF("col1","col2")

c.rdd转dataset

    It is much more convenient to have a dataframe to convert a dataset. It is divided into two steps:

//1. Define a case class
 case class User(UserID:String, Gender:String, Age:String, OccupationID:String, Zip_Code:String)

 //2.as
 val usersDataSet = usersDataFrame.as[User]

     You can also transfer dataset directly from Rdd, but you need to introduce an implicit conversion method

 
//Introduce implicit conversion method after creating sparksession
 import sparkSession.implicits._
 //Define a case class
 case class User(UserID:String, Gender:String, Age:String, OccupationID:String, Zip_Code:String)

 // can be converted directly from RDD
 val usersDataSet = usersRDD.as[User]
       Or use the implicit conversion method:
import spark.implicits._
case class Coltest(col1:String,col2:Int)extends Serializable //Define field name and type
val testDS = rdd.map {line=>
      Coltest (line._1, line._2)
    }.toDS



1. The movie with the highest average score (best reputation) and the most watched movie (most popular) among all movies

a. RDD implementation

   The implementation of rdd tends to batch the elements of each row. Compared with dataframe and dataset, rdd can easily edit each element of each row. The implementation idea with the highest average score needs to map the Rating property in ratings.dat to ( rating, 1), then construct key, value=> (movieID, (rating, 1)), and then according to reduceBykey, you can get the total rating of a movie and the total number of viewings, and you can get the highest rating by dividing. Movie ID, the code is as follows:

val ratings= ratingsRDD.map(_.split("::")).map(x => (x(0), x(1), x(2))).cache() //Format the movie ID and rating
    ratings.map(x => (x._2, (x._3.toDouble, 1))) //Formatted as Key-Value
      .reduceByKey((x, y) => (x._1 + y._1,x._2 + y._2)) //Reduce the Value to get the total score and total comments of each movie number of people
      .map(x => (x._2._1.toDouble / x._2._2, x._1)) // Find the average score of the movie
      .sortByKey(false) //sort in descending order
      .take (10) // Take Top 10
      .foreach(println) //print to console

b.dataframe implementation

   The methods provided by the dataframe will be rich, and we do not need to edit a single element, select the required columns, group, average, and sort in one go, the code is as follows, the top 10 with the highest score:

ratingsDataFrame.select("MovieID", "Rating").groupBy("MovieID").
      avg("Rating").orderBy($"avg(Rating)".desc).show(10)

c.dataset implementation

 The APIs provided by dataset and dataframe are basically the same, and the implementation methods are the same. The top 10 with the highest score:

ratingsDataSet.select("MovieID", "Rating").groupBy("MovieID").
      avg("Rating").orderBy($"avg(Rating)".desc).show(10)



2. Analyze the top 10 most popular movies for men and the top 10 most popular movies for women

a.rdd implementation

      The gender attribute information is in users.dat, and the rating viewing information is in ratings.dat, so we need a join operation. For efficiency, the current data may be cached, and then the filter will filter out male or female, and the rest is the same as the one in question 1. The same, the code is as follows:

    val male = "M"
    val female = "F"
    val genderRatings = ratings.map(x => (x._1, (x._1, x._2, x._3))).join(
      usersRDD.map(_.split("::")).map(x => (x(0), x(1)))).cache() //Because it will be used multiple times later, cache it
    genderRatings.take(10).foreach(println)

    / / Filter out the rating of the new information
    val maleFilteredRatings = genderRatings.filter(x => x._2._2.equals("M")).map(x => x._2._1)
    val femaleFilteredRatings = genderRatings.filter(x => x._2._2.equals("F")).map(x => x._2._1)

     maleFilteredRatings.map(x => (x._2, (x._3.toDouble, 1))) //Formatted as Key-Value
      .reduceByKey((x, y) => (x._1 + y._1,x._2 + y._2)) //Reduce the Value to get the total score and total comments of each movie number of people
      .map(x => (x._2._1.toDouble / x._2._2, x._1)) // Find the average score of the movie
      .sortByKey(false) //sort in descending order
      .map(x => (x._2, x._1))
      .take (10) // Take Top 10
      .foreach(println) //print to console

     femaleFilteredRatings.map(x => (x._2, (x._3.toDouble, 1))) //Formatted as Key-Value
      .reduceByKey((x, y) => (x._1 + y._1,x._2 + y._2)) //Reduce the Value to get the total score and total comments of each movie number of people
      .map(x => (x._2._1.toDouble / x._2._2, x._1)) // Find the average score of the movie
      .sortByKey(false) //sort in descending order
      .map(x => (x._2, x._1))
      .take (10) // Take Top 10
      .foreach(println) //print to console

b.dataframe implementation

   The implementation idea is similar, but the function is used in a different way. The code is as follows:

    val male = "M"
    val female = "F"
    val genderRatingsDataFrame = ratingsDataFrame.join(usersDataFrame, "UserID").cache()


    / / Filter out the scoring information of different genders of men and women
    val maleFilteredRatingsDataFrame = genderRatingsDataFrame.filter("Gender= 'M'").select("MovieID", "Rating")
    val femaleFilteredRatingsDataFrame = genderRatingsDataFrame.filter("Gender= 'F'").select("MovieID", "Rating")

     //Grouping--calculate the average score--sort
    maleFilteredRatingsDataFrame.groupBy("MovieID").avg("Rating").orderBy($"avg(Rating)".desc).show(10)

    
    femaleFilteredRatingsDataFrame.groupBy("MovieID").avg("Rating").orderBy($"avg(Rating)".desc, $"MovieID".desc).show(10)

c.dataset implementation

   The implementation of dataset is similar to that of dataframe, so it will not be repeated here.



3. TopN analysis of favorite movies of target users at a certain age

a. RDD implementation

   For users of special age, such as 18<age<25, filtering in spark may lead to a large number of computing tasks. For example, we can map the 18<age<25 stage to 18, 25<age<40 should be set to 25, so we only need the filter type, put the filter task before the spark task, or use the hive built-in function to achieve.

    The implementation of this problem requires the use of the join operation. The join operation can easily lead to data skew and a large number of computing tasks. If one of the two sides of the join has a small amount of data and can be stored in the memory, then we can use mapjoin, which is a built-in hive Function, in spark, we can use broadcase to implement, such as broadcasting the userID that matches the age, so that in each worker's executor, it can be shared between tasks, which is the executor level, which is used in the following code. The map operation replaces a join operation

// filter out the target audience
val targetQQUsers = usersRDD.map(_.split("::")).map(x => (x(0), x(2))).filter(_._2.equals("18"))
//In order to broadcast, convert the structure type
val targetQQUsersSet = HashSet() ++ targetQQUsers.map(_._1).collect()
broadcast
val targetQQUsersBroadcast = sc.broadcast(targetQQUsersSet)
//Get the map structure of movieID to name
val movieID2Name = moviesRDD.map(_.split("::")).map(x => (x(0), x(1))).collect.toMap
//1. Split lines. 2. Select userID, movieId 3. Use broadcast variables to filter users 4. map to (movieId, 1) 5. reduceByKey to count the total number. 6. The rest is sorting
ratingsRDD.map(_.split("::")).map(x => (x(0), x(1))).filter(x =>
      targetQQUsersBroadcast.value.contains(x._1)
    ).map(x => (x._2, 1)).reduceByKey(_ + _).map(x => (x._2, x._1)).
      sortByKey(false).map(x => (x._2, x._1)).take(10).
      map(x => (movieID2Name.getOrElse(x._1, null), x._2)).foreach(println)

b.dataframe implementation

   The dataframe implementation is relatively simple, using join instead of mapjoin operation

ratingsDataFrame.join(usersDataFrame, "UserID").filter("Age = '18'").groupBy("MovieID").
      count().orderBy($"count".desc).printSchema()

c.dataset implementation

 There are exactly the same wood:

ratingsDataSet.join(usersDataSet, "UserID").filter("Age = '18'").groupBy("MovieID").
      count().join(moviesDataSet, "MovieID").select("Title", "count").sort($"count".desc).show(10)


Summarize

1.rdd becomes the difference between dataframe and dataset:

  RDD programming uses relatively simple functions to edit each element of each line, the process is more complicated and the controllability is high

  Dataframe and dataset programming are relatively simple, emphasizing the use of functional programming, rather than targeting an element in a row, and the controllability is low

2. The difference between dataframe and dataset:

   dataframe->dataset[Row] is a special case of dataset. In our implementation, the difference is not particularly obvious, but dataset is strongly typed. For example, when getting an element in a row:

dataframe:
row.getString(0)
or
row.col("department")
dataset:
DataSet<Persono>. When getting a certain value of each piece of data, APIs such as person.getName() can ensure type safety.
  Also, about the schema. DataFrame comes with a schema, and DataSet doesn't have a schema. The schema defines the "data structure" of each row of data, just like the "columns" in relational databases, and the schema specifies how many columns a DataFrame has.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325508520&siteId=291194637