background
There are some public datasets on the Internet that can be used for us to make some connections. This time, we use a common movie rating dataset, which is relatively easy to obtain. Baidu can do it. Only the format of the movie dataset is given here:
1.users.dat
UserID::Gender::Age::Occupatoin::Zip-Code
2.ratings.dat
UserID::MovieID::Rating::Timestamp
3.movies.dat
MovieID::Title::Genres
This practice will be implemented in three ways: RDD, DataFrame and Dataset. The list of issues involved is as follows:
(1) The movie with the highest average score (the best word of mouth) and the movie with the highest number of viewers (the most popular) among all the movies
(2) Analyze the top 10 most popular movies for men and the top 10 most popular movies for women
(3) TopN analysis of favorite movies of target users in a certain age
Here, the related version information of spark and sql is as follows. If the versions are inconsistent, the usage of the api will be different:
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.1.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.1.0</version> <scope>compile</scope> </dependency>
text
0. Data processing
The content of this part is mainly to convert the format between rdd, dataframe and dataset to prepare later
a. read rdd
val usersRDD = sc.textFile(dataPath + "users.dat") val moviesRDD = sc.textFile(dataPath + "movies.dat") val occupationsRDD = sc.textFile(dataPath + "occupations.dat") val ratingsRDD = sc.textFile(dataPath + "ratings.dat")
b.rdd转dataframe
Take the data in users.dat as an example, other types of data are similar:
//Provide structure information val schemaforusers = StructType("UserID::Gender::Age::OccupationID::Zip_Code".split("::"). map(column => StructField(column, StringType, true))) //Use the Struct method to format the data of Users, that is, add metadata information of the data on the basis of RDD // rdd-> rdd [Row] val usersRDDRows = usersRDD.map(_.split("::")).map(line => Row(line(0).trim,line(1). trim,line(2).trim,line(3).trim,line(4).trim)) //Turn each of our data into data in Row //to dataframe val usersDataFrame = spark.createDataFrame(usersRDDRows, schemaforusers)
If there is other type information, such as DoubleType, we can add it separately, such as rating:
val schemaforratings = StructType("UserID::MovieID".split("::"). map(column => StructField(column, StringType, true))). add("Rating", DoubleType, true). add("Timestamp",StringType, true)
If the read data already has structural information, but it is only read as Rdd, the implicit conversion method can be imported, and rdd.toDF() can be directly
If there is no column information, it needs to be specified:
import spark.implicits._ val testDF = rdd.map {line=> (line._1,line._2) }.toDF("col1","col2")
c.rdd转dataset
It is much more convenient to have a dataframe to convert a dataset. It is divided into two steps:
//1. Define a case class case class User(UserID:String, Gender:String, Age:String, OccupationID:String, Zip_Code:String) //2.as val usersDataSet = usersDataFrame.as[User]
You can also transfer dataset directly from Rdd, but you need to introduce an implicit conversion method
//Introduce implicit conversion method after creating sparksession import sparkSession.implicits._ //Define a case class case class User(UserID:String, Gender:String, Age:String, OccupationID:String, Zip_Code:String) // can be converted directly from RDD val usersDataSet = usersRDD.as[User]Or use the implicit conversion method:
import spark.implicits._ case class Coltest(col1:String,col2:Int)extends Serializable //Define field name and type val testDS = rdd.map {line=> Coltest (line._1, line._2) }.toDS
1. The movie with the highest average score (best reputation) and the most watched movie (most popular) among all movies
a. RDD implementation
The implementation of rdd tends to batch the elements of each row. Compared with dataframe and dataset, rdd can easily edit each element of each row. The implementation idea with the highest average score needs to map the Rating property in ratings.dat to ( rating, 1), then construct key, value=> (movieID, (rating, 1)), and then according to reduceBykey, you can get the total rating of a movie and the total number of viewings, and you can get the highest rating by dividing. Movie ID, the code is as follows:
val ratings= ratingsRDD.map(_.split("::")).map(x => (x(0), x(1), x(2))).cache() //Format the movie ID and rating ratings.map(x => (x._2, (x._3.toDouble, 1))) //Formatted as Key-Value .reduceByKey((x, y) => (x._1 + y._1,x._2 + y._2)) //Reduce the Value to get the total score and total comments of each movie number of people .map(x => (x._2._1.toDouble / x._2._2, x._1)) // Find the average score of the movie .sortByKey(false) //sort in descending order .take (10) // Take Top 10 .foreach(println) //print to console
b.dataframe implementation
The methods provided by the dataframe will be rich, and we do not need to edit a single element, select the required columns, group, average, and sort in one go, the code is as follows, the top 10 with the highest score:
ratingsDataFrame.select("MovieID", "Rating").groupBy("MovieID"). avg("Rating").orderBy($"avg(Rating)".desc).show(10)
c.dataset implementation
The APIs provided by dataset and dataframe are basically the same, and the implementation methods are the same. The top 10 with the highest score:
ratingsDataSet.select("MovieID", "Rating").groupBy("MovieID"). avg("Rating").orderBy($"avg(Rating)".desc).show(10)
2. Analyze the top 10 most popular movies for men and the top 10 most popular movies for women
a.rdd implementation
The gender attribute information is in users.dat, and the rating viewing information is in ratings.dat, so we need a join operation. For efficiency, the current data may be cached, and then the filter will filter out male or female, and the rest is the same as the one in question 1. The same, the code is as follows:
val male = "M" val female = "F" val genderRatings = ratings.map(x => (x._1, (x._1, x._2, x._3))).join( usersRDD.map(_.split("::")).map(x => (x(0), x(1)))).cache() //Because it will be used multiple times later, cache it genderRatings.take(10).foreach(println) / / Filter out the rating of the new information val maleFilteredRatings = genderRatings.filter(x => x._2._2.equals("M")).map(x => x._2._1) val femaleFilteredRatings = genderRatings.filter(x => x._2._2.equals("F")).map(x => x._2._1) maleFilteredRatings.map(x => (x._2, (x._3.toDouble, 1))) //Formatted as Key-Value .reduceByKey((x, y) => (x._1 + y._1,x._2 + y._2)) //Reduce the Value to get the total score and total comments of each movie number of people .map(x => (x._2._1.toDouble / x._2._2, x._1)) // Find the average score of the movie .sortByKey(false) //sort in descending order .map(x => (x._2, x._1)) .take (10) // Take Top 10 .foreach(println) //print to console femaleFilteredRatings.map(x => (x._2, (x._3.toDouble, 1))) //Formatted as Key-Value .reduceByKey((x, y) => (x._1 + y._1,x._2 + y._2)) //Reduce the Value to get the total score and total comments of each movie number of people .map(x => (x._2._1.toDouble / x._2._2, x._1)) // Find the average score of the movie .sortByKey(false) //sort in descending order .map(x => (x._2, x._1)) .take (10) // Take Top 10 .foreach(println) //print to console
b.dataframe implementation
The implementation idea is similar, but the function is used in a different way. The code is as follows:
val male = "M" val female = "F" val genderRatingsDataFrame = ratingsDataFrame.join(usersDataFrame, "UserID").cache() / / Filter out the scoring information of different genders of men and women val maleFilteredRatingsDataFrame = genderRatingsDataFrame.filter("Gender= 'M'").select("MovieID", "Rating") val femaleFilteredRatingsDataFrame = genderRatingsDataFrame.filter("Gender= 'F'").select("MovieID", "Rating") //Grouping--calculate the average score--sort maleFilteredRatingsDataFrame.groupBy("MovieID").avg("Rating").orderBy($"avg(Rating)".desc).show(10) femaleFilteredRatingsDataFrame.groupBy("MovieID").avg("Rating").orderBy($"avg(Rating)".desc, $"MovieID".desc).show(10)
c.dataset implementation
The implementation of dataset is similar to that of dataframe, so it will not be repeated here.
3. TopN analysis of favorite movies of target users at a certain age
a. RDD implementation
For users of special age, such as 18<age<25, filtering in spark may lead to a large number of computing tasks. For example, we can map the 18<age<25 stage to 18, 25<age<40 should be set to 25, so we only need the filter type, put the filter task before the spark task, or use the hive built-in function to achieve.
The implementation of this problem requires the use of the join operation. The join operation can easily lead to data skew and a large number of computing tasks. If one of the two sides of the join has a small amount of data and can be stored in the memory, then we can use mapjoin, which is a built-in hive Function, in spark, we can use broadcase to implement, such as broadcasting the userID that matches the age, so that in each worker's executor, it can be shared between tasks, which is the executor level, which is used in the following code. The map operation replaces a join operation
// filter out the target audience val targetQQUsers = usersRDD.map(_.split("::")).map(x => (x(0), x(2))).filter(_._2.equals("18")) //In order to broadcast, convert the structure type val targetQQUsersSet = HashSet() ++ targetQQUsers.map(_._1).collect() broadcast val targetQQUsersBroadcast = sc.broadcast(targetQQUsersSet) //Get the map structure of movieID to name val movieID2Name = moviesRDD.map(_.split("::")).map(x => (x(0), x(1))).collect.toMap //1. Split lines. 2. Select userID, movieId 3. Use broadcast variables to filter users 4. map to (movieId, 1) 5. reduceByKey to count the total number. 6. The rest is sorting ratingsRDD.map(_.split("::")).map(x => (x(0), x(1))).filter(x => targetQQUsersBroadcast.value.contains(x._1) ).map(x => (x._2, 1)).reduceByKey(_ + _).map(x => (x._2, x._1)). sortByKey(false).map(x => (x._2, x._1)).take(10). map(x => (movieID2Name.getOrElse(x._1, null), x._2)).foreach(println)
b.dataframe implementation
The dataframe implementation is relatively simple, using join instead of mapjoin operation
ratingsDataFrame.join(usersDataFrame, "UserID").filter("Age = '18'").groupBy("MovieID"). count().orderBy($"count".desc).printSchema()
c.dataset implementation
There are exactly the same wood:
ratingsDataSet.join(usersDataSet, "UserID").filter("Age = '18'").groupBy("MovieID"). count().join(moviesDataSet, "MovieID").select("Title", "count").sort($"count".desc).show(10)
Summarize
1.rdd becomes the difference between dataframe and dataset:
RDD programming uses relatively simple functions to edit each element of each line, the process is more complicated and the controllability is high
Dataframe and dataset programming are relatively simple, emphasizing the use of functional programming, rather than targeting an element in a row, and the controllability is low
2. The difference between dataframe and dataset:
dataframe->dataset[Row] is a special case of dataset. In our implementation, the difference is not particularly obvious, but dataset is strongly typed. For example, when getting an element in a row:
dataframe: row.getString(0) or row.col("department") dataset: DataSet<Persono>. When getting a certain value of each piece of data, APIs such as person.getName() can ensure type safety.Also, about the schema. DataFrame comes with a schema, and DataSet doesn't have a schema. The schema defines the "data structure" of each row of data, just like the "columns" in relational databases, and the schema specifies how many columns a DataFrame has.