overview
This article mainly introduces the data source and the process of loading data into the database
1 Data Acquisition
Use Scrapy to crawl Douban movie data, and then use the movielens dataset to create a rating data.
1.1 Data set acquisition
- Data set acquisition: select movielens data set: movielens official website
- The data set includes: movies, ratings, tags files
1.2 Data crawling
- Use scrapy+xpath to crawl Douban movie data, and finally save it in csv, named movie.csv
- Preprocessing the crawled data: including field selection, related character processing
1.3 Data Conversion
- Due to the lack of rating data, we use the rating file of movielens to create the rating data.
movie.csv
The movie files in the movielens dataset have a total of 2791 pieces of movie data, so we directly intercept the first 2791 pieces of data crawled .- Directly replace the crawled movie ID with the
movie.csv
movie ID of movielens, so the movie data we get finally has the corresponding rating data. - In the end, all we need are two files: movie.csv, rating.csv
(1) Movie data
数据表格式为:
mid,title,desc,minute,year,year,language,geners,actors,director
(2) Rating data
userID,mid,score,timestamp
2 Load data into MongoDB database
We choose the MongoDB database for the following reasons:
- Tens of millions of document objects, nearly 10G of data, the query of the indexed ID will not be slower than mysql , and the query of the non-indexed field is an overall win
- In-depth query is possible
Next, we deploy MongoDB on the cloud server, the host remotely connects to the database, and loads the files into the database.
2.1 MongoDB installation
- Installation tutorial: Install MongoDB on linux
2.2 maven dependencies
Maven-related dependency versions are as follows
Note: The version of Spark and the version of the Spark cluster need to be consistent
scala
:2.11.8
Spark
:2.3.0
<properties>
<scala.version>2.11.8</scala.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.0</version>
</dependency>
</dependencies>
2.3 Data loading program
// 加载数据主程序
object DataLoader {
val MONGODB_MOVIE_COLLECTION = "Movie"
val MONGODB_RATING_COLLECTION = "Rating"
val mgo_host = "root"
val config = Map(
"spark.cores" -> "local[*]",
"mongo.uri" -> "mongodb://root:123456@服务器公网IP:27017/recommender",
"mongo.db" -> "recommender"
)
// 文件位置
val MOVIE_DATA_PATH = "F:\\1-project\\offline\\src\\main\\resources\\file\\movie.csv"
val RATING_DATA_PATH = "F:\\1-project\\offline\\src\\main\\resources\\file\\ratings.csv"
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster(config("spark.cores")).setAppName("DataLoader")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
import spark.implicits._
val movieRDD = spark.sparkContext.textFile(MOVIE_DATA_PATH) // 加载数据
// 转为df
val movieDF = movieRDD.map(
item => {
val attr = item.split(",")
Movie(attr(0).toInt, attr(1).trim, attr(2).trim, attr(3).trim, attr(4).trim, attr(5).trim, attr(6).trim, attr(7).trim, attr(8).trim)
}
).toDF()
val ratingRDD = spark.sparkContext.textFile(RATING_DATA_PATH)
val ratingDF = ratingRDD.map(item => {
val attr = item.split(",")
Rating(attr(0).toInt,attr(1).toInt,attr(2).toDouble,attr(3).toInt)
}).toDF()
implicit val mongoConfig = MongoConfig(config("mongo.uri"), config("mongo.db"))
// 将数据保存到MongoDB
storeDataInMongoDB(movieDF, ratingDF)
spark.stop()
}
}
def storeDataInMongoDB(movieDF: DataFrame, ratingDF: DataFrame)(implicit mongoConfig: MongoConfig): Unit ={
// 新建一个mongodb的连接
val mongoClient = MongoClient(MongoClientURI(mongoConfig.uri))
// 将DF数据写入对应的mongodb表中
movieDF.write
.option("uri", mongoConfig.uri)
.option("collection", MONGODB_MOVIE_COLLECTION)
.mode("overwrite")
.format("com.mongodb.spark.sql")
.save()
ratingDF.write
.option("uri", mongoConfig.uri)
.option("collection", MONGODB_RATING_COLLECTION)
.mode("overwrite")
.format("com.mongodb.spark.sql")
.save()
//对数据表建索引
mongoClient(mongoConfig.db)(MONGODB_MOVIE_COLLECTION).createIndex(MongoDBObject("mid" -> 1))
mongoClient(mongoConfig.db)(MONGODB_RATING_COLLECTION).createIndex(MongoDBObject("uid" -> 1))
mongoClient(mongoConfig.db)(MONGODB_RATING_COLLECTION).createIndex(MongoDBObject("mid" -> 1))
mongoClient.close()
}
2.4 View data
- Use the software Mongo Management studio to check whether it is successful