[Movie Recommendation System] Data crawling, data loading into MongoDB database

overview

This article mainly introduces the data source and the process of loading data into the database

1 Data Acquisition

Use Scrapy to crawl Douban movie data, and then use the movielens dataset to create a rating data.

1.1 Data set acquisition

1.2 Data crawling

  • Use scrapy+xpath to crawl Douban movie data, and finally save it in csv, named movie.csv
  • Preprocessing the crawled data: including field selection, related character processing

1.3 Data Conversion

  • Due to the lack of rating data, we use the rating file of movielens to create the rating data.
  • movie.csvThe movie files in the movielens dataset have a total of 2791 pieces of movie data, so we directly intercept the first 2791 pieces of data crawled .
  • Directly replace the crawled movie ID with the movie.csvmovie ID of movielens, so the movie data we get finally has the corresponding rating data.
  • In the end, all we need are two files: movie.csv, rating.csv

(1) Movie data

数据表格式为:
mid,title,desc,minute,year,year,language,geners,actors,director

(2) Rating data

userID,mid,score,timestamp

2 Load data into MongoDB database

We choose the MongoDB database for the following reasons:

  • Tens of millions of document objects, nearly 10G of data, the query of the indexed ID will not be slower than mysql , and the query of the non-indexed field is an overall win
  • In-depth query is possible

Next, we deploy MongoDB on the cloud server, the host remotely connects to the database, and loads the files into the database.

2.1 MongoDB installation

2.2 maven dependencies

Maven-related dependency versions are as follows

Note: The version of Spark and the version of the Spark cluster need to be consistent

  • scala:2.11.8
  • Spark:2.3.0

image.png

<properties>
  <scala.version>2.11.8</scala.version>
</properties>


<dependencies>
  <dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-library</artifactId>
    <version>${scala.version}</version>
  </dependency>

  <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.3.0</version>
  </dependency>
</dependencies>

2.3 Data loading program

// 加载数据主程序
object DataLoader {
    
    

  val MONGODB_MOVIE_COLLECTION = "Movie"
  val MONGODB_RATING_COLLECTION = "Rating"
  val mgo_host = "root"
  val config = Map(
    "spark.cores" -> "local[*]",
    "mongo.uri" -> "mongodb://root:123456@服务器公网IP:27017/recommender",
    "mongo.db" -> "recommender"
  )
  // 文件位置
  val MOVIE_DATA_PATH = "F:\\1-project\\offline\\src\\main\\resources\\file\\movie.csv"
  val RATING_DATA_PATH = "F:\\1-project\\offline\\src\\main\\resources\\file\\ratings.csv"


  def main(args: Array[String]): Unit = {
    
    

    val sparkConf = new SparkConf().setMaster(config("spark.cores")).setAppName("DataLoader")
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()

    import spark.implicits._
    val movieRDD = spark.sparkContext.textFile(MOVIE_DATA_PATH)  // 加载数据
    
    // 转为df
    val movieDF = movieRDD.map(
      item => {
    
    
        val attr = item.split(",")
        Movie(attr(0).toInt, attr(1).trim, attr(2).trim, attr(3).trim, attr(4).trim, attr(5).trim, attr(6).trim, attr(7).trim, attr(8).trim)
      }
    ).toDF()

    val ratingRDD = spark.sparkContext.textFile(RATING_DATA_PATH)

    val ratingDF = ratingRDD.map(item => {
    
    
      val attr = item.split(",")
      Rating(attr(0).toInt,attr(1).toInt,attr(2).toDouble,attr(3).toInt)
    }).toDF()

    implicit val mongoConfig = MongoConfig(config("mongo.uri"), config("mongo.db"))
    // 将数据保存到MongoDB
    storeDataInMongoDB(movieDF, ratingDF)

    spark.stop()

  }

}
def storeDataInMongoDB(movieDF: DataFrame, ratingDF: DataFrame)(implicit mongoConfig: MongoConfig): Unit ={
    
    
    // 新建一个mongodb的连接
    val mongoClient = MongoClient(MongoClientURI(mongoConfig.uri))

    // 将DF数据写入对应的mongodb表中
    movieDF.write
      .option("uri", mongoConfig.uri)
      .option("collection", MONGODB_MOVIE_COLLECTION)
      .mode("overwrite")
      .format("com.mongodb.spark.sql")
      .save()

    ratingDF.write
      .option("uri", mongoConfig.uri)
      .option("collection", MONGODB_RATING_COLLECTION)
      .mode("overwrite")
      .format("com.mongodb.spark.sql")
      .save()


    //对数据表建索引
    mongoClient(mongoConfig.db)(MONGODB_MOVIE_COLLECTION).createIndex(MongoDBObject("mid" -> 1))
    mongoClient(mongoConfig.db)(MONGODB_RATING_COLLECTION).createIndex(MongoDBObject("uid" -> 1))
    mongoClient(mongoConfig.db)(MONGODB_RATING_COLLECTION).createIndex(MongoDBObject("mid" -> 1))

    mongoClient.close()

  }

2.4 View data

  • Use the software Mongo Management studio to check whether it is successful

image.png

Guess you like

Origin blog.csdn.net/weixin_40433003/article/details/132049111