Spark-based movie recommendation system (recommendation system to 1)

Part IV - Recommended Systems - Projects

business background:

Fast: Apache Spark to calculate the memory core
common: one-stop solution various problems, ADHOC SQL query, stream computing, data mining, a calculation
complete ecosystem
once you have Spark, it is possible for the majority of enterprise big data scenarios provide significant acceleration

“猜你喜欢”为代表的推荐系统,从吃穿住行等

Project Background:

本项目是一个基于Apache Spark 的电影推荐系统,
技术路线:离线推荐+实时推荐

Project Architecture:

Here Insert Picture Description

  • Memory layer: HDFS as the underlying storage, Hive data warehouse (Hive Metastore: schema Hive management data)
  • Offline data processing: SparkSQL (for data query engine <===> Data ETL)
  • Real-time data processing: Kafka + Spark Streaming
  • Application layer data: MLlib generating a model algorithm als
  • Data display and docking: Zeppelin

    Selection considerations:
    HDFS whether in performance, stability throughput are stored in the main file system is dominant
    if the feeling is still relatively slow HDFS storage, SSD hard drives and other programs can be used

      数据处理层组件:
      Hive 在数据量不是很大或对实时性没有那么高要求的时候,可以选用作为计算引擎
    
      消息队列一般还是Kafka,消费者端也可以使用Flink,Storm等...
      同时,SparkStreaming的优势就是: 已经有与各个组件比较好的集成  
      这里写一个KafkaProducer作业实时将数据 放到Kafka 中 
    
      应用层:MLlib :Spark 对数据挖掘机器学习库的封装 ,ALS是其中一个算法  
      http://spark.apache.org/docs/1.6.3/mllib-guide.html
      http://spark.apache.org/docs/latest/ml-guide.html
      TensorFlow 偏向于深度学习
    
      Zeppelin:包含各个图标表展示,而且组件集成性更多。作业调度略差
      HUE 数据展示+作业调度  
    
      系统采用standaone模式,更加简单。
      只有SPARK 环境,就使用standalone 脱机运行模式
      Hadoop +Spark 就推荐:Spark On Yarn
      Spark On Docker : 任务封装为一个个的Docker,不依赖于你的物理机环境,每个Docker 的资源可以更好的分配

Main modules:

  • Memory module: HDFS build and configure the distributed storage system, and as an alternative and MySQL Hbase

  • ETL module: load the original data, cleaning, processing, and prepare a variety of data required for the model training module and recommendation module.

  • Model training module: responsible for producing the model, as well as finding the best model

  • Recommended module: includes off-line and real-time recommendation recommendation, responsible for the offline recommend recommendation result into the storage system,
    real-time recommendations responsible for generating real-time message queues, message generation and consumption in real-time recommendation results, and finally stored in the memory module

  • Data presentation module: responsible for displaying the data used in the project

  • Data flow:
    Here Insert Picture Description

    Heavy and difficult system development:

    Data warehouse ready: Spark + Hive data ETL, Zeppelin + Hive data showing
    data processing:
    real-time data processing: 1. Data timeliness, completeness, consistency,
    2. ensure that the application will not crash out in a timely manner after starting out or collapse up processing and data consistency

expand:

1. Data Warehouse how to understand? Two things, one of which is represented by IBM, Microsoft data products, and the other is + Hive Hadoop
the Apache Hive ™ data warehouse software helps to use SQL to read, write, and manage reside in the distributed storage of large data sets .
The structure may be projected onto the already stored data.
And a command line tool JDBC driver to connect the user to the Hive.

2. Data Source Preparations:
the Data Source: Open the Data MovieLens
http://files.grouplens.org/datasets/movielens
http://files.grouplens.org/datasets/movielens/ml-latest.zip

[root@hadoop001 ml-latest]# pwd
/root/data/ml/ml-latest
[root@hadoop001 ml-latest]# ll -h
总用量 1.9G
-rw-r--r--. 1 root root 1.3M 10月 17 13:41 links.txt
-rw-r--r--. 1 root root 2.8M 10月 17 16:06 movies.txt
-rw-r--r--. 1 root root 725M 10月 17 16:07 ratings.txt
-rw-r--r--. 1 root root  38M 10月 17 16:08 tags.txt
[root@hadoop001 ml-latest]# 

The next step is to start Coding ...

Have any questions, please leave a message with the exchange ~~
More Articles: Spark movie recommendation system based on: https: //blog.csdn.net/liuge36/column/info/29285

Guess you like

Origin www.cnblogs.com/liuge36/p/11713141.html