7 Recommended system architecture design

  • 7.1: Basic model of recommendation system
  • 7.2: Common recommended system architecture
  • 7.3: Commonly used software, used for architectural design.
  • 7.4: Some common problems

7.1 Basic model of recommendation system

  • The recommendation system is supervised learning
    • Supervised learning sub-learning and prediction
  • The learning system uses a given training sample
    • The model is obtained after training, and then the model will be used to predict the prediction system.
  • The prediction system gives predictions for the given test samples by the model

Insert picture description here

  • The purpose of the recommendation system
    • Effectively train the model through the learning system so that the results of the prediction system are close to the true results of the test samples
  • Predicted content,
    • Likes a piece of information, a song or a video
    • Probability of buying a certain product
  • Optimization of the recommendation system,
    • Pass (models, algorithms, data, features),
    • Improve the accuracy of prediction results,
    • The items recommended to the user are closer to the user's true preferences

  • The amount of user data processed by the learning system will be larger, the data will have more dimensions, and the recommendation model used will also be more complicated
  • Collaboration model, content model and knowledge model
  • The collaborative model mainly guesses what I like based on what my friends like;
  • The content model is based on the item itself to predict that the user likes A and may also like B;
  • The knowledge model is based on the user's limited conditions and recommended according to his needs

  • The recommended system architecture design is based on the basic model of supervised learning and customized according to the needs of the business. To polish a recommendation system suitable for business needs.
  • In the learning system, data should be reported, cleaned, and feature structured.
  • You need a platform for storing and processing data.
  • Depending on the amount of data and the type of data, the learning system may need to be customized.
  • In the prediction system, the prediction request needs to be serviced and packaged as an API for business calls.
  • At the same time, we also need to ensure the reliability and scalability of online services.

  • Next, first introduce the commonly used architecture of the recommendation system,
    • Then on the basis of understanding these architectures, introduce some common components of each module,
    • Finally, introduce some common problems of the recommendation system.

7.2 Common architecture of recommendation system

  • The several recommendation system architectures introduced in this section are not independent of each other. Actual recommendation systems may use one or more of these architectures.
  • In practice, the architecture introduced in this article can be used as a starting point for design, and more independent thinking should be combined with its own business characteristics to design a system suitable for its own business.

  • A recommendation system based on offline training and online training in response to user behavior speed.
  • Recommendation system using traditional machine learning and using deep learning
  • Due to the importance of the business, a separate category is introduced: content-oriented recommendation system.
  • At the end of each section, the problems encountered in the actual system design will be introduced,
    • For design reference.

7.2.1 Architecture design of recommendation system based on offline training

  • "Offline"
  • Train with historical data for a period of time (a week or weeks),
    • The cycle of model iteration is long (in hours)
  • Fitting is the user's medium and long-term interest.
  • Mobile application market, music recommendation
  • "Online" training refers to incremental, real-time training,
    • The model is required to respond quickly to each training sample.
    • The user is currently watching a food video and stays for a long time,
    • Then the next video recommendation system will recommend more similar videos to you after it detects your short-term interest
    • The training data update frequency is in seconds.
    • Information, shopping, short video recommendation

  • Recommendation system based on offline training: logistic regression, gradient lifting decision tree
    • Factorization machine

Insert picture description here

  • Data reporting and offline training: learning system
  • Real-time calculation and A / B testing: prediction system
  • There is also an online storage module,
    • Store the model and the characteristic information required by the model for the real-time calculation module to call
  • The modules in the figure form two data streams for training and prediction.
    • The training data stream collects business data and finally generates the model and stores it in the online storage module;
    • The predicted data stream accepts the business prediction request, and accesses the real-time calculation module through the AB test module to obtain the prediction result.
  • The training data stream needs to process a large amount of training data, and the update cycle is longer, in hours,
    • So the corresponding architecture is called offline training-based architecture
  • The predicted data stream is used for business on the Internet, and the delay is within tens of milliseconds.
    • This makes different architecture requirements for each module on the two data streams for training and prediction.

  • Set business data to form training samples
  • Sub-collection, verification, cleaning and conversion
  • Need to collect data from the business.
  • Business driven, collected from several dimensions of items, users, scenes,
    • The core data samples must ensure quality.
    • Quantify everything, the finer the better
  • Verify the accuracy of reported data to avoid reporting logic errors, data misalignment, or missing data
  • In order to ensure the credibility of the data, it is necessary to clean up the dirty data.
    • Common data cleaning: null value check, abnormal value, abnormal type, data deduplication
  • Data conversion, transform the collected data into the sample format required for training,
    • Save to offline storage module.
  • The quality of the data is very important, whether the prediction result is accurate,
    • Depending on the strength of the model, more important is the quality and quantity of the training data.

Insert picture description here

  • Offline training Offline training module separate line storage and offline calculation
  • Offline storage requires a distributed file system or storage platform
  • Common operations for offline calculation: sample sampling, feature engineering, model training, similarity calculation

  • Sample sampling design the samples reasonably and provide high-quality input for model training, thus training a more ideal model.
  • Reasonably define positive and negative samples, in practice, often encounter positive and negative sample imbalance
    • It is solved by punishment weights and combinations, etc.
    • Combine with business understanding, rationally design positive and negative samples.
  • When designing samples, try to ensure the balance of the number of user samples.
  • For malicious brush traffic and robot users, the sample deduplication ensures the balance of the user's sample number.
  • Give due consideration to the diversity of samples. Enrich the source of samples by collecting user samples that are independent of the current recommendation algorithm.

  • Feature engineering uses domain-related knowledge to obtain as much information as possible from the original data, and features are used to improve model training effects.
  • Feature selection selects a set of most statistically significant feature subsets from the feature set through the steps of evaluation function, stopping criterion, and verification process.
  • Feature extraction uses component analysis, discriminant analysis and other methods to transform and combine original features to construct new core features with business or statistical significance.
  • Third, feature combination combines multi-modal embedding and other methods to combine feature vectors from users, items, and backgrounds to achieve complementary information.

  • After the first two steps, model training uses a given data set to obtain a model through training, which is used to describe the mapping between input and output variables.
  • In practice, considering the need to deal with large-scale training sets, generally, linear time algorithms that can be distributed for training will be selected.

Insert picture description here

  • In addition to the modules mentioned in the recommendation system of Figure 7.1 and Figure 7.2, there is also an online storage module
  • Online services have strict requirements for latency.
  • The user opens the APP and hopes to respond quickly
  • This requires the recommendation system to process the user request and return the recommendation results within tens of milliseconds.
  • For online services, there must be a dedicated online storage module,
    • Store model and feature data for online
  • Online storage module requires local memory or distributed memory
  • In order to make online storage as fast as possible, on the basis of open source software, you can also make some customizations, caching strategy, incremental strategy, deferred expiration strategy, SSD

Insert picture description here

  • Real-time recommendation module predicts new requests from business
  • Open APP, APP sends a request to the server in the background,
    • After receiving the request, the server guesses its preferences based on the user's previous history in the application market
    • Then return a recommended application list to the mobile APP, and then present it to the user on the APP interface.
  • The real-time calculation module requires the following calculations:
    • (1) Obtain user characteristics, the system reads the user's portrait and historical behavior from the online storage module according to the user ID in the request, and constructs the user's model characteristics
    • (2) Calling the recommendation model, combining the user model to call the algorithm model of the recommendation system, and obtaining the user's preference probability for items in a certain APP candidate pool;
    • (3) Sort the results, sort the score results of the candidate pool, and then return the result list to the mobile APP.
  • The real-time calculation module needs to read a lot of data from the online storage module,
    • Complete a large number of model scoring in a short time
    • The module has high performance requirements.
  • This module is a distributed computing framework to complete computing tasks.

  • The list of business items is too large,
    • Real-time calculation to score each item using a complex model,
    • Takes too long
  • Split recommendation list generation into recall and sorting
  • Recall: Select a candidate set (hundreds) from a large number of candidates (millions)
  • The ranking uses the ranking model to score the relatively small candidate set obtained by the recall
  • After the recommendation list is sorted,
    • For diversity and operational considerations,
    • Also add the third step-rearrangement filtering, which is used to process the recommended list after fine sorting
  • Rearrangement filtering provides users with some exploratory content,
    • Avoid the content that users see on the platform is too homogeneous, filter out vulgar illegal
  • The architecture is shown in Figure 7.6

Insert picture description here

here

7.3 Common components of recommendation system

7.3.1 Common components for data reporting

  • Apache Kafka open source stream processing platform
  • High-throughput and low-latency processing framework for real-time data sources.
  • Logically, it is a distributed implementation of multi-producer and multi-consumer queues
  • Messages are managed by topic, a topic can have multiple producers and consumers
  • Producers produce messages and push them to a topic, and consumers who subscribe to the topic pull messages from the topic

Insert picture description here

7.3.2 Offline storage of common components

  • HDFS (Hadoop Distributed File System) is a widely used distributed file system.
  • Low cost, high reliability and high throughput.
  • Its fault-tolerance mechanism allows HDFS to build a distributed file system based on inexpensive hardware, and it can still provide reliable storage even if there are component failures.

  • Hive is a data warehouse based on Hadoop, with more complete SQL functions,
    • Use HDFS as the underlying storage
  • It is convenient for people familiar with SQL to operate data without complicated programming.
  • The data scale reaches hundreds of PB, and supports the storage of structured data.

7.3.3 Common components for offline computing

  • Apache Spark is a high-performance distributed computing framework based on in-memory data processing,
  • Simple, flexible, and powerful APIs help users develop efficient programs for complex data analysis.
  • Spark provides a Mapreduce computing model similar to Hadoop, but unlike Hadoop, Spark uses an intermediate memory-based data structure, which makes it better support workloads that require multiple iterations.

  • tensorflow
  • Open source software framework for numerical calculation of data flow graphs.
  • Powerful and diverse API for researchers to develop various applications.

  • Distributed tensorflow provides support for parameter servers.
  • Different from other parameter servers
    • Tensorflow's parameter server updates the parameters implicitly
    • Programmers do not have to manually push and pull these parameters.
  • Make the development of parameter server based on tensorflow easy
  • The main task of writing distributed tensorflow programs,
    • It becomes how to distribute the parameters to different parameter servers reasonably,
    • Configure parameters through cluster configuration interface, designated device interface and synchronous mode interface.

7.3.4 Online storage of common components

Published 589 original articles · 300 praises · 80,000 + views

Guess you like

Origin blog.csdn.net/zhoutianzi12/article/details/105619174