Big data user portrait system architecture design


User portrait is a very general and commonly used system. As can be seen from our architecture diagram, in terms of data calculation timeliness, line calculation and real-time calculation are separated. Offline calculation is generally to calculate all users in full every night, or recalculate the batch of users whose user data has changed on demand. Offline computing mainly uses Hive SQL statement processing, Spark data processing, or machine learning algorithms to calculate user loyalty models, user value models, user mental models, etc. Real-time calculation specifies the user behavior data collected through Flume real-time logs and transfers it to the Kafka message queue, allowing the streaming computing framework Flink/Storm/SparkStreaming to consume and process user data in real time, trigger the real-time calculation model, and add new user portraits after the calculation is completed Data update search index. When personalized recommendation and operation promotion need to obtain the portrait data of one or some users, the results can be directly searched from the search index in milliseconds, and the data can be quickly returned to the caller. This is roughly divided into two lines of offline processing and real-time from the computing architecture.
Below we look at each architecture module in detail from top to bottom as shown in Figure 2.3:

Insert picture description here

Figure 2.3 User portrait system architecture diagram

1. User portrait data warehouse construction and data extraction part

(1) Collect Mysql business databases related to user portraits and extract them to the Hadoop platform incrementally every day. Of course, the full amount of initialization is required for the first time. The data conversion tool can be Sqoop, which can import data into Hadoop in distributed batches. Hive;
(2) Flume distributed log collection related to user portraits can collect real-time user behavior, buried point data, etc. from various web servers, and can specify source and sink to directly transmit data to the Hadoop platform.

2. Hierarchical design and processing of big data platforms and user portrait fairs

Build data marts related to user portraits on the big data platform, with hierarchical design, the reasoning is similar to recommendation and search.

3. Offline calculation part

(1) Hive SQL can calculate a part of user data to get a user profile attribute. If it is a particularly complex user attribute, such as machine learning, we can use the following Spark platform to process it.
(2) Spark loads user data from the Hadoop platform. One is to process part of the data, and use machine learning algorithms to calculate some complex user attributes, such as user loyalty models, user value models, user mental models, etc. Of course, these models are also It is not necessary to use machine learning, it is also possible to use rules.
(3) Whether it is calculated by Hive SQL or processed by Spark, the final user model result will be stored in the Hive warehouse of Hadoop, and then a separate Spark task will be written to load and update the user profile model to Solr or ES In the search index, the online interface can be accessed in real time. In addition, the data of the Hive user profile table stored on Hadoop will also be customized according to the needs of other departments of the company. Hive SQL will be executed asynchronously on demand and then landed to the local file, and then distributed to the server of the demand side, or returned to the landing file access Address, let other departments take the initiative to wget this file data.

Four, real-time calculation part

(1) Flume real-time log collects user behavior data and transmits it to the Kafka message queue, allowing the streaming computing framework Flink/Storm/SparkStreaming to consume and process user data in real time, trigger the real-time calculation model, and update the new user profile data after the calculation is completed Search the index. Real-time calculation is on-demand calculation. Only when the user's behavior changes, the calculation is triggered. If there is no change, the user's behavior data will not be collected in the message queue, and naturally, the calculation will not be triggered.
(2) If there is a need for real-time calculations, in addition to updating the Solr or ES index, you can also update to Hbase, and then build a Hive to Hbase mapping table, you can do statistical analysis on the real-time user profile data in Hbase. Of course, Hbase Shell script can also be used, but it is not as convenient and flexible as Hive SQL.

Five, Solr/ES search engine part

This is the core of providing real-time user portrait data at the millisecond level. Not only can it be queried based on user ID, but it can also be accurately filtered based on any custom query field. In addition, because it is a search engine, it is natural to do some vague relevance searches through keywords.

6. Java Web millisecond-level real-time user portrait interface service

(1) Because the Solr/ES search engine we use is developed in Java, it is recommended to use Java for the web interface.
(2) This web interface is provided to the demander in real time. For example, the recommended interface can obtain the data of a certain user portrait directly according to the user ID and return the corresponding user portrait data in real time within a few milliseconds. Of course, you can also search for the first few user profile data of topN according to other filter conditions or specified keywords. Note that this method does not return all user data that meets the filter conditions, generally the top tens or hundreds. The most words are usually several thousand at a time. If there are too many, one is slow, and the other is that the web server such as Tomcat may be completely down.

7. Real-time display of user portraits and asynchronous triggers to obtain Web self-service background

(1) Why is this a web self-service backend? Generally in this kind of application scenario, the operation team needs to filter some users for advertising. At this time, specify the filter conditions through the web background and click on the asynchronous acquisition, and then this asynchronous acquisition will trigger the background asynchronously designated Hive SQL or other such as Spark processing program, Spark SQL When all the corresponding user sets are queried from the user portrait market, this user set will be relatively large, not a few thousand, usually hundreds of thousands or several million, and then generate files. After the asynchronous calculation is completed, a file address will be returned, and the self-service staff can download the file for other subsequent processing.
(2) What is real-time display asynchronous triggering? Real-time display refers to the part of users I have screened. You can call the search results in real-time first to see how the previous sample data is, how many users can be returned, and how many users can reach this promotion. Because calling the search interface is millisecond-level to display data in pages on the page, you can quickly see an approximate effect. If the data is obtained asynchronously, the calculation time will generally be very long, such as at least a few minutes or even a few hours. After executing it for so long, the data is not what I want. Therefore, real-time display is to quickly verify whether the data is what you want, and then to obtain the data asynchronously in large quantities.

The user portrait system architecture is basically this architecture, and each company is similar. The user portrait system is a universal and core system. If the company has a budget, a user portrait team is usually arranged to be responsible for the research and development of this area.
As you can see from the architecture of the above systems, distributed artificial intelligence application systems based on big data generally have to master the core big data platforms and frameworks such as Hadoop, Hive, Hbase, and Spark. Distributed machine learning is also based on They are based, so the following chapter will focus on the core framework of big data.

to sum up

In addition to the big data user portrait system architecture design,
other deep learning frameworks also have good open source implementations, such as MXNet. Please pay attention to the charging app, courses, WeChat groups later. For more content, please see the new book "Distributed Machine Learning (Artificial Intelligence) Science and Technology Series)"

[New book introduction]
"Distributed machine learning in practice" (artificial intelligence science and technology series) [edited by Chen Jinglei] [Tsinghua University Press]
Features of the new book: Explain the framework of distributed machine learning and its application supporting personalized recommendation algorithm system step by step , Face recognition, dialogue robots and other practical projects

[New book introduction video]
Distributed machine learning practice (artificial intelligence science and technology series) new book [Chen Jinglei]

Video features: focus on the introduction of new books, analysis of the latest cutting-edge technology hotspots, and technical career planning suggestions! After listening to this lesson, you will have a brand new technological vision in the field of artificial intelligence! Career development will also have a clearer understanding!

[Excellent Course]
"Distributed Machine Learning Practical Combat" Big Data Artificial Intelligence AI Expert-level Excellent Course

[Free experience video]:

Artificial intelligence million annual salary growth route / from Python to the latest hot technology

From the beginner's introduction to Python programming with zero foundation to the advanced practical series of artificial intelligence courses

Video features: This series of expert-level fine courses has a corresponding supporting book "Distributed Machine Learning Practical Combat". The fine courses and books can complement each other and complement each other, which greatly improves the learning efficiency. This series of courses and books take distributed machine learning as the main line, and give a detailed introduction to the big data technology it depends on. After that, it will focus on the current mainstream distributed machine learning frameworks and algorithms. This series of courses and books focus on actual combat. , Finally, I will talk about a few industrial-level system combat projects for everyone. The core content of the course includes Internet company big data and artificial intelligence, big data algorithm system architecture, big data foundation, Python programming, Java programming, Scala programming, Docker container, Mahout distributed machine learning platform, Spark distributed machine learning platform, Distributed deep learning framework and neural network algorithm, natural language processing algorithm, industrial-grade complete system combat (recommended algorithm system combat, face recognition combat, dialogue robot combat), employment/interview skills/career planning/promotion guidance, etc. .

[Is it charged? Company introduction]

Rechargeable App is an online education platform focusing on rechargeable learning for vocational training for office workers.

Focus on the improvement and learning of work vocational skills, improve work efficiency, and bring economic benefits! Are you charging today?

Is it charging official website
http://www.chongdianleme.com/

Is it charged? App official website download address
https://a.app.qq.com/o/simple.jsp?pkgname=com.charged.app

Features are as follows:

【Full Industry Positions】-Focus on improving the vocational skills of office workers

Covering all industries and positions, whether you are an office worker, executive or entrepreneur, there are videos and articles you want to learn. Among them, big data intelligent AI, blockchain, and deep learning are the practical experience of the Internet's first-line industrial level.

In addition to professional skills learning, there are general workplace skills, such as corporate management, equity incentives and design, career planning, social etiquette, communication skills, presentation skills, meeting skills, emailing skills, how to relax work pressure, personal connections, etc. Improve your professional level and overall quality in all aspects.

【Niuren Classroom】-Learn the work experience of Niuren

1. Intelligent personalization engine:

Massive video courses, covering all industries and all positions, through the skill word preference mining analysis of different industries and positions, intelligently matching the skill learning courses that you are most interested in for the current position.

2. Search the whole network

Enter keywords to search for massive video courses, there are everything, there is always a course suitable for you.

3. Details of listening to the class

Video playback details, in addition to playing the current video, there are also related video courses and article reading, which strengthens a certain skill knowledge point, allowing you to easily become a senior expert in a certain field.

【Excellent Reading】-Interesting reading of skill articles

1. Personalized reading engine:

Tens of millions of articles to read, covering all industries and all positions, through the skill word preference mining analysis of positions in different industries, intelligently matching the skills learning articles you are most interested in in your current position.

2. Read the whole network search

Enter keywords to search for a large number of articles to read, everything is available, there are always skills learning articles you are interested in.

[Robot Teacher]-Personally enhance fun learning

Based on the search engine and intelligent deep learning training, we will create a robot teacher who understands you better, chat and learn with the robot teacher in natural language, entertaining and learning, efficient learning, and happy life.

【Short Course】-Learn knowledge efficiently

Massive short courses to satisfy your time fragmented learning and quickly improve a certain skill knowledge point.

Guess you like

Origin blog.csdn.net/weixin_52610848/article/details/109890183