How to build from 0-1 personalized recommendations?

Author: Chin list has 58 city Senior Technical Manager

Edited: Zhou Xiaoxia

Content Source: Technical salon 58

Produced community: DataFun

Note: Welcome to reprint, please leave a message in the message area.

REVIEW: With the rapid development of science and technology, the Internet is widely used in various fields, and Internet-based recruitment model is also more and more enterprises of all ages. Internet recruitment has without geographical restrictions, wide coverage, lower recruitment costs, targeted, convenient, time-sensitive, etc., has been widely used, in which 58 Internet job recruitment is the industry's largest platform scale. Today, the major share with you how the 58 recruitment services through personalized recommendation technology of large-scale job seekers and recruiters. Share topic is from zero to 58 jobs to build a personalized recommendation, primarily described by the following three aspects:

Recruitment Business

Personalized recommendation practice

Experience to share with planning

- Recruitment Business -

  1. 58 Recruitment Business Profile

2018 of the country's total population of over 13.9 million, of which 770 million working population, a huge recruiting base. Three industrial employment accounted for respectively 26.11%, 27.57%, 46.32%, the third largest industry in which the largest proportion, accounting for part of the third largest industry in developed countries has reached 70% to 80%, along with economic development, China future employment market and employment distribution will change much. August 2019 survey of urban unemployment rate was 5.2%, the unemployment rate among 25 to 59 years old 4.5%, while there are more than 800 million graduating students to join the inauguration of the market each year. 58 Recruitment As the first of China's Internet recruitment industry, serving ten million medium and small enterprises and job seekers every day, the platform generates ten million connections a day, contributed to the success of a large number of job seekers.

Recruitment platform 58 mainly serve seekers and recruiters, next we describe the general process flow of the entire platform user job by an angle, as follows:

Click here for details and preferences based job search positions.

Delivery intends to post, or by micro-platform chat online tools, telephone and recruiters for further communication.

The two sides reached a consensus, for an interview with the entry.

Compared to the traditional recommendation system, 58 business recruitment funnel longer deeper, and has converted part of the platform can not fully capture the formation of the difficulties and challenges of recruiting 58 personalized recommendations carried out.

  1. 58 Recruitment recommend the type of scene

58 Recruitment recommend scene mainly for job seekers and the C-terminal B-side enterprises, recommendations include: job recommendations, the recommended label, business recommendation, resumes recommended.

Typical C-terminal seekers recommended scene comprising:

App Home Page Recruitment categories: polymerization zone comprises primarily the posts, posts Feed stream.

Category Recommended: When a user clicks a category, related job recommendations.

Similar recommendation: after a user clicks on a specific job, to show similar posts below.

  1. 58 Recruitment recommend major problem

58 Recruitment recommendation relative to other industries exist mainly in the following typical problems:

Massive data computing: Most companies are present, are not described in detail here.

Cold start problem: 58 city services to multi-service, including recruitment, real estate, yellow pages, and so second-hand, job seekers entering the recruitment sector use recruitment function, because the current user is not forced to fill in the resume, leading to resume without user cold start problems.

Sparsity & Real-time: 58 recruited as part of the group of blue-collar users, their behavior is produced in a short time platform, continuous and sparse, may actively find work after two days are no longer active. Secondly, some users back to the platform, the wishes of the job may change, some might want to find another job (such as a waiter before, now looking for express delivery), the other part may be because the positions advanced processes exist traditional occupations, which are systems thinking needs.

Resource allocation problems: first, how to effectively identify (companies, job seekers) true intentions, and then allocate resources to produce a valid connection to differentiate treatment for bad intentions. Second, recruitment for B and C-terminal ends are limited resources, limited recruiter recruitment jobs, job seekers and recruiters to interact limited, largely different from Taobao recommended, because the latter is an unlimited supply of commodities.

- Recruitment personalized recommendations to achieve -

  1. 58 Recruitment personalized recommendations to achieve

58 Recruitment personalized recommendation implementation process and the most recommended module similar, including the understanding of user intent, content recall, sort the contents, display the contents of four core modules. Below with service features, each module introduces key implementations.

  1. How to understand user?

58 Recruitment for users to understand, mainly through "word" and "line" to identify a user's real intention, focus on property including job search intention, personal property and the external image of the recruitment industry (pictured above left). Around the content and behavior of job seekers and recruiters in the platform produced, we constructed the appropriate knowledge and user profiles portraits.

2.1 insincere user identification

Before understanding the user, we first need to identify the user no real recruitment / job search intentions, and treatment differences. As frequently published guide contains contact information, publishing high temptations and other malicious false information, will guide the user to the external platform for transformation. For more business we summarize some characteristics, mainly as follows:

Contact exposure

Content not a sentence

High temptations

Very "active" in the platform

For more than business characteristics, our main identification methods include:

+ Regular traditional keyword recognition method, such as for the "micro-channel", "QQ" Related Categories such contact, and the like.

Information for modification, based on the identified pinyin + sliding window.

Named entity recognition using NER excavation, as BiLSTM + CRF.

Classification algorithm using correlation recognition, as fastText, CNN.

In the sincerity user identification process, we have put together the following tips:

TELL: user identification problem is a typical scenario of confrontation, need more time to think portray the defensive ability of strategy building, will feature some of the strong against the ability of added to the model (such as your text, text-to-phonetic).

Firmness and flexibility: the difference punish different problem types of users. Huge damage to platforms other user groups, combined with the legal means severe punishment; in border problems, mainly by softer manner (such as content presentation right down), reduce violent confrontational phenomena.

2.2 knowledge map construction

Knowledge Mapping is a very complex system, including multi-heterogeneous data collection -> access to knowledge - knowledge reasoning> - -> Knowledge Fusion represents> Knowledge management plurality of sections, topics and the time factor, we focus on exploration under the terms of NER . Recruitment business scene contains a lot of text, can effectively extract key information in the text by NER technology to further improve the ability to understand the structure of the system.

NER carried out two stages:

The first stage: Based on post portion of the structured entity word platform already, and a lot of semi-structured organizational basis of the description, we use the bootstrap method, rapid iteration mining, combined with semi-manual tagging, build a more complete depth of learning sample data set.

Second stage: the contents of the first stage as the input, a core constructed using BiLSTM + CRF depth network entity recognition, there are two points to optimize achieved good results. The first layer is the input to the optimization of the word based on the word, a thesaurus to build proprietary recruitment field. The second is the use of technology to enhance training samples, the entity similar words and terms of similar entities to replace a larger sample set, and the results of model identification of selective re-iterate back into the training set training, reduced reliance on tagging data set. Currently named entity recognition continues to optimize recognition accuracy rate averaged 0.75+, part accuracy rate can reach 0.9+.

2.3 Build user portrait

User portrait is the base module personalized recommendation system, determines the accuracy or otherwise of the user's intent to understand. Based labels conveying ideas, our statistical rules, the traditional classification model and algorithm depth model more binding expression of interest to capture user behavior, building short and long term user portrait.

, Near real-time updates to users portrait is calculated through the window in the form of added time decay factor calculation, the behavior of the weight factor and labeling of confidence right weight: based on statistical rules. A deep understanding of business scenarios, a reasonable mathematical design is the key. Such as click data list page, when used to direct differences in treatment list page to show explicit labels and hidden in the details page of interest in the label, to avoid the introduction of man-made noise.

Prediction based on traditional classification: The classification algorithm used to fill the user attributes, abnormal user / user behavior identification and classification of many aspects. Not all users will leave a more detailed job resume, we use historical user behavior and recruitment resume tissue samples, which can effectively predict the sex, age, job expectations and other user information, optimization of missing or incomplete resume the cold start problem . Meanwhile, the focus for the behavior of the user, through the model can effectively identify some abnormal data, to identify purposeful job type and job divergent types of users, and then weed out some of the noise sample data to improve the accuracy, customized for different users differentiation strategy, recommended enhance overall abilities to describe.

Behavior-based series prediction: With the statistical rules and the traditional classification, the basic building a portrait available, but information between multiple user behavior capture limited. We will search for users to browse, resume delivery, online communication behavior and other acts organized into a sequence of events, the use of LSTM, GRU, Attention and other training models predict user interest, currently also exploring the evaluation stage.

  1. Recall module

58 Recruitment recommended three recalls continue to refine the evolution around the individual, group, overall, meet the different needs of different recall, the three combined services to all types of scenarios. From 2016 to now, we have gone through context-based content, collaborative filtering, fine portraits, depth recall several stages, evolved into the current context of the user to combine precise portrait recall, recall and collaborative filtering to quantify the depth of the core recall strategy recall module.

3.1 based on accurate portrait recall the context of the user +

The strategy is one of the industry's very common recall method, the core is combined with a user requests a rich portrait of the rewrite. Most of the scenes, limited users to actively search for or click on conditions, with entities interested in the relationship between history and knowledge of user profiles organization portrait, we conducted multiple dimensions of job conditions, workplace, payroll, and other industries to expand or rewrite necessary , recall multiple jobs matching the user's content.

The main advantage of this strategy: interpretability good time to achieve low cost, shortcomings and difficulties are over-dependent on mining label accuracy.

3.2 particularity traffic based collaborative filtering algorithm improvements

Collaborative filtering is the recommended system recalls the classic method, the behavior of the user with an item mining association relationship between the user and the user, goods and goods. A huge number of job seekers recruitment business, and is a short-term behavior of sparse scene, we based collaborative filtering items, while hoping to near real-time behavior of real-time information organized into service.

In the technology implementation process, we refer to the Tencent 2015 published Paper "TencentRec: Real-time Stream Recommendation in Practice", giving jobs clicks, delivery, online different weights conduct communication and other heavy multi-behavioral integration, based on user behavior sequence the length and quality of the user design a user penalty factor, but by the time decay factor enhancing the expression of the recent behavior of the design and Paper these three factors are basically the same. Also for the special nature of the business, we have improved the similarity calculation jobs, adding jobs similarity control, to avoid job objective divergent positions affect the organization of the user relationship. After the algorithm on the line, the click-through rate, the rate of delivery have made positive gains, which details page of related jobs increased by more than 25% is recommended.

3.3 Embedding depth exploration recall

Collaborative filtering has yielded good business benefits, but it depends on the user's behavior with the article matrix, sparse scene to conduct a limited natural expression. And just flow constitutes 58 recruitment business, there are three hundred forty-five line part of the city, the city sank data are sparse and the more prominent. For these problems, we hope to further tap the information behavior data, it is natural to think quantify Embedding recall based on the depth of learning. We refer to the core of DNN Youtube recall thinking, based on the status quo has been adjusted to optimize the business.

Position vectorization: We will conduct job seekers to be seen as a series of sequence context, to quantify the expression to use word2vector thought. Input section, including job characteristics, job-owned enterprise features and characteristics of job seekers feedback. Output constructed, the greater the deeper the funnel traffic behavior selection window, and the user based on the behavior of the average length of the window is set as a reference value. No new jobs for historical user behavior, using the position of text structured information, through average-pooling as the initial vector, solve cold start by tag expression vector obtained historical training.

User vectorization: build a multi-classification NN network, Embedding layer migrate user posts behavior occurs directly over the use of vectorization, enter the user's profile and image information vector training. Ideally, the uppermost limit is a classification to real user behavior occurs as a positive samples, data behavior did not occur as a negative sample, the loss of function build optimized training. 58 scene there are millions recruitment level positions, limit classification requires enormous computational cost, the current resources can not meet. Therefore, in the negative sample selection, we use down-sampling mechanism, job behavior did not occur at random attention from job seekers in cities and extract a certain percentage of negative samples. Online in real time collection of user behavior in order to update the user form window vectors.

Online services: learning FAISS Facebook realization, when a user initiates a request online, expression vectors by job seekers, to acquire its most similar TopN jobs, returned to the recommendation system.

Embedding vectorization recall, is still in the early exploration, a lot of work remains to be done in the sample, the input feature and network parameters tuning, expect more significant business gains further to share with you.

  1. Sort iteration history

Compared to other recommended scene, 58 recruitment funnel deeper, and is typical of the bilateral business. The system continuously optimized to enhance the job seeker clicks, while delivering jobs, but also need to focus on whether the position behind the formation of an effective recruiter feedback bilateral connection, thus achieving the purposes of prediction is closer to the job chain. Combine different periods of business objectives, we have gone through several major phases.

The first stage: to increase click-scale as the main objective, to build click-through rate prediction model from zero to one, the basic framework for the development of model building, including engineering features, AB experimental framework CTR and online services. The stage with less staff, ordering the establishment of a general framework and model service, and support business growth at the click level.

The second stage: in-depth business objectives, click on over to the unilateral connection from up bilateral connections, on the basis of the estimated CTR model, an increase of CVR estimates and estimated ROR bilateral connection. At the same time launched a targeted building on the tools, including features production Pipeline, AB experimental framework to upgrade the center of configurable features of the model and visual analysis monitoring, and engineering algorithms rely on decoupling, support parallel algorithms and engineering staff more efficient iteration.

The third stage: deep learning algorithm to explore around, wide & deep, DeepFM, multi-task learning, reinforcement learning, and improve the ability to express algorithms for high dimensional feature, the ability to improve the characterization of the prediction model. Fall expected to be fully operational in 2020, reaching an iterative state more desirable.

4.1 connection transformation prediction model

58 Recruitment transformation prediction model is a multi-target study, design and implementation as shown above, the underlying characteristics and build a sample, using different algorithms for CTR CTR estimates, CVR connection unilateral estimates, ROR bilateral connections forecast modeling, final supporting a plurality of models of the fusion line ordering.

Sort achieve overall business is a common way to summarize the process more crucial points for:

Sample processing: focus on reducing noise sample, we carried out a number of optimization. Removal of abnormal user and abnormal data, including user data non-recruitment intentions, false click data; increased real exposure and long-buried point stays, remove the user pulled down the data flow process in Central Africa really see, the time to stay long as the sample confidence weights added to model training; sampling based on the dimensions of job seekers, remove contradictions of the same post multiple positive and negative samples possible.

Project features: attention and monitoring features of dominant and recessive changes, especially information list shows the style of products adjustment, the need for timely adjustment feature and model iterations. 58 of the special nature of the recruitment business, features real-time class is very important, need to focus on security mechanism features consistency and avoid the occurrence of characteristic features across the line at the phenomenon of online or inconsistencies.

Model: emphasis on the cognitive model, not simply focus on the effectiveness of online or offline AUC conversion rate vs. AB, expressed some features on multiple analysis, feature attention to before and after the iterative process of model comparison, can effectively improve the experimental iteration of the model.

4.2 features production to achieve

Characteristics Pipeline Construction and less duplication of work engineering features, significantly improve the efficiency of iterative model. The core functionality is implemented configuration of the embodiment, the integrated sample sampling, conversion feature, combination of features, discrete features, training samples obtained after integration, on the one hand to assess the delivery training models, on the other hand output to the visualization platform support Analysis analysis.

4.3 model serving achieve

There are online services model requires regular updates and a large number of experiments AB, with service evolution of the current model to build Serving framework to achieve the automatic loading and unloading functions regularly and automatically update the model to model, but also with a stronger expansion , accessibility model different algorithm. Offline part, through the sample characterized in Pipeline Construction incremental training data, the model training module obtains and initializes the model file Base incremental model train, no abnormal evaluation model, the system model will be stored into the model file storage and HDFS. Online part, additions and deletions to the model repository model, the model will be launched heat load or unload command to update the online service memory; a request for ordering online, real-time changes corresponding model using storage lifecycle, useless for long-term model, model the warehouse will be automatically deleted.

Serving automates management model line model, but we can not fully hosted system, still we need to focus on changes in the model. In an aspect of the offline part of the model evaluation session, and the like in addition to automatic monitoring of the evaluation index AUC, also the memory size representation model, the model feature as part of the monitoring; monitoring changes in business line conversion metrics hand, when the index occurs more alerts when large fluctuations, artificial model checking.

4.4 reordering mechanism

Due to the special nature of the business to CTR estimates, CVR connection unilateral estimates, ROR bilateral connections estimates there are still insufficient to support sorting abilities to describe problems in the following areas:

Job related to individual livelihoods and the national people's livelihood, is a very serious matter, content quality is the basic guarantee. But the connection quality prediction model can not effectively describe the problem, there are some very good positions but are connected to the efficiency issue positions, it is recommended that the system needs more quality-related factors.

Conversion rate is not equivalent to a bilateral match. Online recruitment, can not be traced to a good interview and job sectors, bilateral connect job seekers and recruiters formed, may be for other reasons (such as misjudgment on their own or each other's). Therefore, the control system needs to consider aspects of the match.

Waste of resources problem for most users, job search and recruitment are cyclical behavior, has attracted over a person's job may still show online. The system also need to increase the activity or position portray aspects of the cycle, a corresponding reduction in waste of resources.

In response to these needs, the system increases the reordering mechanism, by means of processing segments, suppress and even filter out low-quality content in coarse ranked stage, the reordering of inactive / does not match the contents down the right, to protect the ecological quality of the platform, to improve the effective size of the connection destination.

  1. List shows the contents of control

Content display, we have also done some work in conjunction with algorithms to improve the content of interpretability, provide more valuable information to assist the user in decision-making. Combined with personalized labels highlights mining model, the deeper the estimated core features of the model packed into a label displayed in the form of a list of pages, such as how far away, welfare label jobs, jobs and other popular situation; the use of NLG text generation technology, automatically generates brief description on display, text captions to make up for its lack of a simple job information.

  1. AB Experimental Center Configuration

Recommended system includes a recall, filter, sort, display several core modules, and each module has its long-term appeal experiments iteration. We set up a test configuration AB Center, visualization configurations, platform linked with online services and data analysis, to carry out experiments iterative work more flexibly and efficiently.

  1. Overall technical framework

58 Recruitment personalized recommendations through constant evolution, culminating in the technical framework as shown above. Offline section contains mining layer data warehouse layer, knowledge maps, portraits user, predictive model, knowledge of the data storage layer; section contains data services and online recommendation engine. Online behavior data generated in real-time to off-line flow calculation mining module, a feedback line to achieve the effect of a personalized experience.

- to share experiences and plan -

58 Recruitment recommendation system to optimize revenue figure above finishing the last four years, followed by the contribution of the recall, characteristics, data, algorithms, style and engineering. In-depth understanding of business and algorithms, attention to detail algorithm to do is to ensure that the accumulation of work; preliminary work harder on samples and features, not only to get good business growth, but also the depth of the foundation after the algorithm; instrumental development first as far as possible, be able to improve the overall efficiency of iterations.

Future core work:

Comprehensive exploration floor multi-task learning, reinforcement learning, etc.

Set of internal and external resources, recruiting rich data sources, improve user portrait coverage, better support thousands of thousand faces.

Guests share

Zeng Qin list

58 city | Senior Technical Manager
--END--
welcome public attention DataFunTalk number of the same name, watch the first original technology articles.

Guess you like

Origin www.cnblogs.com/datafuntalk/p/11870357.html