Research on Person-post Matching Algorithm Based on Deep Reinforcement Learning

1. Demand Analysis

        The HR-oriented job matching function helps HR select resumes efficiently. The model can select several resumes that best match the job requirements from the resume database and recommend them to HR according to the different job requirements given. Common requirements for positions include: age, education, and working years. The resume also has the following characteristics: the applicant's age, education, graduate school, working years, and target position. The model should match and recommend based on the job requirements and the characteristics of the resume.

2. Algorithm selection

        The problem is essentially a recommendation problem, and common recommendation systems include the following three categories: collaborative filtering recommendation, content-based recommendation, and hybrid recommendation. Although the traditional recommendation algorithm can solve most of the information filtering problems, it cannot solve the problems of data sparseness, cold start, repeated recommendation and so on. [1] The authors of this project observed that deep reinforcement learning can achieve better resume recommendation results.

3. Deep reinforcement learning

1. Reinforcement Learning

        The basis for reinforcement learning comes from the Markov decision model. Its basic idea is to drive an agent agent to take actions in the environment to maximize its cumulative reward. Agents are motivated by punishments for bad behavior and rewards for good behavior.

Figure 1 Markov decision model

We start the agent agent with an initial environment. It doesn't yet have any associated rewards, but it has a state (S_t).

Then, for each iteration, the agent takes the current state (S_t), chooses the best (based on model prediction) action (A_t), and executes it in the environment. Subsequently, the environment returns the reward (R_t+1) for the given action, the new state (S_t+1), and information if the new state is terminal. This process is repeated until terminated.

More details about reinforcement learning can be found at:

Literature [2]:

https://gsurma.medium.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288https://gsurma.medium.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288

Literature[3]

https://medium.com/towards-data-science/reinforcement-learning-explained-visually-part-3-model-free-solutions-step-by-step-c4bbb2b72dcfhttps://medium.com/towards-data-science/reinforcement-learning-explained-visually-part-3-model-free-solutions-step-by-step-c4bbb2b72dcf

Literature [4] 

Research on Path Planning Algorithms for Intelligent Robots Based on Reinforcement Learning (with Code)

The most commonly used in reinforcement learning is the Q-learning algorithm.

The Q-learning algorithm uses a Q table (Qtable) of state-action values. This Q table represents a state for each row and an action for each column. Each cell contains the estimated Q-value for the corresponding state-action pair.

We first initialize all Q values ​​to zero. As the agent interacts with the environment and gets feedback, the algorithm iteratively improves these Q-values ​​until they converge to an optimal Q-value.

For the detailed theory of Q-learning algorithm, please refer to:

Literature [5]

https://medium.com/towards-data-science/reinforcement-learning-explained-visually-part-4-q-learning-step-by-step-b65efb731d3ehttps://medium.com/towards-data-science/reinforcement-learning-explained-visually-part-4-q-learning-step-by-step-b65efb731d3e

And literature [4].

2. Deep reinforcement learning

This part is the focus of this article. There are three types of deep reinforcement learning: (1) Value-based deep Q network: DQN and other algorithms (2) Policy Gradient algorithm (3) Actor-Critic algorithm

        The algorithm of this project uses Deep Q Network (DQN) in deep reinforcement learning . Why there is a theory of DQN, we must start with the Q-learning algorithm. In Q-learning, a data structure-Qtable is used. To record the Q value determined by the state and action, but in the real scene, the state space is often large, and the action space may also be large, and organizing a huge Q table requires a lot of system overhead.

Figure 2 Qtable

If a function (Q-function) can be used to input the state-action pair, the corresponding Q value can be output, and a state can be mapped to the Q value of all operations that can be performed from the state, then the performance will be greatly improved, such as picture:

Figure 3 Q-function

The neural network is perfect for this function. The DQN algorithm uses the neural network in deep learning to optimize the reinforcement learning algorithm.

Figure 4 Deep Q Network (DQN)

        This algorithm has two wonderful features, one is called experience replay, and the other is a dual neural network: Q Network and Target Network .

2.1 Experience Playback

        We know that training an AI decision-making model requires continuous collection of feedback data after the model makes actions in the environment, but how to collect data and what data to collect will directly affect the training effect of the model. If the collected data is a single sample, each sample and the corresponding gradient will have too much variance, and the weight of the neural network will not converge. What about collecting a batch of samples sequentially? There are also flaws, which can lead to "catastrophic forgetting". For example, if there is a robot in a corner of a large factory, the robot learns experience in this small corner every time it walks, and the collected data and learning experience are limited. In this little corner, this is similar to what we call sequential batch samples. At this time, if the robot is taken to another completely different corner of the factory, the robot will start to learn experience in the new corner, and the experience learned in the original corner will soon be completely forgotten. This is because the training samples are too dependent. strong reason.

If we use a pool, load the batch samples collected each time, when a certain amount of samples have been accumulated in the pool, randomly collect a batch of samples from the pool each time, and send them to the neural network for learning, so that the samples learned each time are diverse. Enrichment will greatly reduce the dependence of samples and improve the effect of learning.

It's as if parents collect together the wrong questions that their children have done every time and ask them to redo them. The more types of wrong questions collected, the better for children to learn from different types of questions, so that they can do new exercises better.

2.2 Dual Neural Network——Q-Network and Target Network

        The original DQN algorithm has only a single Q Network, and the Q Network is responsible for predicting the Q value of each action in a state each time, and then the agent will make decisions based on the Q value of different behaviors, but we train the neural network, Every time the loss value is calculated to backpropagate and update the weights of each neuron, if a single neural network is used, a phenomenon called "bootstrapping" will occur, and each iteration of Q Network will approach itself and use itself as A goal, to keep chasing, because the weight changes, Q Network is trained every time, use the self before training to catch up with the self after training, and you will never catch up, as if you lift yourself up and chase your own Like the shadow, it brings great instability to the model training.

If we create a neural network that is exactly the same as it is, it is just not like Q Network that updates the weight every time, the network updates the weight at a certain interval (assigns the weight of Q Network to it) and is trained. In this case, use Q Network To catch up, you can gradually approach, we call this neural network "Target Network".

To give an example: a person is very poor, his wealth is M1 at this time, and his degree of satisfaction with owning 1 million is A1. If it is lowered, he will not be satisfied with 1 million, and if he wants to have 10 million, then he will chase higher goals every time, and will never be satisfied. If when he has 800,000, he will be satisfied with the amount or be satisfied with 1 million when he has nothing, then he will be satisfied as long as he gets another 200,000. Target Network is like how satisfied he is with 1 million when he has nothing, so that he can approach satisfaction every time he earns.

2.3 DQN algorithm flow

        Initially, the state is sent to the Q Network, and the neural network predicts the Q value of each action in this state. According to the ε-Greedy method, there is a certain probability that the action is randomly selected for execution, otherwise, the action with the largest Q value is selected for execution. After execution, the environment will give the action a feedback value reward. At this time, the action data (current state (current state), current action (action(i)), feedback value (reward), next state ( next state)) into the experience playback pool. Repeat the above process until the agent agent reaches the termination state, update the Q network weights, the update process is: collect a batch of samples from the experience playback pool, send them to Q Network and Target Network, and Q Network use the current state of these samples respectively and action(i) calculate the Q value of this sample - QValue. The Target Network uses the next state and reward of these samples respectively to calculate the target value of the sample. The calculation method of the target value is:

Figure 5 Calculation of Target Value

Is_endj is used to judge whether the agent has reached the final state, that is, the end point. If it arrives, the target value value is the feedback value of the selected action action (i) in this state. If it does not reach the end point, the target value is equal to the feedback value plus the next state The maximum value among the target values ​​of each action is multiplied by the discount factor, and each sample is calculated according to the above method.

    After getting the predicted value of Q Network and the target value predicted by Target Network, use the mean square error formula ( MSE ):

Figure 6 Mean square error formula

Calculate the loss value of Q Network, then use the gradient to calculate the optimizer, and backpropagate to update the weight of Q Network. The following is the flow of the entire model:

Figure 7 DQN algorithm flow

The DQN algorithm is described in detail in [6], with step-by-step analysis and wonderful illustrations:

Document [6]

https://medium.com/towards-data-science/reinforcement-learning-explained-visually-part-5-deep-q-networks-step-by-step-5a5317197f4bhttps://medium.com/towards-data-science/reinforcement-learning-explained-visually-part-5-deep-q-networks-step-by-step-5a5317197f4b

4. Modeling of personnel-post matching model

        Because this function is developed for HR, we can regard the job matching process as using a job to match the resumes in the resume collection. Each matching resume is regarded as an action, and the number of actions in the action space is the number of resumes. That is to say, each resume corresponds to an action, and feedback is given according to the matching degree between the characteristics of the resume and the characteristics of the position. The higher the matching degree, the better the feedback value of the resume. In this model, the position can be regarded as an agent, and every match to a resume is an action.

All that needs to be said is the status:

The initial state is to randomly select a resume from the resume set as the initial state. During the model training process, the action made by this decision is used as the next state. That is to say, each state is actually a resume.

  • State space S: set of resumes
  • Action Space A: Resume Collection
  • Recruiter Feedback R: Give feedback based on the matching degree between resume and job characteristics
  • Discount factor γ: Indicates the discount for different time returns

 Figure 8 Schematic diagram of the algorithm

5. Problems encountered during coding and debugging

1. If a job seeker in a resume has two target positions, how to solve it?

The data set is an excel file, and the target positions all exist in one cell, for example:

Figure 9 Problems in the target post column

        You can use the spite function to separate when "," is encountered, and then divide the original target job column into two columns (aim_job1, aim_job2). If the resume does not specify the target position, the corresponding positions of the two columns are set to none. If a target position is given, it is placed in aim_job1, and aim_job2 is set to none. If two target positions are given, they are placed in aim_job1 and aim_job2 in turn. .

2. How to design the reward and punishment function?

Rewards and punishments should be determined according to the matching degree of the post characteristic value and the resume characteristic value:

First, encode the characteristic value of the resume and the characteristic value of the position, and replace the string with a number:

Corresponding codes for academic qualifications: 'Technical Secondary School': 1, 'Junior College': 2, 'Undergraduate': 3, 'Master': 4, 'PhD': 5

Corresponding codes of target positions: 'Product Operation': 0, 'Graphic Designer': 1, 'Finance': 2, 'Marketing': 3, 'Project Supervisor': 4, 'Development Engineer': 5, 'Clerk' ': 6, 'E-commerce Operation': 7, 'Human Resource Management': 8, 'Risk Control Specialist': 9, 'None': 10

(1) If the target position of the resume is not the position, the reward is -30

(2) If the target position of the resume is the position, but education, working years, and age do not meet the job requirements , reward points will be deducted according to the weight. Here, the weight of education is 2, and the weight of working years (experience) The weight is 1, and the calculation method of reward is:

self.reward = -5 + self.W[0] * (resume_experience - experience_min) + self.W[1] * (
resume_education - education_required)

(3) If the target position of the resume is the position, and the educational background, working years, and age all meet the job requirements , the higher the educational background and the longer the working experience, the greater the reward value of the resume. This can be achieved by using the resume's educational background There is a difference between the value of working years and the corresponding values ​​of education and working years required for the position. The larger the difference, the higher the education and the longer the working years, and the weight is also incorporated into the calculation of reward.

5 + 1 * (resume_experience – job_experience_min) + 2 * (resume_education – job_education_required)

3. How to take samples to calculate the loss reverse update weight?

Take a batch of samples from the experience playback pool, take out the same quantity in a batch of samples, such as the current state, and pack them into arrays, so that you can get four arrays, send the array with the next state to the target network, and store the The array with the current state is sent to the Q Network, and the QValue of the corresponding action is extracted using the one-hot mask, and the loss value is calculated together with the target value and reward calculated by the target network.

4.报错:ValueError: You called set_weights(weights) on layer "match_model_1" with a weight list of length 4, but the layer was expecting 0 weights. Provided weights: [array([[-1.56753018e-01, 5.38734198e-02, 2.8747...

After many times of debugging, the author found that it was caused by the inconsistency between the step interval of Q Network assigning weights to Target Network and the number of samples collected from the experience playback pool each time. If each sample collected from the experience playback pool If the number of samples is too large, and the step interval of Q Network assigning the weight to the Target Network is too small, this error will be reported, but the theoretical reason is still unclear to the author. If someone can explain it, I hope to point it out in the comment area , this error will be solved indirectly later.

5. The algorithm is stuck in a local optimum and the convergence is too slow

We set the number of training times to 5000 iterations:

Set the training parameters as:

learning_rate=0.10, discount_factor=0.85, exploration_rate=0.5

Figure 10 Algorithm convergence is not good

We can see that the average reward value has been fluctuating greatly, and the Q value ranking of each resume has the following situation:

Q value sorting results:

1 Num: 38, Name: Hong Zifen, Age: 38, Experience: 12, Education: 3, School: Beijing Normal University, aim_job_1: 3, aim_job_2: 4 Q value: 0.91416574

2 Num: 22, Name: Lei Jinbao, Age: 38, Experience: 9, Education: 4, School: China University of Geosciences, aim_job_1: 4, aim_job_2: 10 Q value: 0.826052

3 Num: 16, Name: Liu Ziting, Age: 36, Experience: 10, Education: 3, School: Wuhan University, aim_job_1: 3, aim_job_2: 4 Q value: 0.77591

4 Num: 23, Name: Wu Meilong, Age: 37, Experience: 6, Education: 4, School: Hunan University, aim_job_1: 4, aim_job_2: 10 Q value: 0.7495003

5 Num: 18, Name: Lu Zhiying, Age: 38, Experience: 12, Education: 3, School: Wuhan University of Technology, aim_job_1: 3, aim_job_2: 4 Q value: 0.73388803

6 Num: 48, Name: Lin Yingwei, Age: 34, Experience: 7, Education: 3, School: Nankai University, aim_job_1: 4, aim_job_2: 10 Q value: 0.61048335

7 Num: 21, Name: Zheng Yiwen, Age: 39, Experience: 8, Education: 4, School: Huazhong Agricultural University, aim_job_1: 4, aim_job_2: 10 Q value: 0.604043

8 Num: 55, Name: Yao Yangyun, Age: 35, Experience: 6, Education: 4, School: Communication University of China, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

9 Num: 58, Name: Tang Xinyi, Age: 34, Experience: 11, Education: 1, School: Hebei Industrial Vocational and Technical College, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

10 Num: 53, Name: Zheng Xingyu, Age: 26, Experience: 3, Education: 1, School: Xinjiang Agricultural Vocational and Technical College, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

11 Num: 35, Name: Lin Jialun, Age: 28, Experience: 5, Education: 2, School: Shanghai Jiaotong University, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

12 Num: 45, Name: Hu Tai, Age: 33, Experience: 10, Education: 2, School: Sun Yat-sen University, aim_job_1: 0, aim_job_2: 4 Q value: 0.0

13 Num: 59, Name: Chen Zhengsheng, Age: 32, Experience: 10, Education: 1, School: Beijing Medical College, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

14 Num: 60, Name: Li Shushu, Age: 28, Experience: 5, Education: 1, School: Fujian Vocational College of Shipbuilding and Transportation, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

15 Num: 44, Name: Zheng Yaqian, Age: 36, Experience: 7, Education: 4, School: Harbin Institute of Technology, aim_job_1: 4, aim_job_2: 10 Q value: 0.0

16 Num: 41, Name: Yuwen Ren, Age: 29, Experience: 3, Education: 3, School: Shanghai Jiao Tong University, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

17 Num: 100, Name: Lin Yuting, Age: 31, Experience: 7, Education: 2, School: Beijing Normal University, aim_job_1: 0, aim_job_2: 10 Q value: 0.0

18 Num: 25, Name: Wang Meizhu, Age: 35, Experience: 4, Education: 4, School: Hunan Normal University, aim_job_1: 4, aim_job_2: 10 Q value: 0.0

19 Num: 32, Name: Wu Meiyu, Age: 34, Experience: 8, Education: 2, School: University of Electronic Science and Technology of China Chengdu College, aim_job_1: 4, aim_job_2: 10 Q value: 0.0

20 Num: 29, Name: Cao Minyou, Age: 29, Experience: 6, Education: 2, School: Shandong University, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

21 Num: 69, Name: Liu Tingbao, Age: 26, Experience: 6, Education: 0, School: Jinan Media School, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

22 Num: 24, Name: Wu Xinzhen, Age: 36, Experience: 7, Education: 4, School: Central South University, aim_job_1: 4, aim_job_2: 10 Q value: 0.0

23 Num: 20, Name: Li Yungui, Age: 40, Experience: 14, Education: 3, School: Central China Normal University, aim_job_1: 3, aim_job_2: 4 Q value: 0.0

24 Num: 19, Name: Fang Yiqiang, Age: 39, Experience: 13, Education: 3, School: Zhongnan University of Economics and Law, aim_job_1: 3, aim_job_2: 4 Q value: 0.0

25 Num: 17, Name: Rong Zikang, Age: 37, Experience: 11, Education: 3, School: Huazhong University of Science and Technology, aim_job_1: 3, aim_job_2: 4 Q value: 0.0

26 Num: 11, Name: Li Zhongbing, Age: 24, Experience: 5, Education: 0, School: Shunde Technical Secondary School, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

27 Num: 5, Name: Jiang Yiyun, Age: 28, Experience: 5, Education: 2, School: South China University of Technology, aim_job_1: 4, aim_job_2: 10 Q value: 0.0

28 Num: 3, Name: Lin Wenshu, Age: 28, Experience: 4, Education: 2, School: Southern University of Science and Technology, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

29 Num: 2, Name: Lin Guorui, Age: 34, Experience: 11, Education: 2, School: Communication University of China, aim_job_1: 3, aim_job_2: 10 Q value: 0.0

30 Num: 68, Name: Lin Chengchen, Age: 25, Experience: 7, Education: 0, School: Jinan New Technology Application School, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

31 Num: 50, Name: Li Xiuling, Age: 32, Experience: 7, Education: 2, School: Communication University of China, aim_job_1: 0, aim_job_2: 4 Q value: 0.0

32 Num: 84, Name: Liu Xiaozi, Age: 30, Experience: 6, Education: 2, School: Southeast University, aim_job_1: 4, aim_job_2: 10 Q value: 0.0

33 Num: 96, Name: Du Yi, Age: 30, Experience: 8, Education: 2, School: Beijing City University, aim_job_1: 2, aim_job_2: 10 Q value: 0.0

34 Num: 92, Name: Shen Huimei, Age: 30, Experience: 6, Education: 2, School: Shanghai Ocean University, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

35 Num: 75, Name: Lian Shuzhong, Age: 32, Experience: 8, Education: 3, School: Sun Yat-sen University, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

36 Num: 90, Name: Huang Kanggang, Age: 29, Experience: 8, Education: 2, School: University of Science and Technology Beijing, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

37 Num: 88, Name: Wu Tingting, Age: 29, Experience: 3, Education: 3, School: Central University of Finance and Economics, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

38 Num: 87, Name: Lin Yizi, Age: 26, Experience: 0, Education: 3, School: Renmin University of China, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

39 Num: 89, Name: Yang Yijun, Age: 27, Experience: 4, Education: 2, School: Beijing Tsinghua University, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

40 Num: 97, Name: Pan Xiaodong, Age: 23, Experience: 0, Education: 2, School: Central Academy of Chinese Opera, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

41 Num: 98, Name: Zhou Zhihe, Age: 27, Experience: 4, Education: 2, School: Shenzhen University, aim_job_1: 0, aim_job_2: 4 Q value: 0.0

42 Num: 43, Name: Lin Shimei, Age: 34, Experience: 11, Education: 2, School: Chongqing University, aim_job_1: 3, aim_job_2: 4 Q value: -0.17997088

43 Num: 4, Name: Lin Yanan, Age: 28, Experience: 5, Education: 2, School: South China Normal University, aim_job_1: 4, aim_job_2: 10 Q value: -0.35405895

44 Num: 42, Name: Li Zhi, Age: 32, Experience: 9, Education: 2, School: Tongji University, aim_job_1: 4, aim_job_2: 10 Q value: -0.36112082

……

        We found that some resumes, the target position matches the position, but the Q value is 0, and Li Yungui's experience and education are better than Hong Zifen, who ranks first, but his Q value is 0, which shows that these Q values The resume of 0 has not been trained at all, and the algorithm will fall into local optimum at a certain moment. To prevent local optimum, what we often do is to adjust the hyperparameter value, especially exploration_rate, because this value determines how likely the algorithm is to be random. Select an action and increase this value to avoid falling into a local optimum, because our exploration_rate will also decay with the number of iterations.

Now adjust the parameters to:

learning_rate=0.10, discount_factor=0.85, exploration_rate=1

                                                                Figure 11 Algorithm convergence results

We found that the algorithm converges quickly, but the convergence is not perfect, and the Q value ranking has fallen into a local optimum again:

Q value sorting results:

1 Num: 48, Name: Lin Yingwei, Age: 34, Experience: 7, Education: 3, School: Nankai University, aim_job_1: 4, aim_job_2: 10 Q value: 2.521024

2 Num: 20, Name: Li Yungui, Age: 40, Experience: 14, Education: 3, School: Central China Normal University, aim_job_1: 3, aim_job_2: 4 Q value: 2.4513242

3 Num: 19, Name: Fang Yiqiang, Age: 39, Experience: 13, Education: 3, School: Zhongnan University of Economics and Law, aim_job_1: 3, aim_job_2: 4 Q value: 2.3784618

4 Num: 17, Name: Rong Zikang, Age: 37, Experience: 11, Education: 3, School: Huazhong University of Science and Technology, aim_job_1: 3, aim_job_2: 4 Q value: 2.372804

5 Num: 24, Name: Wu Xinzhen, Age: 36, Experience: 7, Education: 4, School: Central South University, aim_job_1: 4, aim_job_2: 10 Q value: 2.3390896

6 Num: 22, Name: Lei Jinbao, Age: 38, Experience: 9, Education: 4, School: China University of Geosciences, aim_job_1: 4, aim_job_2: 10 Q value: 2.235163

7 Num: 18, Name: Lu Zhiying, Age: 38, Experience: 12, Education: 3, School: Wuhan University of Technology, aim_job_1: 3, aim_job_2: 4 Q value: 2.152625

8 Num: 38, Name: Hong Zifen, Age: 38, Experience: 12, Education: 3, School: Beijing Normal University, aim_job_1: 3, aim_job_2: 4 Q value: 2.1127188

9 Num: 16, Name: Liu Ziting, Age: 36, Experience: 10, Education: 3, School: Wuhan University, aim_job_1: 3, aim_job_2: 4 Q value: 2.1125493

10 Num: 21, Name: Zheng Yiwen, Age: 39, Experience: 8, Education: 4, School: Huazhong Agricultural University, aim_job_1: 4, aim_job_2: 10 Q value: 2.1001434

11 Num: 44, Name: Zheng Yaqian, Age: 36, Experience: 7, Education: 4, School: Harbin Institute of Technology, aim_job_1: 4, aim_job_2: 10 Q value: 2.089555

12 Num: 25, Name: Wang Meizhu, Age: 35, Experience: 4, Education: 4, School: Hunan Normal University, aim_job_1: 4, aim_job_2: 10 Q value: 1.840682

13 Num: 23, Name: Wu Meilong, Age: 37, Experience: 6, Education: 4, School: Hunan University, aim_job_1: 4, aim_job_2: 10 Q value: 1.2011623

14 Num: 83, Name: Peng Zhengren, Age: 25, Experience: 2, Education: 2, School: Huazhong University of Science and Technology, aim_job_1: 0, aim_job_2: 10 Q value: 0.0

15 Num: 35, Name: Lin Jialun, Age: 28, Experience: 5, Education: 2, School: Shanghai Jiaotong University, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

16 Num: 43, Name: Lin Shimei, Age: 34, Experience: 11, Education: 2, School: Chongqing University, aim_job_1: 3, aim_job_2: 4 Q value: 0.0

17 Num: 52, Name: Ye Weizhi, Age: 34, Experience: 3, Education: 4, School: Communication University of China, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

18 Num: 62, Name: Lin Yahui, Age: 24, Experience: 5, Education: 0, School: Guangzhou Light Industry Vocational School, aim_job_1: 0, aim_job_2: 10 Q value: 0.0

19 Num: 66, Name: Lai Shuzhen, Age: 27, Experience: 1, Education: 0, School: Changsha Vocational and Technical College, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

20 Num: 15, Name: Hong Zhenxia, ​​Age: 23, Experience: 4, Education: 0, School: Shenzhen Technical Secondary School, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

21 Num: 8, Name: Lin Zifan, Age: 28, Experience: 6, Education: 1, School: Yangjiang Vocational College, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

22 Num: 6, Name: Liu Bohong, Age: 31, Experience: 9, Education: 1, School: Guangdong Teachers College, aim_job_1: 10, aim_job_2: 10 Q value: 0.0

23 Num: 4, Name: Lin Yanan, Age: 28, Experience: 5, Education: 2, School: South China Normal University, aim_job_1: 4, aim_job_2: 10 Q value: 0.0

24 Num: 45, Name: Hu Tai, Age: 33, Experience: 10, Education: 2, School: Sun Yat-sen University, aim_job_1: 0, aim_job_2: 4 Q value: -0.3499998

25 Num: 42, Name: Li Zhi, Age: 32, Experience: 9, Education: 2, School: Tongji University, aim_job_1: 4, aim_job_2: 10 Q value: -0.36110973

It shows that even if the exploration rate is pulled to 1, it is difficult to solve the problem that when the terminal state of the last round of iteration is reached, the Q value of some actions has not been successfully predicted, that is, the action has not been executed, which may contain the optimal untie.

The solution is to set the condition of reaching the terminal state to execute all actions in the first round of iteration:

if episode == 0:
    if set(range(12, 112)).issubset(set(action_union)):
        done = True
    else:
        done = False

In this way, each action can be fully executed, and the reward value is avoided from being too sparse.

Figure 12 The algorithm converges quickly 

Figure 13 Three-dimensional state-action corresponding Q value diagram

6. The problem of low accuracy

When testing again at the end, it was found that the accuracy rate was very low for a long time during the iteration, which was not completely consistent with the standard results. The author considered that the reason was that the number of samples taken from the experience playback was limited each time, and the experience playback pool at the end of the iteration The total number of samples in the middle is much larger than the number of samples taken each time, so the author set it. Every time the standard result is hit, the number of samples taken each time is increased by 100, so that the samples learned by the neural network are more sufficient and the prediction is more accurate.

Figure 14 Algorithm accuracy increases with the number of iterations

6. Summary and Outlook

For the problem of person-post matching for feature matching, I personally think that DQN is not an excellent solution, and natural language processing technology should be used. Due to the limited project time, there is no time to change the solution and develop the model from scratch. Actor-Critic algorithm can be used to develop human posts. Matching the model should have better results.

7. References

[1] Lu Yamin. Video recommendation algorithm based on the combination of FM and DQN [J]. Computer and Digital Engineering, 2021,49(09):1771-1776.

[2] https://gsurma.medium.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288

[3] 

https://medium.com/towards-data-science/reinforcement-learning-explained-visually-part-3-model-free-solutions-step-by-step-c4bbb2b72dcf

[4]  Research on Path Planning Algorithms for Intelligent Robots Based on Reinforcement Learning (with Code)

[5] https://medium.com/towards-data-science/reinforcement-learning-explained-visually-part-4-q-learning-step-by-step-b65efb731d3e

[6] 

https://medium.com/towards-data-science/reinforcement-learning-explained-visually-part-5-deep-q-networks-step-by-step-5a5317197f4b

[7]  Reinforcement Learning (9) Deep Q-Learning Advanced Nature DQN - Liu Jianping Pinard - Blog Garden (cnblogs.com)

[8] Application of Reinforcement Learning in Echeng Technology Personnel Matching System | Heart of Machine (jiqizhixin.com)

[9]  Application of Reinforcement Learning in Talent Recommendation - Analysis Intelligence (xiaoxizn.com) 

[10] What is Reinforcement Learning DQN Experience Replay?

Guess you like

Origin blog.csdn.net/qq_53162179/article/details/131465208