ASE Advanced Software Engineering first twinning operations

Gold point game Bot

Bot8 come to report

1. Problem Definition

a) Problem Description

  • N players, writing per a rational number between 0 to 100 (excluding 0 or 100), submitted to the server, the server calculates an average of all the numbers at the end of the current round, then multiplying by 0.618 (the so-called golden constant) to give G value. G closest player number (absolute value) obtained submissions-N, G farthest from the player gets points -2, 0 points to other players. Only one player does not score points when participating.

b) simple thinking

  • I get this question is that if we do not want to take risks, then the result will be as close to 0
  • It is clear that there must be disturbance to make the game more interesting, but the disturbance is critical (this refers to the output disturbance is obviously incorrect value of gold point offset)
  • We have to do is to predict the next round of gold point

2. The method of modeling

a) Environmental Analysis

  • After a few bot in Room 1 and simple contest can be found gold point change is very intense, which makes the difficulty of accurate prediction of greatly increased
  • Due to dramatic changes, so before considering 100 or 10 gold points appear without reason, consideration should be given less gold points, enough for a maximum of five (poor Finally, we use only the last two)
  • In consideration of data, the subtle changes in the value of small data clearly greater than the large dramatic changes in the data, so there should be appropriate for the measures (such as small value segments, large value roughly divided)
  • For disturbance, disturbance upwardly (50-100) effect is obvious, the downward perturbation (output 0) is not obvious, Hu easily cut, so the only disturbance should upwardly disturbance
  • Continuous disturbing fact little effect, because there will be other people's strategies to adapt quickly to the environment

b) selection model

  • When prompted assistant, we performed a direct comparison in a non-machine learning algorithm, Q-learning, DQN, trying to choose the right strategy
  • First, try to interfere with policy, the purpose of the interference is interference from other machine-learning algorithms to learn the correct model, there are specific strategies for each successive round of 20 interference 10 rounds, 10 rounds each successive interference 5 rounds, each round are interference, no interference, random Poisson distribution successive interference
  • While predicting interference in the golden point under interference, where we assume that others do not interfere, get the value of the gold points
  • Based on previous environmental analysis, we can know the state does not need to store a lot of previous figures, how characteristics such recent gold extraction point into our state representation is a core issue
  • Since digital is less and less features, so we believe that the role of DQN will not be much, but will affect the model train speed
  • Finally, we consider Q-learning
  • The rest is action, and the first to have a golden point, on a golden point * 0.618, Hu and then have to try to cut other people's interference

c) Q-learning Introduction

  • Similar to dynamic programming, there are state and state transition, there will be scores generated when the transfer state, our goal is to learn a state transition table.
  • I wanted to learn it, then action needs to have feedback, let us know which action tend to use the next time you encounter this state.
  • Specific official: NewValue = CurrentValue + lr * [Reward + discount_rate * (highestValue between possible actions from the new state s’) — CurrentValue]

d) implementation details

  • Good pre-training model, loaded in advance and instantly able to save model
  • After testing, the probability of randomly selecting interference interference, and interference after the bout if this bout to suppress interference, if the interference had not been to enhance the probability of interference
  • Gold interference removing some of their own rounds of Q-learning in shaping the model as soon as possible
  • It indicates a state where only consider the difference between the previous point and the current golden golden point by inter-cell amplification, reduce the large section 100 to convert into the state (with a reason in said environmental analysis)
  • Action Strategies In addition to the golden point, the golden points * 0.618, cut outside interference Hu also added plus or minus (beginning with 0.001, 0.011 and finally, in fact, just a little more than other bot or a little less like a)

e) Workflow

  • With Q-learning prediction action, to give two numbers
  • This turn determines whether interference, in particular interference divided into four stages, each stage probability (0.05, 0.2, 0.4, 0.7) interfere with each turn if no interference with a 0.9 probability-transfer stage, the stage is cleared when the interference (i.e. 0.05 probability of interference becomes stage)
  • If the interference with the (50-100) replacing a random number number2, while new number1 estimated under the assumption that other people do not interfere with the case where

3. Results Analysis

a) two points
in the first round points 1690 (1000)
a second round of sub 13850 (10000)

b) Experience reflection
two first or exceeded expectations, did not expect the results were good ^ v ^
In fact, prior to cross-bot, we also use their own bot pk in his room, in a multi-round best as the final the bot
if you submit each round numbers into three, simply by changing the action can work, but the effect model of training and time to be fine
if more participants, you may want to reconsider the benefits of our interference
forced sandwiches, then we would blame his teammates CHF is too strong, completely without my play, I lay a good (salted fish pose), a slip of the slip

Guess you like

Origin www.cnblogs.com/hsuppr/p/11558028.html