[ASE Advanced Software Engineering] first twinning operations

Problem Definition

Specific rules See: handouts . Roughly rules are as follows:

Classmates N (N is typically greater than 10), each writing a rational number between 0 to 100 (excluding 0 or 100), to the referee, ref calculates an average of all numbers, then multiplied by 0.618 (the so-called golden constant ), to obtain the G value. Digital closest G (absolute value) obtained student submissions-N, furthest from G to give students of -2, 0 points other students.

Difficulties :

  1. Environmental unknown . "Golden point" as a game arcade game, the rules of the game itself is relatively simple. Therefore, the game situation depends largely on the strategies of other players, predictability is weak, unlike chess, chess and other games, the situation and development to a large extent are bound by rules of the game, so you can predict strong . This led to the game "optimal policy" is also largely dependent on another player's strategy, which model the effect pretrain unsatisfactory, sometimes even worse than the trained models from scratch effect.
  2. Lack of data . Since the first point, we have chosen as well as most of the teams during the race from the beginning of training model. However, during the game can only get a new round of data, the amount of data is too small is not conducive to the convergence model. In addition, write code, parameter adjustment process, using a server with others on the war each data requires at least 3s, but also more difficult to effectively tune parameters.
  3. Reward sparse . This is common problem for many RL problems. In this game, the reward in most cases is 0, the model is not conducive to learning.

In addition, individuals have little doubt that the game itself: many of the rules of the game itself will be able to RL reminiscent of possible "best strategy", but the rules of the game makes me feel too much randomness, as a human being I feel confused. Even I myself RL agent, may also be difficult to learn than a wild guess better strategy, or other players may need to be some degree of policy analysis in order to choose a better strategy. The results showed that although the game does some high team score, the game itself whether there is really a "best strategy"? Personally, I'm not sure. (Of course, this does not affect RL learning, just personally feel that some weird game.)

Modeling method

The core algorithm

We use q-learning. The algorithm maintains a table q, is recorded in a state outside \ (S \) under, perform some action \ (A \) anticipated reward \ (Q (S, A) \) . Once there is a sufficient realistic table q in each state (S \) \ , you can select the action \ (A * = ^ \ Arg \ max_a Q (S, A) \) , to maximize the expected reward.

In practice, since the q table is unknown, it is necessary to learn from the environment. Specifically, the model of a random initialization q table, try different actions by different states, and gradually correct the estimated q table observation reward feedback, model, making it more realistic, more guidance model action. Flowchart is as follows:

This homework, we added some more reasonable action; 2. In order to solve the difficulties in, we construct the data more available for learning. See section details the specific implementation.

Use motivation

Because of the difficulty of analysis 1., we consider it necessary train from scratch, so the amount of data scarcity is a serious problem. Therefore, we believe that is not suitable for many parameters, the neural network model of convergence more difficult, but should be the most simple q learning.

Q is not considered because the correlation between the Learning state, the action, but for each \ ((s, a) \ ) are stored separately a q value, the action space or state space model up is too difficult to converge. Therefore, we control the size of the state space and action space, only manual design some of the more useful strategy as an action.

Implementation

  1. action to improve . For the demo already in action, we changed the initial value of the output of the action, bringing it closer to the initial value of the common statistical data; in the case of need two digital inputs and two digital output of the original lot of action is the same, not conducive score, we introduce some randomness in the second numbers; Moreover, we have introduced some new, action, outputs a random number within a certain range of gold often appears in point, the output of a large number or disturbance.
  2. Structure data . Because api servers available on the bureau to submit a digital all, we in the training process, we can construct "If in the previous round, I adopted another strategy to get the reward would be like," the data so that the data the amount of extension \ (N_A \) times, where \ (N_A \) is the number of operation.

Result analysis

In the first inning, we for the first six. In between innings, according to the golden point us to the range of action data and other fine-tuning, and achieved certain results, won the first four in the second innings.

Reflection summary

  1. The results golden point game of match you expected it?

    The first game round result is poor, but we have taken a slight increase after the improvement. Overall, our bot effect is not prominent, the main reason may be that the state of design is too simple: we use the last 10 point decline in gold, as a rising number of states, it is difficult to complete depict the current status of the situation. In addition, we understand better performance of the group uses DQN, may indicate a q-learning learning ability is still limited.

  2. Before the official game, you take what kind of strategy to evaluate the quality of the model?

    We run the bot in room0, room1, the participation in the competition, assessment model results. Later, room0, room1 speed is too slow, we added a room to assess the effect of some of the other students created. In addition, we recorded bot round reward, the learning curve was observed bot, bot assess learning ability.

  3. If the numbers may be submitted for each round into 3 or looking for more participants to participate in the competition, your method is also suitable it?

    For the three numbers, q-learning may be applicable, but the need to analyze data, design is more suitable to submit three digital strategy. If the action space is too large, you may want to consider q table modeling of the relationship between the action.

    For more contestants, the method still applies.

  4. Please evaluate the work of partners, evaluation methods please reference the discussion of the sandwich method. Twinning partners and proposed areas for improvement.

    My partner is Tianxin Wei and Wu Ziwei and Wu Xueqing students, two students are very good, made many contributions to our project.

    Wei Tianxin RL classmates know very quickly analyzed the problem, possible solutions, and actively writing code, debugging.

    Wu Ziwei students on the data were analyzed and the corresponding improvements also contributed a lot of code.

    In addition, because of our summer study abroad, out of the country, and domestic students and teaching assistants exchange limited; and each have their own research work during the summer, ddl project just near the end of our summer research return time, so some of us before ddl people are just returned jet lag, some people are still busy summer research ending, some people soon to be on the plane, so the relative lack of overall time. His teammates are very good, I think this result can also be satisfied.

Guess you like

Origin www.cnblogs.com/jennawu/p/11568118.html