Reinforcement Learning: An Inteoduction Chapter 2 Reading Notes

Part I:Tabular Solution Methods

In this section we describe almost all the core ideas in reinforcement learning. In these problems the state and action spaces are small enough to be represented by estimated value functions such as queues and tables. In these examples, both the optimal value function and the optimal policy are found exactly. This is different from the next part, which is vaguely resolved but has a wider scope.

The first chapter of this part presents a special case of reinforcement learning, which has only one case, called the bandit problem. Chapter 2 presents a general problem formulation that carries over to the remainder of this book—Finite Markov Decision Processes. His core ideas include the bellman equation and value function.

The next three chapters introduce three classes of problems for solving finite Markov problems: dynamic programming, the Monte Carlo method, and temporal difference learning. Each type of method has advantages and disadvantages.

method evaluate
dynamic programming Mathematical computational advantages can be exploited (are well developed mathematically), but a fully accurate model of the environment is required.
Monte Carlo methond No environment model is required, the concept is simple, but not suitable for step calculation.
Timing Difference Algorithm No need for an environment model, full step-by-step, but the analysis is quite complex.

The next two chapters how to combine the three methods to maximize the effect. In the previous chapter we described how to combine the advantages of Monte Carlo methon and temporal difference learning through the multi-step bootstrapping method, and the last chapter describes how temporal difference learning can be combined with model learning and planning methods such as dynamic programming. to provide a completely unified solution to the problem of representational reinforcement learning.

Chapter 2 Multi_armed Bandits

The most striking feature of reinforcement learning and other types of learning is that it evaluates actions by training information rather than being guided by giving the correct action. Evaluative feedback is based solely on actions taken, whereas instructional feedback is independent of actions taken.

In this chapter we study Critic Reinforcement Learning in its simplest form, involving only one situation. Learning such nonassociative problems simplifies the full reinforcement learning problem and clearly grasps the difference and combination of evaluative judgment and instruction.

2.1 A k-armed Bandit Problem

Simply put, there are k different choices, and the reward is obtained after the choice, and the goal is to maximize the reward. Today "bandit problem" generally represents this kind of problem.

Each action has an expected reward, called the value of the action.

At: represents the action at step t q*(a): represents the expected reward of the action Qt(a) :estimate of the q*(a)

the greedy action: the action that selects the largest estimated value each time, i.e. max Qt(a).

explore : select non-greedy action, non-maximum estimate, the goal is to change the estimated value of this action to make it closer to the real q*(a).

The approach with exploration usually doesn't work well at first, but as the process continues, it works better.

Factors related to balancing explore and exploit are: the degree of accuracy of the value of estimate, the uncertainty, and the number of remaining steps. There are many sophisticated ways to balance this problem. In this book we don't care about the specific ways in which complex balancing algorithms are used, but rather understanding how balancing exploits and explores the problem.

2.2 Value function method (action - value method)

A natural method is to calculate by averaging the rewards received.

The greedy algorithm is to choose the maximum estimated value action every time. Every time we use the greedy algorithm, we use the existing data, so we must explore new actions. A feasible method is the e-greedy method. Each time there is a non-greedy choice with probability e, which can explore more actions, which usually works better.

2.3 The 10-armed Testbed

A distribution method with a mean of 0 and a variance of 1


Rt: reward at step t

We compare the performance of the greedy algorithm and the e_greedy algorithm.


The e-greedy method is better than the greedy method, and in this case, the e-greedy method later performs better. And the performance of e=0.01 is better than that of e=0.1, because the upward trend of the former is more obvious.

The comparison of e greedy and greedy is also relevant to the question. For example, when the variance is 10, the advantage is more obvious, and when the variance is 0, the greedy algorithm is better because the optimal action can be found faster than e-greedy with only one experiment. However, even for deterministic problems the egreedy method performs better when we weaken some assumptions. For example, when the bandit task is dynamic, the value is changing. In this case exploration is necessary to accommodate the change. Dynamic problems are the most common in reinforcement learning.

2.4 Incremental Implementation

So far, the estimation of the value function has used the average of the examples. We now consider a computationally efficient way.

Ri: Represents the reward for selecting the action for the i-th time.

Qn: represents the average of the results of the previous n-1 actions.

Qn = (R1+R2+'''+Rn-1)/(n-1)

Optimization


This formula shows that only memory reservations of Qn and n are required, and only a little computation is required.

In addition, this update formula is actually a form that often appears in this book, and its general form is

NewEstimate = OldEstimate + StepSize[Target-OldEstimate]

Explanation: [Target-OldEstimate] is the deviation of the estimated value, and the deviation is reduced by being closer to the result.

We generally use α(a) to represent the step parameter.

2.5 Introducing a Dynamic Problem (Tracking a Nonstationary Problem)

The method of averaging the examples above is suitable for static slot machine problems because the probability of the slot machine problem result does not change.

But most of the reinforcement learning problems we encounter in reinforcement learning are non-static. In this case, it is more reasonable to give more weight to the recent reward than the previous result. One of the most popular ways is to use a constant step-size parameter, i.e.

Qn+1 = Qn + α[Rn - Qn]

where α is a constant,


This is a weighted average. Can be called an exponential recency-weighted average.

It is convenient to change the value of α(a). In the method of finding the average value of the samples, α(a)=1/n, which obviously converges to the real value.

Not all α(a) can make it converge. A very well-known stochastic estimation theory gives us a formula that satisfies the convergence


Satisfying the first condition ensures that there are enough steps to eventually eliminate the effects of initial values ​​and biases. Satisfying the second formula guarantees that the steps will be smaller and smaller to ensure convergence.

Reinforcement learning research problems are mostly non-static, and formulations are now mostly used in theoretical research and rarely used in practical applications.

2.6 Optimistic Initial Value

The methods used above are all affected to some extent by the initial action-value. In statistics, these methods are discriminatory. In the sample-average method, discrimination disappears when all actions appear once. In the method with constant α it is always there, although the effect decreases with time. The advantage is that you can use this to set value expectations to achieve your own purposes .

Setting a high initial action-value can be used as an easy way to encourage exploration.


As shown in the figure above, the optimistic greedy method does not work well at the beginning, but later the effect is better than the realistic e-greedy method.

This is because the method explores more data in the beginning and performs better later because it explores less over time.

Limitations: Optimistic initial values ​​are very effective for static problems, but not very helpful for the most dynamic problems in reinforcement learning. Because dynamic problems require exploration all the time, not just at the beginning.

But it is often used as a simple method in combination with other algorithms.

2.7 Upper-Confidende-Bound

Because action-value estimates always have uncertainty in their accuracy, exploration is necessary. The e-greedy algorithm selects non-greedy actions randomly, without taking care of those actions that are close to the best and with high uncertainty. An efficient method is the upper bound confidence interval algorithm. It takes into account both the degree of closeness to the maximum and the uncertainty of the action.


Nt(a) represents the number of times a was chosen before time t, c refers to the confidence level, and ln t represents the natural logarithm of t (later increases will be slower).

Further explanation : Assuming that a is selected in one step, then Nt(a) increases and the uncertainty decreases, while the uncertainty of non-a will increase because ln t increases.


UCB is generally better than the e-greedy algorithm because it has a selection basis for non-greedy choices.

Limitations: UCB algorithm is more difficult to apply to other reinforcement learning problems than e-greedy.

One of the difficulties: dealing with non-static problems, the method is more complex than the previous method. The second difficulty is dealing with large transition spaces, especially the fuzzy functions used in the second part of this book.

2.8 Gradient Bandit Algorithm

The methods we have used so far to estimate value and use the estimate of action value to select actions are generally good, but not unique. In this section, we use Ht(a) to represent the numerical tendency of the action. The larger the tendency, the easier the action is to be selected, but the tendency is not directly related to the result.

soft—max distribution (selection basis)


π(a) The probability that the action will be chosen at time t. Initially all actions have the same preference, so they are all equally likely to be selected.

preference update


Here Rt represents the average of all results (including time t) and is called the baseline. If the result is higher than the baseline, the probability of the action being selected in the future will increase, otherwise, the probability of selecting the action will decrease; the opposite is true for non-actions.


2.9 Associative Search Contextual Bandits

So far in this chapter we have only considered unrelated tasks, there is no need to consider different actions in different situations. However, the general reinforcement learning task is more than one case, and the goal is to learn a policy: a map that can choose the best action for the situation according to different situations. Learning non-relevant problems is to pave the way for learning related problems.

To give an example: Suppose there are several different k-armed bandit problems, after each choice you will randomly face a different problem, which cannot be solved well with our previous method unless the real action value changes very slow.

Suppose you have a policy (policy). When a situation occurs, you can choose the best action in the state according to it. This is the associative search task (associative search task), which is often called contextual bandits in academics. The association search problem is between k-armed bandit and full reinforcement learning. It's like reinforcement learning because it has a learning strategy, and it's like a multi-armed problem because each action only affects the next reward. If the action can affect both the reward and the next step, then it becomes a full reinforcement learning problem.

2.10 Summary

Several approaches to solving exploration and exploitation problems are introduced in this chapter:

The e-greedy method randomly selects non-greedy actions with small probability, while the UCB algorithm can prefer those actions with few samples. The gradient slot machine algorithm selects actions based on preference rather than action value, utilizing a soft-max distribution based on rank and probability. Simply setting optimistic initial values ​​can make greedy algorithms more efficient.



In various methods, we not only look at their situation under the best parameters, but also consider their sensitivity to parameter values. All of these methods are rather insensitive.

Overall, the UCB method performed the best.

These methods of balancing exploration and exploitation are far from satisfactory. A better approach to the dobby problem is to compute special functions
(Gittins indices). But this approach is not general and assumes a consistent distribution of possible problems. Another method is the Bayesian method. This method is sometimes referred to as posterior samplping. This way of evolution becomes the state information of the problem. But the chosen number space is so large that it is impractical to do all the calculations exactly , but it can be estimated efficiently. This turns the bandit problem into an example of a reinforcement learning problem, which can be solved by fuzzy . But it is beyond the scope of this book.










Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325871828&siteId=291194637