93 Fuzzy Qlearning and Dynamic Fuzzy Qlearning

Introduction

In the reinforcement learning paradigm, an agent receives from its envrionment a scalar reward value called \(reinforcement\). This feedback is rather poor: it can be loolean (true, false) or fuzzy (bad, fair, very good, ...), and, moreover, ti may be delayed. A sequence of control actions is often executed before receiving any information on the quality of the whole sequence. Therefore, it is difficult to evaluate the contribution of on individual action.

Q-learning

Q-learning is a form of competitve learning which provides agents with the capability of learning to act optimally by evaluatiing the consequences of actons. Q-learning keeps a Q-function which attempts to estimate the discounted future reinforcement fo taking actions from given states. A Q-function is a mapping from state-action pairs to predicted reinforcement. In order to explain the method, we adopt the implementation proposed by Bersini.

  1. The state space, \(U\subset R^{n}\), is partitioned into hypercubes or cells. Among these cells we can distinguish: (a) one particular cell, called the target cell, to which the quality value +1 is assigned, (b) a subset of cells, called viability zone, that the process must not leave. The quality value for viability zone is 0. This notion of viability zone comes from Aubin and eliminates strong constraints on a reference trajectory for the process. (c) the remaining cells, called failure zone, with the quality value -1.
  2. In each cell, a set of \(J\) agents compete to control a process. With \(M\) cells, the agent \(j\), $j \in {1,\ldots, J} $, acting in cell \(c\), \(c\in{1,\ldots,M}\), is characterized by its quality value Q[c,j]. The probability to agent \(j\) in cell \(c\) will be selected is given by a Boltzmann distribution.
  3. The selected agent controls the process as long as the process stays in the cell. When the process leaves the cell \(c\) to get into a cell \(c^{'}\), at time step \(t\), another agent is selected for cell \(c^{'}\) and the Q-function of the previous agent is incremented by:
    \[\Delta Q[c,j] = \alpha \{ r(t)+ \gamma \underset{k}{\max}Q[k,c^{'}]-Q[c,j] \} \]

猜你喜欢

转载自www.cnblogs.com/jtailong/p/11773689.html