Goal
Related Work
To generate a general state-to-action mapping for a high dimensional system, optimizing the mapping using ‘trial and error’ basis has limitations due to the fact that a single trajectory (i.e. one episode) only visits a very small part of the state space.
To address this issue, methods of learning a policy based on optimized trajectories have been