Paper translation - STUN: Reinforcement-Learning-Based Optimization of Kernel Scheduler Parameters 3 (1)

Continued from the previous article: STUN: Reinforcement-Learning-Based Optimization of KernelScheduler Parameters (2)

3. Background (knowledge)

3.1 Q-Learning

Reinforcement learning is a machine learning algorithm that establishes a policy to determine the behavior that maximizes the sum of the total compensation values ​​in the current state. It consists of an environment and an agent, as shown in Figure 1. The agent is the main body that decides the next action, and interacts with the environment through actions, and the environment responds to states and rewards. The agent interacts with the environment iteratively to establish an optimal policy.

                                                                            Figure 1 Reinforcement Learning Structure

Reinforcement learning is suitable for solving problems with trade-off relationships. In particular, the reward values ​​for short-term and long-term actions are explicit. Therefore, it is used in various fields such as robot control and games. Atari and DOTA 2 are known to be well trained with deep reinforcement learning .

There are many algorithms used to implement reinforcement learning, among which we applied the Q-Learning algorithm. The agent of Q-Learning has a Q-table, which records the Q value, that is, the value of all actions that can be taken in each state to solve the problem. The Q-Learning algorithm works as follows: when the agent starts learning for the first time, all values ​​of the Q-table are initialized to 0. If the Q-value is 0 in each state, the agent randomly chooses an action and updates the Q-value in the Q-table; if the Q-value is not 0, the agent chooses the action with the largest Q-value to make the reward maximize. The agent repeats these steps until an optimal policy is found.

Q-Learning is a type of model-free reinforcement learning that uses Bayesian equations to find actions with the highest total reward. In reinforcement learning, the model predicts state changes and compensations of the environment. The advantage of model-free reinforcement learning is that it is easy to implement or tune.

knowledge supplement

The following is from: What is Reinforcement Learning? Reinforcement Learning

Definition of Reinforcement Learning

Reinforcement Learning (RL), abbreviated as RL, is a method of machine learning that emphasizes learning how to make decisions through interaction with the environment. It does not require any data to be given in advance, but obtains learning information and updates model parameters by receiving rewards (feedback) from the environment for actions. Deep learning models can be used in reinforcement learning to form deep reinforcement learning.

The agent (Agent) learns in the way of "trial and error (Action)", and the reward (Reward) obtained by interacting with the environment (Environment) (Interpreter) guides the behavior, and the goal is to make the agent obtain the maximum reward.

Its inspiration comes from the theory of behaviorism in psychology, that is, how organisms gradually form expectations for stimuli under the stimulation of rewards or punishments given by the environment, and produce habitual behaviors that can obtain the greatest benefits. It focuses on online learning and attempts to explore - Use (exploration-exploitation) balance.

Terminology in Reinforcement Learning

  • Agent (often translated as: intelligent body, individual, subject, player): A learner or decision maker in the reinforcement learning process. Agents interact with the environment and take actions to achieve specific goals.
  • Environment: The environment in which the agent operates. It provides observations to the agent, and the agent's actions can affect the state of the environment.
  • State: A representation of the current state of the agent in the environment. It can be "full observability" or "partial observability"
  • Action: A decision made by an agent that affects its interaction with the environment.
  • Reward: The immediate feedback signal an agent receives after taking an action. Rewards reflect the desirability of actions taken in a particular state.
  • Policy (Policy): The strategy for the agent to choose an action, which can be deterministic or random.
  • Value function: A function that estimates the expected cumulative reward an agent can obtain, starting from a given state and following a specific policy.
  • Q-function: A function that estimates the expected cumulative reward an agent can obtain, starting from a given state, taking a specific action, and then following a specific policy.
  • Exploration vs. Exploitation: A trade-off between trying new actions to discover their consequences (exploration) and choosing actions known to yield high returns (exploitation).

Three Approaches to Reinforcement Learning

  1. Based on value (value-based), the goal is to optimize the value function V(s), the value function will tell us the maximum expected future reward obtained by the agent in each state, and thus determine each step Which action to choose.
  2. Based on the policy (policy-based), the policy function π(s) is directly optimized , and the value function is discarded. The strategy is to judge the performance of the agent at a specific point in time, and link each state with its corresponding best action through the strategy.
  3. Model-based, creating a model that represents the behavior of the environment. (uncommonly used)

Some classic algorithms for reinforcement learning

  • Value Iteration: A dynamic programming technique that iteratively updates the value function until it converges to an optimal value function.
  • Q-learning: A model-free, off-policy algorithm that learns an optimal Q-function by iteratively updating its estimates based on observed transitions and rewards.
  • SARSA: A model-free policy-based algorithm that learns a Q-function by updating its estimate based on the actions taken by the current policy.
  • Deep Q-Network (DQN): An extension of Q-learning that uses deep neural networks to approximate the Q-function, enabling RL to scale to high-dimensional state spaces.
  • Policy Gradient Methods: A family of algorithms that directly optimize policies by adjusting their parameters based on the gradient of the expected cumulative reward.

Guess you like

Origin blog.csdn.net/phmatthaus/article/details/131418243