Deep reinforcement learning arrangement

Reinforcement learning comes from the behaviorism theory in psychology, which is to gradually form a behavioral strategy that can maximize benefits under the feedback of reward or punishment signals given by the environment. Compared with supervised learning, reinforcement learning does not require the preparation of sample sets in advance. Instead, it guides the learning of strategies by constantly trying and discovering the feedback generated by different actions. Compared with unsupervised learning, reinforcement learning does not just explore the characteristics of things, but establishes the mapping relationship between input and output by interacting with the environment to obtain the optimal strategy.

Characteristics of reinforcement learning:

  1. Trial-and-error learning: The agent interacts with the environment and learns the best strategy through trial and error at each step without any guidance.
  2. Delayed feedback: The agent's trial and error gets feedback from the environment, and it may need to wait until the end of the process to get a feedback. ‘
  3. Procedural learning: The training process of reinforcement learning is a process that changes over time.
  4. Behavioral correlation between links: current behavior affects subsequent states and behaviors.
  5. Integration of exploration and utilization: At the beginning of reinforcement learning, the agent is more inclined to explore, its behavior has a certain degree of randomness, it tries multiple possibilities, and then reduces the proportion of exploration after training for many rounds.

Basic concepts of reinforcement learning

  • agent

Inevitably interacting with the environment requires understanding how the environment will respond to the actions taken, a trial-and-error learning approach involving many trials.

In the concept of reinforcement learning, state represents the agent's current state. The agent performsactions to explore the environment.

  • policy

Defines the behavior criteria of the agent in a given state.

The policy function (which can be continuous or discrete) is a mapping from the state of the agent to the behavior it will take in that state. Usually expressed as \(π(a_t|s_t)\), which means that in a given state \(s_t\ )The conditional probability distribution of the action\(a_t\) /span>.

For example, in the picture above, Mario's task is to get more gold coins and avoid obstacles. The result of the policy function π:(s,t) is a probability, which is between [0,1].

\(π(a|s)=p(A=a|S=s)\)

Mario has three directions in which he can move, so in order to achieve better results, the probability that he moves in three directions is

  1. π(left | s)=0.2
  2. π(right | s)=0.1
  3. π(up | s)=0.7

 

 

 

Tang Xiaoou, founder of SenseTime, passed away at the age of 55 In 2023, PHP stagnated Wi-Fi 7 will be fully available in early 2024 Debut, 5 times faster than Wi-Fi 6 Hongmeng system is about to become independent, and many universities have set up “Hongmeng classes” Zhihui Jun’s startup company refinances , the amount exceeds 600 million yuan, and the pre-money valuation is 3.5 billion yuan Quark Browser PC version starts internal testing AI code assistant is popular, and programming language rankings are all There's nothing you can do Mate 60 Pro's 5G modem and radio frequency technology are far ahead MariaDB splits SkySQL and is established as an independent company Xiaomi responds to Yu Chengdong’s “keel pivot” plagiarism statement from Huawei
{{o.name}}
{{m.name}}

おすすめ

転載: my.oschina.net/u/3768341/blog/10322379
おすすめ