Wasserstein distance, shrinkage maps, and modern RL theory

Wasserstein Distance, Contraction Mapping, and Modern RL Theory | by Kowshik chilamkurthy | Medium

1. Description

        Concepts and relationships that mathematicians explore with some applications in mind—decades later become unexpected solutions to problems they never imagined in the first place. Riemann's geometry was discovered for pure reasons - with no application at all - and was later used by Einstein to explain the structure of space-time and general relativity.

2. RL Reinforcement Learning Concept

        In reinforcement learning (RL), agents seek optimal policies for sequential decision problems. A common approach to reinforcement learning, which models the expectation of this reward or value. However, recent advances in RL under the banner of "distributed RL" focus on the distribution of random returns R received by the agent. The state operation value can be explicitly regarded as a random variable Z with expected value Q        

Equation 1: Ordinary Bellman operator B

The ordinary Bellman operator  (Eq-1) plays a crucial role in approximating the value of Q  by iteratively minimizing the L-squared distance between  Q and  BQ ( TD learning ).

Equation 2: Distributed Bellman operator ⲧπ

Similarly, the distributed Bellman operator ⲧπ approximates the value of Z by iteratively minimizing the distance between Z and ⲧπ Z.

Z and Ⲧπ Z are not vectors but distributions, how to calculate the distance between 2 different probability distributions ? The answer could be many (KL, DL metrics, etc.), but we are particularly interested in the Wasserstein distance .

3. What is the Wasserstein distance

        The Russian mathematician Leonid Vaseršteĭn came up with the concept in 1969. Wasserstein distance is a measure of the distance between two probability distributions. It is also known as bulldozer distance, short for EM distance, because informally it can be interpreted as the minimum energy cost to move and transform a pile of dirt from the shape of one probability distribution to the shape of another.

The distance of the bulldozer, image source: author

The Wasserstein metric ( dp ) between the cumulative distribution functions F, G is defined as:

Equation 3: Wasserstein Metric

where the infimum takes all pairs of random variables (U, V), with respective cumulative distributions F and G. dp(F, G) is also written as:

Equation 4: Wasserstein Metric

example

Let's look at a simple case first: Suppose we have two discrete distributions f(x) and g(x), defined as follows:

f(1) = .1, f(2) = .2, f(3) = .4, f(4) = .3 g(1) = .2, g(2) = .1, g(3) = .2,g
(4) = .5

Let's calculate Equation 3: δ0 = 0.1–0.2 = -0.1 δ1= 0.2–0.1 = 0.1–2.0 = 4.0 δ2= 0.2–3.0 = 3.0



δ5= 0.2–<>.<> = -<>.<> defined in The Wasserstein metric ( dp )

Therefore  Wasserstein metric ( dp )  =∑|δi|=0.6

4. Why choose Wasserstein distance

        Unlike the Kullback-Leibler divergence, the Wasserstein metric is a true probability metric, taking into account the probabilities and distances of various outcome events. Unlike other distance metrics such as KL-divergence, Wasserstein distance provides a meaningful and smooth representation of the distance between distributions. These properties make Wasserstein well suited for domains where the underlying similarity of outcomes is more important than the likelihood of an exact match.
        

Example generated by Python, image credit: author

Right : For KL divergence, the measure is the same between the red and blue distributions, while the Wasserstein distance measures the work required to transfer the probability mass from the red state to the blue state.

Left: Wasserstein distance does have a problem. As long as the transfer occurs, the distance remains the same, and the probability mass remains the same regardless of the direction in which the transfer occurs. Therefore, we have no way to reason about distance.

5. ɣ-contraction

        Shrinkage maps play a key mathematical role in the classical analysis of reinforcement learning. Let's first define contraction

5.1 Shrinkage Mapping

        A function (or operator or mapping) defined on elements of a metric space is a contraction if there exists some constant ɣ such that for any two elements of the metric spaces X₁ and X₂ the following holds:(X, d)

        Equation 5: Shrinkage Mapping

        This means that after applying the map f(.) on the elements X₁ and X₂, their distance from each other increases by at least a factor ɣ  .

5.2 RL contraction

        Proving shrinkage is important because it justifies the use of the distance metric itself. The distribution operator ⲧ π  is used to estimate Z(x, a), and it turns out that ⲧ π  is a contraction of dp meaning that all moments also converge exponentially fast.

        Equation 6: ɣ contraction

        Shrinkage shows that applying the operator  Ⲧ  to 2 different distributions shortens the distance between them, so the choice of distance metric is important. Let us now try to prove that the "distribution operator ⲧπ" is a contraction of the Wasserstein distance (dp).

5.3 Proof

3 important properties         of the Wasserstein metric help us demonstrate shrinkage.

6. Conclusion

        In this blog, we define the Wasserstein distance and discuss its advantages and disadvantages. We justify its use as a distance metric in distributed Bellman operators by demonstrating its shrinkage. But that's just the end of the beginning, the Wasserstein distance presents challenges when computing stochastic gradients, which makes it ineffective when using function approximations. In my next blog, I will discuss how to approximate the Wasserstein metric using quantile regression.

7. Citation

  1. distributions - What is the advantages of Wasserstein metric compared to Kullback-Leibler divergence? - Cross Validated
  2. https://runzhe-yang.science/2017-10-04-contraction/#contraction-property

3.  A Distributional Perspective of Reinforcement Learning

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/131982634
RL
Recommended