Wasserstein Distance, Contraction Mapping, and Modern RL Theory | by Kowshik chilamkurthy | Medium
1. Description
Concepts and relationships that mathematicians explore with some applications in mind—decades later become unexpected solutions to problems they never imagined in the first place. Riemann's geometry was discovered for pure reasons - with no application at all - and was later used by Einstein to explain the structure of space-time and general relativity.
2. RL Reinforcement Learning Concept
In reinforcement learning (RL), agents seek optimal policies for sequential decision problems. A common approach to reinforcement learning, which models the expectation of this reward or value. However, recent advances in RL under the banner of "distributed RL" focus on the distribution of random returns R received by the agent. The state operation value can be explicitly regarded as a random variable Z with expected value Q
Equation 1: Ordinary Bellman operator B
The ordinary Bellman operator B (Eq-1) plays a crucial role in approximating the value of Q by iteratively minimizing the L-squared distance between Q and BQ ( TD learning ).
Equation 2: Distributed Bellman operator ⲧπ
Similarly, the distributed Bellman operator ⲧπ approximates the value of Z by iteratively minimizing the distance between Z and ⲧπ Z.
Z and Ⲧπ Z are not vectors but distributions, how to calculate the distance between 2 different probability distributions ? The answer could be many (KL, DL metrics, etc.), but we are particularly interested in the Wasserstein distance .
3. What is the Wasserstein distance
The Russian mathematician Leonid Vaseršteĭn came up with the concept in 1969. Wasserstein distance is a measure of the distance between two probability distributions. It is also known as bulldozer distance, short for EM distance, because informally it can be interpreted as the minimum energy cost to move and transform a pile of dirt from the shape of one probability distribution to the shape of another.
The distance of the bulldozer, image source: author
The Wasserstein metric ( dp ) between the cumulative distribution functions F, G is defined as:
Equation 3: Wasserstein Metric
where the infimum takes all pairs of random variables (U, V), with respective cumulative distributions F and G. dp(F, G) is also written as:
Equation 4: Wasserstein Metric
example
Let's look at a simple case first: Suppose we have two discrete distributions f(x) and g(x), defined as follows:
f(1) = .1, f(2) = .2, f(3) = .4, f(4) = .3 g(1) = .2, g(2) = .1, g(3) = .2,g
(4) = .5
Let's calculate Equation 3: δ0 = 0.1–0.2 = -0.1 δ1= 0.2–0.1 = 0.1–2.0 = 4.0 δ2= 0.2–3.0 = 3.0
δ5= 0.2–<>.<> = -<>.<> defined in The Wasserstein metric ( dp )
Therefore Wasserstein metric ( dp ) =∑|δi|=0.6
4. Why choose Wasserstein distance
Unlike the Kullback-Leibler divergence, the Wasserstein metric is a true probability metric, taking into account the probabilities and distances of various outcome events. Unlike other distance metrics such as KL-divergence, Wasserstein distance provides a meaningful and smooth representation of the distance between distributions. These properties make Wasserstein well suited for domains where the underlying similarity of outcomes is more important than the likelihood of an exact match.
Example generated by Python, image credit: author
Right : For KL divergence, the measure is the same between the red and blue distributions, while the Wasserstein distance measures the work required to transfer the probability mass from the red state to the blue state.
Left: Wasserstein distance does have a problem. As long as the transfer occurs, the distance remains the same, and the probability mass remains the same regardless of the direction in which the transfer occurs. Therefore, we have no way to reason about distance.
5. ɣ-contraction
Shrinkage maps play a key mathematical role in the classical analysis of reinforcement learning. Let's first define contraction
5.1 Shrinkage Mapping
A function (or operator or mapping) defined on elements of a metric space is a contraction if there exists some constant ɣ such that for any two elements of the metric spaces X₁ and X₂ the following holds:(X, d)
Equation 5: Shrinkage Mapping
This means that after applying the map f(.) on the elements X₁ and X₂, their distance from each other increases by at least a factor ɣ .
5.2 RL contraction
Proving shrinkage is important because it justifies the use of the distance metric itself. The distribution operator ⲧ π is used to estimate Z(x, a), and it turns out that ⲧ π is a contraction of dp meaning that all moments also converge exponentially fast.
Equation 6: ɣ contraction
Shrinkage shows that applying the operator Ⲧ to 2 different distributions shortens the distance between them, so the choice of distance metric is important. Let us now try to prove that the "distribution operator ⲧπ" is a contraction of the Wasserstein distance (dp).
5.3 Proof
3 important properties of the Wasserstein metric help us demonstrate shrinkage.
6. Conclusion
In this blog, we define the Wasserstein distance and discuss its advantages and disadvantages. We justify its use as a distance metric in distributed Bellman operators by demonstrating its shrinkage. But that's just the end of the beginning, the Wasserstein distance presents challenges when computing stochastic gradients, which makes it ineffective when using function approximations. In my next blog, I will discuss how to approximate the Wasserstein metric using quantile regression.