Optimal Transmission and Its Application in Fairness

Optimal transport originated in economics and was developed as a tool for how to best allocate resources. The origins of the theory of optimal transport itself can be traced back to 1781, when Gaspard Monge studied the most efficient method of transporting earth in order to build fortifications for Napoleon's army. In a nutshell, optimal transportation is the problem of how to move all resources (e.g. iron) from a collection of starting points (iron ores) to collections of endings (iron factories) while minimizing the total distance the resources have to travel. Mathematically, we want to find a function that takes each origin and maps it to a destination while minimizing the total distance between the origin and its corresponding destination. Although its description is innocuous, progress was made on the original formulation of the question, 

The first real leap forward in problem solving came in the 1940s, when Soviet mathematician Leonid Kantorovich adapted the formulation of the problem into the modern version, now known as the Monge-Kantorovich formulation. The novelty here is to allow some iron from the same mine to be sent to different factories. For example, 60% of a mine's iron can be sent to one factory, while the remaining 40% of the mine's iron can be sent to another factory. Mathematically, this is no longer a function, as the same origin may now map to many destinations. Instead, this is called a coupling. The relationship between the origin and destination distributions is shown in the diagram below; picking a mine from the blue distribution (origin) and moving vertically along the diagram shows that the iron is sent to The distribution of factories (destinations). 

 
Automatic generation of descriptions for charts, histograms, and scatterplots 

As part of this new development, Kantorivich introduced an important concept called the Wasserstein distance. Like the distance between two points on a map, the Wasserstein distance (also known as the bulldozer distance, inspired by its original context) measures the distance between two distributions, such as the blue and magenta distributions in this example. For example, if all iron mines are far away from all iron factories, the Wasserstein distance between the distribution (location) of mines and the distribution of factories would be very large. Even with these new improvements, it's still not clear that there really is a single best way to transport resources, let alone that way. Finally, in the 1990s, the theory began to develop rapidly as improvements in mathematical analysis and optimization led to partial solutions to problems. It was also around this time and into the 21st century that optimal transport began to permeate other fields such as particle physics, fluid dynamics, and even statistics and machine learning. 

Modern Optimal Transportation 
With an explosion of new developing theories, optimal transportation has been at the core of many new statistical and AI algorithms that have emerged in the last two decades. In virtually every statistical algorithm, data is explicitly or implicitly modeled as having some underlying probability distribution. For example, if you are collecting data on the income of individuals in different countries, each country has a probability distribution over the income of the population. If we want to compare two countries based on the income distribution of their populations, then we need a way to measure the gap between the two distributions. This is exactly why optimal transport (especially the Wasserstein distance) becomes so useful in data science. However, the Wasserstein distance is not the only measure of how far apart two probability distributions are. In fact, two choices—the L-2 distance and the Kullback-Leibler (KL) divergence—are historically more common because of their connections to physics and information theory. The main advantage of Wasserstein distance over these alternatives is that it takes both values ​​and their probabilities into account when calculating distance, whereas L-2 distance and KL divergence only consider probabilities. The figure below shows an example artificial dataset on the income of three fictional countries. 

In this case, since the distributions do not overlap, the L-2 distance (or KL divergence) between the blue and magenta distributions will be about the same as the L-2 distance between the blue and green distributions. On the other hand, the Wasserstein distance between the blue and magenta distributions will be much smaller than the Wasserstein distance between the blue and green distributions because of the significant difference in values ​​(horizontal separation). This property of the Wasserstein distance makes it ideal for quantifying differences between distributions, especially between datasets. 

Fairness in Optimal Shipping 
With massive amounts of data collected every day and machine learning becoming more prevalent across many industries, data scientists must be increasingly careful not to allow their analyzes and algorithms to perpetuate existing biases and stereotypes in the data . For example, if a dataset of home mortgage approvals contains information on the race of applicants, but minorities were discriminated against during collection because of the methods used or unconscious bias, then a model trained on that data will reflect the underlying prejudice. Optimal transport can be exploited to help mitigate this bias and improve fairness in two ways. The first and easiest method is to use the Wasserstein distance to determine if there is a potential bias in the dataset. For example, we can estimate the Wasserstein distance between the distribution of loan amounts approved for women and the distribution of loan amounts approved for men, and if the Wasserstein distance is very large, ie statistically significant, then we might suspect an underlying bias. This idea of ​​testing whether there is a difference between two groups is known in statistics as a two-sample hypothesis test. 

Or, optimal transfer can even be used to enforce fairness in the model when the underlying dataset is itself biased. From a practical standpoint, this is very useful, since many real datasets exhibit some degree of bias, and collecting unbiased data can be prohibitively expensive, time-consuming, or infeasible. So, no matter how imperfect the data, it's much more practical to use the data we have available and try to make sure our models mitigate this bias. This is achieved by enforcing a constraint known as "strong demographic parity" in our model, which forces model predictions to be statistically independent of any sensitive attributes. One approach is to map the model-predicted distribution to an adjusted predicted distribution that does not depend on sensitive attributes. However, adjusting the predictions also changes the performance and accuracy of the model, so there is a trade-off between model performance and how much the model depends on sensitive attributes (i.e., fairness). 

Optimal transfer works by changing predictions as little as possible to ensure optimal model performance, while still guaranteeing that new predictions are independent of sensitive attributes. This new distribution predicted by the adjusted model is called the Wasserstein centroid, and it has been the subject of much research over the past decade. The Wasserstein centroid is similar to the mean of a probability distribution in that it minimizes the total distance from itself to all other distributions. The figure below shows the three distributions (green, blue, and magenta) and their Wasserstein centroids in red. 

In the above example, suppose we built a model to predict a person's age and income. The dataset contains a sensitive attribute, such as marital status, which can take on three possible values: single (blue), married ( green) and widowed/divorced (magenta). The scatterplot shows the distribution of model predictions for each different value. However, we want to adjust these so that the new model predictions are blind to a person's marital status. We can map each of these distributions to the red centroid using optimal transport. Because all values ​​map to the same distribution, we can no longer judge a person's marital status based on income and age, or vice versa. The center of gravity maintains the fidelity of the model as much as possible. 

The growing ubiquity of data and the use of machine learning models in business and government decision-making has led to the rise of new social and ethical concerns about ensuring their fair application. Many datasets contain some sort of bias due to the nature of how they were collected, so it is important that models trained on them do not exacerbate this bias or any historical discrimination. Optimizing traffic is just one way to tackle this problem, which has gained momentum in recent years. Today, there are fast and efficient ways to compute optimal transport maps and distances, making this approach suitable for modern large datasets. As we increasingly rely on data-based models and insights, fairness has become and will continue to be a central issue in data science, and optimal transport will play a key role in making this happen. 

Guess you like

Origin blog.csdn.net/wouderw/article/details/128432183
Recommended