What exactly is stable learning based on deep learning? Causal learning? transfer learning? one

Machine Learning | Stable Learning | DGBR

Deep Learning | Transfer Learning | Causal Learning

As we all know, deep learning research is an important research direction in the field of machine learning. It mainly uses technologies such as data analysis, data mining, and high-performance computing. It has extremely strict requirements on servers. The traditional air-cooled heat dissipation method is not enough to meet the heat dissipation needs. This requires the emerging liquid cooling technology to meet the needs of energy saving, emission reduction, quietness and high efficiency. In addition to its great development in the field of deep learning research, machine learning has also made great progress in the fields of causal learning, virtual simulation, and medical research and development. Although machine learning has achieved success in many fields, the potential risks of spurious correlations still limit the application of these models in many risk-sensitive fields. At this time, stable learning was proposed to meet this challenge, which tried to build a more reliable machine learning model without affecting the performance of the model.

On February 23 this year, Associate Professor Cui Peng from the Computer Science Department of Tsinghua University and Susan Athey of Stanford University (a member of the American Academy of Sciences, an international authority in the field of causality) published an article on the world's top journal Nature Machine Intelligence (impact factor IF=15.51, 2020) A view paper titled "Stable Learning Establishes Some Common Ground Between Causal Inference and Machine Learning" (Stable Learning: Establishing a Consensus in Causal Inference and Machine Learning), which explores and summarizes the concerns of causal inference in the field of machine learning and artificial intelligence in depth, Proposing that there should be a consensus between machine learning and causal inference, Steady Learning is moving towards this goal. Based on the above academic viewpoints, this paper summarizes a series of advances in stable learning.

AI's current challenges

Throughout the development history of artificial intelligence (AI) technology, in the past 20 years, the progress of artificial intelligence has been closely accompanied by the development of the Internet economy, and artificial intelligence technology has been used in many scenarios such as online search and product recommendation. In these scenarios, the harm of AI making wrong decisions is small (such as recommending products that users are not interested in), and users have relatively low requirements for the stability and reliability of AI model algorithms. Today, AI technology has gradually penetrated into fields that are closely related to people's lives and have a major impact on human survival and development, including medical care, justice, and transportation. In this context, the reliability and stability of AI models have become increasingly important, and to a large extent determine the extent to which we can use and rely on artificial intelligence technology to help decision-making.

We believe that there are two important problems in the practical utilization of current artificial intelligence models. One is that the model lacks interpretability; that is, people cannot understand the logic and reasons for the judgments made by the model. This leads to the fact that people can only unconditionally fully affirm or deny the answers provided by the model when faced with decision-making. We believe that this problem can be solved by establishing a human-in-the-loop mechanism for cooperative decision-making. The second problem is that the model lacks the stability of performance in the location environment; most current artificial intelligence models rely on the iid assumption (Independent and identically distributed), that is, the data distribution of the training data set and the test data set are similar; and In practical applications, the possible data distribution cannot be completely foreseen (the distribution of the test data set cannot be known), and the performance of the model cannot be guaranteed at this time. This article will focus on solving the performance stability of the model in an unknown environment.

Take the example of an AI application that recognizes the presence of a dog in a picture. The left image in the figure is a set of images in the training set containing dogs, most of which have grass backgrounds. In the test set, the model has a good judgment ability for pictures that are also grass backgrounds (top right picture); the accuracy of judgment for pictures with non-grass backgrounds decreases (right picture, bottom).

New Advances in Causal Learning Deep Stable Learning

At present, deep learning has made unprecedented progress in many research fields, especially in the field of computer vision (such as image recognition, object detection and other technical fields), and the performance of deep models depends on the fitting of the model to the training data. When the distribution of training data (data available before application) and test data (instances encountered in practical applications) are different, the sufficient fitting of the traditional deep model to the training data will cause its prediction on the test data to fail, resulting in The confidence of the model is reduced when applied to different environments. In order to solve the generalization problem of the model under distribution migration, Cui Peng's team proposed deep stable learning to improve the accuracy and stability of the model in any unknown application environment.

The figure above shows the similarities and differences between common independent and identical distribution models, transfer learning models, and stable learning models. The training and testing of the independent and identical distribution model are completed under the same distribution of data. The test goal is to improve the accuracy of the model on the test set, which has high requirements for the test set environment; transfer learning also expects to improve the model on the test set accuracy, but allows the sample distribution of the test set to be different from the training set. Both independent and identical distribution learning and transfer learning require that the test set sample distribution is known. On the other hand, stable learning hopes to reduce the variance of the accuracy rate of the model performance under various sample distributions under the premise of ensuring the average accuracy of the model. In theory, stable learning can have better performance under test sets with different distributions.

1. Stable learning based on essential characteristics

Existing deep learning models try to use the correlation between all observable features and data labels for learning and prediction, but the features related to labels in the training data are not necessarily the essential features of their corresponding categories. The basic idea of ​​deep stable learning is to extract essential features of different categories, remove irrelevant features and false associations, and make predictions based only on essential features (features that are causally related to labels). As shown in the figure below, when the environment of the training data is complex and strongly correlated with the sample label, traditional convolutional networks such as ResNet cannot distinguish essential features from environmental features, so all features are used for prediction at the same time, while StbleNet can Distinguishing essential features from environmental features, and only focusing on essential features while ignoring environmental features, StableNet can make stable predictions no matter how the environment (domain) changes.

In the saliency map of the traditional deep model and the deep stable learning model, the higher the brightness, the greater the contribution to the prediction result. It can be seen that the characteristics of the two are significantly different. StableNet pays more attention to the object itself, while the traditional deep model also pays attention to the environment. feature

The existing stable learning methods are mostly aimed at linear models, and the neural network model can infer causality through the method of Confounder Balancing. Specifically, if you want to infer the causal relationship between variable A and variable B (interference variable C exists), take variable A as a discrete binary variable (with a value of 0 or 1) as an example, divide the overall sample according to the value of A For two groups (A=0 or A=1), and assign different weights to each sample, so that the distribution of the interference variable C is the same when A=0 and A=1 (ie D(C|A=0) = D(C|A=1), where D represents the variable distribution), at this time, judging whether D(B|A=0) and D(B|A=1) are the same can determine whether A has a causal relationship with B.

In computer vision-related scenarios, since the features of each dimension after the convolutional network are continuous values ​​and have complex nonlinear dependencies, it is impossible to eliminate the correlation between features by directly applying the above-mentioned disturbance variable balance method; The training data set used for deep learning is usually large in size, and the dimension of deep features is also large, so the global sample weight cannot be directly calculated. The problem to be solved in this paper is how to find a set of sample weights in the deep learning network so that all variables can be independent of each other, that is, arbitrarily select a variable as the target variable, and the distribution of the target variable does not follow the value of other variables. Change and change.

2. De-correlation of deep features based on random Fourier features

However, there are complex dependencies between the various dimensional features of the deep network, and only removing the linear correlation between variables is not enough to completely eliminate the false correlation between irrelevant features and labels, so a direct idea is to use the kernel (kernel method) to combine It is mapped to a high-dimensional space, but the feature map dimension of the original feature is expanded to infinite dimensions after kernel mapping, making the correlation between variables of each dimension impossible to calculate. In view of the excellent properties of Random Fourier Feature (RFF) in approximating the kernel function and measuring the independence of features, this paper uses RFF to map the original features into a high-dimensional space (which can be understood as expanding the sample dimension), eliminating new The linear correlation between features can ensure that the original features are strictly independent, as shown in the figure below.

3. Global optimization of sample weights

The above formula requires learning a specific weight for each training sample during the training process, but in practice, especially for deep learning tasks, it requires huge computational and storage overhead to use all samples to learn sample weights globally. In addition, when using SGD to optimize the network, only some samples are visible to the model in each iteration, so the feature vectors of all samples cannot be obtained. This paper proposes a method of storing and reloading sample features and sample weights. At the end of each training iteration, the current sample features and weights are fused and saved, and reloaded at the beginning of the next training iteration as a global prior for training data. Knowledge optimizes the sample weights of a new round, as shown in the figure below.

The structure diagram of StableNet is shown in the figure below. The input image is extracted through the convolutional network to obtain visual features, and then passes through two branches. The upper branch is the sample weight learning sub-network, and the lower branch is the conventional classification network. The final training loss is the weighted sum of the classification network prediction loss and sample weights. Among them, LSWD is the Learning Sample Weights for Decorrelation module (Learning Sample Weights for Decorrelation), which uses RFF to learn sample weights that make each dimension of the feature independent.

Taking the application of dog recognition as an example, if most of the dogs in the training samples are on the grass and a small number of dogs are on the beach, the corresponding visual features of the picture are reweighted by the samples and then the dimensions are independent, that is, the features corresponding to the dog are the same as those of the grass, The features corresponding to the beach are not statistically relevant, so it is easier for the classifier to pay attention to the features related to the dog when predicting whether the dog exists (if focusing on features such as grass and sand, the prediction loss will increase sharply), so no matter whether the dog is on the grass during the test Whether it is on the beach or not, StableNet can give more accurate predictions based on essential characteristics, and realize the generalization of the model on OOD data.

4. Domain generalization tasks with broader implications

In conventional domain generalization (DG) tasks, the capacity of different source domains in the training set is similar and the heterogeneity is clear. However, in practical applications, most data sets are combinations of several potential source domains. When the source domains are different When qualitative properties are unclear or not explicitly labeled, it is difficult to assume that the amount of data from each source domain is roughly the same. In order to more comprehensively verify the generalization performance of StableNet, this paper proposes three new domain generalization tasks to simulate more general and challenging distribution transfer generalization scenarios.

1. Unbalanced Domain Generalization

For domain generalization problems with unclear source domains, it is too ideal to assume that the capacity of source domains is similar. A more general assumption is that the amount of data from different source domains may be different and may vary greatly. In this case, the generalization ability of the model for the unknown target domain is more suitable for practical applications. For example, in the example of recognizing dogs, it is difficult for us to assume that the background is the same number of pictures of grass, beach or water. In fact, dogs appear more on grass and less in water. This requires that the model's predictions cannot be misled by the background grass that often appears with dogs, so the generalizability and difficulty of this task is significantly higher than that of balanced domain generalization.

The experimental results of using ResNet18 as the feature extraction network are shown in the table below. StableNet achieved the best performance on the PACS and VLCS datasets.

2. Domain generalization with missing categories

We consider a more challenging situation that often exists in real-world scenarios, where data for some categories in some source domains is missing, while the model needs to recognize all categories in the test set. For example, birds often appear in trees but rarely in water, and fish often appear in fish tanks and rarely in trees, so not all source domains necessarily contain all categories. This scenario requires higher model generalization ability, and since there are only some categories in each source domain, the spurious association between domain-related features and labels is stronger and more likely to mislead the classifier.

The following table shows the experimental results. Due to the requirements of domain heterogeneity and category integrity, many existing domain generalization methods cannot be significantly better than ResNet, while StableNet has achieved the best results on PCAS, VLCS and NICO.

3. There is adversarial domain generalization

A more difficult scenario is where the dominant source domain for any given class is different from the dominant target domain. For example, the dogs in the training data are mostly on the grass and the cats are mostly indoors, while the test data are mostly dogs indoors and the cats are mostly on the grass, which leads to the fact that if the model cannot distinguish essential features from domain-related features, it will be Misled by domain information to make wrong predictions. The following table shows the experimental results on the MNIST-M dataset. StableNet is still significantly better than other methods, and it can be seen that as the proportion of the dominant domain increases, the performance of ResNet decreases significantly, and the advantages of StableNet become more and more obvious.

The main method of stable learning

The DGBR algorithm solves the stable prediction problem under the setting of binary predictor variables (features) and binary discrete response variables for the first time. Since then, a series of stable learning methods have been proposed to solve the problem of more stable prediction under different settings. However, the stable learning method designed in the follow-up is not limited to the perspective of causal reasoning, but also includes different perspectives such as statistical learning and optimization process, which will be introduced one by one in this section.

1. Variable decorrelation based on sample weighting

Cui Peng's team further explored the problem of stable prediction of model misestimation (that is, the inconsistency between the model and the data generation mechanism). Zheyan Shen et al. studied how the collinearity between variables in a linear model affects the prediction stability, and proposed a general data preprocessing method to remove the correlation between predictor variables (features) by reweighting the training set samples. to reduce collinear effects. The work of Kuang Kun et al. further improved the DGBR algorithm and proposed decorrelation weighted regression (DWR), which combined the variable decorrelation regularization with the weighted regression model, and solved the problem of stable prediction of the model under the setting of continuous predictor variables (features) .

While removing correlations between all variables is a great idea to find causal correlations, balance covariates, and achieve stable predictions, it comes at the cost of greatly reducing the effective sample size, which can be disastrous in machine learning training. Zheyan Shen et al. proposed an algorithm for variable decomposition based on variable clustering, called Differentiated Variable Decorrelation (DVD), by using unlabeled data from different environments. The approach is to note that preserving correlations between causal variables does not necessarily lead to unstable performance of the model in unknown environments. Using the data stability between the training set data and the unlabeled data of the correlation between features as the index of clustering, the predictor variables (features) can be clustered and divided into different clusters, some of which represent A collection of features that have a causal effect on the corresponding variable. It is only necessary to isolate these clusters when balancing confounding variables. Since the number of clusters is much lower than the dimensionality of features, DVD maintains a higher effective sample size compared to the sample weighting method DWR.

The discriminative variable decorrelation method (DVD) has a larger effective sample size under the same settings than the method that indiscriminately removes the correlation of all variables (DWR)

2. Anti-stability learning

Since one always wants to maximize all the correlations found in the training data, machine learning algorithms with empirical risk minimization are vulnerable to distribution changes. Cui Peng's team proposed the Stable Adversarial Learning (SAL) algorithm to solve this problem in a more principled and unified way. This algorithm uses heterogeneous data sources to construct a more practical uncertainty set and perform difference Optimized robustness optimization, where covariates are differentiated based on their stability in relation to the target.

Specifically, the method uses the framework of Wasserstein distributed robust learning (Wasserstein distributionally robust learning, WDRL). The uncertainty set is further characterized as anisotropic in terms of the covariate's stability across multiple settings, which introduces stronger adversarial perturbations to unstable covariates than stable covariates. And a synergistic algorithm is designed to jointly optimize the differential process of covariates and the adversarial training process of model parameters.

In the experiment, the SAL algorithm was combined with Empirical Risk Minimization (ERM) framework, Wasserstein distributed robust learning (Wasserstein distributionally robust learning, WDRL) framework, Invariant Risk Minimization (IRM) framework Compare:

(a) Test performance in each environment (b) Test performance with respect to radius (c) Learning coefficient values ​​of S and V with respect to radius

Experimental results show that the SAL algorithm anisotropically considers each covariate for more realistic robustness. In addition, a better uncertainty set is constructed, and consistent and better performance is achieved on data with different distributions, which verifies the effectiveness of the algorithm.

3. Minimize the risk of heterogeneity

Likewise, empirical risk minimization machine learning algorithms typically generalize poorly if they are to exploit all the correlations found in the training data, and these correlations are not stable under changes in the distribution. Mr. Cui Peng's team proposed the framework of heterogeneity risk minimization (HRM) to realize the joint learning of potential heterogeneity and invariant relationship between data, so as to achieve stable prediction under the condition of distribution change.

HRM framework

The overall framework is shown in the figure. The framework consists of two modules, the front-end for heterogeneity identification and the back-end Mp for invariant prediction. Given heterogeneous data, starting from the heterogeneity recognition module Mc, the heterogeneity environment εlearn is represented by the learning variable ψ (x). Then, the out-of-distribution generalization prediction module Mp uses the learned environment to learn the MIP φ(x) and the invariant prediction model F(φ(x)). Afterwards, we derive the variant ψ(x) to further enhance the modulus Mc. As for the "conversion" step, based on our setup, we employ feature selection in this work, by which more variable features can be obtained while learning more invariant features.

HRM is an optimization framework that enables joint learning of underlying heterogeneity between data and invariant predictors. Despite the shifted distribution, the framework still has better generalization ability.

In order to verify the effectiveness of the framework, Cui Peng's team combined the HRM framework with the Empirical Risk Minimization (ERM) framework, the Distributionally Robust Optimization (DRO) framework, and the Environment Inference (Environment Inference) framework for invariant learning. for Invariant Learning, EIIL) framework, Invariant Risk Minimization (IRM) framework with environmental εtr labels for comparison.

Experiments show that compared with the baseline methods, HRM achieves near-perfect performance in terms of average performance and stability, especially the variance of loss across environments is close to 0. In addition, HRM does not require environmental labels, which verifies that our clustering algorithm can mine potential heterogeneity inside the data.

Three real-world scenario predictions were continued, including car insurance predictions, people's income predictions, and house price predictions.

Prediction Results for Real Scenarios (a) Training and Test Accuracy for Auto Insurance Prediction. The left subplot shows the training results for the 5 settings, and the right subplot shows their corresponding test results. (b) Misclassification rate for revenue forecasts. (c) Forecast error in house price forecasts.

From the experimental results, it can be seen that HRM consistently maintains the best performance in all tasks and almost all test environments. HRM can effectively reveal and fully exploit the intrinsic heterogeneity of training data for invariant learning. HRM relaxes the requirement on environment labels, opening a new direction for invariant learning. It is able to cover a wide range of applications such as healthcare, finance, marketing, etc.

4. Theoretical Explanation of Stable Learning

Covariate shift generalization is a classic case of out-of-distribution generalization (OOD), which requires good performance on an unknown test distribution whose gap with the training distribution is reflected in covariate shift. On several learning models involving regression algorithms and deep neural networks, stable learning algorithms have shown some effectiveness in dealing with covariate shift generalization. The team of Mr. Peng Cui has taken a step towards theoretical analysis by interpreting the stable learning algorithm as a process of feature selection.

Specifically, first define a set of variables, called the minimal stable variable set (minimal stable variable set), which deals with the generalization of covariate migration under common loss functions (including mean square loss and binary cross entropy loss) The smallest and optimal set of variables. It is then shown that under ideal conditions, stable learning algorithms can identify variables in this set. These theories shed light on why stabilization learning is suitable for covariate transfer generalization.

The framework of a typical stable learning algorithm is shown in the figure. The algorithm usually consists of two steps, namely importance sampling and weighted least squares. Under ideal conditions, a stable learning algorithm can identify the smallest set of stable variables, which is the smallest set of variables that can provide good predictions under covariate shifts.

The minimum set of stable variables is closely related to the Markov boundary, and stable learning helps to identify the Markov boundary to a certain extent. Furthermore, if generalization with covariate shifting is the goal, the Markov bound is not necessary, whereas the minimal set of stable variables is sufficient and optimal.

Compared with Markov bounds, the minimum set of stable variables can bring two advantages:

① The conditional independence test is the key to accurately discover the Markov boundary.

② In several common machine learning tasks, including regression and binary classification, not all variables are on the Markov boundary. The minimal set of stable variables is shown to be a subset of the Markov bound, which excludes useless variables in the Markov bound for covariate shift generalization.

Application of stable learning

1. Stable learning on the graph

1. Learning Stability Maps in Multiple Environments with Selective Bias

Today, graphs have become a general and powerful representation to describe rich relationships between different types of entities through the underlying schema encoded in their structure. However, the data collection process for graph generation is fraught with known or unknown sample selection biases, especially in non-stationary and heterogeneous environments where spurious associations can exist between entities. Aiming at the problem of learning stable graphs from multiple environments with selective bias, Cui Peng's team designed an unsupervised stable graph learning (Stable Graph Learning, SGL) framework for learning stable graphs from collective data. It consists of a GCN (Graph Convolutional Networks) module and an E-VAE (element-wise VAE) module for high-dimensional sparse set data.

The task of stable graph learning is to learn a graph Gs that represents an unbiased connection structure, because the graph in the environment is generated from the data, if the data collection process comes from the environment with selection bias, the spurious correlation between elements will be Causes the graph to perform poorly in other environments. The SGL framework can solve this problem well, and the SGL framework can be decomposed into two steps, including graph-based ensemble generation and stable graph learning. The diagram of the stabilization graph learning process is shown in the figure below.

In the simulated experiments, as shown in the figure, the performance of the SGL framework is much more stable in almost all experiments, especially when the difference between the two environments is more significant, it achieves higher than all baseline methods. average accuracy.

Simulation results. Each subgraph corresponds to an experiment, and the purple curve represents the experimental performance of the graph Gs generated by the SGL framework

Correspondingly, in real experiments, the team of Mr. Cui Peng studied the problem of stable graph structure in common practical applications of product recommendation.

As can be seen from the table below, the graph Gs generated by the SGL framework can balance the correlation in the two environments and achieve the highest average prediction rate more stably.

Purchase Behavior Prediction with Exposure Bias Using Item Embeddings Learned from Item Networks

As shown in the table below. The SGL framework can well make up for the information loss in a single environment, and generate the graph Gs with the best overall performance by learning the essential relationship between commodities.

Predicting Purchase Behavior of Different Gender Groups Using Item Embeddings Learned from Item Networks

The data selectivity bias of graph generation can lead to the poor performance of biased graph structures in Non-IID scenarios. The proposed SGL framework for this problem can improve the generalization ability of learned graphs and can adapt well to different types of graphs and collected data.

2. Stable Prediction of Graphs with Agnostic Distribution Shifts

Graph Neural Networks (GNNs) have been shown to be effective on various graph tasks with randomly separated training and testing data. However, in practical applications, the distribution of training graphs may be different from that of testing graphs. Furthermore, the distribution of test data is always agnostic when training GNNs. Therefore, we are faced with an agnostic distribution shift between training and testing in graph learning, which will lead to unstable inference of traditional GNNs in different testing environments.

In order to solve this problem, the team of teacher Kuang Kun from Zhejiang University proposed a new GNNs stable prediction framework, which allows local and global stable learning and prediction on the graph, which can reduce the training loss in heterogeneous environments, so that GNNs can generalize well. In other words, a new stable prediction framework is designed for GNNs, which can capture the stable properties of each node, learn node representations and make predictions on this basis (local stability), and regulate GNNs in heterogeneous environments. Training in (globally stable). The essence of the method is shown in the figure.

Overall structure

Consists of two fundamental components, locally stable learning that captures cross-environment stable properties in representation learning for each target node, and globally stable learning environments that explicitly balance different training environments.

In the graph benchmark experiment, the team of teacher Kuang Kun from Zhejiang University used the OGB dataset and the traditional benchmark Citeseer dataset to construct two layers of GCN and GAT. All other methods (including ours) also include two graph layers for fair comparison. The number of hidden layer neural nodes of all OGBarxiv methods is 250, the number of hidden layer neural nodes of Citeseer is 64, and the learning rate is 0.002.

The stable prediction framework has more stable experimental results. Most GNNs suffer from distribution shift and produce poor performance when the test distribution differs more from the training distribution (e.g., the right side of panel a). Although the stable prediction framework sacrifices some performance in the test environment whose distribution is closer to training (eg, the left side of Figure a), a significantly higher Average_Score and lower Stability_Error are obtained.

In order to prove the effectiveness of the stable prediction framework in practical applications, the team of teacher Kuang Kun from Zhejiang University collected real-world noisy datasets and conducted experiments on the user-item bipartite graph of the recommendation system. Experimental results show that the stable prediction framework achieves significantly more stable results than other baseline methods.

Results on real-world recommendation datasets with distribution shifts caused by node attributes 

2. Stable Learning in Deep Neural Networks

Approaches based on deep neural networks achieve amazing performance when test data and training data share similar distributions, but can sometimes fail. Therefore, removing the effect of distribution changes between training and testing data is crucial for building deep models with promising performance. The team of Mr. Cui Peng proposed to solve this problem by learning the weights of training samples to eliminate the dependencies between features, which will help the deep model get rid of false associations, and then pay more attention to the real relationship between discriminative features and labels.

Cui Peng's team proposed a method called StableNet. The method addresses the distribution shift problem by globally weighting samples to directly decorrelate all features of each input sample, thereby removing the statistical correlation between correlated and irrelevant features. This is a novel nonlinear feature decorrelation method based on Random Fourier Features (RFF) with linear computational complexity. At the same time, it is also an effective optimization mechanism that globally perceives and eliminates correlations by iteratively saving and reloading model features and weights, and can also reduce storage usage and computing costs when the amount of training data is large. Furthermore, as shown in Figure 16, StableNet can effectively reject irrelevant features (e.g., water) and utilize truly relevant features for prediction, resulting in more stable performance in non-stationary environments in the wild.

When the training images for recognizing dogs contain a lot of water, the StableNet model mainly focuses on dogs

The overall architecture of StableNet

In order to cover more common and challenging cases of distribution changes, Peng Cui's team used the following four settings in the experiment: unbalanced, flexible, confrontational, and classic. Under different experimental settings, StableNet can outperform other methods to varying degrees.

In ablation studies, feature dimensionality is further reduced by randomly selecting features for computing dependencies with different ratios. The figure below shows the experimental results with random Fourier features of different dimensions.

Results of Ablation Study

An intuitive interpretation of image classification models is to identify pixels that have a large impact on the final decision. So, on the saliency image, to demonstrate whether the model focuses on the object or the context (domain) when making predictions, the gradient of the class score function with respect to the input pixels is visualized. The visualization result is shown in the figure.

Saliency images for StableNet. The brighter a pixel is, the more it contributes to the prediction

Various experimental results show that the StableNet method can eliminate the statistical correlation between relevant and irrelevant features through sample weighting, thereby effectively removing irrelevant features and using truly relevant features for prediction.

3. Stable learning and fairness

Today, fairness issues have become important issues in decision-making systems. Many scholars have proposed various concepts of fairness to measure the degree of unfairness of algorithms. Pearl examines the case of gender bias in graduate admissions at Berkeley. Data show that male applicants are generally more enrolled, but when it comes to graduate department selection, the results vary. The bias caused by the choice of departments should be considered fair, but the traditional concept of group fairness cannot judge fairness because it does not consider the choice of departments. Inspired by this, the concept of fairness based on causality came into being. In these papers, the authors first assume a causal graph between features, then, they can define the unfair causal effect of a sensitive attribute on an outcome as a measure. However, these notions of fairness require very strong assumptions, and they are not scalable. In practice, there is often a set of variables that we call fairness variables that are covariates before a decision, such as a user's choice.

Fairness variables do not affect the fairness of evaluating decision support algorithms. Therefore, the team of Mr. Cui Peng defined conditional fairness as a more reasonable measure of fairness by setting fairness variables. By selecting different fairness variables, Peng Cui's team proved that traditional fairness concepts, such as statistical fairness and equal opportunity, are special cases of conditional fairness symbols. And a derivable conditional fairness regularizer (Derivable Conditional Fairness Regularizer, DCFR) is proposed, which can be integrated into any decision model to track the trade-off between accuracy and fairness of algorithmic decisions.

DCFR framework

For a fair comparison, in the experiment, a fair optimization algorithm that also uses the method of adversarial representation learning to solve the problem is selected as a control. There are UNFAIR, ALFR, CFAIR and LAFTR, and its variants LAFTR-DP and LAFTR-EO.

Accuracy-fairness trade-offs for different fairness metrics (Δ, Δ, Δ from left to right) on various datasets (from top to bottom: income dataset, Dutch census dataset, COMPAS dataset) curve. DCFR is shown in bold.

Obviously, DCFR has more advantages in the experiment, achieving a better trade-off effect in accuracy and fairness. For statistical fairness and equal opportunity tasks, degenerate variants of DCFR achieve performance comparable to, and sometimes even better than, state-of-the-art baseline methods designed for these tasks. In summary, DCFR is very effective on real datasets and achieves good performance on conditional fairness objectives. And it performs better as the number of fairness variables increases.

4. Stable Learning and Domain Adaptation

The original definition of stable learning does not require target domain information. The domain adaptation here is a practice that utilizes target domain information, which can be understood as expanding the meaning of the original stable learning.

Research has shown that representations learned by deep neural networks can be transferred to other domains for which we do not have sufficient labeled data, and perform similar prediction tasks. However, as we transition to higher neural layers in the model, representations become more task-specific than general. Regarding this issue, research on deep domain adaptation proposes to alleviate it by forcing deep models to learn more cross-domain transferable representations. This is actually achieved by incorporating domain adaptation methods into deep learning pipelines. However, correlations are not always transferable. Arizona State University (Arizona State University, ASU) Liu Huan's team proposed a deep causal representation learning framework for unsupervised domain adaptation (Deep Causal Representation learning framework for unsupervised Domain Adaptation, DCDAN) to learn the target domain prediction The transferable feature representation is shown in Figure 22. In essence, it simulates a virtual target domain using reweighted samples from the source domain and estimates the causal impact of features on the outcome.

DCDAN consists of a regularization term that learns balanced weights for the source data by balancing the distribution of feature representations learned from the data. These weights are designed to help the model capture the causal impact of features on the target variable, rather than their correlation. Furthermore, our model includes a weighted loss function for deep neural networks, where the weights for each sample come from a regularization term, the loss function is responsible for learning predictive domain-invariant features, and a classifier or causal mechanism that maps the learned representations to the output . Embedding the sample weights of the learning component into the pipeline of the model, and jointly learning these weights with representations, not only benefits from deep models, but also learns causal features that are transferable and predictive of the target.

Sample examples (EQ2) and heatmaps in the dataset generated by DCDAN. (a) shows an example image from the data, Fig. 23(b) shows the ground truth of the causal features of Fig. 23(a) extracted from the VQA-X dataset, and Fig. 23(c) shows DCDAN generated Heatmap of

In order to verify the effectiveness of the framework, the team of teacher Huan Liu from Arizona State University (ASU) set ResNet-50, DDC, DAN, Deep CORAL, DANN, and HAFN as control methods for experiments.

In experiments, DCDAN outperforms baseline methods in many cases, and the results show that DCDAN can perform unsupervised domain adaptation, showing its effectiveness in learning causal representations. And this also verifies that causal feature representation helps to learn transferable features across domains, further confirming that a good trade-off between causal loss and classification loss can lead to learning more transferable features.

Research progress on causal-inspired stable learning

1. Cui Peng, Tsinghua University: Some thoughts on generalization and stable learning outside the division

In recent years, the out-of-distribution (OOD) generalization problem has widely aroused the interest of researchers in fields such as machine learning and computer vision. Taking supervised learning as an example, we want to find a model f and its parameters θ that allow us to minimize

The expected loss between and y.

In principle, the data distribution during our test is unknown. In order to optimize it, the traditional machine learning method considers that the training data and the test data satisfy the assumption of independent and identical distribution, thus simplifying the problem so that we can distribute the training data Next search for a function f with parameter θ.

However, this simplified problem setting cannot meet the requirements of many practical application scenarios, and it is often difficult for us to ensure that the data distribution during testing and training is consistent. learned through the above

In the absence of theoretical guarantees, the performance of the model in the real test environment may be far from the performance of the training in the laboratory. To this end, some researchers have started to study the learning problem in out-of-distribution scenarios.

According to the difference in data distribution during testing, the out-of-distribution learning problem has derived two technical paths:

(1) Out-of-distribution domain adaptation: part of the test data (target domain) is known, and based on domain adaptation/transfer learning technology, the model obtained by using the training data (source domain) is adapted to a different data distribution (target domain)

(2) Out-of-distribution generalization: The test data distribution is completely unknown.

In the traditional independent and identical distribution learning scenario, model generalization is a kind of interpolation (Interpolation) generalization, and in the out-of-distribution learning scenario, model generalization refers to extrapolation (Extrapolation).

As shown in the figure above, in the independent and identically distributed scenario, if the number of parameters is too small, the model will underfit the data; if the number of parameters is too large, the model may overfit the data. The author of the paper "Direct Fit to Nature: An Evolutionary Perspective on Biological and Artificial Neural Networks" believes that the reason why the over-parameterized deep learning network has better generalization ability may be due to the fact that the model uses a form similar to a broken line to directly analyze the data points. Fitting was performed.

If we directly observe a small part of the whole, we need to extrapolate the unobserved data. Traditionally, we have required elaborate experimental designs to infer out-of-distribution situations based on small amounts of observed data. In this process, we will introduce some general laws summed up by a large number of humans, so as to realize the extrapolation of data.

 

As the so-called "response to all changes with invariance", "invariance" (invariance) is the basis for extrapolation. For example: Newton observed an apple falling from a tree, thus deduced the law of universal gravitation, which in turn could be extrapolated to the motion of other objects.

In the independent and identical distribution scenario, since we believe that the distribution of training data and test data is the same, our goal is data fitting. At this time, "correlation" naturally becomes a good statistical indicator. In the OOD scenario, we aim to find "invariance" through the following two paths:

(1) Causal reasoning

(2) Find invariance from change

1. Causal reasoning

Causal reasoning is a science related to invariance. In a classical causal model, we try to control X and look for the effect of a change in T on Y. Specifically, using the observed data, we will use sample reweighting (Sample Reweighting) to make the samples at T=1 and T=0 have similar X distributions. If there is a significant change in Y in these two cases, T Y has a causal effect. In this case, we estimate that the causal effect of T on Y is, on average, invariant to changes in X.

To fit invariance into the learning framework, we investigate the effect of multiple input variables on the predictability of the output variable. Under the stable learning framework, we try to find a set of suitable sample weights, and then regress the output variables through the input variables after reweighting the samples. The regression coefficient at this time is the regression coefficient that satisfies the causal relationship. The model trained by the above method has the generalization ability of OOD.

2. Find invariance from change

Change and invariance are the unity of opposites. In the context of machine learning, "variation" in the data refers to the heterogeneity that exists in the training data (eg, different image backgrounds, different types of objects). However, we cannot manually define such data heterogeneity because we cannot guarantee that the data satisfies the invariance constraints across all feature dimensions.

Therefore, a feasible way is to find invariance from underlying heterogeneity. We assume that the environment is unknown and some heterogeneity exists. In this case, we first need to discover the heterogeneity in the data, and then discover the invariance according to the heterogeneity of the data, and then we can also use the invariance to improve the learning of the changing part (heterogeneity) As a result, this process will continue to iterate.

3. Stable learning orientation

Under the stable learning framework, we leverage a heterogeneous data distribution to learn a model, hoping to have certain performance guarantees when applying the learned model to a series of unknown data sets. In addition to demonstrating the validity of such models experimentally, we hope to develop a theoretical underpinning for them.

2. Zhang Xingxuan, Tsinghua University: StableNet - Deep Stable Learning for Out-of-Distribution Generalization

Let's discuss deep stable learning specifically. For example, in my training pictures, many dogs are on the grass, and then a small number of dogs are on other backgrounds. Then it needs to be able to distinguish the dogs on the grass in time. Usually, this model can give more accurate predictions. But when given a background that he has built less, he may not necessarily give an accurate prediction, but most of them may still be able to give similar predictions, but when it appears a person he has never seen before background, the model is likely to give a wrong prediction. Therefore, the problem of this distribution offset will bring great challenges to the current deep network.

For the current deep learning network developed based on the assumption of independent and identical distribution, if the distribution of training data and test data is inconsistent, the generalization performance of the model will be poor. As shown in the figure above, assuming that the training set contains a large number of dogs with a grass background, if the test is faced with a picture of a dog standing on the grass, the network can generally predict the picture accurately; however, if the test picture contains If the background appears infrequently or never in the training set, the network is likely to predict poorly. This distribution shift problem is one of the major challenges facing current deep learning networks.

The reason for the above problems is that what the network learns is likely to be the correlation between data. In the above figure, since there are a large number of samples of "dog standing on the grass" in the training set, a correlation is established between the features of the grass and the image features of the dog, and then an association is established between the features of the grass and the label of the dog. As a result, the prediction performance for other background images on the test set has decreased.

To address the above issues, we try to instead extract causal features (e.g., part-to-whole causality). Under the stable learning framework, we focus on the causal features of the object itself rather than the features of the environment.

As shown in the figure above, the ResNet18 (second row) network not only pays attention to the characteristics of the dog, but also pays attention to the irrelevant characteristics of the background, while the Stable Net mainly focuses on the characteristics of the dog itself.

Specifically, we adopt the Global Balancing method to extract causal features. Given any intervention (treatment), we weight the training samples to eliminate the statistical correlation between various features, disconnect the relationship between background and causal features, and finally find more causal features to achieve more stable forecast.

Previous stabilization learning methods were mainly developed for simpler models (e.g., linear models), and the consideration was mainly to eliminate the linear correlation between features. However, in deep networks, the correlations between various features are usually very complex nonlinear correlations. Therefore, StableNet first maps all features to its random Fourier feature space, which maps features in a lower-dimensional space to a higher-dimensional space; then, we remove The linear correlation of various features; in this way, we can remove the linear correlation and nonlinear correlation between the features in the original feature space to ensure the strict independence of the features.

Furthermore, original global reweighting methods need to operate on all samples. However, in deep learning scenarios, the training sample size is generally very large, and we cannot weight the global samples. To this end, we propose a pre-storage method, which stores the features and sample weights that the network has seen before, and then reweights them in combination with the current features in a new round of training.

The network architecture of StableNet is shown in the figure above. The network architecture has two branches, the lower branch is the basic image classification network, and the upper branch is the process of reweighting the samples after RFF mapping. We can separate the two branches to plug StableNet into any deep learning architecture.

Currently, in domain generalization tasks in computer vision, we often assume that the heterogeneity in the training data is significant and that the sample sizes of each domain are comparable. This somewhat limits the validation of OOD generalization methods in the CV domain.

The author of this paper constructed an experimental environment with an unbalanced number of image domains based on two data sets, PACS and VLCS. Some image domains are dominant and have stronger false associations. Under this setting, StableNet has the best generalization performance compared to the baseline.

In a more flexible OOD generalization scenario, different categories of images may be in different domains. In this scenario, StableNet still outperforms all compared baselines.

In the adversarial OOD generalization scenario, the spurious association between domain and label is strong (e.g., most digit 1 in training set is colored green and digit 2 is yellow; at test time both digits are colored opposite to training set) . StableNet outperforms existing methods in almost all experimental settings.

3. Kuang Kun, Zhejiang University: Causal Generalization Through Instrumental Variable Regression

1. Causality and Stable Learning

As mentioned above, the existing machine learning algorithms based on associations have certain instability. To this end, researchers propose a framework for stable prediction/learning that focuses on making accurate and stable predictions on unknown test data.

The reason why existing machine learning algorithms are unstable is that these algorithms are association-driven, and there are a large number of deviations in the data, which may cause the model to extract some non-causal features (false associations), which makes the model uninterpretable , unstable. To this end, we try to recover the causal relationship between each feature variable and the label Y, thereby finding causal features.

In 2018, Mr. Cui Peng, Mr. Kuang Kun and others proposed the causal regularization technology. By learning the global weights, the variables are independent of each other. By applying this technology to models such as logistic regression and shallow deep networks, certain regularization can be achieved. performance improvement. This process of finding causal relationships requires us to be able to observe all the features, but sometimes some causal features are not observed by us.

2. Instrumental variable regression

In the field of causal science, researchers have previously dealt with unobserved variables through instrumental variables. As shown in the figure above, suppose we need to estimate the causal effect between T (intervention) and Y (outcome), with U being the unobserved variable. The instrumental variable Z must meet the following three conditions: (1) Z is correlated with T (2) Z and U are independent of each other (3) Z needs to affect Y through T.

After finding a suitable instrumental variable Z, we can estimate the causal effect between T and Y by a two-stage least squares method. In the first stage, we regress T against Z, resulting in

; In the second stage, we estimate the causal function between T and Y according to regression Y. In the example in the lower left corner of the figure above, the yellow curve represents the result of direct neural network regression, and the red curve represents the result obtained through two-stage least squares regression after introducing instrumental variables. The experimental results show that the red curve fits the original function better.

The original instrumental variable regression method relies on some strong linearity assumptions. To this end, some computer researchers have proposed nonlinear instrumental variable regression algorithms (for example, DeepIV, KernelIV, etc.) in recent years. Theoretically, in the first stage, we regress T through Z and X to get

; in the second stage, we regress Y by sum X. At this time, the regression function is nonlinear.

However, in experiments, methods such as DeepIV, KernelIV, etc. did not perform as expected, because the regression in the first stage introduced a confusing bias to the second stage. Here, we consider introducing confounding factor equalization into instrumental variable regression to address this confounding bias problem. Specifically, after the first-stage regression, we learn a balanced representation of the confounding factors

, making it irrelevant to . Then, in the second stage, we pass and return Y.

When using the original instrumental variable regression method, we often need to define an instrumental variable in advance. In the paper "Auto IV: Counterfactual Prediction via Automatic Instrumental Variable Decomposition", Dr. Kuang Kun et al. given the intervention T, the output result Y, the observed confounding factor X, and the unobserved confounding factor U, from the observation Decouple the instrumental variables from the confounding factor X obtained. Although the separated instrumental variable may not have a clear physical meaning, it satisfies the three properties that the instrumental variable mentioned above needs to satisfy. The instrumental variables thus generated can help us estimate the relationship between T and Y. Specifically, we use mutual information to judge the conditional independence between features, and representation learning to achieve decoupling operations.

3. Causal generalization through instrumental variable regression

Instrumental variable regression can be used in tasks such as domain generalization, invariant causal prediction, and causal transfer learning. Taking domain generalization as an example, the task aims to predict Y given X given data from different observational environments. We want to learn invariance from multiple data domains (environments) such that the predictive model is robust to all possible environments.

When solving the domain generalization problem through instrumental variable regression, first, we describe the data generation process (DGP) in each domain through causal diagrams. For domain m, when generating sample data X, in addition to the domain-invariant features of the sample, it may also be affected by domain-specific features (such as lighting, weather); when labeling samples, the annotator will not only consider the image Sample features are also affected by domain-specific features.

Here, we assume that each domain has invariance characteristics, and the relationship between X and Y is invariant. Looking at the data generation process in multiple domains, samples in domain n

is exactly the instrumental variable of the sample in the domain m, and satisfies the above three characteristics of the instrumental variable. Therefore, we can learn the causal effect f between X and Y through instrumental variable regression.

In the specific solution process, we first use instrumental variable regression, that is, estimate. Next, we use the approximated sum to learn the invariant function. It is worth noting that when performing domain generalization via instrumental variables, we only need labeled Y in one domain, and unlabeled data X in other domains.

4. Liu Jiashuo, Tsinghua University: From heterogeneous data to out-of-distribution generalization

1. Background of out-of-distribution generalization

Empirical loss risk minimization (ERM) is currently the most commonly used optimization algorithm, which optimizes the average loss of all data points, and the weight of all samples is 1/N. As shown in the figure above, when there is heterogeneity in the data, the distribution of samples in the data set is not balanced. Therefore, optimization through the ERM algorithm may pay more attention to the groups that appear more, while ignoring the impact of groups that appear less on the loss.

Specifically, in real scenarios, the distribution of data collected from different sources may be uneven, and there is a certain degree of heterogeneity. When optimizing the model through ERM, although a high accuracy rate can be obtained overall, this may be due to the model's perfect prediction performance for the majority of the data set, but not necessarily good prediction performance for the minority group .

As shown in the figure above, when the training data distribution is consistent with the test data distribution, if the ERM algorithm is used for optimization, the generalization performance of the model is theoretically guaranteed. However, if the distribution of the data is skewed, the generalization performance of the model obtained by the ERM algorithm may be poor.

Therefore, we should fully consider the heterogeneity of the data, design a more reasonable risk minimization method, and apply appropriate weights to different sample points, so that the model has better predictive ability for both the majority group and the minority group, thereby improving the model generalization performance.

As shown in the figure above, the OOD generalization problem aims to ensure the generalization ability of the model when the distribution shift occurs, that is, to find a set of parameters through "min-max" optimization

, making the performance of the model acceptable in the worst environment. Considering the distribution shift, the joint distribution of X and Y of the data collected in different environments is also different.

2. Minimize the risk of heterogeneity

We attempt to address the OOD generalization problem from the perspective of invariant learning. Here, we assume a random variable

Satisfy the following assumptions: (1) Invariance assumption: In different environments, the relationship between features and label Y is stable and invariant (2) Sufficiency assumption: Label Y can be completely generated by . Based on the above two assumptions, using features to make predictions can achieve stable predictions across environments with high accuracy, and is an invariant feature with causal effects.

To find the above invariant features, we need strong constraints on the environment. Many existing invariant learning methods aim at finding features that meet the above properties from multiple environments. However, in real situations, many datasets are mixed data collected from multiple different data sources, and it is often difficult for us to reserve explicit labels for the environment that are truly effective for model learning

Aiming at the phenomenon of heterogeneity in data in a mixed environment, Dr. Liu Jiashuo and others proposed a heterogeneity risk minimization framework (HRM). First, we assume that there are parts of the data that vary wildly across environments

, there are differences in the relationship between and Y in different environments.

Next, we define the heterogeneity risk minimization problem as: given a mixed data set D with heterogeneity, in the absence of environmental labels, we aim to learn a set of invariant features so that the model has a better OOD generalization ability

As shown in the figure above, the HRM algorithm framework includes the following two modules:

(1) Heterogeneity identification module

(2) Invariance prediction module

. In continuous iteration, the above two modules will promote each other.

Specifically, we first pass the module

Learn unstable features in mixed datasets, identify environments with heterogeneity in the data, and obtain environments with strong heterogeneity. Next, we learn invariant features through an invariant learning module.

There are interdependent changing parts and invariant parts in the data, and our learned invariant features

Through transformation, the characteristics of change can be obtained, and the two promote each other. In order to obtain better theoretical properties, the author focuses on relatively simple data in this article, and the sum is obtained through the simple feature selection process in the above figure.

3. Minimize the risk of nuclear heterogeneity

HRM algorithms cannot handle complex data (eg, images, text). In KerHRM, Dr. Liu Jiashuo and others extended the HRM algorithm to more complex data types.

On the basis of the HRM algorithm flow, Dr. Liu Jiashuo and others introduced the Neural Tangent Kernel (NTK) into KerHRM. According to NTK theory, the operation of a neural network (eg, MLP) is equivalent to linear regression in a complex feature space.

As shown in formula (5) in the figure above, assume that the parameter of the neural network is w, and the input data is X. right

Doing Taylor expansion at the position, according to the first-order Taylor expansion of the model parameters, it can be found that the effect is equivalent to doing a linear operation on the gradient term. Therefore, with NTK technology, we can transform complex neural network operations into linear regression on neural tangent features.

Through the above methods, we can apply HRM to more complex data while retaining the characteristics of the HRM framework. KerHRM distinguishes stable and unstable parts of the data by constructing a set of orthogonal kernels.

4. Simulation experiment: Colored MNIST

The author of this paper uses the same experimental settings as in the paper "Invariant RiskMinimization" to test the performance of the KerHRM method on the Colored MNIST dataset. In this experimental environment, the author marks the numbers 0-4 in MNIST as the "0" category, and the numbers "5-9" as the "1" category, thus transforming the ten-category problem into a two-category problem. Next, the author dyes most of the pictures in the "0" class to one color, and dyes most of the pictures in the "1" class to another color, thereby constructing a false association between the number label and the color. At test time, we flipped the coloring of the image, at which point the performance of traditional machine learning models tends to drop significantly.

The experimental results are shown in the figure above. As the number of iterations increases, the heterogeneity of the environment learned by the KerHRM framework gradually increases, and the prediction accuracy during the test also increases. At the same time, the gap between training and testing accuracy is gradually narrowing. It can be seen that the performance of OOD generalization is positively correlated with the degree of heterogeneity of the environment we constructed, and heterogeneity is very important for OOD generalization performance. Therefore, the quality (heterogeneity) of the environmental labels also has a strong impact on the generalization performance.

5. He Yue, Tsinghua University: Out-of-distribution generalization image data set - NICO

1. Non-independent and identically distributed image classification

Image classification is one of the most fundamental and important tasks in the field of computer vision. Under the traditional independent and identical distribution assumption, by minimizing the empirical loss on the training set, existing deep learning models can already achieve good predictive performance at test time. However, it is difficult for the data sets collected in real situations to satisfy the independent and identical distribution assumption, and it is almost impossible for the training set to cover all the data distributions in the test samples. At this point, if we still optimize the model by minimizing the experience loss of the model on the training set, it will often lead to a serious decline in the performance of the model during testing.

As shown in the figure above, the backgrounds of cats and dogs in the training data and test data are very different, and the assumption of independent and identical distribution is not satisfied. The deep learning model may mistake the background as the standard for classifying pictures. Humans naturally have a strong generalization ability for such classification problems, and a good classification model should also be insensitive to changes in this background distribution.

We refer to this problem as a non-IID image classification problem, where the data distributions in the training and test sets are different. This type of problem contains two subtasks:

(1) Targeted Non-IID image classification: Part of the information in the test set is known. We can use methods such as transfer learning to migrate the currently trained model to the data distribution of the target domain to achieve better prediction performance.

(2) General Non-IID image classification: Using mechanisms such as invariance, the learned model can be generalized to any unknown data distribution with a high accuracy rate.

In fact, the learning problem in non-IID scenarios is very important for computer vision tasks. In scenarios such as autonomous driving and automatic rescue, we hope that the model can quickly identify uncommon but very dangerous situations.

2. Measuring the difference in data distribution

To characterize the difference between distributions, we define a metric called "NI". In the process of calculating NI, we use a pre-trained general vision model to extract image features, then calculate the first-order moment distance between two distributions at the feature level, and use the variance of the distribution for normalization. A large number of experiments have proved that NI is more robust in describing the difference in image distribution. In addition, in the case of limited sampling, data distribution bias is ubiquitous, and as the data distribution bias becomes stronger, the error rate of the classification model is also increasing.

In fact, the phenomenon of distribution shift widely exists in benchmark datasets such as PASCAL VOC, ImageNet, and MSCOCO. Taking ImageNet as an example, we first selected 10 common animal categories, and then selected different subcategories for each category of animals to form three different data sets A, B, and C.

Next, we collected some fixed test samples. By measuring NI, we found that there are data distribution deviations in different data sets, but this deviation is weak, and this data deviation is uncontrollable, and the size of the distribution deviation is random. In order to promote the research of OOD generalization in the field of vision, we constructed a visual data set with obvious data distribution deviation and adjustable deviation - NICO.

3. NICO data set

First, we consider decomposing visual concepts of subject and context from images. As shown in the figure above, the subject may be a cat or a dog, and the context may be concepts such as the subject's posture, background, and color. By combining different subjects and contexts in training and testing, we can form differences in data distributions.

The concept of context comes from the real world. We can describe the context from many angles, and then describe a biased data distribution. When the combination of context and subject makes sense, we can easily collect enough images.

The currently public NICO dataset has a hierarchical structure as shown in the figure above. The two superclasses Animal and Vehicle contain 9-10 subject classes, each subject class holds a set of contextual concepts. We want contexts to be as diverse as possible, with meaningful combinations of subjects and contexts, with some overlap between contexts. In addition, we require that the number of samples for each class of subject and context combination be as balanced as possible, and that the variance between different contexts be as large as possible.

Compared with the classical independent and identically distributed data set, the image classification task on the NICO data set is more challenging because NICO introduces the concept of context, and the images are non-centralized and irregular.

In the face of limited samples, sampling will produce a certain degree of data distribution deviation anyway, which is caused by the nature of the image itself and the difference in sampling scale. In the NICO dataset, we simulate approximately independent and identically distributed scenarios by means of random sampling. Compared with the ImageNet dataset, it is true that NICO introduces the concept of decentralization/context, and its recognition task is more difficult.

4. OOD generalization - proportional deviation

When there is a "proportional deviation" in the distribution of test data and training data, we require both the training set and the test set data to contain all category contexts, but we choose different contexts as the dominant contexts in training and testing (the proportion of the entire collected image relatively high). By setting different dominant contexts in training and testing, we can naturally form differences in data distribution.

Here, we also define the "Dominant Ratio" indicator to describe the proportion of samples with dominant contexts and samples with other contexts. As shown in the figure above, as the dominance rate increases, the distribution difference between the training and test data becomes larger and larger, and the impact on the accuracy of the model becomes larger and larger.

5. OOD Generalization - Component Bias

"Composition bias" simulates the spatio-temporal constraints in our sampling of training data and test data. Under this setting, the training set does not contain all categories of context, and some contexts in the test set have not been seen in training. As the types of context contained in the training set decrease, the difference between the data distribution of the test set and the training set increases, and the effect of model learning becomes worse and worse.

To achieve greater data distribution bias, we can also combine composition bias and scale bias. We can require certain category contexts to be dominant in the contexts contained in the training set, that is, to control the degree of data distribution deviation by adjusting the number and dominance of the visible contexts in the training set at the same time, and then observe the performance of the model under different data deviation scenarios. performance.

6. Generalization of OOD - against attacks

In the "anti-bias" scenario, we select samples of certain classes as positive classes and samples of other classes as negative classes. Next, we define some kind of context to only appear in the positive classes in the training set, and in the negative classes in the test set. At this point, the model will incorrectly associate that context with the positive class, resulting in poor performance at test time. We refer to this context as confusion context, and as the proportion of confusion context increases, the model's learning of positive classes becomes more and more susceptible to spurious associations.

Blue Ocean Brain Deep Learning Solution

Machine learning models have been successful in many Internet-oriented scenarios. In application scenarios such as predicting clicks or classifying images, the cost of a model making a wrong decision does not seem to be high, so practitioners adopt a "performance-driven" model to optimize artificial intelligence technology, that is, only focus on the model's ability to complete the goal. The performance embodied in the task is less concerned with the risk of technical errors. When the task environment changes and the prediction is wrong, people update the black-box model frequently to ensure the performance of the prediction.

However, in areas closely related to social life, such as healthcare, industrial manufacturing, finance, and justice, the consequences of wrong predictions made by machine learning models are often unacceptable, and these scenarios are therefore called risk-sensitive scenarios. Due to the difficulty of data acquisition and ethical issues, the cost of retraining machine learning models due to environmental changes in risk-sensitive scenarios can be expensive, so characteristics other than the short-term predictive performance of the model are also very important. In order to promote the application of machine learning models in more risk-sensitive scenarios, we need to carefully analyze the technical risks faced by machine learning models and take measures to overcome these risks.

Blue Ocean Brain proposes a stable learning liquid cooling solution for deep learning, machine learning, causal learning researchers, AI developers and data scientists, and provides data labeling, model generation, model training, and model inference service deployment through integrated delivery of software and hardware End-to-end capabilities lower the technical threshold for using AI, allowing customers to focus more on the business itself, enabling rapid development and launch of AI services.

The solution provides a one-stop deep learning platform service, built-in a large number of optimized network model algorithms, helps users easily use deep learning technology in a convenient and efficient way, and provides model training, evaluation and prediction through flexible scheduling and on-demand service.

1. Advantages

1. Better energy saving

The energy consumption of the overall computer room air-conditioning system is reduced by 70%; the power consumption of server fans is reduced by 70%~80%; the liquid cooling system can realize natural cooling throughout the year, PUE<1.1, and the overall computer room air-liquid hybrid cooling system PUE<1.2

2. Higher device reliability

The core temperature of the CPU running at full load is about 40-50°C, which is about 30°C lower than that of air cooling; the temperature of the server system is about 20°C lower than that of air cooling

3. Better performance

The operating temperature of the CPU and memory is greatly reduced, and overclocking operation can be realized, and the performance of the computing cluster can be increased by 5%

4. Lower noise

The water circulation noise of the liquid cooling part is extremely low, the fan speed of the air cooling part is reduced, and the noise is reduced by about 30dB, and the noise of full load operation is less than 60dB

5. Increased rate density

The power density of a single cabinet can reach more than 25kW, which is greatly improved compared with the air-cooled heat dissipation method

2. Liquid-cooled server architecture

The hyper-converged architecture assumes the role of computing resource pool and distributed storage resource pool, which greatly simplifies the infrastructure of the data center, and realizes no single point of failure and no single point of bottleneck through software-defined computing resource virtualization and distributed storage architecture , elastic expansion, performance linear growth and other capabilities. Through a simple and convenient unified management interface, unified monitoring, management and operation and maintenance of computing, storage, network, virtualization and other resources in the data center are realized.

The computing resource pool and storage resource pool formed by the hyper-converged infrastructure can be directly deployed by the cloud computing platform, serving IaaS, PaaS, and SaaS platforms such as OpenStack, EDP, Docker, Hadoop, and HPC, and for upper-layer application systems or application clusters, etc. support. At the same time, the distributed storage architecture simplifies the disaster recovery method, realizing active-active data in the same city and remote disaster recovery. The existing hyper-converged infrastructure can be extended to the public cloud, and private cloud services can be easily migrated to public cloud services.

3. Customer benefits

1. Save energy

The original digital power usage cost accounts for the largest proportion in the total cost of ownership (TCO). Realize on-demand power supply and cooling of IT equipment, so that the capacity of the power supply and cooling system can better match the load demand, thereby improving work efficiency and reducing over-provisioning.

2. Operation and maintenance supervision

Help customers achieve multi-level and refined energy consumption management of data centers, and determine additional energy consumption points through various reports to achieve energy saving and consumption reduction. Asset management helps users formulate asset maintenance plans, realize active early warning, dynamically adjust maintenance plans, output optimization plans according to actual conditions, and build optimal asset management functions.

Guess you like

Origin blog.csdn.net/LANHYGPU/article/details/128481898