Causal Inference 6--Multi-task Learning (Personal Notes)

Table of contents

1 Multi-task learning

1.1 Problem description

1.2 Dataset

1.3 Network structure

1.4 Results

2 Causal inference using a multi-task approach

2.1DRNet

2.2Dragonet

2.3Deep counterfactual networks with propensity-dropout

2.4VCNet

3 thoughts


1 Multi-task learning

Keras-mmoe/census_income_demo.py at master Drawbridge/Keras-mmoe GitHub

Recommendation System - (16) Multi-task Learning: Google MMOE Principles and Practice-Knowledge

1.1 Problem description

In recent years, deep neural networks have been used more and more widely, such as recommender systems. Recommendation systems usually need to optimize multiple goals at the same time. For example, in movie recommendation, it is not only necessary to predict whether the user will buy, but also to predict the user's rating of the movie. Therefore, the multi-task learning model has become a hot spot in the research field.

1.2 Dataset

1.3 Network structure

1.4 Results

1.5 Multitasking code

model = Model(inputs = [inputOrdInfo,inputTextInfo], outputs = [output,outputTextInfo,outputOrdInfo])


#lr_schedule = schedules.ExponentialDecay(initial_learning_rate=0.0015,decay_steps=100,decay_rate=0.95)


#adam = Adam(lr=0.005)
#adam = Adam(learning_rate = lr_schedule)
rmsprop = RMSprop(lr = 0.005)
#sgd = SGD(lr=0.001, momentum=0.0, decay=0.0, nesterov=False)
model.compile(optimizer=rmsprop, loss='binary_crossentropy', metrics=None,loss_weights=[0.19, 0.8,0.01])
# 0.2, 0.8,0.001
checkpoint = ModelCheckpoint('./bestModel.h5', monitor='val_output_loss', verbose=0, save_best_only=True, mode='min',save_weights_only=True)
earlystopping = EarlyStopping(monitor='val_output_loss', min_delta=0, patience=2, verbose=0, mode='min', baseline=None)

# model.fit([X_train_order,X_train_text],[y_train,y_train,y_train],batch_size=64, epochs=15,validation_data = ([X_val_ord_info,X_val_text],[y_val,y_val,y_val]),shuffle=True,callbacks = [earlystopping])
#note reduce_lr = tf.keras.callbacks.LearningRateScheduler(scheduler)
    
model.fit([X_train_order,X_train_text],[y_train,y_train,y_train],batch_size=64, epochs=15,validation_data = ([X_val_ord_info,X_val_text],[y_val,y_val,y_val]),shuffle=True,
          callbacks = [earlystopping]
            ,sample_weight = [W_train,W_train,W_train])

2 Causal inference using a multi-task approach

The multi-task learning method is used to learn the causal relationship, especially the multi-task learning mode of the multi-research recommendation system, and corresponding supplements are made.

2.1DRNet

Learning Counterfactual Representations for Estimating Individual Dose-Response Curves

  1. The parameters of L1 base layers participate in the training of all data sets, and the parameters of L2 treatment layers only participate in the training of Treatment group samples
  2. Can be applied to more complex intervention scenarios, discrete state intervention + continuous state intervention , for each combination of intervention, use the head network to learn
  3. Let's take an easy-to-understand case. We want to test the effects of different drugs on different patients. t=0~k-1 respectively represent different groups of patients, t=0 is the normal group, t=1~k-1 respectively represent the diabetic group, hypertensive patient group and other patient groups, the drug dosage level m is divided into a , b, c represent low dose/medium dose/high dose, respectively, and use head network learning for different combinations of t and m. Each treatment layer is further subdivided into E head layers (only the set of E = 3 head layers for t = 0 treatment is shown above).

2.2Dragonet

Adapting Neural Network for the Estimation of Treatment Effects

  • dragonNet (learning non-linear relationships): a two-stage method, first learning the representation model, and then learning the inference model

If the propensity score network is lost, the network structure is the same as that of TARNET , and a test comparison with this method is done later. This part of the loss tends to be divided into parts, which will cause the network weight to automatically reduce the weight of the features with poor correlation with g(x), which is conducive to feature selection. The following introduces the target regularizaiton to improve the loss.

2.3Deep counterfactual networks with propensity-dropout

Abstract: We propose a new method for inferring individualized causal effects of treatments (interventions) from observational data. Our approach conceptualizes causal inference as a multitask learning problem; we use a deep multitask network with a set of shared layers between factual and counterfactual outcomes, and a set of outcome-specific Result modeling. The effect of selection bias in the observation data is mitigated by a propensity-dropout regularization scheme, where the network thins out each training example with a dropout probability that depends on the associated propensity score. The network is trained in alternating stages, in each stage we use training examples from one of the two potential outcomes (treated and control populations) to update the weights of the shared layer and the respective outcome-specific layer. Experiments based on data from real-world observational studies demonstrate that our algorithm outperforms state-of-the-art algorithms.

代码:GitHub - Shantanu48114860/Deep-Counterfactual-Networks-with-Propensity-Dropout: Implementation of the paper "Deep Counterfactual Networks with Propensity-Dropout"(https://arxiv.org/pdf/1706.05966.pdf) in pytorch framework

  1. The model adopts the idea of ​​multi-objective modeling , and puts the Treatment group and Control group samples in the same model to reduce model redundancy
  2. The left part is a multi-objective framework. The samples of the Treatment group and the Control group have a shared layer and their own independent network layers, so as to learn the Treatment model and the Control model.
  3. The Propensity Network on the right mainly controls the complexity of the left model. If the data is well divided, the left model is controlled by generating Dropout-Propensity to make it simpler; if the data is not well divided, the left model is controlled to be more complicated.
  4. During training, the samples of the Treatment group and the Control group are trained separately. When the number of iterations is odd, the samples of the Treatment group are trained ; when the number of iterations is even, the samples of the Control group are trained.

network:

If a parameter requires_grad=False, and this parameter is in the optimizer, it will not be updated, and the program will not report an error

network.hidden1_Y1.weight.requires_grad = False
                        
import torch
import torch.nn as nn
import torch.optim as optim

from DCN import DCN


class DCN_network:
    def train(self, train_parameters, device):
        epochs = train_parameters["epochs"]
        treated_batch_size = train_parameters["treated_batch_size"]
        control_batch_size = train_parameters["control_batch_size"]
        lr = train_parameters["lr"]
        shuffle = train_parameters["shuffle"]
        model_save_path = train_parameters["model_save_path"].format(epochs, lr)
        treated_set = train_parameters["treated_set"]
        control_set = train_parameters["control_set"]

        print("Saved model path: {0}".format(model_save_path))

        treated_data_loader = torch.utils.data.DataLoader(treated_set,
                                                          batch_size=treated_batch_size,
                                                          shuffle=shuffle,
                                                          num_workers=1)

        control_data_loader = torch.utils.data.DataLoader(control_set,
                                                          batch_size=control_batch_size,
                                                          shuffle=shuffle,
                                                          num_workers=1)
        network = DCN(training_flag=True).to(device)
        optimizer = optim.Adam(network.parameters(), lr=lr)
        lossF = nn.MSELoss()
        min_loss = 100000.0
        dataset_loss = 0.0
        print(".. Training started ..")
        print(device)
        for epoch in range(epochs):
            network.train()
            total_loss = 0
            train_set_size = 0

            if epoch % 2 == 0:
                dataset_loss = 0
                # train treated
                network.hidden1_Y1.weight.requires_grad = True
                network.hidden1_Y1.bias.requires_grad = True
                network.hidden2_Y1.weight.requires_grad = True
                network.hidden2_Y1.bias.requires_grad = True
                network.out_Y1.weight.requires_grad = True
                network.out_Y1.bias.requires_grad = True

                network.hidden1_Y0.weight.requires_grad = False
                network.hidden1_Y0.bias.requires_grad = False
                network.hidden2_Y0.weight.requires_grad = False
                network.hidden2_Y0.bias.requires_grad = False
                network.out_Y0.weight.requires_grad = False
                network.out_Y0.bias.requires_grad = False

                for batch in treated_data_loader:
                    covariates_X, ps_score, y_f, y_cf = batch
                    covariates_X = covariates_X.to(device)
                    ps_score = ps_score.squeeze().to(device)

                    train_set_size += covariates_X.size(0)
                    treatment_pred = network(covariates_X, ps_score)
                    # treatment_pred[0] -> y1
                    # treatment_pred[1] -> y0
                    predicted_ITE = treatment_pred[0] - treatment_pred[1]
                    true_ITE = y_f - y_cf
                    if torch.cuda.is_available():
                        loss = lossF(predicted_ITE.float().cuda(),
                                     true_ITE.float().cuda()).to(device)
                    else:
                        loss = lossF(predicted_ITE.float(),
                                     true_ITE.float()).to(device)
                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()
                    total_loss += loss.item()
                dataset_loss = total_loss

            elif epoch % 2 == 1:
                # train controlled
                network.hidden1_Y1.weight.requires_grad = False
                network.hidden1_Y1.bias.requires_grad = False
                network.hidden2_Y1.weight.requires_grad = False
                network.hidden2_Y1.bias.requires_grad = False
                network.out_Y1.weight.requires_grad = False
                network.out_Y1.bias.requires_grad = False

                network.hidden1_Y0.weight.requires_grad = True
                network.hidden1_Y0.bias.requires_grad = True
                network.hidden2_Y0.weight.requires_grad = True
                network.hidden2_Y0.bias.requires_grad = True
                network.out_Y0.weight.requires_grad = True
                network.out_Y0.bias.requires_grad = True

                for batch in control_data_loader:
                    covariates_X, ps_score, y_f, y_cf = batch
                    covariates_X = covariates_X.to(device)
                    ps_score = ps_score.squeeze().to(device)

                    train_set_size += covariates_X.size(0)
                    treatment_pred = network(covariates_X, ps_score)
                    # treatment_pred[0] -> y1
                    # treatment_pred[1] -> y0
                    predicted_ITE = treatment_pred[0] - treatment_pred[1]
                    true_ITE = y_cf - y_f
                    if torch.cuda.is_available():
                        loss = lossF(predicted_ITE.float().cuda(),
                                     true_ITE.float().cuda()).to(device)
                    else:
                        loss = lossF(predicted_ITE.float(),
                                     true_ITE.float()).to(device)
                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()
                    total_loss += loss.item()
                dataset_loss = dataset_loss + total_loss

            print("epoch: {0}, train_set_size: {1} loss: {2}".
                  format(epoch, train_set_size, total_loss))

            if epoch % 2 == 1:
                print("Treated + Control loss: {0}".format(dataset_loss))
                # if dataset_loss < min_loss:
                #     print("Current loss: {0}, over previous: {1}, Saving model".
                #           format(dataset_loss, min_loss))
                #     min_loss = dataset_loss
                #     torch.save(network.state_dict(), model_save_path)

        torch.save(network.state_dict(), model_save_path)

    @staticmethod
    def eval(eval_parameters, device):
        print(".. Evaluation started ..")
        treated_set = eval_parameters["treated_set"]
        control_set = eval_parameters["control_set"]
        model_path = eval_parameters["model_save_path"]
        network = DCN(training_flag=False).to(device)
        network.load_state_dict(torch.load(model_path, map_location=device))
        network.eval()
        treated_data_loader = torch.utils.data.DataLoader(treated_set,
                                                          shuffle=False, num_workers=1)
        control_data_loader = torch.utils.data.DataLoader(control_set,
                                                          shuffle=False, num_workers=1)

        err_treated_list = []
        err_control_list = []

        for batch in treated_data_loader:
            covariates_X, ps_score, y_f, y_cf = batch
            covariates_X = covariates_X.to(device)
            ps_score = ps_score.squeeze().to(device)
            treatment_pred = network(covariates_X, ps_score)

            predicted_ITE = treatment_pred[0] - treatment_pred[1]
            true_ITE = y_f - y_cf
            if torch.cuda.is_available():
                diff = true_ITE.float().cuda() - predicted_ITE.float().cuda()
            else:
                diff = true_ITE.float() - predicted_ITE.float()

            err_treated_list.append(diff.item())

        for batch in control_data_loader:
            covariates_X, ps_score, y_f, y_cf = batch
            covariates_X = covariates_X.to(device)
            ps_score = ps_score.squeeze().to(device)
            treatment_pred = network(covariates_X, ps_score)

            predicted_ITE = treatment_pred[0] - treatment_pred[1]
            true_ITE = y_cf - y_f
            if torch.cuda.is_available():
                diff = true_ITE.float().cuda() - predicted_ITE.float().cuda()
            else:
                diff = true_ITE.float() - predicted_ITE.float()
            err_control_list.append(diff.item())

        # print(err_treated_list)
        # print(err_control_list)
        return {
            "treated_err": err_treated_list,
            "control_err": err_control_list,
        }

 We refer to our latent outcome model as a Deep Counterfactual Network (DCN), and we use the acronym DCN-pd to refer to a DCN with propensity-dropout regularization. Since our model captures both propensity scores and outcomes, it is a doubly-robust model.

2.4VCNet

@article{LizhenNie2021VCNetAF,  title={VCNet and Functional Targeted Regularization For Learning Causal Effects of Continuous Treatments},  author={Lizhen Nie and Mao Ye and Qiang Liu and Dan L. Nicolae},  journal={arXiv: Learning},  year={2021}}

reference:

  1. dcn (deep cross network) trilogy - know almost
  2. Causal reasoning in practice (1) - learning task rules from teaching with the help of causal relationship-Knowledge
  3. Popular interpretation of causal reasoning causal inference bzdww
  4. AB experiment high-end gameplay series 1 - Let's take a look
  5. Collection | Talking about Multi-task Learning (Multi-task Learning) bzdww
  6. Application Exploration and Case Sharing of Multi-task Learning in Risk Control Scenarios-Knowledge
  7. Keras-mmoe/census_income_demo.py at master Drawbridge/Keras-mmoe GitHub
  8. Keras-mmoe/census_income_demo.py at master Drawbridge/Keras-mmoe GitHub
  9. Multi-objective modeling (1) bzdww
  10. Recommender System (8) - Summary of Multi-objective Optimization Application_1 - Deep Machine Learning - 博客园
  11. Multi-task Learning Applied to Causal Modeling - Programmer Sought
  12. Deep learning [22] Mxnet multi-task (multi-task) training_DCD_Lin's blog-CSDN blog_Multi-task training
  13. What are some good practices for causal inference in multi-task optimization scenarios? - Know almost
  14. https://huaweicloud.csdn.net/63802f23dacf622b8df8639e.html
  15. When neural network training multi-task learning (MTL), how to assign weights to multiple losses (with code)_Neural network multi-task training_Ciao112's blog-CSDN blog

question:

1. Multi-target training, X_T_Y (unique), special effect x [y1, y2], the row format is x [y1, null]?

Answer: Train in a way that the parameters are not updated.

3 thoughts

1. Use causal inference to correct the correlation model. What is the relationship between causal inference and machine learning?

2. What problems do causal inference and machine learning solve?

Answer: Machine learning solves the prediction problem without knowing the reason; causal inference knows the reason and predicts the result.

3. What problems cannot be solved by machine learning?

Answer: Do not intervene, do not explore the cause.

Answer: The question of cause and effect cannot be answered.

4. What problems cannot be solved by causal inference?

5. Is the non-intervention problem still a causal problem?

Answer: I don't understand.

6. How to solve the deviation?

Guess you like

Origin blog.csdn.net/as472780551/article/details/128622691