Causal inference dowhy - analysis of reasons for exploring hotel cancellations

0x01. DoWhy case analysis

This case is still based on Microsoft's official open source documents. If you want to know more, please go to Microsoft's official website.
Background:
There can be different reasons for canceling a hotel reservation. The customer may request something that cannot be provided (for example, parking), the customer may later find out that the hotel did not meet their request, or the customer may simply cancel their entire trip. Some of these issues, like parking, are something hotels can handle, while others, like trip cancellations, are outside the hotel's control. In any case, we'd like to better understand what factors lead to cancellations.

In this example, our research problem is to estimate the impact of allocating a different room than the one previously reserved when a consumer is booking a hotel on the cancellation of the current reservation. The standard for analyzing such issues is "Randomized Controlled Trials ," in which each consumer is randomly assigned to one of two interventions: the same or a different room than the one previously reserved.

We consider what factors can cause a hotel reservation to be cancelled. Simple empirical thinking is shown in the figure:
insert image description here

0x02. Start the experiment

0x02_1. Guide packet read data

For a detailed explanation of the data, please refer to: Portal

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import dowhy

dataset = pd.read_csv('https://raw.githubusercontent.com/Sid-darthvader/DoWhy-The-Causal-Story-Behind-Hotel-Booking-Cancellations/master/hotel_bookings.csv')
dataset.columns

The dataset contains the following columns:

Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
       'arrival_date_month', 'arrival_date_week_number',
       'arrival_date_day_of_month', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
       'country', 'market_segment', 'distribution_channel',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'reserved_room_type',
       'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
       'company', 'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status', 'reservation_status_date'],
      dtype='object')

0x02_2. Data processing

Through feature merging, the original feature dimension is reduced, and the data is more valuable. The specific created data is as follows:

  • Total Stay = stays_in_weekend_nights + stays_in_week_nights
  • Guests = adults + children + babies
  • Different_room_assigned = 1 if reserved_room_type & assigned_room_type are different, 0 otherwise.

The operation is a basic pandas operation, just follow the practice.

# Total stay in nights
dataset['total_stay'] = dataset['stays_in_week_nights']+dataset['stays_in_weekend_nights']
# Total number of guests
dataset['guests'] = dataset['adults']+dataset['children'] +dataset['babies']
# Creating the different_room_assigned feature
dataset['different_room_assigned']=0
slice_indices =dataset['reserved_room_type']!=dataset['assigned_room_type']
dataset.loc[slice_indices,'different_room_assigned']=1
# Deleting older features
dataset = dataset.drop(['stays_in_week_nights','stays_in_weekend_nights','adults','children','babies'
                        ,'reserved_room_type','assigned_room_type'],axis=1)
dataset.columns

After the columns of the data are processed, they become:

Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
       'arrival_date_month', 'arrival_date_week_number',
       'arrival_date_day_of_month', 'meal', 'country', 'market_segment',
       'distribution_channel', 'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'booking_changes', 'deposit_type',
       'agent', 'company', 'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status', 'reservation_status_date', 'total_stay', 'guests',
       'different_room_assigned'],
      dtype='object')

Then deal with problems such as null values:

dataset.columns
#%%
dataset.isnull().sum() # Country,Agent,Company contain 488,16340,112593 missing entries
dataset = dataset.drop(['agent','company'],axis=1)
# Replacing missing countries with most freqently occuring countries
dataset['country']= dataset['country'].fillna(dataset['country'].mode()[0])

dataset = dataset.drop(['reservation_status','reservation_status_date','arrival_date_day_of_month'],axis=1)
dataset = dataset.drop(['arrival_date_year'],axis=1)
dataset = dataset.drop(['distribution_channel'], axis=1)

# Replacing 1 by True and 0 by False for the experiment and outcome variables
dataset['different_room_assigned']= dataset['different_room_assigned'].replace(1,True)
dataset['different_room_assigned']= dataset['different_room_assigned'].replace(0,False)
dataset['is_canceled']= dataset['is_canceled'].replace(1,True)
dataset['is_canceled']= dataset['is_canceled'].replace(0,False)
dataset.dropna(inplace=True)
print(dataset.columns)
dataset.iloc[:, 5:20].head(100)

The data output results are as follows:

Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_month',
       'arrival_date_week_number', 'meal', 'country', 'market_segment',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'booking_changes', 'deposit_type',
       'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'total_stay', 'guests', 'different_room_assigned'],
      dtype='object')

Perform statistical analysis on the data of those who did not use the deposit to observe whether they canceled the order

dataset = dataset[dataset.deposit_type=="No Deposit"]
dataset.groupby(['deposit_type','is_canceled']).count()

The result is as follows:
insert image description here
make a deep copy of the data:

dataset_copy = dataset.copy(deep=True)

0x02_3. Simple analysis hypothesis

After the data preprocessing is completed, we first analyze the data to examine the relationship between variables. For the target variables is_cancelled and different_room_assigned , we randomly select 1000 observations to see how many times the above two variables have the same value (that is, there may be a causal relationship), and repeat the above process 10000 times to take the average. The code is as follows:

counts_sum=0
for i in range(1,10000):
        counts_i = 0
        rdf = dataset.sample(1000)
        counts_i = rdf[rdf["is_canceled"]== rdf["different_room_assigned"]].shape[0]
        counts_sum+= counts_i
counts_sum/10000

The resulting expected frequency is 518that the two variables are different about 50% of the time, and we have not yet been able to judge the causal relationship. Let's further analyze the expected frequency of the two variables being equal when no adjustment occurs during the booking process (that is, the variable booking_changes is 0):

counts_sum=0
for i in range(1,10000):
        counts_i = 0
        rdf = dataset[dataset["booking_changes"]==0].sample(1000)
        counts_i = rdf[rdf["is_canceled"]== rdf["different_room_assigned"]].shape[0]
        counts_sum+= counts_i
counts_sum/10000

The result obtained is 492. Then we analyze the expected frequency when adjustments occur during the booking process:

counts_sum=0
for i in range(1,10000):
        counts_i = 0
        rdf = dataset[dataset["booking_changes"]>0].sample(1000)
        counts_i = rdf[rdf["is_canceled"]== rdf["different_room_assigned"]].shape[0]
        counts_sum+= counts_i
counts_sum/10000

The result turned out 663to be significantly different from before. We can loosely consider the appointment adjustment variable to be a "confounding factor". Similarly, we perform the analysis on other variables and make some assumptions that serve as prior knowledge for causal inferences. DoWhy does not require full prior knowledge, and unspecified variables are inferred as potential confounding factors.

0x03. Use DoWhy to estimate causal effects

Use the four steps of causal analysis to analyze the results, as follows

0x03_1. Create a causal graph

There is no need to create a complete causal diagram at this stage, even a partial diagram is enough, the rest can be calculated by DoWhy; we translate some possible assumptions into a causal diagram:

  • market_segment has two values: "TA" refers to "travel agent" and "TO" refers to "tour operator", so it should affect the lead_time (i.e. the number of days between booking and arrival ).
  • country will determine whether a person will book early (i.e. affect lead_time ) and their favorite food (i.e. affect meal ).
  • lead_time affects the waiting time for bookings ( days_in_waiting_list ).
  • The waiting time days_in_waiting_list of the reservation , the total stay time total_stay and the number of guests will affect whether the reservation is cancelled.
  • Cancellations of previous bookings previous_bookings_not_canceled will affect whether the guest is_repeated_guest ; these two variables will also affect whether the booking is cancelled.
  • booking_changes will affect whether the customer is assigned to a different room, and also affect the cancellation of the reservation.
    In addition to the confounding factor of booking_changes , there must be other confounding factors that affect both the intervention and the outcome.
import pygraphviz
causal_graph = """digraph {
different_room_assigned[label="Different Room Assigned"];
is_canceled[label="Booking Cancelled"];
booking_changes[label="Booking Changes"];
previous_bookings_not_canceled[label="Previous Booking Retentions"];
days_in_waiting_list[label="Days in Waitlist"];
lead_time[label="Lead Time"];
market_segment[label="Market Segment"];
country[label="Country"];
U[label="Unobserved Confounders",observed="no"];
is_repeated_guest;
total_stay;
guests;
meal;
hotel;
U->{different_room_assigned,required_car_parking_spaces,guests,total_stay,total_of_special_requests};
market_segment -> lead_time;
lead_time->is_canceled; country -> lead_time;
different_room_assigned -> is_canceled;
country->meal;
lead_time -> days_in_waiting_list;
days_in_waiting_list ->{is_canceled,different_room_assigned};
previous_bookings_not_canceled -> is_canceled;
previous_bookings_not_canceled -> is_repeated_guest;
is_repeated_guest -> {different_room_assigned,is_canceled};
total_stay -> is_canceled;
guests -> is_canceled;
booking_changes -> different_room_assigned; booking_changes -> is_canceled;
hotel -> {different_room_assigned,is_canceled};
required_car_parking_spaces -> is_canceled;
total_of_special_requests -> {booking_changes,is_canceled};
country->{hotel, required_car_parking_spaces,total_of_special_requests};
market_segment->{hotel, required_car_parking_spaces,total_of_special_requests};
}"""

Based on the above causal diagram, build a model:

model= dowhy.CausalModel(
        data = dataset,
        graph=causal_graph.replace("\n", " "),
        treatment="different_room_assigned",
        outcome='is_canceled')
model.view_model()
from IPython.display import Image, display
display(Image(filename="causal_model.png"))

The result of the construction is as shown in the figure:
insert image description here
(PS: I don’t know why this hanging diagram is so blurry, it just comes out like this)

0x03_2. Identify causal effects

We say that an intervention (Treatment) leads to an outcome (Outcome) if and only if, all else being equal, a change in the intervention causes a change in the outcome. A causal effect is the degree to which an outcome changes when an intervention changes by one unit. Below we will use properties of causal graphs to identify estimators of causal effects.

identified_estimand = model.identify_effect()
print(identified_estimand)

The output is as follows:

Estimand type: nonparametric-ate

### Estimand : 1
Estimand name: backdoor
Estimand expression:
            d                                                                 
──────────────────────────(E[is_canceled|guests,booking_changes,days_in_waitin
d[different_room_assigned]  

g_list,is_repeated_guest,hotel,lead_time,total_of_special_requests,required_car_parking_spaces,total_stay])

Estimand assumption 1, Unconfoundedness: If U→{different_room_assigned} and U→is_canceled then P(is_canceled|different_room_assigned,guests,booking_changes,days_in_waiting_list,is_repeated_guest,hotel,lead_time,total_of_special_requests,required_car_parking_spaces,total_stay,U) = P(is_canceled|different_room_assigned,guests,booking_changes,days_in_waiting_list,is_repeated_guest,hotel,lead_time,total_of_special_requests,required_car_parking_spaces,total_stay)

### Estimand : 2
Estimand name: iv
No such variable(s) found!

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

0x03_3. Causal Effect Estimation

Based on the estimator, we can estimate the causal effect based on actual data. As stated earlier, a causal effect is the degree to which an intervention changes the outcome for a unit change. DoWhy supports a wide variety of methods for computing causal effect estimators and ultimately returns a single mean. The code looks like this:

estimate = model.estimate_effect(identified_estimand,
                                 method_name="backdoor.propensity_score_stratification",target_units="ate")
# ATE = Average Treatment Effect
# ATT = Average Treatment Effect on Treated (i.e. those who were assigned a different room)
# ATC = Average Treatment Effect on Control (i.e. those who were not assigned a different room)
print(estimate)

The output is as follows:

*** Causal Estimate ***

## Identified estimand
Estimand type: nonparametric-ate

### Estimand : 1
Estimand name: backdoor
Estimand expression:
            d                                                                 
──────────────────────────(E[is_canceled|guests,booking_changes,days_in_waitin
d[different_room_assigned]                                                                                                                       
g_list,is_repeated_guest,hotel,lead_time,total_of_special_requests,required_car_parking_spaces,total_stay])

Estimand assumption 1, Unconfoundedness: If U→{different_room_assigned} and U→is_canceled then P(is_canceled|different_room_assigned,guests,booking_changes,days_in_waiting_list,is_repeated_guest,hotel,lead_time,total_of_special_requests,required_car_parking_spaces,total_stay,U) = P(is_canceled|different_room_assigned,guests,booking_changes,days_in_waiting_list,is_repeated_guest,hotel,lead_time,total_of_special_requests,required_car_parking_spaces,total_stay)

## Realized estimand
b: is_canceled~different_room_assigned+guests+booking_changes+days_in_waiting_list+is_repeated_guest+hotel+lead_time+total_of_special_requests+required_car_parking_spaces+total_stay
Target units: ate

## Estimate
Mean value: -0.23589558877628428

0x03_4. Rebuttal result

In fact, the above causality is not based on data, but on the assumptions we made (that is, the causal diagram provided), and the data is only used for statistical estimation. Therefore, we need to verify the correctness of the assumption. DoWhy supports a variety of robustness checks for testing the validity of assumptions. Here are a few of these tests:

  • Add random confounders . If the assumption is correct, the causal effect does not vary much when random confounders are added.
refute1_results=model.refute_estimate(identified_estimand, estimate,
        method_name="random_common_cause")
print(refute1_results)

The result is as follows:

Refute: Add a random common cause
Estimated effect:-0.23589558877628428
New effect:-0.2387463245344759
p value:0.19999999999999996
  • placebo intervention . Replacing the intervention with a random variable, the causal effect should be close to 0 if the assumptions are correct.
refute2_results=model.refute_estimate(identified_estimand, estimate,
        method_name="placebo_treatment_refuter")
print(refute2_results)

The result is as follows:

Refute: Use a Placebo Treatment
Estimated effect:-0.23589558877628428
New effect:8.214081490552981e-05
p value:0.98
  • Data subset validation . The causal effect is estimated on a subset of the data, which should vary little if the assumptions are correct.
refute3_results=model.refute_estimate(identified_estimand, estimate,
        method_name="data_subset_refuter")
print(refute3_results)

The result is as follows:

Refute: Use a subset of data
Estimated effect:-0.23589558877628428
New effect:-0.23518131741789491
p value:0.8

It can be seen that our causal model can basically pass the above tests (that is, achieve the expected results). Therefore, based on the results of the estimation phase, we conclude that when a consumer is booking a room, assigning a previously booked room (different_room_assigned = 0) results in a higher average cancellation probability ( is_canceled ) than assigning a different room_assigned room ( different_room_assigned = 1 ) is low 23%.

0x04. Reference

  1. Microsoft DoWhy Official Tutorial Hotel Booking Case Actual Combat
  2. Getting started with the causal inference framework DoWhy

Guess you like

Origin blog.csdn.net/l8947943/article/details/129495025