Welcome to the third assignment of Lesson Two. In this assignment, we will use Python to build some statistical models that we learned about last week to analyze survival estimates for a dataset of lymphoma patients . We will also evaluate these models and interpret their outputs. Along the way, you'll learn the following:
- Censored (censored) Data
- Kaplan-Meier Estimates
- Subgroup Analysis
job analysis
Job Name:
Survival Estimates that Vary with Time.ipynb
Job address:
github --> bharathikannann/AI-for-Medicine-Specialization-deeplearning.ai --> AI for Medical Prognosis --> Week 3
import package
import lifelines
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from util import load_data
from lifelines import KaplanMeierFitter as KM
from lifelines.statistics import logrank_test
-
lifelines
It is an open source data analysis library, mainly used for survival analysis and reliability analysis. It provides some common survival analysis tools, such as Kaplan-Meier curve, Cox proportional hazards model, etc., which are convenient for users to process and analyze survival data. -
numpy
It is the basic package in Python scientific computing, which provides tools to support multidimensional array and matrix operations, and is the basis of many scientific computing and data analysis libraries -
pandas
It is one of the core libraries of Python data analysis, mainly used for data cleaning, processing, conversion and analysis. -
matplotlib
is one of the most popular plotting libraries in Python.
load dataset
The data set used in this experiment is the lymphoma data provided by lifelines
data = load_data()
print("data shape: {}".format(data.shape))
data.head()
The "Time" column lists how long the patient lived before death or censoring.
The "Event" column shows whether death was observed. Events were 1 if they were observed (i.e., patient death), and 0 if data were censored.
Censoring here means ending without any observed events. For example, keep patients in hospital for up to 100 days. If a patient dies after 44 days, his event is recorded as Time = 44, event = 1. If the patient was discharged after 100 days and died after 3 days (total 103 days), then the event was not observed in our pipeline, corresponding row Time = 100, event = 0. If a patient survives 25 years after admission, their data is still Time = 100, Event = 0.
Censored data
We can plot a survival time histogram to see how long cases survived before censoring or events.
exercise 1
Write a function that calculates the proportion of censored data
def frac_censored(df):
"""
Return percent of observations which were censored.
Args:
df (dataframe): dataframe which contains column 'Event' which is
1 if an event occurred (death)
0 if the event did not occur (censored)
Returns:
frac_censored (float): fraction of cases which were censored.
"""
result = 0.0
### START CODE HERE ###
result =len(df[df["Event"]==0]["Event"])/len(df)
### END CODE HERE ###
return result
survival estimate
To illustrate the advantages of Kaplan Meier's method, we will start with a simple estimator for estimating the above survival function. To estimate this quantity, we divide the number of people who are still alive after computing time t by the number of people who were not censored before time t.
Expressed as:
S ^ ( t ) = ∣ X t ∣ (people who are alive within t time) ∣ M t ∣ (everyone who is deleted within t time) \hat{S}(t) = \frac{|X_t|(people who are alive within t time)}{|M_t|(everyone who is deleted during t time)}S^(t)=∣Mt∣ ( everyone except the deleted ones within t time)∣Xt∣ ( people alive within t time)
X t = { i : T i > t } X_t = \{i : T_i > t\} Xt={
i:Ti>t},
M t = { i : e i = 1 or T i > t } M_t = \{i : e_i = 1 \text{ or } T_i > t\} Mt={
i:ei=1 or Ti>t}.
M t M_t MtIt should be noted that the person who is censored in time t (ei=0 and Time < t) is not included, because the specific survival situation of it is unknown. Here M t M_tMtIn the calculation of , e_i = 1 includes two cases, one is dead before t time, and the other is dead after t time. And T_i > t also includes two cases, one is censoring after t time, and the other is death after t time. Obviously there is overlap between the two, but combining them together (logic: or), you can get the total number of people who are not censored within t time.
It's a bit tangled, so I can stroke it by myself~~
I understand, it is very simple to look at the code
Simple Estimation Function
def naive_estimator(t, df):
"""
Return naive estimate for S(t), the probability
of surviving past time t. Given by number
of cases who survived past time t divided by the
number of cases who weren't censored before time t.
Args:
t (int): query time
df (dataframe): survival data. Has a Time column,
which says how long until that case
experienced an event or was censored,
and an Event column, which is 1 if an event
was observed and 0 otherwise.
Returns:
S_t (float): estimator for survival function evaluated at t.
"""
S_t = 0.0
### START CODE HERE ###
X_t = len(df[df["Time"]>t])
M_t = len(df[(df["Time"]>t) | (df["Event"]==1)])
S_t = X_t/M_t
### END CODE HERE ###
return S_t
After we know how to find the survival rate of a t, next, use the for loop to find the survival rate of continuous t.
Kaplan Meier estimates
We next compare this with the Kaplan Meier estimate. In the cell below, write a function that computes S ( t ) S(t) for each distinct time in the datasetKaplan Meier estimate of S ( t ) .
Recall the Kaplan-Meier estimate:
S ( t ) = ∏ ti ≤ t ( 1 − dini ) S(t) = \prod_{t_i \leq t} (1 - \frac{d_i}{n_i})S(t)=ti≤t∏(1−nidi)
Among them ti t_itiare the events observed in the dataset, di d_idiis time ti t_itinumber of deaths, ni n_iniis that we know that at time ti t_itiThe number of people who survived before.
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def HomemadeKM(df):
"""
Return KM estimate evaluated at every distinct
time (event or censored) recorded in the dataset.
Event times and probabilities should begin with
time 0 and probability 1.
Example:
input:
Time Censor
0 5 0
1 10 1
2 15 0
correct output:
event_times: [0, 5, 10, 15]
S: [1.0, 1.0, 0.5, 0.5]
Args:
df (dataframe): dataframe which has columns for Time
and Event, defined as usual.
Returns:
event_times (list of ints): array of unique event times
(begins with 0).
S (list of floats): array of survival probabilites, so that
S[i] = P(T > event_times[i]). This
begins with 1.0 (since no one dies at time
0).
"""
# individuals are considered to have survival probability 1
# at time 0
event_times = [0]
p = 1.0
S = [p]
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
# get collection of unique observed event times
observed_event_times = df["Time"].unique()
# sort event times
observed_event_times = sorted(observed_event_times)
# iterate through event times
for t in observed_event_times:
# compute n_t, number of people who survive to time t
n_t = len(df[df["Time"]>=t])
# compute d_t, number of people who die at time t
d_t = len(df[(df["Time"]==t) & (df["Event"]==1)])
# update p
p = p * (1 - (d_t/n_t))
S.append(p)
event_times.append(t)
# update S and event_times (ADD code below)
# hint: use append
### END CODE HERE ###
return event_times, S
subgroup analysis
We can see that in addition to the columns for time and censoring information, there is a column called Stage_group in the dataset.
A value of 1 in this column represents a patient with stage III cancer and 2 represents stage IV. We want to compare the survival functions of these two groups of patients.
This time, we will use the class lifelines
in KaplanMeierFitter
. Run the code block below to fit and plot the Kaplan Meier curves for each group.
S1 = data[data.Stage_group == 1]
km1 = KM()
km1.fit(S1.loc[:, 'Time'], event_observed = S1.loc[:, 'Event'], label = 'Stage III')
S2 = data[data.Stage_group == 2]
km2 = KM()
km2.fit(S2.loc[:, "Time"], event_observed = S2.loc[:, 'Event'], label = 'Stage IV')
ax = km1.plot(ci_show=False)
km2.plot(ax = ax, ci_show=False)
plt.xlabel('time')
plt.ylabel('Survival probability estimate')
plt.savefig('two_km_curves', dpi=300)
Note: If you KaplanMeierFitter
are not familiar with the use of , you can check out the following
survival analysis tool: Detailed explanation of the Kaplan-Meier Fitter class in Python
Compare the difference in survival between the two groups at the same time
survivals = pd.DataFrame([90, 180, 270, 360], columns = ['time'])
survivals.loc[:, 'Group 1'] = km1.survival_function_at_times(survivals['time']).values
survivals.loc[:, 'Group 2'] = km2.survival_function_at_times(survivals['time']).values
Log-Rank test
To illustrate whether there is a statistical difference between the survival curves, we can perform a log-rank test. This test tells us the probability that we can observe this data if the two curves are the same. The derivation of the log-rank test is a bit complicated, but fortunately lifeline has a simple function to calculate it.
def logrank_p_value(group_1_data, group_2_data):
result = logrank_test(group_1_data.Time, group_2_data.Time,
group_1_data.Event, group_2_data.Event)
return result.p_value
logrank_p_value(S1, S2)
result = 0.009588929834755544
It is known that p<0.05 is considered statistically significant.
Congratulations!
You have completed the third assignment of Lesson Two. You have studied the Kaplan Meier estimator, a basic nonparametric estimator in survival analysis. Next week we will learn how to account for patient covariates in survival estimates