Deep Learning for Medical Prognosis-Lesson 2, Week 3, Major Assignment-Survival Analysis of Lymphoma Patients: Using Kaplan-Meier Method for Data Analysis

Welcome to the third assignment of Lesson Two. In this assignment, we will use Python to build some statistical models that we learned about last week to analyze survival estimates for a dataset of lymphoma patients . We will also evaluate these models and interpret their outputs. Along the way, you'll learn the following:

  • Censored (censored) Data
  • Kaplan-Meier Estimates
  • Subgroup Analysis

job analysis

Job Name:
Survival Estimates that Vary with Time.ipynb

Job address:
github --> bharathikannann/AI-for-Medicine-Specialization-deeplearning.ai --> AI for Medical Prognosis --> Week 3

import package

import lifelines
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from util import load_data

from lifelines import KaplanMeierFitter as KM
from lifelines.statistics import logrank_test
  1. lifelinesIt is an open source data analysis library, mainly used for survival analysis and reliability analysis. It provides some common survival analysis tools, such as Kaplan-Meier curve, Cox proportional hazards model, etc., which are convenient for users to process and analyze survival data.

  2. numpyIt is the basic package in Python scientific computing, which provides tools to support multidimensional array and matrix operations, and is the basis of many scientific computing and data analysis libraries

  3. pandasIt is one of the core libraries of Python data analysis, mainly used for data cleaning, processing, conversion and analysis.

  4. matplotlibis one of the most popular plotting libraries in Python.

load dataset

The data set used in this experiment is the lymphoma data provided by lifelines

data = load_data()
print("data shape: {}".format(data.shape))
data.head()


The "Time" column lists how long the patient lived before death or censoring.

The "Event" column shows whether death was observed. Events were 1 if they were observed (i.e., patient death), and 0 if data were censored.

Censoring here means ending without any observed events. For example, keep patients in hospital for up to 100 days. If a patient dies after 44 days, his event is recorded as Time = 44, event = 1. If the patient was discharged after 100 days and died after 3 days (total 103 days), then the event was not observed in our pipeline, corresponding row Time = 100, event = 0. If a patient survives 25 years after admission, their data is still Time = 100, Event = 0.

Censored data

We can plot a survival time histogram to see how long cases survived before censoring or events.

exercise 1

Write a function that calculates the proportion of censored data

def frac_censored(df):
    """
    Return percent of observations which were censored.
    
    Args:
        df (dataframe): dataframe which contains column 'Event' which is 
                        1 if an event occurred (death)
                        0 if the event did not occur (censored)
    Returns:
        frac_censored (float): fraction of cases which were censored. 
    """
    result = 0.0
    
    ### START CODE HERE ###
    result =len(df[df["Event"]==0]["Event"])/len(df)
    
    ### END CODE HERE ###
    
    return result

survival estimate

To illustrate the advantages of Kaplan Meier's method, we will start with a simple estimator for estimating the above survival function. To estimate this quantity, we divide the number of people who are still alive after computing time t by the number of people who were not censored before time t.

Expressed as:
S ^ ( t ) = ∣ X t ∣ (people who are alive within t time) ∣ M t ∣ (everyone who is deleted within t time) \hat{S}(t) = \frac{|X_t|(people who are alive within t time)}{|M_t|(everyone who is deleted during t time)}S^(t)=Mt( everyone except the deleted ones within t time)Xt( people alive within t time)

X t = { i : T i > t } X_t = \{i : T_i > t\} Xt={ i:Ti>t},
M t = { i : e i = 1  or  T i > t } M_t = \{i : e_i = 1 \text{ or } T_i > t\} Mt={ i:ei=1 or Ti>t}.

M t M_t MtIt should be noted that the person who is censored in time t (ei=0 and Time < t) is not included, because the specific survival situation of it is unknown. Here M t M_tMtIn the calculation of , e_i = 1 includes two cases, one is dead before t time, and the other is dead after t time. And T_i > t also includes two cases, one is censoring after t time, and the other is death after t time. Obviously there is overlap between the two, but combining them together (logic: or), you can get the total number of people who are not censored within t time.

It's a bit tangled, so I can stroke it by myself~~

I understand, it is very simple to look at the code

Simple Estimation Function

def naive_estimator(t, df):
    """
    Return naive estimate for S(t), the probability
    of surviving past time t. Given by number
    of cases who survived past time t divided by the
    number of cases who weren't censored before time t.
    
    Args:
        t (int): query time
        df (dataframe): survival data. Has a Time column,
                        which says how long until that case
                        experienced an event or was censored,
                        and an Event column, which is 1 if an event
                        was observed and 0 otherwise.
    Returns:
        S_t (float): estimator for survival function evaluated at t.
    """
    S_t = 0.0
    
    ### START CODE HERE ###
    X_t = len(df[df["Time"]>t])
    M_t = len(df[(df["Time"]>t) | (df["Event"]==1)])
    S_t = X_t/M_t
    ### END CODE HERE ###
    
    return S_t

After we know how to find the survival rate of a t, next, use the for loop to find the survival rate of continuous t.

Kaplan Meier estimates

We next compare this with the Kaplan Meier estimate. In the cell below, write a function that computes S ( t ) S(t) for each distinct time in the datasetKaplan Meier estimate of S ( t ) .

Recall the Kaplan-Meier estimate:
S ( t ) = ∏ ti ≤ t ( 1 − dini ) S(t) = \prod_{t_i \leq t} (1 - \frac{d_i}{n_i})S(t)=tit(1nidi)

Among them ti t_itiare the events observed in the dataset, di d_idiis time ti t_itinumber of deaths, ni n_iniis that we know that at time ti t_itiThe number of people who survived before.

# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def HomemadeKM(df):
    """
    Return KM estimate evaluated at every distinct
    time (event or censored) recorded in the dataset.
    Event times and probabilities should begin with
    time 0 and probability 1.
    
    Example:
    
    input: 
    
         Time  Censor
    0     5       0
    1    10       1
    2    15       0
    
    correct output: 
    
    event_times: [0, 5, 10, 15]
    S: [1.0, 1.0, 0.5, 0.5]
    
    Args:
        df (dataframe): dataframe which has columns for Time
                          and Event, defined as usual.
                          
    Returns:
        event_times (list of ints): array of unique event times
                                      (begins with 0).
        S (list of floats): array of survival probabilites, so that
                            S[i] = P(T > event_times[i]). This 
                            begins with 1.0 (since no one dies at time
                            0).
    """
    # individuals are considered to have survival probability 1
    # at time 0
    event_times = [0]
    p = 1.0
    S = [p]
    
    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    
    # get collection of unique observed event times
    observed_event_times = df["Time"].unique()
  
    # sort event times
    observed_event_times = sorted(observed_event_times)
    
    # iterate through event times
    for t in observed_event_times:
  
        # compute n_t, number of people who survive to time t
        n_t = len(df[df["Time"]>=t])
  
        # compute d_t, number of people who die at time t
        d_t = len(df[(df["Time"]==t) & (df["Event"]==1)])
        
        # update p
        p = p * (1 - (d_t/n_t))
        S.append(p)
        event_times.append(t)
  
        # update S and event_times (ADD code below)
        # hint: use append
        
    
    ### END CODE HERE ###
  
    return event_times, S

subgroup analysis

We can see that in addition to the columns for time and censoring information, there is a column called Stage_group in the dataset.

A value of 1 in this column represents a patient with stage III cancer and 2 represents stage IV. We want to compare the survival functions of these two groups of patients.

This time, we will use the class lifelinesin KaplanMeierFitter. Run the code block below to fit and plot the Kaplan Meier curves for each group.

S1 = data[data.Stage_group == 1]
km1 = KM()
km1.fit(S1.loc[:, 'Time'], event_observed = S1.loc[:, 'Event'], label = 'Stage III')

S2 = data[data.Stage_group == 2]
km2 = KM()
km2.fit(S2.loc[:, "Time"], event_observed = S2.loc[:, 'Event'], label = 'Stage IV')

ax = km1.plot(ci_show=False)
km2.plot(ax = ax, ci_show=False)
plt.xlabel('time')
plt.ylabel('Survival probability estimate')
plt.savefig('two_km_curves', dpi=300)

Note: If you KaplanMeierFitterare not familiar with the use of , you can check out the following
survival analysis tool: Detailed explanation of the Kaplan-Meier Fitter class in Python

Compare the difference in survival between the two groups at the same time

survivals = pd.DataFrame([90, 180, 270, 360], columns = ['time'])
survivals.loc[:, 'Group 1'] = km1.survival_function_at_times(survivals['time']).values
survivals.loc[:, 'Group 2'] = km2.survival_function_at_times(survivals['time']).values

Log-Rank test

To illustrate whether there is a statistical difference between the survival curves, we can perform a log-rank test. This test tells us the probability that we can observe this data if the two curves are the same. The derivation of the log-rank test is a bit complicated, but fortunately lifeline has a simple function to calculate it.

def logrank_p_value(group_1_data, group_2_data):
    result = logrank_test(group_1_data.Time, group_2_data.Time,
                          group_1_data.Event, group_2_data.Event)
    return result.p_value

logrank_p_value(S1, S2)

result = 0.009588929834755544

It is known that p<0.05 is considered statistically significant.

Congratulations!

You have completed the third assignment of Lesson Two. You have studied the Kaplan Meier estimator, a basic nonparametric estimator in survival analysis. Next week we will learn how to account for patient covariates in survival estimates

Guess you like

Origin blog.csdn.net/u014264373/article/details/130712244