Survival function


A survival function is a function that gives the probability that a patient, device, or other object of interest will survive after a certain amount of time. Survival function is also called survivor function or reliability function. Reliability functions are common in engineering, while survival functions are used in a wider range of applications, including human mortality. The survival function is the complementary cumulative distribution function of the lifetime . Sometimes complementary cumulative distribution functions are collectively referred to as survival functions.

1. Definition

Set life cycle TTT is the interval[ 0 , ∞ ) [0,\infty)[0,) with cumulative distribution functionF ( t ) F(t)F ( t ) and probability density functionf ( t ) f(t)Continuous random variable of f ( t ) . Its survival function or reliability function is:

S ( t ) = P ( { T > t } ) = ∫ t ∞ f ( u )   d u = 1 − F ( t ) S(t)=P(\{T>t\})=\int _{t}^{\infty }f(u)\,\mathrm{d}u=1-F(t) S(t)=P({ T>t})=tf(u)of u=1F(t)

2. Examples of Survival Functions

The figure below shows an example of a hypothetical survival function. xxxThe x- axis is time. yyThe y- axis is the proportion of subjects surviving. These graphs show subjects at timettProbability of survival after t .

Figure four survival functions


For example, for survival function 1, the survival time exceeds t = 2 t = 2t=The probability for 2 months is 0.37. That is, 37% of subjects survived beyond 2 months.

Graph Survival Function 1


For survival function 2, survival time exceeds t = 2 t = 2t=The probability for 2 months is 0.97. That is, 97% of subjects survived beyond 2 months.

Graph Survival Function 2


Median survival can be determined by the survival function. For example, for survival function 2, 50% of subjects survived for 3.72 months. Therefore, the median survival time was 3.72 months.

Median survival of graph survival function


In some cases, the median survival cannot be determined from the graph. For example, for survival function 4, more than 50% of the subjects survived beyond the 10-month observation period.

The median survival time in the graph is greater than 10 months


Survival functions are one of several ways to describe and display survival data. Another useful way to display data is a graph showing the distribution of survival times across subjects. Olkin, p. 426 gives the following example of survival data. Record the number of hours between consecutive failures of the air conditioning system. Time between consecutive failures is 1, 3, 5, 7, 11, 11, 11, 12, 14, 14, 14, 16, 16, 20, 21, 23, 42, 47, 52, 62, 71, 71 , 87, 90, 95, 120, 120, 225, 246 and 261 hours. The mean time between failures was 59.6. This average will soon be used to fit a theoretical curve to the data. The figure below shows the distribution of time between failures. The blue tick mark below the graph is the actual number of hours between consecutive failures.

Figure AC failure frequency distribution


The distribution of failure times is overlaid with a curve representing an exponential distribution. For this example, an exponential distribution approximates the distribution of failure times. An exponential curve is a theoretical distribution that fits actual failure times. This particular exponential curve is specified by the parameter lambda, λ = 1 / (mean time between failures) = 1 / 59.6 = 0.0168 \lambda= 1/(\text{mean time between failures}) = 1/59.6 = 0.0168l=1/ ( mean time between failures )=1/59.6=0.0168 . If time can take any positive value, the distribution of failure times is called a probability density function (pdf). In the equation, pdf is specified asf ( t ) f(t)f ( t ) . If times can only take discrete values ​​(eg 1 day, 2 days, etc.), then the distribution of failure times is called a probability mass function (pmf). Most survival analysis methods assume that time can take any positive value,f ( t ) f(t)f ( t ) is the pdf. If an exponential function is used to approximate the observed time between AC failures, then the exponential curve gives the probability density functionf ( t ) f(t)f(t)

Another useful way to display survival data is a graph showing the cumulative failure at each point in time. These data can be shown as cumulative counts or as cumulative percentages of each failure. The graph below shows the cumulative probability (or proportion) of each failure of the air conditioning system. The black stepped line shows the cumulative proportion of failures. For each step, there is a blue tick at the bottom of the graph representing the observed failure time. The smooth red line represents an exponential curve fitted to the observed data.

The cumulative probability of failure plot to each point in time is called the cumulative distribution function (cumulative distribution function) or CDF. In survival analysis, the cumulative distribution function gives the survival time less than or equal to a specific time ttprobability of t .

Let TTT is the survival time, which is any positive number. Specific times are represented by lowercase lettersttt specified. TTThe cumulative distribution function of T is the function:

F ( t ) = P ⁡ ( T ≤ t ) {\displaystyle F(t)=\operatorname {P} (T\leq t)} F(t)=P(Tt)

where the right side represents the random variable TTT is less than or equal tottprobability of t . If time can take any positive value, the cumulative distribution functionF ( t ) F(t)F ( t ) is the probability density functionf ( t ) f(t)Integral of f ( t ) .

For the air conditioner example, the CDF chart below illustrates that the probability of a failure time of 100 hours or less is 0.81, estimated using an exponential curve fitted to the data.

Figure AC Time to failure LT 100 hours


An alternative to plotting the probability of a failure time of less than or equal to 100 hours is to plot the probability of a failure time of greater than 100 hours. The probability that the failure time is greater than 100 hours must be 1 minus the probability that the failure time is less than or equal to 100 hours, because the total probabilities must sum to 1.

This gives:

P ( failure time > 100 hours ) = 1 − P ( failure time < 100 hours ) = 1 – 0.81 = 0.19 P(\text{failure time} > 100 \text{hours}) = 1 - P(\text{failure time} < 100 \text{hours}) = 1 – 0.81 = 0.19 P(failure time>100hours)=1P(failure time<100hours)=1–0.81=0.19

This relationship generalizes to all failure times:

P ( T > t ) = 1 − P ( T < t ) = 1 –cumulative distribution function P(T > t) = 1 - P(T < t) = 1 – \text{cumulative distribution function} P(T>t)=1P(T<t)=1–cumulative distribution function

This relationship is shown in the figure below. The left figure is the cumulative distribution function, that is, P ( T < t ) P(T < t)P(T<t)。右图为 P ( T > t ) = 1 − P ( T < t ) P(T > t) = 1 - P(T < t) P(T>t)=1P(T<t ) . The figure on the right is the survival functionS ( t ) S(t)S(t) S ( t ) = 1 –CDF S(t) = 1 – \text{CDF} S(t)=1 – The fact that the CDF is the reason why another name for the survival function is the Complementary Cumulative Distribution Function.

Graph Survival Function for 1 - CDF

3. Parametric survival function

In some cases, such as the air conditioner example, the distribution of survival times can be well approximated by functions such as the exponential distribution. Several distributions are commonly used in survival analysis, including exponential, Weibull, gamma, normal, lognormal, and loglogistic. These distributions are defined by parameters. For example, the normal (Gaussian) distribution is defined by two parameters mean and standard deviation. A survival function defined by parameters is said to be parametric.

In the four survival function plots shown above, the shape of the survival function is defined by a specific probability distribution: Survival function 1 is defined by the exponential distribution, 2 by the Weibull distribution, and 3 by the log-logistic distribution (log- logistic distribution), 4 is defined by another Weibull distribution.

3.1 Exponential survival function

For an exponential survival distribution, the probability of failure is the same for each time interval, regardless of the age of the individual or device. This fact leads to the "memoryless" property of the exponential survival distribution: the subject's age has no effect on the probability of failure in the next time interval. Exponentials can be a good model for system life cycles, where parts are replaced as they fail. It can also be used to simulate the survival of living organisms for short periods of time. It's unlikely to be a good model of an organism's complete life cycle. As Efron and Hastie (p. 134) point out, "If human life spans were exponential, then there would be no old or young, only the lucky or the unlucky".

3.2 Weibull survival function

A key assumption of the exponential survival function is that the hazard rate is constant. In the example given above, the proportion of males who die each year is a constant 10%, which means that the hazard rate is constant. The assumption of continued danger may not be appropriate. For example, in most organisms, the risk of dying in old age is greater than in middle age—that is, the risk increases over time. For some diseases, such as breast cancer, the risk of recurrence after 5 years is low -- that is, the risk rate decreases over time. The Weibull distribution extends the exponential distribution to allow for constant, increasing, or decreasing hazard rates.

3.3 Survival functions of other parameters

There are several other parametric survival functions that can better fit specific data sets, including normal, lognormal, loglogistic, and gamma. A parametric distribution can be chosen for a particular application using graphical methods or using formal fit tests. These distributions and tests are described in survival analysis textbooks. Lawless covers parametric models extensively.

Parametric survival functions are often used in manufacturing applications, in part because of their ability to estimate survival functions beyond the observation period. However, proper use of parametric functions requires that the data be well modeled by the chosen distribution. Nonparametric survival functions provide a useful alternative when a suitable distribution is not available, or cannot be specified prior to a clinical trial or experiment.

4. Nonparametric Survival Functions

A parametric model of survival may not be possible or desirable. In these cases, the most common way to model the survival function is the nonparametric Kaplan–Meier estimator ( Kaplan–Meier estimator ).

5. Nature

  • Every survival function S ( t ) S(t)S ( t ) is monotonically decreasing, that is, for allu > t u>tu>t S ( u ) ≤ S ( t ) S(u)\leq S(t) S(u)S(t)
    • It is a property of a random variable that maps to time a set of events, often associated with the mortality or failure of some system.
  • time t = 0 t=0t=When 0 , it represents a certain origin, usually the beginning of research or the beginning of a certain system operation. S ( 0 ) S(0)S ( 0 ) is usually uniform, but can be smaller to represent the probability of the system failing immediately at runtime.
  • Since CDF is a right continuous function, the survival function S ( t ) = 1 − F ( t ) S(t)=1-F(t)S(t)=1F ( t ) is also right continuous.
  • The survival function can be compared with the probability density function f ( t ) f(t)f ( t ) also has a hazard function (hazard function)λ ( t ) \lambda (t)λ ( t ) is associated with:

f ( t ) = − S ′ ( t ) {\displaystyle f(t)=-S'(t)} f(t)=S(t)
λ ( t ) = − d d t log ⁡ S ( t ) {\displaystyle \lambda (t)=-{\mathrm{d} \over {\mathrm{d}t}}\log S(t)} λ ( t )=dtdlogS(t)

so:

S ( t ) = exp ⁡ [ − ∫ 0 t λ ( t ′ ) d t ′ ] {\displaystyle S(t)=\exp[-\int _{0}^{t}\lambda (t')\mathrm{d}t']} S(t)=exp[0tλ ( t)dt]

  • Expected survival time:

E ( T ) = ∫ 0 ∞ S ( t ) d t {\displaystyle \mathbb {E} (T)=\int _{0}^{\infty }S(t)\mathrm{d}t} E(T)=0S(t)dt

6. Kaplan–Meier estimator

The Kaplan–Meier estimator, also known as the product limit estimator, is a non-parametric statistic used to estimate survival functions from lifetime data. In medical research, it is often used to measure the proportion of patients who survive a certain period of time after treatment. In other areas, Kaplan-Meier estimators can be used to measure the length of time people remain unemployed after being unemployed, the time a machine part fails, or how long fleshy fruit on a plant remains before it is removed by predators. The estimator is named after Edward L. Kaplan and Paul Meier, who each submitted similar manuscripts to the Journal of the American Statistical Association. Journal editor John Tukey convinced them to combine their work into a single paper that has been cited more than 61,800 times since its publication in 1958.

Survival function S ( t ) S(t)Estimator of S ( t ) (lived longer than ttt ) is given by:

S ^ ( t ) = ∏ i :   t i ≤ t ( 1 − d i n i ) {\displaystyle {\widehat {S}}(t)=\prod \limits _{i:\ t_{i}\leq t}\left(1-{\frac {d_{i}}{n_{i} }}\right)} S (t)=i: tit(1nidi)

t i t_{i} tiis the time when at least one event occurs, di d_{i}diFor the time ti t_{i}tithe number of events (e.g., deaths) that occurred, ni n_{i}niFor the arrival time ti t_{i}tiSurvivors known so far (not incident or vetted yet).

6.1 Basic concepts

The plot of the Kaplan–Meier estimator is a series of decreasing horizontal steps that, given a sufficiently large sample size, approximate the true survival function for the population. The value of the survival function between consecutive different sampled observations ("clicks") is assumed to be constant.

An important advantage of the Kaplan-Meier curve is that the method can take into account certain types of censored data ( censored data ), especially right-censoring (right-censoring), if the patient withdraws from the study, lost follow-up, or in the last follow-up No events occurred. On the plot, small vertical tick marks indicate individual patients whose survival times have been right-censored. The Kaplan-Meier curve is the complement of the empirical distribution function when no truncation or censoring occurs .

In medical statistics, a typical application might involve grouping patients, for example, patients with gene A profile and those with gene B profile. In the graph, patients with gene B die faster than patients with gene A. After two years, about 80 percent of patients with gene A survived, but less than half of those with gene B survived.

To generate a Kaplan-Meier estimator, each patient (or each subject) needs at least two pieces of data: the state of the last observation (event occurrence or right censoring) and the time of event occurrence (or censoring time). If you want to compare survival functions between two or more groups, you need a third piece of data: the group assignment for each subject.

7. Proportional hazards model

Proportional hazards models are a class of survival models in statistics. Survival models relate the elapsed time before an event to one or more covariates ( covariates ) that may be associated with that amount of time. In a proportional hazards model, the unique effect of a unit increase in a covariate is multiplicative relative to the hazard rate. For example, taking a drug might halve a person's risk of having a stroke, or changing the material from which a part is made could double their risk of malfunctioning. Other types of survival models, such as accelerated failure time models , do not exhibit proportional hazards. Accelerated failure time models describe situations in which the biological or mechanical life history of events is accelerated (or decelerated).

An example of a Kaplan-Meier plot for two cases plotting patient survival.

7.1 Background

A survival model can be viewed as consisting of two parts: the underlying baseline hazard function, usually denoted as λ 0 ( t ) \lambda _{0}(t)l0( t ) , describing how the event risk per time unit changes over time at the baseline level of the covariate; and effect parameters, describing how the hazard changes in response to the explanatory covariates. A typical medical example would include covariates such as assignment therapy, as well as patient characteristics such as age at study start, sex, and presence of other diseases at study start, to reduce variability and/or control for confounding.

The proportional hazards condition states that the covariate is multiplicatively related to the risk. For example, in the simplest case of fixed coefficients, a drug treatment might be at any given time ttt halves the risk to the subject, while the baseline harm may vary. Note, however, that this does not double the lifetime of the object; the precise effect of the covariate on the lifetime depends onλ 0 ( t ) \lambda_{0}(t)l0( t ) . Covariates are not limited to binary predictors; in continuous covariatesxxIn the case of x , it is usually assumed that the hazard responds exponentially; for each increment ofxxx causes the hazard to scale.

7.2 Problem Definition

τ ≥ 0 \tau \geq 0t0 is a random variable, which we treat as the time before the event of interest. As mentioned above, the goal is to estimate atτ \tauSurvival function SSunder τS. _ This function is defined as:

S ( t ) = Prob ⁡ ( τ > t ) ,where  t = 0 , 1 , … is the time {\displaystyle S(t)=\operatorname {Prob} (\tau >t)},\text{where}\ t=0,1,\dots \text{is the time} S(t)=Prob ( v>t)where t=0,1,is the time

τ 1 , … , τ n ≥ 0 \tau _{1},\dots ,\tau _{n}\geq 0t1,,tn0 is an independent and identically distributed random variable whose common distribution isτ \tauτ :τ j \tau_{j}tjis an event jjA random time when j occurs. Can be used to estimateSSThe data of S is not ( τ j ) j = 1 , … , n (\tau _{j})_{j=1,\dots ,n}( tj)j = 1 , , n, but a series of pairs ( ( τ ~ j , cj ) ) j = 1 , … , n (\,({\tilde {\tau }}_{j},c_{j})\,)_{ j=1,\dots ,n}((t~j,cj))j = 1 , , n, where j ∈ [ n ] : = { 1 , 2 , … , n } j\in [n]:=\{1,2,\dots ,n\}j[n]:={ 1,2,,n }cj ≥ 0 c_{j}\geq 0cj0 is a fixed deterministic integer, eventjjj 's ending time (censoring time) andτ ~ j = min ⁡ ( τ j , cj ) {\tilde {\tau }}_{j}=\min(\tau _{j},c_{j})t~j=min ( tj,cj) . especially about the incidentjjAvailable information for the time j occurs is whether the event occurred at a fixed timecj c_{j}cjHappened before. If yes, then the actual time of the event is also available. The challenge is to estimate S ( t ) S(t) given this dataS(t)

7.3 The Cox model

7.3.1 Introduction

Sir David Cox observed that if the proportional hazards assumption holds (or the assumption holds), then one can estimate the effect parameter, denoted β i \beta _{i}bi. Below, the full hazard function is not considered. This approach to survival data is called the application of a Cox proportional hazards model, sometimes abbreviated to Cox model or proportional hazards model. However, Cox also points out that the biological interpretation of the proportional hazards assumption can be tricky.

X i = ( X i 1 , ⋯   , X i p ) X_{i} = (X_{i1}, \cdots, X_{ip}) Xi=(Xi 1,,Xip) for subjectiiThe actual value of the covariate for i . The hazard function for a Cox proportional hazards model has the following form:

λ ( t ∣ X i ) = λ 0 ( t ) exp ⁡ ( β 1 X i 1 + ⋯ + β p X i p ) = λ 0 ( t ) exp ⁡ ( X i ⋅ β ) {\displaystyle {\begin{aligned}\lambda (t|X_{i})&=\lambda _{0}(t)\exp(\beta _{1}X_{i1}+\cdots +\beta _ {p}X_{ip})\\&=\lambda _{0}(t)\exp(X_{i}\cdot \beta )\end{aligned}}} λ ( t Xi)=l0(t)exp ( b1Xi 1++bpXip)=l0(t)exp(Xib ).

This expression gives the object iii at timettHazard function for t , vector of covariates (explanatory variables)X i X_{i}Xi. Note that between subjects, the baseline risk λ 0 ( t ) \lambda _{0}(t)l0( t ) is the same (does not depend oniii ). The only difference between subjects' risks comes from the baseline scaling factorexp ⁡ ( X i ⋅ β ) \exp(X_{i}\cdot \beta)exp(Xib ) _


  • references

wiki: Survival function

wiki: Kaplan–Meier estimator

wiki: Proportional hazards model

Guess you like

Origin blog.csdn.net/qq_32515081/article/details/130190109