April 2, 2019

Using Survival Methods for Business Forecasting

by Business Forecasting
in Advanced Analytics

Introduction

Survival methods are a cornerstone of how the Business Forecasting team at Enova perform their analysis and modeling. Before we jump headfirst into the survival methods used at Enova, perhaps it is worthwhile to first delve deeper into what it actually means to perform such an analysis.

As such, this blog post is organized into 3 short sections, beginning with a quick and brief introduction on what it means to perform survival analysis in general, followed by the justification for the use of survival methods at Enova, and finally a brief description on how Enova’s Business Forecasting team performs survival modeling and analysis.

Section 1: A brief introduction to Survival Analysis

Survival methods attempt to describe and / or predict the survival profile of every given observation. They often do not start off with a singular response (dependent variable), but a tuple that denotes an event and a corresponding time duration. Indeed, survival models are quite different from classification models (that classify observations into different categories) and regression models (that attempt to obtain a real number value for each observation) in both the manner of describing the profile of each observation and the techniques employed to analyze the given data.

There are several assumptions that underlie the use of survival methods in statistics and data science. One of the most fundamental assumptions would be that each observation is exposed to at least one type of event. Without this premise, it will not be logical to describe an observation has having “survived” or “died”. The exposure to such an event can be thought of as a “risk” and is often described through numerous measures such as the survival probabilities, the hazard rate, the cumulative hazard rate, etc.

Censoring is the phenomenon where an event was not observed but could happen or could have happened. Right-censoring would describe the case where the study ended or the observation left the study before the event happened, and left-censoring would describe the case where the event had happened before the study began. Right-censoring is the most common type of censoring, and we would not touch further on left-censoring in this blog post. Notably, survival methods are attractive candidates for modeling and analysis particularly because they can appropriately handle censored observations that carry pertinent information.

Let’s take a look at a simple example:

Suppose we have a machine that would eventually fail. Suppose further that the machine failed after surviving for 80 days. If we only managed to observe the machine for 79 days, the observation would be right-censored, and our response would be encoded as (time=79, event=0). Instead, if we observed the machine until its failure, then the response would be encoded as (time=80, event=1). One important point would be that if the observation is right-censored, the response time would indicate the total time of observation, but if the observation is not right-censored the response time would indicate the total time of observation until event.

Let’s suppose further that the machine is exposed to both the risk of failure and a power-outage. We have a scenario of competing-risks, where each observation is exposed to the risk of multiple events occurring, all of which compete for the “death” of the observation. In such a scenario, how we encode the event given the tuple-response constraint matters tremendously. If the machine is not censored and failed at day 80, our response would be (time=80, event=1), if the event refers explicitly to “machine failure”. On the flip side, if the event refers explicitly to “power-outage”, then the response would be (time=80, event=0) even though the machine did fail at day 80.

Of course, once we have many of such observations, we can then obtain several measures of “risk” from the sample. The most common and straightforward approaches are to look at the hazard rate, the cumulative hazard rate, and the survival probability.

Briefly, the hazard rate is defined as the probability of death at a given time conditioned on the fact that the observation survived up until said time. If we have a random variable T denoting the time until event for a given observation, then the hazard rate can be expressed as h(t) = P(T=t|T>t-1). Correspondingly, the cumulative hazard rate is simply the cumulative sum of the hazard rate until a given time-point. C(t) = sum(h(z) | 0<z<=t). Lastly, the survival probability is the probability that the observation survives past a given time t. S(t) = P(T>t).

Following which, how do we obtain unbiased and consistent estimators for these measures? This depends if the problem at hand is parametric or not. If the problem is parametric, then this should be answered relatively easily given the random variable(s) and its parametric distribution(s) of choice. This would commonly go along the lines of obtaining the likelihood function, taking derivatives with respect to specified parameters to obtain their maximum likelihood estimators, estimating the value of these maximum likelihood estimates from the data, and finally reinserting these estimated parameters into the expressions for hazard rate, cumulative hazard, survival probabilities, etc. If the problem is non-parametric, then it can be proven (using Greenwood’s formula) that the maximum likelihood estimates of the cumulative hazard rate and survival probability can be obtained using the Nelson-Aalen estimator and the Kaplan-Meier estimator respectively. The hazard rate can simply be obtained via simple subtraction for discrete times (due to the probability measure being defined on subsets of the event space), or differentiation for continuous times. For the rest of this blogpost, we will be treating this as a non-parametric / semi-parametric problem to keep the conversation more flexible.

To concretely illustrate how we can estimate these measures of risk from a sample, let’s start off with a tiny sample of 5 observations:

Observation Number	Time to event	Event
1	3	0
2	4	1
3	2	1
4	5	0
5	6	1

For this example, we assume discrete time-points for ease of illustration. Here, our response is the tuple: (time to event, event). If we assume that all observations began at the same time, then observations 1 and 4 are right-censored.

Estimated hazard rate:

h(1) = P(T=1|T>0) = 0

h(2) = P(T=2|T>1) = 0.2 (because of all the observations that survive past t=1, one out of five died at t=2)

h(3) = P(T=3|T>2) = 0

h(4) = P(T=4|T>3) = 0.333 (because of all the observations that survive past t=3, one out of three died at t=4)

h(5) = P(T=5|T>4) = 0

h(6) = P(T=6|T>5) = 1 (because the final observation died)

Cumulative hazard rate (Nelson-Aalen estimator):

C(1) = 0

C(2) = 0.2

C(3) = 0.2

C(4) = 0.533 = 0.2 + 0.333

C(5) = 0.533

C(6) = 1.533

Survival probabilities (Kaplan-Meier estimator):

S(1) = 1

S(2) = 0.8 = (1-0.2)

S(3) = 0.8 (because no observations died at t=3)

S(4) = 0.533 = 0.8*(1- 1/3)

S(5) = 0.533 (because no observations died at t=5)

S(6) = 0 (because beyond t=6, no observations are known to be alive)

Graphically, the estimated survival probabilities would conform to this form:

Note that the large grey shaded region corresponds to the confidence interval. The CI is large (i.e. the standard error estimated is large) due to this example having only 5 data points, of which only 3 deaths at different times were observed.

Likewise, the cumulative hazard rates would look like this:

All these estimations of various measures will then provide us with a solid foundation of describing the survival profile of the sample and should be viewed in the same manner as other summary statistics such as the mean and standard deviation. From the perspective of parametric statistics, one could then think of these estimations as maximum likelihood estimators for “hidden” parameters of the survival distribution.

Section 2: Why do we use survival methods at Enova?

Enova, being an online lender, has a collection of loans with outstanding account receivables for each product. A huge impetus to forecast each loan’s performance throughout its entire lifetime comes from the introduction of a new accounting standard by the Financial Accounting Standards Board (FASB) in June 2016. The Current Expected Credit Loss (CECL) standard is required by FASB of lenders to estimate the allowances for credit losses during the lifetime of its financial instruments. To put it simply, lenders are expected to budget an allowance for loan losses to offset potential charge-off risks on all scheduled future payments on each loan until the end of its loan term.

Ideally, we would collect on the entirety of every loan’s scheduled payments into the future on the scheduled due dates, but in reality, each loan is exposed to multiple “risks”, including the risk of charging-off.

Hence, we primarily apply survival techniques with 2 main objectives: 1) to remain compliant with CECL standards, we must determine how much to budget for loan loss reserves for the entirety of every loan’s lifetime, and 2) we must also accurately reflect the changes in the performance of each loan throughout its lifetime for us to estimate the present Lifetime Value of every loan on the books. With these objectives, we are especially concerned with the charge-off event (where a loan is unable to be repaid and will be denoted in our accounting books as a loss).

Section 3: How do we use survival approaches at Enova’s Business Forecasting team?

As mentioned, the survival response is not a singular value but instead (at least) a tuple denoting an event along with the time to event.

For each loan, it seems intuitive to define T as T = time of last observation of loan with active status – funding time, and the event of interest as the charge-off event.

However, if we were to have the event denote the charge-off event, and T = time of last observation of loan with active status – funding time, we face a problem: what about competing risks? We chose the most straightforward approach, to instead re-express T as T = time of last observation of loan until charge-off status – funding time. This means that any loan that pays off the entirety of its due payments will not be treated as terminated but instead lives forever. In doing so, we must be aware of the fact that the entire analysis is now from the perspective of a singular risk, almost akin to “marginalizing” over the other risks. Therefore, there will be several estimates, e.g. time to failure, that would not be accurately obtained from such a perspective, since each loan could have “failed” from other non-charge-off events.

Now, each observation can be associated with a set of features that describe it. This means that we can take this a step further and attempt the use of survival models instead of summary statistics to describe the likely survival distribution of individual observations.

Here at Enova, we, of course, are at the liberty to continually predict on our active portfolio to obtain the updated survival forecast for every loan as new information about each loan is registered in our database. Empirically, we have verified that the periodic and frequent retraining of models improve our model performance by a meaningful margin. We choose to follow the monthly accounting cycles to continually retrain our survival models and predict on our active portfolio to obtain the updated forecasts. We must therefore define a concept of observation time, which is our “snapshot” time that abides by monthly accounting cycles and is equivalent in concept to our prediction time. This concept of observation time helps us rationalize concepts such as “in the past” or “into the future”, which are very pertinent perspectives to consider in production.

At this juncture, it is crucial that our models (and the data utilized):

Is not biased towards over or under predicting the likelihood of charge-offs
Do not overfit on previous versions of each loan and regurgitate the non-event prediction in later iterations
Do not result in the training on a future version of a loan to predict on a past version of a loan during validation / testing
Do not violate the independent and identically distributed assumption behind most supervised machine learning models
Do not have features that contain leaked information from the future
Do not predict on past time-points of each loan

There are many steps one can take to address the challenges above, which is the key to successful survival modeling. Indeed, Enova’s Business Forecasting team has devised a series of steps to overcome all these issues to provide us with the models we want that are reliable, predictive, accurate, and do not fall prey to the pitfalls mentioned above. That being said, there are certainly many different ways a data scientist can approach these challenges which is highly contextual to the problem at hand.

To supplement, what are our choice(s) of model architectures?

We employed both Random Survival Forests and Cox Proportional Hazard models in our survival modeling.

Our primary reasons for using Random Survival Forests:

Non-parametric and non-linear
Provides unbiased Kaplan-Meier-like estimations
Reasonable number of hyperparameters for a tree-based algorithm
Provides importance scores of variables
Regularized via randomization

Our primary reasons for using Cox Proportional Hazard models:

Semi-parametric
Ease of interpretation
Provides unbiased Breslow-like estimations for baseline cumulative hazard (a very close approximation of the Nelson-Aalen estimator)
Quick computation speed
Intuitive empirical risk minimization via partial likelihood maximization

To keep this blogpost concise, we would not delve further into how exactly the algorithms work, or their fundamental assumptions and outcomes. We suggest the reader familiarize themselves with the literature behind these two models as they are very fundamental model architectures used in survival modeling.

As a final touch, how do we evaluate performance for survival models?

In common classification or regression settings, this is rather straightforward. The measurement of loss via the loss function, the confusion matrix, auROC, MSE, etc. are very common metrics to evaluate the performance of the model both in and out of sample / time. However, this gets pretty tricky when we are dealing with predictions in time as in a survival prediction outcome.

Enova’s business forecasting team is not a fan of simply using performance metrics blindly according to the type of machine learning task without really contextualizing it to the business problem. In the case of having a set of predicted survival probabilities, if all the probabilities are of importance, for example, then metrics such as the Concordance Index or the Brier score might make sense; in contrast, if the main concern is a set of predictions at a particular point in time, then it turns into a classification problem and traditional metrics like auROC or a confusion matrix might make more sense.

As a quick overview, the Concordance Index is quite the macro-level index in describing the comparison between a set of survival probabilities and the survival responses. In effect, suppose we randomly pick a pair of observations, x and y, with their corresponding survival responses (Tx, Ex) and (Ty, Ey). From their predicted survival probabilities, something of a “predicted survival time” is obtained, Px and Py. If Px>Py and Tx>Ty, this is considered to be a correct pairing. The Concordance Index would then be the proportion of all possible randomly selected pairings that are correct. The Brier score is fundamentally the MSE across time-points.

Conclusion

To wrap up, survival methods are a really useful way to describe the survival profile and distributions of numerous random variables, especially if each observation has something of a “life”. Yet, the data scientist must maintain a high level of caution and vigilance in dealing with the data and be well-versed with the underlying theory to avoid making grave mistakes. Certainly, the set of possible mistakes (and the set of corresponding solutions) is conditional on the type of problem and the data in question, but we hope that our sharing of the possible pitfalls in our problem and some of the generic techniques employed would help the reader get more acquainted with the various perspectives to view their survival problem.

References

Throughout this blogpost, we have referred to several concepts / algorithms / metrics / standards. Here are some reference materials for the reader to get more acquainted with some of these concepts:

On CECL: https://www.fasb.org/cs/ContentServer?cid=1176168232900&d=&pagename=FASB%2FFASBContent_C%2FNewsPage
On Greenwood’s Formula and the proof of consistency of the Kaplan-Meier/Nelson-Aalen estimator: https://data.princeton.edu/pop509/NonParametricSurvival.pdf
On Random Survival Forests: https://arxiv.org/pdf/0811.1645.pdf
On Cox Proportional Hazards Model: https://www4.stat.ncsu.edu/~dzhang2/st745/chap6.pdf
On the Breslow Estimator: http://dlin.web.unc.edu/files/2013/04/Lin07.pdf
On Concordance Index: https://papers.nips.cc/paper/3375-on-ranking-in-survival-analysis-bounds-on-the-concordance-index.pdf

About

Observation Number

Time to event

Event