## Practical Application to Rehospitalization

Survival models are great for predicting the time for an event to occur. These models may be utilized in a wide range of use cases including predictive maintenance (forecasting when a machine is more likely to break down), marketing analytics (anticipating customer churn), patient monitoring (predicting a patient is more likely to be re-hospitalized), and way more.

By combining machine learning with survival models, the resulting models can profit from the high predictive power of the previous while retaining the framework and typical outputs of the latter (resembling the survival probability or hazard curve over time). For more information, try the primary article of this series here.

Nonetheless, in practice, ML-based survival models still require extensive feature engineering and thus prior business knowledge and intuition to guide to satisfying results. So, why not use deep learning models as an alternative to bridge the gap?

## Objective

This text focuses on how deep learning may be combined with the survival evaluation framework to resolve use cases resembling predicting the likelihood of a patient being (re)hospitalized.

After reading this text, you’ll understand:

- How can deep learning be leveraged for survival evaluation?
- What are the common deep learning models in survival evaluation and the way do they work?
- How can these models be applied concretely to hospitalization forecasting?

This text is the second a part of the series around survival evaluation. In case you aren’t aware of survival evaluation, it’s best to start out by reading the primary onehere. The experimentations described within the article were carried out using the librariesscikit-survival,pycox, andplotly. You could find the code here onGitHub.

## 1.1. Problem statement

Let’s start by describing the issue at hand.

We’re enthusiastic about predicting the likelihood that a given patient shall be rehospitalized given the available details about his health status. More specifically, we would really like to estimate this probability at different time points after the last visit. Such an estimate is crucial to observe patient health and mitigate their risk of relapse.

This can be a typical survival evaluation problem. The information consists of three elements:

Patient’s baseline data including:

- Demographics: age, gender, locality (rural or urban)
- Patient history: smoking, alcohol, diabetes mellitus, hypertension, etc.
- Laboratory results: hemoglobin, total lymphocyte count, platelets, glucose, urea, creatinine, etc.
- More information in regards to the source dataset here.

A time t and an event indicator δ∈{0;1}:

- If the event occurs through the commentary duration, t is the same as the time between the moment the information were collected and the moment the event (i.e., rehospitalization) is observed, In that case, δ = 1.
- If not, t is the same as the time between the moment the information were collected and the last contact with the patient (e.g. end of study). In that case, δ = 0.

⚠️ With this description, why use survival evaluation methods when the issue is so just like a regression task? The initial paper gives a fairly good explanation of the predominant reason:

“If one chooses to make use of standard regression methods, the right-censored data becomes a kind of missing data. It will likely be removed or imputed, which can introduce bias into the model. Subsequently, modeling right-censored data requires special attention, hence using a survival model.” Source [2]

## 1.2. DeepSurv

**Approach**

Let’s move on to the theoretical part with somewhat refresher on the hazard function.

“The hazard function is the probability a person is not going to survive an additional infinitesimal period of time δ, given they’ve already survived as much as time t. Thus, a greater hazard signifies a greater risk of death.”

Source [2]

Much like the Cox proportional hazards (CPH) model, DeepSurv is predicated on the belief that the hazard function is the product of the two functions:

**the baseline hazard function:**λ_0(t)**the chance rating**, r(x)=exp(h(x)). It models how the hazard function varies from the baseline for a given individual given the observed covariates.

More on CPH models in the primary article of this series.

The function h(x) is usually known as the **log-risk function**. And that is precisely the function that the Deep Surv model goals at modeling.

The truth is, CPH models assume that *h(x)* is a linear function: h(x) = β . x. Fitting the model consists thus in computing the weights *β* to optimize the target function. Nonetheless, the linear proportional hazards assumption doesn’t hold in lots of applications. This justifies the necessity for a more complex non-linear model that’s ideally able to handling large volumes of information.

**Architecture**

On this context, how can the DeepSurv model provide a greater alternative? Let’s start by describing it. Based on the unique paper, it’s a “deep feed-forward neural network which predicts the consequences of a patient’s covariates on their hazard rate parameterized by the weights of the network θ.” [2]

How does it work?

‣ The input to the network is the baseline data x.

‣ The network propagates the inputs through a variety of hidden layers with weights θ. The hidden layers consist of fully-connected nonlinear activation functions followed by dropout.

‣ The ultimate layer is a single node that performs a linear combination of the hidden features. The output of the network is taken as the anticipated log-risk function.

Source [2]

In consequence of this architecture, the model could be very flexible. Hyperparametric search techniques are typically used to find out the variety of hidden layers, the variety of nodes in each layer, the dropout probability and other settings.

What in regards to the objective function to optimize?

- CPH models are trained to optimize the Cox partial likelihood. It consists of calculating for every patient
*i*at time*Ti*the probability that the event has happened, considering all of the individuals still in danger at time*Ti*, after which multiplying all these probabilities together. You could find the precise mathematical formula here [2]. - Similarly, the target function of DeepSurv is the log-negative mean of the identical partial likelihood with an extra part that serves to regularize the network weights. [2]

**Code sample**

Here’s a small code snippet to get an idea of how such a model is implemented using the pycox library. The entire code may be present in the notebook examples of the library here [6].

`# Step 1: Neural net`

# easy MLP with two hidden layers, ReLU activations, batch norm and dropoutin_features = x_train.shape[1]

num_nodes = [32, 32]

out_features = 1

batch_norm = True

dropout = 0.1

output_bias = False

net = tt.practical.MLPVanilla(in_features, num_nodes, out_features, batch_norm,

dropout, output_bias=output_bias)

model = CoxPH(net, tt.optim.Adam)

# Step 2: Model training

batch_size = 256

epochs = 512

callbacks = [tt.callbacks.EarlyStopping()]

verbose = True

model.optimizer.set_lr(0.01)

log = model.fit(x_train, y_train, batch_size, epochs, callbacks, verbose,

val_data=val, val_batch_size=batch_size)

# Step 3: Prediction

_ = model.compute_baseline_hazards()

surv = model.predict_surv_df(x_test)

# Step 4: Evaluation

ev = EvalSurv(surv, durations_test, events_test, censor_surv='km')

ev.concordance_td()

## 1.3. DeepHit

**Approach**

As a substitute of constructing strong assumptions in regards to the distribution of survival times, what if we could train a deep neural network that may learn them directly?

That is the case with the DeepHit model. Specifically, it brings two significant improvements over previous approaches:

- It doesn’t depend on any assumptions in regards to the underlying stochastic process. Thus, the network learns to model the evolution over time of the connection between the covariates and the chance.
- It might handle competing risks (e.g., concurrently modeling the risks of being rehospitalized and dying) through a multi-task learning architecture.

**Architecture**

As described here [3], DeepHits follows the common architecture of multi-task learning models. It consists of two predominant parts:

- A shared subnetwork, where the model learns from the information a general representation useful for all of the tasks.
- Task-specific subnetworks, where the model learns more task-specific representations.

Nonetheless, the architecture of the DeepHit model differs from typical multi-task learning models in two facets:

- It features a residual connection between the inital covariates and the input of the task-specific sub-networks.
- It uses just one softmax output layer. Due to this, the model doesn’t learn the marginal distribution of competing events however the joint distribution.

The figures below show the case where the model is trained concurrently on two tasks.

The output of the DeepHit model is a vector *y* for each subject. It gives the probability that the topic will experience the event k ∈ [1, 2] for each timestamp *t* inside the commentary time.

## 2.1. Methodology

**Data**

The information set was divided into three parts: a training set (60% of the information), a validation set (20%), and a test set (20%). The training and validation sets were used to optimize the neural networks during training and the test set for final evaluation.

**Benchmark**

The performance of the deep learning models was in comparison with a benchmark of models including CoxPH and ML-based survival models (Gradient Boosting and SVM). More information on these models is on the market in the primary article of the series.

**Metrics**

Two metrics were used to guage the models:

- Concordance index (C-index): it measures the potential of the model to supply a reliable rating of survival times based on individual risk scores. It’s computed because the proportion of concordant pairs in a dataset.
- Brier rating: It’s a time-dependent extension of the mean squared error to right censored data. In other words, it represents the common squared distance between the observed survival status and the anticipated survival probability.

## 2.2. Results

When it comes to C-index, the performance of the deep learning models is considerably higher than that of the ML-based survival evaluation models. Furthermore, there is sort of no difference between the performance of Deep Surval and Deep Hit models.

When it comes to Brier rating, the Deep Surv model stands out from the others.

- When examining the curve of the Brier rating as a function of time, the curve of the Deep Surv model is lower than the others, which reflects a greater accuracy.

- This commentary is confirmed when considering the combination of the rating over the identical time interval.

Note that the Brier wasn’t computed for the SVM as this rating is just applicable for models which might be in a position to estimate a survival function.

Finally, deep learning models may be used for survival evaluation in addition to statistical models. Here, as an illustration, we will see the survival curve of randomly chosen patients. Such outputs can bring many advantages, particularly allowing a simpler follow-up of the patients which might be probably the most in danger.

✔️ Survival models are very useful for predicting the time it takes for an event to occur.

✔️ They will help address many use cases by providing a learning framework and techniques in addition to useful outputs resembling the probability of survival or the hazard curve over time.

✔️ They’re even indispensable in such a uses cases to take advantage of all the information including the censored observations (when the event doesn’t occur through the commentary period for instance).

✔️ ML-based survival models are likely to perform higher than statistical models (more information here). Nonetheless, they require high-quality feature engineering based on solid business intuition to realize satisfactory results.

✔️ That is where Deep Learning can bridge the gap. Deep learning-based survival models like DeepSurv or DeepHit have the potential to perform higher with less effort!

✔️ Nevertheless, these models aren’t without drawbacks. They require a big database for training and require fine-tuning multiple hyperparameters.

[1] Bollepalli, S.C.; Sahani, A.K.; Aslam, N.; Mohan, B.; Kulkarni, K.; Goyal, A.; Singh, B.; Singh, G.; Mittal, A.; Tandon, R.; Chhabra, S.T.; Wander, G.S.; Armoundas, A.A. An Optimized Machine Learning Model Accurately Predicts In-Hospital Outcomes at Admission to a Cardiac Unit. Diagnostics 2022, 12, 241.

[2] Katzman, J., Shaham, U., Bates, J., Cloninger, A., Jiang, T., & Kluger, Y. (2016). DeepSurv: Personalized Treatment Recommender System Using A Cox Proportional Hazards Deep Neural Network, ArXiv

[3] Laura Löschmann, Daria Smorodina, Deep Learning for Survival Evaluation, Seminar information systems (WS19/20), February 6, 2020

[4] Lee, Changhee et al. DeepHit: A Deep Learning Approach to Survival Evaluation With Competing Risks. AAAI Conference on Artificial Intelligence (2018).

[5] Wikipedia, *Proportional hazards model*

[6] Pycox library