library(tidymodels)
library(censored)
#> Loading required package: survival
How long until building complaints are dispositioned? A survival analysis case study
Learn how to use tidymodels for survival analysis.
Introduction
To use code in this article, you will need to install the following packages: aorsf, censored, glmnet, modeldatatoo, and tidymodels.
Survival analysis is a field of statistics and machine learning for analyzing the time to an event. While it has its roots in medical research, the event of interest can be anything from customer churn to machine failure. Methods from survival analysis take into account that some observations may not yet have experienced the event of interest and are thus censored.
Here we want to predict the time it takes for a complaint to be dispositioned1 by the Department of Buildings in New York City. We are going to walk through a complete analysis from beginning to end, showing how to analyze time-to-event data.
Let’s start with loading the tidymodels and censored packages (the parsnip extension package for survival analysis models).
The buildings complaints data
The city of New York publishes data on the complaints received by the Department of Buildings. The data includes information on the type of complaint, the date it was entered in their records, the date it was dispositioned, and the location of the building the complaint was about. We are using a subset of the data, available in the modeldatatoo package.
<- modeldatatoo::data_building_complaints()
building_complaints glimpse(building_complaints)
#> Rows: 4,234
#> Columns: 11
#> $ days_to_disposition <dbl> 72, 1, 41, 45, 16, 62, 56, 11, 35, 38, 39, 106, 1,…
#> $ status <chr> "ACTIVE", "ACTIVE", "ACTIVE", "ACTIVE", "ACTIVE", …
#> $ year_entered <fct> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 20…
#> $ latitude <dbl> 40.66173, 40.57668, 40.73242, 40.68245, 40.63156, …
#> $ longitude <dbl> -73.98297, -74.00453, -73.87630, -73.79367, -73.99…
#> $ borough <fct> Brooklyn, Brooklyn, Queens, Queens, Brooklyn, Quee…
#> $ special_district <fct> None, None, None, None, None, None, None, None, No…
#> $ unit <fct> Q-L, Q-L, SPOPS, Q-L, BKLYN, Q-L, Q-L, SPOPS, Q-L,…
#> $ community_board <fct> 307, 313, 404, 412, 312, 406, 306, 306, 409, 404, …
#> $ complaint_category <fct> 45, 45, 49, 45, 31, 45, 45, 49, 45, 45, 45, 4A, 31…
#> $ complaint_priority <fct> B, B, C, B, C, B, B, C, B, B, B, B, C, C, B, B, B,…
Before we dive into survival analysis, let’s get a impression of how the complaints are distributed across the city. We have complaints in all five boroughs, albeit with a somewhat lower density of complaints in Staten Island.
Building complaints in New York City (closed complaints in purple, active complaints in pink).
In the dataset, we can see the days_to_disposition
as well as the status
of the complaint. For a complaint with the status "ACTIVE"
, the time to disposition is censored, meaning we do know that it has taken at least that long, but not how long for it to be completely resolved.
The standard form for time-to-event data are Surv
objects which capture the time as well as the event status. As with all transformations of the response, it is advisable to do this before heading into the model fitting process with tidymodels.
<- building_complaints %>%
building_complaints mutate(
disposition_surv = Surv(days_to_disposition, status == "CLOSED"),
.keep = "unused"
)
Data splitting and resampling
For our resampling strategy, let’s use a 3-way split into training, validation, and test set.
set.seed(403)
<- initial_validation_split(building_complaints) complaints_split
First, let’s pull out the training data and have a brief look at the response using a Kaplan-Meier curve.
<- training(complaints_split)
complaints_train
survfit(disposition_surv ~ 1, data = complaints_train) %>% plot()
We can see that the majority of complaints is dispositioned relatively quickly, but some complaints are still active after 100 days.
A first model
The censored package includes parametric, semi-parametric, and tree-based models for this type of analysis. To start, we are fitting a parametric survival model with the default of assuming a Weibull distribution on the time to disposition. We’ll explore the more flexible models once we have a sense of how well this more restrictive model performs on this dataset.
<- survival_reg() %>%
survreg_spec set_engine("survival") %>%
set_mode("censored regression")
We have several missing values in complaint_priority
that we are turning into a separate category, "unknown"
. We are also combining the less common categories for community_board
and unit
into an "other"
category to reduce the number of levels in the predictors. The complaint category often does not tell us much more than the unit, with several complaint categories being handled by a specific unit only. This can lead to the model being unable to estimate some of the coefficients. Since our goal here is only to get a rough idea of how well the model performs, we are removing the complaint category for now.
<- recipe(disposition_surv ~ ., data = complaints_train) %>%
rec_other step_unknown(complaint_priority) %>%
step_rm(complaint_category) %>%
step_novel(community_board, unit) %>%
step_other(community_board, unit, threshold = 0.02)
We combine the recipe and the model into a workflow. This allows us to easily resample the model because all preprocessing steps are applied to the training set and the validation set for us.
<- workflow() %>%
survreg_wflow add_recipe(rec_other) %>%
add_model(survreg_spec)
To fit and evaluate the model, we need the training and validation sets. While we can access them each on their own, validation_set()
extracts them both, in a manner that emulates a single resample of the data. This enables us to use fit_resamples()
and other tuning functions in the same way as if we had used some other resampling scheme (such as cross-validation).
We are calculating several performance metrics: the Brier score, its integrated version, the area under the ROC curve, and the concordance index. Note that all of these are used in a version tailored to survival analysis. The concordance index uses the predicted event time to measure the model’s ability to rank the observations correctly. The Brier score and the ROC curve use the predicted probability of survival at a given time. We evaluate these metrics every 30 days up to 300 days, as provided in the eval_time
argument. The Brier score is a measure of the accuracy of the predicted probabilities, while the ROC curve is a measure of the model’s ability to discriminate between events and non-events at the given time point. Because these metrics are defined “at a given time,” they are also referred to as dynamic metrics.
For more information see the Dynamic Performance Metrics for Event Time Data article.
<- validation_set(complaints_split)
complaints_rset
<- metric_set(brier_survival_integrated, brier_survival,
survival_metrics
roc_auc_survival, concordance_survival)<- seq(0, 300, 30)
evaluation_time_points
set.seed(1)
<- fit_resamples(
survreg_res
survreg_wflow,resamples = complaints_rset,
metrics = survival_metrics,
eval_time = evaluation_time_points,
control = control_resamples(save_pred = TRUE)
)
The structure of survival model predictions is slightly different from classification and regression model predictions:
<- collect_predictions(survreg_res)
preds
preds#> # A tibble: 847 × 6
#> .pred .pred_time id .row disposition_surv .config
#> <list> <dbl> <chr> <int> <Surv> <chr>
#> 1 <tibble [11 × 3]> 96.6 validation 2541 35+ Preprocessor1…
#> 2 <tibble [11 × 3]> 18.7 validation 2542 129+ Preprocessor1…
#> 3 <tibble [11 × 3]> 29.5 validation 2543 4+ Preprocessor1…
#> 4 <tibble [11 × 3]> 29.8 validation 2544 5+ Preprocessor1…
#> 5 <tibble [11 × 3]> 24.8 validation 2545 1+ Preprocessor1…
#> 6 <tibble [11 × 3]> 58.4 validation 2546 76+ Preprocessor1…
#> 7 <tibble [11 × 3]> 71.3 validation 2547 51+ Preprocessor1…
#> 8 <tibble [11 × 3]> 102. validation 2548 44+ Preprocessor1…
#> 9 <tibble [11 × 3]> 47.1 validation 2549 15+ Preprocessor1…
#> 10 <tibble [11 × 3]> 28.5 validation 2550 61+ Preprocessor1…
#> # ℹ 837 more rows
The predicted survival time is in the .pred_time
column and the predicted survival probabilities are in the .pred
list column.
$.pred[[6]]
preds#> # A tibble: 11 × 3
#> .eval_time .pred_survival .weight_censored
#> <dbl> <dbl> <dbl>
#> 1 0 1 1
#> 2 30 0.554 1.04
#> 3 60 0.360 1.19
#> 4 90 0.245 NA
#> 5 120 0.171 NA
#> 6 150 0.121 NA
#> 7 180 0.0874 NA
#> 8 210 0.0637 NA
#> 9 240 0.0468 NA
#> 10 270 0.0347 NA
#> 11 300 0.0259 NA
For each observation, .pred
contains a tibble with the evaluation time .eval_time
and the corresponding survival probability .pred_survival
. The column .weight_censored
contains the weights used in the calculation of the dynamic performance metrics.
For details on the weights see the Accounting for Censoring in Performance Metrics for Event Time Data article.
Of the metrics we calculated with these predictions, let’s take a look at the AUC ROC first.
collect_metrics(survreg_res) %>%
filter(.metric == "roc_auc_survival") %>%
ggplot(aes(.eval_time, mean)) +
geom_line() +
labs(x = "Evaluation Time", y = "Area Under the ROC Curve")
We can discriminate between events and non-events reasonably well, especially in the first 30 and 60 days. How about the probabilities that the categorization into event and non-event is based on?
collect_metrics(survreg_res) %>%
filter(.metric == "brier_survival") %>%
ggplot(aes(.eval_time, mean)) +
geom_line() +
labs(x = "Evaluation Time", y = "Brier Score")
The accuracy of the predicted probabilities is generally good, albeit lowest for evaluation times of 30 and 60 days. The integrated Brier score is a measure of the overall accuracy of the predicted probabilities.
collect_metrics(survreg_res) %>%
filter(.metric == "brier_survival_integrated")
#> # A tibble: 1 × 7
#> .metric .estimator .eval_time mean n std_err .config
#> <chr> <chr> <dbl> <dbl> <int> <dbl> <chr>
#> 1 brier_survival_integrated standard NA 0.0512 1 NA Preproce…
Which metric to optimise for depends on whether separation or calibration is more important in the modeling problem at hand. We’ll go with calibration here. Since we don’t have a particular evaluation time that we want to predict well at, we are going to use the integrated Brier score as our main performance metric.
Try out more models
Lumping factor levels together based on frequencies can lead to a loss of information so let’s also try some different approaches. We can let a random forest model group the factor levels via the tree splits. Alternatively, we can turn the factors into dummy variables and use a regularized model to select relevant factor levels.
First, let’s create the recipes for these two approaches:
<- recipe(disposition_surv ~ ., data = complaints_train) %>%
rec_unknown step_unknown(complaint_priority)
<- rec_unknown %>%
rec_dummies step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
Next, let’s create the model specifications and tag several hyperparameters for tuning. For the random forest, we are using the "aorsf"
engine for accelerated oblique random survival forests. An oblique tree can split on linear combinations of the predictors, i.e., it provides more flexibility in the splits than a tree which splits on a single predictor. For the regularized model, we are using the "glmnet"
engine for a semi-parametric Cox proportional hazards model.
<- rand_forest(mtry = tune(), min_n = tune()) %>%
oblique_spec set_engine("aorsf") %>%
set_mode("censored regression")
<- workflow() %>%
oblique_wflow add_recipe(rec_unknown) %>%
add_model(oblique_spec)
<- proportional_hazards(penalty = tune()) %>%
coxnet_spec set_engine("glmnet") %>%
set_mode("censored regression")
<- workflow() %>%
coxnet_wflow add_recipe(rec_dummies) %>%
add_model(coxnet_spec)
We can tune workflows with any of the tune_*()
functions such as tune_grid()
for grid search or tune_bayes()
for Bayesian optimization. Here we are using grid search for simplicity.
set.seed(1)
<- tune_grid(
oblique_res
oblique_wflow,resamples = complaints_rset,
grid = 10,
metrics = survival_metrics,
eval_time = evaluation_time_points,
control = control_grid(save_workflow = TRUE)
)#> i Creating pre-processing data to finalize unknown parameter: mtry
set.seed(1)
<- tune_grid(
coxnet_res
coxnet_wflow,resamples = complaints_rset,
grid = 10,
metrics = survival_metrics,
eval_time = evaluation_time_points,
control = control_grid(save_workflow = TRUE)
)
So do any of these models perform better than the parametric survival model?
show_best(oblique_res, metric = "brier_survival_integrated", n = 5)
#> # A tibble: 5 × 9
#> mtry min_n .metric .estimator .eval_time mean n std_err .config
#> <int> <int> <chr> <chr> <dbl> <dbl> <int> <dbl> <chr>
#> 1 8 16 brier_survival… standard NA 0.0468 1 NA Prepro…
#> 2 4 6 brier_survival… standard NA 0.0469 1 NA Prepro…
#> 3 6 23 brier_survival… standard NA 0.0469 1 NA Prepro…
#> 4 7 26 brier_survival… standard NA 0.0470 1 NA Prepro…
#> 5 9 38 brier_survival… standard NA 0.0471 1 NA Prepro…
show_best(coxnet_res, metric = "brier_survival_integrated", n = 5)
#> # A tibble: 5 × 8
#> penalty .metric .estimator .eval_time mean n std_err .config
#> <dbl> <chr> <chr> <dbl> <dbl> <int> <dbl> <chr>
#> 1 0.00750 brier_surviv… standard NA 0.0496 1 NA Prepro…
#> 2 0.000000262 brier_surviv… standard NA 0.0506 1 NA Prepro…
#> 3 0.0000000591 brier_surviv… standard NA 0.0506 1 NA Prepro…
#> 4 0.00000000200 brier_surviv… standard NA 0.0506 1 NA Prepro…
#> 5 0.0000588 brier_surviv… standard NA 0.0506 1 NA Prepro…
The best regularized Cox model performs a little better than the parametric survival model, with an integrated Brier score of 0.0496 compared to 0.0512 for the parametric model. The random forest performs yet a little better with an integrated Brier score of 0.0468.
The final model
We chose the random forest model as the final model. So let’s finalize the workflow by replacing the tune()
placeholders with the best hyperparameters.
<- select_best(oblique_res, metric = "brier_survival_integrated")
param_best
<- finalize_workflow(oblique_wflow, param_best) last_oblique_wflow
We can now fit the final model on the training data and evaluate it on the test data.
set.seed(2)
<- last_fit(
last_oblique_fit
last_oblique_wflow, split = complaints_split,
metrics = survival_metrics,
eval_time = evaluation_time_points,
)
collect_metrics(last_oblique_fit) %>%
filter(.metric == "brier_survival_integrated")
#> # A tibble: 1 × 5
#> .metric .estimator .estimate .eval_time .config
#> <chr> <chr> <dbl> <dbl> <chr>
#> 1 brier_survival_integrated standard 0.0430 NA Preprocessor1_Model1
The Brier score across the different evaluation time points is also very similar between the validation set and the test set.
<- collect_metrics(oblique_res) %>%
brier_val filter(.metric == "brier_survival") %>%
filter(mtry == param_best$mtry, min_n == param_best$min_n) %>%
mutate(Data = "Validation")
<- collect_metrics(last_oblique_fit) %>%
brier_test filter(.metric == "brier_survival") %>%
mutate(Data = "Testing") %>%
rename(mean = .estimate)
bind_rows(brier_val, brier_test) %>%
ggplot(aes(.eval_time, mean, col = Data)) +
geom_line() +
labs(x = "Evaluation Time", y = "Brier Score")
To finish, we can extract the fitted workflow to either predict directly on new data or deploy the model.
<- extract_workflow(last_oblique_fit)
complaints_model
<- testing(complaints_split) %>% slice(1:5)
complaints_5 predict(complaints_model, new_data = complaints_5, type = "time")
#> # A tibble: 5 × 1
#> .pred_time
#> <dbl>
#> 1 82.2
#> 2 42.6
#> 3 99.4
#> 4 79.8
#> 5 80.5
For more information on survival analysis with tidymodels see the survival analysis
tag.
Session information
#> ─ Session info ─────────────────────────────────────────────────────
#> setting value
#> version R version 4.4.0 (2024-04-24)
#> os macOS Sonoma 14.4.1
#> system aarch64, darwin20
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz America/Los_Angeles
#> date 2024-06-26
#> pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ─────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> aorsf * 0.1.5 2024-05-30 [1] CRAN (R 4.4.0)
#> broom * 1.0.6 2024-05-17 [1] CRAN (R 4.4.0)
#> censored * 0.3.2 2024-06-11 [1] CRAN (R 4.4.0)
#> dials * 1.2.1 2024-02-22 [1] CRAN (R 4.4.0)
#> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)
#> ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)
#> glmnet * 4.1-8 2023-08-22 [1] CRAN (R 4.4.0)
#> infer * 1.0.7 2024-03-25 [1] CRAN (R 4.4.0)
#> modeldatatoo 0.3.0 2024-03-29 [1] CRAN (R 4.4.0)
#> parsnip * 1.2.1 2024-03-22 [1] CRAN (R 4.4.0)
#> purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)
#> recipes * 1.0.10 2024-02-18 [1] CRAN (R 4.4.0)
#> rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)
#> rsample * 1.2.1 2024-03-25 [1] CRAN (R 4.4.0)
#> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)
#> tidymodels * 1.2.0 2024-03-25 [1] CRAN (R 4.4.0)
#> tune * 1.2.1 2024-04-18 [1] CRAN (R 4.4.0)
#> workflows * 1.1.4 2024-02-19 [1] CRAN (R 4.4.0)
#> yardstick * 1.3.1 2024-03-21 [1] CRAN (R 4.4.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
#>
#> ────────────────────────────────────────────────────────────────────
Footnotes
In this context, the term disposition means that there has been a decision or resolution regarding the complaint that is the conclusion of the process.↩︎