---
title: "Flexible Estimation of Odds Ratio Curves: Introducing the flexOR Package"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Flexible Estimation of Odds Ratio Curves: Introducing the flexOR Package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

**Marta Azevedo**  
*Center of Mathematics, University of Minho, Guimarães, Portugal*  
Email: marta.vasconcelos4@gmail.com 

**Luís Meira-Machado**  
*Center of Mathematics, University of Minho, Guimarães, Portugal*  
Email: lmachado@math.uminho.pt  

**Artur Araújo**  
*Department of Statistics and Operations Research, University of Vigo, Vigo, Spain*  
Email: artur.stat@gmail.com 

## Abstract
Understanding the relationship between continuous predictors and binary outcomes is a common challenge in statistical analysis. Traditional parametric regression methods often impose rigid functional forms, limiting their adaptability to complex datasets. To address this constraint and enhance flexibility, various smoothing techniques, particularly those based on splines, have been widely explored. Expressing results in terms of splines-based odds ratio (OR) curves provides a subtle understanding of the impact each continuous covariate has on the outcome.

In this vignette, we present **flexOR**, an **R** package designed to offer a robust framework for pointwise nonparametric estimation of odds ratio curves for continuous predictors. To better understand the effects that each continuous covariate has on the outcome, results can be expressed in terms of splines-based odds ratio (OR) curves, taking a specific covariate value as reference. **flexOR** enables users to estimate odds ratio curves without imposing severe assumptions on the underlying functional form, allowing for a refined exploration of relationships with binary outcomes. The package incorporates various options for automatically selecting degrees of freedom in multivariable models, enhancing its adaptability to diverse datasets. Additionally, **flexOR** features intuitive visualization functions, facilitating the interpretation and presentation of estimated odds ratio curves.

**Keywords:** logistic models; generalized additive models; odds ratio; reference value; smoothing splines.

## Introduction

Logistic regression models (Hosmer and Lemeshow, 2013) stand as essential tools in statistical analysis, especially when dealing with binary outcome variables. Unlike linear regression tailored for continuous dependent variables, logistic regression is specifically designed to predict the probability of an event. This makes it particularly useful in scenarios such as estimating the likelihood of a patient developing coronary disease or experiencing a specific medical outcome.

Within logistic regression, addressing the nonlinear effects of continuous predictors is a crucial challenge. Traditional approaches involve either categorizing predictors with dummy variables or incorporating them into a polynomial model. However, these methods have limitations, leading to challenges in determining optimal cutpoints and risking power loss when estimating odds ratios.

To address these issues, generalized additive models (GAMs) (Hastie and Tibshirani, 1990; Wood, 2017) offer an alternative solution. Particularly, the use of smoothing splines within GAMs enhances the capacity to model nonlinear effects more effectively. Splines provide flexibility in representing nonlinear relationships, mitigating the risk of overfitting associated with high-order polynomials. Smoothing splines offer a balance between flexibility and simplicity by applying a penalty term to control the smoothness of the fitted curve.

This paper explores the use of additive models, specifically focusing on the utilization of smoothing splines, as a recognized approach to capture nonlinear effects in logistic regression analyses.

Despite the advantages of splines and smoothing splines, the challenge of selecting the number and placement of knots defining the smooth line remains. This arbitrary determination may obscure crucial features in the dataset, requiring a careful balance to avoid oversmoothing or undersmoothing. Various methods, including Akaike's Information Criterion (AIC) (Akaike, 1974) and Bayesian Information Criterion (BIC) (Schwarz, 1978), have been proposed to address this issue. However, these criteria pose challenges in multivariable settings.

The odds ratio (OR) function emerges as a predominant metric to characterize the impact of continuous covariates in the additive logistic model. Building on previous work in clinical survival studies (Meira-Machado, 2013), we aim to estimate odds ratios and their confidence intervals through a nonparametric approach, taking a specific reference value for the covariate. The **flexor** package introduces the dfgam function, offering a systematic means to determine the optimal number of degrees of freedom in the multivariable additive logistic model.

The *dfgam* function allows users to obtain degrees of freedom using criteria such as AIC, AICc, and BIC, providing a versatile tool for model selection. It can also extract degrees of freedom using other criteria implemented in the **mgcv R** package from Simon Wood (Wood, 2006; Wood, 2017), including Restricted Maximum Likelihood (REML), and Generalized Cross Validation (GCV) with a Cp penalty (GCV.Cp). The choice among these criteria depends on the data's characteristics and modeling objectives, emphasizing the importance of considering factors such as sample size and the nature of relationships being modeled.

The next section of the paper describes the features and functions of the package **flexOR** using a real data application from diabetes.

## Pima Indians Diabetes Database

This section explores the Pima Indians Diabetes Database, a widely recognized dataset in the domains of machine learning and statistics. Originating from a study conducted by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) in the 1990s, the dataset focuses on the Pima Indian population in Arizona, USA.

Comprising various demographic, clinical, and diagnostic measurements, the dataset includes information on whether each individual developed diabetes. Researchers and data scientists frequently employ this dataset to construct and evaluate predictive models for diabetes, utilizing features such as age (years), BMI (Body Mass Index), blood pressure, diabetes pedigree function, and other health-related variables. The dataset is conveniently accessible through the **mlbench** R package.

This dataset comprises 768 observations across 9 variables, encompassing details on pregnancy, glucose levels, blood pressure, triceps thickness, insulin levels, body mass index, pedigree, age, and diabetes status. The subsequent presentation of the R object structure provides comprehensive information regarding data types and values within each variable.


```{r load_data}
library(mlbench)
data("PimaIndiansDiabetes2")
str(PimaIndiansDiabetes2)
```

After identifying the variables to be included in the model and determining those requiring a nonlinear effect through smoothing splines, we utilized the *dfgam* function. This function enabled us to obtain optimal degrees of freedom, minimizing the AIC of the model. The resulting degrees of freedom for the nonlinear predictors, namely age and body mass index (mass), were determined to be 3.3 and 4.1, respectively. Subsequently, these optimal degrees of freedom were incorporated into the generalized additive model (GAM) using the *gam* function:


```{r create_df2}
library(flexOR)
library(gam)
df2 <- dfgam(response="diabetes", 
            nl.predictors=c("age","mass"), 
            other.predictors=c("pedigree"),
            smoother="s", 
            method="AIC", 
            data = PimaIndiansDiabetes2)
df2$df

m2 <- gam(diabetes ~ s(age, df=3.3) + s(mass, df=4.1) + pedigree,
          data=PimaIndiansDiabetes2, family=binomial)
```

Subsequently, we employ the *flexOR* function, which, in turn, provides information used to generate a plot illustrating the smooth log odds ratio curve. This curve offers valuable insights into the relationship between the risk of diabetes and body mass index. 

```{r apply_flexOR}
or2 <- flexOR(data = PimaIndiansDiabetes2, 
              response = "diabetes", 
              formula = ~s(age, 3.3) + s(mass, 4.1) + pedigree)
plot(
  x = or2,
  predictor = "mass",
  ref.value = 40,
  ref.label = "Ref. value",
  col.area = c("grey75", "grey90"),
  main = "Smooth odds ratio for mass",
  xlab = "Body mass index",
  ylab = "Log Odds Ratio (Ln OR)",
  lty = c(1,2,2,3,3),
  round.x = 1,
  conf.level = c(0.8, 0.95)
)
```

This figure offers a comprehensive illustration of the correlation between body mass index (BMI) and the risk of diabetes within the Pima Indian population in Arizona, USA. The Log Odds Ratio (LnOR) is visually represented, accompanied by 80% (depicted in grey) and 95% (light grey) confidence bands, utilizing a reference value of 40 for BMI. Users have the option to choose a single confidence level, although two are also feasible, as demonstrated in the input command above. Additionally, the argument 'ylog' provides the flexibility to generate a plot that is not on the log scale.

It's important to note that normal BMI values typically fall within the range of 19 to 25. In our context, individuals with a BMI lower than 40 manifest a lower risk of diabetes. The risk of diabetes, follows a distinctive pattern: there is a rapid increase until a BMI value of 30, followed by a relatively stable risk between 30 and 40. However, beyond a BMI of 40, there is a notable and accelerated rise in the risk of diabetes.

Plotly's **R** graphing library offers a powerful tool for creating interactive, publication-quality graphs that enhance data visualization experiences. The library's versatility allows users to go beyond static representations and delve into dynamic visualizations, enabling a more engaging exploration of the data. In the example provided, the input commands showcase the library's capability to generate a smoothed log-odds curve with two confidence bands. The interactive nature of these plots facilitates a deeper understanding of the underlying patterns by allowing users to zoom in, pan, and hover over data points for detailed insights. This interactivity not only enhances the overall user experience but also promotes a more nuanced and insightful interpretation of the graphed information. Below are the input commands, along with the corresponding figure.

```{r plotly}

library(plotly)

p <- plot(
  x = or2,
  predictor = "mass",
  ref.value = 40,
  ref.label = "Reference Label",
  main = "Smooth odds ratio for mass",
  xlab = "Mass",
  ylab = "Log Odds Ratio (Ln OR)",
  lty = c(1,2,2,3,3),
  #xlim = c(18, 67),
  #ylim = c(-4.5, 4.5),
  round.x = 1,
  conf.level = c(0.8, 0.95)
)

tmat <- p$estimates
xref <- p$xref      
mdata <- or2$dataset
jj <- match(sort(unique(mdata$mass)), mdata$mass)

# Plotly to get shaded (two-levels) confidence bands
fig <- plot_ly(x=mdata$mass[jj],
                y=tmat[jj,5],
                type = 'scatter', mode = 'lines',
                line = list(color = 'transparent'),
                showlegend = FALSE, name = '80%UCI')

fig <- fig %>% add_trace(y = ~tmat[jj,3], type = 'scatter', mode = 'lines',
                fill = 'tonexty', fillcolor='rgba(0,100,80,0.3)', 
                line = list(color = 'transparent'),
                showlegend = FALSE, name = '95%UCI')
fig <- fig %>% add_trace(y = ~tmat[jj,2], type = 'scatter', mode = 'lines',
                fill = 'tonexty', fillcolor='rgba(0,100,80,0.3)', 
                line = list(color = 'transparent'),
                showlegend = FALSE, name = '95%LCI')
fig <- fig %>% add_trace(y = ~tmat[jj,4], type = 'scatter', mode = 'lines',
                fill = 'tonexty', fillcolor='rgba(0,100,80,0.3)', 
                line = list(color = 'transparent'),
                showlegend = FALSE, name = '80%LCI')
fig <- fig %>% add_trace(y = ~tmat[jj,1], type = 'scatter', mode = 'lines',
                line = list(color='rgb(0,100,80)'),
                showlegend = FALSE, name = 'LnOR')
fig <- fig %>% add_annotations( x = xref,
                #y = floor(min(tmat[jj,])),
                #y = min(tmat[jj,]),
                y = floor_to(min(tmat[jj,]), to=0.5),
                xref = "x", yref = "y",
                axref = "x", ayref = "y",
                text = paste("Ref. value =",xref),
                showarrow = T,
                ax = xref,
                ay = max(tmat[jj,])/2)

fig <- fig %>% layout(#title = ""
  plot_bgcolor='rgb(229,229,229)',
  xaxis = list(title = "Body mass index",
                gridcolor = 'rgb(255,255,255)',
                showgrid = TRUE,
                showline = FALSE,
                showticklabels = TRUE,
                tickcolor = 'rgb(127,127,127)',
                ticks = 'outside',
                zeroline = FALSE),
  yaxis = list(title = "Log Odds Ratio (Ln OR)",
                gridcolor = 'rgb(255,255,255)',
                showgrid = TRUE,
                showline = FALSE,
                showticklabels = TRUE,
                tickcolor = 'rgb(127,127,127)',
                ticks = 'outside',
                #range = c(-4.5,4.5),
                zeroline = FALSE))
fig
```

This figures illuminate the intricate relationship between body mass index and the risk of diabetes among Pima Indian population in Arizona, USA. The Log Odds Ratio (LnOR) is presented alongside 80% and 95% confidence bands, all referenced to a reference value of 40 for BMI.

The **flexOR** package also enables users to generate predictions based on the object *or2* obtained from the *flexOR* function. The output provides predicted values along with confidence intervals for the log odds ratio at different levels of the predictor variable body mass index. The reference value is set at 40, and the confidence level is specified as 95%. The resulting table displays the reference value, log odds ratios, and corresponding lower and upper bounds for the given prediction values.

```{r flexOR1}
pdval <- c (20, 25, 30, 35, 40, 45, 50, 55, 60, 65)
predict(or2, predictor = "mass", ref.value = 40, conf.level = 0.95,
           prediction.values = pdval, ref.label = "Ref.")
```

The package primarily relies on the  **gam** package by Trevor Hastie. This choice is driven by the package's capability to empower users to specify the desired degrees of freedom for each nonlinear term. The following input commands yield results for the fitted model, employing degrees of freedom obtained through the AIC method. The resulting output provides detailed information about the model and its components. The summary includes deviance residuals, null and residual deviance, AIC, and ANOVA tables for both parametric and nonparametric effects. This provides a comprehensive overview of the model's performance and the significance of each predictor.

```{r flexOR2}
m2 <- gam(diabetes ~ s(age, df=3.3) + s(mass, df=4.1) + pedigree,
          data=PimaIndiansDiabetes2, family=binomial)
summary(m2)
```

A good alternative is provided by method="GCV.Cp" parameter, which imparts comparable degrees of freedom. Below we show how one can use the **mgcv R** package.

```{r flexOR3}
m2.mgcv <- mgcv::gam(diabetes ~ s(age) + s(mass) + pedigree,
                     data=PimaIndiansDiabetes2, 
                     family=binomial, 
                     method="GCV.Cp")
m2.mgcv

summary(m2.mgcv)$edf  

summary(m2.mgcv)
```

The outputs shown above presents the results to a logistic additive regression model that incorporates smooth terms for age and mass, along with a parametric term for pedigree. The estimated degrees of freedom for the smooth terms are 3.304 for age and 3.960 for mass, contributing to a total estimated degrees of freedom of 9.26. The Un-Biased Risk Estimate (UBRE) score is low at 0.07247985, suggesting a favorable model fit. Parametric coefficients reveal that the intercept is -1.3344, and the pedigree coefficient is 0.9883, both statistically significant. The approximate significance of the smooth terms demonstrates highly significant contributions of age and mass to the model. The adjusted R-squared is 0.212, indicating that 21.2% of the deviance is explained by the model, with a 19.2% overall deviance explained. The model's performance metrics and the significance of the predictors suggest that it provides a reasonable fit to the data, as supported by the low p-values and the UBRE score.

The following input commands can be utilized to obtain the smooth odds ratio (not on a log scale) for age.
```{r plot}
p2 <- plot(
  x = or2,
  predictor = "age",
  ref.value = 50,
  ref.label = "Reference Label",
  main = "Smooth odds ratio for age",
  xlab = "Age (years)",
  ylab = "Log Odds Ratio (Ln OR)",
  lty = c(1,2,2,3,3),
  round.x = 1,
  conf.level = 0.95
)

tmat <- exp(p2$estimates)
xref <- p2$xref      
mdata <- or2$dataset
jj <- match(sort(unique(mdata$age)), mdata$age)

fig <- plot_ly(x=mdata$age[jj],y=tmat[jj,3],
                type = 'scatter', mode = 'lines',
                line = list(color = 'transparent'),
                showlegend = FALSE, name = '95%UCI')

fig <- fig %>% add_trace(y = ~tmat[jj,1], type = 'scatter', mode = 'lines',
                fill = 'tonexty', fillcolor='rgba(0,100,80,0.3)', 
                line = list(color='rgb(0,100,80)'),
                showlegend = FALSE, name = 'LnOR')
fig <- fig %>% add_trace(y = ~tmat[jj,2], type = 'scatter', mode = 'lines',
                fill = 'tonexty', fillcolor='rgba(0,100,80,0.3)', 
                line = list(color = 'transparent'),
                showlegend = FALSE, name = '95%LCI')

fig <- fig %>% add_annotations( x = xref,
                y = floor_to(min(tmat[jj,]), to=0.5),
                xref = "x", yref = "y",
                axref = "x", ayref = "y",
                text = paste("Ref. value =",xref),
                showarrow = T,
                ax = xref,
                ay = 1.05)

fig <- fig %>% layout(#title = ""
  plot_bgcolor='rgb(229,229,229)',
        xaxis = list(title = " ",
                gridcolor = 'rgb(255,255,255)',
                showgrid = TRUE,
                showline = FALSE,
                showticklabels = TRUE,
                tickcolor = 'rgb(127,127,127)',
                ticks = 'outside',                
                tickvals = c(20,30,40,50,60,70,80),
                zeroline = FALSE),
        yaxis = list(title = "Log Odds Ratio (Ln OR)",
                gridcolor = 'rgb(255,255,255)',
                showgrid = TRUE,
                showline = FALSE,
                showticklabels = TRUE,
                tickcolor = 'rgb(127,127,127)',
                ticks = 'outside',
                #range = c(-0.5,3.5),
                zeroline = FALSE))
fig
```

## Acknowledgements

The authors gratefully acknowledge financial support from Portuguese funds through FCT - Fundação para a Ciência e a Tecnologia, under Projects UIDB/00013/2020, UIDP/00013/2020, and EXPL/MAT-STA/0956/2021.

## Related References
1. Akaike, H. (1974). A new look at the statistical model identification. *IEEE Transactions on Automatic Control*, **19**(6), 716–723. [doi:10.1109/TAC.1974.1100705](https://doi.org/10.1109/TAC.1974.1100705)
2. Azevedo, M., Meira-Machado, L., Gude, F., and Araújo, A. (2024). Pointwise Nonparametric Estimation of Odds Ratio Curves with R: Introducing the flexOR Package. *Applied Sciences*, **14**(9), 1-17. [doi:10.3390/app14093897](https://doi.org/10.3390/app14093897)
3. Cadarso-Suárez, C. and Meira-Machado, L. and Kneib, T. and Gude, F. (2010). Flexible hazard ratio curves for continuous predictors in multi-state models: an application to breast cancer data. *Statistical Modelling*, **10**(3), 291–314. [doi:10.1177/1471082X0801000303](https://doi.org/10.1177/1471082X0801000303)
4. de Boor, C. (2001). *A Practical Guide to Splines: Revised Edition*, Springer, New York, NY.
5. Hastie, T. J. and Tibshirani, R. J. (1990). *Generalized Additive Models*, Chapman & Hall/CRC, New York, NY.
6. Hosmer, D. W. and Lemeshow, S. and Sturdivant, R. X. (2013). *Applied Logistic Regression: Third Edition*, John Wiley and Sons Inc., New York, NY.
7. Hurvich, C. M. and Simonoff, J. S. and Tsai, C. (1998). Smoothing parameter selection in nonparametric regression using an improved akaike information criterion. *Journal of the Royal Statistical Society Series B: Statistical Methodology*, **60**(2), 271–293. [doi:10.1111/1467-9868.00125](https://doi.org/10.1111/1467-9868.00125)
8. Meira-Machado, L. and Cadarso-Suárez, C. and Gude, F. and Araújo, A. (2013). smoothHR: An R Package for Pointwise Nonparametric Estimation of Hazard Ratio Curves of Continuous Predictors. *Computational and Mathematical Methods in Medicine*, **2013**, 11 pages. [doi:10.1155/2013/745742](https://doi.org/10.1155/2013/745742)
9. Royston, P. and Altman, D. G. and Sauerbrei, W. (2006). Dichotomizing continuous predictors in multiple regression: A bad idea. *Statistics in Medicine*, **25**(1), 127–141. [doi:10.1002/sim.2331](https://doi.org/10.1002/sim.2331)
10. Schwarz, G. (1978). Estimating the dimension of a model. *Annals of Statistics*, **6**(2), 461–464. [doi:10.1214/aos/1176344136](https://doi.org/10.1214/aos/1176344136)
11. Wood, S. N. (2017). *Generalized Additive Models: An Introduction with R: Second Edition*, Chapman & Hall/CRC, London, UK.
12. Wood, S. N. and Pya, N. and Safken, B. (2016). Smoothing Parameter and Model Selection for General Smooth Models. *Journal of the American Statistical Association*, **111**(516), 1548-1563. [doi:10.1080/01621459.2016.1180986](https://doi.org/10.1080/01621459.2016.1180986)