Merge pull request #26 from cmusso86/update_release

complementando vignette e tentativa de calll-out
cmusso86 · Jun 30, 2024 · dcfbc98 · dcfbc98
2 parents 94d02e4 + 139857b
commit dcfbc98
Show file tree

Hide file tree

Showing 7 changed files with 88 additions and 14 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -21,12 +21,12 @@ Authors@R:
            role = c("aut","ths", 'cph'),
            email = "[email protected]",
            comment = c(ORCID = "0000-0003-2009-4844")))
-Description: Enables the diagnostics and enhancement of regression model calibration.It offers both global and local visualization tools for calibration diagnostics and provides one recalibration method: Torres R, Nott DJ, Sisson SA, Rodrigues T, Reis JG, Rodrigues GS (2024) <doi:10.48550/arXiv.2403.05756>. The method leverages on Probabilistic Integral Transform (PIT) values to both evaluate and perform the calibration of statistical models.  
+Description: Enables the diagnostics and enhancement of regression model calibration.It offers both global and local visualization tools for calibration diagnostics and provides one recalibration method: Torres R, Nott DJ, Sisson SA, Rodrigues T, Reis JG, Rodrigues GS (2024) <doi:10.48550/arXiv.2403.05756>. The method leverages on Probabilistic Integral Transform (PIT) values to both evaluate and perform the calibration of statistical models. For a more detailed description of the package, please refer to the bachelor's thesis available bellow.  
 License: MIT + file LICENSE
 Encoding: UTF-8
 Roxygen: list(markdown = TRUE)
 RoxygenNote: 7.3.1
-URL: https://github.com/cmusso86/recalibratiNN, https://cmusso86.github.io/recalibratiNN/
+URL: https://bdm.unb.br/handle/10483/38504, https://github.com/cmusso86/recalibratiNN, https://cmusso86.github.io/recalibratiNN/
 BugReports: https://github.com/cmusso86/recalibratiNN/issues
 Imports: 
     stats(>= 3.0.0),

diff --git a/R/recalibrate.R b/R/recalibrate.R
@@ -3,7 +3,7 @@
 #' @description
 #' This function offers recalibration techniques for regression models that assume Gaussian distributions by using the
 #' Mean Squared Error (MSE) as the loss function. Based on the work by Torres R. et al. (2024), it supports
-#' both local and global recalibration approaches to provide samples from a recalibrated predictive distribution.
+#' both local and global recalibration approaches to provide samples from a recalibrated predictive distribution. A detailed algorithm can also be found in Musso C. (2023).
 #'
 #' @param yhat_new Numeric vector with predicted response values for the new (or test) set.
 #' @param space_cal Numeric matrix or data frame representing the covariates/features of the calibration/validation set,
@@ -43,6 +43,7 @@
 #'
 #' @references
 #' \insertRef{torres2024}{recalibratiNN}
+#' \insertRef{musso2023}{recalibratiNN}
 #'
 #' @examples
 #'

diff --git a/README.Rmd b/README.Rmd
@@ -7,6 +7,8 @@ output: github_document
 ```{r, include = FALSE}
 knitr::opts_chunk$set(
   collapse = TRUE,
+  warning = F, 
+  message = F,
   comment = "#>",
   fig.path = "man/figures/README-",
   out.width = "80%",

diff --git a/README.md b/README.md
@@ -46,11 +46,15 @@ download.
 ``` r
 if(!require(pacman)) install.packages("pacman")
 pacman::p_load_current_gh("cmusso86/recalibratiNN")
+#> crayon (1.5.2 -> 1.5.3) [CRAN]
+#> cli    (3.6.2 -> 3.6.3) [CRAN]
 #> 
+#> The downloaded binary packages are in
+#>  /var/folders/rp/h9_9qkdd7c57z9_hytk4306h0000gn/T//Rtmpx2IcOw/downloaded_packages
 #> ── R CMD build ─────────────────────────────────────────────────────────────────
-#>      checking for file ‘/private/var/folders/rp/h9_9qkdd7c57z9_hytk4306h0000gn/T/RtmpCPgUPL/remotes159257e41e1ea/cmusso86-recalibratiNN-c947b5d/DESCRIPTION’ ...  ✔  checking for file ‘/private/var/folders/rp/h9_9qkdd7c57z9_hytk4306h0000gn/T/RtmpCPgUPL/remotes159257e41e1ea/cmusso86-recalibratiNN-c947b5d/DESCRIPTION’
+#>      checking for file ‘/private/var/folders/rp/h9_9qkdd7c57z9_hytk4306h0000gn/T/Rtmpx2IcOw/remotes17f90582977a6/cmusso86-recalibratiNN-94d02e4/DESCRIPTION’ ...  ✔  checking for file ‘/private/var/folders/rp/h9_9qkdd7c57z9_hytk4306h0000gn/T/Rtmpx2IcOw/remotes17f90582977a6/cmusso86-recalibratiNN-94d02e4/DESCRIPTION’
 #>   ─  preparing ‘recalibratiNN’:
-#>    checking DESCRIPTION meta-information ...  ✔  checking DESCRIPTION meta-information
+#>      checking DESCRIPTION meta-information ...  ✔  checking DESCRIPTION meta-information
 #>   ─  installing the package to process help pages
 #>      Loading required namespace: recalibratiNN
 #>   ─  saving partial Rd database
@@ -60,6 +64,9 @@ pacman::p_load_current_gh("cmusso86/recalibratiNN")
 #>        WARNING: Added dependency on R >= 3.5.0 because serialized objects in
 #>      serialize/load version 3 cannot be read in older versions of R.
 #>      File(s) containing such objects:
+#>        ‘recalibratiNN/inst/extdata/mse_cal.rds’
+#>        ‘recalibratiNN/inst/extdata/y_hat_cal.rds’
+#>        ‘recalibratiNN/inst/extdata/y_hat_test.rds’
 #>        ‘recalibratiNN/vignettes/mse_cal.rds’
 #>        ‘recalibratiNN/vignettes/y_hat_cal.rds’
 #>        ‘recalibratiNN/vignettes/y_hat_test.rds’

diff --git a/inst/REFERENCES.bib b/inst/REFERENCES.bib
@@ -11,3 +11,14 @@ @article{torres2024
   acm_classes={G.3; I.5.1; I.6.4},
   doi={10.48550/arXiv.2403.05756}
 }
+
+
+@misc{musso2023,
+  author = {Carolina Musso},
+  title = {Recalibration of Gaussian Neural Network Regression Models: The RecalibratiNN Package},
+  year = {2023},
+  howpublished = {Undergraduate Thesis (Bachelor in Statistics), University of Brasília},
+  note = {Available at: \url{https://bdm.unb.br/handle/10483/38504}},
+  month = {Dec},
+  year = {2023}
+}
diff --git a/man/recalibrate.Rd b/man/recalibrate.Rd
diff --git a/vignettes/simple_mlp.Rmd b/vignettes/simple_mlp.Rmd
@@ -1,22 +1,59 @@
 ---
-title: "ANN ajusted to bidimensional data"
-subtitle: "A visual example of how to recalibrate a neural network"
+title: "Recalibrating the predicitions of an ANN."
+subtitle: "A visual example of recalibration using bidimensional data."
 output: rmarkdown::html_vignette
 vignette: >
   %\VignetteIndexEntry{ANN ajusted to bidimensional data}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
+header-includes:
+  - \usepackage{amsmath}
 ---
 
+
 ```{r, include = FALSE}
 knitr::opts_chunk$set(
   collapse = TRUE,
+  message = FALSE,
   comment = "#>"
 )
 ```
 
+## PROBLEM
+
+The calibration of a model can be evaluated by comparing observed values with their respective estimated conditional (or predictive) distributions. This evaluation can be conducted globally, examining overall calibration, or locally, investigating calibration in specific parts of the covariate space. To better illustrate how the package can improve a models's calibration, let's consider some artificial examples.
+
 ```{r setup}
 library(recalibratiNN)
+```
+
+In the following example, we are going to recalibrate the predictions of an artificial neural network (ANN) model to non-linear heteroscedastic data. First we will simulate some data as follows:
+
+
+We define the sample size:
+\begin{equation}
+n = 10000
+\end{equation}
+
+The vectors \(x_1\) and \(x_2\) are generated from uniform distributions:
+\begin{equation}
+x_1 \sim \text{Uniform}(-3, 3)
+\end{equation}
+\begin{equation}
+x_2 \sim \text{Uniform}(-5, 5)
+\end{equation}
+
+We define the function \(\mu\) as:
+\begin{equation}
+\mu(x) = \left| x_1^3 - 50 \sin(x_2) + 30 \right|
+\end{equation}
+
+The response variable \(y\) is generated from a normal distribution with mean \(\mu\) and standard deviation \(20 \left| \frac{x_2}{x_1 + 10} \right|\):
+\begin{equation}
+y \sim \mathcal{N}\left(\mu, 20 \left| \frac{x_2}{x_1 + 10} \right|\right)
+\end{equation}
+
+```{r echo = F}
 
 library(glue)
 library(RANN)
@@ -56,6 +93,7 @@ y_test  <- y[(split2*n+1):n]
 
 ```
 
+Now, this toy model was trained using the Keras framework with TensorFlow backend. The ANN architecture consist of an ANN with 3 hidden layers with ReLU activation functions and dropout for regularization as follows:
 
 ```{r, eval=F}
 model_nn <- keras_model_sequential()
@@ -118,22 +156,27 @@ y_hat_cal <- readRDS(file_path2)|> as.numeric()
 file_path3 <- system.file("extdata", "y_hat_test.rds", package = "recalibratiNN")
 y_hat_test <- readRDS(file_path3)|> as.numeric()
 
-
 ```
 
+## MISCALIBRATION DISGNOSTICS
 
+Now, we can evaluate the calibration of the appropriate functions of the recalibratiNN package. Firstly, we will calculate the Probability Integral Transform (PIT) values using the `PIT_global` function and visualize them using the `gg_PIT_global` function.
 
-```{r}
+We can observe in this graph that, globally, the model shows significant miscalibration (deviating from a uniform distribution)
 
+```{r}
 ## Global calibrations
 pit <- PIT_global(ycal = y_cal, 
                   yhat = y_hat_cal, 
                   mse = MSE_cal)
 
 gg_PIT_global(pit)
-
 ```
 
+For comparison, we will also calculate the local PIT values using the local functions. This is important because the model may be well calibrated globally but not locally. In other words, it may exhibit varying or even opposing patterns of miscalibration throughout the covariate space, which can be compensated for when analyzed globally.
+
+Here, we can see that the model is miscalibrated differently according to the regions. 
+
 ```{r}
 pit_local <- PIT_local(xcal = x_cal,
                        ycal = y_cal, 
@@ -143,9 +186,16 @@ pit_local <- PIT_local(xcal = x_cal,
 
 gg_PIT_local(pit_local, 
              facet = TRUE)
-
 ```
 
+Since this example consists of bidimensional data, we visualize the calibration of the model on a surface representing the covariate space. In this graph, we used a 95% Confidence Interval centered on the mean predicted by the model, with a fixed variance estimated by the Mean Squared Error (MSE). When the true value falls within the interval, it is colored greenish; when it falls outside, it is colored red. 
+
+::: {.callout-note}
+Note that this visualization is not part of the recalibratiNN package since it can only be applied to bidimensional data, which is not typically the case when adjusting neural networks. This example was used specifically to demonstrate (mis)calibration visually and to make the concept more tangible.
+:::
+
+The following graph illustrates the original coverage of the model, which is around 90%. Thus, globally, we observe that the model underestimates the true uncertainty of the data (90% < 95%). However, despite the global coverage being approximately 90%, there are specific regions where the model consistently makes more incorrect predictions (falling well below the 95% mark), while accurately predicting (100%) within other regions. Although this last part may initially seem favorable (more accuracy is typically desirable), it indicates that the uncertainty of the predictions is not adequately captured by the model (overestimated) . This highlights the importance of interpreting predictions probabilistically, considering a distribution rather than just a point prediction.
+
 ```{r}
 coverage_model <- tibble(
   x1cal = x_test[,1], 
@@ -160,19 +210,21 @@ mutate(lwr = qnorm(0.05, y_hat, sqrt(MSE_cal)),
 )
 
 coverage_model |> 
+  arrange(CI) |>   
   ggplot() +
   geom_point(aes(x1cal, 
                  x2cal, 
                  color = CI),
-             alpha = 0.8)+
+             alpha = 0.9,
+             size = 3)+
    labs(x="x1" , y="x2", 
         title = glue("Original coverage: {coverage_model$coverage[1]} %"))+
   scale_color_manual("Confidence Interval",
                      values = c("in" = "aquamarine3", 
                                 "out" = "steelblue4"))+
   theme_classic()
 ```
-
+## 
 ```{r}
 recalibrated <- 
   recalibrate(