Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Un expected behavior when predicting spatial regression models #47

Open
MGousseff opened this issue Mar 18, 2024 · 7 comments
Open

Un expected behavior when predicting spatial regression models #47

MGousseff opened this issue Mar 18, 2024 · 7 comments

Comments

@MGousseff
Copy link

Thank you for the great spatial tools you provide to the community.

Maybe I'm doing something wrong as I'm rather new to spatial regressions, but here is a behavior I find strange.

if I fit a spatial regression model the way it is suggested in the man page, it works and I get some fitted values in the object I get. For instance :
formula1<-paste0( "to_predict ~ predictor_1 + predictor_2") fit.sem1<-lagsarlm(formula1, data=df_b, listw = neighbs_weights_b)

will work. Then my fit.sem object has a $fitted.values and it seems everything works. df_b is an sf object, and neghbs_weights_b have been produced as intended.

But when I try to use this model to predict my dependent variable on another location, using something like :
predict.sem2.arras<-predict(object=fit.sem2, newdata = df_a, listw = neighbs_weights_a, pred.type = "TS", legacy.mixed=TRUE, power=TRUE)

I get an error message :

Error in x$terms (explo_spat_cor_models.R#210): $ operator is invalid for atomic vectors

It is a bit confusing, as I thought my data was wrong, like missing one of the predictors. I checked the way the weights are provided, including how the row.names need to be specified so that they are different in the training area and the testing area.

I found, thanks to the code of an article by Thibault Laurent (thank him for making it available), that the proper use seems to be:

  1. Fit the non-spatial linear model on the "training data".
  2. Feed the resulting linear model to lagsarlm in lieu of just the formula
  3. Use the resulting spatial model to predict on the new area (providing that all the weights are properly computed and fed to the function, o course).

All in all, as it is working, maybe it is just that I'm not aware of practice that may seem obvious to the community, and if so, sorry for the inconvenience, but as an R user, it is rather puzzling that feeding the formula "almost" works and feeding the linear object "fully" works. Maybe just modifying the error message in predict.sarlm would be useful ?

If the issue is considered useful and the behavior can not be reproduced, I'll try and provide a small data set to reproduce it.

@rsbivand
Copy link
Member

Could you perhaps provide a minimal reproducible example? Might something in these lecture notes help https://rsbivand.github.io/PG_AGII_2sem/11_talk.html ?

@MGousseff
Copy link
Author

Thank you very much for the link. I will try to produce minimal data for a reproducible example as soon as possible, as the example you linked to seems to be the same situation. I'll try it as soon as possible. I still have to see what is the best way to provide the minimal data and how to look them here.

@MGousseff
Copy link
Author

Broke my hand so coding is slow. Still trying to understand my bug and produce minimal example, sorry for the delay

@adrianauscanga
Copy link

adrianauscanga commented Jun 28, 2024

Dear Roger,

I've been running some SAR models with the spatialreg package to analyze raster data in a project which main objective is to improve forest structural models by incorporating data on recent forest disturbance. The SAR models are working fine but I would like to use them for making predictions. Thus I split the data into a training data set (70% of data) and a test data set (30% of data), ran a SAR model using the errorsarlm function with the training data set and use the predict function to predict values in the test data set. Although everything runs, it looks like the predict function ignores the spatial component when using a new data set for making predictions. I was wondering if I'm doing something wrong here and if you have any suggestions for making predictions with SAR models using training and testing datasets. I've used the examples in your lectures for writing this script and I also read the papers you recommended on GitHub Issue #45.
Below you will find an example of the script and attached is a sample of the data set I'm working with.
The original data set has 4484 observations and 33 variables (including geographic location X and Y). Most of the variables are continuous and comprise information about environmental conditions (topography, soil) and forest disturbance. The variable I'm trying to predict is LAI (stands for leaf area index and it's called lai_harv on data set). Data was originally a raster with several layers that I transformed into a data frame using the centroid of each pixel as X and Y locations (in UTM).
Thank you very much for your time and help!
Best,

Adriana


lai_df.csv
# R script for Github Issue --making predictions with SAR ----

## Load libraries ----

library(tidyverse)
library(caret)
library(spatialreg)

## Load data ----

### This is a subset of the original data, which is 4484x33

#load("data/L2/lai_df.RData")
read_csv("data/L2/lai_df.csv")

## Data partition ----

### Set seed for random number generation
set.seed(2024)

### Split data 
index <- caret::createDataPartition(lai_df$lai_harv,
                                    p= 0.7,
                                    list = F,
                                    times = 1)

### Create train_df and test_df using data partition
train_df <- lai_df[index,]

test_df <- lai_df[-index,]


## Neighborhood matrix ----

loc_dist <- 36.48985
### I'm using 36.48 m as neighborhood distance

### X and Y of full data frame
loc_all <- (lai_df[,2:3])

### X and Y of rain data
loc_train <- (train_df[,2:3])

### Test data
loc_test <- (test_df[,2:3])

# Make list of neighborhoods using loc_dist (min distance between points)
neighbors_dist_all <- spdep::dnearneigh(loc_all, 0, d2 = loc_dist, longlat = F)
neighbors_dist_train <- spdep::dnearneigh(loc_train, 0, d2 = loc_dist, longlat = F)
neighbors_dist_test <- spdep::dnearneigh(loc_test, 0, d2 = loc_dist, longlat = F)

# Spatial weights

sp_all <- spdep::nb2listw(neighbors_dist_all, style = "W", zero.policy = T)
sp_train <- spdep::nb2listw(neighbors_dist_train, style = "W", zero.policy = T)
sp_test <- spdep::nb2listw(neighbors_dist_test, style = "W", zero.policy = T)

## SAR ----

#### Compute SAR model with errorsarlm function (which is the one that works with the predict function)
sar_lai <- errorsarlm(formula = lai_harv ~ dtm_harv + summer_insolation_harv + roughness + ndvi_growMag + ndmi_sdIndex + x_sc + ndvi_2018 + tcw_2018,
                       data = train_df,
                       listw = sp_train,
                       zero.policy = T)

summary(sar_lai)

### Prediction within data set
predict(sar_lai)

### Prediction using the test data set
lai_predictions_sar <- predict(sar_lai, newdata = test_df, listw = sp_all, zero.policy = T, pred.type = "TS")

### what happens with the spatial component of the model?

@rsbivand
Copy link
Member

Fitting a spatial error model may change the fitted regression coefficients compared to least squares, so the prediction will differ. The spatial term applies to the error, which is not observed, so as we saw from email exchange in April, there is no clear path forward. May I add (some of) my email comments to this thread?

Since April, I gave a talk in part about this, slides at: https://rsbivand.github.io/nem24_talk/, sources at https://github.com/rsbivand/nem24_talk, talk recording at: https://nhh.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=d26410ee-6243-48ce-96dd-b18400beb764, and see rsbivand/nem24_talk#1.

@adrianauscanga
Copy link

Dear Roger,
Thank you very much for your reply and the resources! Sorry to hear there's no clear path forward but glad to know there's research being developed on the topic. Of course, feel free to add your email comments to this thread.
Best,

Adriana

@rsbivand
Copy link
Member

Comments from email in April:

Please see these two recent issues on the spatialreg repository: #45, #47. Please follow up the references there, and maybe contribute there - preferably #47 which is still open, also consider providing there a minimal reproducible example. Using predictions to test/validation data sets to assess model goodness of fit with spatial data is always highly problematic, as the references in #45 show. You are right that there is no predict() method for spautolm objects. For SAR, use errorsarlm(), for CAR use some other fitting function in another package, mgcv::gam() with an mrf smooth for example, or INLA::inla.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants