cv.rfsi {meteo}R Documentation

Nested k-fold cross-validation for Random Forest Spatial Interpolation (RFSI)

Description

Function for nested k-fold cross-validation function for Random Forest Spatial Interpolation (RFSI) (Sekulić et al. 2020). It is based on rfsi, pred.rfsi, and tune.rfsi functions. Currently, only spatial (leave-location-out) cross-validation is implemented. Temporal and spatio-temporal cross-validation will be implemented in the future.

Usage

cv.rfsi(formula,
        data,
        data.staid.x.y.time = c(1,2,3,4),
        obs,
        obs.staid.time = c(1,2),
        stations,
        stations.staid.x.y = c(1,2,3),
        zero.tol = 0,
        use.idw = FALSE,
        s.crs = NA,
        t.crs = NA,
        tgrid,
        tgrid.n=10,
        tune.type = "LLO",
        k = 5,
        seed=42,
        folds,
        fold.column,
        acc.metric,
        output.format = "data.frame",
        cpus = detectCores()-1,
        progress = TRUE,
        ...)

Arguments

formula

formula; Specifying the dependent variable (without nearest observations and distances to them). If z~1, an RFSI model with nearest obsevrations and distances to them as covariates will be cross-validated.

data

STFDF-class, STSDF-class or data.frame; Contains dependent variable (observations) and covariates used for making an RFSI model. If data.frame object, it should have next columns: station ID (staid), longitude (x), latitude (y), time of the observation (time), observation value (obs) and covariates (cov1, cov2, ...). If covariates are missing, the RFSI model with nearest obsevrations and distances to them as covariates (formula=z~1) will be cross-validated.

data.staid.x.y.time

numeric or character vector; Positions or names of the station ID (staid), longitude (x), latitude (y) and time columns in data if data is data.frame. Default is c(1,2,3,4).

obs

data.frame; Contains dependent variable (observations) and covariates in space and time. It should have next columns: station ID (staid), time of the observation (time), observation value (obs) and covariates (cov1, cov2, ...). This object is used together with stations (see below) to create STFDF-class object (if data object is missing) which is then used for making an RFSI model. If covariates are missing, the RFSI model with nearest obsevrations and distances to them as covariates (formula=z~1) will be cross-validated.

obs.staid.time

numeric or character vector; Positions or names of the station ID (staid) and time columns in obs. Default is c(1,2).

stations

data.frame; It should have next columns: station ID (staid), longitude (x) and latitude (y) of the stations. This object is used together with obs (see above) if data object is missing.

stations.staid.x.y

numeric or character vector; Positions or names of the station ID (staid), longitude (x) and latitude (y) columns in stations. Default is c(1,2,3).

zero.tol

numeric; A distance value below (or equal to) which locations are considered as duplicates. Default is 0. See rm.dupl.

use.idw

boolean; Will IDW predictions from n.obs nearest observations be calculated and used as covariate (see function near.obs). Default is FALSE.

s.crs

Source CRS-class of observations (data). If NA, read from data

t.crs

Target CRS-class for observations (data) reprojection. If NA, will be set to s.crs. Note that observations should be in projection for finding nearest observations based on Eucleadean distances (see function near.obs).

tgrid

data.frame; Possible tuning parameters for nested folds. The columns are named the same as the tuning parameters. Possible tuning parameters are: n.obs, num.trees, mtry, min.node.size, sample.fraction, splirule, and idw.p.

tgrid.n

numeric; number of randomly chosen tgrid combinations used for nested tuning of RFSI. If larger than tgrid, will be set to length(tgrid)

tune.type

character; Type of cross-validation used for nested tuning: leave-location-out ("LLO"), leave-time-out ("LTO"), and leave-location-time-out ("LLTO"). Default is "LLO". "LTO" and "LLTO" are not implemented yet. Will be in the future.

k

numeric; Number of random folds for cross-validation and nested tuning that will be created, with CreateSpacetimeFolds function if folds or fold.column parameters are missing. Default is 5.

seed

numeric; Random seed that will be used to generate folds for cross-validation and nested tuning, with CreateSpacetimeFolds function.

folds

numeric or character vector; Showing folds of data observations used for cross-validation. Used if fold.column parameter is not specified. Note that folds for nested tuning will be created with CreateSpacetimeFolds function.

fold.column

numeric or character; Column name or number showing the position of variable in data that represents foldsused for cross-validation. Note that folds for nested tuning will be created with CreateSpacetimeFolds function.

acc.metric

character; Accuracy metric that will be used as a criteria for choosing an optimal RFSI model in nested tuning. Possible values for regression: "ME", "MAE", "RMSE" (default), "R2", "CCC". Possible values for classification: "Accuracy","Kappa" (default), "AccuracyLower", "AccuracyUpper", "AccuracyNull", "AccuracyPValue", "McnemarPValue".

output.format

character; Format of the output, "STFDF" (default), "STSDF" or "data.frame" (data.frame).

cpus

numeric; Number of processing units. Default is detectCores()-1.

progress

logical; If progress bar is shown. Default is TRUE.

...

Further arguments passed to ranger.

Value

A STFDF-class, STSDF-class or data.frame obejct (depends on output.format argument), with columns:

obs

Observations.

pred

Predictions from cross-validation.

folds

Folds used for cross-validation.

Author(s)

Aleksandar Sekulic asekulic@grf.bg.ac.rs

References

Sekulić, A., Kilibarda, M., Heuvelink, G. B., Nikolić, M. & Bajat, B. Random Forest Spatial Interpolation. Remote. Sens. 12, 1687, https://doi.org/10.3390/rs12101687 (2020).

See Also

near.obs rfsi pred.rfsi tune.rfsi

Examples

library(sp)
library(spacetime)
library(gstat)
library(plyr)
library(CAST)
library(doParallel)
library(ranger)
# prepare data
# load observation - data.frame of mean temperatures
data(dtempc)
data(stations)

serbia= point.in.polygon(stations$lon, stations$lat, c(18,22.5,22.5,18), c(40,40,46,46))
st= stations[ serbia!=0, ]
dtempc <- dtempc[dtempc$staid %in% st$staid, ]
dtempc <- dtempc[complete.cases(dtempc),]

# create STFDF
stfdf <- meteo2STFDF(dtempc,st)
# Adding CRS
stfdf@sp@proj4string <- CRS('+proj=longlat +datum=WGS84')

# load covariates for mean temperatures
data(regdata)
data(tregcoef)
# str(regdata)
regdata@sp@proj4string <- CRS('+proj=longlat +datum=WGS84')

# Overlay observations with covariates
time <- index(stfdf@time)
covariates.df <- as.data.frame(regdata)
names_covar <- names(tregcoef[[1]])[-1]
for (covar in names_covar){
  nrowsp <- length(stfdf@sp)
  regdata@sp=as(regdata@sp,'SpatialPixelsDataFrame')
  ov <- sapply(time, function(i) 
    if (covar %in% names(regdata@data)) {
      if (as.Date(i) %in% as.Date(index(regdata@time))) {
        over(stfdf@sp, as(regdata[, i, covar], 'SpatialPixelsDataFrame'))[, covar]
      } else {
        rep(NA, length(stfdf@sp))
      }
    } else {
      over(stfdf@sp, as(regdata@sp[covar], 'SpatialPixelsDataFrame'))[, covar]
    }
  )
  # ov <- do.call('cbind', ov)
  ov <- as.vector(ov)
  if (all(is.na(ov))) {
    stop(paste('There is no overlay of data with covariates!', sep = ""))
  }
  stfdf@data[covar] <- ov
}

# remove stations out of covariates
for (covar in names_covar){
  # count NAs per stations
  numNA <- apply(matrix(stfdf@data[,covar],
                        nrow=nrowsp,byrow=FALSE), MARGIN=1,
                 FUN=function(x) sum(is.na(x)))
  # Remove stations out of covariates
  rem <- numNA != length(time)
  stfdf <-  stfdf[rem,drop=FALSE]
}

# Remove dates out of covariates
rm.days <- c()
for (t in 1:length(time)) {
  if(sum(complete.cases(stfdf[, t]@data)) == 0) {
    rm.days <- c(rm.days, t)
  }
}
if(!is.null(rm.days)){
  stfdf <- stfdf[,-rm.days]
}

formula = 'tempc ~ temp_geo + modis + dem + twi'  # without nearest obs
t.crs=CRS("+proj=utm +zone=34 +ellps=WGS84 +datum=WGS84 +units=m +no_defs")
s.crs=NA

# making tgrid
n.obs <- 2:3
min.node.size <- 2:10
sample.fraction <- seq(1, 0.632, -0.05) # 0.632 without / 1 with replacement
splitrule <- "variance"
ntree <- 250 # 500
mtry <- 3:(2+2*max(n.obs))
tgrid = expand.grid(min.node.size=min.node.size, num.trees=ntree,
                    mtry=mtry, n.obs=n.obs, sample.fraction=sample.fraction)

# Cross-validation of RFSI
rfsi_cv <- cv.rfsi(formula=formula, # without nearest obs
                   data= stfdf,
                   zero.tol=0,
                   s.crs=s.crs,
                   t.crs=t.crs,
                   tgrid=tgrid,
                   tgrid.n=5,
                   tune.type = "LLO",
                   k = 3, # number of folds
                   seed = 42,
                   acc.metric = "RMSE",
                   output.format = "data.frame", # "STFDF",
                   cpus = 2,# detectCores()-1,
                   progress=FALSE,
                   # ranger parameters
                   importance = "impurity")

summary(rfsi_cv)
# stplot(rfsi_cv[, , "pred"])
# stplot(rfsi_cv[, , "obs"])


[Package meteo version 1.0-1 Index]