Skip to content

Reduce disk space requirement for eddy covariance download #131

@s-kganz

Description

@s-kganz

Is your feature request related to a problem? Please describe.

Requesting several seasons of eddy covariance data can require a large amount of storage because all levels of the data product are bundled together. In my case, I only work with the level 4 products. Downloading the raw data takes about 60 GB of storage on disk before stacking. After running stackEddy, I am left with a 33 MB table with the NSAE data and QC flags I care about.

This discourages reproducibility because 1) downloading takes a long time, 2) it is antisocial to download tens of GB to a collaborator's machine, and 3) it encourages hosting a processed data table outside of NEON to get around (1) and (2).

Describe the solution you'd like

The optimal solution would be to just have users download eddy covariance data in FLUXNET format. I know this partially exists already on the Ameriflux data portal, but many sites don't have any FLUXNET-formatted data. This happens to affect my main study site (WREF), so here I am (note this also means I have to run REddyProc myself with potentially different settings than what site managers would prefer).

Another option is to download only the desired data level. But, I imagine this would require backend changes to the API that are not feasible.

A third option is to modify the zipsByProduct -> stackEddy workflow to operate one site-month at a time instead of processing all site-months together as done in this tutorial. This works, but deleting files is error-prone (unlink doesn't even raise a warning if it fails) and you still have to wait for 60 GB to download.

Describe alternatives you've considered

Right now I'm running zipsByProduct and stackEddy one site-month at a time, deleting any intermediate products along the way so that only ~250 MB of disk space is needed at any one time. A brief reprex:

library(neonUtilities)
library(foreach)
library(dplyr)

tdir <- tempdir()
fpath <- file.path(tdir, "filesToStack00200")

# Download five site-months of H20/CO2 NSAE
site_mos <- paste0("2019-0", seq(5, 9))

vars <- c(
  "timeBgn", "timeEnd",
  "data.fluxCo2.nsae.flux",
  "qfqm.fluxCo2.nsae.qfFinl",
  "data.fluxH2o.nsae.flux",
  "qfqm.fluxH2o.nsae.qfFinl"
)

wref_nsae <- foreach(sm=site_mos, .combine=rbind) %do% {
  zipsByProduct(
    "DP4.00200.001",
    site="WREF",
    startdate=sm,
    enddate=sm,
    savepath=tempdir(),
    check.size=FALSE
  )
  
  myeddy <- stackEddy(fpath)[["WREF"]] %>%
    select(all_of(vars))
  
  unlink(fpath, recursive=TRUE)
  stopifnot(!dir.exists(fpath))
  
  return(myeddy)
}

With my machine/internet this takes about 2 hours to download all the flux data I work with.

Additional context

I think this package is filling a really important role in the research community. I'd love to be able to write a paper and have a script linked that will run the entire analysis all the way through generating figures that appear in the manuscript. Having more flexibility in how flux data are downloaded would make this goal much more achievable.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions