Reduce disk space requirement for eddy covariance download

**Is your feature request related to a problem? Please describe.**

Requesting several seasons of eddy covariance data can require a large amount of storage because all levels of the data product are bundled together. In my case, I only work with the level 4 products. Downloading the raw data takes about 60 GB of storage on disk before stacking. After running `stackEddy`, I am left with a 33 MB table with the NSAE data and QC flags I care about. 

This discourages reproducibility because 1) downloading takes a long time, 2) it is antisocial to download tens of GB to a collaborator's machine, and 3) it encourages hosting a processed data table outside of NEON to get around (1) and (2).

**Describe the solution you'd like**

The optimal solution would be to just have users download eddy covariance data in FLUXNET format. I know this partially exists already on the Ameriflux data portal, but many sites don't have any FLUXNET-formatted data. This happens to affect my main study site (WREF), so here I am (note this also means I have to run REddyProc myself with potentially different settings than what site managers would prefer).

Another option is to download only the desired data level. But, I imagine this would require backend changes to the API that are not feasible.

A third option is to modify the `zipsByProduct -> stackEddy` workflow to operate one site-month at a time instead of processing all site-months together as done in [this tutorial](https://www.neonscience.org/resources/learning-hub/tutorials/neondatastackr). This works, but deleting files is error-prone (`unlink` doesn't even raise a warning if it fails) and you still have to wait for 60 GB to download.

**Describe alternatives you've considered**

Right now I'm running `zipsByProduct` and `stackEddy` one site-month at a time, deleting any intermediate products along the way so that only ~250 MB of disk space is needed at any one time. A brief reprex:

```
library(neonUtilities)
library(foreach)
library(dplyr)

tdir <- tempdir()
fpath <- file.path(tdir, "filesToStack00200")

# Download five site-months of H20/CO2 NSAE
site_mos <- paste0("2019-0", seq(5, 9))

vars <- c(
  "timeBgn", "timeEnd",
  "data.fluxCo2.nsae.flux",
  "qfqm.fluxCo2.nsae.qfFinl",
  "data.fluxH2o.nsae.flux",
  "qfqm.fluxH2o.nsae.qfFinl"
)

wref_nsae <- foreach(sm=site_mos, .combine=rbind) %do% {
  zipsByProduct(
    "DP4.00200.001",
    site="WREF",
    startdate=sm,
    enddate=sm,
    savepath=tempdir(),
    check.size=FALSE
  )
  
  myeddy <- stackEddy(fpath)[["WREF"]] %>%
    select(all_of(vars))
  
  unlink(fpath, recursive=TRUE)
  stopifnot(!dir.exists(fpath))
  
  return(myeddy)
}
```
With my machine/internet this takes about 2 hours to download all the flux data I work with.

**Additional context**

I think this package is filling a really important role in the research community. I'd love to be able to write a paper and have a script linked that will run the entire analysis all the way through generating figures that appear in the manuscript. Having more flexibility in how flux data are downloaded would make this goal much more achievable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce disk space requirement for eddy covariance download #131

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reduce disk space requirement for eddy covariance download #131

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions