An Intake-based catalog of FRESCA datasets and supporting tools
The purpose of fresca-catalog
is to provide an interface to the FRESCA dataset inventory that facilitates exploration and discovery. To do this, fresca-catalog
makes use of the open-source package Intake to represent the FRESCA datasets as a unified, machine-readable catalog. fresca-catalog
also defines custom functionality for querying and filtering the catalog, as well as for curating the catalog as it grows.
In order to use fresca-catalog
, you'll need to install either Anaconda or Miniconda to handle environment management.
Once one of those are installed, the next step is to clone this repository:
git clone [email protected]:axiom-data-science/fresca-catalog.git
Now navigate into the resulting fresca-catalog
directory, and create the Conda environment for this project with:
conda env update
This could take a while. After it finishes running (hopefully successfully) you can activate the fresca
environment with:
conda activate fresca
Now all of fresca-catalog
's dependencies are installed and accessible, and you can get started running code in JupyterLab by running:
jupyter lab
And if you prefer another IDE to JupyterLab, like Visual Studio Code, then by all means use that instead.
fresca-catalog
addresses two primary use cases:
- Loading and interactively exploring the FRESCA Intake catalog via a Jupyter notebook
- Rebuilding the FRESCA Intake catalog in order to add new dataset entries or update existing entries
fresca-catalog
contains three key components to enable these use cases:
full_catalog.yml
: The Intake catalog itself, in its on-disk YAML representationfresca_catalog/
: The Python package containing the logic required for building and interacting with that Intake catalogbase_catalog.yml
: The minimal "seed" catalog required to programmatically build the "enriched"full_catalog.yml
We'll explore these two use cases and the components listed above in the following sections.
One benefit of Intake is that its catalogs can easily be serialized to disk as YAML files. This makes them portable and easy to version. In fresca-catalog
we store the full FRESCA Intake catalog as full_catalog.yml
. Below, we'll cover the basics of querying and filtering the full catalog, but for details and to follow along, please open example.ipynb
.
Loading the FRESCA Intake catalog is as simple as:
import intake
catalog = intake.open_catalog('full_catalog.yml')
catalog
is just a plain vanilla Intake catalog and as such, supports anything described in the official Intake documentation. But in order to make use of catalog
for querying and filtering, we've defined a variety of helper functions within fresca_catalog
. These include functions to help facilitate:
- Searching for variables of interest within the catalog
- Selecting variables and other parameters of interest (e.g. datasets, time range, and bounding box) using interactive GUI selectors
- Filtering the full catalog down to a subset using saved selections
- Plotting a catalog's data availability across a variety of dimensions
Details on the usage of these functions can be found in example.ipynb
, via the docstrings in fresca_catalog
's modules, or by calling help()
on any function of interest. For example, to learn about how to use the funciton search_catalog_variables
, simply run help(search_catalog_variables)
:
search_catalog_variables(catalog, query=None, case_insensitive=True, fuzzy=False) -> List[str]
Searches the catalog for variables.
Parameters
----------
catalog : Catalog
The catalog to search.
query : str or list, optional
The query or list of queries to search for. If None, returns all variables. Default is None.
case_insensitive : bool, optional
Whether to perform a case-insensitive search. Default is True.
fuzzy : bool, optional
Whether to perform a fuzzy search. Default is False.
Returns
-------
list
A list of matching variables.
The purpose of all fresca_catalog
's helper functions is to assist in narrowing down the search to a set of datasets that will comprise a sub-catalog, which can then be used for more serious scientific analysis. Let's say this searching, filtering, and plotting of the full catalog results in the definition of a new catalog called my_catalog
. Rather than needing to rebuild it everytime you want to explore it, you can simply save it to disk as a YAML file in the same way we do below with full_catalog.yml
. For example:
my_catalog.to_yaml_file('my_catalog.yml')
For more details on how how to use these various selection, filtering, and plotting functions, please see example.ipynb
.
Another benefit of Intake is the ease with which you can programmatically build a new catalog from minimal input. In cases where we have a new dataset entry to add to the catalog, or we need to update an existing dataset entry, we'll want to take advantage of this functionality.
The key ingredient driving this programmatic generation is base_catalog.yml
. For example, a base_catalog.yml
with two entries could look like:
- dataset_id: CRCP_Carbonate_Chemistry_Atlantic
server_url: https://www.ncei.noaa.gov/erddap
type: erddap
- lat_col: lat_dec
lon_col: lon_dec
time_col: datetime
type: csv
url: https://files.axds.co/tmp/SFER_data.csv
base_catalog.yml
currently support two different types of entries: csv
types and erddap
types. Because ERDDAP servers provide metadata in addition to the data itself, the erddap
type entry is very minimal, requiring just the server_url
, the dataset_id
, and the type
. CSVs on the other hand embed less structured metadata, and thus require a more verbose entry, specifically the addition of lat_col
, lon_col_
, and time_col
to support automated metadata extraction for those columns.
From this minimal base_catalog.yml
, regenerating full_catalog.yml
is as simple as activating the environment with conda activate fresca
and running:
python -m fresca_catalog.catalog build base_catalog.yml full_catalog.yml
If you're more comfortable in a Jupyter notebook, the same thing can be accomplished by running a cell with the following:
from fresca_catalog.catalog import build_catalog
build_catalog('base_catalog.yml', 'full_catalog.yml')
This should result in a modified full_catalog.yml
being written to disk for you to explore. If the new full_catalog.yml
contains new or updated dataset entries and represents the latest and greatest canonical version of the catalog that everyone should be using, then it should also be committed to the repository and pushed so others can access it.