fresca-catalog

An Intake-based catalog of FRESCA datasets and supporting tools

The purpose of fresca-catalog is to provide an interface to the FRESCA dataset inventory that facilitates exploration and discovery. To do this, fresca-catalog makes use of the open-source package Intake to represent the FRESCA datasets as a unified, machine-readable catalog. fresca-catalog also defines custom functionality for querying and filtering the catalog, as well as for curating the catalog as it grows.

Installation

In order to use fresca-catalog, you'll need to install either Anaconda or Miniconda to handle environment management.

Once one of those are installed, the next step is to clone this repository:

git clone [email protected]:axiom-data-science/fresca-catalog.git

Now navigate into the resulting fresca-catalog directory, and create the Conda environment for this project with:

conda env update

This could take a while. After it finishes running (hopefully successfully) you can activate the fresca environment with:

conda activate fresca

Now all of fresca-catalog's dependencies are installed and accessible, and you can get started running code in JupyterLab by running:

jupyter lab

And if you prefer another IDE to JupyterLab, like Visual Studio Code, then by all means use that instead.

Usage

fresca-catalog addresses two primary use cases:

Loading and interactively exploring the FRESCA Intake catalog via a Jupyter notebook
Rebuilding the FRESCA Intake catalog in order to add new dataset entries or update existing entries

fresca-catalog contains three key components to enable these use cases:

full_catalog.yml: The Intake catalog itself, in its on-disk YAML representation
fresca_catalog/: The Python package containing the logic required for building and interacting with that Intake catalog
base_catalog.yml: The minimal "seed" catalog required to programmatically build the "enriched" full_catalog.yml

We'll explore these two use cases and the components listed above in the following sections.

Exploring the catalog

One benefit of Intake is that its catalogs can easily be serialized to disk as YAML files. This makes them portable and easy to version. In fresca-catalog we store the full FRESCA Intake catalog as full_catalog.yml. Below, we'll cover the basics of querying and filtering the full catalog, but for details and to follow along, please open example.ipynb.

Loading the FRESCA Intake catalog is as simple as:

import intake
catalog = intake.open_catalog('full_catalog.yml')

catalog is just a plain vanilla Intake catalog and as such, supports anything described in the official Intake documentation. But in order to make use of catalog for querying and filtering, we've defined a variety of helper functions within fresca_catalog. These include functions to help facilitate:

Searching for variables of interest within the catalog
Selecting variables and other parameters of interest (e.g. datasets, time range, and bounding box) using interactive GUI selectors
Filtering the full catalog down to a subset using saved selections
Plotting a catalog's data availability across a variety of dimensions

Details on the usage of these functions can be found in example.ipynb, via the docstrings in fresca_catalog's modules, or by calling help() on any function of interest. For example, to learn about how to use the funciton search_catalog_variables, simply run help(search_catalog_variables):

search_catalog_variables(catalog, query=None, case_insensitive=True, fuzzy=False) -> List[str]
    Searches the catalog for variables.
    
    Parameters
    ----------
    catalog : Catalog
        The catalog to search.
    query : str or list, optional
        The query or list of queries to search for. If None, returns all variables. Default is None.
    case_insensitive : bool, optional
        Whether to perform a case-insensitive search. Default is True.
    fuzzy : bool, optional
        Whether to perform a fuzzy search. Default is False.
    
    Returns
    -------
    list
        A list of matching variables.

The purpose of all fresca_catalog's helper functions is to assist in narrowing down the search to a set of datasets that will comprise a sub-catalog, which can then be used for more serious scientific analysis. Let's say this searching, filtering, and plotting of the full catalog results in the definition of a new catalog called my_catalog. Rather than needing to rebuild it everytime you want to explore it, you can simply save it to disk as a YAML file in the same way we do below with full_catalog.yml. For example:

my_catalog.to_yaml_file('my_catalog.yml')

For more details on how how to use these various selection, filtering, and plotting functions, please see example.ipynb.

Rebuilding the catalog

Another benefit of Intake is the ease with which you can programmatically build a new catalog from minimal input. In cases where we have a new dataset entry to add to the catalog, or we need to update an existing dataset entry, we'll want to take advantage of this functionality.

The key ingredient driving this programmatic generation is base_catalog.yml. For example, a base_catalog.yml with two entries could look like:

- dataset_id: CRCP_Carbonate_Chemistry_Atlantic
  server_url: https://www.ncei.noaa.gov/erddap
  type: erddap
- lat_col: lat_dec
  lon_col: lon_dec
  time_col: datetime
  type: csv
  url: https://files.axds.co/tmp/SFER_data.csv

base_catalog.yml currently support two different types of entries: csv types and erddap types. Because ERDDAP servers provide metadata in addition to the data itself, the erddap type entry is very minimal, requiring just the server_url, the dataset_id, and the type. CSVs on the other hand embed less structured metadata, and thus require a more verbose entry, specifically the addition of lat_col, lon_col_, and time_col to support automated metadata extraction for those columns.

From this minimal base_catalog.yml, regenerating full_catalog.yml is as simple as activating the environment with conda activate fresca and running:

python -m fresca_catalog.catalog build base_catalog.yml full_catalog.yml

If you're more comfortable in a Jupyter notebook, the same thing can be accomplished by running a cell with the following:

from fresca_catalog.catalog import build_catalog
build_catalog('base_catalog.yml', 'full_catalog.yml')

This should result in a modified full_catalog.yml being written to disk for you to explore. If the new full_catalog.yml contains new or updated dataset entries and represents the latest and greatest canonical version of the catalog that everyone should be using, then it should also be committed to the repository and pushed so others can access it.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
datasets		datasets
fresca_catalog		fresca_catalog
.gitignore		.gitignore
README.md		README.md
base_catalog.yml		base_catalog.yml
environment.yml		environment.yml
example.ipynb		example.ipynb
full_catalog.yml		full_catalog.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

fresca-catalog

Installation

Usage

Exploring the catalog

Rebuilding the catalog

About

Uh oh!

Releases

Packages

Languages

axiom-data-science/fresca-catalog

Folders and files

Latest commit

History

Repository files navigation

fresca-catalog

Installation

Usage

Exploring the catalog

Rebuilding the catalog

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages