Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Brainstorm Thread #252

Open
khusmann opened this issue Jul 18, 2024 · 0 comments
Open

API Brainstorm Thread #252

khusmann opened this issue Jul 18, 2024 · 0 comments

Comments

@khusmann
Copy link
Contributor

I'm starting this thread to brainstorm some of the ideas I mention in #198 and #251. It leans into the idea of data packages and table resources being their own class, not just lightweight descriptors. In this approach:

  • A data package object would be a list of resource objects. Properties would be stored in its attributes, and be accessed with get_prop() and set_props(). These functions would ensure the object was always valid.
  • A table resource object would be a list of fields objects. Properties would be stored in its attributes, as with data package objects.
  • When table resource objects were read with read_resource(), it would make them a tibble AND a table resource object. So it would allow you to manage a data frame with frictionless metadata simultaneously.

Although it does introduce a lot of implementation complexity in some areas, I think it potentially simplifies user experience and reduces complexity in other areas:

  • users no longer have to keep their loaded data frames synchronized with their descriptor metadata, because a loaded resource tibble IS a table resource object in all of its metadata glory
  • we can easily carry context around in an object (e.g. the working directory of a descriptor; Should we use resource as an argument? #251), without it polluting the rest of the descriptor attributes
  • validation is streamlined because properties are always modified through fns that insure the object stays valid

It's also a pretty big departure from the current architecture, so I totally understand if you're not wanting to go this direction... I'm mostly sharing this to just get more ideas / possibilities flowing.

pkg <- example_package()

pkg
#> A Data Package with 3 resources:
#> • deployments
#> • observations
#> • media
#> Use `get_descriptor()` to print the Data Package as a list.

# Instead of using `unclass()`, we use `get_descriptor()` to convert the
# data package object into a raw descriptor object (list)

get_descriptor(pkg)
#> $name
#> [1] "example_package"
#> 
#> $id
#> [1] "115f49c1-8603-463e-a908-68de98327266"
#> 
#> $created
#> [1] "2021-03-02T17:22:33Z"
#> 
#> $image
#> ...

# Instead of setting properties directly on the data package object, we get
# and set properties using `get_prop()` and `set_props()`. This allows us to
# validate the properties before setting them, so the data package object
# is always guaranteed to be valid.

get_prop(pkg, "id")
#> [1] "115f49c1-8603-463e-a908-68de98327266"

pkg <- set_props(pkg, id = "new-id")

get_prop(pkg, "id")
#> [1] "new-id"

# Because all properties are stored as attributes in the data package object,
# we can have the object's items refer directly to the child resources
# of the data package:

pkg$deployments
#> A Table Resource with 5 fields:
#> • deployment_id (string)
#> • longitude (number)
#> • lattitude (number)
#> • start (date)
#> • comments (string)
#> Use `get_descriptor()` to print the Table Resource as a list.
#> Use `read_resource()` to load the data of this Table Resource.

# As with a data package object, we can use `get_descriptor()` to convert
# the resource object into a raw descriptor object (list)

get_descriptor(pkg$deployments)
#> $name
#> [1] "deployments"
#> 
#> $path
#> [1] "<...>"
#> 
#> $profile
#> [1] "tabular-data-resource"
#> 
#> $title
#> [1] "Camera trap deployments"
#> ...

# As with data package objects, we use get_prop() and set_props() to work with
# properties:

get_prop(pkg$deployments, "title")
#> [1] "Camera trap deployments"

pkg$deployments <- set_props(pkg$deployments, title = "Camera trap deployments (modified)")

get_prop(pkg$deployments, "title")
#> [1] "Camera trap deployments (modified)"

# We let the child items of table resource objects refer to field objects:

pkg$deployments$deployment_id
#> A Field:
#> • name: deployment_id
#> • type: string
#> • constraints: {required: TRUE, unique: TRUE}
#> Use `get_descriptor()` to print the Field as a list.

# And as usual, we can convert to raw descriptor via `get_descriptor()`:

get_descriptor(pkg$deployments$deployment_id)
#> $name
#> [1] "deployment_id"
#> 
#> $type
#> [1] "string"
#> 
#> $constraints
#> $constraints$required
#> [1] TRUE
#> 
#> $constraints$unique
#> [1] TRUE

# (Also, `get_prop()` and `set_props()` would work with field objects)

# Where this approach gets really interesting is when we start loading the data
# from resources:

rsc <- read_resource(pkg$deployments)
#> # A Table Resource tibble: 3 × 5
#>   deployment_id longitude latitude start      comments
#>   <chr>             <dbl>    <dbl> <date>     <chr>
#> 1 1                  4.62     50.8 2020-09-25  NA
#> 2 2                  4.64     50.8 2020-10-01 "On \"forêt\" road."
#> 3 3                  4.65     50.8 2020-10-05 "Malfunction/no photos, data"

# Notice the header in the printout -- this is not your average tibble!
# What we get here is a subclassed tibble allowing it to be both a tibble AND
# keep track of the resource metadata simultaneously. This means `get_prop()`
# and `set_props()` can still be used!

get_prop(rsc, "title")
#> [1] "Camera trap deployments (modified)"

rsc <- set_props(rsc, title = "Camera trap deployments (modified again)")

get_prop(rsc, "title")
#> [1] "Camera trap deployments (modified again)"

# We can also still use `get_descriptor()` with the tibble!

get_descriptor(rsc)
#> $name
#> [1] "deployments"
#> 
#> $path
#> [1] "<...>"
#> 
#> $profile
#> [1] "tabular-data-resource"
#> 
#> $title
#> [1] "Camera trap deployments (modified again)"
#> ...

# Properties of fields could be set in tidy pipelines, and new fields
# could be created by adding columns:

rsc <- rsc |>
  mutate(
    deployment_id = set_props(deployment_id, title = "New deployment ID title"),
  ) |>
  mutate(
    new_field = start + 1,
    new_field = set_props(new_field, title = "The day after the start day"),
  )

# What's cool about this, is now we can use `get_descriptor()` to get the
# descriptor of the resource tibble, and it will include the new field in the
# resulting schema.

# And we can update our package with the new resource at any time:

pkg$deployments <- rsc

# We could also update the resource's path to control how the resource
# will be saved when we write the package to disk:

pkg$deployments <- set_props(pkg$deployments, path = "deployments_new.csv")

# Or set the path to NULL to have the resource embed the tibble data in the
# "data" prop when it's converted to a descriptor:

pkg$deployments <- set_props(pkg$deployments, path = NULL)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant