Skip to content

Dataset granularities: Abstract vs. Versioned vs. Layer

timrdf edited this page Mar 17, 2013 · 26 revisions

When csv2rdf4lod converts tabular data to RDF, it also asserts metadata about the RDF using the conversion vocabulary. One type of metadata that it asserts is void:Dataset details, which lets us group collections of triples in a hierarchical fashion. csv2rdf4lod groups triples into three specific types of void:Dataset:

  • conversion:AbstractDataset (subclass of void:Dataset)
    • conversion:VersionedDataset (subclass of void:Dataset that is a void:subset of conversion:AbstractDataset)
      • conversion:LayerDataset (subclass of void:Dataset that is a void:subset of conversion:VersionsedDataset)

So, the "largest" void:Dataset in the list above is conversion:AbstractDataset, and the "smallest" is conversion:LayerDataset.

As described in the naming phase, three provenance-related aspects are used to organize any third party data that csv2rdf4lod retrieves, converts, and republishes. This allows us to clearly distinguish where our data came from, and which version we actually got. Each aspect is given a short, URI-friendly identifier string:

  • source identifier - indicates the person/agent/organization that provided the data. For example, "epa-gov" provides a variety of datasets.
  • dataset identifier - indicates the logical group of data. For example, the EPA distinguishes between the datasets "photochemical-assessment-monitoring-stations-pams" and "historical-radnet-air-quality-data".
  • version identifier - indicates the version of a particular dataset. For example, you could have retrieved version "r23" (a designation from EPA themselves) three years ago, or I could have retrieved version "2013-Mar-14" today (because the EPA didn't provide a version designation for what they provided). For naming conventions, see naming.

To see the "Source/Dataset/Version" naming and the resulting "Abstract/Versioned/Layer" datasets hierarchy in action check out how it's used to (attempt to) keep the White House Visitor Access Records organized.

  • A conversion:AbstactDataset is the union of all triples from all its VersionedDatasets.
  • A conversion:VersionedDataset is created each time a group of files is retrieved from the third party. All triples produced from these files are grouped into the same conversion:VersionedDataset.
  • A conversion:LayerDataset is created each time the curator wants to produce a new RDF structure from the files in a conversion:VersionedDataset. Often, there are only two LayerDatasets: raw and enhancement/1, which is the naive conversion (with minimal interpretation and curation) and a first enhancement (which is a human curated restructuring to reuse vocabulary, entities, etc.). If some data users start using enhancement/1 and the curator wishes to create an even better RDF representation, they can create a third (or, fourth, etc.) LayerDataset to provide the new value while maintaining backward data compatibility with those that are using previous layers.

Given the base URI of our Linked Data site (say, http://lod.mine.org), we can create URIs for different groups of triples (void:Datasets).

  • http://lod.mine.org/source/epa-gov/dataset/photochemical-assessment-monitoring-stations-pams (a conversion:AbstractDataset) is the group of all triples from the following:
    • http://lod.mine.org/source/epa-gov/dataset/photochemical-assessment-monitoring-stations-pams/version/2011-Jan-11 (a conversion:VersionedDataset) is the group of all triples that are produced from the group of files retrieved on January 11th, 2011.
    • http://lod.mine.org/source/epa-gov/dataset/photochemical-assessment-monitoring-stations-pams/version/2013-Mar-14 (a conversion:VersionedDataset)is the group of all triples that are produced from the group of files retrieved on March 14th, 2013. Hopefully, the curator ensured that these files were different enough from those retrieved on January 11th.
  • http://lod.mine.org/source/epa-gov/dataset/historical-radnet-air-quality-data (a conversion:AbstractDataset) is the group of all triples from the following:
    • http://lod.mine.org/source/epa-gov/dataset/historical-radnet-air-quality-data/version/r23 (a conversion:VersionedDataset) is the group of triples that are produced from the group of files that are known as "r23", which is a hypothetical version designator that was provided and maintained by the source organization (EPA).
      • http://lod.mine.org/source/epa-gov/dataset/historical-radnet-air-quality-data/version/r23/conversion/raw (a conversion:LayerDataset) is the group of triples that are produced from a naive interpretation of the files from version "r23".
      • http://lod.mine.org/source/epa-gov/dataset/historical-radnet-air-quality-data/version/r23/conversion/enhancement/1 (a conversion:LayerDataset) is the group of triples that are produced from the first curation-level conversion of the files from version "r23". "enhancement/2", "enhancement/3", etc. name other LayerDatasets that group the triples produces from varying interpretations for how the RDF created from "r23"'s data files should look.
    • http://lod.mine.org/source/epa-gov/dataset/historical-radnet-air-quality-data/version/2013-Mar-14 (a conversion:VersionedDataset) is the group of triples that are produced from the group of files whose latest modification date was March 15, 2013 (this is another way to designate a version identifier -- the csv2rdf4lod curator must make an informed decision about how to name versions and when to create new versions).

Clone this wiki locally