Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dcat:distribution, model fix, inspect API updates #911

Open
canwaf opened this issue Apr 12, 2024 · 1 comment
Open

dcat:distribution, model fix, inspect API updates #911

canwaf opened this issue Apr 12, 2024 · 1 comment

Comments

@canwaf
Copy link
Contributor

canwaf commented Apr 12, 2024

With yanked csvcubed 0.5.0 we adopted the following change to the object model.

<4g-coverage.csv#dataset> <http://purl.org/dc/terms/description> "4G coverage in the UK by geographic area" ;
	<http://purl.org/dc/terms/title> "4G Coverage in the UK" ;
	<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/linked-data/cube#Attachable>, <http://purl.org/linked-data/cube#DataSet>, <http://www.w3.org/2000/01/rdf-schema#Resource>, <http://www.w3.org/ns/dcat#Distribution>, <http://www.w3.org/ns/dcat#Resource> .

This impacts csvcubed's inspect command, which calls https://github.com/GSS-Cogs/csvcubed/blob/main/src/csvcubed/inspect/sparql_handler/sparql_queries/select_catalog_metadata.sparql which primarily looks for the dcat:Dataset

        SELECT DISTINCT ?dataset
        WHERE {
            GRAPH ?someGraph {
                ?dataset a dcat:Dataset.
            }
        }

Which is no longer present; however it should be present. Consider the application profile where the CSV-W is the distribution. This leads us to the following:

<4g-coverage.csv#csvqb> a <http://purl.org/linked-data/cube#Attachable>, <http://purl.org/linked-data/cube#DataSet>, <http://www.w3.org/2000/01/rdf-schema#Resource>, <http://www.w3.org/ns/dcat#Distribution>, <http://www.w3.org/ns/dcat#Resource> ;
    <http://www.w3.org/ns/dcat#isDistributionOf> <4g-coverage.csv#dataset> .
<4g-coverage.csv#dataset> <http://purl.org/dc/terms/description> "4G coverage in the UK by geographic area" ;
	<http://purl.org/dc/terms/title> "4G Coverage in the UK" .

So the catalogue metadata is attached to the dataset, but the CSV-W's primary subject is now the Attachable, qb:Dataset, etc.

This should allow the SPARQL query to remain unchanged.

The metadata attached to the dcat:Distribution should be at most (Not these are not requirements, just what we can fill in that we already have we should add, nothing new new please):

classDiagram

class Distribution["Distribution a dcat:Distribution"] {
    +dcterms:identifier ∋ rdfs:Literal as xsd:string
    +dcterms:created ∋ rdfs:Literal as xsd:dateTime
    +dcterms:creator ∋ foaf:Agent
    +dcterms:issued ∋ rdfs:Literal as xsd:dateTime
    +prov:wasDerivedFrom ∋ [prov:Entity]
    +prov:wasGeneratedBy ∋ prov:Activity
    +dcat:downloadURL ∋ rdf:Resource
    +dcat:byteSize ∋ rdfs:Literal as xsd:nonNegativeInteger
    +dcat:mediaType ∋ dcterms:MediaType
    +wdrs:describedBy ∋ rdfs:Resource
    +spdx:checksum ∋ spdx:Checksum
}
Loading

tl;dr main subject of the CSV-W metadata file should be <dataset.csv#csvqb> which is dcat:isDistributionOf the dcat:Dataset. The dcat:Dataset is the one which should have the catalogue metadata attached to it.

@SarahJohnsonONS
Copy link
Contributor

SarahJohnsonONS commented Jun 3, 2024

Currently, cubes that have been built using csvcubed v0.4.10 or lower cannot be inspected using csvcubed v0.5.0 or greater, as the primary identifier has changed from some-dataset.csv#dataset to some-dataset.csv#csvqb. In order to facilitate this change, a new distribution_uri property has been added to the CatalogMetadata class, and the select_catalog_metadata SPARQL query has been updated to extract the value of this property, if it is present.

Additional information on the version of csvcubed used to build the cube is also now available in the metadata JSON file, which may also be leveraged to determine how the cube should be inspected.

The distribution_uri value is not present in cubes built using older versions of csvcubed, so the inspect command fails if using a newer version of csvcubed. This is due to the MetadataPrinter class now using the distribution_uri in the get_primary_csv_url() method via DataCubeRepository.get_cube_identifiers_for_dataset(). There will be other places where there is a discrepancy, but this is where I would start.

Possible solutions:

  • Use the csvcubed-build-activity information to extract the version of csvcubed used to build the cube, and use this to implement different versions of the inspect command. Build activity information available in different versions of csvcubed is below.
  • Use the presence or absence of distribution_uri in the select_catalog_metadata SPARQL results to implement different versions of the inspect command.

Build activity information

csvcubed version < 0.5.0

...
{
    "@id": "aged-16-to-64-years-level-3-or-above-qualifications.csv#dataset",
    "http://www.w3.org/ns/prov#wasGeneratedBy": [
        {
            "@id": "aged-16-to-64-years-level-3-or-above-qualifications.csv#csvcubed-build-activity"
        }
    ]
}
...
{
    "@id": "aged-16-to-64-years-level-3-or-above-qualifications.csv#csvcubed-build-activity",
    "@type": [
        "http://www.w3.org/2000/01/rdf-schema#Resource",
        "http://www.w3.org/ns/prov#Activity"
    ],
    "http://www.w3.org/ns/prov#used": [
        {
            "@id": "https://github.com/GSS-Cogs/csvcubed/releases/tag/v0.4.10"
        }
    ]
}
...

csvcubed version >= 0.5.0

...
{
    "@id": "some-title.csv#csvqb",
    "http://www.w3.org/ns/prov#wasDerivedFrom": [
        {
            "@id": "https://github.com/GSS-Cogs/csvcubed/releases/tag/v0.5.0"
        }
    ],
    "http://www.w3.org/ns/prov#wasGeneratedBy": [
        {
            "@id": "some-title.csv#csvcubed-build-activity"
        }
    ]
}
...
{
    "@id": "some-title.csv#csvcubed-build-activity",
    "@type": [
        "http://www.w3.org/ns/prov#Activity",
        "http://www.w3.org/2000/01/rdf-schema#Resource"
    ],
    "http://www.w3.org/ns/prov#used": [
        {
            "@id": "https://github.com/GSS-Cogs/csvcubed/releases/tag/v0.5.0"
        }
    ]
},
{
    "@id": "https://github.com/GSS-Cogs/csvcubed/releases/tag/v0.5.0",
    "@type": [
        "http://www.w3.org/ns/prov#Entity",
        "http://www.w3.org/2000/01/rdf-schema#Resource"
    ],
    "http://purl.org/dc/terms/title": [
        {
            "@language": "en",
            "@value": "csvcubed v0.5.0"
        }
    ],
    "http://www.w3.org/ns/prov#hasPrimarySource": [
        {
            "@id": "https://pypi.org/project/csvcubed/0.5.0"
        }
    ],
    "http://www.w3.org/ns/prov#wasGeneratedBy": [
        {
            "@id": "some-title.csv#csvcubed-build-activity"
        }
    ]
}

SarahJohnsonONS added a commit that referenced this issue Jul 17, 2024
* Updating the release version in pyproject.toml

* test commit

* WIP

* WIP

* WIP

* WIP

* WIP

* WIP

* Tidy up

* tidy up

* Working

* Added comments

* fixed pyright errors

* more pyright

* Changed #csvqb to #qbDataSet

* PR comments addressed

* poetry lock

* poetry lock

* oops

* small change

---------

Co-authored-by: Auto-version-incrementer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants