Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete content_type subsection. #42

Closed
Bonnarel opened this issue Apr 20, 2020 · 20 comments
Closed

Incomplete content_type subsection. #42

Bonnarel opened this issue Apr 20, 2020 · 20 comments

Comments

@Bonnarel
Copy link
Contributor

Bonnarel commented Apr 20, 2020

Recent semantic discussion addressed the use case of adding the possibility to link sibling or alternate science datasets to the main item. Eventually the right place to specify the dataproduct_type of the datasets has been decided to be a standardized media type parameter in the content_type FIELD. this has to be explained in the section. See PR #43

@pdowler
Copy link
Collaborator

pdowler commented Nov 4, 2020

The PR discussion brought up the issue that mime type parameters really should be defined by the same authority that defines the mime type, so the idea of adding content={dataproduct_type} (or some other ivoa vocabulary value) to application/fits (eg) now seems unacceptable.

The next best alternative would be to add a new (optional in 1.1) field, say content_qualifier where one could use a vocabulary term to describe the logical content (as opposed to the format in content_type). Since the ObsCore standard looks to be moving toward dataproduct_type being a vocabulary (and we could treat the current list of words as such now) allowing vocabulary terms would would satisfy the current and future use cases.

Detail: do we define a default vocabulary and allow terms from that to be used "unqualfiied" (eg image or #image instead of http://ivoa.net/ObsCore/dataproduct_type#image <-- totally made up fully qualified vocabulary term) -- or do we always require fully qualified values? Having a default vocabulary kind of anchors is (again) to the idea that DataLink is for data and files and not more generically "links to resources"... something we've tried to generalise in the current revision.

So, do we allow bare vocabulary terms from any (IVOA) vocabulary -- image (or #image) and galaxy (or #galaxy) -- or fully qualified vocabulary terms (identifiers)?

@pdowler
Copy link
Collaborator

pdowler commented Nov 4, 2020

I can volunteer to write this and create a PR, but I'd like to wait for PR #50 because that introduces optional fields and this would definitely create a merge conflict if done in parallel.

@msdemlei
Copy link
Collaborator

The next best alternative would be to add a new (optional in 1.1)
field, say content_qualifier where one could use a vocabulary term to

I always cringe when "about the same thing" is done differently in two
different standards. So... what's the difference between this and
obscore dataproduct_type? Is the rationale for this difference really
so significant to justify inventing something new and forcing adopters
to learn yet another thing?

Detail: do we define a default vocabulary and allow terms from that to
be used "unqualfiied" (eg image or #image instead of
http://ivoa.net/ObsCore/dataproduct_type#image <-- totally made up

There's http://www.ivoa.net/rdf/product-type that ought to become
adopted with SimpleDALRegExt. And I'm pretty sure we should just use
that.

So, do we allow bare vocabulary terms from any (IVOA) vocabulary --
image (or #image) and galaxy (or #galaxy) -- or fully qualified
vocabulary terms (identifiers)?

This is a bit tricky -- for internal (datalink) consistency, I'd say we
should do it like with semantics: What's in there is a URI relative to
http://www.ivoa.net/rdf/product-type. This will make #image just work,
and if people really want, they can add fully qualified URIs.

Given that's what datalink does elsewhere, I'd say we can't really do it
differently here.

If we started from scratch, I'd not do it this way again and instead say
"it's terms from product-type, full stop, no fooling around with
concatenating URIs".

This is because I'm now convinced that hierarchy-aware matching
("anything that is image or narrower") is an important use case in this
kind of thing; and that, really, won't ever work when you allow terms
from everywhere. That's why I'm against repeating the # hack in, say,
obscore, or in SimpleDALRegExt. I might add some text explaining why
datalink differs in this respect from what's done elsewhere in the VO if
VocinVO2 becomes REC before Datalink 1.1, just so adopters don't curse
use to badly.

@Bonnarel
Copy link
Contributor Author

Bonnarel commented Nov 18, 2020

Hi all,

The next best alternative would be to add a new (optional in 1.1)
field, say content_qualifier where one could use a vocabulary term to

I always cringe when "about the same thing" is done differently in two
different standards. So... what's the difference between this and
obscore dataproduct_type? Is the rationale for this difference really
so significant to justify inventing something new and forcing adopters
to learn yet another thing?

Well, I tend to agree with Pat there. I think we have to be cautious about adding new columns all the time in the future. So having two qualify the content of the link independantly from its relation to "#this" (content_type and content_qualifier) should be enough. We will still have plenty of use cases wher we will not use a dataproduct_type to qualify the target because it's simply inappropriate. But if the target is voevent the content can be a classification tag of that voevent or if the semantics is "metadata" the content_qualifier could tell us : "provenance" record, obscore record, ssa record, proprietary, etc...
This more ore less requires to integrate the vocabulary namespace in the value of this new content_qualifier field

Detail: do we define a default vocabulary and allow terms from that to
be used "unqualfiied" (eg image or #image instead of
http://ivoa.net/ObsCore/dataproduct_type#image <-- totally made up

There's http://www.ivoa.net/rdf/product-type that ought to become
adopted with SimpleDALRegExt. And I'm pretty sure we should just use
that.

So, do we allow bare vocabulary terms from any (IVOA) vocabulary --
image (or #image) and galaxy (or #galaxy) -- or fully qualified
vocabulary terms (identifiers)?

This is a bit tricky -- for internal (datalink) consistency, I'd say we
should do it like with semantics: What's in there is a URI relative to
http://www.ivoa.net/rdf/product-type. This will make #image just work,
and if people really want, they can add fully qualified URIs.

Given that's what datalink does elsewhere, I'd say we can't really do it
differently here.

so http://www.ivoa.net/rdf/product-type as to be the default namespace for this field (stated so in the spec or advertized "à la" xsd namespace at the beginning of the VOTable)

So anything which is not a dataproduct_type from the iVOA vocab has to contain an explicit namespace

If we started from scratch, I'd not do it this way again and instead say
"it's terms from product-type, full stop, no fooling around with
concatenating URIs".

This is because I'm now convinced that hierarchy-aware matching
("anything that is image or narrower") is an important use case in this
kind of thing; and that, really, won't ever work when you allow terms
from everywhere. That's why I'm against repeating the # hack in, say,
obscore, or in SimpleDALRegExt. I might add some text explaining why
datalink differs in this respect from what's done elsewhere in the VO if
VocinVO2 becomes REC before Datalink 1.1, just so adopters don't curse
use to badly.

@Bonnarel
Copy link
Contributor Author

For a while I also volunteered to write this one
I didn't create a pull request because Pat writes there may be a conflict with PR #50
so here is the proposal
he value may be null (blank)
The value may be null (blank)
if unknown and will typically be null for links to services.
if unknown and will typically be null for links to services.

\subsubsection{content_qualifier}

The content_qualifier column is optional. If it is present, it tells the client the nature of the thing or service they will receive or access if they use the link, in other words the target. If the target is a dataproduct, the field SHOULD contain one of the terms defined in the IVOA dataproduct_type vocabulary, considered as the default vocabulary. For other natures of the target the field MAY contain a term defined in another IVOA or proprietary vocabulary refered by its URI.

\subsection{Successful Requests}
\subsection{Successful Requests}

@Bonnarel
Copy link
Contributor Author

Bonnarel commented Dec 9, 2020

For a while I also volunteered to write this one
I didn't create a pull request because Pat writes there may be a conflict with PR #50
so here is the proposal
he value may be null (blank)
The value may be null (blank)
if unknown and will typically be null for links to services.
if unknown and will typically be null for links to services.

\subsubsection{content_qualifier}

The content_qualifier column is optional. If it is present, it tells the client the nature of the thing or service they will receive or access if they use the link, in other words the target. If the target is a dataproduct, the field SHOULD contain one of the terms defined in the IVOA dataproduct_type vocabulary, considered as the default vocabulary. For other natures of the target the field MAY contain a term defined in another IVOA or proprietary vocabulary refered by its URI.

\subsection{Successful Requests}
\subsection{Successful Requests}

I eventually created the PR for the small subsection because it is not in conflict with the table where optional FIELDS will be listed. See discussion on this PR#50

@Bonnarel Bonnarel mentioned this issue Dec 9, 2020
@Bonnarel
Copy link
Contributor Author

Bonnarel commented Dec 9, 2020

Possible solution in PR #56 (DataLink-#51)

@pdowler
Copy link
Collaborator

pdowler commented May 13, 2021

Coming back to this now that the other PRs are merged. Before we discuss the name of the column in the links table, the more fundamental question is whether there is a single vocabulary which defines the values or are there several, some of which have not been created yet?

If we started from scratch, I'd not do it this way again and instead say
"it's terms from product-type, full stop, no fooling around with
concatenating URIs".

I'm not sure what you mean by "do it this way again". (I thought) I understand that there are two orthogonal vocabulary concepts:
0. you have a vocabulary with a set of words (hierarchical: wider and narrower)

  1. you have multiple vocabularies with different sets of words (different namespace)
  2. you can have a vocabulary that extends another (adds words), usually to add narrower terms (don't know if there is a way to add a new base term and for that to be any different from just a term in a different vocabulary)

Are you saying you don't like allowing terms from multiple vocabularies in this new "content_qualifier" column? (using #1) If so, the "fooling around" is caused by allowing unqualified bare terms from a default vocabulary.

Are you saying you don't like extensions of a single-mandated vocabulary? (using #2). If so, the "fooling around" (like in semantics column) is because we allowed the unqualified bare terms #this which seemed kind of cute at the time. Maybe we don't have a well specified way for people to declare and use extensions but that's really important so people can put prototype terms into use...

I just don't see how the product-type vocabulary can satisfy all the use cases and I don't see adding a column for each new vocabulary, so:

My position right now:

  • semantics continues to mandate the single vocabulary, therefore unqualified terms are allowed
  • content_qualifier (not in love with the name) allows fully qualified terms from any vocabulary; no default; I could possibly get behind restricting to "any ivoa vocabulary", depending on your position on extensions

We (CADC) use fully qualified terms in semantics that are not in the core vocab; I consider them prototype in nature and just haven't got around to the VEP stage.

@msdemlei
Copy link
Collaborator

msdemlei commented May 14, 2021 via email

@pdowler
Copy link
Collaborator

pdowler commented May 14, 2021

OK, I get the objection to the wild west of arbitrary full URLs to something on the internet; I don't think it would magically work either and they are just opaque identifiers to s/w (a human could go get the definition of a term).

I re-read what I think is the original post on this (issue #44) and in there I noted a couple of rather simple things that maybe are enough to get by for some time. First, there is a (proposed) "tabular" or "table" value in the product-type vocabulary; assuming such a VEP was accepted this would nominally be the way to link to "records" (query results).

If you saw a links response with:

id semantics product_type content_type ...
id1 #this #image application/fits ...
id1 #derivation #table application/fits ...

You could infer that the second link was to a fits file with a table in it, but does #derivation tell you what's in the table? what is a row in that table? is it clear that it is an extracted source? if not, how could we make that clear?

The answer could be a narrower term than #derivation that said something about what kind of derivation: same data but processed to be "better" vs information extracted vs astronomical sources extracted ...

So I guess if both datalink/core and product-type vocabularies grow sufficiently, aren't too rigid and don't become a huge mess then we'd be OK with a product_type column restricted to values from that vocabulary. The combinations from two vocabularies will make this quite flexible... I suspect 3 such things would be too much.

Francois - do you think this will work for the use cases from Ada and others you mentioned?

Aside: At CADC we have a handful of astronomers and data-scientists that use our services a lot; they are pseudo-representative of the community (pseudo because they know too much now). I am keenly aware of how much they hate it when things change and if you give them something simple they get used to you can never go back and generalize it in a way that makes it more complex. As a result, I am extremely leery of simple-looking things that look like short cuts unless I have sketched out the general solution and I know the shortcut is not going to bite me later. So like Markus, I don't think I grok the general problem here (lack of use cases) and that makes me a little worried that we'll regret something. OTOH, if we just think about it as "used to be able to say one thing about a link" and "now you can say two things about a link" then that helps.

@Bonnarel
Copy link
Contributor Author

Bonnarel commented May 20, 2021

OK, I get the objection to the wild west of arbitrary full URLs to something on the internet; I don't think it would magically work either and they are just opaque identifiers to s/w (a human could go get the definition of a term).

I re-read what I think is the original post on this (issue #44) and in there I noted a couple of rather simple things that maybe are enough to get by for some time. First, there is a (proposed) "tabular" or "table" value in the product-type vocabulary; assuming such a VEP was accepted this would nominally be the way to link to "records" (query results).

If you saw a links response with:

id semantics product_type content_type ...
id1 #this #image application/fits ...
id1 #derivation #table application/fits ...

You could infer that the second link was to a fits file with a table in it, but does #derivation tell you what's in the table? what is a row in that table? is it clear that it is an extracted source? if not, how could we make that clear?

The answer could be a narrower term than #derivation that said something about what kind of derivation: same data but processed to be "better" vs information extracted vs astronomical sources extracted ...

So I guess if both datalink/core and product-type vocabularies grow sufficiently, aren't too rigid and don't become a huge mess then we'd be OK with a product_type column restricted to values from that vocabulary. The combinations from two vocabularies will make this quite flexible... I suspect 3 such things would be too much.

Francois - do you think this will work for the use cases from Ada and others you mentioned?

well there are two level of answers :
1 ) if we consider the original usecase where #this is a "source" or "detection" in a catalog and #link is a timeseries "of #this" for sure product_type combined with one of (coderived, derived, counterpart, progenitor) semantics term is enough. And content_type will tell us about the format.
But
2 ) - when #link is not a dataproduct product_type is useless. It is not a problem per se. we can leave it empty. But maybe we want to say more about what it is in that case. Imagine #link is "Documentation". Is that a tutorial ? a refered article ? a simple html page ? a github repository ? Where do we put this information if the new field is reserved for dataproduct_type vocabulary ?
- in Ada's proposal there were 4 levels :
Level 0 - Data-format (fits, VOTable, PDF, png, …)
Level 1 - Data-type (tabular, image, spectrum, cube, text, …)
Level 2 - Data-information (Documentation, Calibration, Log, Preview, …)
Level 3 - Data-relation (Derived from, Progenitor of, Sibling of, ...)
0 and 1 will be covered by content_type and product_type. 3 is obviously covered by semantics. My personal opinion is that the examples for level2 are also a kind of relationship between #this and the #link, so well covered by semantics. But it may happen that something in her level 2 could be covered by data-type (the very nature of documentation for example. An i think we will sooener or later need a new "metadata" semantics term the nature of which could be an "obscore record" or a "provenance record" or .....

---> could we find a more generic term than product_type for describing the nature of the #link. (I understand that content_qualifier is ruled out)
---> can we consider that the default vocabulary there is the dataproduct_type one and that we allow alternative complete uri ivoa terms if needed ?

Aside: At CADC we have a handful of astronomers and data-scientists that use our services a lot; they are pseudo-representative of the community (pseudo because they know too much now). I am keenly aware of how much they hate it when things change and if you give them something simple they get used to you can never go back and generalize it in a way that makes it more complex. As a result, I am extremely leery of simple-looking things that lock like short cuts unless I have sketched out the general solution and I know the shortcut is not going to bite me later. So like Markus, I don't think I grok the general problem here (lack of use cases) and that makes me a little worried that we'll regret something. OTOH, if we just think about it as "used to be able to say one thing about a link" and "now you can say two things about a link" then that helps.

Not sure I catch this "aside". What do you consider as a shortcut there ? use the same field for different vocabularies

@pdowler
Copy link
Collaborator

pdowler commented May 20, 2021

As long as the product-type vocabulary, which says "what something is" expands to include terms beyond what ObsCore uses (different kinds of science data) it could be a general purpose way to augment the content_type.

The level 3 and 4 examples above are both using terms from the datalink/core vocabulary; it could be that we have created some confusion with the content of that vocabulary... is there a use case where you would want to specify one of those level 3 and one of those level 4 terms? If so, is is feasible to split the datalink/core vocab into two actually distinct vocabularies (I'm skeptical)? what about simply allowing multiple terms to be used to describe a link that has a complicated multi-faceted relationship to #this? I do in fact have a use case that suggests this and I don't want to get that mixed up with use of product-type, but in general being able to put multiple terms might be an alternative.

On the aside: the "simple thing" I am potentially nervous about is being strict about product_type column being just for terms from the product-type vocab, and then future evolution of that vocab is also strict and not being able to use it for other use cases. The other obvious thing I could see doing is linking to an instance of a data model and for that I'd expect to say content_type=aplication/x-votable+xml product_type="instance(s) of ObsCore" or something like that. So do we eventually add a base term "model" and narrower terms like "ObsCore" and "Source" and "Cube" to the product-type vocabulary? We could go that way and I'd feel a lot better about adding a strict product_type column to links now if I heard "heh, that sounds cool - we could do some VEPs for that in the near future".

@msdemlei
Copy link
Collaborator

msdemlei commented May 21, 2021 via email

@Bonnarel
Copy link
Contributor Author

Bonnarel commented May 23, 2021 via email

@Bonnarel
Copy link
Contributor Author

Bonnarel commented May 23, 2021 via email

@msdemlei
Copy link
Collaborator

msdemlei commented May 25, 2021 via email

@msdemlei
Copy link
Collaborator

msdemlei commented May 25, 2021 via email

@Bonnarel
Copy link
Contributor Author

As long as the product-type vocabulary, which says "what something is" expands to include terms beyond what ObsCore uses (different kinds of science data) it could be a general purpose way to augment the content_type.

The level 3 and 4 examples above are both using terms from the datalink/core vocabulary; it could be that we have created some confusion with the content of that vocabulary... is there a use case where you would want to specify one of those level 3 and one of those level 4 terms? If so, is is feasible to split the datalink/core vocab into two actually distinct vocabularies (I'm skeptical)? what about simply allowing multiple terms to be used to describe a link that has a complicated multi-faceted relationship to #this? I do in fact have a use case that suggests this and I don't want to get that mixed up with use of product-type, but in general being able to put multiple terms might be an alternative.

Well , I think it could be interesting to allow some combination of terms in semantics. I seem some use cases as I explained on the semantics mailing list for the VEP006 discussion. But if it is to be useful for clients, should we not restrict the allowed combinations to some predefined list ?

On the aside: the "simple thing" I am potentially nervous about is being strict about product_type column being just for terms from the product-type vocab, and then future evolution of that vocab is also strict and not being able to use it for other use cases. The other obvious thing I could see doing is linking to an instance of a data model and for that I'd expect to say content_type=aplication/x-votable+xml product_type="instance(s) of ObsCore" or something like that. So do we eventually add a base term "model" and narrower terms like "ObsCore" and "Source" and "Cube" to the product-type vocabulary? We could go that way and I'd feel a lot better about adding a strict product_type column to links now if I heard "heh, that sounds cool - we could do some VEPs for that in the near future".

@pdowler
Copy link
Collaborator

pdowler commented Oct 14, 2021

optional content_qualifier field added in PR 57 to resolve this issue.

@pdowler pdowler closed this as completed Oct 14, 2021
@Bonnarel
Copy link
Contributor Author

Bonnarel commented Oct 14, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants