alternative links #117

pdowler · 2024-12-11T17:04:51Z

In the links response, the provider may want to convey multiple URLs to the same content. There is currently no mechansim to tell clients that these links are altermatives and it should chose one of them.

Use case 1: links that return different format

Use case 2: links to different storage locations

Use case 3: links with different transport protocols

pdowler · 2024-12-11T17:14:04Z

Proposal from DAL running meeting 19 on 2024-12-11:

Introduce a new optional column in the links table with an opaque value that is common to a set of alternatives.

new column: name="alt_key" datatype="char" arraysize="..."

example:

ID  semantics  access_url                                          alt_key
----------------------------------------------------------------------------------
id1 #this        https://zone1.example.net/file1                  abc
id1 #this        https://zone2.example.net/file1                  abc
id1 #this        https://zone1.example.net/file2

In this example, the primary data for id1 is two files and one file is available from two locations: the firstb two links are alternatives (same alt_key) and the 3rd link is independent.

msdemlei · 2024-12-12T10:13:03Z

On Wed, Dec 11, 2024 at 09:14:27AM -0800, Patrick Dowler wrote: Proposal from DAL running meeting 19 on 2024-12-11: Introduce a new optional column in the links table with an opaque value that is common to a set of alternatives.

I suspect, too, that this is the least ugly general treatment of the problem. Still, it *is* ugly. One obvious problem is that with the multi-ID capability of datalink (you probably expected me to moan about this again), you would either have to make sure that the strings somehow are made unique (so that there's no abc with id-1 *and* id-2 at the same time), which will make for long and probably not very readable alt_keys. Or, and I think that's what I'd suggest, make alt_key *plus* the id the group key. I think that's reasonable because no conceivable use case would require us to form cross-id groups. We could then even make the alt_key a small integer, because I suspect having it a string will tempt people to put things in there that really ought to be in semantics or local_semantics or content_qualifier. Finally, from a consumer side the simplest solution would be to have a booleanesque column pick_one that's true if semantics is to be interpreted as alternatives and false or NULL otherwise. This would, for all I can see, work for the cases that have been proposed so far. It would not work if we had multiple equivalence groups per semantics class. Do we expect that? If we do, would it still be smarter to define the grouping columns as (id, semantics, alt_key)?

pdowler · 2024-12-12T20:42:30Z

I agree that the specifying uniqueness as {ID,alt-key} and that would allow alt_key values to be pretty small and easily generated is a plus. The alternative is generating sufficiently long random keys or using UUID, which are large for no good reason. I agree it is pretty much nonsense to have alternatives across different IDs.

I'm not against making it a smallint... I almost wrote in the draft proposal that we require a very short string (like arraysize="4" ish) so was thinking their uniqueness was within a small set of links for the same ID.

I don't think the boolean would work for us because we have multiple {ID,semantics} X multiple locations for each, so I think having alt_key orthogonal is the best general approach.

While it's probably true that uniqueness is within the same semantics as well, but I don't see how that buys us anything and maybe it blocks a specialised use... like two links with wider and narrower semantics that are alternatives??

ID     semantics        alt_key ... description
id1   #progenitor       alt1    ...   "the default progenitor"
id1   #progenitor-X     alt1   ... "a more specific progenitor"

where progenitor-X is a narrow term under progenitor. I don't have a specific example.

msdemlei · 2024-12-13T09:26:18Z

On Thu, Dec 12, 2024 at 08:42:54PM +0000, Patrick Dowler wrote: While it's probably true that uniqueness is within the same semantics as well, but I don't see how that buys us anything and maybe it blocks a specialised use... like two links with wider and narrower semantics that are alternatives?? ``` ID semantics alt_key ... description id1 #progenitor alt1 ... "the default progenitor" id1 #progenitor-X alt1 ... "a more specific progenitor" ``` where progenitor-X is a narrow term under progenitor. I don't have a specific example.

Hm... I think that if we consider this kind of thing, we should be explicit about it. You see, the first thing I do when presenting a datalink result (e.g., https://github.com/msdemlei/datalink-xslt.git) is group by semantics. I hence would not even know how to express such an equivalence across the different concepts, and it would certainly take conscious effort to make whatever we want work. Do we know what we want? In a display like what https://dc.g-vo.org/shomydl/q/f/form produces, how would this be shown, if at all? But that's perhaps a detail that we can work out later. And should certainly not keep us from drafting some standards language. Pat? Me? Someone else?

pdowler · 2024-12-13T19:58:58Z

I think what we want is to declare which links are alternatives to other links and that should be orthogonal to any other concern. So in principle just grouping by alt_key is sufficient to make a small set of links where the client choses 1.

In order to ensure uniqueness of those groups in a large links response (eg many IDs -> many more links) we could make it easier to generate alt_key values by saying they only have to be unique within a specific ID value. That makes the spec a little more complicated and maybe makes the implementation easier, but honestly if alt_key is a string I would just chose to generate random string codes of length 8-ish and be confident it would be OK. While we could certainly say that is the expectation, I don't think it really makes it easier to work with so I would avoids the complexity and unforeseen consequences.

So my position is that for spec simplicity I would fall on the side of "all links with the same value of alt_key are alternatives and the client should chose one rather than use all of them". This accomplishes the goal and places slightly more burden on the implementation to take care when assigning alt_key values, but for the use cases we have I do not think that is hard to do.

I will likely prototype this in early January (for the multi-location use case) and I don't think we need to go beyond discussing here until that happens.

As for a UI to handle this... it depends on how you want to present the "choices" and allow the user to "chose". I think we just concentrate on conveying the correct information, which is the relationship between 2 or more links in the result.

msdemlei · 2024-12-16T07:09:17Z

On Fri, Dec 13, 2024 at 11:59:19AM -0800, Patrick Dowler wrote: As for a UI to handle this... it depends on how you want to present the "choices" and allow the user to "chose". I think we just concentrate on conveying the correct information, which is the relationship between 2 or more links in the result.

While I agree that there is not much more to discuss before implementation, let me disagree here: If you want to add useful features to protocols, thinking about how the users will consume them in my experience is the most useful guideline by a wide margin. If people designed from the user back, it tendend to be a good design. If people designed from what seemed convenient to data publishers, it tended to not work out well, not even for the next data publisher. And that again brings me to a fairly firm impression that we either say alt_key is per semantics, or semantics is per alt_key (ugh), or that the two intertwine in... ugh.. ways. I give you orthogonality is a nice concept, but in pratice clients have to give a consistent picture, and there semantics and alt_key simply are not orthogonal. After all, the question that started this is "what does it mean if there are multiple rows with the same semantics?"

gmantele · 2024-12-16T09:09:12Z

I agree with @msdemlei . The user experience must really be the most important, especially with DataLink which is not very easy/intuitive for users (and implementers).

The problem of random strings is that there is no useful meaning for users. But what actually bothers me in this proposed solution is that we have no way to say what are the alternatives proposed to the user ; we have random keys with no meaning (and, so, I agree with Markus that integers would then be easier to generate and to identify by a human being). How the user knows what alternative he/she wants? To answer this question semantics and local_semantics should be enough to answer this question. So, as François B. suggested, maybe we already have our solution here: local_semantics could then be used to make links with the same semantics alternatives to each other. Then, there would be no "randomly generated and meaningless" grouping keys. Is not it enough? Or have I missed something?

pdowler · 2024-12-16T17:58:45Z

The problem is that local_semantics is a mechanism that already means something else and links with different ID values already have the same local_semantics value and that means something specific to a user interface that is helping a user pick links. So no, it is not a solution.

"alternative links" is completely orthogonal to every other aspect of the links response. It could be a choice between two file formats for the same data, or two locations for the same file. Either way, the client just has to know that there is a choice to be made. In the case of two locations, all the client will see is that the two access_url values are different: they won't have any good reason to prefer one over the other, but at least they will know to not download both.

I prefer alt_key to have no meaning because if it does, even something implied like using integer instead of random string, people will make assumptions (order, rank, whatever).

Admittedly, this is more complex than it looked at the outset, and there are subtle aspects.... I did have another solution that I have designed and considered (but not implemented) that I did not bring up in the meeting. It is essentially the 3rd class of solutions - service descriptor - which is more general but less optimal in terms of number of requests required... maybe we are attempting premature optimization and should consider it. I will post that idea separately.

pdowler · 2024-12-16T18:11:36Z

So, if alt_key is too complicated the other option is service descriptors. The pro: it's a more general solution that can take advantage of other existing tech. The cons: clients that grok the semantics of the service descriptor need to make additional requests and in most cases different kinds of alternatives imply different kinds of services.

Use case 1: links that return different format
For this use case, the natural way to proceed would be a service that honours the http Accepts header and can return the content in different formats (or maybe use DALI RESPONSEFORMAT param?). The service could provide access to existing files or perform content transformation on-the-fly.

Use case 2: links to different storage locations && Use case 3: links with different transport protocols
For this use case, the transfer negotiation API in VOSpace does exactly this. The client specifies an identifier for the target data/file and a set of transfer protocols it knows how to use and the server returns a set of URLs for all the locations and protocols available. This is a proven mechanism, maybe a little dated but easily updated to support request/response in something other than xml. More specifically, it could more naturally allow clients to say "I know how to use S3" in the request and the server to say "here is the S3 object identifier" in the response (details TBD).

Now, I'm not saying this is the simplest to implement but it is more robustly and clearly specified than trying to wedge alternatives into the links response.

msdemlei · 2024-12-17T09:50:22Z

On Mon, Dec 16, 2024 at 10:11:58AM -0800, Patrick Dowler wrote: So, if `alt_key` is too complicated the other option is **service

No, I don't think alt_key is too complicated. I think we just need to acknowledge that it means "rows that have the same (id, semantics, alt_keys) triple are alternatives". Basically, it's just a completely normal GROUP BY. And I highly prefer it to going through service descriptors because it's more visible and works nicely even when the client does not explicitly support alt_key.

gmantele · 2024-12-17T12:47:14Z

What I'd like with the local_semantics solution is that it is a custom vocabulary and then it does not mean adding a column. However, I completely agree that this field already has a meaning which makes it hard to use it for something like data localisation or transport protocol. So, OK local_semantics is not a good solution.

I also agree with @msdemlei : service descriptor are relatively hidden and having one more steps to get more ways to get the data seems too discouraging in terms of UX.

I agree too that alt_key seems like a better solution, but....it is ugly. It looks like a ugly trick to do something that Datalink is not currently able to do, while it should do so. Especially, I don't like the fact that there is a content that may mean something for humans but is actually meaningless for the machine unless you want to group. But, I don't have a better alternative to propose yet except integer values (although it may mean, as you said, ordering, priority, ...) ....

pdowler · 2024-12-17T17:22:18Z

OK, just showing that the alternative looks like. Yes, service descriptors are more complex from a usage/UX point of view. No argument there :-)

My disagreement is that I think it should just be "rows with the same ID" and not include semantics, because alternatives in general could have different semantics. I am specifically thinking of wider or narrower terms but there could be other scenarios where the service wants the client to chose. I think the only disagreement is including or not including semantics in specifying the alt grouping. I don't see why we would restrict this unnecessarily when it is well defined either way.

So, can we not just start with "rows that have the same ID,alt_key are alternatives"? That makes it easy enough to implement and does not restrict usage... prototypes will tell if we need more.

msdemlei · 2024-12-18T08:48:06Z

On Tue, Dec 17, 2024 at 09:23:03AM -0800, Patrick Dowler wrote: including or not including semantics in specifying the alt grouping. I don't see why we would restrict this unnecessarily when it is well defined either way.

If "Well defined" includes "clients will know what to do", I'd mildly dispute the "well defined" part :-) But:

So, can we not just start with "rows that have the same ID,alt_key are alternatives"? That makes it easy enough to implement and does not restrict usage... prototypes will tell if we need more.

Well, let's start with that. But if no credible use case surfaces for "alt_key across semantics" until we're through, let's seriously reconsider that decision; as I said, I wouldn't know how to even show such a thing in my Datalink XSLT/js, and I doubt anyone else trying to do anything sensible with semantics will.

pdowler · 2024-12-18T18:16:29Z

current status/decision:

We will go ahead with prototyping an optional alt_key column in the links response to group multiple alternative links with the same ID. This means that {ID,alt_key} specifies a group.

We will revisit the question of uniqueness and whether or not it should also include semantics (so alternatives have to have the same ID and semantics) later once we have some experience and are more informed by use cases.

@pdowler will create a PR to update doc status to WD-DataLink-1.2 and document the current alt_key idea (probably a separate PR once we're back in WD).

pdowler added the TBD label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alternative links #117

alternative links #117

pdowler commented Dec 11, 2024

pdowler commented Dec 11, 2024

msdemlei commented Dec 12, 2024 via email

pdowler commented Dec 12, 2024

msdemlei commented Dec 13, 2024 via email

pdowler commented Dec 13, 2024

msdemlei commented Dec 16, 2024 via email

gmantele commented Dec 16, 2024

pdowler commented Dec 16, 2024

pdowler commented Dec 16, 2024

msdemlei commented Dec 17, 2024 via email

gmantele commented Dec 17, 2024

pdowler commented Dec 17, 2024

msdemlei commented Dec 18, 2024 via email

pdowler commented Dec 18, 2024

alternative links #117

alternative links #117

Comments

pdowler commented Dec 11, 2024

pdowler commented Dec 11, 2024

msdemlei commented Dec 12, 2024 via email

pdowler commented Dec 12, 2024

msdemlei commented Dec 13, 2024 via email

pdowler commented Dec 13, 2024

msdemlei commented Dec 16, 2024 via email

gmantele commented Dec 16, 2024

pdowler commented Dec 16, 2024

pdowler commented Dec 16, 2024

msdemlei commented Dec 17, 2024 via email

gmantele commented Dec 17, 2024

pdowler commented Dec 17, 2024

msdemlei commented Dec 18, 2024 via email

pdowler commented Dec 18, 2024