Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make immutability of fragments proper part of the specification #64

Open
smessie opened this issue Mar 26, 2025 · 9 comments
Open

Make immutability of fragments proper part of the specification #64

smessie opened this issue Mar 26, 2025 · 9 comments

Comments

@smessie
Copy link

smessie commented Mar 26, 2025

In the current specification, an LDES is introduced as a collection of immutable objects. However, the specification does not define how fragments SHOULD be marked as immutable. The only mention of this is in a non-normative note, which suggests using the HTTP Cache-Control header.

I propose including a normative definition for marking fragments as immutable. Specifically, the specification should state that:

  1. Fragments SHOULD include the HTTP header: Cache-Control: public, max-age=604800, immutable
  2. Fragments SHOULD include an RDF property: ldes:immutable true

While the first approach is already mentioned in a note, I propose formalizing it and adding the second approach for greater flexibility.

Consider a scenario where a publisher has limited resources to host an LDES and relies on a simple HTTP server (e.g., python -m http.server, GitHub Pages, or similar tools). These often do not support configuring HTTP headers for specific pages.

Furthermore, since LDES metadata is represented in RDF, it is inconsistent that immutability is only indicated via an HTTP header. Including an ldes:immutable property ensures a more complete RDF-based solution without adding system overhead. Clients still need to process immutable fragments once fully before determining that they do not require to fetch it again.

That said, some systems may already leverage HTTP headers for this purpose. To accommodate both approaches, I propose that the specification include both methods as normative recommendations.

@pietercolpaert
Copy link
Member

pietercolpaert commented Mar 26, 2025

I wonder if your proposed ldes:immutable would still be necessary if we change the state management for an LDES Event Source (i.e. a view that uses the ldes:timestampPath in their fragmentation) to take the timestamp into account as proposed here: rdf-connect/ldes-client#19

@smessie
Copy link
Author

smessie commented Mar 26, 2025

I can think of at least two scenarios where this approach might fall short:

  1. Relations without timestampPath as a value
    If relations do not use the timestampPath property (e.g., using the default tree:Relation or fragmenting on another property), the client cannot reliably determine whether new members will be added to a given fragment. The publisher may know that a fragment linking to the next fragment is complete, but without an explicit indicator, the client cannot be certain. Should we enforce that every relation must use timestampPath?
  2. Intermediate nodes in a tree structure
    In some cases, intermediate nodes in a tree already contain all relations to their child nodes. As long as new members are added to a leaf node, all intermediate nodes along the path from the leaf to the root will not be considered immutable by the client. However, the publisher knows these nodes are immutable and could explicitly inform the client.

I like the idea of a smarter client that can infer immutability even when the publisher does not explicitly mark it. However, having an explicit immutability strategy would still offer benefits in some scenarios.

@pietercolpaert
Copy link
Member

In general, what I’d like to see is an analysis in spec text of a #31, and then analyze what kind of extra features we need to make it faster. I rest my case that I think that ldes:timestampPath and ldes:EventSource might be a sufficient and simple solution, given that keeping the state can also work in that way. I may be wrong though...

On 1: I think a running example would be where you have an event source that is first geospatially fragmented, and only then chronologically. I think in this case you will never be able to put an immutable flag on this root node that contain geospatial fragments anyway. I’d like to see when in the algorithm we really need a flag like this to do something useful.

On 2: I think that’s a feature rather than a bug: this way we might be more efficient in handling rebalancing, as the next time you resume your synchronization, you might see the more efficient rebalanced tree right away?

I think both 1 and 2 are good examples of why I’d explicitely like to standardize the replication algorithm first, and then take this issue further if it is still needed.

@smessie
Copy link
Author

smessie commented Mar 26, 2025

So I understand it correctly that you propose to get rid of the HTTP header as well?

Consider the already implemented TimeBucketFragmentation of @ajuvercr where you have a fragmentation on e.g. /year/month/day. You can configure it to contain Y members in /year, and when Y is exceeded, it will create child buckets and relations for /year/month where another M members can be placed, after which the same happens for /year/month/day housing D members each. This fragmentation enables a client to find and process members fast while traversing the tree, and is a perfect example of an LDES structure that we can find out there in the wild.
When you now would propose to remove the ability to mark the fragments /year and /year/month as immutable while new members are being added to a /year/month/day fragment, that would mean that on every poll interval, Y+M+D members need to be processed, instead of only D, while we know that only potentially D might have changed.
This is now even an example where we use the timestampPath in the relations.
You could also consider the equivalent fragmentation for a PrefixFragmentation, with year=first-char, month=second-char, day=third-char.

You can see this as a feature while working out a rebalancing fragmentation, and in that case you should just not mark fragments as immutable yet, but in many other scenarios this will potentially cause a painful performance degradation.

@ajuvercr
Copy link

I always thought that we 'ignored' rebalancing with relation to immutability of fragments, as you could always make the point that rebalancing is an option thus never make a fragmentation immutable, and instead add support for rebalancing by changing mutable fragments to containing a relation to the new equivalent node.
Personally I like the idea of adding an immutable triple to immutable fragments, as this reflects the difference in replication logic and the transportation layer, you can derive replication logic from the transportation layer, but not the other way around.
Side note: I wouldn't make it Cache-Control: public, max-age=604800, immutable, I think adding immutable alone is enough, as some pages might not be public and should only be cached on the client side.
Even this might break authentication though ...

@pietercolpaert
Copy link
Member

So I understand it correctly that you propose to get rid of the HTTP header as well?

No, that should remain the default case.

Your problem statement however started with environments like github pages in which we want to make sure clients do not need to refetch the same info over and over again because of the inability to control the headers. I thought this initial problem could be solved using smarter state management for LDES specifically.

If we indeed want to put headers in the body of the RDF, we might however want to look at existing vocabularies that already do that, and probably want to discuss this for all of TREE instead of just LDES.

Consider the already implemented TimeBucketFragmentation

But a year bucket that already passed will never be processed again, no? And we could still use cache immutability together with this technique, although I think then the necessity of adding headers in the body might not be as important anymore.

@pietercolpaert
Copy link
Member

I always thought that we 'ignored' rebalancing with relation to immutability of fragments, as you could always make the point that rebalancing is an option thus never make a fragmentation immutable, and instead add support for rebalancing by changing mutable fragments to containing a relation to the new equivalent node. Personally I like the idea of adding an immutable triple to immutable fragments, as this reflects the difference in replication logic and the transportation layer, you can derive replication logic from the transportation layer, but not the other way around.

Ok, I can see the potential benefit here. So you wouldn’t describe it using a HTTP header vocbaulary, but only talk about immutability of the tree:Node pages, right?

Side note: I wouldn't make it Cache-Control: public, max-age=604800, immutable, I think adding immutable alone is enough, as some pages might not be public and should only be cached on the client side. Even this might break authentication though ...

The idea here was to provide a fallback for caches that do not yet support the more recent immutable directive. See https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Cache-Control (→ check for immutable)

@smessie
Copy link
Author

smessie commented Mar 26, 2025

No, that should remain the default case.

Then I am still convinced that using a HTTP header should not be the only way to mark fragments as immutable, and adding it as a property in the RDF body should be considered too.

Your problem statement however started with environments like github pages in which we want to make sure clients do not need to refetch the same info over and over again because of the inability to control the headers.

Just a simple example to showcase that only considering headers is not always enough.

If we indeed want to put headers in the body of the RDF,

I'm not proposing that. That does not sound like a good way to go to me. We are talking about marking fragments as immutable and only that.
It was proposed earlier as part of a note in the spec to do so by setting a specific HTTP header.
While here, I am proposing to do so by setting a ldes:immutable property in the RDF body on the fragment.

But a year bucket that already passed will never be processed again, no?

If it is passed it won't, but I'm talking about adding members in a day bucket, where that day is part of the year [fragment we are here considering]. In this scenario It will process the year fragment again with your proposal for the smart state management. The year fragment has a timespan of a whole year, so the last processed timestamp will fall into these bounds, so your algorithm will still consider this year fragment mutable and refetch it every poll interval.

So you wouldn’t describe it using a HTTP header vocbaulary, but only talk about immutability of the tree:Node pages, right?

Agreed with everything @ajuvercr mentioned and with this quoted conclusion.

@pietercolpaert
Copy link
Member

Alright, I can live with the ldes:immutable proposal, but I’d like to only merge this in the spec when we have clarity on the full algorithm as part of the new standardization trajectory at SEMIC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants