Consistent/reproducible TTL and TRIG formatting with Jena? #2672

fkleedorfer · 2024-08-31T12:39:15Z

fkleedorfer
Aug 31, 2024

I am looking into a way of working with RDF files (TTL, mostly) in a git repo. The data can contain blank nodes. I would like to use a formatter via a build tool so as to have nice human-readable files that are always formatted the same way such that a git diff will only show changes to the actual data. The only problem I see there is the ordering of blank nodes. Is there a way to achieve consistent ordering of blank nodes in TTL output with JENA?

... Edit: also asking for a friend: atextor/turtle-formatter#8

afs · 2024-09-01T20:37:27Z

afs
Sep 1, 2024
Collaborator

Firstly - be careful what you ask for! Setting Jena up to process blank node as given is fragile and not RDF. Processing RDF as a graph is not editing the file.

If a file is read in twice, in the absence of any other information, RDF requires the blank nodes be kept apart. That's why the parser create large unique ids.

One approach is to output blank nodes are <_:BlankNodeLabel>, Jena's extension for IRIs that look like blank nodes. When parsed they create exactly the same blank node.

Another approach: The parsers can run in a non-compliant mode whereby blank node labels are preserved. Coupled with a custom writer, formatting might be preserved.

The "might" is because adding triples may make a major change to the internal indexing which uses hash maps. The triples to be output. may come out in a very different order for a small change to the graph. The writer is going to have to slurp the whole graph and output in its own defined order i.e. sort the data.

What are you going to do about [ :p 123 ] .? that is, no label. Jena when running with LabelToNode.createUseLabelAsGiven() names them incrementally from _:0000 .

See also #2549 - RDFWriter does not expose a way to set the node to label mapping. That has to be done at a lower level.

Example code:

        Graph graph = RDFParser.source("D.ttl").labelToNode(LabelToNode.createUseLabelAsGiven()).toGraph();
        NodeFormatter fmt = new NodeFormatterNT() {
            @Override
            public void formatBNode(AWriter w, String label) {
                w.print("_:");
                //String lab = NodeFmtLib.encodeBNodeLabel(label);
                // w.print(lab);
                w.print(label);
            }
        };
        AWriter out = IO.wrapUTF8(System.out);

        StreamRDF stream = new WriterStreamRDFPlain(out, fmt) ;
        StreamRDFOps.graphToStream(graph, stream);

1 reply

fkleedorfer Sep 2, 2024
Author

Thanks!!

The <_:BlankNodeLabel> approach is "proprietary", I'd like to avoid that.

The second approach still requires labeled blank nodes, which we don't have (and would like to avoid).

Would the RDF Dataset Canonicalization algorithm help at all? If I don't misunderstand the algorithm, the N-th degree hashes for blank nodes and therefore their ordering wrt other blank nodes in the naming scheme should be stable except if there is a change to the blank node's triples - so it would provide a not-so-bad solution which would only occasionally reorder blank nodes unexpectedly... correct?

If so, can we access those hashes for ordering?

afs · 2024-09-02T11:03:50Z

afs
Sep 2, 2024
Collaborator

The <_:BlankNodeLabel> approach is "proprietary", I'd like to avoid that.

True.
You might try to skolemize as described inthe RDF Concepts
https://www.w3.org/TR/rdf12-concepts/#section-skolemization

The second approach still requires labeled blank nodes,

Unlabelled blank nodes become _:0000 inumbered in encounter order and the parser output is deterministic with the order in the file. Obviously, if you add something in the middle, it has a knock on effect.

Not sure about RDF Dataset Canonicalization - it might help but it is choosing the blank node hash. As you note, it's only stable if no changes happen which might or might not suit your situation.

3 replies

fkleedorfer Sep 3, 2024
Author

Thanks, I'll start with the default behavior for now and check back if I run into trouble

fkleedorfer Sep 6, 2024
Author

Turns out, when I try with

RDFDataMgr.write(StringWriter, Model, RDFFormat.TURTLE_PRETTY)

not even the URI nodes are sorted consistently across multiple runs of the same data (no need looking at the blank nodes at this point)

Is there a way to achieve that?

afs Sep 9, 2024
Collaborator

Let's focus on a shared example.

What scale of graph do you want to handle?
Can you provide an example turtle file we can discuss?

RDFFormat.TURTLE_PRETTY may well reorder. It's also quite extensible by subclassing.
It does hide blank nodes for RDF collections - AKA lists - and for one-connected blank predicate object lists :s :p [ :q 123 ] which may make the problem harder.

fkleedorfer · 2024-09-09T15:37:07Z

fkleedorfer
Sep 9, 2024
Author

My approach is to extend turtle-formatter
with the @afs' proposition to intercept the parser, here: BlankNodeOrderAwareTurtleParser.java - this may also be extended to keep track of the ordering of URI resources, and maybe even to preserve comments.

2 replies

afs Sep 9, 2024
Collaborator

It is easy enough to write an insertion-order preserving graph implementation. It would be as as scaleable but millions of triples is probably doable.

Just hash tables is fragile - add a one triple and the table may resize which changes iteration order significantly.

fkleedorfer Sep 9, 2024
Author

It is easy enough to write an insertion-order preserving graph implementation. It would be as as scaleable but millions of triples is probably doable.

That would be more than enough.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent/reproducible TTL and TRIG formatting with Jena? #2672

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Consistent/reproducible TTL and TRIG formatting with Jena? #2672

fkleedorfer Aug 31, 2024

Replies: 3 comments · 6 replies

afs Sep 1, 2024 Collaborator

fkleedorfer Sep 2, 2024 Author

afs Sep 2, 2024 Collaborator

fkleedorfer Sep 3, 2024 Author

fkleedorfer Sep 6, 2024 Author

afs Sep 9, 2024 Collaborator

fkleedorfer Sep 9, 2024 Author

afs Sep 9, 2024 Collaborator

fkleedorfer Sep 9, 2024 Author

fkleedorfer
Aug 31, 2024

Replies: 3 comments 6 replies

afs
Sep 1, 2024
Collaborator

fkleedorfer Sep 2, 2024
Author

afs
Sep 2, 2024
Collaborator

fkleedorfer Sep 3, 2024
Author

fkleedorfer Sep 6, 2024
Author

afs Sep 9, 2024
Collaborator

fkleedorfer
Sep 9, 2024
Author

afs Sep 9, 2024
Collaborator

fkleedorfer Sep 9, 2024
Author