Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a good way of doing bulk updates w/ SPARQLUpdateStore? #423

Closed
pudo opened this issue Aug 23, 2014 · 9 comments
Closed

Is there a good way of doing bulk updates w/ SPARQLUpdateStore? #423

pudo opened this issue Aug 23, 2014 · 9 comments

Comments

@pudo
Copy link

pudo commented Aug 23, 2014

I want to speed up some imports, so I've just made this update buffer. Is there a less insane way of doing this?

@uholzer
Copy link
Contributor

uholzer commented Aug 25, 2014

Not really. There are solutions for special cases, however:

When you just want to add triples, use addN.

If you need to create/update/delete whole graphs, check whether your endpoint supports the graph store HTTP Protocol.

@pudo pudo closed this as completed Aug 26, 2014
@uholzer
Copy link
Contributor

uholzer commented Aug 26, 2014

By the way, it just occured to me that indeed there is a transactional interface (Store.commit and Store.rollback called by Graph.commit and Graph.rollback). So maybe it would be better to implement this interface in a subclass of SPARQLUpdateStore. Of course, the implementaion would look exactly like your solution.

@pudo
Copy link
Author

pudo commented Aug 27, 2014

@uholzer on a related note, is there any write-up on which triplestores actually work with rdflib, and what dance one has to dance to make that happen? Fuseki has worked for me but is really slow, Virtuoso and Stardog don't seem to get along with RDFLib --

@gromgull
Copy link
Member

I've not touched this code in a while, but when I wrote it I tested against Fuseki (and only fuseki :)

It will always be kind of slow as long as you are using the SPO store interface (add/remove/slicing/subjects/etc.) Serializing and deserializing everything over http eats pretty much any advantage you gain from using a faster non-python based store.

@pudo
Copy link
Author

pudo commented Aug 27, 2014

@gromgull Oh, I'm not actually so concerned about write speed, but their SPARQL interface just doesn't seem to scale at all. I'm doing a reasonably complex graph query and it takes 14s to come back - which just makes it not an option for a production web application.

@wwaites
Copy link
Member

wwaites commented Aug 27, 2014

Ages ago I wrote some bindings via pyodbc for Virtuoso,

 https://bitbucket.org/ww/virtuoso

I seem to remember it was very picky about its idea of transaction
isolation and locking -- much more so than any other database that I
have used. If you can get past that, there might be some mileage in
it. I think it presented itself as an rdflib store...

There might also a way to compose your query differently for Fuseki
and Jena to make it run faster. I've worked with Dave Reynolds (@der)
before who knows the internal details and may be able to help.

Best,
-w

@pudo
Copy link
Author

pudo commented Aug 27, 2014

Yeah, I've seen the package but a) I'm scared of things that haven't been maintained in more than three years (as you've seen I've already had to get into telescope much more than i wanted); and b) I just want to keep the deployment process reasonably simple - HTTP helps a lot there, while ODBC just seems like an unnecessary hurdle.

I really need to work on these queries, but my sense is that it's just incredibly easy to slow the whole thing down to a crawl. Perhaps I'm doing some stuff fundamentally wrong, though.

Here's more discussion on the subject: uf6/design#6

@uholzer
Copy link
Contributor

uholzer commented Aug 27, 2014

As @gromgull said, using SPARQLStore's SPO interface is slow as it has to do a query for every operation. That a sincle SPARQL query is executed slowly on a SPARQL endpoint is entirely the endpoint's fault (or you wrote a difficult query). All you can do is to try different endpoints. Also try endpoints you maybe never have heard about yet. I think SWI Prolog also has one which has good performance when using the in-memory store. I never did and don't know of a proper comparison though.

@pudo
Copy link
Author

pudo commented Aug 27, 2014

@uholzer thanks for your advice. I understand that the speed of SPARQL query execution is a backend (or query implementation) issue, that's why I was trying to try out different servers before I realized that rdflib only really connects to fuseki (from what I gather).

On the whole, my sense is that this entire ecosystem is more tailored to the demands of an academic environment rather than user-facing web apps; so I should probably just go with Neo4J for graph storage like everybody else :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants