-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixing RDF and Neo4J #3
Comments
Hey! I don't think I can manage provenance inside neo4j in the way that I want to (at the statement level) so my plan for Open Oil is to move to a triple store next. It would help if you gave me an example of a query that you'd like to run that is currently not working well. The gist of my response is: I think ideally whatever storage you choose should be able to generate an RDF representation (I think I should be able to generate RDF from Neo4j (iilab/openoil@1d2c213) for instance), and your RDF generation Gist seems should be able to do that. Whether you need a triple store or not depends on how granular you need the provenance data to be. The moment you're moving to store some aspect of your data in a triple store though, I think is the moment where you need to consider storing your main data in the triplestore. From a data modelling standpoint, if you generate RDF it should probably be in quads or named graphs, but you could probably still implement the logic of that with your current backend (maybe by introducing a new table however, just to capture the meta-statements, instead of currently with added metadata field in your main graph table). I think having both ElasticSearch and Neo4j probably is generally not necessary but I'd need to understand what you mean by "Prefix lookup". Neo4j embeds a lucene full text search engine that should cover most of you needs I think if you were to use it. I like the idea of building an MQL query end point to Neo4j. But maybe for now its simpler to convert your MQL generation widgets into Cypher generating widgets? Do you have some examples of common structures of the MQL queries you end up generating? I don't claim being able to deal with the generalised problem, but maybe I could give it a shot that would work for Grano's use cases. Would be interesting then to look into MQL to Sparql. It feels that Gremlin is currently the closest to be able to pivot between graph querying languages (https://groups.google.com/forum/#!topic/gremlin-users/7QS7j9aA5NA) but it doesn't support MQL as far as I'm aware. But I suspect MQL is simple... Seems like giving Cailey a kick in the tires might be instructive. But also give a shot to:
Hope that helps. |
Thanks, Jun for that comprehensive reply and the valuable thoughts! To begin, where I'm currently stuck: doing bidirectional queries on my SQL-based graph. Imagine when you're storing a graph in SQL, it'll always be directed because the edge/link table will have to have a source and target column. This isn't a huge problem if you know which way the links you're looking for are pointed, but if you don't know (or don't care), the only way to query this is to do a query that either includes a big UNION or generates an impossibly big join table (for my 30k-link dataset, the full bidirectional join is 19mn rows). This leads me to think I may have reached the limits of what I can do in SQL. Now, to your questions:
Since I wrote this issue, I've realised that simple indexes are not optional, they're needed in order to be able to do any type of meaningful updates on the dataset. So the flat-file option is pretty much dead unless I want to start implementing my own indexes (at that point I'm just building a graph database). The next most conservative thing to do, IMO, would be to use a single-table relational DB as the quad store. That's not performant enough to do full-blown SPARQL; but it should be OK as a sort of master data storage from which more query-friendly forms (like Neo4J or even in-memory graphs) can be derived. I do understand your reluctance to do this type of double storage, but I just don't feel the one-size-fits-all thing exists. On MQL: I just don't think that one can safely expose CYPHER, there are probably a dozen ways to circumvent the term filter that you've proposed previously. And once you're off to parse this thing in any meaningful way, why not go and do a nice query API like MQL. In any case, it's a bit of a gimmick - and won't determine any of the major architecture choices. Finally, the databases: as a general thing, I would like to keep this reasonably simple to deploy on a reasonably-sized machine. Titan hits me very badly on this front. I've worked with Virtuoso in the past, and sworn myself that I will never again touch it with a stick. It's a mad piece of software; needs to die. Not a rational argument here, so much as an irrational fear :) That leaves Orient and Cayley: Cayley is really awesome from what I've seen, including native quad support. But it's just very, very new and it looks like some fairly basic bits are missing (e.g. I think it may only do bulk import at launch at this time?). Orient looks really cool, going to explore that more today. Haven't seen any mentions of quads yet, though. |
hey. :) i'm working on something similar to uf6 (open-app, openappjs), and for the data subsystem i was thinking of building on top of leveldb and levelgraph, as the level and levelgraph ecosystems allow you to "build your own database" using modules. |
Hey Michael! Levelgraph looks pretty cool, it'd be very interesting to find out if there is a) quad support and b) non-JS bindings or a REST API! |
a) you can add additional properties to triples like so. they can be accessed during query filters, but the additional properties are not by default indexed like the triples are. there's also an issue up about proper named graphs: levelgraph/levelgraph#43. in general, the "build your own database" approach has less "batteries included", so it's really dependent on what you want, just figured i might as well share. |
Hey @pudo Re. limits of SQL, makes a lot of sense. Seems as you say in 1/ that you're hitting the "I am building my own triple(quad) store" barrier indeed. Re. 2/ that sounds messy. I didn't mean impossible, I mean increased complexity and failure modes. Re. 3/ & 4/ awesome! Let's! Re. Indices and rebuilding a graph database. Yup. Re. Yes, you read my mind, this gets negative points for me because of the increased system complexity and managing multiple "conceptual views" (your graph systems which holds the data, and your meta-graph system which holds the provenance) on top of, what essentially is just, the graph (if it allows quads). I think at some point you'll also be dealing with some "relational" integrity issues when your main graph evolves and needs to be kept linked properly to your meta-graph. Also I think there's a paradoxical thing going on here. If you add an index to your triple in your current SQL main database you're in fact implementing a quad. If you store quads in your secondary database, then aren't you essentially duplicating the triple information in your main database and adding a "column"? Re. one size fits all. The problem as I see it with regards to moving your main storage to a graph DB is mostly about changing a lot of the plumbing which you invested a lot of effort in, but I don't see why it wouldn't "fit". What are the things that would not work on a graphDB? (Sorry I think if I read more of the Grano code, then I'll be able to answer myself but I'm sure it'll be faster if you explain). Re. building a quad store in SQL. Maybe provenance queries, which are our main use case for now I guess, could be predictible enough in their structure to optimise your quad store for this type of query, but I'm worried that you would also get into the same performance, giant join problems for anything that tries to see the quad graph really as a graph, rather than a provenance store. But maybe that's enough. We would need to be a bit more granular about the type of queries we think would be ran for the type of applications we're interested in, don't you think? Moving the choosing a graph database discussion (and @ahdinosaur's contributions) to a new issue! #4 |
You're convincing me, @jmatsushita - maybe it is worth finding another "one size fits all" solution. I'm beginning to be somewhat attracted by RDF as the standardized SPARQL Query/Update mechanism means that you can swap out backends easily (or so the theory goes). For grano specifically, there might be a migration path where I move the core graph to a triplestore first and keep the rest (users, projects, ...) in a sqlite database before finally moving everything over. |
Awesome. Yes keeping users and projects aside makes a lot of sense to me. Sent from my phone
|
@jmatsushita I think I need a bit of advice from you.
The SQL implementation of grano is meeting a few limitations when querying the graph in depth, so I'm trying to think of alternative approaches. At the moment, the best thing I was able to come up with is a wild, poly-backend mix of RDF, ElasticSearch and Neo4J.
RDF would be stored in hashed flat files and contain full provenance and alternative values. This would be the master data, but it wouldn't be easily queryable. Queries would therefore be handled by ElasticSearch for simple stuff (like prefix lookups) and Neo4J for more complex queries (I still want to wrap it in an MQL front-end because exposing CYPHER to the web seems insane).
I still want to have a bespoke web interface for entering, exploring and editing data (i.e. in news orgs), but that would basically interact with a simplified web API...
I'm wondering a) what do you think of this? b) would that bring our projects closer together, or are you thinking of ways to handle provenance inside of Neo4J? I just can't get myself to trust that unholy thing as a main data store :)
The text was updated successfully, but these errors were encountered: