ARQ optimizer vs TDB2 optimizer #1659
Replies: 3 comments
-
Nobel nominees are humans (RDFS domain). I don't think AI's are permitted. So the first triple pattern isn't needed because it is a check that "human" has been explicitly declared. Data quality issue.
You can turn it off per endpoint, per dataset or server-wide. (Not per query because that exposes a DOS attack.) It is a relatively new optimization so it is likely that it would have less serious impact. Or rewrite in the style of a data check: reverse the triple patterns:
or be explicit:
|
Beta Was this translation helpful? Give feedback.
-
Thanks Andy and I agree with everything you're saying - the point is, those queries I report here are not "my" queries, it's just that others are asking me "why is it so slow" - I know that I can try to reorder things, or rewrite queries - but that needs understanding of SPARQL as well as dataset statistics. In addition, the example query here is from a question answering challenge on Wikidata, I don't care about it but people in our group are using those to evaluate their research. Indeed, they could rewrite all the queries in the benchmark, but they were asking me before "why ..." - I already provided them the rewritten query when they were asking, but you know, people are lazy and expect system to do most if not all ... They asked me also about other queries like SELECT DISTINCT ?result WHERE {
?film wdt:P1476 ?result ; # has title (this is the huge part with 43 190 482 bindings)
wdt:P31/wdt:P279* wd:Q11424 ; # type or subtype film
wdt:P179 wd:Q22092344 # part of Star Wars universe
} here the reason is not the ARQ optimizer but indeed won't the TDB optimizer be reorder the triple patterns as the second contains a property path, so statistics wouldn't be taken into account.
True, though Wikidata doesn't use RDFS at all nor would it be the domain of the property which is just "nominated for" - indeed I agree with you, but we know the real world domain and expect the data to reflect this. Anyways, thank's for feedback - so I'll leave the optimizer as is for now |
Beta Was this translation helpful? Give feedback.
-
A specific transform for DBpedia service would be possible though I'd be interested and a bit surprised if turning the transform off always didn't work for you.
Please - there is only so much the project can do with the limited resource available. There have been several reports from you and colleagues. It is easy to raise an issue (though it has normally need further refinement that could have been included earlier); it is harder to do anything about in the timescale you seem to be seeking. Try without the optimization step. A research solution is a DBpedia-transform that is tuned initially by hand, but could be configured by machine learning. As I have said elsewhere, I am concerned about solving issues that are due to other triplestores current behaviour and also concernd about making use case specific changes to Jena that negatively impact the user community as a whole. That is what extensions are for. Paths That's different situation. As you know, there is #1629 (GH does not have "epics" but that is what it is). There is upcoming work for 4.7.0 #1638 that is waiting feedback as to whether it has negative effects. It does not address everything but it should improve one the of cases to have raised. Before moving on to another issue, can we please complete work items? |
Beta Was this translation helpful? Give feedback.
-
Hi all,
a query which has bad performance on our Wikidata dataset:
It basically returns "How many people nominated for the nobel prize in chemistry didn't ever win it?"
The data is loaded in TDB2 database, and stats have been computed. Clearly, the first triple pattern returns way more results than the second:
TDB Stats part
so I initially thought TDB2 will reorder the BGP based on the stats.
But it didn't - the reason for is the ARQ optimizer which does what I guess is some kind of
FILTER
push resp.FILTER
placement. The algebra shows this:Before optimizations:
After optimizations:
I think this prevents the TDB optimizer at execution from reordering the initial BGP which happens if I remove the
FILTER
expression from the query.For testing I set
in the assembler file, in total this makes a runtime difference of 600s vs 1s as the join
40 000 x 40 000 000
would indeed by "better" direction.Any ideas how to handle this? ARQ optimizer doesn't know of statistics, TDB optimizer can't reorder.
Disabling the
FILTER
placement optimizations seems to be the wrong way as I do not know all consequences for it.Thanks for discussing in advance.
Beta Was this translation helpful? Give feedback.
All reactions