ARQ optimizer vs TDB2 optimizer #1659

LorenzBuehmann · 2022-12-02T08:36:36Z

LorenzBuehmann
Dec 2, 2022

Hi all,

a query which has bad performance on our Wikidata dataset:

PREFIX  wd:   <http://www.wikidata.org/entity/>
PREFIX  wdt:  <http://www.wikidata.org/prop/direct/>

SELECT  (COUNT(DISTINCT ?peo) AS ?result)
WHERE
  { ?peo  wdt:P31    wd:Q5 ; # all humans
          wdt:P1411  wd:Q44585 # award nominates
    FILTER NOT EXISTS { ?peo  wdt:P166  wd:Q44585 } # award not won
  }

It basically returns "How many people nominated for the nobel prize in chemistry didn't ever win it?"

The data is loaded in TDB2 database, and stats have been computed. Clearly, the first triple pattern returns way more results than the second:

TDB Stats part

(prefix ((wdt: <http://www.wikidata.org/prop/direct/>))
(stats
  (meta
    (timestamp "2022-10-20T09:56:42.767+00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>)
    (run@ "2022/10/20 09:56:42 UTC")
    (count 6586140827))

((VAR <http://www.wikidata.org/prop/direct/P31> <http://www.wikidata.org/entity/Q5>) 43061704)
(<http://www.wikidata.org/prop/direct/P1411> 45911)

so I initially thought TDB2 will reorder the BGP based on the stats.

But it didn't - the reason for is the ARQ optimizer which does what I guess is some kind of FILTER push resp. FILTER placement. The algebra shows this:

Before optimizations:

(base <http://example/base/>
  (prefix ((wd: <http://www.wikidata.org/entity/>)
           (wdt: <http://www.wikidata.org/prop/direct/>))
    (project (?result)
      (extend ((?result ?.0))
        (group () ((?.0 (count distinct ?peo)))
          (filter (notexists (bgp (triple ?peo wdt:P166 wd:Q44585)))
            (bgp
              (triple ?peo wdt:P31 wd:Q5)
              (triple ?peo wdt:P1411 wd:Q44585)
            )))))))

After optimizations:

(base <http://example/base/>
  (prefix ((wd: <http://www.wikidata.org/entity/>)
           (wdt: <http://www.wikidata.org/prop/direct/>))
    (project (?result)
      (extend ((?result ?.0))
        (group () ((?.0 (count distinct ?peo)))
          (sequence
            (filter (notexists (bgp (triple ?peo wdt:P166 wd:Q44585)))
              (bgp (triple ?peo wdt:P31 wd:Q5)))
            (bgp (triple ?peo wdt:P1411 wd:Q44585))))))))

I think this prevents the TDB optimizer at execution from reordering the initial BGP which happens if I remove the FILTER expression from the query.

For testing I set

ja:context [ ja:cxtName "arq:optFilterPlacement" ;  ja:cxtValue "false" ] ;

in the assembler file, in total this makes a runtime difference of 600s vs 1s as the join 40 000 x 40 000 000 would indeed by "better" direction.

Any ideas how to handle this? ARQ optimizer doesn't know of statistics, TDB optimizer can't reorder.

Disabling the FILTER placement optimizations seems to be the wrong way as I do not know all consequences for it.

Thanks for discussing in advance.

afs · 2022-12-06T14:09:53Z

afs
Dec 6, 2022
Collaborator

Clearly

Nobel nominees are humans (RDFS domain). I don't think AI's are permitted.

So the first triple pattern isn't needed because it is a check that "human" has been explicitly declared. Data quality issue.

Disabling the FILTER placement optimizations seems to be the wrong way as I do not know all consequences for it.

You can turn it off per endpoint, per dataset or server-wide. (Not per query because that exposes a DOS attack.)

It is a relatively new optimization so it is likely that it would have less serious impact.

Or rewrite in the style of a data check: reverse the triple patterns:

   {
       ?peo   wdt:P1411  wd:Q44585 .
       ?peo   wdt:P31    wd:Q5     .
       FILTER NOT EXISTS { ?peo  wdt:P166  wd:Q44585 }
   }

or be explicit:

   { {
       ?peo   wdt:P1411  wd:Q44585 . # award nominates
       FILTER NOT EXISTS { ?peo  wdt:P166  wd:Q44585 } # award not won
     }
     ?peo   wdt:P31    wd:Q5     . # all humans
   }

0 replies

LorenzBuehmann · 2022-12-07T07:54:02Z

LorenzBuehmann
Dec 7, 2022
Author

Thanks Andy and I agree with everything you're saying - the point is, those queries I report here are not "my" queries, it's just that others are asking me "why is it so slow" - I know that I can try to reorder things, or rewrite queries - but that needs understanding of SPARQL as well as dataset statistics. In addition, the example query here is from a question answering challenge on Wikidata, I don't care about it but people in our group are using those to evaluate their research. Indeed, they could rewrite all the queries in the benchmark, but they were asking me before "why ..." - I already provided them the rewritten query when they were asking, but you know, people are lazy and expect system to do most if not all ...

They asked me also about other queries like

SELECT DISTINCT ?result WHERE {
?film wdt:P1476 ?result ; # has title (this is the huge part with 43 190 482 bindings)
         wdt:P31/wdt:P279* wd:Q11424 ; # type or subtype film
         wdt:P179 wd:Q22092344 # part of Star Wars universe
}

here the reason is not the ARQ optimizer but indeed won't the TDB optimizer be reorder the triple patterns as the second contains a property path, so statistics wouldn't be taken into account.

Nobel nominees are humans (RDFS domain). I don't think AI's are permitted.

True, though Wikidata doesn't use RDFS at all nor would it be the domain of the property which is just "nominated for" - indeed I agree with you, but we know the real world domain and expect the data to reflect this.

Anyways, thank's for feedback - so I'll leave the optimizer as is for now

0 replies

afs · 2022-12-08T18:56:42Z

afs
Dec 8, 2022
Collaborator

the point is

A specific transform for DBpedia service would be possible though I'd be interested and a bit surprised if turning the transform off always didn't work for you.

people in our group are using those to evaluate their research

Please - there is only so much the project can do with the limited resource available. There have been several reports from you and colleagues. It is easy to raise an issue (though it has normally need further refinement that could have been included earlier); it is harder to do anything about in the timescale you seem to be seeking.

Try without the optimization step.

A research solution is a DBpedia-transform that is tuned initially by hand, but could be configured by machine learning.

As I have said elsewhere, I am concerned about solving issues that are due to other triplestores current behaviour and also concernd about making use case specific changes to Jena that negatively impact the user community as a whole. That is what extensions are for.

Paths

That's different situation.

As you know, there is #1629 (GH does not have "epics" but that is what it is).

There is upcoming work for 4.7.0 #1638 that is waiting feedback as to whether it has negative effects. It does not address everything but it should improve one the of cases to have raised.

Before moving on to another issue, can we please complete work items?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARQ optimizer vs TDB2 optimizer #1659

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

ARQ optimizer vs TDB2 optimizer #1659

LorenzBuehmann Dec 2, 2022

TDB Stats part

Before optimizations:

After optimizations:

Replies: 3 comments

afs Dec 6, 2022 Collaborator

LorenzBuehmann Dec 7, 2022 Author

afs Dec 8, 2022 Collaborator

LorenzBuehmann
Dec 2, 2022

afs
Dec 6, 2022
Collaborator

LorenzBuehmann
Dec 7, 2022
Author

afs
Dec 8, 2022
Collaborator