You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The LuceneEnv and LuceneQueryBuilder components are rather messy and it's often unclear how they should be used and for what purpose its methods were designed. For instance, searchConceptByConceptAccessionExact() and similar *Exact methods were searching accessions by keywords, so not so exact. Now I've fixed them, but the search is still case-insensitive, because the Lucene standard analyzer can't deal with case-sensitive indexing+searches, nor is it easy to switch to another analyzer, read on for details.
Because of the same reason, I've had to introduce an analyzer (DEFAULTANALYZER) that uses PerFieldAnalyzerWrapper to use different analyzers for fields like concept class ID (uses keyword analyzer) or concept attribute value (uses the standard analyzer). The rationale of this is that the ID fields are to be indexed and searched with a full identity criterion, while others are dedicated to user free-text searches and hence are best served by the standard analyzer (ie, tokenisation, stop words, case insensivity, etc). In fact, if fields like Concept Class ID or data source name are indexed and searched with the standard analyzer, we have a number of problems, like upper case strings not working at all when saved as StringField, or unwanted substring matching (eg, "00633" matches both "00633" and "go 00633", which, in general, is wrong). Details are discussed here and some tests of mine are here.
Switching accession fields isn't so easy and I'll do it later. The problem with them is that they're saved with a field name like ConceptAccession_<dataSourceId>, eg, ConceptAccession_GO. This doesn't fit into the way PerFieldAnalyzerWrapper works (ie, it uses a map of field name -> analyzer), plus, it doesn't seem to play well with un-tokenised fields that can be multi-value.
As the latest link suggests, the proper solution is to store separated documents for the 1-n accession values (ie, one Lucene document with concept ID + accession + data source per each accession, which might result into multiple documents of this type, sharing the same concept ID, or even the same concept ID + data source).
But even before that, it would be worth to check the last fixes mentioned above. Regarding this:
Knetminer doesn't seem to be affected.
As for Ondex, there are a couple of plugins using the accession search methods mentioned above, which don't seem to be in use. These are:
decypher module (Blast, Decypher, Hmmer, Mapping), this is in the modules-opt subtree
generic module
relationneighbours/Filter
net.sourceforge.ondex.mapping.accessionbased.Mapping (seems no longer in use, replaced by lowmemoryaccessionbased)
The
LuceneEnv
andLuceneQueryBuilder
components are rather messy and it's often unclear how they should be used and for what purpose its methods were designed. For instance,searchConceptByConceptAccessionExact()
and similar*Exact
methods were searching accessions by keywords, so not so exact. Now I've fixed them, but the search is still case-insensitive, because the Lucene standard analyzer can't deal with case-sensitive indexing+searches, nor is it easy to switch to another analyzer, read on for details.Because of the same reason, I've had to introduce an analyzer (
DEFAULTANALYZER
) that usesPerFieldAnalyzerWrapper
to use different analyzers for fields like concept class ID (uses keyword analyzer) or concept attribute value (uses the standard analyzer). The rationale of this is that the ID fields are to be indexed and searched with a full identity criterion, while others are dedicated to user free-text searches and hence are best served by the standard analyzer (ie, tokenisation, stop words, case insensivity, etc). In fact, if fields like Concept Class ID or data source name are indexed and searched with the standard analyzer, we have a number of problems, like upper case strings not working at all when saved as StringField, or unwanted substring matching (eg, "00633" matches both "00633" and "go 00633", which, in general, is wrong). Details are discussed here and some tests of mine are here.Switching accession fields isn't so easy and I'll do it later. The problem with them is that they're saved with a field name like
ConceptAccession_<dataSourceId>
, eg,ConceptAccession_GO
. This doesn't fit into the wayPerFieldAnalyzerWrapper
works (ie, it uses a map of field name -> analyzer), plus, it doesn't seem to play well with un-tokenised fields that can be multi-value.As the latest link suggests, the proper solution is to store separated documents for the 1-n accession values (ie, one Lucene document with concept ID + accession + data source per each accession, which might result into multiple documents of this type, sharing the same concept ID, or even the same concept ID + data source).
But even before that, it would be worth to check the last fixes mentioned above. Regarding this:
Knetminer doesn't seem to be affected.
As for Ondex, there are a couple of plugins using the accession search methods mentioned above, which don't seem to be in use. These are:
decypher
module (Blast
,Decypher
,Hmmer
,Mapping
), this is in themodules-opt
subtreegeneric
modulerelationneighbours/Filter
net.sourceforge.ondex.mapping.accessionbased.Mapping
(seems no longer in use, replaced bylowmemoryaccessionbased
)net.sourceforge.ondex.mapping.lowmemoryaccessionbased.Mapping
(fixed)go
module (I think it was replaced by the OWL parser).The text was updated successfully, but these errors were encountered: