Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lucene module, search by accession and other issues #24

Open
marco-brandizi opened this issue Jun 5, 2020 · 0 comments
Open

Lucene module, search by accession and other issues #24

marco-brandizi opened this issue Jun 5, 2020 · 0 comments

Comments

@marco-brandizi
Copy link
Member

The LuceneEnv and LuceneQueryBuilder components are rather messy and it's often unclear how they should be used and for what purpose its methods were designed. For instance, searchConceptByConceptAccessionExact() and similar *Exact methods were searching accessions by keywords, so not so exact. Now I've fixed them, but the search is still case-insensitive, because the Lucene standard analyzer can't deal with case-sensitive indexing+searches, nor is it easy to switch to another analyzer, read on for details.

Because of the same reason, I've had to introduce an analyzer (DEFAULTANALYZER) that uses PerFieldAnalyzerWrapper to use different analyzers for fields like concept class ID (uses keyword analyzer) or concept attribute value (uses the standard analyzer). The rationale of this is that the ID fields are to be indexed and searched with a full identity criterion, while others are dedicated to user free-text searches and hence are best served by the standard analyzer (ie, tokenisation, stop words, case insensivity, etc). In fact, if fields like Concept Class ID or data source name are indexed and searched with the standard analyzer, we have a number of problems, like upper case strings not working at all when saved as StringField, or unwanted substring matching (eg, "00633" matches both "00633" and "go 00633", which, in general, is wrong). Details are discussed here and some tests of mine are here.

Switching accession fields isn't so easy and I'll do it later. The problem with them is that they're saved with a field name like ConceptAccession_<dataSourceId>, eg, ConceptAccession_GO. This doesn't fit into the way PerFieldAnalyzerWrapper works (ie, it uses a map of field name -> analyzer), plus, it doesn't seem to play well with un-tokenised fields that can be multi-value.

As the latest link suggests, the proper solution is to store separated documents for the 1-n accession values (ie, one Lucene document with concept ID + accession + data source per each accession, which might result into multiple documents of this type, sharing the same concept ID, or even the same concept ID + data source).

But even before that, it would be worth to check the last fixes mentioned above. Regarding this:

Knetminer doesn't seem to be affected.
As for Ondex, there are a couple of plugins using the accession search methods mentioned above, which don't seem to be in use. These are:

  • decypher module (Blast, Decypher, Hmmer, Mapping), this is in the modules-opt subtree
  • generic module
    • relationneighbours/Filter
    • net.sourceforge.ondex.mapping.accessionbased.Mapping (seems no longer in use, replaced by lowmemoryaccessionbased)
    • net.sourceforge.ondex.mapping.lowmemoryaccessionbased.Mapping (fixed)
  • Mappers in go module (I think it was replaced by the OWL parser).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant