Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor search interface to connect to Solr using Django Haystack #674

Closed
jeancochrane opened this issue Mar 24, 2020 · 3 comments
Closed

Comments

@jeancochrane
Copy link

In the time since we wrote the search interface and search indexing script for the app, DataMade has learned a lot about how best to integrate Django and Solr. (You can see our current best-practices doc here.) Most pertinently, we've standardized on Haystack as our go-to library for integrating the two services. Haystack provides a connection layer that automates much of the custom code that we wrote in the search interface and indexing script, and does so in a cleaner fashion that is easier to maintain.

Beyond maintainability, I also worry about the security of our current custom implementation. According to our error logging, the search interface is by far the most popular target for bots scanning for security vulnerabilities (we get a couple notifications per week showing scanners trying different permutations of query parameters). So far I don't have any reason to believe that there are vulnerabilities in our current code, but the fact that we have so much custom code that directly sends requests to Solr makes me concerned that our attack surface is larger than it would otherwise be with a framework like Haystack.

We've been thinking about ways to onboard @beamalsky to the internals of this project while giving her some more experience with heavyweight search, and I think this particular maintenance task is a great candidate. The scope of the issue would be to preserve all existing features of the search interface, while swapping out the custom pysolr implementation for a Haystack connection layer. If this is something you're interested in doing @tlongers, we would be happy to cover half of the development cost.

@tlongers
Copy link
Member

tlongers commented Mar 24, 2020

Thanks for the proposal Jean. I share your concerns about maintainability and security of this aspect of the platform, and DataMade's offer to split the cost is very generous. Off list, I'd find a time estimate a useful addition to this discussion.

We should remember that as well as being main search function the user uses, Solr also does the heavy lifing on the command charts (see #416). We have a couple of other open issues on Solr including #296 on adding search operators, and there is some distant chatter about enabling location-based search on our data. We can't consider the site's own search function without also thinking about how users discover our material, and what preferences they may have for refining how they explore it. On that issue, we have also discussed improving how WWIC exposes results to external search engines (see #357).

So, we have a cluster of issues here about how users actually get at the information they are looking for, which we need to explore more. What's best for them? For example, would a removing Solr and replacing it with some bare SQL, but accompanied by a massive scrub down of the way results are shown and filtered/facted, along with deeper search engine indexing yield a better outcome for the user?

There are times I am not convinced that Solr is helping me out as much as it could. A good example of where Solr returns a sub-optimal result set is when we query for the unit name 44 Zona, for which the best hit would be 44 Zona Militar. Solr returns a subordinate of 44 Zona Militar as the first result, but the 44 Zona Militar unit record itself appears ~230 rows down (page 4, if you paginate at 50/page). In using WWIC, I feel I come across stuff like this regularly and perhaps have just got used to it. Can we improve the accuracy and relevance of the results returned by Solr, which is where the user comes into contact with this pretty hefty component of WWIC? We looked at this before (see #272) in the early days but have not revisited it much since.

It has also been on my mind that Solr could play a part in resolving our long-standing challenging with managing and querying geospatial information, but I am not knowledgable enough on this yet.

So, to wrap up. My questions would be:

  • what's the investment?
  • what does the user get out of it?
  • how does a better Solr integration help us with other challenges that are on our plate?

@jeancochrane
Copy link
Author

Thanks for the detailed reply Tom, you raise some excellent points. Here are some answers to your questions:

Can we improve the accuracy and relevance of the results returned by Solr, which is where the user comes into contact with this pretty hefty component of WWIC?

I think we can, but I think it will be orthogonal to this effort, which is more about improving the cleanliness of the code. I strongly agree with your broader point that the biggest user-facing issue with the search right now is that the way it retrieves results isn't always intuitive.

what's the investment?

I'll send you a time estimate for this over email.

what does the user get out of it?

As I've scoped the issue, the user wouldn't see any changes. The end goal would be giving us more peace of mind and setting ourselves up so that we can move faster next time we want to spend time improving search.

how does a better Solr integration help us with other challenges that are on our plate?

I think the Solr integration is an important piece of the overall technical debt that's accumulating in the app as time goes on. If we're planning on rethinking the way we perform search in the app in a major way, it probably doesn't make sense for us to spend time making our current implementation more maintainable. But if we like the general architecture of search and want to continue to build on it in the coming years I think we'll have to face this technical debt eventually.

@jeancochrane jeancochrane added this to the 2021 maintenance milestone Nov 12, 2020
@jeancochrane
Copy link
Author

I double-checked and confirmed that Haystack supports Solr's geospatial search so we should be able to move forward with geo queries at a later date if we do this refactor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants