-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor search interface to connect to Solr using Django Haystack #674
Comments
Thanks for the proposal Jean. I share your concerns about maintainability and security of this aspect of the platform, and DataMade's offer to split the cost is very generous. Off list, I'd find a time estimate a useful addition to this discussion. We should remember that as well as being main search function the user uses, Solr also does the heavy lifing on the command charts (see #416). We have a couple of other open issues on Solr including #296 on adding search operators, and there is some distant chatter about enabling location-based search on our data. We can't consider the site's own search function without also thinking about how users discover our material, and what preferences they may have for refining how they explore it. On that issue, we have also discussed improving how WWIC exposes results to external search engines (see #357). So, we have a cluster of issues here about how users actually get at the information they are looking for, which we need to explore more. What's best for them? For example, would a removing Solr and replacing it with some bare SQL, but accompanied by a massive scrub down of the way results are shown and filtered/facted, along with deeper search engine indexing yield a better outcome for the user? There are times I am not convinced that Solr is helping me out as much as it could. A good example of where Solr returns a sub-optimal result set is when we query for the unit name It has also been on my mind that Solr could play a part in resolving our long-standing challenging with managing and querying geospatial information, but I am not knowledgable enough on this yet. So, to wrap up. My questions would be:
|
Thanks for the detailed reply Tom, you raise some excellent points. Here are some answers to your questions:
I think we can, but I think it will be orthogonal to this effort, which is more about improving the cleanliness of the code. I strongly agree with your broader point that the biggest user-facing issue with the search right now is that the way it retrieves results isn't always intuitive.
I'll send you a time estimate for this over email.
As I've scoped the issue, the user wouldn't see any changes. The end goal would be giving us more peace of mind and setting ourselves up so that we can move faster next time we want to spend time improving search.
I think the Solr integration is an important piece of the overall technical debt that's accumulating in the app as time goes on. If we're planning on rethinking the way we perform search in the app in a major way, it probably doesn't make sense for us to spend time making our current implementation more maintainable. But if we like the general architecture of search and want to continue to build on it in the coming years I think we'll have to face this technical debt eventually. |
I double-checked and confirmed that Haystack supports Solr's geospatial search so we should be able to move forward with geo queries at a later date if we do this refactor. |
In the time since we wrote the search interface and search indexing script for the app, DataMade has learned a lot about how best to integrate Django and Solr. (You can see our current best-practices doc here.) Most pertinently, we've standardized on Haystack as our go-to library for integrating the two services. Haystack provides a connection layer that automates much of the custom code that we wrote in the search interface and indexing script, and does so in a cleaner fashion that is easier to maintain.
Beyond maintainability, I also worry about the security of our current custom implementation. According to our error logging, the search interface is by far the most popular target for bots scanning for security vulnerabilities (we get a couple notifications per week showing scanners trying different permutations of query parameters). So far I don't have any reason to believe that there are vulnerabilities in our current code, but the fact that we have so much custom code that directly sends requests to Solr makes me concerned that our attack surface is larger than it would otherwise be with a framework like Haystack.
We've been thinking about ways to onboard @beamalsky to the internals of this project while giving her some more experience with heavyweight search, and I think this particular maintenance task is a great candidate. The scope of the issue would be to preserve all existing features of the search interface, while swapping out the custom pysolr implementation for a Haystack connection layer. If this is something you're interested in doing @tlongers, we would be happy to cover half of the development cost.
The text was updated successfully, but these errors were encountered: