-
-
Notifications
You must be signed in to change notification settings - Fork 809
Description
Hey! We have several different object types and a unified search across them - namely we have objects like:
- Workflows - with names, descriptions, folders
- Notebooks - with titles and content
- Actions - with just the action name
- Launch configs - with just the launch config name
- etc
Each of these has a different set of weightings e.g. you can imagine the ordering of most important fields for workflows is name, description and then the enclosing folder. We implement logic for these weightings at query-time with BoostQuery
s (snippet below).
// Add term queries for all words except the last one.
if words.len() > 1 {
for word in &words[0..words.len() - 1] {
for (field, weight) in self.weighted_search_fields.values() {
let term = Term::from_field_text(*field, word);
let term_query = build_term_query(term);
let weighted_query = Box::new(BoostQuery::new(
term_query,
// Boost the term query by the field weight, normalized by the total weight so the final
// score is in the range of roughly 0-5. Complex queries might have a score exceeding 5.
*weight * SCORE_BOOST_FACTOR / self.normalizing_factor,
));
subqueries.push((Occur::Should, weighted_query));
}
}
}
Currently, we've structured this as multiple Tantivy full-text searchers - one for each data source, where we define a schema for each object type. Then, when we have a search (the user enters a search term on the command palette), we run the search across these different searchers asynchronously, and return an aggregated ranked set of results.
However, we've seen this scales the number of threads we're spinning up proportionally to the number of data sources, which isn't great (related to #702).
An approach we're considering is the following:
- Define a unified schema with all possible fields from every object type, with no inherent weightings/boosts
- Objects like Actions would just have empty fields for any that aren't relevant for that object type
- Extend the query-time piece to filter by type of object first, and then use type-conditional
BoostQuery
s to account for the weights
This would result in a single searcher running async.
Wanted to check if this is the recommended approach for this sort of search across different object types w/ different schemas? Thanks!