Skip to content

Best approach to handling searches across multiple object schemas #2665

@Advait-M

Description

@Advait-M

Hey! We have several different object types and a unified search across them - namely we have objects like:

  • Workflows - with names, descriptions, folders
  • Notebooks - with titles and content
  • Actions - with just the action name
  • Launch configs - with just the launch config name
  • etc

Each of these has a different set of weightings e.g. you can imagine the ordering of most important fields for workflows is name, description and then the enclosing folder. We implement logic for these weightings at query-time with BoostQuerys (snippet below).

        // Add term queries for all words except the last one.
        if words.len() > 1 {
            for word in &words[0..words.len() - 1] {
                for (field, weight) in self.weighted_search_fields.values() {
                    let term = Term::from_field_text(*field, word);
                    let term_query = build_term_query(term);
                    let weighted_query = Box::new(BoostQuery::new(
                        term_query,
                        // Boost the term query by the field weight, normalized by the total weight so the final
                        // score is in the range of roughly 0-5. Complex queries might have a score exceeding 5.
                        *weight * SCORE_BOOST_FACTOR / self.normalizing_factor,
                    ));
                    subqueries.push((Occur::Should, weighted_query));
                }
            }
        }

Currently, we've structured this as multiple Tantivy full-text searchers - one for each data source, where we define a schema for each object type. Then, when we have a search (the user enters a search term on the command palette), we run the search across these different searchers asynchronously, and return an aggregated ranked set of results.

However, we've seen this scales the number of threads we're spinning up proportionally to the number of data sources, which isn't great (related to #702).

An approach we're considering is the following:

  • Define a unified schema with all possible fields from every object type, with no inherent weightings/boosts
  • Objects like Actions would just have empty fields for any that aren't relevant for that object type
  • Extend the query-time piece to filter by type of object first, and then use type-conditional BoostQuerys to account for the weights

This would result in a single searcher running async.

Wanted to check if this is the recommended approach for this sort of search across different object types w/ different schemas? Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions