Investigate dedicated full-text search with e.g. Meilisearch or Elasticsearch #3713

poVoq · 2023-07-24T20:39:07Z

Requirements

Is this a feature request? For questions or discussions use https://lemmy.ml/c/lemmy_support
Did you check to see if this issue already exists?
Is this only a feature request? Do not put multiple feature requests in one issue.
Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.

Is your proposal related to a problem?

The search in Lemmy could be much improved by utilizing an existing search middleware that indexes in memory. This is common practise in other Fediverse software and as an optional feature useful for bigger instances.

Describe the solution you'd like.

Most commonly Elasticsearch is supported, but the Opensearch fork or the alternative Meilisearch are also good.

Meilisearch already has an easy to use integration for Rust: https://github.com/meilisearch/meilisearch-rust

Describe alternatives you've considered.

n/a

Additional context

No response

phiresky · 2023-07-24T23:38:48Z

probably better to start by using postgresql FTS and only move to more complex solutions if that actually turns out not to be enough.

marsara9 · 2023-07-25T00:44:14Z

For the short term there's always my project: https://www.search-lemmy.com which uses postgresql FTS just like @phiresky suggests.

It only supports posts and communities at the moment, but more will come in the future.

Long term, @phiresky feel free to steal the queries that I have in my project. In theory Lemmy would just need an additional calculated column on each object to store the tsquery data and then you'd just query against that.

poVoq · 2023-07-25T14:06:18Z

The main postgres database seems to be already a bottleneck and search is a database heavy operation.

Offloading that to a separate system would benefit larger instances a lot I think, and Meilisearch comes with quite a lot of additional benefits like fuzzy search and spelling error correction.

marsara9 · 2023-07-25T14:14:47Z

@poVoq if you think this best remains as a separate service, can you raise an issue on https://www.github.com/marsara9/lemmy-search and I can explore if Meilisearch might be a better fit?

As I assume that's what you're suggesting with "a separate system?"

phiresky · 2023-07-25T14:25:54Z

The main postgres database seems to be already a bottleneck

The database is not a bottleneck currently - poorly written and redundant queries are a bottleneck. We're far from the scalability limits of PostgreSQL.

poVoq · 2023-07-25T15:30:58Z

@marsara9 no, I meant a separate system interfacing with Lemmy directly similar to how Pict-rs is used for media, which is what my original proposal is about.

marsara9 · 2023-07-25T15:42:28Z

Gotcha. Technically the project I'm working on can run alongside Lemmy itself. I'd just need to redo the crawler to use the existing DB rather than relying on API calls to fetch the data.

Assuming you use the existing DB, again you just need a new column on each object type to store the tsquery data that you ultimately query against. Query times this way are also sub-second. You can test that out by using my website and just make sure not to apply any filters (those will slow down the query a bit).

The other catch is this can be a local only search. If a remote fetch is required the query time will go up substantially.

phiresky · 2023-07-25T16:20:29Z

Probably don't need new columns at all for basic search, just an index ON to_tsvector(body) is enough

marsara9 · 2023-07-25T16:28:20Z

I tried that in my project and performance was abysmal. Once I added a computed column I was back to less than a second for all queries (that didn't involve one of my custom filters)

phiresky · 2023-07-25T21:56:32Z

That doesn't really make sense, performance should be the same for searches. The index stores the computed value just as if you had added a computed column. Just INSERT and UPDATEs will be slower. Did you check the query plans? maybe your expression differed between the index and what you actually searched for.

The docs also mention you must specify to arguments to to_tsvector for it to work: Notice that the 2-argument version of to_tsvector is used. Only text search functions that specify a configuration name can be used in expression indexes (Section 11.7).

phiresky · 2023-07-25T22:00:16Z

Just tried it with:

create index on comment using gin(to_tsvector('english', content));
explain analyze select * from comment where to_tsvector('english', content) @@ to_tsquery('foo & bar');

full text searches take 0.5ms for a production instance with 700k comments.

marsara9 · 2023-07-25T22:47:07Z

Ya, I probable had something miss-configured the first time I tried setting it up.

But I'm currently using

ALTER TABLE posts 
    ADD COLUMN IF NOT EXISTS com_search TSVECTOR
    GENERATED ALWAYS AS	(
        to_tsvector('english', \"name\") || ' ' || to_tsvector('english', coalesce(body, ''))
    ) stored;

and

CREATE INDEX IF NOT EXISTS idx_search ON posts USING GIN(com_search);

And it's working beautifully, so no reason to mess with it at the moment.

ancientmarinerdev · 2023-07-27T13:59:47Z

This is a critical point that I made on an PR, and probably should have made on an issue. IMHO, this is critical medium to long term once efficiencies and bottlenecks have been tightened up. I copied across a few of my comments.

#3719 (comment)

"Should search directly search DB? Shouldn't something like elasticsearch be considered so it can sit in front of/alongside the persistence? It might open more options for caching also. If some common queries can be cached, it could be reduced, even if it is a very short lived cache. The more elaborate the search, the less likely there is to be an hit, but splitting queries out or any technology that can help support that might just cut a significant load from the DB which will help massively with scalability. Someone will inevitably try expensive queries to kill servers. Being defensive here could be important.

There is probably some investigation that could go into this such as types of queries, replication of data etc. Community names could probably be quite an easy cache as it's a commonly undertaken activity."

#3719 (comment)

"It would add complexity, that is something I am not going to disagree, but I don't think they add unjustifiable complexity. Most websites running at country scale will use a Lucene, Solr, or more recently Elasticsearch for covering search. Taking load off the database is critical, because all extra concurrent load makes all transactions slower until it gets to the point it buckles. Even if it doesn't buckle, the delayed reaction times impact on users and their experience and eventually those doubts start to build up about whether this is good or reliable.

I suggested the following because from what I have seen, at scale these technologies are favoured. I don't know any large website that allows database searching without any form of caching."

"Most top websites when accessing immutable content, will try to cache first to cut load. If a query is run more than 96 times a day, a 15 min cache is going to provide a reduction in the amount of work, assuming they are evenly divided. They are returning a result rather than doing the same computations again. Yes, the data can maybe be stale, but who needs the data to be that real time for search. Even an hour cache is hardly an issue from a search perspective.

In tech, it's important to use the best tool for the job, it isn't always advisable to stick with simple stacks when the demands are greater. The last few weekends, there has been bad actors really testing the limits of Lemmy, and they seem quite motivated. By allowing search straight onto the DB, you're putting the DB in the hands of bad actors which is a very risky move. So far, it's not going smoothly. They're going to keep probing and poking where there is weaknesses."

WayneSheppard · 2023-07-28T14:04:27Z

Is there any evidence that Searching is more than a 1% load on the DB? 1000 searches per day is nothing compared to 1000 inserts per second from Federation. IMO, a better idea of the performance impact will help prioritize this request.

Nutomic · 2023-10-02T10:04:26Z

This is unnecessary for now.

codenyte · 2024-05-19T20:45:35Z

Perhaps this could get implemented as a plugin, using the new plugin API (#4695)

poVoq added the enhancement New feature or request label Jul 24, 2023

lionirdeadman added the area: search label Jul 25, 2023

phiresky mentioned this issue Jul 27, 2023

add trigram index to search #3719

Merged

phiresky changed the title ~~Optional full-text search with Meilisearch or Elasticsearch~~ Investigate dedicated full-text search with e.g. Meilisearch or Elasticsearch Jul 27, 2023

Nutomic closed this as completed Oct 2, 2023

codenyte mentioned this issue Jun 10, 2024

Add a Real-time Search Bar to Improve User Experience LemmyNet/lemmy-ui#1388

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate dedicated full-text search with e.g. Meilisearch or Elasticsearch #3713

Investigate dedicated full-text search with e.g. Meilisearch or Elasticsearch #3713

poVoq commented Jul 24, 2023

phiresky commented Jul 24, 2023

marsara9 commented Jul 25, 2023 •

edited

Loading

poVoq commented Jul 25, 2023 •

edited

Loading

marsara9 commented Jul 25, 2023 •

edited

Loading

phiresky commented Jul 25, 2023

poVoq commented Jul 25, 2023

marsara9 commented Jul 25, 2023

phiresky commented Jul 25, 2023

marsara9 commented Jul 25, 2023

phiresky commented Jul 25, 2023 •

edited

Loading

phiresky commented Jul 25, 2023

marsara9 commented Jul 25, 2023

ancientmarinerdev commented Jul 27, 2023 •

edited

Loading

WayneSheppard commented Jul 28, 2023

Nutomic commented Oct 2, 2023

codenyte commented May 19, 2024

Investigate dedicated full-text search with e.g. Meilisearch or Elasticsearch #3713

Investigate dedicated full-text search with e.g. Meilisearch or Elasticsearch #3713

Comments

poVoq commented Jul 24, 2023

Requirements

Is your proposal related to a problem?

Describe the solution you'd like.

Describe alternatives you've considered.

Additional context

phiresky commented Jul 24, 2023

marsara9 commented Jul 25, 2023 • edited Loading

poVoq commented Jul 25, 2023 • edited Loading

marsara9 commented Jul 25, 2023 • edited Loading

phiresky commented Jul 25, 2023

poVoq commented Jul 25, 2023

marsara9 commented Jul 25, 2023

phiresky commented Jul 25, 2023

marsara9 commented Jul 25, 2023

phiresky commented Jul 25, 2023 • edited Loading

phiresky commented Jul 25, 2023

marsara9 commented Jul 25, 2023

ancientmarinerdev commented Jul 27, 2023 • edited Loading

WayneSheppard commented Jul 28, 2023

Nutomic commented Oct 2, 2023

codenyte commented May 19, 2024

marsara9 commented Jul 25, 2023 •

edited

Loading

poVoq commented Jul 25, 2023 •

edited

Loading

marsara9 commented Jul 25, 2023 •

edited

Loading

phiresky commented Jul 25, 2023 •

edited

Loading

ancientmarinerdev commented Jul 27, 2023 •

edited

Loading