-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarks ? #5
Comments
Apologies, I just saw this ticket. Well, what would you be interested in ? indexing time? search time? etc |
I would be interested in both indexing and search time side by side (https://sematext.com/blog/2015/05/27/side-by-side-with-elasticsearch-and-solr-performance-and-scalability). Please include Bleve (https://github.com/blevesearch/bleve) into benchmark, and put them alongside Elasticsearch and Solr in table of features (https://sematext.com/blog/2017/06/19/solr-vs-elasticsearch-differences) if possible. |
The guy above is asking too much imho. Only interested (for now) in in-memory search benchmark compared to lucene + tantivy to see the c++ difference. |
For what it's worth, you may want to check https://www.bestprice.gr/ -- Trinity replaced the previous generation Trinity tech. a few days ago. It's in Greek, but you may get a sense of speed and accuracy. We 'll see to those benchmarks; we don't actively use lucene so we 'd need to build whatever's required to run them. Our ops folks will look into it. @ddorian Please note that Trinity support any configuration and arrangement of sources; in-memory, on-disk, whatever -- as long as you provide the implementation, it doesn't care. It's very flexible that way -- you could use it to build e.g a Gmail(clone?) where you rely on both on-disk "segments" and in-memory segments, periodically flushing, or not -- with separate segments/indices for e.g operator is:starred where you want to be set/unset for messages but don't want to re-index the whole message every time. |
@markpapadakis Was it worth it ? Doing custom library ? Do you have replication/sharding on the site (you might want to look at netflix/dynomite rediscluster/modules if not) ? I'm only interested to see the difference for fully-in-memory, to see java vs c++(with SIMD etc) vs rust(tantivy) based also in this blog post: http://blog.mikemccandless.com/2013/06/screaming-fast-lucene-searches-using-c.html |
@ddorian Yes, it was absolutely worth it. It took maybe 2-3? months for Trinity, and maybe another month for integrating it on BestPrice.gr (and soon, elsewhere) - which includes figuring out the kinks and building a powerful query rewrite engine (it uses Trinity's queries_rewrite.h functionality). Re: replication, for BestPrice specifically, we flush segments and periodically merge them and just sync them over to the various nodes -- but that's because it was a very simple design and we didn't need anything else at the time, though it should be easy to implement many if not any design. Other than explicit use of SIMD in Trinity, the compiler is making use of SIMD instructions -- it certainly helps. Trinity also makes use of prefetching which also provided a useful boost. Other than that, if there is a need for it (it's very fast as it is, so no pressing need to get it to now) we will just compile the resulting execution tree (a query is compiled to an execution tree/plan which is then executed) to machine code and execute it directly (like Bing does). It shouldn't be too hard. |
@ddorian Also, re: that blog post, there are 2 codecs included in Trinity. One of them is based on Lucene's default codec -- more or less same design with a few changes. It uses FastPFOR which deals with integers blocks and encoding/decoding relies on SIMD instructions (Lucene's using a PFOR variant itself). It's really really fast. (You can also use MaskedVByte instead -- see |
Have you ever checked https://github.com/powturbo/TurboPFor ? https://twitter.com/powturbo (mostly interested into search-engine/big-data stuff) What about sharding ? Or you don't need it yet. |
@ddorian yes, I have. It's excellent. You can replace Lemire's FastPFOR impl. with powturbo's. We have no applications or services that require sharing yet. Queries take a few ms, and if you are dealing with multiple index sources, you can use |
Mark, just curious if you've looked at something like this: http://www.scylladb.com/2017/07/06/scyllas-approach-improve-performance-cpu-bound-workloads/ |
@ddorian I have, I love Scylla's engineers write-ups, they are very pragmatic and insightful. ( I also can't recommend studying the Seastar codebase enough; it's a great engineering achievement, and perhaps the finest C++ codebase, along with Facebook's Folly and Facebook's HHVM, that I have studied ). As for the benchmarks, we found a way to do this - so will likely release something very soon. In the mean time, a major Trinity upgrade is in the works (lots of performance and usability improvements). |
A benchmarks page will be up soon, but in the mean time, in a not exactly fair comparison that involved indexing (naively) an English wikipedia partial dump of 563MB (XML), into a single segment: This is a bit unfair though, because Lucene's skipping all stop-words, and that's a lot of stop words really, whereas, for this benchmarks, Trinity's not set to skip anything. Furthermore, the default query parser is used to parse the tokens, which is fairly sophisticated and considers different strategies for properly parsing different types of tokens etc, whereas the default parser of lucene is naive in that it just figures out the beginning and end of a token and that's pretty much all there is to it. With stop words excluded and a simple parse similar to Lucene's, I expect the difference to be closer to 100%, i.e twice as fast -- and this is running on a single thread whereas Lucene's indexer I believe uses multiple threads ( I should look into that and for ways to configure it, if possible, to use a single thread). For search, there is no practical way to compare performance among Lucene and Trinity, because they have a different scoring model. Lucene simply aggregates the scores of each term or phrases matches (which is almost always a simple TF-IDF that doesn't require access to the positions/hits of a term in the matched document), which is pretty simplistic and very fast to compute(just a sum of values) -- and also allows for optimisations that are just not possible with Trinity's more advanced/richer model. Trinity on the other hand will invoke an application provided function, passing it rich information about every term matched, including all hits for that term in the matched document, so that the application can perform sophisticated ranking. This is vastly more complicated and expensive than just simply aggregating numbers and computing sums, but it's extremely important for computing better and more accurate relevance scores/similarity measure. |
Lucene 6+ uses BM25 which is a little slower than TF-IDF(depreciated on lucene). Do you intend to create some performance-benchmarks, so you can use yourself to make sure new changes don't introduce slowdowns ? (like lucene does) |
@ddorian we have such a benchmark suit but that's internal to our org., i.e won't be open sourced. I will bundle a Lucene and a Trinity app so that others can see for themselves (i.e the applications that index and search that wikipedia dump) -- thanks to @rkrambovitis for buildingthat Lucene Java application. BM25 yields somewhat better results, but, just like TF-IDF, is not going to work for "short" documents (e.g products, or records), where you don't really have enough terms in the "document", so those models can't really produce meaningful results. |
Trinity now supports the recently(yesterday) released Streaming VByte encoding scheme by Daniel Lemire. Read the blog post for an introduction. When selected (see |
Yeah saw that yesterday too. But it didn't also work with SIMD.
Vespa looks nice for a clustering solution (assuming you can replace their index with trinity) that got released yesterday and ex-employees that have worked with it say it's a 100x better elastic-search. |
@ddorian you can choose between PFOR, Streaming VByte and Masked VByte, see lucene_codec.h for how to select the encoding scheme. In practice, most systems nowadays (well, CPUs) support SSE 4.0+ so it should work as expected on most environments. I 've been talking to Jon Bratseth about Vespa. It's really interesting, and I plan to do a deep dive in the codebase to understand the impl. specifics; I just haven't had a chance to do that yet. |
People love benchmarks. Especially against lucene.
The text was updated successfully, but these errors were encountered: