Do you have plans to support real time indexing? #10

yingfeng · 2017-08-22T07:08:39Z

It's not that difficult to support such a feature, just by providing two in-memory segments is enough.
When one in-memory segment is full, just flush it to disk while the other in-memory segment will be used to support data ingestion at the same time. It requires a lock-less design to support higher concurrency, which is not that complicated using std::atomic semantics.

markpapadakis · 2017-08-23T06:25:55Z

Please note that a major Trinity update is in the works - it should be pushed to GH sometime next week, along with benchmarks, comparing Lucene and Trinity.

You can implement a real-time indexing scheme pretty easily, by creating an IndexSourcesCollection. You then just add to that collection one index source for each read-only serialized segment/source (e.g SegmentIndexSource) and finally you add another IndexSource that’s built for real-time updates -- all you need to do is make sure your resolve_term_ctx() and new_postings_decoder() account for that. That’s pretty much all there is to it, though you may need to make use of an IndexDocumentsFilter because you will likely won’t want to rely on IndexSource::masked_documents() of your real-time index source, but those are rather easy to figure out specifics.

When whatever you use to back your real-time index source(which is a proxy of sorts to that in-memory backing store), you can just flush it as e.g a lucene or google segment, re-create the index source collection to include that new segment and reset the in-memory index source and atomically replace the index collection (just a pointers swap).

This is just one way to do it, and if it sounds complicated, it’s because I failed to describe it properly -- it is pretty trivial in practice really.

yingfeng · 2017-08-23T08:13:13Z

The real time indexing requires concurrent access for SegmentIndexSource since updates and retrieval happen at the same time, additionally, the document should be able to be found immediately after it has been inserted which means the so called commit will happen at a per-document grained level. As a result, corresponding posting list should be thread safe. I've not seen such a data structure and other mechanism to be able to support the above flow.

markpapadakis · 2017-08-23T08:58:04Z

You shouldn't really use a SegmentIndexSource. This is for read-only segments. Instead, you should subclass IndexSource and create your own. I should probably bundle a simple such implementation as an example of how this could work. If you can wait for a while until I get this new major release into shape and push it to GH, I 'll add a reference impl. for such an IndexSource.

markpapadakis · 2017-09-06T08:56:02Z

@yingfeng I am sorry, it has taken longer than I expected to find some free time for those examples -- working on add more features still (a major release was pushed to GH some days ago). Will get to those examples soon thereafter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do you have plans to support real time indexing? #10

Do you have plans to support real time indexing? #10

yingfeng commented Aug 22, 2017

markpapadakis commented Aug 23, 2017

yingfeng commented Aug 23, 2017

markpapadakis commented Aug 23, 2017

markpapadakis commented Sep 6, 2017

Do you have plans to support real time indexing? #10

Do you have plans to support real time indexing? #10

Comments

yingfeng commented Aug 22, 2017

markpapadakis commented Aug 23, 2017

yingfeng commented Aug 23, 2017

markpapadakis commented Aug 23, 2017

markpapadakis commented Sep 6, 2017