Skip to content

IndexWriter's committed_opstamp is never updated after commits, breaking rollback and causing data integrity issues #2666

@requilence

Description

@requilence

Describe the bug
The IndexWriter::committed_opstamp field is initialized during IndexWriter creation but is never updated after successful commits. This causes severe issues with rollback functionality, delete operations, and can lead to data corruption through duplicate opstamps.

Which version of tantivy are you using?
both main and v0.22.0

Current Behavior

  1. committed_opstamp is initialized from disk metadata when creating an IndexWriter
  2. After each successful commit, committed_opstamp remains at its initial value (0 when using the fresh index)
  3. The public method commit_opstamp() always returns the initial value, not the actual last committed opstamp

Expected Behavior

According to the documentation, commit_opstamp() should return "the opstamp of the last successful commit". The committed_opstamp field should be updated after each successful commit.

Reproduction

use tantivy::schema::{Schema, TEXT};
use tantivy::{doc, Index, IndexWriter};

fn main() -> tantivy::Result<()> {
    let mut schema_builder = Schema::builder();
    let text_field = schema_builder.add_text_field("text", TEXT);
    let index = Index::create_in_ram(schema_builder.build());
    
    let mut index_writer: IndexWriter = index.writer_with_num_threads(1, 10_000_000)?;
    
    // Initial committed_opstamp is 0
    assert_eq!(index_writer.commit_opstamp(), 0);
    
    // Add document and commit
    let opstamp1 = index_writer.add_document(doc!(text_field => "first"))?;
    let commit_result = index_writer.commit()?;
    
    // BUG: committed_opstamp should be updated to opstamp1, but it's still 0
    assert_eq!(index_writer.commit_opstamp(), 0);  // Should be opstamp1!
    
    // This breaks rollback functionality
    index_writer.add_document(doc!(text_field => "second"))?;
    index_writer.rollback()?;  // This reverts stamper to 0 instead of opstamp1!
    
    // Now we get duplicate opstamps!
    let opstamp3 = index_writer.add_document(doc!(text_field => "third"))?;
    assert_eq!(opstamp3, 1);  // Same as opstamp1 - DUPLICATE!
    
    Ok(())
}

Impact

This bug causes multiple severe issues:

1. Broken Rollback Mechanism

  • rollback() calls self.stamper.revert(self.committed_opstamp)
  • Since committed_opstamp is never updated, it always reverts to 0
  • This causes duplicate opstamps and violates monotonicity guarantees

2. Incorrect delete_all_documents()

  • Also uses self.stamper.revert(self.committed_opstamp)
  • Returns wrong opstamp value to the caller

3. Data Integrity Issues

  • After rollback, new documents can get opstamps that already exist in committed segments
  • Violates the fundamental assumption that opstamps are monotonically increasing
  • Can lead to incorrect document ordering and deletion behavior

4. Misleading Public API

  • commit_opstamp() returns incorrect values
  • Any user code relying on this method for synchronization is broken

Root Cause

The issue stems from having two separate opstamp tracking mechanisms that are not synchronized:

  1. SegmentUpdater correctly maintains the committed opstamp:

    • In schedule_commit(), it saves the opstamp to disk via save_metas(opstamp, payload)
    • It updates its internal active_index_meta with the correct opstamp
    • This is what gets persisted to disk and loaded on restart
  2. IndexWriter has its own committed_opstamp field:

    • Initialized from disk metadata when creating the IndexWriter (line 312)
    • Never updated after successful commits
    • Used by rollback(), delete_all_documents(), and the public commit_opstamp() method

The commit flow in index_writer.rs:

  1. commit()prepare_commit()?.commit()
  2. PreparedCommit::commit()segment_updater.schedule_commit(opstamp)
  3. SegmentUpdater correctly updates its metadata, but IndexWriter's committed_opstamp field remains stale

Suggested Fix

The committed_opstamp field should be updated after a successful commit. Options include:

  1. Update after commit completes: When schedule_commit returns successfully, update IndexWriter::committed_opstamp
  2. Query SegmentUpdater: Have commit_opstamp() query the current opstamp from segment_updater.load_meta().opstamp instead of using a stale field
  3. Remove the field: Since SegmentUpdater already tracks this correctly, consider removing the redundant field from IndexWriter

The key is ensuring that rollback and delete operations use the actual last committed opstamp, not a stale value from initialization.

Additional Notes

This bug has likely existed for a long time but may have gone unnoticed because:

  1. Many users don't use rollback functionality
  2. The effects might not be immediately visible
  3. The public commit_opstamp() method might not be widely used

However, for applications that rely on rollback or the opstamp values for consistency, this is a critical data integrity issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions