Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add util for finding trie root hash #6094

Merged
merged 18 commits into from
Jun 18, 2024
Merged

Conversation

zhangchiqing
Copy link
Member

This PR adds a util command to find trie root hash in the wal file. It allows us to roll back executed height to any height.

Context: the current util allows us to roll back executed height to any height, which remove the execution results in the database. However, the EN is not able to start up, because the checkpoint and wal file has not been rollback to the corresponding height.

Rolling back the checkpoint and wal has been quite difficult, we've made tool to find the block for the block, however, it only allow us to roll back executed height to the block of the checkpoint. In order to be able to roll back to any height, we need to find the trie root hash in the wal files, and this is what this find trie root hash util is made for.

@@ -61,6 +61,7 @@ func Decode(data []byte) (operation WALOperation, rootHash ledger.RootHash, upda
switch operation {
case WALUpdate:
update, err = ledger.DecodeTrieUpdate(data[1:])
rootHash = update.RootHash
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a bug, that the root hash is not returned when decoding a record in the wal file.

@zhangchiqing zhangchiqing marked this pull request as ready for review June 13, 2024 21:32
@zhangchiqing zhangchiqing requested review from fxamacker, janezpodhostnik and peterargue and removed request for ramtinms and AlexHentschel June 13, 2024 21:32
flagOutputDir string
)

var Cmd = &cobra.Command{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some high level docs about what this command does and how it should be used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments, and also documented some detailed use case here

@@ -44,7 +43,7 @@ func TestUpdate(t *testing.T) {
operation, stateCommitment, up, err := realWAL.Decode(data)
require.NoError(t, err)
assert.Equal(t, realWAL.WALUpdate, operation)
assert.Equal(t, stateCommitment, ledger.RootHash(hash.DummyHash))
assert.Equal(t, stateCommitment, ledger.RootHash(rootHash))
Copy link
Member Author

@zhangchiqing zhangchiqing Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @fxamacker , could you review my change here?

I'm not sure why the test case originally expected the decoded root hash to be zero. Would like to double check in case I missed anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangchiqing I think currently wal.Decode() returns un-initialized rootHash (empty rootHash) if operation is WALUpdate.

rootHash is only populated in wal.Decode() for WALDelete operation.

The existing test expecting empty rootHash fails because you changed wal.Decode() in this PR. This can be a breaking change in other use cases as well.

Given this, instead of modifying wal.Decode() function, you can use returned update.RootHash from wal.Decode() for your use case in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see why it was decided this way.

There are two types of operation: WALUpdate and WALDelete.

WALDelete only has the root hash. WALUpdate can be decoded into update, which contains the root hash.

So the operation is WALDelete, than the root hash is returned, and the update is nil; if the operation is WALUpdate, then the decoded update will be returned. So the root hash is only for the case that the operation is WALDelete.

But I still think in the case the operation is WALUpdate, it should return the root hash. There isn't any code reading the root hash when the operation is WALUpdate. So fixing it would not cause any problem.

This fix would also make my logic simplier, because I don't need to check the returned update object, which might be nil if the operation is WALDelete. I could just check the root hash and the operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took your suggestion and reverted my changes, and also added some commends on the Decode method to explain the behavior.

Comment on lines +104 to +107
from, to, err := prometheusWAL.Segments(dir)
if err != nil {
return 0, 0, fmt.Errorf("cannot get segments: %w", err)
}

This comment was marked as resolved.

Comment on lines 99 to 102
w, err := prometheusWAL.NewSize(lg, prometheus.DefaultRegisterer, dir, wal.SegmentSize, false)
if err != nil {
return 0, 0, fmt.Errorf("cannot create WAL: %w", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to create w here? We can use dir to replace w.Dir() as parameter NewSegmentsRangeReader().

@@ -44,7 +43,7 @@ func TestUpdate(t *testing.T) {
operation, stateCommitment, up, err := realWAL.Decode(data)
require.NoError(t, err)
assert.Equal(t, realWAL.WALUpdate, operation)
assert.Equal(t, stateCommitment, ledger.RootHash(hash.DummyHash))
assert.Equal(t, stateCommitment, ledger.RootHash(rootHash))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangchiqing I think currently wal.Decode() returns un-initialized rootHash (empty rootHash) if operation is WALUpdate.

rootHash is only populated in wal.Decode() for WALDelete operation.

The existing test expecting empty rootHash fails because you changed wal.Decode() in this PR. This can be a breaking change in other use cases as well.

Given this, instead of modifying wal.Decode() function, you can use returned update.RootHash from wal.Decode() for your use case in this PR.

Copy link
Member

@fxamacker fxamacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some more comments. Maybe add some docs for functions like copyWAL() to describe high level purpose of functions.

return 0, 0, fmt.Errorf("finish reading all segment files from %d to %d, but not found", from, to)
}

func copyWAL(dir, outputDir string, segment int, expectedRoot ledger.RootHash) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To help me understand:

  • Is copyWAL() intended to create a new segment file with all records up to and including first update record with specified rootHash?

  • If outputDir has existing wal files, the new segment file would be created with higher index as filename suffix. Is this expected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is copyWAL() intended to create a new segment file with all records up to and including first update record with specified rootHash?

Yes

If outputDir has existing wal files, the new segment file would be created with higher index as filename suffix. Is this expected?

yeah, good point. the outputDir is supposed to be different from the dir, (I can add a check to prevent that), so the generated wal is always 00000000 in a separate output folder.

Comment on lines 210 to 214
bytes := wal.EncodeUpdate(update)
_, err = writer.Log(bytes)
if err != nil {
return fmt.Errorf("cannot write LedgerWAL record: %w", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update can be nil for delete operation record. Maybe check operation before encoding record.

Copy link
Member

@fxamacker fxamacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for updating. I focused on reviewing code related to wal files.

Comment on lines +54 to +57
// Decode decodes the given data into a WAL operation, root hash and trie update.
// It returns (WALDelete, rootHash, nil, nil) if the operation is WALDelete.
// It returns (WALUpdate, hash.DummyHash, update, nil) if the operation is WALUpdate.
// To read the root hash of the trie update, use update.RootHash.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 192 to 195
w, err := prometheusWAL.NewSize(log.Logger, nil, dir, wal.SegmentSize, false)
if err != nil {
return fmt.Errorf("cannot create WAL: %w", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems w isn't used except for w.Dir(), so we can remove w and use dir directly.

Suggested change
w, err := prometheusWAL.NewSize(log.Logger, nil, dir, wal.SegmentSize, false)
if err != nil {
return fmt.Errorf("cannot create WAL: %w", err)
}

@zhangchiqing
Copy link
Member Author

I added some logic to replace the WAL file once it found the matching wal file with the --trim-as-latest-wal=true flag

Copy link
Member

@fxamacker fxamacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for automating this aspect as suggested! 👍 I left some more comments and suggestions.

Comment on lines 89 to 93
// genereate a segment file to the temporary folder with the root hash as its last record
newSegmentFile, err := findRootHashAndCreateTrimmed(flagExecutionStateDir, segment, rootHash)
if err != nil {
log.Fatal().Err(err).Msg("cannot copy WAL")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can create temp directory in this function and pass it to findRootHashAndCreateTrimmed() so that temp directory can be removed in defer.

backupFile := prometheusWAL.SegmentName(backupDir, i)

log.Info().Msgf("backup segment file %s to %s, %v/%v", segmentFile, backupFile, i, last)
err := os.Rename(segmentFile, backupFile)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use copyFile() here in case backup directory is in a different device?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about this. If I use copy file, it will take some time, I think we most likely would move within the same device. If a different device is used, then it fails, they can retry again with a different path on the same device.

I could also check if they are different device, and then use copy and remove the old files, but that would be too much code. I decided to keep it simple.

return nil
}

func copyFile(src, dst string) (int64, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of copying file, do we want to move file instead?

if os.IsNotExist(err) {
return true, nil
}
return false, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return error here?

Suggested change
return false, nil
return false, err

@zhangchiqing zhangchiqing merged commit 82de39d into v0.33 Jun 18, 2024
50 of 51 checks passed
@zhangchiqing zhangchiqing deleted the leo/v0.33-find-trie-root branch June 18, 2024 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants