accounts-db: relax intrabatch account locks#4253
Conversation
3e016b3 to
f748a5f
Compare
2fb8527 to
11d1dcb
Compare
accounts-db/src/account_locks.rs
Outdated
| // HANA TODO the vec allocation here is unfortunate but hard to avoid | ||
| // we cannot do this in one closure because of borrow rules | ||
| // play around with alternate strategies, according to benches this may be up to | ||
| // 50% slower for small batches and few locks, but for large batches and many locks |
There was a problem hiding this comment.
bench using jemalloc? i'd think it would do a reasonable job of just keeping the mem in thread-local cache for re-use
95b17b3 to
d1ec289
Compare
a11ef4d to
c234cc6
Compare
| #[derive(Debug, Default)] | ||
| pub struct AccountLocks { | ||
| write_locks: AHashSet<Pubkey>, | ||
| write_locks: AHashMap<Pubkey, u64>, | ||
| readonly_locks: AHashMap<Pubkey, u64>, | ||
| } |
There was a problem hiding this comment.
because the read- and write-lock hashmaps have the same type now, all the functions that change them are basically the same. we could use an enum or hashmap reference to discriminate and delete half of the functions, but i left it like this for your review before butchering it
accounts-db/src/accounts.rs
Outdated
| relax_intrabatch_account_locks: bool, | ||
| ) -> Vec<Result<()>> { | ||
| // Validate the account locks, then get iterator if successful validation. | ||
| let tx_account_locks_results: Vec<Result<_>> = txs |
There was a problem hiding this comment.
Accounts::lock_accounts() could be reimpled as a wrapper on lock_accounts_with_results() or possibly deleted, it isnt really required anymore since all batch-building needs to have results. but we could leave it as-is, or could do some kind of refactor with TransactionAccountLocksIterator
accounts-db/src/accounts.rs
Outdated
| fn lock_accounts_inner( | ||
| fn lock_accounts_inner<'a>( | ||
| &self, | ||
| tx_account_locks_results: Vec<Result<TransactionAccountLocksIterator<impl SVMMessage>>>, | ||
| tx_account_locks_results: impl Iterator< | ||
| Item = Result<TransactionAccountLocksIterator<'a, impl SVMMessage + 'a>>, | ||
| >, | ||
| relax_intrabatch_account_locks: bool, |
There was a problem hiding this comment.
in general where possible i changed vecs to iters, we eliminate several uses of collect() and just pass around the closure chains. this makes the type signatures look kind of stupid tho and there are possibly things we could refactor to be better (like maybe combining transactions with the transaction results instead of taking for granted they always have the same length). im undecided about style tho
accounts-db/src/accounts.rs
Outdated
| if relax_intrabatch_account_locks { | ||
| let validated_batch_keys = tx_account_locks_results.map(|tx_account_locks_result| { | ||
| tx_account_locks_result | ||
| .map(|tx_account_locks| tx_account_locks.accounts_with_is_writable()) | ||
| }); | ||
|
|
||
| account_locks.try_lock_transaction_batch(validated_batch_keys) | ||
| } else { | ||
| tx_account_locks_results | ||
| .map(|tx_account_locks_result| match tx_account_locks_result { | ||
| Ok(tx_account_locks) => account_locks | ||
| .try_lock_accounts(tx_account_locks.accounts_with_is_writable()), | ||
| Err(err) => Err(err), | ||
| }) | ||
| .collect() | ||
| } |
There was a problem hiding this comment.
this is the main branch that the feature controls. we pass it in as a param so Accounts doesnt need FeatureSet, only Bank. the only other place we use the feature is to enable signature-based transaction deduplication
runtime/src/bank.rs
Outdated
| // with simd83 enabled, we must deduplicate transactions by signature | ||
| // previously, conflicting account locks would do it as a side effect | ||
| let mut batch_signatures = AHashSet::with_capacity(transactions.len()); | ||
| let transaction_results = | ||
| transaction_results | ||
| .enumerate() | ||
| .map(|(i, tx_result)| match tx_result { | ||
| Ok(()) | ||
| if relax_intrabatch_account_locks | ||
| && !batch_signatures.insert(transactions[i].signature()) => | ||
| { | ||
| Err(TransactionError::AccountInUse) | ||
| } | ||
| Ok(()) => Ok(()), | ||
| Err(e) => Err(e), | ||
| }); |
There was a problem hiding this comment.
this is the dedupe step mentioned in a comment above. we could use a double for loop instead of a hashset but this seemed much more straightforward since the inner loop would have to abort based on the outer loop index
There was a problem hiding this comment.
I'd guess it's almost certainly faster to just do a brute-force since our batches are small (replay will be size 1 for unified-scheduler); but might get complicated with the current iterator interface. Fine to leave it as is.
d378d3a to
c6d7105
Compare
|
recent sample of bench comparisons, master vs this branch with simd83 enabled in general we perform slightly worse for tiny batches and as well or better for large batches. note these benches call code in |
runtime/src/bank.rs
Outdated
| // with simd83 enabled, we must deduplicate transactions by signature | ||
| // previously, conflicting account locks would do it as a side effect |
There was a problem hiding this comment.
we must dedup because we check for already_processed in a batch, then process, then add to the status_cache.
Is that correct summary of why we need to do this now?
There was a problem hiding this comment.
im not fully confident in my understanding of the status cache but i believe the way it works is you execute the batch and the signatures of processed (non-dropped) transactions go in the status cache only after theyve all been run. there are no checks in svm (nor should there be) for duplicate transactions within a batch
if replay is going to single-batch everything i guess it would enforce this as a side effect but it seemed good to do it here, since this code already did enforce this constraint (a malicious block that put the same transaction in one entry multiple times would fail locking in replay, without anything involving status cache, because the transactions would take the same write lock on the fee-payer)
runtime/src/bank.rs
Outdated
|
|
||
| // with simd83 enabled, we must deduplicate transactions by signature | ||
| // previously, conflicting account locks would do it as a side effect | ||
| let mut batch_signatures = AHashSet::with_capacity(transactions.len()); |
There was a problem hiding this comment.
let's use message_hash here instead of signature.
status-cache uses both, but signature is only necessary for RPC operation for fast signature lookup.
iirc reason to use message hash is because of signature malleability.
There was a problem hiding this comment.
i used it because message_hash isnt provided by SVMMessage or SVMTransaction. would you like me to add it to SVMTransaction? its available on SanitizedTransaction but providing it from SanitizedMessage would require us to add it to LegacyMessage and v0::LoadedMessage
There was a problem hiding this comment.
We can probably just change the trait bound to TransactionWithMeta on this function, that trait should provide it, and I'm fairly certain the things we actually call this with impl it.
There was a problem hiding this comment.
had to squash the past because rebasing was getting ugly but this is bae70c0
There was a problem hiding this comment.
We have the function directly as part of TransactionWithMeta, just call message_hash() on the TransactionWithMeta tx.
as_sanitized_transaction (unless it IS a sanitized transaction) will create a sanitized transaction. and possibly do 100s of allocations.
That fn is only there because of legacy interfaces - we shouldn't use it anywhere it's not strictly necessary (should be only geyser rn)
There was a problem hiding this comment.
gotcha, i see it now. i only looked at TransactionWithMeta rather than StaticMeta
There was a problem hiding this comment.
There's a comment on that fn. I'll update it to be more clear that basically no one should be using that, except for the couple places we call into geyser.
There was a problem hiding this comment.
Made comments on that trait better here - #4827
|
This PR contains changes to the solana sdk, which will be moved to a new repo within the next week, when v2.2 is branched from master. Please merge or close this PR as soon as possible, or re-create the sdk changes when the new repository is ready at https://github.com/anza-xyz/solana-sdk |
d2cf375 to
bae70c0
Compare
|
@joncinque this is a core runtime pr, the only sdk change is adding the feature gate. i assume when the new repo is created the procedure is going to be like:
right? |
|
reopened a re-requested reviews now that rebatching removal (#4833) and transaction balance collection (#5588) have both landed. this required some rebasing given the age but the only change is i rerolled the feature gate key, because it has migrated repos a couple times and it would be safer to not have to worry about the complications of that |
accounts-db/src/accounts.rs
Outdated
| Ok(()) => validate_account_locks(tx.account_keys(), tx_account_lock_limit) | ||
| .map(|_| TransactionAccountLocksIterator::new(tx)), | ||
| Err(err) => Err(err), | ||
| }); |
There was a problem hiding this comment.
This iterator will call into validate_account_locks on .next().
The iterator is passed into lock_accounts_inner and used inside the account locks - we should undo this change. I think it's better to have a vector collect here on validated account locks and pass that.
There was a problem hiding this comment.
We can actually still pass the same iterator I think. We just shouldn't redo the validate_account_locks
There was a problem hiding this comment.
ok i believe d0cabcc accomplishes the goal here. we keep the same outer iterator but collect the inner transaction results and writable bools before passing off to try_lock_accounts() or try_lock_transaction_batch()
There was a problem hiding this comment.
That changeset actually looks like it made things even worse because now both the collect's and the expensive lock validation are all done while the accounts lock is acquired.
There was a problem hiding this comment.
sorry, i understand your meaning now. i misinterpreted the talk about reducing work done inside "account locks" to mean minimizing time inside AccountLocks::try_lock_accounts() / AccountLocks::try_lock_transaction_batch(), but now i see you mean minimizing time after let account_locks = &mut self.account_locks.lock().unwrap();. will refactor again tomorrow
There was a problem hiding this comment.
9e73e22 ok all work is collected before the mutex. i deleted the old lock_accounts() since nothing used it anymore except tests (since we need to pass results for message hash dedupe) and consolidated all the logic in one statement in a renamed lock_accounts_with_results(), which hopefully makes it easier to follow
edit: eliminated a level of nesting
accounts-db/src/account_locks.rs
Outdated
| .collect(); | ||
|
|
There was a problem hiding this comment.
.collect() into immediate into_iter, this seems unnecessary. Am I missing something?
There was a problem hiding this comment.
needed to prevent a double-borrow of self in the two maps (and two are needed because every can_lock_accounts() needs to be called before any lock_accounts())
There was a problem hiding this comment.
We can get by with only one collect for building the result vec. The intermediate vec for "available keys" isn't really needed
There was a problem hiding this comment.
we do need it, unless theres something im missing in what you mean. if we were to write:
validated_batch_keys
.map(|validated_keys| match validated_keys {
Ok(ref keys) => match self.can_lock_accounts(keys.clone()) {
Ok(_) => validated_keys,
Err(e) => Err(e),
},
Err(e) => Err(e),
})
.map(|available_keys| available_keys.map(|keys| self.lock_accounts(keys)))
.collect()the borrow checker fails us because the second map borrows &mut self while the first map still has &self. and this is actually good: if the code compiled, it would be wrong. every can_lock_accounts() call needs to complete before any lock_accounts() call, so if this was evaluated lazily it would essentially behave like the pre-simd83 version
if the cost of the allocation is a concern we could pass in a mut Vec<_> and .iter_mut().foreach() editing in-place
There was a problem hiding this comment.
37f5095 went ahead and did that, i think this makes the most sense
There was a problem hiding this comment.
I had something different in mind but I like this better, nice!
runtime/src/bank.rs
Outdated
| .feature_set | ||
| .is_active(&feature_set::relax_intrabatch_account_locks::id()); | ||
|
|
||
| // with simd83 enabled, we must deduplicate transactions by message hash |
There was a problem hiding this comment.
👏 this is going to make our lives sooo much easier
accounts-db/src/accounts.rs
Outdated
| Ok(()) => validate_account_locks(tx.account_keys(), tx_account_lock_limit) | ||
| .map(|_| TransactionAccountLocksIterator::new(tx)), | ||
| Err(err) => Err(err), | ||
| }); |
There was a problem hiding this comment.
That changeset actually looks like it made things even worse because now both the collect's and the expensive lock validation are all done while the accounts lock is acquired.
8f84ec7 to
defe0eb
Compare
defe0eb to
9e73e22
Compare
accounts-db/src/account_locks.rs
Outdated
| } | ||
| } | ||
| /// NOTE this is the pre-SIMD83 logic and can be removed once SIMD83 is active. | ||
| pub fn try_lock_accounts(&mut self, keys: &[(&Pubkey, bool)]) -> TransactionResult<()> { |
There was a problem hiding this comment.
Why is this a slice now? Seems like iterator was a better interface
accounts-db/src/account_locks.rs
Outdated
| .collect(); | ||
|
|
There was a problem hiding this comment.
We can get by with only one collect for building the result vec. The intermediate vec for "available keys" isn't really needed
accounts-db/src/accounts.rs
Outdated
| .map(|_| { | ||
| TransactionAccountLocksIterator::new(tx) | ||
| .accounts_with_is_writable() | ||
| .collect::<Vec<_>>() |
There was a problem hiding this comment.
It doesn't seem necessary to collect here. Why not keep the same approach as before?
| } else { | ||
| assert_eq!(results, vec![Ok(()), Err(TransactionError::AccountInUse)]); | ||
| } | ||
| } |
There was a problem hiding this comment.
Feel free to do in a separate PR. IMO we should add in an additional test that checks the "partial" w/ intrabatch conflicts.
i.e.
- outstanding lock on A
- batch containing tx1, tx2:
- tx1 locks (A, B)
- tx2 locks (B)
I fully expect this works correctly as is, but just want some additional testing on it since the SIMD83 path relies on this matching our expectations.
runtime/src/bank.rs
Outdated
| if batch_message_hashes.insert(transactions[i].message_hash()) { | ||
| Ok(()) | ||
| } else { | ||
| Err(TransactionError::AccountInUse) |
There was a problem hiding this comment.
I can understand why you used AccountInUse here but it feels confusing in the context that it's totally fine to re-use accounts between transactions intra batch. Seems a little more accurate to use already processed instead:
| Err(TransactionError::AccountInUse) | |
| Err(TransactionError::AlreadyProcessed) |
There was a problem hiding this comment.
changed this error as suggested and it turned out to have a runtime effect, as you can see from the test changes i made after: e2a0190. basically AccountInUse is retryable but AlreadyProcessed is not. i believe this is an unproblematic bug fix but please let me know if you think there could be any downstream ramifications. this is consensus-safe however because this match arm only triggers on the feature gate
depending on how many duplicate transactions make it through the scheduler, this could be a trivial or nontrivial optimization. im not familiar with the retry flow yet but it seems like because the current code in master dedupes using lock conflicts:
- a batch with duplicates only weeds them out after the n^2 lock validation and after going partway through lock-taking
- these duplicates then go back in the transaction queue to be retried, which means they will be validated again, take all locks, and then only get dropped when
check_transactions()goes to the status cache
whereas with an AlreadyProcessed error they will get weeded out before lock validation and then not be retried. it doesnt help unless theyre in the same batch, which maybe is a rare or nonexistent case, but it seems like an unexpectedly good change!
There was a problem hiding this comment.
Ah interesting, it seems like a good thing to drop them sooner rather than later. I'm not aware of any other downstream effects.
accounts-db/src/account_locks.rs
Outdated
| let result = std::mem::replace(validated_keys, Err(TransactionError::AccountInUse)); | ||
| *validated_keys = match result { | ||
| Ok(ref keys) => match self.can_lock_accounts(keys.clone()) { | ||
| Ok(_) => result, | ||
| Err(e) => Err(e), | ||
| }, | ||
| Err(e) => Err(e), | ||
| }; |
There was a problem hiding this comment.
This code is a little hard to follow with the use of mem::replace, could we simplify to something like this?
| let result = std::mem::replace(validated_keys, Err(TransactionError::AccountInUse)); | |
| *validated_keys = match result { | |
| Ok(ref keys) => match self.can_lock_accounts(keys.clone()) { | |
| Ok(_) => result, | |
| Err(e) => Err(e), | |
| }, | |
| Err(e) => Err(e), | |
| }; | |
| if let Ok(keys) = validated_keys { | |
| if let Err(e) = self.can_lock_accounts(keys.clone()) { | |
| *validated_keys = Err(e); | |
| } | |
| } |
runtime/src/bank.rs
Outdated
| // with simd83 enabled, we must deduplicate transactions by message hash | ||
| // previously, conflicting account locks would dedupe transactions as a side effect |
There was a problem hiding this comment.
We need the same check on replay, this is only checking on the leader. Maybe we can move this into the TransactionBatch so that it's impossible to create a batch with duplicated transactions?
There was a problem hiding this comment.
i think im going to move it to Bank::try_lock_accounts_with_results()... it cant go in TransactionBatch::new() as-written because the batch is constructed after locks are taken, and if you change any Ok to Err at that stage youll never fully unlock
accounts-db/src/account_locks.rs
Outdated
| .collect(); | ||
|
|
There was a problem hiding this comment.
I had something different in mind but I like this better, nice!
feature-set/src/lib.rs
Outdated
|
|
||
| pub mod relax_intrabatch_account_locks { | ||
| solana_pubkey::declare_id!("EbAhnReKK8Sf88CvAfAXbgKji8DV48rsp4q2sgHqgWef"); | ||
| solana_pubkey::declare_id!("B31HjiHtU5KCvzeGW2pa9DLDvrYazWdF6qFeKE5UQoqG"); |
There was a problem hiding this comment.
Can we use this key ENTRYnPAoT5Swwx73YDGzMp3XnNH1kxacyvLosRHza1i from the issue here: #1029
| let conflicting_transaction = sanitize_transactions(vec![system_transaction::transfer( | ||
| &Keypair::new(), | ||
| &pubkey, | ||
| 1, | ||
| genesis_config.hash(), | ||
| )]); | ||
| bank.try_lock_accounts(&conflicting_transaction); |
There was a problem hiding this comment.
I'm not sure what this is supposed to be testing. Shouldn't we be inspecting the result of try_lock_accounts? Feels like that method should have a #[must_use] annotation too
There was a problem hiding this comment.
this isnt testing try_lock_accounts(), its just setup code so the test batch sees a lock conflict on the second transaction regardless of simd83. my read of this test was "a conflict happens, we dont care why, is the cost tracker right?" in contrast to the test_bank_process_and_record_transactions_account_in_use() test below which is testing the locking itself
There was a problem hiding this comment.
Ah so before simd 83 we would always be failing on intra batch account lock conflict. But after simd 83 it wouldn't fail so you changed the test so that it would fail effectively on inter batch account lock conflict. The comments above should be updated to reflect this. And I think the conflicting transaction creation and locking should happen before the creation of the test transactions.
core/src/banking_stage/consumer.rs
Outdated
| ), | ||
| ]); | ||
| assert_eq!( | ||
| transactions[0].signature() == transactions[1].signature(), |
There was a problem hiding this comment.
This is technically fine but can we check for message hash equality instead? Whenever testing or handling duplicate transactions we should make it clear that transactions are identified by message hash not signature.
There was a problem hiding this comment.
right, this line and the comment below got left behind because originally we were using the signatures
core/src/banking_stage/consumer.rs
Outdated
| let consumer = Consumer::new(committer, recorder, QosService::new(1), None); | ||
|
|
||
| // with simd83 and no duplicate, we take a cross-batch lock on an account to create a conflict | ||
| // with a duplicate transaction and simd83 it comes from signature equality in the batch |
There was a problem hiding this comment.
| // with a duplicate transaction and simd83 it comes from signature equality in the batch | |
| // with a duplicate transaction and simd83 it comes from message hash equality in the batch |
| 1, | ||
| genesis_config.hash(), | ||
| )]); | ||
| bank.try_lock_accounts(&conflicting_transaction); |
There was a problem hiding this comment.
We should be checking the results right?
There was a problem hiding this comment.
as above this is setup code and the real tests are on the return of consumer.process_and_record_transactions(&bank, &transactions);
There was a problem hiding this comment.
Got it, comments make more sense to me now, thanks!
| // with simd3, duplicate transactions are not retryable | ||
| if relax_intrabatch_account_locks && use_duplicate_transaction { | ||
| assert_eq!(retryable_transaction_indexes, Vec::<usize>::new()); | ||
| } else { | ||
| assert_eq!(retryable_transaction_indexes, vec![1]); | ||
| } |
There was a problem hiding this comment.
I'm confused. I would expect retryable indexes to be empty for relax_intrabatch_account_locks && !use_duplicate_transaction too, no?
There was a problem hiding this comment.
this is the test that sets up a lock conflict in that case:
// with simd83 and no duplicate, we take a cross-batch lock on an account to create a conflict
// with a duplicate transaction and simd83 it comes from message hash equality in the batch
// without simd83 the conflict comes from locks in batch
if relax_intrabatch_account_locks && !use_duplicate_transaction {
let conflicting_transaction =
sanitize_transactions(vec![system_transaction::transfer(
&Keypair::new(),
&pubkey1,
1,
genesis_config.hash(),
)]);
bank.try_lock_accounts(&conflicting_transaction);
}for context the orginal version of this test had one case: no simd83 (obviously) and the same transaction twice. i expanded it into four cases to cover all codepaths
ledger/src/blockstore_processor.rs
Outdated
| bank.get_balance(&keypair2.pubkey()), | ||
| ]; | ||
|
|
||
| assert!(result.is_err()); |
There was a problem hiding this comment.
If we changed the error for message hash duplication to "already processed" we would be able to differentiate the error type here before/after the feature activation. Seems kinda useful.
jstarry
left a comment
There was a problem hiding this comment.
I'm cool with this going in as is but think that one test could be cleaned up a bit and I think we should assert that try-lock in tests is actually locking successfully. That can all be done in a follow up since the tests look correct as is.
| let conflicting_transaction = sanitize_transactions(vec![system_transaction::transfer( | ||
| &Keypair::new(), | ||
| &pubkey, | ||
| 1, | ||
| genesis_config.hash(), | ||
| )]); | ||
| bank.try_lock_accounts(&conflicting_transaction); |
There was a problem hiding this comment.
Ah so before simd 83 we would always be failing on intra batch account lock conflict. But after simd 83 it wouldn't fail so you changed the test so that it would fail effectively on inter batch account lock conflict. The comments above should be updated to reflect this. And I think the conflicting transaction creation and locking should happen before the creation of the test transactions.
| 1, | ||
| genesis_config.hash(), | ||
| )]); | ||
| bank.try_lock_accounts(&conflicting_transaction); |
There was a problem hiding this comment.
Got it, comments make more sense to me now, thanks!
change locking rules in line with SIMD-0083 such that read/write and write/write overlap on accounts is allowed intra-batch, and thus intra-entry
change locking rules in line with SIMD-0083 such that read/write and write/write overlap on accounts is allowed intra-batch, and thus intra-entry
Problem
simd83 proposes removing the constraint that transactions in the same entry cannot have read/write or write/write contention on any account. a previous pr modified svm to be able to carry over state changes between such transactions while processing a transaction batch. this pr modifies consensus to remove the account locking rules that prevent such batches from being created or replayed
Summary of Changes
add a new function to
AccountLocks,try_lock_transaction_batch, which only checks for locking conflicts with other batches, allowing any account overlap within the batch itself. modifyAccountsandBankto use it instead when the feature gaterelax_intrabatch_account_locksis activated. also modifyprepare_sanitized_batch_with_resultsto deduplicate transactions within a batch by signature to prevent replay attacks, such that two instances of the same transaction cause the first to lock out the second, in a similar manner to the non-simd83 behavior for this special casesince transaction results are used more extensively than previous, some functions with
*_wiith_resultsvariants have bene collapsed into wrappers around that variant. we also refactor several things to favor iterators over vectors, to avoid places where iters are collected and transformed back into itersimportant code changes are confined to
accounts-db/src/accounts.rs,accounts-db/src/account_locks.rs, andruntime/src/bank.rs. changes in core, ledger, and runtime transaction batch only affect tests. overall the large majority of changes are fixes or improvements to testsFeature Gate Issue: https://github.com/anza-xyz/feature-gate-tracker/issues/76