Skip to content

svm: collect transaction balances#5588

Merged
2501babe merged 5 commits intoanza-xyz:masterfrom
2501babe:20250331_bals_take5_ultimate_redux
May 6, 2025
Merged

svm: collect transaction balances#5588
2501babe merged 5 commits intoanza-xyz:masterfrom
2501babe:20250331_bals_take5_ultimate_redux

Conversation

@2501babe
Copy link
Member

@2501babe 2501babe commented Mar 31, 2025

Problem

#4253 aka "simd83 locking" intends to allow multiple transactions in the same batch to write to the same account. presently, banking stage and blockstore processor collect pre- and post-execution balances by loading all accounts before and after commit. this does not work in a simd83 world because the intermediate balances of an account modified multiple times would be wrong (although the first pre- and final post-balance would be correct)

we have attempted to solve this in several different manners, as detailed in the notes below (not necessary to understand this pr). they dont work. what if there was a better way?

Summary of Changes

there is. well, a less bad way:

  • add a new option enable_transaction_balance_recording to ExecutionRecordingConfig
  • when this option is enabled, before and after execution in svm, we gather native lamport balances for all accounts, and token balances for all valid, initialized token accounts, along with mint address and decimals. we do this in a new struct BalanceCollector which isolates the logic from load_and_execute_sanitized_transactions(). we wrap this in Option<_> throughout, so when the recording config is false no extra work is done. we parse the token accounts and mints using the new crate spl-generic-token which provides the minimum necessary information common to both token programs (and can be extended if others are ever created, even if their buffer layouts are different)
  • Option<BalanceCollector> is added to LoadAndExecuteSanitizedTransactionsOutput and passed up to the ultimate caller. this is important because one of the main challenges has been that banking stage and blockstore processor must gather what is essentially intermediate state internal to svm, passing this through Bank itself without relying on Bank to store it or care about it
  • all balance collection code is removed from banking stage and blockstore processor. it simply receives the BalanceCollector. it no longer matters whether it handles it before or after commit because no more accounts need to be loaded
  • a simple function compile_collected_balances() turns the four vecs in BalanceCollector into the two structs containing the balances, transforming u64 token amounts into UiTokenAmount as required by the transaction status sender
  • ledger/src/token_balances.rs is deleted and solana-ledger no longer depends on the token programs directly (although it does indirectly via account-decoder; in the future we could move the UiTokenAmount builder somewhere and break this link)
  • balance collection timings are removed from ConsumeWorkerTimingMetrics and LeaderExecuteAndCommitTimings and added to ExecuteTimings. we expect balance collection to be much faster in this pr because we will already have all token accounts pre-loaded from accounts-db, and mints have to be loaded at most once per batch. the version in master has to load all accounts for all transactions

BalanceCollector has no special handling for pre- vs post-balances, ie it does not use TransactionLoadResult or TransactionProcessingResult for anything. in practice doing so makes the code more complex and brittle for uncertain speed gains, probably minimal since the accounts are already loaded

note that there are bank tests that expect to see full balances even for failed or discarded transactions, so that behavior has been left unchanged

this pr unblocks #4253

NOTE this should not be merged until https://github.com/solana-program/libraries requires code review to merge

Misc Notes (feel free to ignore)

this is too jargony to probably make sense to most and more for my reference in case we have to rehash past attempts in code review of this pr:

  • bals1, the svm three card monte: add a balance recording flag to the recording config. rather messy balance collection in txp, directly with no helpers. the idea was to use a callback from ledger to parse the token accounts so svm doesnt see spl-token but otherwise do the work ourselves in svm. tack the bals onto the processing result to get picked up upstream. this was bad because it very poorly separated concerns and did too much in txp itself, and we thought we had better approaches we could try

  • bals2, the mysterious dinner guest: i dont remember what this one was

  • bals3, the gigafunction: something in ledger which attempted to construct balances post-hoc from the transaction results. this was a clever machine which attempted to allow svm to have no hand in any of this. unfortunately the approach is inherently imperfect and has tons of gimmickry to emulate balance changes as one steps through the transactions. we crashed into the rocks because to get decimals right we need to either:

    • ignore mints that didnt exist before, but this produces bad balances for pathological mint usage. it also doesnt necessarily cover 100% of usecases, eg if you want to see balances of mints you created
    • load mint before and after commit and skip balances for any mints that changed. this avoids most mint hijinks but is still wrong for extreme hijinks. it lets us show more balances but forces a weird two-stage flow where we have to load all accounts up front
    • fully track mint state changes as we advance through the batch. this is 100% correct and gets ideal coverage. but it is a nightmare to code and maintain, the most rube goldberg situation of any option

    furthermore all of these have the problem of, for blockstore, we need to put the logic in the precommit callback. and that code has some unusual and possibly dangerous-to-modify interactions with bank locking to support the unified scheduler

  • bals4, the trait object hand grenade: we package up a machine that does the work and lob it into svm. this was an inspired option but it is fiendish. we have a collector trait in svm and impl it in ledger, but the collector trait depends on account loading, and passing in the collector trait object is very difficult because the balance collector is generic over concrete impls of either a new loader trait or txpcb, both of which are created below blockstore, so they cannot be unified by rustc. it also, even if it did work, would make function signatures truly ghastly

  • bals5, the present concern: this pr. we take the general approach of bals1, with a new recording option and balance recording in svm, passing the balances up in the processing results. but we package all the logic into a struct so txp is unperturbed, use a very light token parser directly in svm instead of passing any function pointers or trait objects down from ledger, and do the final transform of balance+decimals -> uiamount in ledger because this depends on account-decoder. we originally did the token work with agave inline-spl, but because svm is being broken out of agave, and because we want new features for tokens anyway, we made a new spl library that does exactly what we want

@2501babe 2501babe force-pushed the 20250331_bals_take5_ultimate_redux branch 3 times, most recently from cd80dce to 52ba648 Compare April 1, 2025 16:18
@codecov-commenter
Copy link

codecov-commenter commented Apr 1, 2025

Codecov Report

Attention: Patch coverage is 89.69697% with 34 lines in your changes missing coverage. Please review.

Project coverage is 82.8%. Comparing base (4ebe3c0) to head (d74a5be).
Report is 18 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #5588     +/-   ##
=========================================
- Coverage    82.9%    82.8%   -0.1%     
=========================================
  Files         840      841      +1     
  Lines      377977   377818    -159     
=========================================
- Hits       313382   313204    -178     
- Misses      64595    64614     +19     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@2501babe 2501babe force-pushed the 20250331_bals_take5_ultimate_redux branch from 187f4ef to 5459890 Compare April 14, 2025 16:33
@2501babe 2501babe self-assigned this Apr 15, 2025
@2501babe 2501babe force-pushed the 20250331_bals_take5_ultimate_redux branch from c93d734 to 79ba85f Compare April 15, 2025 15:42
@2501babe 2501babe requested a review from apfitzge April 15, 2025 16:58
solana-rent-debits = { workspace = true }
solana-sdk = { workspace = true }
solana-sdk-ids = { workspace = true }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was a pleasant surprise

Comment on lines 367 to 373
.map(|_| Err(TransactionError::ProgramCacheHitMaxLimit))
.collect(),
balance_collector: None,
};
}
Copy link
Member Author

@2501babe 2501babe Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting None here is absolutely the right call because this branch discards all transactions and all pending state changes. what is less clear is whether it matters that compile_collected_balances() in ledger/src/transaction_balances.rs returns (vec![], vec![], vec![], vec![])

now that im going through this code with a reviewers eyes, maybe the function in ledger should return Option<(TransactionBalancesSet, TransactionTokenBalancesSet)> instead, and its callers can create four vec![vec![]; batch.len()]

in the worst case callers would have to load the accounts to fill the native balance vecs. i really dont know whether anything on the TransactionStatusReceiver side would need balances for an entirely discarded batch though. im leaning toward the "vec of empty vecs" option but curious to hear your thoughts

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is marked outdated because i added a comment to it explaining this)

Comment on lines +396 to 400

let (mut load_us, mut execution_us): (u64, u64) = (0, 0);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its only tangentially related to this pr, but instead of adding all timings at the bottom, i switched to adding as many timings as possible immediately after they are generated, and keeping load_us and execution_us as incremental accumulators because theyre printed in a debug message

this is because with collect_balances_us we would have had four mut u64 here, and if the program cache fix (filter account owners) ever lands we would have seven. it doesnt make much sense to keep this pattern, since we are just saturating-adding to a stack variable that is then later saturating-added to the timing struct anyway

Copy link

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall approach looks good to me. Left a couple of minor comments. Let me know when you've added tests and I will go through to do a full review

@2501babe 2501babe force-pushed the 20250331_bals_take5_ultimate_redux branch from fb7c42b to f3323e4 Compare April 16, 2025 20:18
@2501babe 2501babe marked this pull request as ready for review April 18, 2025 18:42
@2501babe 2501babe requested a review from a team as a code owner April 18, 2025 18:42
@2501babe 2501babe requested a review from apfitzge April 18, 2025 18:42
owner: owner.to_string(),
program_id: program_id.to_string(),
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know these are simple, but I still think we should probably add some very simple tests for compile_collected_balances.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

balance_collector: Option<BalanceCollector>,
batch_len: usize,
) -> (TransactionBalancesSet, TransactionTokenBalancesSet) {
// if the batch was aborted due to blowing the program cache, balance_collector may be None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the batch is aborted, we shouldn't get to the point this is called, right? I thought if we blow program-cache we just abort the entire batch and don't commit any of it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

youre right, i was looking in committer that the transaction batch is sent no matter what happens, but this line in consumer if processed_counts.processed_transactions_count != 0 { ... } means the committer is never entered... i suppose we could remove the Option<_> from the result type and just put a BalanceCollector::default() in transaction processor in the cache case. it seems better than unwrapping the Option<_>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7f1f1a8 i kept the Option<_> since its better to signal that it doesnt exist when it doesnt exist (since it truly doesnt exist when there is no transaction status sender). instead i debug assert it is not empty inside branches that execute if there is a transaction status sender and then unwrap to default

vec![vec![]; batch_len],
)
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we assert that the length of native_pre is equal to batch_len? obviously there is possibility in the else branch - but balance_collector path has fewer guarantees:

  • it initially constructs vectors with capacity of batch_len
  • pushes to them before/after each tx execution

I'm fairly certain svm has guarantees that it will execute the whole batch, but may want to add at least a debug_assert_eq! here for our tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7f1f1a8 i did this but at construction time, its a more natural place now since compile_collected_balances() no longer needs batch length and BalanceCollector cannot be modified outside svm


#[derive(Default, Debug)]
pub struct LeaderExecuteAndCommitTimings {
pub collect_balances_us: u64,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we lose this metric, but I think it's fine - no realistic leader is running with this option enabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the timings just moved to ExecuteTimings

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that's not reported directly by the workers - so we've lost the metric on the threads that execute when we're leader.

Comment on lines +2671 to +2678
let mut users = users.to_vec();
let (i, from) = users.iter().enumerate().choose(&mut rng).unwrap();
let from = *from;
users.remove(i);
let to = *users.iter().choose(&mut rng).unwrap();
let amount = rng.gen_range(1, STARTING_BALANCE / 100);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test-only, but seems a weird way to do it! Choose 2 unique-indexes and clone just those not the whole user-set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AccountMeta::new(self.to, false),
AccountMeta::new(fee_payer, true),
],
data: bincode::serialize(&(3u8, self.amount)).unwrap(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a constant 3 we can pull in from anywhere reasonable here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ill define it explicitly in the file, the only place to get it would be spl-token

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@2501babe 2501babe force-pushed the 20250331_bals_take5_ultimate_redux branch from d0ef619 to 7f1f1a8 Compare April 27, 2025 08:39
@2501babe 2501babe requested a review from apfitzge April 28, 2025 09:26
@2501babe 2501babe force-pushed the 20250331_bals_take5_ultimate_redux branch from 408b408 to 005eb45 Compare April 30, 2025 07:14
apfitzge
apfitzge previously approved these changes May 1, 2025
@2501babe 2501babe force-pushed the 20250331_bals_take5_ultimate_redux branch from 005eb45 to 2ee244c Compare May 2, 2025 16:29
@2501babe
Copy link
Member Author

2501babe commented May 2, 2025

@anza-xyz/svm this is ready for review, i just had to rebase after andrews approval because lockfiles changed again

Comment on lines +2740 to +2746
struct SplTokenMint {
authority_tag: u32,
authority: Pubkey,
supply: u64,
decimals: u8,
initialized: bool,
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would using the spl-token crate avoid the declaration of struct SplTokenMint, struct Transfer and struct SplTokenAccount?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd avoided spl-token dependency in svm, but maybe it's fine because this is a test?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want SVM to depend on other monorepo crates, but since spl-token is already outside the monorepo, I believe it should be fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@2501babe 2501babe merged commit cf6946d into anza-xyz:master May 6, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants