-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(resharding) - Make shard ids non-contiguous #12181
base: master
Are you sure you want to change the base?
Conversation
core/primitives-core/src/types.rs
Outdated
/// indices in range 0..NUM_SHARDS and casting to ShardId. Once the transition | ||
/// if fully complete it potentially may be simplified to a regular type alias. | ||
/// | ||
/// TODO get rid of serde |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What prevents removing serde now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some tests use serde as a snapshotting library to make sure old layouts are not
changed accidentally.
core/primitives-core/src/types.rs
Outdated
PartialOrd, | ||
Ord, | ||
)] | ||
pub struct ShardId(u64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be easily changed to ShardId(u32)
?
core/primitives-core/src/types.rs
Outdated
} | ||
|
||
/// Get the numerical value of the shard id. This should not be used as an | ||
/// index into an array, as the shard id may be any arbitrary number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a doc-link to get_shard_index
?
core/primitives-core/src/types.rs
Outdated
|
||
impl ShardId { | ||
/// Create a new shard id. Please note that this function should not be used | ||
/// directly. Instead the ShardId should be obtained from the shard_layout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem slightly counter-intuitive to say new
shouldn't be used, given the fact it's a trivial constructor. What are the risks?
core/primitives/src/shard_layout.rs
Outdated
@@ -444,6 +476,14 @@ impl ShardLayout { | |||
pub fn shard_uids(&self) -> impl Iterator<Item = ShardUId> + '_ { | |||
self.shard_ids().map(|shard_id| ShardUId::from_shard_id_and_layout(shard_id, self)) | |||
} | |||
|
|||
pub fn get_shard_index(&self, shard_id: ShardId) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should say this returns a index from 0..num_shards, with stable ordering
btw will there be guarantees about the index of a shard S
when shard_layout changes and S
is in the old and new layout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will add comments and implementation later. I just want to make sure that this approach works fine first.
btw will there be guarantees about the index of a shard S when shard_layout changes and S is in the old and new layout?
Good question and it's the opposite. The shard indices will change for almost all shards during resharding. It is ok though because the shard index is maintained in the shard layout. The users should make sure to use the correct shard layout when getting the shard index - but again I think this expectation is reasonable and quite common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In future iterations, we should try to get rid of shard_index from places like sample_chunk_producer
and ChunkEndorsementsBitmap
by changing their structure, but for now this approach looks reasonable.
core/primitives-core/src/types.rs
Outdated
|
||
/// Get the numerical value of the shard id. This should not be used as an | ||
/// index into an array, as the shard id may be any arbitrary number. | ||
pub fn get(self) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we keep shard_id.into()
as primary conversion method instead of shard_id.get()
while we are sorting this mess? That looks more natural to me and the calls to .into()
are always explicit in rust
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, we could use ShardId::from(id)
instead of ShardId::new(id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for into
. see my similar comment about operations involving u64 or other integers as well.
shard_uids.iter().map(|_| HashSet::new()).collect(); | ||
|
||
let mut shard_account_ids: BTreeMap<ShardId, HashSet<AccountId>> = | ||
shard_ids.iter().map(|&shard_id| (shard_id, HashSet::new())).collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shard_layout.shard_ids().map(...).collect();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general a big +1 for the new type ShardId(u64)
!
@@ -85,7 +85,7 @@ impl CongestionControl { | |||
// `clamped_f64_fraction` clamps to exactly 1.0. | |||
if congestion == 1.0 { | |||
// Red traffic light: reduce to minimum speed | |||
if sender_shard == self.info.allowed_shard() as u64 { | |||
if sender_shard.get() == self.info.allowed_shard() as u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for these kinds of operations (to hide the internal u64 more), would it make sense to not have get()
but overriding the operation involving u64 (in this particular case for example overriding ==
op or adding equals(u64)
)?
similarly for ops like self.set_allowed_shard(allowed_shard.get() as u16);
, defining into()
operations?
core/primitives-core/src/types.rs
Outdated
|
||
/// Get the numerical value of the shard id. This should not be used as an | ||
/// index into an array, as the shard id may be any arbitrary number. | ||
pub fn get(self) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for into
. see my similar comment about operations involving u64 or other integers as well.
This is non-trivial change. I adjusted the shard assignment to fully operate only on ShardIndex instead of ShardId.
I'm not sure if it's in the scope of this PR, but I had issues trying to initialize genesis with certain EDIT: nvm I think it will be fixed with this PR |
let congestion_seed = apply_state.block_height.wrapping_add(apply_state.shard_id); | ||
// TODO(wacban) Using non-contiguous shard id here breaks some | ||
// assumptions. The shard index should be used here instead. | ||
let congestion_seed = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the worst that can happen? The allowed shard is not chosen fairly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could lead to some shards never be the allowed shard. I'm not sure as I'm avoiding digging deeper into those non-trivial cases as this task is massive as it stands. For now a todo should do :)
I think it should be possible to use ShardLayoutV2 in tests as long as you set it up so that the shard ids are contiguous, ordered, etc. Resharding would break that however so yeah you'll need to wait for this PR. |
736d82e
to
795746d
Compare
@shreyan-gupta and @tayfunelmas regarding the |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #12181 +/- ##
==========================================
+ Coverage 71.60% 71.63% +0.02%
==========================================
Files 824 825 +1
Lines 165513 165855 +342
Branches 165513 165855 +342
==========================================
+ Hits 118516 118808 +292
- Misses 41875 41892 +17
- Partials 5122 5155 +33
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGMT
I left a few comment, not very important
(I didn't go very in depth in a few files.. it's a very long PR)
@@ -214,12 +214,14 @@ impl TestEnv { | |||
// TODO(congestion_control): pass down prev block info and read congestion info from there | |||
// For now, just use default. | |||
let prev_block_hash = self.head.last_block_hash; | |||
let state_root = self.state_roots[shard_id as usize]; | |||
let epoch_id = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we could just unwrap
there
for (shard_id, chunk_header) in block.chunks().iter().enumerate() { | ||
|
||
let epoch_id = block.header().epoch_id(); | ||
let shard_layout = self.epoch_manager.get_shard_layout(&epoch_id)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a fn to get shard_layout
from block could help with readability .. a little
let prev_head = self.chain_store.head()?; | ||
let is_caught_up = block_preprocess_info.is_caught_up; | ||
let provenance = block_preprocess_info.provenance.clone(); | ||
let block_start_processing_time = block_preprocess_info.block_start_processing_time; | ||
// TODO(#8055): this zip relies on the ordering of the apply_results. | ||
// TODO(wacban): do the above todo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a TODO to do a TODO? 🔥
let upper_bound = StatePartKey(sync_hash, shard_id + 1, 0); | ||
|
||
// The upper bound shard id is not a valid ShardId. It should only be | ||
// used as a uppoer bound for the shard id in the database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// used as a uppoer bound for the shard id in the database. | |
// used as a upper bound for the shard id in the database. |
@@ -800,7 +800,7 @@ pub fn report_recorded_column_sizes(trie: &Trie, apply_state: &ApplyState) { | |||
// Tracing span to measure time spent on reporting column sizes. | |||
let _span = tracing::debug_span!( | |||
target: "runtime", "report_recorded_column_sizes", | |||
shard_id = apply_state.shard_id, | |||
shard_id = ?apply_state.shard_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You did already change it everywhere, but we could have added a Display
impl for shard_id 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried, it didn't work. tracing library requires some tracing::Value trait to be implemented but it's actually not possible to add it for new types.
/// Historically the ShardId was always in the range 0..NUM_SHARDS and was used | ||
/// as the shard index. This is no longer the case, and the ShardIndex should be | ||
/// used instead. | ||
pub type ShardIndex = usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any plan to transform this into a newtype in a future PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm good question. It's meant to be used as an index to arrays with chunk data so kinda exactly like usize. I think let's keep it as type alias for now. Please share your thoughts though.
@@ -589,16 +590,21 @@ impl EpochManagerAdapter for MockEpochManager { | |||
&self, | |||
_prev_hash: &CryptoHash, | |||
shard_ids: Vec<ShardId>, | |||
) -> Result<Vec<ShardId>, Error> { | |||
Ok(shard_ids) | |||
) -> Result<Vec<(ShardId, ShardIndex)>, Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to return an error if shard layout is V2?
@@ -1453,8 +1456,18 @@ impl Client { | |||
) { | |||
let chunk_header = partial_chunk.cloned_header(); | |||
self.chain.blocks_delay_tracker.mark_chunk_completed(&chunk_header); | |||
|
|||
// TODO(#10569) We would like a proper error handling here instead of `expect`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which raises the question: is anybody able to recover from this failure?
@@ -122,7 +122,7 @@ pub trait EpochManagerAdapter: Send + Sync { | |||
&self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update the comment to reflect the new return type? I won't suggest changing the method name unless somebody sees any value in doing so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also valid for other modified methods in this file
This is part 1 of adding support for non-contiguous shard ids. The principle idea is to make ShardId into a newtype so that it's not possible to use it to index arrays with chunk data. In addition I'm adding ShardIndex type and a mapping between shard indices and shard ids so that it's possible to covert one to another as necessary.
The TLDR of this approach is to make the types right, fix compiler errors and pray to the software gods that things work out.
I am now giving up on trying to make the migration in a single PR. Instead I am introducing some temporary structures and methods that are compatible with both approaches. My current plan for the migration is as follows:
There are a few common themes in this PR:
?
to shard id in tracing logs because the newtype ShardId doesn't work without itmust-review files:
good-to-review files: