Proposed Reduction of Max TTL in Release 1.5 #3995

MParlikar · 2023-05-26T17:08:23Z

MParlikar
May 26, 2023

Summary

The Casper Labs engineering team is requesting to reduce the Max TTL setting for mainnet from 24 hours to 18 hours. This change would be proposed as part of the 1.5.x release.

The changes in 1.5 that enables new nodes to join from the tip of the chain exacerbates the pressure that a large maximum TTL setting places on the nodes and network at large. Nodes must retain all deploys in the system that have an unexpired TTL. Since nodes can join much faster in 1.5, the pressure on the network and the nodes increases with longer TTL. Reducing this maximum TTL setting will relieve the strain on the nodes/network in the following ways:

Storing the deploys
Gossiping these deploys to new joining nodes
Validation of blocks
Checking for replay attacks

Background

Definitions

Deploys represent potential work to be done on chain. These are also commonly known as transactions.
The deploy TTL parameter (Time to Live) is an offset from the deploy timestamp, and represents the time the deploy is considered live (executable) by the system. Both the TTL and the deploy timestamp are set by the creator of the deploy.
The maximum allowable value for TTL on the public Casper network presently is 24 hours. This is a configurable parameter for each network. Deploys with an invalid value for this parameter are rejected by the network.

Creation of a Deploy

Various fields are set by the creator of a deploy, two of which are a timestamp and a TTL (time to live), which is an offset from the timestamp.
Deploys are created with tools, such as the Casper client, or the JavaScript SDK. When the TTL value is not set, most of these tools default the TTL value to 30 minutes.
All of the fields of a deploy are hashed together in a particular way to create a deploy hash. One or more valid signatures may be applied to the deploy, signing over that deploy hash.

Sending the Deploy to A Network

Such a deploy may then be sent to one or more nodes on the target network with one or more signatures attached. The receiving node(s) validates the deploy, enforces various rules (including a dead-on-arrival check):

a deploy whose TTL has already expired
a TTL in excess of the chain’s configured maximum allowed TTL)
Invalid deploy hash derives the deploy hash (which should match)
Checks for valid & sufficient signatures.

If the deploy passes all of these checks, it is accepted; the node then schedules that deploy be gossiped to the rest of the network.

In 1.4.x and earlier versions, a component called the block proposer is active on all nodes (whether they are validating nodes or not); after a delay, to allow for gossiping, a deploy is registered with the block proposer.
In 1.5 onward the block proposer has been replaced by a simpler deploy buffer; on validating nodes (only) an accepted deploy is registered with the deploy buffer.

Inclusion in a Block

When a given validator is selected as the leader of a round and proposes a block, the validator’s node produces a set of 0 or more deploys that it has buffered for inclusion and sends that proposal onward for consideration / consensus. The protocol dictates various rules about what does and does not constitute a valid proposed block;

The block must not contain a deploy whose time to live has elapsed; this is determined by comparing the timestamp of the deploy plus the offset of the TTL of the deploy against the block time of the proposed block.
- The block proposer / deploy buffer logics enforce this rule; expired deploys are periodically expired / purged; this is signaled via the event stream as a DeployExpired event.
A proposed block may not include a deploy that has already been included in a previous block; enforcing this rule is referred to collectively as “deploy replay protection”.
- This rule must be observed for the maximum TTL of the network; if a deploy could have been included in any block with a block time within the range of the deploy time plus a TTL of up to 24 hours it is necessary to check every actual block produced in that time period to see if they include the deploy.
- The converse is also true, that honest nodes cannot purge any deploy with a valid TTL.

Clearing up a common TTL misconception

All versions of the released casper node software to date use a FIFO (first in first out) scheme. A leader node will propose deploys that it has received, based on the order it has received them, either directly, or via the gossip mechanism.

The TTL setting of a deploy does not indicate a deferment of execution; setting a deploy to have a TTL of 24 hours and sending it to the network immediately does not result in the deploy being buffered for 24 hours and then included in a proposed block.

Rather, if a deploy were created with a 24 hour TTL, was held for 23+ hours then signed and submitted to the network, it would still be a viable deploy assuming all other validity checks pass. It would be gossiped and buffered, and would be eligible for inclusion in a proposed block right up until its TTL expired.

Similarly, in a multisig scenario, if a deploy were created with a 24 hour TTL and signed by the initiating entity, then sent onward off-chain to one or more other signing entities, and the final signer submitted the deploy to one or more nodes on the network, if the TTL had not yet expired it would remain viable as described previously.

Impact of Maximum TTL on a Network

The maximum TTL setting of a given Casper network is a load-bearing chainspec setting which has multiple implications. Recall that the node must perform validity checks on deploys before including these in a block. Deploys are only expired when the TTL has elapsed.

Therefore, as the duration of the max TTL increases, so does the potential memory pressure on the deploy buffer and the amount of overhead involved to enforce the deploy replay protection rules. In addition to the increased pressure on the nodes under normal conditions, it is also a scaling challenge as longer durations eventually present a resource exhaustion attack vector.

In addition to this, new nodes cannot become validators and participate in consensus until they can perform the deploy validation for new deploys. This means that all new joining nodes must have all the deploys with unexpired TTL (maximum TTL), and a contiguous segment of complete blocks for the same time period.

Without this information, the node cannot validate blocks or propose new blocks.

Changes coming with 1.5

The 1.5 version of the protocol joins new nodes at the tip of the network rather than starting them at genesis and forcing them to grind forward for weeks to eventually catch up to the current state of the chain and only then be able to participate in the network. Now nodes join up to the tip of the chain and are able to follow along and act as participating nodes relatively quickly. Such nodes do not have all blocks on the chain, rather they start filling in a tip-centric sequence of the chain as they go. They naturally attain new blocks as they come into existence simply by participating.

In 1.4.x and prior versions, every node (validator or not) was required to have all data of all blocks from genesis onward. In 1.5 and onward, this restriction is no longer necessary. Nodes may still opt to attempt to acquire all historical data back to genesis, but a new option called SyncToTTL is offered which allows nodes to fully participate in the network (including becoming a validator) if they at least have a continuous chain of complete blocks covering the current time window dictated by now back to the network’s maximum TTL setting (actually implemented with a small additional safety margin to avoid fencepost concerns).

The 1.5 version of the node also provides a mechanism for such nodes to acquire historical data for past blocks from other nodes on the network that have it. However, all such work on both the node asking and the nodes responding is prioritized below all essential functions of the network. On a network that has sufficient capacity, filling in historical blocks happens semi-constantly in the background as cycles are available. On a busy network, it happens sporadically on an as-able basis.

For nodes that have no intent to validate, there is no particular urgency to this eventually consistent process. However, for a node that intends to participate in the validation process, that node must acquire sufficient data to cover the time window determined by the max TTL of that network.

There are two ways to accomplish this; the first is the above mentioned historical data acquisition process will eventually fill this data is subject to that network’s available cycles. The other way is to simply run that node for a length of time equal to the same time window; i.e. if the max TTL of the network is 24 hours then a node that has been joined to the network following new blocks for 24 hours will naturally acquire a contiguous segment of complete blocks.

These two approaches are not mutually exclusive; in the best case such a node will acquire the necessary state relatively quickly via the historical synchronization process, and in the worst case will acquire the necessary state as the time window advances forward until the applicable period has been directly observed.

Either way, the longer the maximum TTL setting of the network the larger the burden of work and the more time it takes to advance to the desired state of being able to enforce the deploy replay protection rules.

As mentioned, the current maximum TTL setting in mainnet is 24 hours and the software is designed and tested against that value. However, as illustrated a shorter maximum TTL offers reduced system overhead and reduces the burden on both newly joining nodes and the other nodes servicing their requests for sufficient historical data to satisfy TTL awareness. It also reduces the worst-case scenario time frame for a node attempting to enter the validation process.

Thus, the recommendation is to shorten the maximum TTL in mainnet as part of the 1.5 release to 12 or 18 hours. This would have no effect on the large majority of users of the chain. However, some entities that have batch processing middleware and / or multisig processes may be negatively impacted; particularly in multisig scenarios where signatories are geographically distant or otherwise temporally asynchronous and need time to collect sufficient signatures.

Analysis of on Chain transactions

An analysis of mainnet reveals that over 99% of deploys are included within 2 hours of the timestamp of the deploy. A chart with the raw data is available here

Answered by MParlikar

May 26, 2023

We recommend 18 hours for the new max TTL setting.

View full answer

MParlikar · 2023-05-26T22:22:47Z

MParlikar
May 26, 2023
Author

We recommend 18 hours for the new max TTL setting.

11 replies

sacherjj May 27, 2023
Maintainer

@GuybrushX The calculation went through all deploys to get all TTLs used. Then I took the block time and deploy time and made the duration calculation. It seemed to make sense to group by the TTL as this shows the intent of the deploy to be a long "in process" deploy.

mrkara May 28, 2023
Maintainer

Thank you for the proposal!

Looks like the pros of the proposed change outweigh the cons, especially considering the future timestamping option for offline multisig scenarios, and the possible gain of performance for the whole network on multiple fronts.

So, let's do it. 👍

Nikolay-everstake May 28, 2023

I agree with @mssteuer. Reduction of Max TTL can cause a real problem with multisig if signers are in different time zones. But if it wasn't causing a lot of multisig issues before, so minus 4 hours is probably not a big problem if it really reduces the load on nodes.

GuybrushX May 28, 2023

Thank you for the proposal!

Looks like the pros of the proposed change outweigh the cons, especially considering the future timestamping option for offline multisig scenarios, and the possible gain of performance for the whole network on multiple fronts.

So, let's do it. 👍

Since it seems that the offline multi-sig deploys are the only thing which are affected negatively but it can be worked around with the future timestamp thing I agree as well.

Go for it 👍

I would just like to see a confirmation/example of how this would work in reality for such multi-sig deploys. But that is not a blocker for me since it seems to work one way or the other, someone only has to work out and document the details because I think this isn't that straightforward if I understood everything correctly and it needs some timing and coordination from the signing parties.

GuybrushX May 28, 2023

I agree with @mssteuer. Reduction of Max TTL can cause a real problem with multisig if signers are in different time zones. But if it wasn't causing a lot of multisig issues before, so minus 4 hours is probably not a big problem if it really reduces the load on nodes.

Actually it's 6h :-)

GuybrushX · 2023-05-27T09:17:52Z

GuybrushX
May 27, 2023

My understanding so far was that with a tool like casper-client you can create an offline deploy with your custom TTL instead of the default of 30 minutes.

That leads to a few questions:

Such offline deploys aren't sent to the network yet hence they won't cause any load there until they are sent to the network?
Not sure how the CRDAO is using that multi-sig feature (offline or on-chain?) but if it's offline: couldn't the TTL/timestamp be faked by changing the local clock or adding that feature to casper-client? Setting the TTL to 18h and the timestamp to a future date, e. g. 1 day in the future? Not very nice and it maybe shouldn't be possible but if that works...
Reading the summary it sounds like multi-sig deploys can be sent to the network already before all required signatures are collected and the signatures can be added on-chain as long as the TTL isn't expired? Such deploys are in a pending state and either expire when not enough signatures are added within the TTL period or executed earlier I guess?
Would it be possible/an option to reset the TTL with each new signature? Let's say 3 signatures are required and the TTL is set to 12h. After 10h the next signature is added (or updated to artificially increase the TTL if needed) and the TTL will reset to 12h again -> if that happens on-chain it would probably need some changes of the casper-node including tests
Could we discourage extended TTLs by imposing higher fees, thus encouraging quicker signing but potentially disadvantaging those requiring longer TTLs like the CRDAO?
Is a TTL indispensable? Could this function be managed through a smart contract? Could we employ eras or blocks for TTL during node upgrades, or is real-world time essential for TTL, particularly when the network is stuck and the TTL is used to expire deploys and free up resources?
If a new (1.5) validator joined and a deploy with a 24-hour TTL was submitted 2 minutes later, would the new validator have to wait until the deploy either expires or is successfully executed?
Who decides in the end which TTL will be used?

12 replies

MParlikar May 30, 2023
Author

To be clear, deploys that are in the deploy queue are a load to the network. New joining nodes have to acquire them. Validating nodes have to check for replay attacks against them. Deploys consume memory. All of this is independent of the deploy being included in a block.

GuybrushX May 30, 2023

To be clear, deploys that are in the deploy queue are a load to the network. New joining nodes have to acquire them. Validating nodes have to check for replay attacks against them. Deploys consume memory. All of this is independent of the deploy being included in a block.

Can you give an example please for which or when deploys end up in the deploy queue for the block proposer / deploy buffer and how the reduced TTL will help here?

KillianH May 30, 2023

To be clear, deploys that are in the deploy queue are a load to the network. New joining nodes have to acquire them. Validating nodes have to check for replay attacks against them. Deploys consume memory. All of this is independent of the deploy being included in a block.

Can you give an example please for which or when deploys end up in the deploy queue for the block proposer / deploy buffer and how the reduced TTL will help here?

Deploys end up in the deploy queue when sent to the network and the deploy is valid. Reducing the ttl would (in a way) limit the number of max deploys able to fit in the queue. I guess the node can do a calculation of the max deploys they can have in the buffer for the next 18h by just multiplication the number of deploys that can be send in blocks.

MParlikar May 31, 2023
Author

"The maximum TTL setting of a given Casper network is a load-bearing chainspec setting which has multiple implications. Recall that the node must perform validity checks on deploys before including these in a block. Deploys are only expired when the TTL has elapsed." Until the TTL is expired, the deploys remain on the node. Today (pre-1.4.x), the node stores ALL deploys since genesis, because these deploys are required in order to build up global state to the tip of the chain. After 1.5, new joiner nodes will simply acquire chunks of global state + deploys with unexpired TTL.

MParlikar May 31, 2023
Author

To be clear, deploys that are in the deploy queue are a load to the network. New joining nodes have to acquire them. Validating nodes have to check for replay attacks against them. Deploys consume memory. All of this is independent of the deploy being included in a block.

Can you give an example please for which or when deploys end up in the deploy queue for the block proposer / deploy buffer and how the reduced TTL will help here?

Deploys end up in the deploy queue when sent to the network and the deploy is valid. Reducing the ttl would (in a way) limit the number of max deploys able to fit in the queue. I guess the node can do a calculation of the max deploys they can have in the buffer for the next 18h by just multiplication the number of deploys that can be send in blocks.

Need to separate 'pending deploys' from 'deploys' - Today the node stores all deploys. Pending deploys are the ones that are queued up. Deploys with unexpired TTL are stored on the node & checked against for block validation and replay attacks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed Reduction of Max TTL in Release 1.5 #3995

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 23 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Proposed Reduction of Max TTL in Release 1.5 #3995

MParlikar May 26, 2023

Summary

Background

Clearing up a common TTL misconception

Impact of Maximum TTL on a Network

Changes coming with 1.5

Analysis of on Chain transactions

Replies: 2 comments · 23 replies

MParlikar May 26, 2023 Author

sacherjj May 27, 2023 Maintainer

mrkara May 28, 2023 Maintainer

Nikolay-everstake May 28, 2023

GuybrushX May 28, 2023

GuybrushX May 28, 2023

GuybrushX May 27, 2023

MParlikar May 30, 2023 Author

GuybrushX May 30, 2023

KillianH May 30, 2023

MParlikar May 31, 2023 Author

MParlikar May 31, 2023 Author

MParlikar
May 26, 2023

Replies: 2 comments 23 replies

MParlikar
May 26, 2023
Author

sacherjj May 27, 2023
Maintainer

mrkara May 28, 2023
Maintainer

GuybrushX
May 27, 2023

MParlikar May 30, 2023
Author

MParlikar May 31, 2023
Author

MParlikar May 31, 2023
Author