[Bug]: `".app"` Folder lost and chain reinitialized when node runs out of space (Cosmovisor) #21617

0xSpuddy · 2024-09-09T18:24:05Z

Is there an existing issue for this?

I have searched the existing issues

What happened?

This is my first issue, hello! Let me know where I can provide more information (if desired) and I'll do my best.

What happened:
While running the latest tellor testnet, a validator node that was running via systemd / cosmovisor was found to be jailed. When the operator logged in, they found that the chain's ".app folder" (~/.layer in our case) was in an odd newly initialized state. The .app/config directory had no genesis file, and the .app/data directory no state data. The snapshots, keyring-test, and cosmovisor directories were gone.

I have been investigating all the logs I can find from the machine, but it seems that it was running normally prior whatever event changed the .layer folder. I'll put the setup details in the "How to Reproduce?".

cosmovisor config variables:

# cosmovisor
export DAEMON_NAME=layerd
export DAEMON_HOME=$HOME/.layer
export DAEMON_RESTART_AFTER_UPGRADE=true
export DAEMON_ALLOW_DOWNLOAD_BINARIES=false
export DAEMON_POLL_INTERVAL=300ms
export UNSAFE_SKIP_BACKUP=true
export DAEMON_PREUPGRADE_MAX_RETRIES=0

Cosmos SDK Version

v0.50.9

How to reproduce?

The setup:

machine: amazon ec2 t2.xl, (4 cores 16 gb ram)
storage: just 256 gb (test chain is about 100gb)
go version 1.23
cosmovisor was configured to run layer version v0.6.1 as a systemd service. (2 upgrades were completed gracefully by this test validator.)
start command: ./layerd start --api.enable --api.swagger --price-daemon-enabled=false --panic-on-daemon-failure-enabled=false --home /home/user/.layer --key-name $ACCOUNT_NAME
cosmovisor was built from Commit #33c463ec278702765c380afa69714f1e1b141271

The text was updated successfully, but these errors were encountered:

julienrbrt · 2024-09-09T19:41:50Z

That's a weird one, never saw this before. Cosmovisor doesn't delete anything ever in the node directory.

Did this happen more than 1 time? Directly after an upgrade? If so what are the exact reproducing steps?

Could you show your upgrade handlers logic?

0xSpuddy · 2024-09-09T19:50:01Z

That's a weird one, never saw this before. Cosmovisor doesn't delete anything ever in the node directory.

Did this happen more than 1 time? Directly after an upgrade? If so what are the exact reproducing steps?

Could you show your upgrade handlers logic?

The validator was jailed for inactivity (because the folder was broken) 3 days after this upgrade:

func (app *App) RegisterUpgradeHandlers() {
    const UpgradeName = "v0.7.1-alpha1"

    app.UpgradeKeeper.SetUpgradeHandler(
        UpgradeName,
        func(ctx context.Context, _ upgradetypes.Plan, fromVM module.VersionMap) (module.VersionMap, error) {
            return app.ModuleManager().RunMigrations(ctx, app.Configurator(), fromVM)
        },
    )

    upgradeInfo, err := app.UpgradeKeeper.ReadUpgradeInfoFromDisk()
    if err != nil {
        panic(err)
    }

    if upgradeInfo.Name == UpgradeName && !app.UpgradeKeeper.IsSkipHeight(upgradeInfo.Height) {
        storeUpgrades := storetypes.StoreUpgrades{}

        // configure store loader that checks if version == upgradeHeight and applies store upgrades
        app.SetStoreLoader(upgradetypes.UpgradeStoreLoader(upgradeInfo.Height, &storeUpgrades))
    }
}

It has happened again today to one tester from the public. (If not, I would have 100% assumed user error which it still could be) Just reached out to them to ask if they could add their findings here too. 🙏

Will get help for the upgrade handler's logic and edit this comment with that in a bit.

julienrbrt · 2024-09-10T13:19:58Z

Okay, your upgrade logic looks sane (I thought that maybe you were deleting the node home there, lol).
Any info would be useful to help us investigate yes.

0xSpuddy · 2024-09-12T14:23:22Z

Update: I was able to reproduce the error (super scientific method of starting a genesis sync on the same machine). Logs are attached to this comment, but at the moment the folder changed...

Sep 12 09:01:32 ip-10-0-2-44 bash[24633]: panic: failed to write batch, write /home/admin/.layer/data/application.db/2745596.log: no space left on device

I haven't heard back from the other person that this happened to to know if it was just this crazy coincidence that we both ran out of storage at the same time and lost data. (They were running multiple chains on the same server)

It is not ideal that the .layer (.app) directory is lost when this happens.

I will keep this aws machine as is as long as the issue is open in case anyone wants more information.

layer_rug_moment.txt

julienrbrt · 2024-09-13T07:35:29Z

Interesting, I wonder how this happens.

akhilkumarpilli · 2024-11-26T11:55:42Z

Hey @0xSpuddy, we found that the issue isn’t with cosmovisor but with the layerd application. We manually ran the layerd devnet without cosmovisor and saw the same problem. However, when we tried reproducing it with simapp, everything worked fine. Could you please review your code?

Closing this issue for now. If you find no issues with your code, feel free to reopen it.

0xSpuddy added the T:Bug label Sep 9, 2024

github-project-automation bot added this to Interchain Public Works Sep 9, 2024

tac0turtle added this to Cosmos-SDK Sep 9, 2024

github-project-automation bot moved this to 📋 Backlog in Cosmos-SDK Sep 9, 2024

julienrbrt added the S:needs more info This bug can't be addressed until more information is provided by the reporter. label Sep 9, 2024

0xSpuddy changed the title ~~[Bug]: Node Running with Cosmovisor Deletes and Re-Inits ".app" Folder~~ [Bug]: ".app" Folder lost and chain reinitialized when node runs out of space (Cosmovisor) Sep 12, 2024

julienrbrt removed the S:needs more info This bug can't be addressed until more information is provided by the reporter. label Sep 13, 2024

julienrbrt assigned akhilkumarpilli Nov 18, 2024

julienrbrt added the C:Cosmovisor Issues and PR related to Cosmovisor label Nov 18, 2024

akhilkumarpilli moved this from 📋 Backlog to 🤸‍♂️ In Progress in Cosmos-SDK Nov 21, 2024

akhilkumarpilli closed this as completed Nov 26, 2024

github-project-automation bot moved this to 🥳 Done in Interchain Public Works Nov 26, 2024

github-project-automation bot moved this from 🤸‍♂️ In Progress to 🥳 Done in Cosmos-SDK Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: `".app"` Folder lost and chain reinitialized when node runs out of space (Cosmovisor) #21617

[Bug]: `".app"` Folder lost and chain reinitialized when node runs out of space (Cosmovisor) #21617

0xSpuddy commented Sep 9, 2024 •

edited

Loading

julienrbrt commented Sep 9, 2024

0xSpuddy commented Sep 9, 2024 •

edited

Loading

julienrbrt commented Sep 10, 2024

0xSpuddy commented Sep 12, 2024 •

edited

Loading

julienrbrt commented Sep 13, 2024

akhilkumarpilli commented Nov 26, 2024

[Bug]: ".app" Folder lost and chain reinitialized when node runs out of space (Cosmovisor) #21617

[Bug]: ".app" Folder lost and chain reinitialized when node runs out of space (Cosmovisor) #21617

Comments

0xSpuddy commented Sep 9, 2024 • edited Loading

Is there an existing issue for this?

What happened?

Cosmos SDK Version

How to reproduce?

julienrbrt commented Sep 9, 2024

0xSpuddy commented Sep 9, 2024 • edited Loading

julienrbrt commented Sep 10, 2024

0xSpuddy commented Sep 12, 2024 • edited Loading

julienrbrt commented Sep 13, 2024

akhilkumarpilli commented Nov 26, 2024

[Bug]: `".app"` Folder lost and chain reinitialized when node runs out of space (Cosmovisor) #21617

[Bug]: `".app"` Folder lost and chain reinitialized when node runs out of space (Cosmovisor) #21617

0xSpuddy commented Sep 9, 2024 •

edited

Loading

0xSpuddy commented Sep 9, 2024 •

edited

Loading

0xSpuddy commented Sep 12, 2024 •

edited

Loading