Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: ".app" Folder lost and chain reinitialized when node runs out of space (Cosmovisor) #21617

Closed
1 task done
0xSpuddy opened this issue Sep 9, 2024 · 6 comments
Closed
1 task done
Assignees
Labels
C:Cosmovisor Issues and PR related to Cosmovisor T:Bug

Comments

@0xSpuddy
Copy link

0xSpuddy commented Sep 9, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

This is my first issue, hello! Let me know where I can provide more information (if desired) and I'll do my best.

What happened:
While running the latest tellor testnet, a validator node that was running via systemd / cosmovisor was found to be jailed. When the operator logged in, they found that the chain's ".app folder" (~/.layer in our case) was in an odd newly initialized state. The .app/config directory had no genesis file, and the .app/data directory no state data. The snapshots, keyring-test, and cosmovisor directories were gone.

I have been investigating all the logs I can find from the machine, but it seems that it was running normally prior whatever event changed the .layer folder. I'll put the setup details in the "How to Reproduce?".

cosmovisor config variables:

# cosmovisor
export DAEMON_NAME=layerd
export DAEMON_HOME=$HOME/.layer
export DAEMON_RESTART_AFTER_UPGRADE=true
export DAEMON_ALLOW_DOWNLOAD_BINARIES=false
export DAEMON_POLL_INTERVAL=300ms
export UNSAFE_SKIP_BACKUP=true
export DAEMON_PREUPGRADE_MAX_RETRIES=0

Cosmos SDK Version

v0.50.9

How to reproduce?

The setup:

  • machine: amazon ec2 t2.xl, (4 cores 16 gb ram)
  • storage: just 256 gb (test chain is about 100gb)
  • go version 1.23
  • cosmovisor was configured to run layer version v0.6.1 as a systemd service. (2 upgrades were completed gracefully by this test validator.)
  • start command: ./layerd start --api.enable --api.swagger --price-daemon-enabled=false --panic-on-daemon-failure-enabled=false --home /home/user/.layer --key-name $ACCOUNT_NAME
  • cosmovisor was built from Commit #33c463ec278702765c380afa69714f1e1b141271
@julienrbrt
Copy link
Member

That's a weird one, never saw this before. Cosmovisor doesn't delete anything ever in the node directory.

Did this happen more than 1 time? Directly after an upgrade? If so what are the exact reproducing steps?

Could you show your upgrade handlers logic?

@julienrbrt julienrbrt added the S:needs more info This bug can't be addressed until more information is provided by the reporter. label Sep 9, 2024
@0xSpuddy
Copy link
Author

0xSpuddy commented Sep 9, 2024

That's a weird one, never saw this before. Cosmovisor doesn't delete anything ever in the node directory.

Did this happen more than 1 time? Directly after an upgrade? If so what are the exact reproducing steps?

Could you show your upgrade handlers logic?

The validator was jailed for inactivity (because the folder was broken) 3 days after this upgrade:

func (app *App) RegisterUpgradeHandlers() {
    const UpgradeName = "v0.7.1-alpha1"

    app.UpgradeKeeper.SetUpgradeHandler(
        UpgradeName,
        func(ctx context.Context, _ upgradetypes.Plan, fromVM module.VersionMap) (module.VersionMap, error) {
            return app.ModuleManager().RunMigrations(ctx, app.Configurator(), fromVM)
        },
    )

    upgradeInfo, err := app.UpgradeKeeper.ReadUpgradeInfoFromDisk()
    if err != nil {
        panic(err)
    }

    if upgradeInfo.Name == UpgradeName && !app.UpgradeKeeper.IsSkipHeight(upgradeInfo.Height) {
        storeUpgrades := storetypes.StoreUpgrades{}

        // configure store loader that checks if version == upgradeHeight and applies store upgrades
        app.SetStoreLoader(upgradetypes.UpgradeStoreLoader(upgradeInfo.Height, &storeUpgrades))
    }
}

It has happened again today to one tester from the public. (If not, I would have 100% assumed user error which it still could be) Just reached out to them to ask if they could add their findings here too. 🙏

Will get help for the upgrade handler's logic and edit this comment with that in a bit.

@julienrbrt
Copy link
Member

Okay, your upgrade logic looks sane (I thought that maybe you were deleting the node home there, lol).
Any info would be useful to help us investigate yes.

@0xSpuddy
Copy link
Author

0xSpuddy commented Sep 12, 2024

Update: I was able to reproduce the error (super scientific method of starting a genesis sync on the same machine). Logs are attached to this comment, but at the moment the folder changed...

Sep 12 09:01:32 ip-10-0-2-44 bash[24633]: panic: failed to write batch, write /home/admin/.layer/data/application.db/2745596.log: no space left on device

I haven't heard back from the other person that this happened to to know if it was just this crazy coincidence that we both ran out of storage at the same time and lost data. (They were running multiple chains on the same server)

It is not ideal that the .layer (.app) directory is lost when this happens.

I will keep this aws machine as is as long as the issue is open in case anyone wants more information.

layer_rug_moment.txt

@0xSpuddy 0xSpuddy changed the title [Bug]: Node Running with Cosmovisor Deletes and Re-Inits ".app" Folder [Bug]: ".app" Folder lost and chain reinitialized when node runs out of space (Cosmovisor) Sep 12, 2024
@julienrbrt
Copy link
Member

Interesting, I wonder how this happens.

@julienrbrt julienrbrt removed the S:needs more info This bug can't be addressed until more information is provided by the reporter. label Sep 13, 2024
@julienrbrt julienrbrt added the C:Cosmovisor Issues and PR related to Cosmovisor label Nov 18, 2024
@akhilkumarpilli akhilkumarpilli moved this from 📋 Backlog to 🤸‍♂️ In Progress in Cosmos-SDK Nov 21, 2024
@akhilkumarpilli
Copy link
Contributor

Hey @0xSpuddy, we found that the issue isn’t with cosmovisor but with the layerd application. We manually ran the layerd devnet without cosmovisor and saw the same problem. However, when we tried reproducing it with simapp, everything worked fine. Could you please review your code?

Closing this issue for now. If you find no issues with your code, feel free to reopen it.

@github-project-automation github-project-automation bot moved this from 🤸‍♂️ In Progress to 🥳 Done in Cosmos-SDK Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C:Cosmovisor Issues and PR related to Cosmovisor T:Bug
Projects
Status: 🥳 Done
Development

No branches or pull requests

3 participants