Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lotus daemon crashes on start in Badger: file size: 2207107581 greater than 4294967295 #1831

Closed
jimpick opened this issue May 25, 2020 · 17 comments

Comments

@jimpick
Copy link
Contributor

jimpick commented May 25, 2020

Describe the bug

Lotus daemon dies on restart.

2020-05-25T00:11:27.441-0700	INFO	main	lotus/daemon.go:124	lotus repo: /home/lotus1/.lotus
2020-05-25T00:11:27.442-0700	INFO	build	[email protected]/paramfetch.go:127	Parameter file /var/tmp/filecoin-proof-parameters/v26-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-8-2-b62098629d07946e9028127e70295ed996fe3ed25b0f9f88eb610a0ab4385a3c.vk is ok
2020-05-25T00:11:27.442-0700	INFO	build	[email protected]/paramfetch.go:127	Parameter file /var/tmp/filecoin-proof-parameters/v26-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-0-0-0170db1f394b35d995252228ee359194b13199d259380541dc529fb0099096b0.vk is ok
2020-05-25T00:11:27.442-0700	INFO	build	[email protected]/paramfetch.go:127	Parameter file /var/tmp/filecoin-proof-parameters/v26-stacked-proof-of-replication-merkletree-poseidon_hasher-8-0-0-sha256_hasher-ecd683648512ab1765faa2a5f14bab48f676e633467f0aa8aad4b55dcb0652bb.vk is ok
2020-05-25T00:11:27.442-0700	INFO	build	[email protected]/paramfetch.go:127	Parameter file /var/tmp/filecoin-proof-parameters/v26-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-0-0-5294475db5237a2e83c3e52fd6c2b03859a1831d45ed08c4f35dbf9a803165a9.vk is ok
2020-05-25T00:11:27.442-0700	INFO	build	[email protected]/paramfetch.go:127	Parameter file /var/tmp/filecoin-proof-parameters/v26-stacked-proof-of-replication-merkletree-poseidon_hasher-8-8-0-sha256_hasher-82a357d2f2ca81dc61bb45f4a762807aedee1b0a53fd6c4e77b46a01bfef7820.vk is ok
2020-05-25T00:11:27.442-0700	INFO	build	[email protected]/paramfetch.go:127	Parameter file /var/tmp/filecoin-proof-parameters/v26-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-8-0-559e581f022bb4e4ec6e719e563bf0e026ad6de42e56c18714a2c692b1b88d7e.vk is ok
2020-05-25T00:11:27.442-0700	INFO	build	[email protected]/paramfetch.go:127	Parameter file /var/tmp/filecoin-proof-parameters/v26-stacked-proof-of-replication-merkletree-poseidon_hasher-8-8-2-sha256_hasher-96f1b4a04c5c51e4759bbf224bbc2ef5a42c7100f16ec0637123f16a845ddfb2.vk is ok
2020-05-25T00:11:27.443-0700	INFO	build	[email protected]/paramfetch.go:127	Parameter file /var/tmp/filecoin-proof-parameters/v26-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-0-0-7d739b8cf60f1b0709eeebee7730e297683552e4b69cab6984ec0285663c5781.vk is ok
2020-05-25T00:11:27.443-0700	INFO	build	[email protected]/paramfetch.go:127	Parameter file /var/tmp/filecoin-proof-parameters/v26-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-0-0-3ea05428c9d11689f23529cde32fd30aabd50f7d2c93657c1d3650bca3e8ea9e.vk is ok
2020-05-25T00:11:27.443-0700	INFO	build	[email protected]/paramfetch.go:127	Parameter file /var/tmp/filecoin-proof-parameters/v26-stacked-proof-of-replication-merkletree-poseidon_hasher-8-0-0-sha256_hasher-6babf46ce344ae495d558e7770a585b2382d54f225af8ed0397b8be7c3fcd472.vk is ok
2020-05-25T00:11:27.443-0700	INFO	build	[email protected]/paramfetch.go:127	Parameter file /var/tmp/filecoin-proof-parameters/v26-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-0-0-50c7368dea9593ed0989e70974d28024efa9d156d585b7eea1be22b2e753f331.vk is ok
2020-05-25T00:11:27.443-0700	INFO	build	[email protected]/paramfetch.go:127	Parameter file /var/tmp/filecoin-proof-parameters/v26-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-0-0-0cfb4f178bbb71cf2ecfcd42accce558b27199ab4fb59cb78f2483fe21ef36d9.vk is ok
2020-05-25T00:11:27.443-0700	INFO	build	[email protected]/paramfetch.go:127	Parameter file /var/tmp/filecoin-proof-parameters/v26-stacked-proof-of-replication-merkletree-poseidon_hasher-8-0-0-sha256_hasher-032d3138d22506ec0082ed72b2dcba18df18477904e35bafee82b3793b06832f.vk is ok
2020-05-25T00:11:27.446-0700	INFO	build	[email protected]/paramfetch.go:127	Parameter file /var/tmp/filecoin-proof-parameters/v26-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-8-2-2627e4006b67f99cef990c0a47d5426cb7ab0a0ad58fc1061547bf2d28b09def.vk is ok
2020-05-25T00:11:27.446-0700	INFO	build	[email protected]/paramfetch.go:127	Parameter file /var/tmp/filecoin-proof-parameters/v26-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-8-0-0377ded656c6f524f1618760bffe4e0a1c51d5a70c4509eedae8a27555733edc.vk is ok
2020-05-25T00:11:28.449-0700	INFO	badger	[email protected]/logger.go:46	All 701 tables opened in 985ms

2020/05/25 00:11:28 file size: 2207107581 greater than 4294967295
github.com/dgraph-io/badger/v2/y.AssertTruef
	/home/lotus1/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/y/error.go:62
github.com/dgraph-io/badger/v2.(*logFile).open
	/home/lotus1/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/value.go:910
github.com/dgraph-io/badger/v2.(*valueLog).open
	/home/lotus1/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/value.go:1120
github.com/dgraph-io/badger/v2.Open
	/home/lotus1/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:380
github.com/ipfs/go-ds-badger2.NewDatastore
	/home/lotus1/go/pkg/mod/github.com/ipfs/[email protected]/datastore.go:137
github.com/filecoin-project/lotus/node/repo.(*fsLockedRepo).Datastore.func1
	/home/lotus1/lotus/node/repo/fsrepo.go:268
sync.(*Once).doSlow
	/usr/local/go/src/sync/once.go:66
sync.(*Once).Do
	/usr/local/go/src/sync/once.go:57
github.com/filecoin-project/lotus/node/repo.(*fsLockedRepo).Datastore
	/home/lotus1/lotus/node/repo/fsrepo.go:264
github.com/filecoin-project/lotus/node/modules.Datastore
	/home/lotus1/lotus/node/modules/storage.go:30
reflect.Value.call
	/usr/local/go/src/reflect/value.go:460
reflect.Value.Call
	/usr/local/go/src/reflect/value.go:321
github.com/filecoin-project/lotus/node.as.func2
	/home/lotus1/lotus/node/options.go:140
reflect.callReflect
	/usr/local/go/src/reflect/value.go:549
reflect.makeFuncStub
	/usr/local/go/src/reflect/asm_amd64.s:20
reflect.Value.call
	/usr/local/go/src/reflect/value.go:460
reflect.Value.Call
	/usr/local/go/src/reflect/value.go:321
go.uber.org/dig.(*node).Call
	/home/lotus1/go/pkg/mod/go.uber.org/[email protected]/dig.go:665
go.uber.org/dig.paramSingle.Build
	/home/lotus1/go/pkg/mod/go.uber.org/[email protected]/param.go:245
go.uber.org/dig.paramList.BuildList
	/home/lotus1/go/pkg/mod/go.uber.org/[email protected]/param.go:201
go.uber.org/dig.(*node).Call
	/home/lotus1/go/pkg/mod/go.uber.org/[email protected]/dig.go:656
go.uber.org/dig.paramSingle.Build
	/home/lotus1/go/pkg/mod/go.uber.org/[email protected]/param.go:245
go.uber.org/dig.paramList.BuildList
	/home/lotus1/go/pkg/mod/go.uber.org/[email protected]/param.go:201
go.uber.org/dig.(*node).Call
	/home/lotus1/go/pkg/mod/go.uber.org/[email protected]/dig.go:656
go.uber.org/dig.paramSingle.Build
	/home/lotus1/go/pkg/mod/go.uber.org/[email protected]/param.go:245
go.uber.org/dig.paramList.BuildList
	/home/lotus1/go/pkg/mod/go.uber.org/[email protected]/param.go:201
go.uber.org/dig.(*Container).Invoke
	/home/lotus1/go/pkg/mod/go.uber.org/[email protected]/dig.go:389
go.uber.org/fx.(*App).executeInvokes
	/home/lotus1/go/pkg/mod/go.uber.org/[email protected]/app.go:547
go.uber.org/fx.New
	/home/lotus1/go/pkg/mod/go.uber.org/[email protected]/app.go:345
github.com/filecoin-project/lotus/node.New
	/home/lotus1/lotus/node/builder.go:478
main.glob..func1
	/home/lotus1/lotus/cmd/lotus/daemon.go:175
gopkg.in/urfave/cli%2ev2.(*Command).Run
	/home/lotus1/go/pkg/mod/gopkg.in/urfave/[email protected]/command.go:167

Version (run lotus --version):

lotus version 0.3.0'+git92d58ab1.dirty'

Additional context

Ubuntu Linux. I've been running this node since near the start of testnet.

@schomatis
Copy link
Contributor

In general Badger splits the logs in 4GB files, it is very strange (other than data corruption) to see this assert fail, maybe we should report it to them.

@jimpick
Copy link
Contributor Author

jimpick commented May 25, 2020

Looks like it's happening right when the badger datastore is being opened. Perhaps it's corrupted? I'll investigate a bit more tomorrow.

@jimpick
Copy link
Contributor Author

jimpick commented May 26, 2020

Looking at the largest files in my datastore, I have...

lotus1@nuc:~/.lotus/datastore$ ls -l | sort -rnk5 | head
-rw------- 1 lotus1 lotus1 62336649725 May 25 00:11 000012.vlog
-rw------- 1 lotus1 lotus1    17274407 May 20 09:44 004823.sst
-rw------- 1 lotus1 lotus1    17272388 May 18 08:12 003326.sst
-rw------- 1 lotus1 lotus1    17272194 May 21 07:13 005608.sst
-rw------- 1 lotus1 lotus1    17271748 May 18 12:19 003525.sst
-rw------- 1 lotus1 lotus1    17271434 May 24 19:11 500898.sst
-rw------- 1 lotus1 lotus1    17271208 May 19 07:42 004080.sst
-rw------- 1 lotus1 lotus1    17271148 May 17 21:05 002910.sst
-rw------- 1 lotus1 lotus1    17270919 May 18 04:32 003178.sst
-rw------- 1 lotus1 lotus1    17270721 May 22 14:40 007370.sst

I've lost all my "deals" if I don't fix my node, so I'm going to try to see if I can dig into this and possibly fix it.

@schomatis
Copy link
Contributor

Yes, that vlog file shouldn't be that big, you could try commenting out the assert and see if you can recover the contents (Badger used to have a recovery tool).

@jimpick
Copy link
Contributor Author

jimpick commented May 26, 2020

Might be related? ipfs/go-ds-badger#54 (comment)

@jimpick
Copy link
Contributor Author

jimpick commented May 26, 2020

@schomatis Can't just comment out the assert ... lots of code assumes the offsets are uint32's ... I'm hoping I can somehow split things and process it in batches.

@jimpick
Copy link
Contributor Author

jimpick commented May 27, 2020

I spent a few hours looking through the badger code for how it handles the vlog files... I could possibly save the data, but it would be a lot of effort.

@jimpick
Copy link
Contributor Author

jimpick commented May 27, 2020

I'm very confused about how this happened ... I looked at some other nodes I have set up, and they don't appear to have runaway .vlog files. I'll set up a new node, and watch it to see if it's happens again.

@schomatis
Copy link
Contributor

@jimpick At this point your best bet is to report this to Badger so they can best advice how to proceed, it is indeed a strange case, if the data itself is not corrupted I'm confident you'll be able to retrieve most of it.

@jimpick
Copy link
Contributor Author

jimpick commented May 27, 2020

I think this is related to "too many open files"...

This probably unrelated, but I just saw this in my logs:

2020-05-27T16:22:19.376Z        WARN    badger  [email protected]/log.go:180        While running doCompact: While running compaction for: {elog:0xc04b7524e0 thisLevel:0xc00010efc0 nextLevel:0xc00010f500 top:[0xc0a3e68820] bot:[0xc016c6bd40 0xc02f6dd1e0] thisRange:{left:[47 98 108 111 99 107 115 47 98 108 111 99 107 115 47 85 68 83 65 69 73 67 65 79 83 85 83 65 65 50 67 55 90 69 78 87 53 67 80 85 74 80 70 71 80 53 70 73 86 53 68 52 80 72 74 89 79 68 90 78 88 69 87 70 88 78 81 54 54 77 79 51 77 0 0 0 0 0 0 0 0] right:[47 98 108 111 99 107 115 47 98 108 111 99 107 115 47 85 68 83 65 69 73 67 66 50 73 66 67 65 51 86 50 72 78 90 83 79 81 69 87 51 71 81 50 89 71 82 71 85 80 66 90 82 50 75 67 51 52 71 55 68 77 89 52 88 74 73 71 51 80 72 65 71 85 255 255 255 255 255 255 255 255] inf:false} nextRange:{left:[47 98 108 111 99 107 115 47 98 108 111 99 107 115 47 85 68 83 65 69 73 67 65 67 68 88 79 69 67 87 79 72 70 76 51 79 70 71 89 84 73 69 81 79 68 77 81 86 85 85 66 66 79 79 85 70 89 53 66 83 76 80 70 51 68 68 74 80 52 88 89 51 85 0 0 0 0 0 0 0 0] right:[47 98 108 111 99 107 115 47 98 108 111 99 107 115 47 85 68 83 65 69 73 67 66 50 82 83 73 68 88 70 54 69 89 84 50 89 53 77 86 73 69 68 83 68 78 83 65 80 66 87 65 78 78 66 90 84 78 55 77 69 79 78 86 71 75 77 69 68 83 69 89 73 85 255 255 255 255 255 255 255 255] inf:false} thisSize:3773624 dropPrefix:[]}: While opening new table: 20462: open /home/ubuntu/.lotus/datastore/020462.sst: too many open files

@jimpick
Copy link
Contributor Author

jimpick commented May 27, 2020

Now, when I restart my node, it looks like it's trying to do compaction on badger, but then it fails due to "too many open files".

@schomatis
Copy link
Contributor

I'm not sure compaction was related to log file (it's more about converting the SST into logs I think, but I don't have the Badger model clearly on my head now), anyway, try increasing the ulimit -n to see if it helps.

@jimpick
Copy link
Contributor Author

jimpick commented May 28, 2020

I'll close this now ... I increased the ulimit, and I'll re-open if it happens again. Definitely a weird failure mode, but I'm not sure how to reproduce.

@roiger
Copy link

roiger commented May 30, 2020

i have the same issue.some one like has deal it.
#1879
It works

@RobQuistNL
Copy link
Contributor

I think I'm having similar issues. I've started the daemon a couple of times - first time over SSH, but since I had to shut down my laptop I quit the daemon over SSH, and restarted it on the machine itself so it could keep running.

Then is started to get stuck at certain blocks - after a few restarts it started syncing from a few blocks earlier, and now I keep getting these errors, and I can't even connect to the daemon anymore to view the sync height;

While running compaction for: {elog:0xc0004ca340 thisLevel:0xc00040a120 nextLevel:0xc00040a3c0 top:[0xc03f8ae0d0 0xc02f6535f0 0xc02f653790 0xc02f653930 0xc02f653ad0 0xc02f653c70 0xc02f653e10 0xc0406bef70 0xc0406bf040 0xc021b84000] bot:[0xc034c832b0 0xc034c83930 0xc043cf44e0 0xc020faedd0 0xc0406be4e0 0xc03976b1e0 0xc03976b520 0xc0406beb60 0xc03976b860 0xc043cf4820 0xc03976bba0 0xc043cf4b60 0xc02f652dd0 0xc0406beea0 0xc02f653110] thisRange:{left:[] right:[] inf:true} nextRange:{left:[33 98 97 100 103 101 114 33 104 101 97 100 0 0 0 0 0 0 0 0] right:[47 109 101 116 97 100 97 116 97 47 104 101 97 100 255 255 255 255 255 255 255 255] inf:false} thisSize:0 dropPrefix:[]}
github.com/dgraph-io/badger/v2.(*levelsController).compactBuildTables
	~/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:674
github.com/dgraph-io/badger/v2.(*levelsController).runCompactDef
	~/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:837
github.com/dgraph-io/badger/v2.(*levelsController).doCompact
	~/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:906
github.com/dgraph-io/badger/v2.(*levelsController).runWorker
	~/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:367
runtime.goexit
	/usr/lib/go-1.13/src/runtime/asm_amd64.s:1357: {elog:0xc0004ca340 thisLevel:0xc00040a120 nextLevel:0xc00040a3c0 top:[0xc03f8ae0d0 0xc02f6535f0 0xc02f653790 0xc02f653930 0xc02f653ad0 0xc02f653c70 0xc02f653e10 0xc0406bef70 0xc0406bf040 0xc021b84000] bot:[0xc034c832b0 0xc034c83930 0xc043cf44e0 0xc020faedd0 0xc0406be4e0 0xc03976b1e0 0xc03976b520 0xc0406beb60 0xc03976b860 0xc043cf4820 0xc03976bba0 0xc043cf4b60 0xc02f652dd0 0xc0406beea0 0xc02f653110] thisRange:{left:[] right:[] inf:true} nextRange:{left:[33 98 97 100 103 101 114 33 104 101 97 100 0 0 0 0 0 0 0 0] right:[47 109 101 116 97 100 97 116 97 47 104 101 97 100 255 255 255 255 255 255 255 255] inf:false} thisSize:0 dropPrefix:[]}
2020-06-01T16:04:42.453+0200	WARN	badger	[email protected]/log.go:180	While running doCompact: While running compaction for: {elog:0xc0004ca340 thisLevel:0xc00040a120 nextLevel:0xc00040a3c0 top:[0xc03f8ae0d0 0xc02f6535f0 0xc02f653790 0xc02f653930 0xc02f653ad0 0xc02f653c70 0xc02f653e10 0xc0406bef70 0xc0406bf040 0xc021b84000] bot:[0xc034c832b0 0xc034c83930 0xc043cf44e0 0xc020faedd0 0xc0406be4e0 0xc03976b1e0 0xc03976b520 0xc0406beb60 0xc03976b860 0xc043cf4820 0xc03976bba0 0xc043cf4b60 0xc02f652dd0 0xc0406beea0 0xc02f653110] thisRange:{left:[] right:[] inf:true} nextRange:{left:[33 98 97 100 103 101 114 33 104 101 97 100 0 0 0 0 0 0 0 0] right:[47 109 101 116 97 100 97 116 97 47 104 101 97 100 255 255 255 255 255 255 255 255] inf:false} thisSize:0 dropPrefix:[]}: While opening new table: 125352: open ~/.lotus/datastore/125352.sst: too many open files

2020-06-01T16:04:42.453+0200	INFO	badger	[email protected]/logger.go:46	Got compaction priority: {level:0 score:2 dropPrefix:[]}
2020-06-01T16:04:42.453+0200	INFO	badger	[email protected]/logger.go:46	Running for level: 0

The datastore file number seems to keep increasing but I can't ever connect anymore;

$ lotus sync wait
2020-06-01T16:09:06.496+0200	WARN	main	lotus/main.go:81	dial tcp 127.0.0.1:1234: connect: connection refused

@RobQuistNL
Copy link
Contributor

Like @roiger said - the .vlog file was ~ 1GB in size. Renamed it to 000022.vlog.bak and the daemon starts again with proper logging and I can see the sync height again :) Lets hope it gets up to speed this time. I've been waiting for about 3 days now running it non-stop.

@RobQuistNL
Copy link
Contributor

After a while I started getting the "too many open files" error again. Increased it from 1024 to 4096 with ulimit -n 4096, lets see what happens now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants