Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug? seemingly random job abortion #1899

Open
krancour opened this issue Mar 31, 2022 · 1 comment
Open

bug? seemingly random job abortion #1899

krancour opened this issue Mar 31, 2022 · 1 comment
Assignees
Labels
bug Something isn't working
Milestone

Comments

@krancour
Copy link
Contributor

Twice today, I saw a job aborted for no readily apparent reason, and the associated worker failed.

The first occurrence was with v2.3.1 and the second was after upgrading the dogfood cluster to v2.4.0-rc.2.

Here's the data I gathered from the second occurrence:

$ brig event get -i 61a4332c-c26e-4fa8-a3a0-4c5bd16cc107

ID                                  	PROJECT	SOURCE           	TYPE                 	AGE	WORKER PHASE
61a4332c-c26e-4fa8-a3a0-4c5bd16cc107	brigade	brigade.sh/github	ci:pipeline_requested	1h 	FAILED      

Event "61a4332c-c26e-4fa8-a3a0-4c5bd16cc107" jobs:

NAME             	STARTED	ENDED	PHASE    
test-unit-js     	1h     	1h   	SUCCEEDED
yarn-audit       	1h     	1h   	SUCCEEDED
lint-chart       	1h     	1h   	SUCCEEDED
validate-examples	1h     	1h   	SUCCEEDED
lint-js          	       	     	PENDING  
validate-schemas 	1h     	1h   	SUCCEEDED
style-check-js   	1h     	1h   	SUCCEEDED
test-unit-go     	1h     	     	ABORTED  
lint-go          	       	     	PENDING

Worker logs reveal this:

...

2022-03-30T23:54:23.354Z ERROR: Error creating job "test-unit-js": An internal server error occurred.
/var/brigade-worker/worker/dist/prestart.js:204
        throw new Error(`error executing node brigade.js:\n\n${e.output}`);
        ^

...

An internal server error by the API server should have been logged, so I looked there next:

2022/03/30 23:54:23 error scheduling event "61a4332c-c26e-4fa8-a3a0-4c5bd16cc107" job "test-unit-js" on the substrate: error submitting execution task for event "61a4332c-c26e-4fa8-a3a0-4c5bd16cc107" job "test-unit-js": error sending amqp message for queue "jobs.brigade": unexpected frame: &amqp.performAttach{Name:"mwYiuAiQm58ubLB3bQ77krRjFxXwde2eDQTkYl1RObbAOV6rvi73nw", Handle:0x0, Role:true, SenderSettleMode:(*amqp.SenderSettleMode)(0xc0009a4610), ReceiverSettleMode:(*amqp.ReceiverSettleMode)(0xc0009a4611), Source:(*amqp.source)(nil), Target:(*amqp.target)(0xc00008dd10), Unsettled:amqp.unsettled(nil), IncompleteUnsettled:false, InitialDeliveryCount:0x0, MaxMessageSize:0x0, OfferedCapabilities:amqp.multiSymbol(nil), DesiredCapabilities:amqp.multiSymbol(nil), Properties:map[amqp.symbol]interface {}(nil)}
2022/03/30 23:54:23 error scheduling event "61a4332c-c26e-4fa8-a3a0-4c5bd16cc107" job "validate-schemas" on the substrate: error submitting execution task for event "61a4332c-c26e-4fa8-a3a0-4c5bd16cc107" job "validate-schemas": error sending amqp message for queue "jobs.brigade": unexpected frame: &amqp.performAttach{Name:"mwYiuAiQm58ubLB3bQ77krRjFxXwde2eDQTkYl1RObbAOV6rvi73nw", Handle:0x0, Role:true, SenderSettleMode:(*amqp.SenderSettleMode)(0xc0009a4610), ReceiverSettleMode:(*amqp.ReceiverSettleMode)(0xc0009a4611), Source:(*amqp.source)(nil), Target:(*amqp.target)(0xc00008dd10), Unsettled:amqp.unsettled(nil), IncompleteUnsettled:false, InitialDeliveryCount:0x0, MaxMessageSize:0x0, OfferedCapabilities:amqp.multiSymbol(nil), DesiredCapabilities:amqp.multiSymbol(nil), Properties:map[amqp.symbol]interface {}(nil)}
2022/03/30 23:54:23 error scheduling event "61a4332c-c26e-4fa8-a3a0-4c5bd16cc107" job "test-unit-go" on the substrate: error submitting execution task for event "61a4332c-c26e-4fa8-a3a0-4c5bd16cc107" job "test-unit-go": error sending amqp message for queue "jobs.brigade": unexpected frame: &amqp.performAttach{Name:"mwYiuAiQm58ubLB3bQ77krRjFxXwde2eDQTkYl1RObbAOV6rvi73nw", Handle:0x0, Role:true, SenderSettleMode:(*amqp.SenderSettleMode)(0xc0009a4610), ReceiverSettleMode:(*amqp.ReceiverSettleMode)(0xc0009a4611), Source:(*amqp.source)(nil), Target:(*amqp.target)(0xc00008dd10), Unsettled:amqp.unsettled(nil), IncompleteUnsettled:false, InitialDeliveryCount:0x0, MaxMessageSize:0x0, OfferedCapabilities:amqp.multiSymbol(nil), DesiredCapabilities:amqp.multiSymbol(nil), Properties:map[amqp.symbol]interface {}(nil)}
2022/03/30 23:54:24 http: panic serving 10.240.0.104:39214: runtime error: invalid memory address or nil pointer dereference
goroutine 128736 [running]:
net/http.(*conn).serve.func1()
	/usr/local/go/src/net/http/server.go:1825 +0xbf
panic({0x1704b20, 0x2788530})
	/usr/local/go/src/runtime/panic.go:844 +0x258
github.com/Azure/go-amqp.(*Sender).Close(0xc000952660?, {0x1c0ea80?, 0xc000114008?})
	/go/pkg/mod/github.com/!azure/[email protected]/client.go:458 +0x1e
github.com/brigadecore/brigade/v2/apiserver/internal/lib/queue/amqp.(*writerFactory).NewWriter.func1()
	/src/v2/apiserver/internal/lib/queue/amqp/writer.go:109 +0x79
github.com/brigadecore/brigade-foundations/retries.ManageRetries({0x1c0ea80, 0xc000114000}, {0x191f5b3, 0xd}, 0xa, 0x0?, 0xc0007e00c0)
	/go/pkg/mod/github.com/brigadecore/[email protected]/retries/retries.go:50 +0x79
github.com/brigadecore/brigade/v2/apiserver/internal/lib/queue/amqp.(*writerFactory).NewWriter(0xc00008d950, {0xc00059c6d0, 0xc})
	/src/v2/apiserver/internal/lib/queue/amqp/writer.go:100 +0x24a
github.com/brigadecore/brigade/v2/apiserver/internal/api/kubernetes.(*substrate).ScheduleJob(_, {_, _}, {{{0xc00059c5f6, 0x7}, 0xc0009a8f60}, {0xc000a43fa0, 0x1f}, {{0xc0002e8bd0, 0x2, ...}, ...}, ...}, ...)
	/src/v2/apiserver/internal/api/kubernetes/substrate.go:569 +0xec
github.com/brigadecore/brigade/v2/apiserver/internal/api.(*jobsService).Create(0xc00008dea0, {0x1c0eaf0, 0xc0003e9a70}, {0xc00047c7e0, _}, {{0xc000873828, 0x7}, 0xc0009a88d0, {{{{...}, {...}, ...}, ...}, ...}, ...})
	/src/v2/apiserver/internal/api/jobs.go:504 +0xeaf
github.com/brigadecore/brigade/v2/apiserver/internal/api/rest.(*JobsEndpoints).create.func1()
	/src/v2/apiserver/internal/api/rest/jobs_endpoints.go:76 +0xef
github.com/brigadecore/brigade/v2/apiserver/internal/lib/restmachinery.ServeRequest({{0x1c0e0e8, 0xc0006f2ee0}, 0xc0009e4000, {0x1c0ecb0, 0xc00003a7a0}, {0x16ce580, 0xc00051d040}, 0xc00003ba00, 0xc9})
	/src/v2/apiserver/internal/lib/restmachinery/endpoints.go:123 +0xb6
github.com/brigadecore/brigade/v2/apiserver/internal/api/rest.(*JobsEndpoints).create(0xc0000aba80, {0x1c0e0e8, 0xc0006f2ee0}, 0xc0009e4000)
	/src/v2/apiserver/internal/api/rest/jobs_endpoints.go:69 +0x194
github.com/brigadecore/brigade/v2/apiserver/internal/api/rest.(*tokenAuthFilter).Decorate.func1({0x1c0e0e8, 0xc0006f2ee0}, 0xc00096ef00)
	/src/v2/apiserver/internal/api/rest/token_auth_filter.go:154 +0xa6b
net/http.HandlerFunc.ServeHTTP(0xc00096ed00?, {0x1c0e0e8?, 0xc0006f2ee0?}, 0xb50000c0007cd740?)
	/usr/local/go/src/net/http/server.go:2084 +0x2f
github.com/gorilla/mux.(*Router).ServeHTTP(0xc0000c5200, {0x1c0e0e8, 0xc0006f2ee0}, 0xc00096ec00)
	/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:210 +0x1cf
github.com/rs/cors.(*Cors).Handler.func1({0x1c0e0e8, 0xc0006f2ee0}, 0xc00096ec00)
	/go/pkg/mod/github.com/rs/[email protected]/cors.go:219 +0x1bd
net/http.HandlerFunc.ServeHTTP(0x40d687?, {0x1c0e0e8?, 0xc0006f2ee0?}, 0x4f1001?)
	/usr/local/go/src/net/http/server.go:2084 +0x2f
github.com/brigadecore/brigade/v2/apiserver/internal/lib/restmachinery.(*requestContextFilter).Decorate.func1({0x1c0e0e8, 0xc0006f2ee0}, 0xc00096eb00)
	/src/v2/apiserver/internal/lib/restmachinery/request_context_filter.go:38 +0x288
net/http.HandlerFunc.ServeHTTP(0x0?, {0x1c0e0e8?, 0xc0006f2ee0?}, 0x40d325?)
	/usr/local/go/src/net/http/server.go:2084 +0x2f
net/http.serverHandler.ServeHTTP({0x1c024f8?}, {0x1c0e0e8, 0xc0006f2ee0}, 0xc00096eb00)
	/usr/local/go/src/net/http/server.go:2916 +0x43b
net/http.(*conn).serve(0xc00040c1e0, {0x1c0eaf0, 0xc000263e30})
	/usr/local/go/src/net/http/server.go:1966 +0x5d7
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:3071 +0x4db
2022/03/30 23:54:27 http: panic serving 10.240.0.104:39216: runtime error: invalid memory address or nil pointer dereference
goroutine 128753 [running]:
net/http.(*conn).serve.func1()
	/usr/local/go/src/net/http/server.go:1825 +0xbf
panic({0x1704b20, 0x2788530})
	/usr/local/go/src/runtime/panic.go:844 +0x258
github.com/Azure/go-amqp.(*Session).Close(0x0, {0x1c0ea80, 0xc000114008})
	/go/pkg/mod/github.com/!azure/[email protected]/client.go:256 +0x30
github.com/brigadecore/brigade/v2/internal/amqp.(*session).Close(0xc0006b4cc0?, {0x1c0ea80?, 0xc000114008?})
	/src/v2/internal/amqp/session.go:38 +0x26
github.com/brigadecore/brigade/v2/apiserver/internal/lib/queue/amqp.(*writerFactory).NewWriter.func1()
	/src/v2/apiserver/internal/lib/queue/amqp/writer.go:112 +0xbc
github.com/brigadecore/brigade-foundations/retries.ManageRetries({0x1c0ea80, 0xc000114000}, {0x191f5b3, 0xd}, 0xa, 0x0?, 0xc0007440c0)
	/go/pkg/mod/github.com/brigadecore/[email protected]/retries/retries.go:50 +0x79
github.com/brigadecore/brigade/v2/apiserver/internal/lib/queue/amqp.(*writerFactory).NewWriter(0xc00008d950, {0xc0009a43c0, 0xc})
	/src/v2/apiserver/internal/lib/queue/amqp/writer.go:100 +0x24a
github.com/brigadecore/brigade/v2/apiserver/internal/api/kubernetes.(*substrate).ScheduleJob(_, {_, _}, {{{0xc000109502, 0x7}, 0xc0005a6210}, {0xc000a9c920, 0x1f}, {{0xc0001e7500, 0x2, ...}, ...}, ...}, ...)
	/src/v2/apiserver/internal/api/kubernetes/substrate.go:569 +0xec
github.com/brigadecore/brigade/v2/apiserver/internal/api.(*jobsService).Create(0xc00008dea0, {0x1c0eaf0, 0xc00061c210}, {0xc00047c600, _}, {{0xc00052d598, 0x7}, 0xc000a73668, {{{{...}, {...}, ...}, ...}, ...}, ...})
	/src/v2/apiserver/internal/api/jobs.go:504 +0xeaf
github.com/brigadecore/brigade/v2/apiserver/internal/api/rest.(*JobsEndpoints).create.func1()
	/src/v2/apiserver/internal/api/rest/jobs_endpoints.go:76 +0xef
github.com/brigadecore/brigade/v2/apiserver/internal/lib/restmachinery.ServeRequest({{0x1c0e0e8, 0xc0006f2d20}, 0xc000b17c00, {0x1c0ecb0, 0xc00003a7a0}, {0x16ce580, 0xc000110a90}, 0xc0008edf40, 0xc9})
	/src/v2/apiserver/internal/lib/restmachinery/endpoints.go:123 +0xb6
github.com/brigadecore/brigade/v2/apiserver/internal/api/rest.(*JobsEndpoints).create(0xc0000aba80, {0x1c0e0e8, 0xc0006f2d20}, 0xc000b17c00)
	/src/v2/apiserver/internal/api/rest/jobs_endpoints.go:69 +0x194
github.com/brigadecore/brigade/v2/apiserver/internal/api/rest.(*tokenAuthFilter).Decorate.func1({0x1c0e0e8, 0xc0006f2d20}, 0xc00096e600)
	/src/v2/apiserver/internal/api/rest/token_auth_filter.go:154 +0xa6b
net/http.HandlerFunc.ServeHTTP(0xc00096e500?, {0x1c0e0e8?, 0xc0006f2d20?}, 0x190000c0007cd3c0?)
	/usr/local/go/src/net/http/server.go:2084 +0x2f
github.com/gorilla/mux.(*Router).ServeHTTP(0xc0000c5200, {0x1c0e0e8, 0xc0006f2d20}, 0xc00096e400)
	/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:210 +0x1cf
github.com/rs/cors.(*Cors).Handler.func1({0x1c0e0e8, 0xc0006f2d20}, 0xc00096e400)
	/go/pkg/mod/github.com/rs/[email protected]/cors.go:219 +0x1bd
net/http.HandlerFunc.ServeHTTP(0x40d687?, {0x1c0e0e8?, 0xc0006f2d20?}, 0x4f1001?)
	/usr/local/go/src/net/http/server.go:2084 +0x2f
github.com/brigadecore/brigade/v2/apiserver/internal/lib/restmachinery.(*requestContextFilter).Decorate.func1({0x1c0e0e8, 0xc0006f2d20}, 0xc00096e300)
	/src/v2/apiserver/internal/lib/restmachinery/request_context_filter.go:38 +0x288
net/http.HandlerFunc.ServeHTTP(0x0?, {0x1c0e0e8?, 0xc0006f2d20?}, 0x40d325?)
	/usr/local/go/src/net/http/server.go:2084 +0x2f
net/http.serverHandler.ServeHTTP({0x1c024f8?}, {0x1c0e0e8, 0xc0006f2d20}, 0xc00096e300)
	/usr/local/go/src/net/http/server.go:2916 +0x43b
net/http.(*conn).serve(0xc00040c320, {0x1c0eaf0, 0xc000263e30})
	/usr/local/go/src/net/http/server.go:1966 +0x5d7
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:3071 +0x4db

This has the appearance of the job failing to be placed on the queue, for reasons I cannot yet diagnose.

However, if I look at the logs for that job, they exist. The job ran. In fact, it executed hundreds of unit tests before it was forcefully terminated. This means the job did make it onto the queue after all... and this explains a lot...

After the worker failed (or so it thought) to create a new job, the worker exited non-zero. The observer recorded this correctly as a worker failure. Eventual worker cleanup forcefully killed the running job, which actually was created.

There's more to investigate here, but the fundamental question is probably one of how we ended up with an error that makes it appear as though the job never made it onto the queue event though it very clearly did.

@krancour krancour added the bug Something isn't working label Mar 31, 2022
@krancour krancour self-assigned this Mar 31, 2022
@krancour
Copy link
Contributor Author

A closer look at the logs reveals that three jobs appear not to have made it onto the queue, although they did. Two of them happened to succeed before the observed cleaned up the dead worker. The one of the three still running was aborted at the time of cleanup.

@krancour krancour added this to the v2.4.1 milestone Apr 15, 2022
@krancour krancour modified the milestones: v2.5.0, v2.6.0 May 11, 2022
@krancour krancour modified the milestones: v2.6.0, v2.7.0 May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant