Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isAgentConnected is not reliable #3

Open
Mulugruntz opened this issue Mar 6, 2018 · 4 comments
Open

isAgentConnected is not reliable #3

Mulugruntz opened this issue Mar 6, 2018 · 4 comments

Comments

@Mulugruntz
Copy link

Hi,

The function to detect if an agent is connected makes wrong assumptions.

On line 439 we see that it checks for the string "successfully connected and online".
However, we currently have an offline node and when accessing the corresponding page, it prints:

[2018-03-06 14:04:17] [windows-slaves] Connecting to build1 Checking if Java exists java -version returned 1.8.0. [2018-03-06 14:04:31] [windows-slaves] Installing the Jenkins slave service [2018-03-06 14:04:31] [windows-slaves] Copying jenkins-slave.exe [2018-03-06 14:04:31] [windows-slaves] Copying slave.jar [2018-03-06 14:04:31] [windows-slaves] Copying jenkins-slave.xml [2018-03-06 14:04:31] [windows-slaves] Registering the service [2018-03-06 14:04:31] [windows-slaves] Starting the service [2018-03-06 14:04:38] [windows-slaves] Waiting for the service to become ready [2018-03-06 14:04:43] [windows-slaves] Connecting to port 49,678 <===[JENKINS REMOTING CAPACITY]===>Remoting version: 3.14 This is a Windows agent Agent successfully connected and online ERROR: Connection terminated java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:210) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91) at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72) at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103) at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63) ERROR: Message not found for errorCode: 0x8001FFFF org.jinterop.dcom.common.JIException: Message not found for errorCode: 0x8001FFFF at org.jinterop.dcom.core.JIComServer.init(JIComServer.java:546) at org.jinterop.dcom.core.JIComServer.initialise(JIComServer.java:458) at org.jinterop.dcom.core.JIComServer.<init>(JIComServer.java:427) at org.jvnet.hudson.wmi.WMI.connect(WMI.java:59) at hudson.os.windows.ManagedWindowsServiceLauncher.afterDisconnect(ManagedWindowsServiceLauncher.java:491) at hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:533) at hudson.remoting.Channel.terminate(Channel.java:1013) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:96) Caused by: rpc.BindException: Unable to bind. (unknown) at rpc.security.ntlm.NtlmConnectionContext.accept(NtlmConnectionContext.java:137) at rpc.ConnectionOrientedEndpoint.connect(ConnectionOrientedEndpoint.java:252) at rpc.ConnectionOrientedEndpoint.bind(ConnectionOrientedEndpoint.java:217) at rpc.ConnectionOrientedEndpoint.rebind(ConnectionOrientedEndpoint.java:153) at org.jinterop.dcom.transport.JIComEndpoint.rebindEndPoint(JIComEndpoint.java:40) at org.jinterop.dcom.core.JIComServer.init(JIComServer.java:529) ... 7 more ERROR: Message not found for errorCode: 0x8001FFFF org.jinterop.dcom.common.JIException: Message not found for errorCode: 0x8001FFFF at org.jinterop.dcom.core.JIComServer.init(JIComServer.java:546) at org.jinterop.dcom.core.JIComServer.initialise(JIComServer.java:458) at org.jinterop.dcom.core.JIComServer.<init>(JIComServer.java:427) at org.jvnet.hudson.wmi.WMI.connect(WMI.java:59) at hudson.os.windows.ManagedWindowsServiceLauncher.afterDisconnect(ManagedWindowsServiceLauncher.java:491) at hudson.os.windows.ManagedWindowsServiceLauncher$1.onClosed(ManagedWindowsServiceLauncher.java:366) at hudson.remoting.Channel.terminate(Channel.java:1013) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:96) Caused by: rpc.BindException: Unable to bind. (unknown) at rpc.security.ntlm.NtlmConnectionContext.accept(NtlmConnectionContext.java:137) at rpc.ConnectionOrientedEndpoint.connect(ConnectionOrientedEndpoint.java:252) at rpc.ConnectionOrientedEndpoint.bind(ConnectionOrientedEndpoint.java:217) at rpc.ConnectionOrientedEndpoint.rebind(ConnectionOrientedEndpoint.java:153) at org.jinterop.dcom.transport.JIComEndpoint.rebindEndPoint(JIComEndpoint.java:40) at org.jinterop.dcom.core.JIComServer.init(JIComServer.java:529) ... 7 more 

I believe this happens when a node was successfully online but then Jenkins lost the connection later on. It just appends to the previous message.

A clean alternative would be to access "/computer/"+buildBox+"/api/json" instead and check for the offline key (boolean).

Best regards!

@Mulugruntz
Copy link
Author

Actually, I just noticed the function isNodeOffline(). It now makes me wonder... isn't !isNodeOffline(bb) == isAgentConnected(bb)?

@lucanaldini
Copy link
Contributor

Hi. I've noticed situations where the agent connects and then disconnects few seconds after.
It used to happen either immediately after the agent was connected or never at all. So I've added an extra check before marking the node as connected.
Have you used the version of the library after this commit by any chance?
After I've added that extra check I've not noticed instances of that situation occurring yet.

Could you verify and let me know please?
Thanks

@Mulugruntz
Copy link
Author

Hi @lucanaldini .

Yes, this is the version that we use. Independently of the bug, the piece of code is still built on wrong assumptions.

Regards.

@lucanaldini
Copy link
Contributor

Hi, thanks for checking. Line 439 is checking that the last 37 characters of the console output contains "successfully connected and online". In your output that wouldn't be the case (since you have that stacktrace at the end). So It might be that the agent disconnected after being connected at least 10 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants