-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hercules crash during QETH HSCH processing #690
Comments
I'm not sure it's a Windows only problem: Yes, I've observed it so far under Windows only, in conjunction with CTCI-WIN. However, with the current knowledge I'd not bet that this cannot happen in other environments too. 😉 Probably, it doesn't need more than just feeding nonsense into the device until the IP stack gives up and tries to bail out using HSCH. |
You're probably right. I've updated the text of the 'W' GitHub issue label. I promise to get to this issue as soon as I can but am still desperately trying to get caught up with Hercules and serious personal issues backlog, (My wife had another serious heart attack!) |
As I said earlier: This issue has absolutely no priority. Health, your wife's and your's is most important! Wishing your wife all the best, take care! |
I'm very sorry to hear that, Fish. I wish the best to you both. Bill |
I'm not so sure that's correct. My own analysis reveals that the likely reason the But that's just my own educated guess. (see attachments further below) My analysis involved the discovery and use of two new "windbg"
Using them, I then came to the above mentioned conclusion. Refer to the below crash dump analysis and my notes to see whether you agree with my analysis or not:
Now... What to do about it? How to fix the problem? Answer: I'm not sure! I'm not even certain that my analysis is correct. I'm simply presuming (guessing) that it is. But how do you "fix" what appears to be a permanent I/O error? How do you fix (or workaround?) what appears to be a malfunctioning physical host hardware device? The only thing I can think of is to modify the That won't fix the underlying problem of course, but at least it should prevent the deadlock from occurring. Of course, when the error occurs, the HSC (Halt Subchannel) will of course have failed too, and the repercussions of that are unknown. But at least if something nasty does occur elsewhere in Hercules as a result, we'll at least have an audit trail in the form of a log message to provide us with a clue as to a possible cause, which is better than nothing.
Given my own analysis, I doubt your patch will accomplish much of anything. (Have you tried it? Does it work?) The problem (IMHO) is very likely in the Thanks!
But this time deeper analysis was required. To truly see what was going on, I needed to see what each thread involved in the deadlock was doing, which the |
Hi Fish thanks for the analysis!
The only thing my patch accomplishes is to prevent the deadlock. That means, after applying the patch Hercules will no longer crash when the problem occurs. The problem itself doesn't get fixed by the patch, and thus the device ends up being no longer usable once the problem occured. Even detaching and attaching it from/to Hercules doesn't bring it back to life. And yes, I've tried it and it works exactly to that point. My idea was, to allow further analysis when Hercules is still running -- but I'm not exactly an in depth windows debugger. Cheers |
Wrong! And that was my point! It doesn't prevent the deadlock! (and will never do so! -- by itself) The reason it doesn't prevent the deadlock (by itself) is because your patch is still calling the Now, once I eventually fix the loop in the But again, by itself, it accomplishes nothing. It does not, and will not, fix the deadlock by itself. The deadlock can only be fixed by applying BOTH proposed fixes: my fix to prevent Does that make sense now?
Agreed 100%. But you know me: I hate any bug in Hercules, no matter how minor or insignificant it may be. Especially any bug related to networking, as such bugs tend to make my CTCI-WIN product look bad. p.s. I'm currently working on trying to make changes to my CTCI-WIN product to also report an adapter's "LSO" (Large Segment Offload) setting, and issue a warning should it appear that the adapter has it enabled. This, along with my existing checksum offloading reporting/handling, should (I hope!) prevent such hard to diagnose problems from occurring, as well as significantly reduce my own product support workload. (Most of the support questions I receive these days are specifically related to issues exactly like this: incorrect adapter offload settings.) |
Hi Fish
Logic-wise I'm 100% with you. But then how do you explain the following? 08:51:25 HHC03800I 0:0401 QETH: Adapter mode set to Layer 3 08:51:25 HHC04100I TunTap64.dll version 3.7.0.6070 initiated 08:51:25 HHC00901I 0:0401 QETH: Interface tun0, type TUN opened 08:51:26 HHC03997I 0:0401 QETH: tun0: using MAC address 02-00-5E-A8-4B-6F 08:51:26 HHC03997I 0:0401 QETH: tun0: using IP address 192.168.50.245 08:51:26 HHC03997I 0:0401 QETH: tun0: using subnet mask 255.255.255.0 08:51:26 HHC03997I 0:0401 QETH: tun0: using MTU 1500 08:51:26 HHC03997I 0:0401 QETH: tun0: using drive MAC address 02:90:31:cf:95:7a 08:51:26 HHC03997I 0:0401 QETH: tun0: using drive IP address fe80::290:31ff:fecf:957a 08:51:26 HHC03805I 0:0401 QETH: tun0: Register guest IP address 192.168.50.246 09:04:40 HHC01603I qproc 09:04:40 HHC17007I NumCPU = 02, NumVEC = 00, ReservedCPU = 00, MaxCPU = 02 09:04:40 HHC17008I Avgproc 02 004%; MIPS[ 3.60]; SIOS[ 10] 09:04:40 HHC17009I PROC CP00 - 009%; MIPS[ 3.55]; SIOS[ 8] - Host Kernel(00:00:01.797) User(00:00:57.406) 09:04:40 HHC17009I PROC IP01 - 000%; MIPS[ 0.05]; SIOS[ 2] - Host Kernel(00:00:01.188) User(00:01:33.984) 09:04:40 HHC17010I - Started : Stopping * Stopped 09:04:52 HHC01603I ipending 09:04:52 HHC00850I Processor CP00: CPUint=00000001 (State:00000001)&(Mask:DF00EE2B) 09:04:52 HHC00851I Processor CP00: interrupt not pending 09:04:52 HHC00852I Processor CP00: I/O interrupt not pending 09:04:52 HHC00853I Processor CP00: clock comparator not pending 09:04:52 HHC00854I Processor CP00: CPU timer not pending 09:04:52 HHC00855I Processor CP00: interval timer not pending 09:04:52 HHC00856I Processor CP00: ECPS vtimer not pending 09:04:52 HHC00857I Processor CP00: external call not pending 09:04:52 HHC00858I Processor CP00: emergency signal not pending 09:04:52 HHC00859I Processor CP00: machine check interrupt not pending 09:04:52 HHC00860I Processor CP00: service signal not pending 09:04:52 HHC00861I Processor CP00: mainlock held: no 09:04:52 HHC00862I Processor CP00: intlock held: no 09:04:52 HHC00863I Processor CP00: waiting for intlock: no 09:04:52 HHC00864I Processor CP00: lock not held 09:04:52 HHC00866I Processor CP00: state STARTED 09:04:52 HHC00867I Processor CP00: instcount 15999827536 09:04:52 HHC00868I Processor CP00: siocount 84684 09:04:52 HHC00869I Processor CP00: psw 0706000000000000 0000000000000000 09:04:52 HHC00850I Processor IP01: CPUint=80000001 (State:80000001)&(Mask:DF00E62B) 09:04:52 HHC00851I Processor IP01: interrupt pending 09:04:52 HHC00852I Processor IP01: I/O interrupt not pending 09:04:52 HHC00853I Processor IP01: clock comparator not pending 09:04:52 HHC00854I Processor IP01: CPU timer not pending 09:04:52 HHC00855I Processor IP01: interval timer not pending 09:04:52 HHC00856I Processor IP01: ECPS vtimer not pending 09:04:52 HHC00857I Processor IP01: external call not pending 09:04:52 HHC00858I Processor IP01: emergency signal not pending 09:04:52 HHC00859I Processor IP01: machine check interrupt not pending 09:04:52 HHC00860I Processor IP01: service signal not pending 09:04:52 HHC00861I Processor IP01: mainlock held: no 09:04:52 HHC00862I Processor IP01: intlock held: no 09:04:52 HHC00863I Processor IP01: waiting for intlock: no 09:04:52 HHC00864I Processor IP01: lock not held 09:04:52 HHC00866I Processor IP01: state STARTED 09:04:52 HHC00867I Processor IP01: instcount 40571800651 09:04:52 HHC00868I Processor IP01: siocount 24774 09:04:52 HHC00869I Processor IP01: psw 0706000000000000 0000000000000000 09:04:52 HHC00870I config mask 00000003 started mask 00000003 waiting mask 00000003 09:04:52 HHC00871I syncbc mask 00000000 09:04:52 HHC00872I signaling facility not busy 09:04:52 HHC00873I TOD lock not held 09:04:52 HHC00874I mainlock not held; owner ffff 09:04:52 HHC00875I intlock not held; owner ffff 09:04:52 HHC00876I ioq lock not held 09:04:52 HHC00884I Channel Report queue: (empty) 09:04:52 HHC00880I device 0:0400: status busy local 09:04:52 HHC00880I device 0:0402: status busy local 09:04:52 HHC00880I device 0:0A82: status reserved local 09:04:52 HHC00881I I/O interrupt queue: (NULL) 09:04:57 HHC03997I 0:0401 QETH: tun0: not using MAC address 02-00-5E-A8-4B-6F 09:04:57 HHC03997I 0:0401 QETH: tun0: not using IP address 192.168.50.245 09:04:57 HHC03997I 0:0401 QETH: tun0: not using subnet mask 255.255.255.0 09:04:57 HHC03997I 0:0401 QETH: tun0: not using MTU 1500 09:05:02 QETH Timeout during halt processing: SCSW2_Q NOT cleared 09:05:16 HHC01603I ipending 09:05:16 HHC00850I Processor CP00: CPUint=00000001 (State:00000001)&(Mask:DF00EE2B) 09:05:16 HHC00851I Processor CP00: interrupt not pending 09:05:16 HHC00852I Processor CP00: I/O interrupt not pending 09:05:16 HHC00853I Processor CP00: clock comparator not pending 09:05:16 HHC00854I Processor CP00: CPU timer not pending 09:05:16 HHC00855I Processor CP00: interval timer not pending 09:05:16 HHC00856I Processor CP00: ECPS vtimer not pending 09:05:16 HHC00857I Processor CP00: external call not pending 09:05:16 HHC00858I Processor CP00: emergency signal not pending 09:05:16 HHC00859I Processor CP00: machine check interrupt not pending 09:05:16 HHC00860I Processor CP00: service signal not pending 09:05:16 HHC00861I Processor CP00: mainlock held: no 09:05:16 HHC00862I Processor CP00: intlock held: no 09:05:16 HHC00863I Processor CP00: waiting for intlock: no 09:05:16 HHC00864I Processor CP00: lock not held 09:05:16 HHC00866I Processor CP00: state STARTED 09:05:16 HHC00867I Processor CP00: instcount 16070442290 09:05:16 HHC00868I Processor CP00: siocount 84769 09:05:16 HHC00869I Processor CP00: psw 0706000000000000 0000000000000000 09:05:16 HHC00850I Processor IP01: CPUint=00000001 (State:00000001)&(Mask:DF00E62B) 09:05:16 HHC00851I Processor IP01: interrupt not pending 09:05:16 HHC00852I Processor IP01: I/O interrupt not pending 09:05:16 HHC00853I Processor IP01: clock comparator not pending 09:05:16 HHC00854I Processor IP01: CPU timer not pending 09:05:16 HHC00855I Processor IP01: interval timer not pending 09:05:16 HHC00856I Processor IP01: ECPS vtimer not pending 09:05:16 HHC00857I Processor IP01: external call not pending 09:05:16 HHC00858I Processor IP01: emergency signal not pending 09:05:16 HHC00859I Processor IP01: machine check interrupt not pending 09:05:16 HHC00860I Processor IP01: service signal not pending 09:05:16 HHC00861I Processor IP01: mainlock held: no 09:05:16 HHC00862I Processor IP01: intlock held: no 09:05:16 HHC00863I Processor IP01: waiting for intlock: no 09:05:16 HHC00864I Processor IP01: lock not held 09:05:16 HHC00866I Processor IP01: state STARTED 09:05:16 HHC00867I Processor IP01: instcount 40579796477 09:05:16 HHC00868I Processor IP01: siocount 24789 09:05:16 HHC00869I Processor IP01: psw 0706000000000000 0000000000000000 09:05:16 HHC00870I config mask 00000003 started mask 00000003 waiting mask 00000003 09:05:16 HHC00871I syncbc mask 00000000 09:05:16 HHC00872I signaling facility not busy 09:05:16 HHC00873I TOD lock not held 09:05:16 HHC00874I mainlock not held; owner ffff 09:05:16 HHC00875I intlock not held; owner ffff 09:05:16 HHC00876I ioq lock not held 09:05:16 HHC00884I Channel Report queue: (empty) 09:05:16 HHC00880I device 0:0400: status busy local 09:05:16 HHC00881I I/O interrupt queue: (NULL) The highlighted message above, issued by my patch once Cheers |
Interesting! How did you create that? (i.e. the error condition?) Did you create it artificially? Or were you somehow able to recreate for real? (i.e. an actual recreation of the originally reported problem?) My only explanation is that the
Correct. Which can only mean, like I said just above, that the Because looking at the affected code, I cannot see how, if the pipe signal was successfully sent, why the original I'm stumped. Can you try the same test again, but after adding the "debug" option to your OSA (QETH) device statement please? This should activate the Thanks! And just for fun, we might consider adding some (temporary!) debugging code to the If it is succeeding like your test shows, then that means the problem -- whatever the FRICK it is -- must be somewhere in the ACTIVATE QUEUES logic. For some reason the QDSIG_HALT is never being received (which seems illogical/impossible!) or something else highly unusual is going on! Thanks. |
This is still the original problem: It occurs, reliably, when I open my slightly modified CICS web interface IVP website using HTTPS on a Windows system with LSO enabled on the network adapter. You can see how the result should look like when opening https://cicsweb:[email protected]:3005/CICS/CWBA/DFH$WB1A. This one runs on a Linux instance and it is exactly this interaction that produces the error on a Windows instance with LSO enabled. So, how can you repoduce it for yourself? You have to setup a CICS instance with its web interface activated over HTTPS. To do so, you need a certificate (self signed suffices) and make this available to the CICS web server through a TLS policy via PAGENT. Once you have the web interface up and running, you can replace the original IVP website by my modified one, and then you will see the problem by yourself. The point here is that this is a reliable way to reproduce the problem, but it most probably has nothing to do with the problem. As I said earlier, I'm convinced that there must be an easier way to reproduce it without having to set up a full blown CICS instance with web interface. One would somehow need to produce that same I/O pattern on the QETH interface, and I'd almost bet that the problem would then occur. But nonetheless, let me know whether you want to go through the hassle of defining a certificate, creating a TLS rule and setting up the CICS web interface -- then I'll try to provide you off list with the information necessary on how to do this. |
Oh! I wasn't aware of that. Sorry. My bad. I thought this was a one-time occurrence. I wasn't aware that it could be reliably reproduced at will. That's great! That makes things much easier.
And therein is the rub: while it may be true that there is probably another, perhaps easier, way to reproduce the problem, we don't know that for certain, AND discovering that different way (since we don't currently know what the other way is) would likely take many attempts to find it. (And then there's no guarantee that we would ever find it. It's just a presumption after all.) Therefore I would prefer using the same technique that we already know does work, and works reliably. Why waste our time trying to discover a new way to reproduce the problem, when we already know of a way that reliably reproduces it each and every time?
I would like that very much! Yes! While I'm sure we could work together on this bug with me providing you some research patches to dump information at certain points in the code, you then rebuilding Herc with my patches applied, running your repro and sending me the results (and then lathering, rinsing and repeating, as many times as necessary for me to get to the root of the problem and providing a fix for it (and then you testing it for me)), it would be much more convenient (and likely faster too) for both me and you for you to simply help me set things up on my own system (systems?) here so I can easily reproduce it for myself on demand whenever I need to. Given that the problem involves both my CTCI-WIN product and having LSO mistakenly enabled on the adapter too (a common user error that usually involves a support call from some of my customers), it would be much better for me to be able to reproduce the bug for myself -- at will -- too, as that would allow me to test some changes I've been wanting to (trying to!) do for a long time now. If I can get my CTCI-WIN product to detect and warn my customers about their LSO settings, all the better for me too, and not just Hercules. So let's do that: work together offline so I can obtain the needed help from you on setting up my own system so I can reproduce the problem myself. Once I can repro the problem on demand myself, I'll be able to find and fix the problem much more quickly and reliably than trying to debug this issue remotely. Agreed? Agreed! So I'll wait for you to contact me off list. Thanks, Jürgen. You're a nice guy. |
It look a while, but with Jürgen's help, I was able to setup my z/OS to recreate his problem. I am only able to recreate it on my Windows 11 system however (which is a VMware virtual machine, currently running 21H2(?) I think it's called?), but I cannot recreate it on my Windows 7 host system. So now that I am able to reproduce the crash at will (i.e. whenever I want to), I will hopefully be able to eventually track down the root cause and get the problem fixed. Seeing how this is a Low Priority issue, I am going to change the GitHub status from "In Progress" to "OnGoing", since I suspect it's going to take me a while. (Besides, there are other more important issues I could be working on, and if I try to work on too many things at once, I only end up making mistakes. So I'm putting this issue on hold for a while.) |
Fish, as promised, here the issue concerning the QETH problems we recently discussed:
The Hercules QETH device causes Hercules to crash, when the operating system issues a HSCH against the data device while it is still active/busy.
That behavior typically occurs on Windows systems, when the QETH device is using SoftDevLabs' CTCI-WIN product, and when the LSO functionality of the host system's Ethernet adapter isn't handled correctly by CTCI-WIN.
While there exists a workaround for that specific situation (simply disabling LSO on the host's adapter), the resulting Hercules crash should be seen as an independent problem. It can be expected that issuing
HSCH
(Halt Subchannel) against the QETH data device in the "wrong moment" will generally trigger the problem.Analysis shows that the crash gets triggered when
qeth_halt_data_device
gets called and doesn't return because the device doesn't honor theQDSIG_HALT
signal.That consequently leads to a deadlock (as
qeth_halt_data_device
holds bothINTLOCK
anddev->lock
, thus preventing some other CPU thread from being able to acquireINTLOCK
for its own purposes) which eventually leads to a forced crash the moment ourwatchdog_thread
system monitoring thread eventually detects the deadlock/hang.The deadlock (and thus the crash) can be avoided by the following quick and dirty patch, which is only meant to help further analysis. As it doesn't solve the underlying problem, it is not intended for production use.
Cheers
Jürgen
The text was updated successfully, but these errors were encountered: