Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hercules crash during QETH HSCH processing #690

Open
Juergen-Git opened this issue Nov 10, 2024 · 12 comments
Open

Hercules crash during QETH HSCH processing #690

Juergen-Git opened this issue Nov 10, 2024 · 12 comments
Assignees
Labels
BUG The issue describes likely incorrect product functionality that likely needs corrected. Ongoing Issue is long-term. Variant of IN PROGRESS: it's being worked on but maybe not at this exact moment. W Possible Windows only issue, that seems to occur more easily/frequently on Windows.

Comments

@Juergen-Git
Copy link
Collaborator

Juergen-Git commented Nov 10, 2024

Fish, as promised, here the issue concerning the QETH problems we recently discussed:

The Hercules QETH device causes Hercules to crash, when the operating system issues a HSCH against the data device while it is still active/busy.

That behavior typically occurs on Windows systems, when the QETH device is using SoftDevLabs' CTCI-WIN product, and when the LSO functionality of the host system's Ethernet adapter isn't handled correctly by CTCI-WIN.

While there exists a workaround for that specific situation (simply disabling LSO on the host's adapter), the resulting Hercules crash should be seen as an independent problem. It can be expected that issuing HSCH (Halt Subchannel) against the QETH data device in the "wrong moment" will generally trigger the problem.

Analysis shows that the crash gets triggered when qeth_halt_data_device gets called and doesn't return because the device doesn't honor the QDSIG_HALT signal.

That consequently leads to a deadlock (as qeth_halt_data_device holds both INTLOCK and dev->lock, thus preventing some other CPU thread from being able to acquire INTLOCK for its own purposes) which eventually leads to a forced crash the moment our watchdog_thread system monitoring thread eventually detects the deadlock/hang.

The deadlock (and thus the crash) can be avoided by the following quick and dirty patch, which is only meant to help further analysis. As it doesn't solve the underlying problem, it is not intended for production use.

From: Juergen Winkelmann <[email protected]>
Date: Sat, 9 Nov 2024 13:48:25 +0100
Subject: [PATCH] temporary solution to prevent Hercules crash during QETH HSCH processing

---
 qeth.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/qeth.c b/qeth.c
index c345f99c..be06a4fa 100644
--- a/qeth.c
+++ b/qeth.c
@@ -3812,11 +3812,19 @@ static void qeth_halt_data_device( DEVBLK* dev, OSA_GRP* grp )
 
             DBGTRC( dev, "Halting data device" );
             {
+                struct timespec waittime;
+                struct timeval  now;
+                gettimeofday( &now, NULL );
+                waittime.tv_sec  = now.tv_sec  + 5;
+                waittime.tv_nsec = now.tv_usec * 1000;
+
                 /* Ask, then wait for, the Activate Queues loop to exit */
                 PTT_QETH_TRACE( "b4 halt data", 0,0,0 );
                 VERIFY( qeth_write_pipe( grp->ppfd[1], &sig ) == 1);
-                wait_condition( &grp->qdcond, &grp->qlock );
-                dev->scsw.flag2 &= ~SCSW2_Q;
+                if (!timed_wait_condition( &grp->qdcond, &grp->qlock, &waittime ))
+                    dev->scsw.flag2 &= ~SCSW2_Q;
+                else
+                    logmsg("QETH Timeout during halt processing: SCSW2_Q NOT cleared\n");
                 PTT_QETH_TRACE( "af halt data", 0,0,0 );
             }
             DBGTRC( dev, "Data device halted" );
-- 
2.34.1

Cheers
Jürgen

@Fish-Git Fish-Git added BUG The issue describes likely incorrect product functionality that likely needs corrected. W Possible Windows only issue, that seems to occur more easily/frequently on Windows. labels Nov 10, 2024
@Juergen-Git
Copy link
Collaborator Author

Juergen-Git commented Nov 10, 2024

I'm not sure it's a Windows only problem: Yes, I've observed it so far under Windows only, in conjunction with CTCI-WIN. However, with the current knowledge I'd not bet that this cannot happen in other environments too. 😉

Probably, it doesn't need more than just feeding nonsense into the device until the IP stack gives up and tries to bail out using HSCH.

@SDL-Hercules-390 SDL-Hercules-390 deleted a comment from Juergen-Git Nov 11, 2024
@Fish-Git
Copy link
Member

Fish-Git commented Nov 11, 2024

You're probably right. I've updated the text of the 'W' GitHub issue label. I promise to get to this issue as soon as I can but am still desperately trying to get caught up with Hercules and serious personal issues backlog, (My wife had another serious heart attack!)

@Juergen-Git
Copy link
Collaborator Author

As I said earlier: This issue has absolutely no priority. Health, your wife's and your's is most important! Wishing your wife all the best, take care!

@wrljet
Copy link
Member

wrljet commented Nov 11, 2024

I'm very sorry to hear that, Fish. I wish the best to you both.

Bill

@Fish-Git
Copy link
Member

Analysis shows that the crash gets triggered when qeth_halt_data_device gets called and doesn't return because the device doesn't honor the QDSIG_HALT signal.

I'm not so sure that's correct. My own analysis reveals that the likely reason the qeth_halt_data_device doesn't return is not because "the device doesn't honor the QDSIG_HALT signal", but rather more likely is because the qeth_write_pipe function gets stuck in an infinite error loop, trying to send its "signal" to ask the data device to halt (i.e. exit its ACTIVATE QUEUES loop; qeth.c, line 5173).

But that's just my own educated guess.   (see attachments further below)

My analysis involved the discovery and use of two new "windbg"(*) commands, called ~ and !uniqstack

Using them, I then came to the above mentioned conclusion. Refer to the below crash dump analysis and my notes to see whether you agree with my analysis or not:

  • !analyze -v.txt       (windbg automated dump analysis with new ~ and !uniqstack commands too)
  • _Fish NOTES.txt    (my own analysis of windbg's analysis)

Now... What to do about it? How to fix the problem?

Answer: I'm not sure! I'm not even certain that my analysis is correct. I'm simply presuming (guessing) that it is. But how do you "fix" what appears to be a permanent I/O error? How do you fix (or workaround?) what appears to be a malfunctioning physical host hardware device?

The only thing I can think of is to modify the qeth_write_pipe function to set a reasonable time limit on its loop. That is to say, change it so that it doesn't loop forever. Set a time limit of, say, (5 seconds?) to the loop. Query the current time of day at the start of the loop, and then each loop iteration too, and if 5(?) or more seconds have passed, write a nasty error message and return.

That won't fix the underlying problem of course, but at least it should prevent the deadlock from occurring.

Of course, when the error occurs, the HSC (Halt Subchannel) will of course have failed too, and the repercussions of that are unknown. But at least if something nasty does occur elsewhere in Hercules as a result, we'll at least have an audit trail in the form of a log message to provide us with a clue as to a possible cause, which is better than nothing.

The deadlock (and thus the crash) can be avoided by the following quick and dirty patch, ...

Given my own analysis, I doubt your patch will accomplish much of anything. (Have you tried it? Does it work?) The problem (IMHO) is very likely in the qeth_write_pipe function itself, not with the infinite wait_condition that follows the call to qeth_write_pipe.

 
So..... That's it. That's my problem analysis of this issue and the best I can come up with for a possible solution/workaround. What do you guys think? Feedback encouraged! I value your opinions!

Thanks!


(*) "windbg" is the Windows Debugger tool used to perform postmortem analysis of crash dumps. That's the tool I normally use to analyze Hercules crash dumps. So far, the only command I have ever needed to use was the !analyze -v command (http://windbg.info/doc/1-common-cmds.html#9_exceptions). The results of the !analyze -v command (along with Hercules's log file) is usually enough to determine why the crash occurred.

But this time deeper analysis was required. To truly see what was going on, I needed to see what each thread involved in the deadlock was doing, which the !analyze -v command doesn't provide. So I used the ~ and !uniqstack commands and examined the stack for both the Processor 00 and Processor 01 threads to see what they were doing (i.e. to see how they reached the deadlock condition). Based on that, I then came to the [preliminary!] conclusion that I did.

@Juergen-Git
Copy link
Collaborator Author

Juergen-Git commented Dec 10, 2024

Hi Fish

thanks for the analysis!

Given my own analysis, I doubt your patch will accomplish much of anything. (Have you tried it? Does it work?)

The only thing my patch accomplishes is to prevent the deadlock. That means, after applying the patch Hercules will no longer crash when the problem occurs. The problem itself doesn't get fixed by the patch, and thus the device ends up being no longer usable once the problem occured. Even detaching and attaching it from/to Hercules doesn't bring it back to life. And yes, I've tried it and it works exactly to that point. My idea was, to allow further analysis when Hercules is still running -- but I'm not exactly an in depth windows debugger. ;-) And given, that the problem doesn't seem to occur widely across the Windows/QETH community I still think it has very low priority only.

Cheers
Jürgen

@Fish-Git
Copy link
Member

The only thing my patch accomplishes is to prevent the deadlock.

Wrong! And that was my point! It doesn't prevent the deadlock!  (and will never do so! -- by itself)

The reason it doesn't prevent the deadlock (by itself) is because your patch is still calling the qeth_write_pipe function, and it is that function that is never returning (presumably). So your timed_wait_condition will never be executed! So it doesn't prevent the deadlock at all. By itself, it essentially accomplishes absolutely nothing.

Now, once I eventually fix the loop in the qeth_write_pipe function so that it doesn't loop forever and instead eventually returns, THEN your timed_wait_condition will indeed prevent the deadlock from occurring.

But again, by itself, it accomplishes nothing. It does not, and will not, fix the deadlock by itself. The deadlock can only be fixed by applying BOTH proposed fixes: my fix to prevent qeth_write_pipe from looping forever, and your timed_wait_condition fix to not wait forever for the signal acknowledging that the HSC request has completed. Only when both fixes are in place will the deadlock be prevented. If only one fix or the other is applied, the deadlock will still occur! Both fixes are needed to prevent the deadlock.

Does that make sense now?

the problem doesn't seem to occur widely across the Windows/QETH community I still think it has very low priority only.

Agreed 100%.  But you know me:  I hate any bug in Hercules, no matter how minor or insignificant it may be.

Especially any bug related to networking, as such bugs tend to make my CTCI-WIN product look bad.  ;-)

p.s. I'm currently working on trying to make changes to my CTCI-WIN product to also report an adapter's "LSO" (Large Segment Offload) setting, and issue a warning should it appear that the adapter has it enabled. This, along with my existing checksum offloading reporting/handling, should (I hope!) prevent such hard to diagnose problems from occurring, as well as significantly reduce my own product support workload.  (Most of the support questions I receive these days are specifically related to issues exactly like this: incorrect adapter offload settings.)

@Fish-Git Fish-Git added the IN PROGRESS... I'm working on it! (Or someone else is!) label Dec 10, 2024
@Juergen-Git
Copy link
Collaborator Author

Juergen-Git commented Dec 11, 2024

Hi Fish

The reason it doesn't prevent the deadlock (by itself) is because your patch is still calling the qeth_write_pipe function, and it is that function that is never returning (presumably). So your timed_wait_condition will never be executed! So it doesn't prevent the deadlock at all. By itself, it essentially accomplishes absolutely nothing.

Logic-wise I'm 100% with you. But then how do you explain the following?

08:51:25 HHC03800I 0:0401 QETH: Adapter mode set to Layer 3
08:51:25 HHC04100I TunTap64.dll version 3.7.0.6070 initiated
08:51:25 HHC00901I 0:0401 QETH: Interface tun0, type TUN opened
08:51:26 HHC03997I 0:0401 QETH: tun0: using MAC address 02-00-5E-A8-4B-6F
08:51:26 HHC03997I 0:0401 QETH: tun0: using IP address 192.168.50.245
08:51:26 HHC03997I 0:0401 QETH: tun0: using subnet mask 255.255.255.0
08:51:26 HHC03997I 0:0401 QETH: tun0: using MTU 1500
08:51:26 HHC03997I 0:0401 QETH: tun0: using drive MAC address 02:90:31:cf:95:7a
08:51:26 HHC03997I 0:0401 QETH: tun0: using drive IP address fe80::290:31ff:fecf:957a
08:51:26 HHC03805I 0:0401 QETH: tun0: Register guest IP address 192.168.50.246
09:04:40 HHC01603I qproc
09:04:40 HHC17007I NumCPU = 02, NumVEC = 00, ReservedCPU = 00, MaxCPU = 02
09:04:40 HHC17008I Avgproc  02 004%; MIPS[   3.60]; SIOS[    10]
09:04:40 HHC17009I PROC CP00 - 009%; MIPS[   3.55]; SIOS[     8] - Host Kernel(00:00:01.797) User(00:00:57.406)
09:04:40 HHC17009I PROC IP01 - 000%; MIPS[   0.05]; SIOS[     2] - Host Kernel(00:00:01.188) User(00:01:33.984)
09:04:40 HHC17010I  - Started        : Stopping        * Stopped
09:04:52 HHC01603I ipending
09:04:52 HHC00850I Processor CP00: CPUint=00000001 (State:00000001)&(Mask:DF00EE2B)
09:04:52 HHC00851I Processor CP00: interrupt not pending
09:04:52 HHC00852I Processor CP00: I/O interrupt not pending
09:04:52 HHC00853I Processor CP00: clock comparator not pending
09:04:52 HHC00854I Processor CP00: CPU timer not pending
09:04:52 HHC00855I Processor CP00: interval timer not pending
09:04:52 HHC00856I Processor CP00: ECPS vtimer not pending
09:04:52 HHC00857I Processor CP00: external call not pending
09:04:52 HHC00858I Processor CP00: emergency signal not pending
09:04:52 HHC00859I Processor CP00: machine check interrupt not pending
09:04:52 HHC00860I Processor CP00: service signal not pending
09:04:52 HHC00861I Processor CP00: mainlock held: no
09:04:52 HHC00862I Processor CP00: intlock held: no
09:04:52 HHC00863I Processor CP00: waiting for intlock: no
09:04:52 HHC00864I Processor CP00: lock not held
09:04:52 HHC00866I Processor CP00: state STARTED
09:04:52 HHC00867I Processor CP00: instcount 15999827536
09:04:52 HHC00868I Processor CP00: siocount 84684
09:04:52 HHC00869I Processor CP00: psw 0706000000000000 0000000000000000
09:04:52 HHC00850I Processor IP01: CPUint=80000001 (State:80000001)&(Mask:DF00E62B)
09:04:52 HHC00851I Processor IP01: interrupt pending
09:04:52 HHC00852I Processor IP01: I/O interrupt not pending
09:04:52 HHC00853I Processor IP01: clock comparator not pending
09:04:52 HHC00854I Processor IP01: CPU timer not pending
09:04:52 HHC00855I Processor IP01: interval timer not pending
09:04:52 HHC00856I Processor IP01: ECPS vtimer not pending
09:04:52 HHC00857I Processor IP01: external call not pending
09:04:52 HHC00858I Processor IP01: emergency signal not pending
09:04:52 HHC00859I Processor IP01: machine check interrupt not pending
09:04:52 HHC00860I Processor IP01: service signal not pending
09:04:52 HHC00861I Processor IP01: mainlock held: no
09:04:52 HHC00862I Processor IP01: intlock held: no
09:04:52 HHC00863I Processor IP01: waiting for intlock: no
09:04:52 HHC00864I Processor IP01: lock not held
09:04:52 HHC00866I Processor IP01: state STARTED
09:04:52 HHC00867I Processor IP01: instcount 40571800651
09:04:52 HHC00868I Processor IP01: siocount 24774
09:04:52 HHC00869I Processor IP01: psw 0706000000000000 0000000000000000
09:04:52 HHC00870I config mask 00000003 started mask 00000003 waiting mask 00000003
09:04:52 HHC00871I syncbc mask 00000000 
09:04:52 HHC00872I signaling facility not busy
09:04:52 HHC00873I TOD lock not held
09:04:52 HHC00874I mainlock not held; owner ffff
09:04:52 HHC00875I intlock not held; owner ffff
09:04:52 HHC00876I ioq lock not held
09:04:52 HHC00884I Channel Report queue: (empty)
09:04:52 HHC00880I device 0:0400: status busy local
09:04:52 HHC00880I device 0:0402: status busy local
09:04:52 HHC00880I device 0:0A82: status reserved local
09:04:52 HHC00881I I/O interrupt queue: (NULL)
09:04:57 HHC03997I 0:0401 QETH: tun0: not using MAC address 02-00-5E-A8-4B-6F
09:04:57 HHC03997I 0:0401 QETH: tun0: not using IP address 192.168.50.245
09:04:57 HHC03997I 0:0401 QETH: tun0: not using subnet mask 255.255.255.0
09:04:57 HHC03997I 0:0401 QETH: tun0: not using MTU 1500

09:05:02 QETH Timeout during halt processing: SCSW2_Q NOT cleared

09:05:16 HHC01603I ipending
09:05:16 HHC00850I Processor CP00: CPUint=00000001 (State:00000001)&(Mask:DF00EE2B)
09:05:16 HHC00851I Processor CP00: interrupt not pending
09:05:16 HHC00852I Processor CP00: I/O interrupt not pending
09:05:16 HHC00853I Processor CP00: clock comparator not pending
09:05:16 HHC00854I Processor CP00: CPU timer not pending
09:05:16 HHC00855I Processor CP00: interval timer not pending
09:05:16 HHC00856I Processor CP00: ECPS vtimer not pending
09:05:16 HHC00857I Processor CP00: external call not pending
09:05:16 HHC00858I Processor CP00: emergency signal not pending
09:05:16 HHC00859I Processor CP00: machine check interrupt not pending
09:05:16 HHC00860I Processor CP00: service signal not pending
09:05:16 HHC00861I Processor CP00: mainlock held: no
09:05:16 HHC00862I Processor CP00: intlock held: no
09:05:16 HHC00863I Processor CP00: waiting for intlock: no
09:05:16 HHC00864I Processor CP00: lock not held
09:05:16 HHC00866I Processor CP00: state STARTED
09:05:16 HHC00867I Processor CP00: instcount 16070442290
09:05:16 HHC00868I Processor CP00: siocount 84769
09:05:16 HHC00869I Processor CP00: psw 0706000000000000 0000000000000000
09:05:16 HHC00850I Processor IP01: CPUint=00000001 (State:00000001)&(Mask:DF00E62B)
09:05:16 HHC00851I Processor IP01: interrupt not pending
09:05:16 HHC00852I Processor IP01: I/O interrupt not pending
09:05:16 HHC00853I Processor IP01: clock comparator not pending
09:05:16 HHC00854I Processor IP01: CPU timer not pending
09:05:16 HHC00855I Processor IP01: interval timer not pending
09:05:16 HHC00856I Processor IP01: ECPS vtimer not pending
09:05:16 HHC00857I Processor IP01: external call not pending
09:05:16 HHC00858I Processor IP01: emergency signal not pending
09:05:16 HHC00859I Processor IP01: machine check interrupt not pending
09:05:16 HHC00860I Processor IP01: service signal not pending
09:05:16 HHC00861I Processor IP01: mainlock held: no
09:05:16 HHC00862I Processor IP01: intlock held: no
09:05:16 HHC00863I Processor IP01: waiting for intlock: no
09:05:16 HHC00864I Processor IP01: lock not held
09:05:16 HHC00866I Processor IP01: state STARTED
09:05:16 HHC00867I Processor IP01: instcount 40579796477
09:05:16 HHC00868I Processor IP01: siocount 24789
09:05:16 HHC00869I Processor IP01: psw 0706000000000000 0000000000000000
09:05:16 HHC00870I config mask 00000003 started mask 00000003 waiting mask 00000003
09:05:16 HHC00871I syncbc mask 00000000 
09:05:16 HHC00872I signaling facility not busy
09:05:16 HHC00873I TOD lock not held
09:05:16 HHC00874I mainlock not held; owner ffff
09:05:16 HHC00875I intlock not held; owner ffff
09:05:16 HHC00876I ioq lock not held
09:05:16 HHC00884I Channel Report queue: (empty)
09:05:16 HHC00880I device 0:0400: status busy local
09:05:16 HHC00881I I/O interrupt queue: (NULL)

The highlighted message above, issued by my patch once timed_wait_condition runs into its timeout, should never be issued if your theory is correct that the timed_wait_condition never gets executed. So, what are we missing?

Cheers
Jürgen

@Fish-Git
Copy link
Member

Fish-Git commented Dec 11, 2024

Logic-wise I'm 100% with you. But then how do you explain the following?

...
09:05:02 QETH Timeout during halt processing: SCSW2_Q NOT cleared
...

Interesting! How did you create that? (i.e. the error condition?) Did you create it artificially? Or were you somehow able to recreate for real? (i.e. an actual recreation of the originally reported problem?)

My only explanation is that the qeth_write_pipe function obviously succeeded. That it didn't get stuck in any type of continuous error loop as I originally surmised.

The highlighted message above, issued by my patch once timed_wait_condition runs into its timeout, should never be issued if your theory is correct that the timed_wait_condition never gets executed

Correct. Which can only mean, like I said just above, that the qeth_write_pipe function obviously succeeding. That it's not getting stuck in any type of continuous error loop as I originally theorized. Which is why I'm curious about how you were able to reproduce the problem? If it can be reproduced, I'd very much like to try it for myself! How did you do it?

Because looking at the affected code, I cannot see how, if the pipe signal was successfully sent, why the original wait_condition (in the qeth_halt_data_device function) would not have been woken. Yes, it's an infinite wait (which it probably should not be, thus your patch to change it to a timed_wait_condition instead is a good suggestion), but it should have eventually been signaled! Why would the signal take so long to be received? What could possibly delay it? THAT I guess is the true mystery! Why, in your original problem report, did the halt signal take so long to be received??

I'm stumped.  :(

Can you try the same test again, but after adding the "debug" option to your OSA (QETH) device statement please? This should activate the DBGTRACE statements which hopefully might provide us with some clues as to what the frick is going on.

Thanks!


And just for fun, we might consider adding some (temporary!) debugging code to the qeth_write_pipe function too, that reports each time write_pipe fails (i.e. immediately before the sched_yield() statement) so we can see whether or not it is indeed looping as I originally suspected, or whether is is succeeding like your test shows).

If it is succeeding like your test shows, then that means the problem -- whatever the FRICK it is -- must be somewhere in the ACTIVATE QUEUES logic. For some reason the QDSIG_HALT is never being received (which seems illogical/impossible!) or something else highly unusual is going on!

Thanks.

@Juergen-Git
Copy link
Collaborator Author

Which is why I'm curious about how you were able to reproduce the problem? If it can be reproduced, I'd very much like to try it for myself! How did you do it?

This is still the original problem: It occurs, reliably, when I open my slightly modified CICS web interface IVP website using HTTPS on a Windows system with LSO enabled on the network adapter. You can see how the result should look like when opening https://cicsweb:[email protected]:3005/CICS/CWBA/DFH$WB1A. This one runs on a Linux instance and it is exactly this interaction that produces the error on a Windows instance with LSO enabled.

So, how can you repoduce it for yourself? You have to setup a CICS instance with its web interface activated over HTTPS. To do so, you need a certificate (self signed suffices) and make this available to the CICS web server through a TLS policy via PAGENT. Once you have the web interface up and running, you can replace the original IVP website by my modified one, and then you will see the problem by yourself.

The point here is that this is a reliable way to reproduce the problem, but it most probably has nothing to do with the problem. As I said earlier, I'm convinced that there must be an easier way to reproduce it without having to set up a full blown CICS instance with web interface. One would somehow need to produce that same I/O pattern on the QETH interface, and I'd almost bet that the problem would then occur.

But nonetheless, let me know whether you want to go through the hassle of defining a certificate, creating a TLS rule and setting up the CICS web interface -- then I'll try to provide you off list with the information necessary on how to do this.

@Fish-Git
Copy link
Member

Fish-Git commented Dec 12, 2024

Which is why I'm curious about how you were able to reproduce the problem? If it can be reproduced, I'd very much like to try it for myself! How did you do it?

This is still the original problem: It occurs, reliably, when I ...

Oh! I wasn't aware of that. Sorry. My bad. I thought this was a one-time occurrence. I wasn't aware that it could be reliably reproduced at will. That's great! That makes things much easier.

I'm convinced that there must be an easier way to reproduce it without having to set up a full blown CICS instance with web interface. One would somehow need to produce that same I/O pattern on the QETH interface, and I'd almost bet that the problem would then occur.

And therein is the rub: while it may be true that there is probably another, perhaps easier, way to reproduce the problem, we don't know that for certain, AND discovering that different way (since we don't currently know what the other way is) would likely take many attempts to find it. (And then there's no guarantee that we would ever find it. It's just a presumption after all.)

Therefore I would prefer using the same technique that we already know does work, and works reliably. Why waste our time trying to discover a new way to reproduce the problem, when we already know of a way that reliably reproduces it each and every time?

... let me know whether you want to go through the hassle of defining a certificate, creating a TLS rule and setting up the CICS web interface -- then I'll try to provide you off list with the information necessary on how to do this.

I would like that very much! Yes!

While I'm sure we could work together on this bug with me providing you some research patches to dump information at certain points in the code, you then rebuilding Herc with my patches applied, running your repro and sending me the results (and then lathering, rinsing and repeating, as many times as necessary for me to get to the root of the problem and providing a fix for it (and then you testing it for me)), it would be much more convenient (and likely faster too) for both me and you for you to simply help me set things up on my own system (systems?) here so I can easily reproduce it for myself on demand whenever I need to.

Given that the problem involves both my CTCI-WIN product and having LSO mistakenly enabled on the adapter too (a common user error that usually involves a support call from some of my customers), it would be much better for me to be able to reproduce the bug for myself -- at will -- too, as that would allow me to test some changes I've been wanting to (trying to!) do for a long time now. If I can get my CTCI-WIN product to detect and warn my customers about their LSO settings, all the better for me too, and not just Hercules.

So let's do that: work together offline so I can obtain the needed help from you on setting up my own system so I can reproduce the problem myself. Once I can repro the problem on demand myself, I'll be able to find and fix the problem much more quickly and reliably than trying to debug this issue remotely. Agreed? Agreed!

So I'll wait for you to contact me off list.  :)

Thanks, Jürgen. You're a nice guy.

@Fish-Git
Copy link
Member

Fish-Git commented Jan 1, 2025

It look a while, but with Jürgen's help, I was able to setup my z/OS to recreate his problem.

I am only able to recreate it on my Windows 11 system however (which is a VMware virtual machine, currently running 21H2(?) I think it's called?), but I cannot recreate it on my Windows 7 host system.

So now that I am able to reproduce the crash at will (i.e. whenever I want to), I will hopefully be able to eventually track down the root cause and get the problem fixed.

Seeing how this is a Low Priority issue, I am going to change the GitHub status from "In Progress" to "OnGoing", since I suspect it's going to take me a while.

(Besides, there are other more important issues I could be working on, and if I try to work on too many things at once, I only end up making mistakes. So I'm putting this issue on hold for a while.)

@Fish-Git Fish-Git added Ongoing Issue is long-term. Variant of IN PROGRESS: it's being worked on but maybe not at this exact moment. and removed IN PROGRESS... I'm working on it! (Or someone else is!) labels Jan 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUG The issue describes likely incorrect product functionality that likely needs corrected. Ongoing Issue is long-term. Variant of IN PROGRESS: it's being worked on but maybe not at this exact moment. W Possible Windows only issue, that seems to occur more easily/frequently on Windows.
Projects
None yet
Development

No branches or pull requests

3 participants