Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vioscsi] Fix IntLock regression in commit 15e64ace. #1293

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

MartinCHarvey-Nutanix
Copy link

@MartinCHarvey-Nutanix MartinCHarvey-Nutanix commented Feb 14, 2025

This corrects interrupt/spinlock acquision code regressed previously. The original code previously skipped lock acquision if invoked directly from the ISR. The regression then made it acquire an interrupt spinlock.

This PR restores the original behaviour.

Subsequent PR's also needed to deal with:

  • Not indicating device ready until DPC initialized, hence not going down "no-lock" path unless really a crashdump driver.
  • Rechecking of IRQL / DPC processing when a crashdump driver. (Cursory check that it writes a dump is OK).

Ref: Nutanix ENG-741981.

@MartinCHarvey-Nutanix MartinCHarvey-Nutanix changed the title Fix IntLock regression in commit 15e64ace. [vioscsi] Fix IntLock regression in commit 15e64ace. Feb 14, 2025
@JonKohler
Copy link
Contributor

@benyamin-codez - We spotted this regression from commit: 15e64ac

Copying some of the internal repro data from @sb-ntnx

THREAD ffffac0c09eea040  Cid 0004.00d4  Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 0
Not impersonating
DeviceMap                 ffffbd8074465d60
Owning Process            ffffac0c09ab3040       Image:         System
Attached Process          N/A            Image:         N/A
Wait Start TickCount      70856          Ticks: 6009 (0:00:01:33.890)
Context Switch Count      24750          IdealProcessor: 1             
UserTime                  00:00:00.000
KernelTime                00:01:34.734
Win32 Start Address nt!ExpWorkerThread (0xfffff800af5702f0)
Stack Init ffffd083af1b4c70 Current ffffd083af1b3970
Base ffffd083af1b5000 Limit ffffd083af1af000 Call 0000000000000000
Priority 13  BasePriority 12  Decay Boost 1  IoPriority 2  PagePriority 5
Child-SP          RetAddr               : Args to Child                                                           : Call Site
fffff800`4166eb28 fffff800`af7012b2     : 00000000`00000080 00000000`004f4454 00000000`00000000 00000000`00000000 : nt!KeBugCheckEx
fffff800`4166eb30 fffff800`af6fb6f9     : ffffac0c`0c7e4000 00000000`00001000 00000000`00001000 fffff800`af64869d : nt!HalpNMIHalt+0x2e
fffff800`4166eb70 fffff800`40d01250     : 00000000`00000000 fffff800`4166ec39 ffffac0c`0c7e4908 fffff800`b000fc90 : nt!HalBugCheckSystem+0x69
fffff800`4166ebb0 fffff800`af8b7c2b     : 00000000`00000000 fffff800`4166ec39 ffffac0c`0c7e4908 00000000`00000000 : PSHED!PshedBugCheckSystem+0x10
fffff800`4166ebe0 fffff800`af70108b     : fffff800`b01cd300 fffff800`b01cd300 fffff800`b000fc90 00000000`0000005c : nt!WheaReportHwError+0x316f7b
fffff800`4166eca0 fffff800`af7768cf     : fffff800`b01cd3c0 fffff800`4166ed10 00000000`00000000 fffff800`4166ed10 : nt!HalHandleNMI+0x14b
fffff800`4166ecd0 fffff800`af885cc2     : 00000000`d93c5b0a fffff800`4166eed0 00000000`00000000 ffffac0c`10fc6dd0 : nt!KiProcessNMI+0xff
fffff800`4166ed10 fffff800`af885a2e     : 00000000`00000000 00000000`d93c5b0a fffff800`4166eed0 00000000`00000000 : nt!KxNmiInterrupt+0x82
fffff800`4166ee50 fffff800`af4bd80c     : ffffffff`ffffffc0 00000000`00010008 00000000`00000180 fffff756`0608f700 : nt!KiNmiInterrupt+0x26e (TrapFrame @ fffff800`4166ee50)
ffffd083`af1b4020 fffff800`af60eb70     : ffffac0c`10fc6e40 00000000`af40159d 00000000`00000000 00000000`00000000 : nt!KxWaitForSpinLockAndAcquire+0x1c
ffffd083`af1b4050 fffff800`41f69097     : ffffac0c`1039f1a0 00000000`00000000 00000000`00000000 ffffac0c`1039f1a0 : nt!KeAcquireInterruptSpinLock+0x40
ffffd083`af1b4090 fffff800`41f68312     : ffffac0c`1039f1a0 ffffd083`af1b43d0 00000000`00000000 00000000`00000000 : storport!RaidAdapterAcquireInterruptLock+0x37
ffffd083`af1b40c0 fffff800`41f66f6a     : ffffac0c`103a41a0 00000000`00000000 00000000`000000ff 00000000`00000001 : storport!RaidBusEnumeratorGetUnit+0x282
ffffd083`af1b4180 fffff800`41f654d1     : 00000000`00000000 00000000`0000000b ffffac0c`09105540 ffffac0c`1039f1a0 : storport!RaidAdapterEnumerateBus+0xda
ffffd083`af1b43b0 fffff800`41f64f9d     : 00000000`00000000 ffffd083`af1b4529 ffffac0c`1039f1a0 00000000`00000000 : storport!RaidAdapterRescanBus+0xcd
ffffd083`af1b4490 fffff800`41f63eae     : 00000000`00001001 00000000`00000fff 00000000`00000000 00000000`00000000 : storport!RaidAdapterQueryDeviceRelationsIrp+0xa1
ffffd083`af1b4590 fffff800`41f63bf0     : 00000000`00000003 fffff800`af474c75 ffffac0c`10e53bd0 ffffac0c`118dab60 : storport!RaidAdapterPnpIrp+0x1ea
ffffd083`af1b4670 fffff800`af4f79fe     : ffffac0c`10e53bd0 ffffd083`af1b47d0 ffffac0c`1039f050 00000000`69706e04 : storport!RaDriverPnpIrp+0x50
ffffd083`af1b46b0 fffff800`afc24bfa     : ffffac0c`0c3cc360 ffffac0c`118dab60 00000000`00000000 00000000`00000000 : nt!IofCallDriver+0xbe
ffffd083`af1b46f0 fffff800`af5a1ca2     : ffffac0c`0c3cc360 00000000`00000000 ffffac0c`118dab60 ffffbd80`82a707d0 : nt!PnpAsynchronousCall+0xe6
ffffd083`af1b4730 fffff800`afb17561     : 00000000`00000000 00000000`00000000 ffffac0c`0c3cc360 ffffac0c`0c457b20 : nt!PnpSendIrp+0x9a
ffffd083`af1b47a0 fffff800`afb172fc     : ffffac0c`118dab60 ffffac0c`0c457b48 ffffac0c`0c457b20 00000000`00000001 : nt!PnpQueryDeviceRelations+0x51
ffffd083`af1b4830 fffff800`afb108f5     : ffffac0c`0c457b20 ffffd083`af1b48e1 00000000`00000002 00000000`00000001 : nt!PipEnumerateDevice+0xc4
ffffd083`af1b4860 fffff800`afc92ce2     : ffffac0c`0c457b20 ffffac0c`11cc8090 ffffd083`af1b4980 fffff800`00000000 : nt!PipProcessDevNodeTree+0x4b9
ffffd083`af1b4930 fffff800`af63c406     : 00000001`00000003 ffffac0c`0c457b20 00000000`00000000 00000000`00000000 : nt!PiRestartDevice+0xbe
ffffd083`af1b4980 fffff800`af5704a2     : ffffac0c`09eea040 ffffac0c`09a97130 fffff800`af63c150 ffffac0c`00000000 : nt!PnpDeviceActionWorker+0x2b6
ffffd083`af1b4a40 fffff800`af65585a     : ffffac0c`09eea040 ffffac0c`09eea040 fffff800`af5702f0 ffffac0c`09a97130 : nt!ExpWorkerThread+0x1b2
ffffd083`af1b4bf0 fffff800`af87ae54     : fffff800`3d57d180 ffffac0c`09eea040 fffff800`af655800 000003cc`868b4102 : nt!PspSystemThreadStartup+0x5a
ffffd083`af1b4c40 00000000`00000000     : ffffd083`af1b5000 ffffd083`af1af000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x34
0: kd> !thread ffffac0c112dc080
THREAD ffffac0c112dc080  Cid 088c.06bc  Teb: 000000ef1dbd7000 Win32Thread: 0000000000000000 RUNNING on processor 1
Not impersonating
DeviceMap                 ffffbd8074465d60
Owning Process            ffffac0c100e9300       Image:         svchost.exe
Attached Process          N/A            Image:         N/A
Wait Start TickCount      70838          Ticks: 6027 (0:00:01:34.171)
Context Switch Count      2              IdealProcessor: 0             
UserTime                  00:00:00.000
KernelTime                00:00:00.000
Win32 Start Address 0x00007ffef9f31a00
Stack Init ffffd083b1acac70 Current ffffd083b1aca3e0
Base ffffd083b1acb000 Limit ffffd083b1ac5000 Call 0000000000000000
Priority 9  BasePriority 8  Decay Boost 1  IoPriority 2  PagePriority 5
Child-SP          RetAddr               : Args to Child                                                           : Call Site
ffffd401`dcb33c80 fffff800`af60eb70     : ffffac0c`0f85c320 ffffd401`dcb17180 00000000`000114c7 00000000`00000000 : nt!KxWaitForSpinLockAndAcquire+0x1c
ffffd401`dcb33cb0 fffff800`41f7512a     : 00000000`00000000 00000000`00000000 ffff76ba`00000001 00000000`00000000 : nt!KeAcquireInterruptSpinLock+0x40
ffffd401`dcb33cf0 fffff800`45e34cb4     : 00000000`00000246 ffffac0c`10d58010 00000000`00000003 00000000`00000000 : storport!StorPortNotification+0x5ca
(Inline Function) --------`--------     : --------`-------- --------`-------- --------`-------- --------`-------- : vioscsi!StorPortAcquireSpinLock+0x1d (Inline Function @ fffff800`45e34cb4) [F:\ewdk\24h2\Program Files\Windows Kits\10\Include\10.0.22621.0\km\storport.h @ 11489] 
ffffd401`dcb33dc0 fffff800`45e37100     : ffffac0c`10d58010 00000000`00000000 fffff800`45e3f290 00000000`00000000 : vioscsi!ProcessBuffer+0xd8 [E:\windows-drivers-build-root\kvm-guest-drivers-windows\vioscsi\vioscsi.c @ 1459] 
(Inline Function) --------`--------     : --------`-------- --------`-------- --------`-------- --------`-------- : vioscsi!DispatchQueue+0x96 (Inline Function @ fffff800`45e37100) [E:\windows-drivers-build-root\kvm-guest-drivers-windows\vioscsi\vioscsi.c @ 1420] 
ffffd401`dcb33e30 fffff800`45e36fd0     : 00000000`00000000 00000000`00000001 00000000`00000003 00000000`00000000 : vioscsi!VioScsiMSInterruptWorker+0xdc [E:\windows-drivers-build-root\kvm-guest-drivers-windows\vioscsi\vioscsi.c @ 1039] 
ffffd401`dcb33e70 fffff800`41f7d80b     : 00000000`00000000 00000000`00000000 0000ffff`f8004418 00000000`00000000 : vioscsi!VioScsiMSInterrupt+0x30 [E:\windows-drivers-build-root\kvm-guest-drivers-windows\vioscsi\vioscsi.c @ 1114] 
ffffd401`dcb33ea0 fffff800`af6073d1     : ffffd401`de14cb40 0000023a`acf154a6 ffff76ba`0a4ab12f ffffac0c`100e9300 : storport!RaidpAdapterMSIInterruptRoutine+0x7b
ffffd401`dcb33f00 fffff800`af47047f     : 00000000`00000000 00000000`00000001 ffffd401`de14cb40 ffffd083`b1ac98e0 : nt!KiInterruptMessageDispatch+0x11
ffffd401`dcb33f30 fffff800`af87bd83     : 00000000`00000000 ffffd401`de14cb40 00000000`00000000 fffff800`af87be9e : nt!KiCallInterruptServiceRoutine+0x4bf
ffffd401`dcb33f90 fffff800`af87beec     : 00000002`93e912c0 ffffd083`b1ac98e0 00000002`93e91423 00000237`1c4d4e14 : nt!KiInterruptSubDispatch+0x73 (TrapFrame @ ffffd401`dcb33e70)
ffffd083`b1ac9860 fffff800`afb5ef75     : fffff800`afab711b 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiInterruptDispatch+0x3c (TrapFrame @ ffffd083`b1ac9860)
ffffd083`b1ac99f8 fffff800`afab711b     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!PspCallThreadNotifyRoutines+0x5
ffffd083`b1ac9a00 fffff800`afab66be     : ffffac0c`112dd080 ffffac0c`100e9300 ffffd083`b1aca2f0 ffffd083`b1ac9b30 : nt!PspInsertThread+0x6a3
ffffd083`b1ac9ad0 fffff800`afab4f72     : ffffd083`b1ac9df0 ffffd083`b1aca870 ffffd083`b1ac9df0 ffffd083`b1aca870 : nt!PspCreateThread+0x2ea
ffffd083`b1ac9d80 fffff800`af88d355     : 00000000`00000000 00000000`00000000 ffffac0c`126923d0 00000000`00000002 : nt!NtCreateThreadEx+0x2d2
ffffd083`b1aca600 fffff800`af87b7e0     : fffff800`afbe23c1 ffffd083`b1aca920 00000000`00000000 ffffac0c`09bcedf0 : nt!KiSystemServiceCopyEnd+0x25 (TrapFrame @ ffffd083`b1aca670)
ffffd083`b1aca808 fffff800`afbe23c1     : ffffd083`b1aca920 00000000`00000000 ffffac0c`09bcedf0 fffff800`afa976e0 : nt!KiServiceLinkage
ffffd083`b1aca810 fffff800`af52b176     : ffffac0c`0f822d30 00000000`00000001 ffffac0c`0f822d30 fffff800`af47a47a : nt!RtlpCreateUserThreadEx+0x151
ffffd083`b1aca950 fffff800`af52a576     : 00000000`00000000 ffffac0c`0f822d30 ffffac0c`0f822ec8 ffffd083`b1acaa20 : nt!ExpWorkerFactoryCreateThread+0x10e
ffffd083`b1acaa00 fffff800`af52a052     : 00000000`00000002 ffffac0c`0f822d30 ffffac0c`10567a08 ffffac0c`10567a00 : nt!ExpWorkerFactoryCheckCreate+0x196
ffffd083`b1acaa60 fffff800`af88d355     : ffffac0c`112dc080 ffffac0c`0f822d30 ffffd083`b1acab60 0000020a`36120b20 : nt!NtReleaseWorkerFactoryWorker+0x232
ffffd083`b1acaae0 00007ffe`fa0426d4     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x25 (TrapFrame @ ffffd083`b1acaae0)
000000ef`1df7f3f8 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x00007ffe`fa0426d4

@MartinCHarvey
Copy link

I would note this might be a race condition at start before passive initialization has completed, and I would like to do some more instrumentation and testing of that, but time is short, and I'm now on PTO. Expect further analysis in a week or so.

@benyamin-codez
Copy link
Contributor

benyamin-codez commented Feb 15, 2025

@JonKohler

Thanks for the heads-up and the data, Jon.
In Martin's absence, perhaps yourself or @sb-ntnx could provide further insight?

I'm curious under which circumstances the NMI is produced.
It looks like it's happening during driver init. Is that correct?
Is this perhaps a resume from suspend/hibernation operation?

In the parent PR (#1175), the crash dump / hibernation path also had a mention, reproduced here for your convenience.

Thanks, we also need to test vCPU > 1 and num_queues = 1 configuration because in this case we can expect ISR and DPC to be executed on different CPUs simultaneously.

We know that when MSISupported=0 that num_queues=1 via:

if (adaptExt->dump_mode || !adaptExt->msix_enabled)
{
adaptExt->num_queues = 1;
}

So using MSISupported=1 and num_queues=1 via QEMU command line with vCPU>1 results in VioScsiInterrupt() only issuing DPCs whilst adaptExt->dpc_ok=TRUE via:

if (!adaptExt->dump_mode && adaptExt->dpc_ok)
{
StorPortIssueDpc(DeviceExtension,
&adaptExt->dpc[0],
ULongToPtr(QUEUE_TO_MESSAGE(VIRTIO_SCSI_REQUEST_QUEUE_0)),
ULongToPtr(QUEUE_TO_MESSAGE(VIRTIO_SCSI_REQUEST_QUEUE_0)));
}
else
{
ProcessQueue(DeviceExtension, QUEUE_TO_MESSAGE(VIRTIO_SCSI_REQUEST_QUEUE_0), TRUE);
}

The InterruptLock type spinlock (which I have not tested) would only be used in the else branch, i.e. when IsCrashDumpMode=TRUE or when adaptExt->dpc_ok=FALSE (or !adaptExt->dpc_ok).

In any case - to answer your question - this configuration also tests ok with this commit.

that !adaptExt->dpc_ok condition is a kind of tricky one. IIRC, we hit this case when resuming from Suspend and/or Hibernation

Very tricky indeed from what I can tell..! 8^d

In fact, I cannot see how this would work when all other spinlocks are the DpcLock type. In my previous testing, issuing InterruptLock type spinlocks when adaptExt->dpc_ok=TRUE usually clobbered in-flight DPCs.

I think this is worth exploring, but in a future PR as previously suggested.

Regarding Martin's commit, it appears the changes in helper.c and similar changes in vioscsi.c would not be functionally different. However, Martin has certainly identified the regression where if calling an InterruptLock type spinlock, previously the spinlock was avoided. This avoidance functionality was previously found in the obsoleted functions VioScsiVQLock() and VioScsiVQUnlock() but I would suggest this was really just a hack.

I note it also appears this commit has an SDV failure, but I cannot immediately see a reason as to why...

imho it would be good if we can retain the mnemonic semantics of STOR_SPINLOCK in ProcessBuffer().
This will help us in the future to consider use of other spinlock types e.g. ThreadedDpcLock.
I have some ideas as to how to handle this best from DispatchQueue() using DpcLevelLock when adaptExt->dpc_ok=TRUE and keep InterruptLock when adaptExt->dpc_ok=FALSE, but I will need to do some basic testing, where I will presume this regression occurs when resuming from suspend or hibernation.

So give me a moment to run that up and I will raise a PR for your team's consideration and further testing.

@benyamin-codez
Copy link
Contributor

benyamin-codez commented Feb 16, 2025

@JonKohler @sb-ntnx

Jon and Sergey,

Firstly, I have been unable to reproduce an NMI (using Win11 24H2), even when using v266, whilst performing hibernation. It's possible we have an issue when asking Windows to suspend, but when using qm suspend <vmid> --todisk and then resuming, I don't get an NMI then either. Same when I do a Windows suspend and then qm stop <vmid> and then start again. The failure I see (on Proxmox) is following a Windows suspend, when resuming with mouse movement or by issuing a resume command, the CPU pegs at 100% and the VM hangs. Stopping and restarting the VM results in a successful resume. This was when using any build whether v266, Martin's cut or my PR below.

In any case, I've raised PR #1294 to potentially address this regression in what I guess we historically called the crashdump + resume / hibernation pathway. I should note the HW_INTERRUPT parent pathway, which you are NOT hitting, is refactored into the DispatchQueue() path via PR #1214 (which is sitting in draft awaiting for approval and merging of PR #1228). That being said, I have included it in the PR I raised anyway, in addition to the HW_MESSAGE_SIGNALLED_INTERRUPT_ROUTINE pathway that you are hitting.

As mentioned above, in my testing hibernate and suspend/resume didn't cause any NMIs, and I was only able to observe the (!IsCrashDumpMode && adaptExt->dpc_ok) = TRUE pathway being used.

Are you able to share anything further about the environment or the test scenario?
In particular, the values of any of the following leading up to the NMI would likely provide the necessary insight:

IsCrashDumpMode (adaptExt->dump_mode)
adaptExt->dpc_ok
MSISupported (adaptExt->msix_enabled)
num_queues
num_cpus

If you are using a BusType other than BusTypeSas or BusTypeScsi, that might also have bearing on this issue.

I guess it is also possible this is something related to vhost plumbing?

Could I ask you to try the commit in PR #1294 with your reproducer to see if that resolves the issue? Using DpcLevelLock rather than InterruptLock when adaptExt->dpc_ok=TRUE should avoid any cloberring of in-flight DPCs. If the issue remains unresolved, we might need to consider using the InvalidLock spinlock type and catch it in ProcessBuffer() to avoid the spinlocks (which is my preference in order to retain the use of STOR_SPINLOCK in ProcessBuffer()). In lieu of supply of the above values from the debugger we may need to add some traces to see which IsCrashDumpMode || adaptExt->dpc_ok pathway we are in to determine where to resolve this.

@sb-ntnx
Copy link
Contributor

sb-ntnx commented Feb 17, 2025

@benyamin-codez , the problem happens during vioscsi driver initialization on either installation / upgrade of a driver or SCSI controller start (disable and enable it from the device manager). The problem is that a VM hangs and seems to be due a spinlock not being released. NMI part is not relevant to the problem and corresponds NMI interrupt injections to get a memory dump.

I can share the memory dump with vioscsi if it can be of any use.

@benyamin-codez
Copy link
Contributor

@sb-ntnx

The problem happens during vioscsi driver initialization on either installation / upgrade of a driver or SCSI controller start (disable and enable it from the device manager). The problem is that a VM hangs and seems to be due a spinlock not being released.

Does it BSOD or just hang?

I can share the memory dump with vioscsi if it can be of any use.

I sent you an email if you want to send me a link.

The values of these variables (Locals) from the debugger might be sufficient.
IsCrashDumpMode (adaptExt->dump_mode)
adaptExt->dpc_ok
MSISupported (adaptExt->msix_enabled)
num_queues
num_cpus

@benyamin-codez
Copy link
Contributor

@sb-ntnx
CC: @JonKohler

NMI part is not relevant to the problem and corresponds NMI interrupt injections to get a memory dump.

Except in the event crashdump processing requires that no spinlocking occurs.
Which I guess would be the norm.

Also use of InvalidLock type spinlocks requires Win11 and both InvalidLock and DpcLevelLock type spinlocks require using StorPortAcquireSpinLockEx(). Therefore, I have updated PR #1294 to ignore all requests for InterruptLock type spinlocks in ProcessBuffer(). This should restore the previous behaviour. Please test it at your earliest convenience.

benyamin-codez added a commit to benyamin-codez/kvm-guest-drivers-windows that referenced this pull request Feb 22, 2025
… (regression)

Credit to Nutanix and in particular @MartinCHarvey-Nutanix for his work in PR virtio-win#1293.

Background: We previously ignored calls for a spinlock with isr=TRUE in
VioScsiVQLock() and VioScsiVQUnlock(). This was replaced with a call to
InterruptLock in the (!IsCrashDumpMode && adaptExt->dpc_ok) = FALSE pathway.
In testing, suspend/resume/hibernate did not use this pathway but instead
issued DPCs. The InterruptLock was presumed to be used when IsCrashDumpMode=TRUE.
Also, using PVOID LockContext = NULL, and / or then setting LockContext
to &adaptExt->dpc[vq_req_idx], appears to cause a HCK Flush Test failure.

Created new overloaded enumeration called CUSTOM_STOR_SPINLOCK which adds some
new (invalid) spinlock types such as Skip_Locking and No_Lock. Also provides InvalidLock
for builds prior to NTDDI_WIN11_GE (Windows 11, version 24H2, build 26100) via Invalid_Lock.
In similar vein, Dpc_Lock = DpcLock, StartIo_Lock = StartIoLock, Interrupt_Lock =
InterruptLock, ThreadedDpc_Lock = ThreadedDpcLock & ThreadedDpc_Lock = ThreadedDpcLock.

This fix has two components:

1. Only DpcLock type spinlocks are processed in ProcessBuffer() with all other
   types presently being ignored ; and
2. The (PVOID)LockContext is no longer used, with calls to StorPortAcquireSpinLock()
   for DpcLock type spinlocks using &adaptExt->dpc[vq_req_idx] directly.

Note: Use of InvalidLock requires Win11 and both InvalidLock and DpcLevelLock
      require using StorPortAcquireSpinLockEx. Consider for future use.

Signed-off-by: benyamin-codez <[email protected]>
benyamin-codez added a commit to benyamin-codez/kvm-guest-drivers-windows that referenced this pull request Feb 22, 2025
… (regression)

Credit to Nutanix and in particular @MartinCHarvey-Nutanix for his work in PR virtio-win#1293.

Background: We previously ignored calls for a spinlock with isr=TRUE in
VioScsiVQLock() and VioScsiVQUnlock(). This was replaced with a call to
InterruptLock in the (!IsCrashDumpMode && adaptExt->dpc_ok) = FALSE pathway.
In testing, suspend/resume/hibernate did not use this pathway but instead
issued DPCs. The InterruptLock was presumed to be used when IsCrashDumpMode=TRUE.
Also, using PVOID LockContext = NULL, and / or then setting LockContext
to &adaptExt->dpc[vq_req_idx], appears to cause a HCK Flush Test failure.

Created new overloaded enumeration called CUSTOM_STOR_SPINLOCK which adds some
new (invalid) spinlock types such as Skip_Locking and No_Lock. Also provides InvalidLock
for builds prior to NTDDI_WIN11_GE (Windows 11, version 24H2, build 26100) via Invalid_Lock.
In similar vein, Dpc_Lock = DpcLock, StartIo_Lock = StartIoLock, Interrupt_Lock =
InterruptLock, ThreadedDpc_Lock = ThreadedDpcLock & ThreadedDpc_Lock = ThreadedDpcLock.

This fix has two components:

1. Only DpcLock type spinlocks are processed in ProcessBuffer() with all other
   types presently being ignored ; and
2. The (PVOID)LockContext is no longer used, with calls to StorPortAcquireSpinLock()
   for DpcLock type spinlocks using &adaptExt->dpc[vq_req_idx] directly.

Note: Use of InvalidLock requires Win11 and both InvalidLock and DpcLevelLock
      require using StorPortAcquireSpinLockEx. Consider for future use.

Signed-off-by: benyamin-codez <[email protected]>
@MartinCHarvey-Nutanix
Copy link
Author

@benyamin-codez OK, this looks pretty good. I still have a bit more testing to go, but this looks pretty sound so far.

if (adaptExt->dpc_ok)
{
RhelDbgPrint(TRACE_LEVEL_WARNING, "Unexpected: DPC already initialized.\n");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would likely be hit following ScsiRestartAdapter and the call to VioScsiHwReinitialize().
It might be worth appending " Adapter may have restarted." or similar to the debug message.
Perhaps another reason adaptExt->dpc_ok should be adaptExt->dpc_ready...?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Made the debug print less worrying. The restart cases are not easy for me to test manually, but I note your HCK tests do test PnP stop / restart / rebalance etc, so they should check this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might find wide support too:

adaptExt->dpc_ok should be adaptExt->dpc_ready...

//Sharing of interrupts and "isServiced" based on reading ISR regs to disambig
//line based interrupts. Interrupt is ours, but at the wrong time,
//do not change isInterruptServiced value.
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this else path where !adaptExt->dump_mode && !adaptExt->dpc_ok is a new pathway that skips ProcessBuffer(). The previous behaviour was to run ProcessBuffer() without spinlocks. Are we assuming here that the interrupt has already achieved the purpose for which it was called and we can now safely return?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never happen now that we have moved virtio_device_ready to the point where we know we have dpc's initialized. To get an interrupt at this point would be early on in the initialization (no previous reinitialization), and would (IMO) be considered a hardware (...emaulated hardware) error. Devices shouldn't interrupt until you've told them you can handle interrupts.

Some previous diagnostic code I had or'ed in a status bit such that the buffer processing was subsequently done when the dpc was ready, but it got tricky for line-based interrupts because you had to store the queue number somewhere.

I notice that most int processing for "non message" queues (ie control q, and events q) is done in the handler, with later message processing deferred. The great thing about queues is that you can leave stuff queued up, provided there is a subsequent interrupt. Indicating virtio_device_ready subsequently should produce that interrupt.

Note that this is in the line-based interrupt handling, and current impl of the device (and likely all subsequent impl) is MSI based. Similar logic for the MSI case is in DispatchQueue, where we assume it's okay to leave things queued in the expectation of an interrupt when we call virtio_device_ready.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this else path where !adaptExt->dump_mode && !adaptExt->dpc_ok is a new pathway

Second thoughts. Maybe this allowance for processing with no buffering is a previous workaround for someting else? In that case, I can change the behaviour back to the way it was. I'd still like to move the virtio_device_ready call to later,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second thoughts. Maybe this allowance for processing with no buffering is a previous workaround...

That might be the safest option.

Only when ((virtio_read_isr_status(&adaptExt->vdev) == 1) || adaptExt->dump_mode) is FALSE do we return early. So when TRUE we process further, and that's usually for a reason...

I'd still like to move the virtio_device_ready call to later.

Provided we only use DPCs, I don't see any reason why virtio_device_ready() can't stay in VioScsiPassiveInitializeRoutine(). We do return FALSE; when !StorPortEnablePassiveInitialization() so I don't think this will be an issue, but @vrozenfe is the one who would likely know best.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still keeping this new pathway...?

else
{
RhelDbgPrint(TRACE_LEVEL_WARNING, "Spurious interrupt (Messageid 0x%x), ditching\n", MessageId);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add your comment in VioScsiInterrupt() re this here as well?
I have PR #1214 in the pipe, which if merged, will consolidate these calls into this function.

Copy link
Author

@MartinCHarvey-Nutanix MartinCHarvey-Nutanix Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment in VioScsiInterrupt refers to the disambiguation of interrupt ownership by reading the ISR, not the subsequent DPC processing. The comment in the line based ISR routine is that should accurately return to the OS whether the interrupt was generated by your device, not whether you have your house in order well enough to perform all subsequent processing.

This allows the OS to refrain from calling all later handlers on anything that shares some legacy PIC/APIC line, and then getting confused about whether the int has actually been services by any device.

It does not apply to MSI(X) handlers, which (when real hardware exists) are effectively in-band pci/read write operations, and always confined to a physical or logical device.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think you will still keep this code path unlike in the line based pathway?
Methinks the same logic would apply here too, perhaps more so, certainly if PR #1214 merges.

We should definitely still refactor the call to ProcessBuffer() into the else branch though...!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still keeping this new pathway...?

@benyamin-codez
Copy link
Contributor

Interesting approach, Martin.

Any joy moving that virtio_device_ready() routine...?
I was thinking it might be the case that it is required where it is, ...
...i.e. before StorPort calls HW_PASSIVE_INITIALIZE_ROUTINE (i.e. VioScsiPassiveInitializeRoutine())
...i.e. in HW_INITIALIZE (i.e. VioScsiHwInitialize())...
...so I'm very curious to see where you might call it from, and if that works.

Per the MS doco you linked in the other PR, provided the StorPortEnablePassiveInitialization() runs in VioScsiHwInitialize() the call to virtio_device_ready() is probably well placed in VioScsiPassiveInitializeRoutine().

Others will likely need to weigh in on whether we can skip ProcessBuffer() for any interrupts when DPC is not ok/ready.
This would include MSIs too if PR #1214 merges with validation occurring in ProcessQueue().
( @vrozenfe 👀 )

Still hoop jumping to get this to host a boot drive. Getting it to host the location of the dump file should be easier.

Can you confirm that's working now?

@MartinCHarvey-Nutanix
Copy link
Author

MartinCHarvey-Nutanix commented Feb 27, 2025

Can you confirm that's working now?

Once I've placated my boss, juggled the 5 other tasks I have on my plate, set up some VM's, and got a SDV build to work. I'll have an answer for you. Normal operation works. I will push changes to this review when satisfied crashdump handling has not regressed,

@benyamin-codez
Copy link
Contributor

... juggled the 5 other tasks I have on my plate...

Only 5...!?!?! 8^d 8^D

Thanks for giving the above some consideration. I appreciate your collaborative approach.

I'll mod my WIP - with something close to what I presume you will end up with - and have a play... 8^d

@MartinCHarvey-Nutanix
Copy link
Author

Changes following code feedback/review. Also retested both normal operation, and crashdump writing to secondary drive. @YanVugenfirer if passes HCK, request merge soon as regresses a lot of our VM's.

@MartinCHarvey-Nutanix
Copy link
Author

@benyamin-codez If you're feeling keen, you could merge this across to virtio-stor, because the issue is there was well. And, depending on merge order of PR's please don't undo it by merging over the top, thanks.

@benyamin-codez
Copy link
Contributor

If you're feeling keen, you could merge this across to virtio-stor, because the issue is there was well.

Seems only fair. Thanks for cleaning up in aisle 5... 8^d

And, depending on merge order of PR's please don't undo it by merging over the top, thanks.

No problem, refactoring already....

@MartinCHarvey-Nutanix
Copy link
Author

Reverting to simplest and most conservative version of this fix.

@vrozenfe
Copy link
Collaborator

vrozenfe commented Mar 2, 2025

Why not just revert the patch that caused the problem?

Best,
Vadim.

@benyamin-codez
Copy link
Contributor

@vrozenfe @YanVugenfirer

Why not just revert the patch that caused the problem?

Primarily because there was a performance improvement removing the spinlock managers.

@benyamin-codez
Copy link
Contributor

@vrozenfe @YanVugenfirer

Why not just revert the patch that caused the problem?

Primarily because there was a performance improvement removing the spinlock managers.

There was also a fair bit of mnemonic, semantic refactoring and some preparatory changes, e.g.:

Renaming of ProcessQueue() to ProcessBuffer(). Whilst technically more accurate, this also provides space for a new ProcessQueue() routine, via which the HW_MESSAGE_SIGNALLED_INTERRUPT_ROUTINE and HW_INTERRUPT routines can then be consolidated.

This latter work is in PR #1214, which I am presently updating.

@MartinCHarvey
Copy link

I'm happy with any solution which changes the locking behaviour back to the way it was - my only reason in include additional definitions here is to make clear what the software is doing in order to prevent the same mistake in any subsequent refactors._

@benyamin-codez
Copy link
Contributor

I also think this solution is much more elegant than the previous spinlock managers.
The performance improvement likely came from not having to repeat superfluous checks they performed.
The remaining use to skip spinlocks becomes unjustified, as this is the only place in vioscsi where we skip spinlocks.

The mnemonic changes Martin mentions would have helped to avoid my mistake in the first place too. I had mistakenly thought that a combination of isr and spinlock could occur because I had misunderstood the purpose of the bool. This solution makes the purpose clear - at least in my eyes... qX^{d>--

@benyamin-codez
Copy link
Contributor

@vrozenfe @MartinCHarvey-Nutanix

Also, let's not forget, the relocation of virtio_device_ready() is likely beneficial.
It works well in my testing; I think in yours too, Martin..?

virtio_device_ready() notably has no return code.
We also set adaptExt->dpc_ok blindly as StorPortInitializeDpc() has no return code.

Vadim, perhaps you have a different view on this change..?

Per the MS doco you linked..., provided the StorPortEnablePassiveInitialization() runs in VioScsiHwInitialize() the call to virtio_device_ready() is probably well placed in VioScsiPassiveInitializeRoutine().

@benyamin-codez
Copy link
Contributor

Martin, I've closed PR #1294, so hopefully all eyes are on this one.

@MartinCHarvey-Nutanix
Copy link
Author

Also, let's not forget, the relocation of virtio_device_ready() is likely beneficial.

I saw some of the auto-HCK tests (flush test and disk logo tests) not looking so good, so reverted it to something more conservative.

It would be nice to have a better statistical idea of what works/fails in the automated tests and what's made them better or worse. For the moment, on the basis of 3 or 4 tests passed or failed, can't say either way.

@benyamin-codez
Copy link
Contributor

Also, let's not forget, the relocation of virtio_device_ready() is likely beneficial.

I saw some of the auto-HCK tests (flush test and disk logo tests) not looking so good, so reverted it to something more conservative.

Oh, I didn't realise you dropped that part. The new pathways (!adaptExt->dump && !adaptExt->dpc_ok) with the new action (ditching without calling ProcessBuffer()) were perhaps a bit risky. I thought the relocation of virtio_device_ready() made sense.

Subject to @vrozenfe's advice, I would put it back in, but perhaps he and Yan would prefer a new PR for it. Re viostor, the issue of removing spinlock managers aside, the relocation of virtio_device_ready() was the only remaining part of this PR to port over, yes...?

It would be nice to have a better statistical idea of what works/fails in the automated tests and what's made them better or worse. For the moment, on the basis of 3 or 4 tests passed or failed, can't say either way.

I agree.
I can say - having monitored most if not all of those checks - that I believe all of those failures are known issues, and are either false or unreliable outcomes, and can be disregarded.

Even with the virtio_device_ready() relocation, it has been rock solid in my testing.
However, I should note I also picked PR #1296 into my WIP.

Here too is my implementation with tracing in situ per PR #1214:

if (!adaptExt->dump_mode && adaptExt->dpc_ready)
{
NT_ASSERT(MessageId >= QUEUE_TO_MESSAGE(VIRTIO_SCSI_REQUEST_QUEUE_0));
if (StorPortIssueDpc(DeviceExtension,
&adaptExt->dpc[MessageId - QUEUE_TO_MESSAGE(VIRTIO_SCSI_REQUEST_QUEUE_0)],
ULongToPtr(MessageId),
ULongToPtr(MessageId)))
{
#if !defined(RUN_UNCHECKED) || !defined(RUN_COLD_PATH_ONLY)
RhelDbgPrintInlineHotPath(TRACE_DPC, " The request to queue a DPC was successful.\n");
#endif
bStatus = TRUE;
}
else
{
#if !defined(RUN_UNCHECKED)
RhelDbgPrintInline(TRACE_DPC,
" The request to queue a DPC was NOT successful. It may already be queued elsewhere.\n");
#endif
bStatus = FALSE;
}
}
else
{
#if !defined(RUN_UNCHECKED)
RhelDbgPrintInline(TRACE_LEVEL_VERBOSE,
" We are in Crash Dump Mode or DPC is unavailable. Calling ProcessBuffer() without"
" spinlocks...\n");
#endif
ProcessBuffer(DeviceExtension, MessageId, PROCESS_BUFFER_NO_SPINLOCKS);
bStatus = TRUE;
}

The remainder is implemented in PR #1228, but we will see how far that gets up...

@MartinCHarvey-Nutanix
Copy link
Author

@benyamin-codez I am waiting for feedback / approval / denial from a maintainer before proceeding on this.

@benyamin-codez
Copy link
Contributor

@MartinCHarvey

I am waiting for feedback / approval / denial from a maintainer before proceeding on this.

Yes. @vrozenfe and @YanVugenfirer still need to review and approve.

LGTM though.
Certainly resolves the regression, removes unneeded orphans and provides an elegant mnemonic alternative.
Thanks for your work on this.

I think it best to raise new PRs for relocation of virtio_device_ready() to avoid delays here.
As you have debugged that extensively, would you mind raising one for vioscsi at least?
Then I can do a clone for viostor if you like.

@benyamin-codez
Copy link
Contributor

benyamin-codez commented Mar 4, 2025

Plus maybe edit your OP

Copy link
Contributor

@JonKohler JonKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM at firsh blush, will defer to Vadim et al

@benyamin-codez
Copy link
Contributor

Martin, you may want to fix that clang format issue I mentioned here.
Then when you next force-push, hopefully you get the clang-format checks.
At least then it will be ready to merge upon review...
cc: @kostyanf14

@MartinCHarvey-Nutanix
Copy link
Author

Why not just revert the patch that caused the problem?

Best, Vadim.

I'd be OK with that, but this is perhaps a little clearer about what the code is doing.

This corrects interrupt/spinlock acquision code regressed
previously. Locking in the driver is now the same as
prior to that point, and basic functionality is restored.

Initial regression testing has been performed. There are
however, some outstanding questions as to whether the driver
performs correctly in the crashdump/hibernation path.

I will be performing additional regression testing for
those cases, and MSI(X), and issuing further patches
if necessary.

Ref: Nutanix ENG-741981.

Signed-off-by: Martin Harvey <[email protected]>
@benyamin-codez
Copy link
Contributor

benyamin-codez commented Mar 6, 2025

No clang-format... 8^(
Maybe your custom workflow...?
I think I did see them initially though...
cc: @kostyanf14

EDIT: Sorted..! 🙏

@MartinCHarvey-Nutanix
Copy link
Author

@vrozenfe @benyamin-codez It is important we fix this regession, however, given various other things I'm working on at the mo, I'm afraid I don't have time to babysit this particular PR. Can I leave it to one of you to run with this?

@benyamin-codez
Copy link
Contributor

@MartinCHarvey-Nutanix

It is important we fix this regression, however, given various other things I'm working on at the mo, I'm afraid I don't have time to babysit this particular PR. Can I leave it to one of you to run with this?

I think you've done all that you can. I'm sure Yan and Vadim will review ASAP. The fact multiple HCK-CI checks have run and passed (the failures are known issues with the HCK) is perhaps indicative that visibility of their review and (hopefully) approval and merging will occur soon.

Can you confirm the regression broke crash dumping...? If so, it's likely of high importance this is merged ASAP.

For the reviewers, perhaps it is prudent to briefly revisit:

Why not just revert the patch that caused the problem?

After further consideration, I should have said, primarily because I didn't split the commits enough to do a proper migration. The performance improvement and the other reasons mentioned above are still valid, but this is the real reason we couldn't easily just revert, or revert without consequence. I'll try to do better with that in the future.

I also deferred these issues to a subsequent, speculative, future PR (to be raised after PR #1214 had merged), which in hindsight, was ill-advised. It would have been better to flesh out the remaining issue, i.e. how can InterruptLock type spinlocks work when adaptExt->dpc_ok=TRUE. The warning sign was there when I stated:

The InterruptLock type spinlock (which I have not tested) would only be used in the else branch, i.e. when IsCrashDumpMode=TRUE OR when adaptExt->dpc_ok=FALSE.... [Emphasis mine]
...
In fact, I cannot see how this would work when all other spinlocks are the DpcLock type. In my previous testing, issuing InterruptLock type spinlocks when adaptExt->dpc_ok=TRUE usually clobbered in-flight DPCs.

In any case, IMHO I would suggest merging this PR is the superior solution to the problem.

For the benefit of future readers, I am also minded to share a couple of observations:

  1. It was perhaps not obvious in the offending PR that I had changed from using InterruptLock type spinlocks to DpcLock when adaptExt->msix_enabled = FALSE and that this worked without any issues at least in my testing. This was actually the removal of the last call to an InterruptLock type spinlock.
  2. We use adaptExt->dpc_ok to record readiness of StorPort DPC objects, which we then use in many places, but also after VioScsiHwReinitialize() in lieu of a successful StorPortEnablePassiveInitialization(), the DPC objects having already been initialised in the previous call to VioScsiHwInitialize(). AFAICT, and as assessed in my testing, resume and hibernation do not use this pathway and do not have a dependence on adaptExt->dpc_ok.
  3. The mnemonic ambiguity around the previous use of BOOLEAN isr, uncertainty as to why you would even call a lock manager function if you didn't need it and that InterruptLock type spinlocks where not necessary when adaptExt->msix_enabled = FALSE, led me to think that perhaps an InterruptLock type spinlock was what was being requested rather than making an allowance for processing the buffer without a spinlock if called from an ISR when IsCrashDumpMode=TRUE or when adaptExt->dpc_ok=FALSE. I counted 4 enquiries I made about this, but as I received no response, I ended up running with what I had - noting the pathway was untested.

Anyway, thank you again Martin for your work on this.

cc: @YanVugenfirer @vrozenfe

@YanVugenfirer
Copy link
Collaborator

@MartinCHarvey-Nutanix @benyamin-codez Thanks a lot for your work. I am going to review the PR in next couple of days. It is also important that Vadim will review it as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants