-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
X18 and X24 disks frequently reset with SAS3008 HBAs under heavy write load #162
Comments
Hi @putnam, Sorry you are having issues in your system.
Is this correct? From the standards, disabling EPC should hold across resets and power cycles. As for firmware updates, sometimes those can help (both HBA side and drive side). From the Seagate support site there is a Firmware update finder that you can provide a serial number to check for new firmware. You don't need the other Windows only tool (it basically scans and opens that webpage for you with the SN already loaded). I am asking around to see if any of the customer support engineers have run into this as well, but I have not heard anything yet. |
Thanks so much for the response. I edited my original ticket a lot, so I think you're responding to the initial version. I realized, looking at bash history and the state of the disks, that:
I'm sure this is now outside the scope of this repo, but you guys have been so useful in the past when reporting possible firmware bugs. Maybe it's useful to have shared it here anyway. I'm not an enterprise customer, just an end user, so it's hard to get a line to someone with inside engineering connections. I can repro more consistently now by just copying a lot of data to the disks. I have found very little info on these particular 20TB models since I understand they're technically binned/refurbed X24 HAMR disks. It may well be an issue with the LSI/Broadcom firmware or even mpt3sas, but again it doesn't repro on my 60+ HGST/WD disks or the X16's on their own. Since we're almost certainly outside the scope of openSeaChest here feel free to close but if it's something you guys are open to pursuing with more debug data and info I could share it here or over email privately. Regarding firmware on the end user portal there's no update available for these yet. |
I did pass this issue along to some people internally to see if they've seen similar problems before with these drives and hardware, but I have not heard anything yet. If you dump the SATA phy event counters, are you seeing those increase at all? If these are increasing (not just the reset counter, but others) if can point towards a cabling issue. I'll see if there is anything else I can think of trying that might also help debug this. |
Thanks for the reply! OK, so here are the PHY counters from Anyway, the resets I see now are specifically when ZFS is copying a large amount of data to the pool and is lighting up the vdevs made up of Seagate devices for a sustained amount of time. Eventually, you see the same message about the HBA resetting with the same fault code in mpt3sas. I did some digging in the mpt3sas driver hoping to find some bitflags or something to identify the fault code but it looks to be internal/proprietary to Broadcom/LSI. 20TB X24 Disks (Newer)
16TB X18 Disks (Older, pre-existing without resets)
|
For this page it continues counting until you reset the counters on the page. I don't remember if we put that in as an option in openSeaChest yet. I will have to review the code. The reason I mentioned the CRC errors is due to some of my own past experience trying to troubleshoot some issues other customers have seen. I have also had some long conversations with one of the Seagate engineers who works on the phy level with the goal of figuring out a way to write a test for detecting a bad cable. It's not an easy task 😆 but we did come up with some ideas including using these logs. I have not had time to implement it yet, but it will be an expanded version of the One thing I learned from him was that the faster the interface is running (6Gb/s vs 3Gb/s) the sooner you notice signaling issues. The most common is seeing the CRC counters increasing. This is often increasing due to a cabling problem....not always, but in your case I suspect it is since it's happening on multiple different drives, even drives that were not previously having an issue. It's possible that these new drives have a slightly different phy behavior that managed to bring this out. Another thing that can happen (and I have experienced myself) is similar things happen as the backplane connectors wear out from plugging and unplugging drives. Eventually all connectors will fail but as you approach the insertion count limit you can start to see these kinds of issues too. I don't know if any of these will solve the issue, but you can try these things:
openSeaChest_Configure also has an option to set the phy speed lower as well, which you can also try but it may limit your maximum sequential read/write on more modern drives. One last thing I want to mention is that if you can check for updates on the HBA firmware that may also help. I have seen that resolve odd behavior issues as well due to fixes made to the HBA's firmware. I have seen some past Broadcom HBA's resolve some odd phy issues before, but I don't know if that is affecting this specific case. Let me know if this helps. I'll see if I can talk to that signal engineer I mentioned about this to see if he has any other ideas. |
Thanks. Will go over and try. Regarding the HBA, it's a pretty common SAS3008 HBA and on latest firmware (16.00.14.00). The backplane hasn't had a ton of insertion cycles, but reseating can't hurt. I will swap to a new-in-bag Amphenol cable set + reseat disks and see if I can repro again and report back. |
Did swapping cables make a difference in your case? Another idea is to see if the HBA's BIOS/UEFI settings allow disabling link power management. I am not sure if that is supported by your HBA or not, but I had an issue that rings a lot of very similar bells to this reported to me and in that case disabling the link power management in the BIOS/UEFI for the AHCI card stopped the resets. |
No, unfortunately it has not, after cooking a while. I changed it out and left town -- still out of town at the moment until next week -- but I'm still seeing the same behavior under only a little bit of write load. And right now it only seems to affect the newer X24 disks. Link power management (ASPM) is disabled I think. You can see the state of it in
Note under LnkCtl it says As I write this I'm taking a minute on vacation to prop up the array before it goes totally offline. This has happened before, but one X24 disk actually got knocked so offline it hasn't come back. It will need a physical unplug/replug or a full server power cycle to bring it back up. As far as drive power, I reconfirmed their EPC settings across the board:
Kind of at a loss as to what to do with it right now besides swap in another vendor. There must be something going on between the firmware and the controller but I don't know where else to look. |
@putnam We're asking about "SATA Link power management" (putting the SATA Phy connection to sleep), rather than ASPM (putting the PCI-e link to sleep). I think you can see if this is enabled by running: openSeaChest_SMART -d /dev/sdb --SATInfo and looking for whether |
Ah, sorry. Here is the output on a sample X24 disk. It doesn't have [Enabled] at the end:
|
Thanks for sharing that additional information. There are 2 parts to power management of the phy on both SATA and SAS: Host initiated, and Device initiated (sometimes abbreviated HIPM and DIPM). Please note I DO NOT recommend using openSeaChest_PowerControl to enable the device-initiated power management. If your system has not already enabled it on its own, enabling it may make the drive inaccessible. That option was added to the tool due to some customer request, but if you are not certain if your hardware supports it, I recommend leaving it as-is. The chipset or HBA should be enabling it themselves when it is supported and compatible. I have had a few people report issues around this internally because they enabled it on a system that was unable to wake the phy back up. There are a few SATA capability bits that are not part of the humanized |
Thanks for the response @vonericsen -- I have actually gotten myself into that situation before and can confirm it's not a good idea :) Here is the output of that command for an example disk that was at the top of the stack on the last set of resets. |
EDIT: Most of this is still accurate, but the power transitions reported by smartd (0x81->0xFF) are expected because they're on WD disks that have EPC enabled. I'm still bashing on this. I've been trying to reduce things down to a reliable repro and I'm not quite there, but let me explain my test setup. The server has a zpool made up of many vdevs from different vendors; one of the Seagate vdevs is made up of 11x 16TB Exos disks and another is made up of 11x 24TB Exos disks. Both of these sets live on the same backplane, which is attached to a SAS3008-based controller that's built-in on the motherboard, the Supermicro H12SSL-CT. The HBA is functionally the same as a 9300-8i and shares the same firmware image. I am creating a continuous synthetic load by copying a 100GB random file from a scratch disk into a test dataset (and then deleting it). To be sure the disks are always busy I have two of these running in a loop. There are a few monitoring processes on this server:
I decided to start disabling these one-by-one to reduce things talking to the disks. First I disabled my storcli64 script because I've had issues with it in the past. But the resets continued at roughly the same clip. So next I tried disabling smartd. Right away when I disabled smartd the frequency of the resets went down dramatically. Before I disabled smartd, resets would occur fairly reliably under load (but not on a reliable schedule). But after disabling smartd it took over 12 hours of hard writes before it occurred again. When I restart smartd, the frequency increases again. Here is my smartd config line, for reference:
Note I don't do regular short/long SMART tests with smartd; it's only tracking the health status and error logs. In fact, I used to actually do these, but the automated tests would reliably cause disk resets with Seagate disks and I never did come up with a solution besides disabling automated tests. In those cases, it would affect individual disks, not the whole controller. I think when a SMART test is under way some commands may hang for longer than the kernel likes which causes the kernel to reset the disk on the HBA (a default behavior of mpt3sas). So then I tried running smartd in the foreground in debug mode to see if anything strange was happening. Although nothing stuck out immediately, I was surprised to see the occasional note that a drive's power status transitioned when queried. Looking back in journalctl I see these quite frequently since installing the X24 disks. Here are some examples:
Reading the smartctI source code this line is printing the old and new power states reported by the disk when it sends its query. I didn't know what the 0x81 state was, but looked it up in the ATA spec (page 344, table 204) and it says that's EPC at Idle_A. Now that's weird because EPC is disabled on these disks. I can confirm it with Seachest across all of them. Example:
So, why do these sometimes end up in Idle A? I'm not sure. I don't know if this is directly related, either, but I don't have any other logs that show the power state transitions except for smartd, which happens to show them when it does a check on all the disks (which, by default, is every 20 minutes). Is it possible something in the firmware in the X24 disks is causing power transitions even when EPC is disabled/the timers are all set to 0? And, why would the querying of SMART data so greatly increase the frequency of these HBA resets, and only on Seagate disks? Again if I disable smartd, the frequency drops dramatically. My theory is that other processes try querying SMART (netdata, hddtemp, maybe others) but they do so less frequently. I will keep digging and hopefully this is useful. |
(I'd edit my post above but I think most of you guys are reading via email and you might not see it) Apologies, late night jetlag brain here, but of all the reported power transitions almost all of them were actually WD disks. There was a single Seagate X16 that was showing a transition, and it did in fact have a timer set. I'm not sure how that happened and I've disabled it again now, but resets continue regardless. So from the above all I can say is disabling smartd, which queries the disks roughly every 20 minutes, greatly reduces the frequency of resets. I think if I could whack-a-mole any process that checks the SMART data I could probably eliminate them entirely. But I don't really get why. |
Hi @putnam, This is really interesting information!
Do you know what data is being pulled each time smartd runs? Is is equivalent to the smartctx options -a or -x? One thing I have observed in the past about resets and software to talk to drives is in every operating system you must provide a timeout value, or how long the software expects a command to take before it should be considered a failure. OpenSeaChest usually uses 15 seconds for most commands. With that in mind I am thinking of a couple things that could be happening leading to this happening:
openSeaChest does not have an equivalent to smartctl's -a or -x options to do a lot of things all at once, but you can add many options together to get somewhat close: This is close, but not exactly the same. However, I would be curious if running this triggers anything similar to what you are seeing with smartd. One other difference I know about in openSeaChest is it is coded to prefer the GPL logs over the SMART logs for DST info and SMART error log info. If I remember correctly, that is not how smartctl works (but maybe this has changed over time). Maybe if it's still querying the smart logs over the GPL logs that is another part of what is triggering this. I do not have an option to force it down the SMART log path, but I can look into it to see if it helps with debugging. If you get a chance, can you share the output of
I know of a bug in the Windows version of smartctl running DST in captive/foreground mode where the timeout value is too short which always ends up in a reset from the system. I do not remember if this also affected Linux....I found it months/a year ago. In background/offline mode it should not be an issue since those commands return to the host immediately after starting and can continue processing other commands while it runs. We are setting the timeout for short DST in captive to 2 minutes as the spec requires and to the drive's time estimate for long (I do not recommend running long in captive mode...that will probably never complete without a reset since it can take so many hours). |
I have a bunch (11 each) of ST24000NM000C and ST16000NM001G drives that cause major issues with my SAS3008-based HBA (the onboard HBA on the Supermicro H12SSL-CT, but also just on a regular 9300-8i). Specifically the HBA hits some failure mode under heavy write loads to these new X24's and the driver triggers a whole HBA reset. Heavy reads seem to not be affected.
The X18 default EPC settings vary vs. the X24's. They seem to have Idle_A set to 1 and Idle_B set to 1200; the X24 firmware only has Idle_A set to 1. The first time I saw this occur, I disabled EPC on the new X24's with --EPCfeature disable, and I thought it was resolved, but the next time I had a pretty sustained write load it happened again.
I didn't have this issue when it was purely the X18 disks on this adapter. It was only once the X24s were added to the mix that I saw this occur. It also does not occur with HGST/WD disks.
All X18 disks are on SN02, except one RMA refurbed ST16000NM000J on SN04.
All X24 disks are on SN02.
The SAS3008 HBA is on 16.00.14.00. It is actively cooled and temp is monitored and not overheating.
Disks are all attached on a Supermicro 846 SAS3 backplane/LSI expander on 66.16.11.00.
Kernel is 6.10.11-amd64, current Debian testing/trixie.
Here's dmesg during a heavy write load triggering the problem:
I contacted Seagate support and uh, they told me to install some Windows-only software to monitor for firmware updates and didn't know how to respond to anything technical at all. So I hope maybe through you guys this info is useful.
The text was updated successfully, but these errors were encountered: