Unstable system with Sata Risers

Hello Folks,

I have the following pcie adapter card from ali in order to connect multiple hard drives:

Apparently all of my systems with this card crash from time to time leading to corrupted harddrives.

Does anyone have experience with those things, or having crashes and knows how to resolve them? At this point I do not believe the single controller card is broken as it happens to all my rigs…

Which unit are you using, and are you runing Windows 10? These look just like some SATA ports on a card.

that is pretty much what it is. just some sata ports on a card. I used the Sata-20 Port version pcie 4x.

The system I’m using is Ubuntu server 20.04.
Under Windows I did not have those issues so far.
But then again im all in Ubuntu now and only had Windows shortly with those adapters.

The harddrives are being recognized all fine (until the system crashes randomly)

Please define the crash.

Bsod, driver error, system stays up but drives disconnect?
Need more details.

Ah, your linux…
Still anyone helping will need more details.

1 Like

Can’t tell much about the crash. The Kernel log logs all fine until [null][null][null][null][null]…

followed by a reboot. The reboot in return fails as some drives are corrupted.

Ah man, I’m a win guy, hopefully someone can help you though.

I cannot say that “drives are corrupted” explains much. Is your OS drive OK (I assume so, as you can restart, but am not 100% sure). Are your plots gone missing? Is the file system bad (have you run fsck)? Are those partitions gone?

As @Bones stated, just not enough data to say whether this is a user error or rather that card deficient. To some extent, I would go with the user error (you didn’t partition it right, or are using potentially unstable file format), as Linux is usually good with drivers support, but on the other hand, low-cost card may have some inherent issues as well. It could be also a PSU issue not having enough 5V amps, or something like that. Just not enough data.

Maybe the fastest way would be to get a decent card and move some drives to that one.

Okay, What data can I provide?

The drives (currently 9) are connected on a Corsair ax 1600i psu.
Power specs are the following:
Corsair AX1600i Power Specs
3.3V - 30A 180W
5V - 30A 180W
12V - 133.3A 1600W
5VSB - 3.5A 17.5W
-12V - 0.8A 9.6W
Additionally, there are 5 pcie risers connected (GPU) and 7 Fans
for testing purposes I went as low as 5 HDD’s but still had those issues.

The system drive is a fresh WD Red SA500 SDD but the issue also appeared with the ssd’s installed beforehands.

The drives are formatted as ext4, and are included in fstab via /dev/disk/by-uuid/
When the issue appears, say 3 out of the 9 hdds are corrupted.
The system Drive so far has not become corruped a single time.
The system tries to boot but gets stuck at cloud init process. It then says: “you are in recovery mode one or more hdd’s could not be mounted, fix the issue or continue booting by pressing control+d”
Upon continuing boot with control+d, I find that part of the drives are mounted with the command “df -h”
The corrupted harddrives which are not mounted can usually be fixed with the following command:

sudo fsck.ext4 /dev/disk/by-uuid/01234-5678-90

I have had it one time to see data loss. the general error is “disk was not cleanly unmounted”. Sometimes fsck checks everythin and the disk is totally fine. most of the times it has to scan through the blocks which requires a long time.
unfortunately I don’t have a log of that at hand right now. I think it says “Bad Block count” + “disk was not cleanly unmounted” in that case.
The recovery of a 18 TB disk is very painful as it needs 50gb of swap or so.

As for the drivers, I did not have to install any drivers. The hard discs are recognized in bios, windows and ubuntu right away.

->Maybe the fastest way would be to get a decent card and move some drives to that one.
What would be a decent card? The main issue I found was that HBA SAS cards require heavy flow-through cooling Which I cannot offer easily in my rig.

If anything, I would zoom on “disk was not cleanly unmounted.” To me that indicates that either the disk disconnected by itself (or that card related), or system just shut down. In both cases, to me that would indicate that you may be out of power, or at least, this is where I would start.

Unless you are plotting, shutdown or disk disconnect should not cause problems on fsck level (I think), as there are no writes going on. This is why your fsck comes back clean (again, I think). If on the other hand, your plots are mangled, that would rather point to SATA controller. (Again, if you are not actively plotting, as only those files could be mangled, I think.)

You have mentioned that you have 5 video cards. Assuming that one draws 200W, that puts you at 1,400W right away. I am not sure, whether those video cards also take 5V, maybe not. You add to that 10W / drive, plus some extra for CPU / RAM / mb, and you are pushing close to raw 1,600 limit your PSU has.

Instead of dropping down the number of drives, I would drop one/two video cards, as that is the biggest power draw, and let it run for some time.

You may go through your /var/log/messages to check what were the last lines before the system restarted. If there is nothing there indicating problems, but some noise during the startup, that would further suggest that the system shut down abruptly, further suggesting power issues.

One more thing you can consider is adding a small UPS, as that would eliminate potential small brownouts coming from the power line. Maybe your fridge kicks in, and the power drops enough for your PC to shut down.

Thank you for the suggestions.
How would ubuntu handle system crashes in case of hdd/pcie card disconnect?
In windows world this usually generates a crash report with blue screen of death but my log looks like the following:

Dec 24 13:39:09 eden t-rex[1981]: ------------------------20211224 13:39:09 -------------------------
Dec 24 13:39:09 eden t-rex[1981]: Mining at eth-eu1.nanopool.org:9999 [163.172.162.51], diff: 10.00 G
Dec 24 13:39:09 eden t-rex[1981]: GPU #0: RTX 3080 - 87.54 MH/s, T:97mC, P:209W, F: 55%, E:419kH/W, 517/517
Dec 24 13:39:09 eden t-rex[1981]: GPU #1: RTX 3070 - 49.95 MH/s, T:97mC, P: 99W, F: 54%, E:505kH/W, 322/322
Dec 24 13:39:09 eden t-rex[1981]: GPU #2: RTX 3080 - 72.89 MH/s, T:97mC, P:183W, F:100%, E:398kH/W, 484/484
Dec 24 13:39:09 eden t-rex[1981]: GPU #3: RTX 3070 - 52.08 MH/s, T:97mC, P: 99W, F: 54%, E:526kH/W, 348/348
Dec 24 13:39:09 eden t-rex[1981]: GPU #4: RTX 3080 - 83.64 MH/s, T:97mC, P:203W, F:100%, E:412kH/W, 524/524
Dec 24 13:39:15 eden t-rex[1981]: #033[0m20211224 13:39:15 #033[32m[ OK ]#033[0m 2197/2197 - 346.15 MH/s, 54ms ... GPU #2
                          Dec 29 17:01:24 eden kernel: [    0.000000] Linux version 5.4.0-91-generic (buildd@lcy01-amd64-017) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 (Ubuntu 5.4.0-91.102-generic 5.4.151)
Dec 29 17:01:24 eden kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0-91-generic root=UUID=df98695d-64a6-4acd-b05f-e457a6f778da ro maybe-ubiquity
Dec 29 17:01:24 eden kernel: [    0.000000] KERNEL supported cpus:
Dec 29 17:01:24 eden kernel: [    0.000000]   Intel GenuineIntel
Dec 29 17:01:24 eden kernel: [    0.000000]   AMD AuthenticAMD
Dec 29 17:01:24 eden kernel: [    0.000000]   Hygon HygonGenuine
Dec 29 17:01:24 eden kernel: [    0.000000]   Centaur CentaurHauls
Dec 29 17:01:24 eden kernel: [    0.000000]   zhaoxin   Shanghai  
Dec 29 17:01:24 eden kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[...]

/var/log/ neither contains a file or folder called “messages”

Edit:
apparently booting up to recovery does not generate any log entries there. only when I press control+D (on dec 29) the log was continued

The state I find the rig in when I come to the office is that all led’s and system fans are on. cpu fan running.

Sorry, I am more CentOS / RedHat, so messages file is always there. Although, it looks like you pulled the right file.

At least, in what is there, the log stopped at 13:39:15 without any indications of something going bad. You may want to review lines before that (for on hour or so), maybe there will be something related to SATA.

The crash report can be generated only when there is a code crash that can be trapped. In case of power failure, no OS will generate such report. On Win side you will get a “not a clean shutdown” message in the Event Log.

So, at least to me, the crash is the best option to try to resolve. Either your PSU is borderline, or you may have some brownouts on the power line.

If you are not loading your CPU with some CPU mining code, I would try to find a CPU stress utility, and run it (when your system will stabilize). That will put some extra load on the PSU, and if you will be crashing during such test, that would be the best indicator.

Yeah, when system boots, it checks whether there was a dirty shutdown, and that is the reason you have that control+D. This is also why it is waiting for you to fix the issue that triggered that state. I guess, it restarts, as in your BIOS you have “restart after power failure” set to ON. Although, that kind of contradicts the gap in that log output. Although, as I mentioned in the next post, it could indicate that the system just froze (so your fans are on, your LEDs are still there, …).

1 Like

would a short brownout be able to affect a single system or would all 3 systems crash at such an event?
typically 1 out of 3 systems shows the issue at a time.

I am not an electrical engineer. However, brownout is a short voltage drop, so depending how deep or long the drop is, it may affect the PC that has at that time the highest load, or the weakest PSU. Just guessing.

Again, I like Corsair, and I use them, as those are great PSUs. However, the worst nightmares I had with my systems were PSU related - no traces, no nothing. Sometimes those were shut downs, sometimes just box freezes. So, if there is nothing that you can grab on to, that is potentially the place to try to eliminate first.

1 Like

Could be a driver problem?

Where is your UPS system, we don’t have brown outs!!!

I’ve had a card similar to yours for a few months. 20 port sata x1. 16 disks currently connected. Very good. No problem. Windows 10. I want to buy another one.

Normally the drives should be very quiet. Does your disc make repeated rebooting sounds? This is power problem. PSU is not enough.

Do not connect more than 6 disks to a single PSU output line.

Have tried with multiple cards right now.

Have it on multiple rigs but one more than another. I should try mounting the pcb better.
Additionally, I have had some success, changing the drives to read-only.
I only have those issues on ubuntu server. With windows I never had any corrupted drives. The drives themselves are silent and only little warm to the touch. They are cooled with fans.

Brownout might be a case but not much I can do about it right now. My rigs eat 3kw.

I have the same problem, with a 12 sata. Yesterday appeared and today partillay appears, I see in lsblk but it’s not listed in blkid neither fdisk -l… it’s giving me headache.

PSU 750 w

I use this kind of card, PCIE3…0X1 to 20 SATA ports, two cards on one MB, works fine, but can’t do heavy disk load

i notice the same heavy load such as file copying leads to failure. I wonder if the card is overheating or what might be the issue. Psu should have plenty juice for 10 hdds