K35 repeated plot creation failure

Jacek · June 16, 2022, 5:56am

It used to be, when I needed it, as it was a Linux binary.

Then try this - C:\Windows\System32\MdSched.exe, although looks like this one also requires reboot.

amarena · June 16, 2022, 5:58am

iused memtest to check ram unter win10 u can just run this like any other program.
its very old and simple but it showed after some minuts mem errors for the broken ram

but u can only aloc 2 GB at once so u must start this program multiple times and use this at the same time to test the complet RAM if its bigger than 2GB

seymour.krelborn · June 16, 2022, 6:05am

Is it part of Windows 10?
I can’t find it.

I will not install 3rd party software (other than Chia) on my Chia boxes.
I will not risk complicating any Chia issues with 3rd party software issues. I am keeping my Chia boxes clean of non essential downloads / installations.

So the only way I will run memtest86 is via a boot drive.

I will try it via booting from a flash drive (if I can figure out which “F” key allows me to do so, and cross my fingers that I do not have to first tell the BIOS to allow booting from a USB drive, and that all of my USB chia drives do not cause complications for booting to the memtest86 USB flash drive). Argh!

Jacek · June 16, 2022, 6:09am

That MdSched.exe is Microsoft code and included with Win. It is also a good mem test.

On the other hand, memtest86 is as good as Windows. If I am not mistaken, it may be included with some distros. It is potentially over 100 years old, and I doubt that there are too many other software with as good reputation as that one. It is the gold standard, as far as memory testing - nothing touches it.

amarena · June 16, 2022, 6:10am

13PiB of the complett netspace should be K35 Plots
and 59 PiB should be K34 Plotts and i own 1,2PiB from this K34

seymour.krelborn · June 16, 2022, 6:10am

I used it on my Atari 2600.

Jacek · June 16, 2022, 6:11am

That was already version 11

Jacek · June 16, 2022, 6:12am

It is hard to say, whether on that chart that percentage represents the net space, or rather plot count.

amarena · June 16, 2022, 6:15am

i take the value over 30 days (0.0570%) that is good enougth if K35 part is over 10PiB, i think this 13Pib is statistical good number and not far away from the real number

amarena · June 16, 2022, 6:21am

it represents the percentage of winning plots with different K size (from 100%) not the count

chiameh · June 16, 2022, 3:26pm

On both computers there were messages saying “Only wrote x of y bytes at offset… Error 1. Retrying in five minutes.”? Also, are you plotting to a different drive on both computers, or using a portable drive? That is, is e:/chia/plotting the same drive used for both computers, or different in each one?

If the memtest comes back clean on both, maybe there are errors with the disk(s) you are plotting to.

I was curious about “Error 1”, but unfortunately that code is not an error code. The write is apparently handled by the chiapos library, which is separate from madmax.

This message means that fwrite is failing to write all the data it was asked to (source code). The big question is why. Unfortunately, the code does not actually make a call to get the error code which would show a more specific system error as to why the write failed (e.g. errno 28=No space left on device).

That write error may be a red herring since it will wait 5 minutes and retry the write, which it seems to be doing successfully.

Can you check the debug log while running chia plots check and see if it logs an error “Failed to open file”? This error might explain in more detail why the plot was “unopenable”.

seymour.krelborn · June 16, 2022, 4:32pm

Same messages each time, regardless of the computer (although I am not certain whether the “Only wrote 889 of 9093…” etc are exactly the same values – but both times it was the same wording at the same (or very close to the same) part of phase 3).

No portable drives involved.

In both cases (each case being one of two computers), I am using each computer’s NVMe drives.
So the NVMe drives on computer #1 are independent of the NVMe drives on computer #2.

I am running the K35 job, again. This time, however, I have doubled the “-b” value (from 27112 to 54224), giving the job double the RAM (I have 64 GB of RAM, and never use more than approximately 52%, even while concurrently running a madmax plotting job).

I had chose “-b 27112” as it is a multiple of 3389 (the default “-b” value if not specified).
“-b 54224” is yet another multiple of the default 3389. From where did chia come up with 3389 as the default, I have no clue.

When the job approaches the section in question, I will scour the debug log (assuming the error recurs).

chiameh · June 16, 2022, 4:44pm

To clarify, the plotting process won’t output anything to the debug logs, but I think that running chia plots check might write something to the log when it finds an “unopenable” plot.

So, if this plot finishes with that write error, run chia plots check on it after, and see if the debug log gets anything extra written to it as a result of the check.

Beyond that, I don’t know fwrite would be failing one fairly small write (9,093 bytes) out of many, many writes that it is doing. Especially since it is not logging the system error code that might describe the specific error.

Jacek · June 16, 2022, 5:11pm

That is kind of similar to me drinking. The last shot basically kills me, where all the previous one were fine

The H/W on both boxes are different (different RAM, drives, CPUs), so that really indicates that it is either something to do with the environment (size of the temp space, or the amount of RAM), or rather some weird bug with the plotter. Although, the fact that the plotter barfs on fwrite rather points again to issues on the drive. It is rather unlikely that in both cases one plot would damage the file system, so IMO, we are back to not enough RAM or temp space. Still with not enough RAM, potentially that would be either crash, or problems around the swap file, what further points to that fwrite getting a standard error (as you said, it is unfortunate that there is no check for the error reason).

Agreed that the error could be a secondary one, but so far we have no other indications that this is the case. The fact that the plotter backs off, and says that it will try in 5 rather suggests that all is fine with the plotter, and it is just waiting for the disk space, or as you stated file access rights.

I would really try to get some stats from the drive during the plotting, whether using resource monitor, or something like sysinternals du (this one flushes the output, what standard dir is not doing). I would also check on the swap file, maybe that file somehow migrated to the temp space.

The fact that the very first two plots were fine suggest to review what happened between plot 2 and 3, what were the active processes at that time.

@seymour.krelborn By the way, when you stated that both boxes are using internal temp space (RAID0 NVMes), are those the same size sticks (same model)? Would you have one more on hand to bump that temp space up? When you use filel explorer to check that temp space, do you have hidden files enabled? Could you check whether some sys file migrated there (page / swap / hib)? A long shot, though.

seymour.krelborn · June 16, 2022, 8:00pm

Two, 2 TB, Samsung 980 Pro sticks striped via Disk Management, which changed them to “dynamic” and formatted them (they were already NTFS formatted, and the re-formatting is still NTFS). I guess the re-formatting was to make a 4 TB partition.

And it was not a quick format. It did it the old-fashioned way. Even for NVMe drives, it took a few minutes.

Both MSI motherboards support 3 NVMe drives. I made the mistake of putting the OS on a 500 GB Samsung EVO NVMe, when I should have put the OS on a 2 TB SATA SSD (to hold the ever-growing blockchain files and have the 3rd NVMe slot for Chia processing).

So I have no spare NVMe slots.

I never hide file extensions. But I rarely use graphical tools for file based listings, copying, etc. I use the command prompt. I made a marriage proposal to robocopy.exe

dir/a reveals no hidden files, other than “System Volume Information” directory in the root.

Nothing uninvited has helped itself to my temp drive space.

Before and after reboots, I check and confirm that I have the full 4 TB free, and the “dir” output always shows 4 TB is free.

Jacek · June 16, 2022, 8:37pm

My question was whether you enabled it to also see hidden / system files (it is not enabled out of the box). If that is not enabled, you may not be able to see pagefile.sys, … Once enabled, it makes it potentially equivalent to ‘dir /a’ It may be that the file explorer has some edge when some user level attribs are being used (I use both, sometimes at the same time just to stay sane).

When you do ‘dir plotting_folder’ all the files that are being worked on at that point will most likely show 0 size, as the file system buffers are being used / not flushed. So, when you have a massive build / copy, you will see a 0 length file for quite long time, and then puff, all the temp files are gone, and you have never seen the actual peak space usage. This is where sysinternals du flushes those file buffers, so the next time you run dir, you will see some big files. Although, you may just want to run ‘du -l 1 plotting_folder’ and it will give you a better feedback than dir does.

I am really leaning toward this (low temp space) being the issue, as so far everything that was brought up is not really showing any promising leads.

There is no one size fit all solution here. However, on a full node machine, I would use 128-521 GB SSD for the OS, and another 512 GB or so for the blockchain (so all reads/writes from that db are not affected by system level stuff). On a plotter / harvester, 250 GB SSD should be way more than enough (I have 128GB old SSD on my plotter, and it has plenty of headroom room).

seymour.krelborn · June 16, 2022, 9:09pm

Yes, I enabled it on day one.

The GUI reads that a K35 plotting job requires 2175 GiB of temp space, which is 2335 GBs.
I have 4 TB of temp space.

Correct. But to jump from the GUI’s claim of 2335 GB to 4000 GB is unlikely, unless, as you point out, other uninvited, concealed files are there form the OS (or perhaps elsewhere).

I will check with SysInternals’ “du” tool.
I am assuming that flushing the file’s buffer, to reveal its actual usage, will not affect the execution of the plotting job?

By the way, I do not see a “du” option for flushing file buffers:

Using Disk Usage (DU)

Usage: du [-c[t]] [-l | -n | -v] [-u] [-q]

Parameter	Description
-c	Print output as CSV. Use -ct for tab delimiting.
-l	Specify subdirectory depth of information (default is all levels).
-n	Do not recurse.
-v	Show size (in KB) of intermediate directories.
-u	Count each instance of a hardlinked file.
-q	Quiet (no banner).

So when you wrote that it flushes file buffers, do you mean that it only reveals the info, but does not change the files from their 0 byte size?

Or after using “du” will “dir” no longer show 0 bytes, because “du” flushed the file’s buffer?

Jacek · June 16, 2022, 9:55pm

Yeah, du is by default flushing OS file buffers (flushing the file system caches), so there is no extra option for that. Although, I have noticed that sometimes you need to run it a couple of times or so. Maybe when it sends that flush request, the OS is not reacting right away, but proceeds at what is the best at that time speed. To properly flush it, the process that is working on such file should be calling it, but that would kill all the advantages of OS file caching. So, no calling it when a process is working on a file is harmless, as it just gently asks the OS to try to do the best it can.

I agree that there is a big jump from what it should be, to what you have (2175 to 4 TB), but we are kind of out of options, at least those that we understand.

To me, bad RAM is just a sanity check, as that usually produces garbled output (what doesn’t really matter with most of plotter does), or if any process depends on it, it will result in crashes. So, I would rule it out. Also, when you call fwrite, there are again two options for bad RAM, the first is that the buffer contains garbage, and thus no one cares about it, the second is that buffer is damaged, and the system or process will barf. So, this is again not that promising option.

Problems with the file system, again two different systems (although, the exact same drives), and it didn’t happen for the first two plots, then it is happening right now. Again, looking at how fwrite operates, if it returns an error, that means that the process is sound, and the OS is sound enough to tell it to move on. The obvious reasons for it to fail are: 1. not enough space, 2. seek pointer is out of whack (points to position past 4 TB), 3. file handle is screwed up (but it just wrote several such buffers in the same loop), 4. file access rights are bad (again, it is sitting in a tight loop pushing buffers down to this file). So, to me only #1 and #2 are potentially at play. Again, as the process is writing to that file in a tight loop, I would say that #2 is less likely. So, this (file system / drive issues) is again a sanity check.

The last part is that the plotter code has some issues, and some unseen before bug was just triggered. As you pointed out, your previous 2 plots potentially represent 10% of those plots k35, so potentially more than 90% of such plots were created without any problems. Also, most likely, when a process barfs, it barfs hard, and we see that it works properly (well, missing the part where it should ask for the exact cause) and is backing off from that write, then letting you know that is waiting for something to happen (be back in 5 message to me indicates waiting for a free space to continue). So, it looks to me that all might be fine on the process / OS side.

Again, either we are missing something (maybe somehow the plotting those k35 files is sensitive to time stamp (e.g., a time bomb in the code, what I doubt is there), it is really hard to say what it could be wrong on the process / plotter side.

So, that free temp space is really the last sanity check that we can do before going completely nuts. Is that the exact problem, cannot say so, but that is the last loose part we have there.

By the way, is that Chia or MM plotter, and are you running it through UI or CLI?

seymour.krelborn · June 16, 2022, 10:07pm

chia.exe (only for K35 size plotting – madmax for everything else).

madmax’s help output states -k 34 is the largest valid value.

100% of my plotting is via the command prompt.

Jacek · June 16, 2022, 10:14pm

Maybe just create a ‘goto TOP’ batch file, and call that du -1 on the root of your RAID array every 10 secs or so. Maybe give that command prompt option to increase the number of lines it holds (say 10k), and just let it run. When you see the crash, kill that batch file, and see what is in there. It will most likely slow down the file IO, but i hope that this is not a speed depending issue.

Not sure, what else you can try.