K35 repeated plot creation failure

chia.exe (only for K35 size plotting – madmax for everything else).

madmax’s help output states -k 34 is the largest valid value.

100% of my plotting is via the command prompt.

Maybe just create a ‘goto TOP’ batch file, and call that du -1 on the root of your RAID array every 10 secs or so. Maybe give that command prompt option to increase the number of lines it holds (say 10k), and just let it run. When you see the crash, kill that batch file, and see what is in there. It will most likely slow down the file IO, but i hope that this is not a speed depending issue.

Not sure, what else you can try.

1 Like

So, let’s try to entertain this one.

In the original log output, the file that the program is having problem with is ~800 GB in size. That file has “plot.2.tmp” extension, so we can assume that potentially “plot.1.tmp”, etc. are already sitting on the disk. If we assume that two / three such files are being created, we are somewhere around 2 TB mark at that moment. Maybe the problem is with the RAID array that is for some reason limiting access to 2 TB only? Again, my take is that OS is most likely stable, so this is just a long shot again. However, your RAID settings / drives are what is common between those two boxes.

I would try two things.

First, I would try to create a big file (say past 2 TB). You can use copy command to concatenate files, so that would be something like:

copy a + a + a b  // used file a (preferably a big one; some zip file if you have one on hand, maybe some ISO file) to create file b; also, I would repeat that "+ a" few more times to get faster with the big one
copy b + b + b c  // and so on, every time using the previously created file to get a bigger one

Once you get to 500 GB, just do 5x that 500 GB, and see whether it will go smoothly. This will let us know, whether your RAID will let the user to cross that ~2 TB mark.

The second test would be a check, whether you can trigger the same error, but by you forcing the limited free space. Just create a 1.5 TB file (or combined) before you start plotting (so it has that 2.3 that is advertised as needed). Once the plotter will be busy, using the first method, create another 1 TB or so of files, so you will push your files to about 2.5 TB or so, basically forcing the plotter to barf. I would be really interested to see whether the error message will be the same, as what you have in that first log.

1 Like

I had written that my first two attempts were successful, which for the purpose of troubleshooting the topic in this thread, was correct.

However, I did not mention the following in this thread, because until now it was not relevant:
My truly first attempt at a K35 plot failed, but for an obvious reason that is unrelated to the error that I reported in this thread.

But now that you have brought “…some reason limiting access to 2 TB only?”, my first K35 plotting attempt speaks to your question.

For my first attempt, I did not have the RAID 0. I had two drive letters, each being 2 TB.

For my first attempt, I directed “-t” to one of the two 2 TB drives, and “-2” to the other 2 TB drive, crossing my fingers that neither of them would consume more than 2 TB of tmp files.

That did not work. The K35 plotting job failed due to one of the drive’s space being exhausted. I do not recall the message, but it was clear. And “dir” showed that it was out of space, while the other drive had only a single 0 byte file, and “dir” showed all 2 TB of space was available.

Returning to today:
With the two 2 TB drives striped, I get passed that problem every time.
So the job is using more than 2 TB (the “dir” command shows less than 2 TB free at some point), and probably has all 4 TB available (as it should).

I will make 36 copies of a K32 file, which will get me to approximately 3.9 TB. That should be enough. If you think not, then I will find other files to consume the rest.

I believe that my “truly first attempt” saga, above, would quality for your suggestion.

1 Like

Can we understand it that when it failed obviously for not enough space, it failed gracefully (giving you a proper error message), where right now it looks like a rough abort (indicating a different reason)?

Or rather it was a similar abort (as you stated, would mean that the second test is pointless)? Although, as we don’t really know what are we chasing, it could be that different parts of the potter handle file errors differently, thus different results, even though the problem is the same. But in such case, chasing it is also pointless, as we don’t know at which point, and how to trigger it, so no point spinning wheels on it.

The total space used is not really by one giant file, but rather a bunch of smaller files (most likely, about one two equal to the final (the copying phase), plus few smaller ones (supporting the assembly)). So, maybe do that “copy k32 + k32 … + k32 big_one” three times to produce three ~800 GB files. Most likely, if it fails, it fails due to RAID and it will most likely be somewhere around 2 - 2.5 TB. If you can get those three semi k35 files produced, we know the that at least whatever is advertised is available on your RAIDs. The reason that it worked for those two previous builds, where it fails right now could be some really subtle file size difference (what is really pain to chase).

As I said, I am out of bullets, and I am grasping for straws. Of course, there is also the memtest scheduled for the next reboot, so that still may point to RAM errors. Maybe, as your both motherboards are the same, and most likely RAM is the same, that is some weird motherboard / RAM configuration issue, thus manifesting itself on both boxes.

Of course, one more option would be to bring it to Chia’s github issues page and hope that Chia devs.will bite it (or someone else who got similar crashes will chime in).

I do not remember. It was a month or two ago.
I just knew that the plot was toast, because temp space ran out.

I did no deeper dive, because that seemed unnecessary.

I have my 949 GB K35 files that I can duplicate 4 times.

Similar, but not identical.

One motherboard is maxed out with 64 GB of RAM (it has two slots).
The other is at ½ capacity, with 64 GB of RAM (it has four slots).

The memory speeds differ, too, because finding 64 GB kits was a very difficult task at the time.

In February of 2021, finding parts to build an AMD 5950X was like finding toilet paper.
I was lucky to obtain these two MSI motherboards and the memory.

So the two machines are not exactly identical.

Let’s see how my current K35 plotting job runs. I doubled the “-b” option.

If the same problem happens, then I will verify that the RAID 0 is actually providing 4 TB of storage.

If the storage is not an issue, then I will try memtest86.

If the memory is not an issue, then github will be an option. At least I will be able to report on no hardware issues. And since the motherboards are not identical, and the RAM is not identical, I doubt that either of them is faulty (although the BIOS code might contain common traits).

Since chia.exe usually consumes 3% of CPU cycles, and occasionally going higher, as well as K35 being a huge undertaking, the job runs for close to 3 days.

2 Likes

Sounds like a good plan!

On god. Such a waist.
Withought any legit proof.

2 plots will always win more times than a singular plot regardless of k value.

Case closed.

Super basic trig.

Running code that forces issues to surface is not a waste.

…of what?

How did you arrive at your conclusion?

The Earth is flat.
I wrote it. Case closed.

What does that mean?

  1. Attempting trips to Pluto(k34) when we know we don’t have the fuel to make it there. Ya kinda a waist. In many ways. And ur shooting for other Solar systems(k35) wich isn’t intended to actually be done for I’m assuming 20 more years ish…
    Do you understand how monumentally different computers will be in another 20 years?

What a waist. Of ssd. If time. Of space. Of effort. A act of futility. One im forced to stare at on here.

It’s like being a trump suporter. K33 it’s like ok big dawg chill no one’s getting extra credit here.

2.o proof or citation from chia verifying = yes u would win more with one k33 over 2 k32. The 2x k32 will always win more than the one k33. If I had 3 plots 2 k32. And one k33. Pull each one out of a hat randomly… u will see k32 on ur first pull some 66% of the time.

  1. Trig= triganamity. But now we’re moving to statistics.
    Of why k33+ is a bad idea for now

Let me explain again another way… 1 k33 vs 2 k32….

Those k32 plots have a greater chance of winning than your k33.
.
It’s just math.

Unless chia has explicitly stated. Yes higher k value plots have a greater chance of winning… at random… making it less… random… wich means less secure.

Gosh I feel like this needs to be explained better. I see the problem. Im just not the one. Lol all I can say it trust me. It’s not like pools are painting u more for ur k33 plot…

There is no correlation between plot sizes and trips to Pluto.
I should know. I have been to Pluto.

Just because someone does not have enough fuel to go shopping, eat out, day at the beach, etc, does not have any relationship to plotting.

I do.

TVs will also be monumentally different.
Electric cars will also be monumentally different.
Tax code will also be monumentally different.

How is it a waste of SSD?

Same question.

Did you lose a bet?

Please provide the survey that finds that plotting K34 files is related to politics.

Who is “big dawg”?
Who should chill?

I was counting on that extra credit. Are you sure that none will be given? Should I write my congressman?

What is “2.o”?

The Earth is flat, because I say so.

I saw no math.
Please quote the math.

Trigonometry?
What does that field of mathematics have to do with plotting or Chia?
I can see algebra or calculus playing a role. But neither trigonometry nor triganamity.

Larger plots are less secure?
Do the Chia developers know that?

Did I miss warnings when I kick off K33 and K34 plotting jobs?
Did I miss warnings in this forum?
Did I miss warnings in Chia’s help wiki?

You and I are simpatico on that point.

We are, again, simpatico.

Let me know where to e-mail my VISA info and social security number.

1 Like

dont waste ur time and feed the troll!
if u see his other forum comments, the best is to ignore him.

have a nice weekend :slight_smile:

2 Likes

U just mad.

It’s clear who’s being a troll here

Anyone who says it’s wise to waist hdd space and store literally a less # of winning plots…

Its hopeless I see that now.

Somone somehow convinced you that 1 is somehow better then 2 in a randomly drawn lottery.

My hats off to that trolls :troll: legandary.

Mess up a whole community type troll.

I’m the one trying to liberate you guys.
I get u feel committed and all in but just stop. It’s dum.

Literally some guy was like oh ya k33 bro win wayyy more than my k32.
And his buddy was like noonbroooo k34 is where it’s at…
And you all think because it’s bigger it has a better chance of winnfing… it’s still just one plot only counts for 1 even tho it’s double or triple the size is irrelivant.

I don’t know how more I can simplify this…

And the Pluto reference… are we going to Pluto one day. Sure. Is that day today. No because our tech isn’t on that lvl yet… but it will be one day. Doesn’t mean we shoot monkeys at Pluto in rockets every night till we hit it in 20 years. And call that a practical test of bugs or whatever.

It’s entirely pointless, I susseed. U guys wanna believe in unicorns :unicorn: go ahed

I’m just bringing simple facts of what numbers are and do. Doesn’t matter k size. Because 2 plots is always 2 plots. And 1 plot is only 1 effin plot…

And u guys all believed him and sacraficed literally half your disk space…

U see that right.

You wrote it on the internet, and therefore, like everything else you fabricated, is true.

…addressing each of your unsupported claims.

If you do not use complete sentences, then you leave others to guess what you are attempting to convey.

I’ll take a shot, assuming you are asserting that there are folks that are proponents of wasting hard drive space.

Who are those folks?
Please point to an article, or quote anything in this thread, or from Pluto, where someone is advocating for wasting hard drive space.

Please define “It’s”.

You are consistent in fabricating conclusions based on farts in the wind.

I can translate only so much gibberish.

More gibberish.

It would be helpful to know from what bondage you are applying your powers of liberation.
What color is your cape?

More gibberish, but I will take a shot at “It’s dum.”
When you imply that others are dumb, you should not convey your level of intellect by misspelling that word, of all words.

“Literally”?
So this time you are not blowing smoke?

“some guy”?
I think I know “some guy”.

…and that is related to the topic of this thread “K35 repeated plot creation failure” in what way?

Small world. I think I know some guy’s buddy.

I did not know that you know what we all think.

I know that you do not know.

I told you that I already went to Pluto.

Then why do you persist in writing comments that are entirely pointless?

You have neither brought simple facts nor medium facts nor hard facts nor any facts. You write that you do without ever doing so.

And yet your rants are entirely about K size.

And 2 coins are always 2 coins.
And 2 baseball cards are always 2 baseball cards.
And 2 CPUs are always 2 CPUs.

Getting excited is not an effective way to convey facts. Adding phonetic-type of cursing in this forum is improper.

Consider the following:
And 3+3 equals [curse word] 5.

Do you see how adding foul language makes things true?
If not, then consider the following:

And 3+3 equals [curse word + curse word + curse word] 5.

In other words, when your BS is not selling, then add curse words. When that does not work, then keep adding curse words until you are sucessful.

Which guys?
All of us?

A blind man can see it.

.

@Jacek

The same problem happened during phase 3.
But rather than happening while compressing tables 6 and 7, it happened while compressing tables 5 and 6.

The message also contained the “Only wrote…” language, but with different byte values and a different offset value. The process waited 5 minutes, and ran to completion. Yet, the plot is bad.

This time, when I ran “chia plots check -g …”, I caught something that I must have missed in the past (because it scrolls by quickly):

2022-06-19T14:55:38.356  chia.plotting.manager            : ERROR    Failed to open file C:\mounts\g-tec-11\chia_output\plot-k35-2022-06-16-03-14-2db10c138d8fa09c8e33146dddc48bc4801ef70b0cf5f440abfca64c12b8b02e.plot. Invalid plot header magic Traceback (most recent call last):
  File "chia\plotting\manager.py", line 350, in process_file
ValueError: Invalid plot header magic

Perhaps the above message will lend a clue. But I suspect that it only reveals what is wrong with the file, and not why something went wrong with the file.

I am now testing that all 4 TB of my RAID 0 is truly writable.

Assuming all goes well, then I will either run memtest86, or just not bother with K35 files.
This is because re-booting my PC takes several minutes, due to the scores of connected USB drives, and I do not want to be off-line unless it is necessary (and K35 plots are not really necessary).

Perhaps I will re-visit this when time permits.

Thanks for your help.

You have no magic!!!

Yup, it is obvious right now what is happening. And apparently you screwed up. Your drive … :wink:

What is obvious is that the code at play is not production ready. As @chiameh pointed out, when Chia devs use system level API, they don’t bother to check the error codes, and that is just high school level programming, not really on the level of a company that got $70m in VC funding.

We can speculate ad nauseam what has happened, but rather you should bring it to github, and have them fix this crap (if only by adding error checks and providing a special build for you to check it again). You should also point it out to them, that if they have that 5 mins message, it should be clear enough what to do. On the other hand, as the ERROR message suggested, that was a catastrophic error, so that ‘be in 5’ message was completely retarded one. I would really not waste my time on checking RAID, as Win components are being tested day in and day out, on the other hand, chia is the code that craps out every other day in many places.

So, I would close this thread until they address this issue, and eventually restart it (if you feel like to continue), when they give you a green light.

By the way, one of the best series of software books is by Donald Knuth. He offered (IIRC) $10 for every bug found with what he wrote. There were so few bugs (maybe a handful), that people kept his checks, and put them on a wall as a trophies.

UPDATE
By the way, that new error is rather irrelevant. It is a post mortem error, where we know that plot corruption already happened. I don’t know plot layout, so don’t know how headers are placed in the file. It could be a completely misleading error, as when the offending fwrite failed, most likely the software didn’t do a proper cleanup, leaving behind just junk (either header was not updated when partial data was written, or was written expecting that data will follow). So, the proper behavior (in production code) would be rather to kill such plot, as there is no recovery for it.

The important part is that we know this is plot corruption, and we know that it doesn’t happen in the same place. Usually random crashes are related to H/W (e.g., temp of NVMe, yada-yada-yada). However, the plot is not a fixed file, but rather is randomized for every run, so that may be as well the source (small issues accumulated resulting in the final crash). So, unless Chia will instrument that code, we are just barking at random trees, looking for some patterns, but nor really having proper clues. That is Chia’s QA job (if they had one, of course).

Maybe edit the title, that the thread is on pause, until Chia will address the issue.

1 Like

I filled my 4 TB, RAID 0 partition with as much as I could stuff into it (primarily four copies of the same K35) file, plus a few more files to fill in the rest.

All 4 TBs were writable.

I also did an “fc” (Windows’ version of Linux’s “diff”) between the copies of the K35 files, and they were all identical.

So between this K35 plot failure happening on two different PCs, with somewhat different hardware, and happening repeatedly during different portions of phase 3, it very much points to a Chia software issue.

.

.

If you hear about a version of madmax surfacing (or news that it is in the works) that can handle > K34 size plots, please let me know.

Maybe I will use an external drive, with 5 TB of space, kick off a K36 plot, and check back in a month to see if it finished (without corruption)?

I never heard that. But I will remember that. It is a good line.

1 Like

It was a deliberate substitution, as we were just randomly picking H/W to try to explain what just have happened (not rather stubbornly picking one).

To me, the whole thing with anything bigger than k32 is a perfect example how retarded planning at Chia is. They have crap left and right manifesting everyday and with every new release, yet they waste their resources on something that may be useful in the future, and only if Chia will not go belly up. They have/had access to some of the best devs out there (MM and BB), and basically they abuse them asking to support those extra k values. As mentioned before, if anything is needed, then (IMO) it would be a k32 plot that is just partially filled, to close the empty disk space on all those drives, as that has more green value that those bigger plots. So, really no respect for JM to coming to this forum and spewing that crap about how lab-ready the new BB will be.

Although, I do understand your point, but myself would rather not go that route.

i think u have a Hardware Problem with ur 4TB Raid0, u use the same 4TB Raid0 (SSDs & Controler/Adapter) in ur different systems?
U have overclocked some Parts?