Farm Infrastructure Issues - Seeking Input

chiatx · September 7, 2022, 6:50pm

We have a unique infrastructure which I’ve mapped out below. We are not able to get any proofs, so something is amiss. We would greatly appreciate feedback/assistance/support from the community about what is wrong with our system, and how we may be able to remedy the problem.

I will be monitoring the thread, and will follow up on any further inquiries, requests, or suggestive solutions as replies come into focus. Thanks in advance for you all.

VM_Host1

Farmer & 20 Harvesters

OS = VMware ESX 7.0U3
CPU = Intel(R) Core™ i9-10900 CPU @ 2.80GHz 2.81 GHz
RAM = TEAMGROUP T-Force Zeus DDR4 64GB Kit (2 x 32GB) 3200MHz (PC4 25600) CL20 Desktop Gaming Memory Module Ram - TTZD464G3200HC20DC01
Local HD’s
datastore1 = XPG SPECTRIX S40G - 4TB, M2 SSD
datastore2 = 256GB SATA SSD
datastore3 = 500GB SATA
datastore4 = 500GB SATA
datastore5 = 500GB SATA
Network (all NICs are 1GB connection):
NIC1 = 192.168.83.0/24
NIC2 = 192.168.84.0/24
NIC3 = 192.168.85.0/24
NIC4 = 192.168.86.0/24
VM’s
Farmer

1. OS = Ubuntu Linux 20.04.4 (all)
2. CPU = Intel(R) Core(TM) i9-10900 CPU @ 2.80GHz 2.81 GHz

  1. Sockets = 1
  2. Cores = 8

3. RAM = 16GB
4. Local HD’s

  1. VMDK1 = 256 GB

5. Network:

  1. NIC1 = 192.168.83.0/24
  2. NIC2 = 192.168.84.0/24
  3. NIC3 = 192.168.85.0/24
  4. NIC4 = 192.168.86.0/24

Harvesters 6-25 (20 vms)
OS = Ubuntu Linux 20.04.4 (all)
CPU = Intel(R) Core™ i9-10900 CPU @ 2.80GHz 2.81 GHz

1. Sockets = 1
2. Cores = 4

RAM = 4GB
Local HD’s

1. VMDK1 = 32 GB

Network:

1. NIC1 = 192.168.83.0/24
2. NIC2 = 192.168.84.0/24
3. NIC3 = 192.168.85.0/24
4. NIC4 = 192.168.86.0/24

VM_Host2

4 Harvesters

OS = VMware ESX 7.0U3
CPU = 6 CPUs x Intel(R) Core™ i5-10400 CPU @ 2.90GHz
RAM = 32GB
Local HD’s
datastore1 = 2TB (WDC WD2000FYYZ-0)
Network:
NIC1 (vSwitch0) = 192.168.83.0/24 & 192.168.84.0/24
NIC2 (vSwitch1) = 192.168.85.0/24 & 192.168.86.0/24
VM’s
Harvesters 2-5 (4 vms)

1. OS = Ubuntu Linux 20.04.4 (all)
2. CPU = Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz

  1. Sockets = 1
  2. Cores = 4

3. RAM = 1.5GB
4. Local HD’s

  1. VMDK1 = 32 GB

5. Network:

  1. NIC1 = 192.168.83.0/24
  2. NIC2 = 192.168.84.0/24
  3. NIC3 = 192.168.85.0/24
  4. NIC4 = 192.168.86.0/24

TrueNAS

Contains 12 plotting drives. NFS Shares

OS = TrueNAS 12.07
CPU = Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
RAM = 64GB
Local HD’s (brands vary)
Pool1 = 12TB
Pool2 = 12TB
Pool3 = 12TB
Pool4 = 12TB
Pool5 = 12TB
Pool6 = 12TB
Pool7 = 12TB
Pool8 = 12TB
Pool9 = 12TB
Pool10 = 10TB
Pool11 = 10TB
Pool12 = 10TB
Network:
NIC1 = 192.168.83.0/24
NIC2 = 192.168.84.0/24
NIC3 = 192.168.85.0/24
NIC4 = 192.168.86.0/24
Synology Model DS418

Contains plotting drives. SMB Shares

Local HD =
Pool1 = 10TB
Pool2 = 10TB
Pool3 = 10TB
Pool4 = 10TB

*Seagate Exos X10 10TB Internal Hard Drive HDD – 3.5 Inch 6Gb/s 7200 RPM 128MB Cache for Enterprise, Data Center

Network:
NIC1 = 192.168.83.0/24
NIC2 = 192.168.84.0/24

Jacek · September 7, 2022, 7:28pm

No one wants to analyze complex systems, where the problem was introduced most likely early in the process.

Why don’t you

shut down everything but your farmer, and analyze logs
once you see that farmer logs are clean, add to that farmer a couple of drives, and check logs whether those are recognized (by default, every 3 minutes it should be some entry about those drives)
after that, you can try to bring up one harvester to see whether it can communicate with the farmer
if all is good at this point, you should be able to bring one VMHost at a time, and make it work

Otherwise, it is just too many moving parts to look at and/or suggest anything.

Although, I kind of fail to see the advantage of using VMWare to split the box for instance in separate harvesters. If all VMClients on such VMHost can work as advertised, just one system should also handle all connected resources. (Although, maybe this layout gives you some advantage when using 1Mbps Ethernet cables connected to separate NICs on the same box - potentially less expensive solution to going 2.5 or 10Gbps route.)

Fuzeguy · September 7, 2022, 9:33pm

If I counted right, you have ~180tb of data (plot) drives. Holy cow, you could farm that w/one (slow) NUC. Are you planning on expanding to a PB farm or is that it? If that is it, perhaps simplify the whole thing to one plotting PC and a NUC and have at it.

WolfGT · September 8, 2022, 1:14pm

I can speak to this because I have 3 harvesters running as vm’s on one system. The reason is speed. For some reason when a Chia harvester is given multiple remote locations, it runs dramatically slower. It must have to do with it opening and closing connection between resources. But if you assign one harvester to each remote set of drives, it runs great. Even if you have multiple harvesters on the same system.

Here is a thread I did on this setup:

But the bottom line was: 1 harvester with multiple remote drive arrays = very poor performance, 1 harvester per remote drive array = great. Even if those harvesters were on the same system.

Jacek · September 8, 2022, 5:10pm

Good read that other thread about your VM route. It looks to me that you should report it on Chia’s github Issues page and have them look at that setup. If the problem is really the harvester code, I would think that insight gained from your setup could further improve how harvester works.

WolfGT · September 8, 2022, 5:24pm

There is always a chance that the issue is no longer an issue. That setup/test was done in May of 2021. I am still running that setup because it works. But since there have been many updates since then, the issue may be gone or reduced. I don’t know. Not worth reworking my setup to find out.

Voodoo · September 9, 2022, 1:35pm

Without a whole bunch of info about every little thing in that setup and all the system logs, I think anyone would have a hard time figuring out where the problem lies.

I think the only way to diagnose (unless you get lucky) is to start simple and work your way back into this setup so you can see when the problem pop’s up.

Connect a farmer with some local hard drives directly to the Switch, run with a pool, so you can quickly see if partials are being correctly received by the pool.

If that works, enable the vlan’s again see if it still works, add a harvester, etc, etc.
Trouble shooting a complex system like this for a not typical workload (chia farming) is going to have you chasing ghosts for ever unless you break it down.

drhicom · September 9, 2022, 4:38pm

Strip it down to 1 piece and start it up and see if that runs correctly then add the next part!

slickdakine · September 9, 2022, 9:01pm

I have a similar setup, its awesome. I would first check to make sure your farmer has connections on port 8444.
Run:
chia show -s -c | wc -l
(how many connections)
and
chia show -s -c
(shows actual connections on 8444)
To confirm connections.

I did have harvesters slow down when I put more than 1500 plots per harvester.
Keeping it around 1500 keeps responses below 0.25 sec.
This is a chia software issue. I use VMs to get around it.

drhicom · September 9, 2022, 11:22pm

ESXI or VMware Workstation?

chiatx · September 14, 2022, 2:26am

drhicom: We utilize ESXI

slickdakine · September 14, 2022, 2:56am

I’ve used ESXI, Proxmox and Xen. Its the same across all platforms. Once I get over 1500 plots the response times slow down.

drhicom · September 14, 2022, 11:31am

How are your drives connected?

chiatx · September 14, 2022, 3:36pm

Thank you for the insight WolfGT. We’ve been discussing your comments…

chiatx · September 14, 2022, 3:38pm

Sometimes diagnosis is quite cumbersome. I guess if it were easy, then everybody would be on board!

chiatx · September 14, 2022, 3:39pm

The Synology is connected via Samba, NFS for the TrueNAS.

TrueNAS protocol was changed to NFS recently in an effort to reduce response times. It’s noteworthy , our response times were reduced to below 1 second

We begin having our issue with 0 proofs around this time, but haven’t conclusively linked this protocol switch to the issue.

DanRelfe · September 14, 2022, 3:48pm

I don’t have much over-network plot space (it’s 95% direct attached) but I have not seen any appreciable difference on TrueNAS using NFS over SAMBA. In fact, I reverted back to SAMBA having researched performance and read articles from reputable sites that benchmark the two options.

What I did end up doing on my TrueNAS box was adding a dedictaed NIC for the SAMBA share and also adding a dedicated NIC on my harvester machine, then linking the two directly with one cable so that they are the only two devices on the network (point to point/end to end). Then, I set MTU to 9000(/9014) often referred to as ‘jumbo frames’ on both NIC’s on the link with a positive increase in performance.

Typical average response times are less than 0.25s, and only go near 5s when a proof is found.

DanRelfe · September 14, 2022, 5:32pm

Meant to add, I also have flow control switched off on both NIC’s in the link.

chiatx · September 14, 2022, 8:06pm

It’s myself and a partner. He’s the guy who first wired his college with ethernet.

We’ll be sure to touch on the points you mentioned when we talk next.

Thank you

chiatx · January 26, 2023, 5:18pm

This is great insight, thanks!

Farm Infrastructure Issues - Seeking Input

VM_Host1

VM_Host2

TrueNAS

Synology Model DS418