Farm Infrastructure Issues - Seeking Input

We have a unique infrastructure which I’ve mapped out below. We are not able to get any proofs, so something is amiss. We would greatly appreciate feedback/assistance/support from the community about what is wrong with our system, and how we may be able to remedy the problem.

I will be monitoring the thread, and will follow up on any further inquiries, requests, or suggestive solutions as replies come into focus. Thanks in advance for you all.

  1. VM_Host1

Farmer & 20 Harvesters

  1. OS = VMware ESX 7.0U3

  2. CPU = Intel(R) Core™ i9-10900 CPU @ 2.80GHz 2.81 GHz

  3. RAM = TEAMGROUP T-Force Zeus DDR4 64GB Kit (2 x 32GB) 3200MHz (PC4 25600) CL20 Desktop Gaming Memory Module Ram - TTZD464G3200HC20DC01

  4. Local HD’s

  5. datastore1 = XPG SPECTRIX S40G - 4TB, M2 SSD

  6. datastore2 = 256GB SATA SSD

  7. datastore3 = 500GB SATA

  8. datastore4 = 500GB SATA

  9. datastore5 = 500GB SATA

  10. Network (all NICs are 1GB connection):

  11. NIC1 = 192.168.83.0/24

  12. NIC2 = 192.168.84.0/24

  13. NIC3 = 192.168.85.0/24

  14. NIC4 = 192.168.86.0/24

  15. VM’s

  16. Farmer

1. OS = Ubuntu Linux 20.04.4 (all)
2. CPU = Intel(R) Core(TM) i9-10900 CPU @ 2.80GHz 2.81 GHz

  1. Sockets = 1
  2. Cores = 8

3. RAM = 16GB
4. Local HD’s

  1. VMDK1 = 256 GB

5. Network:

  1. NIC1 = 192.168.83.0/24
  2. NIC2 = 192.168.84.0/24
  3. NIC3 = 192.168.85.0/24
  4. NIC4 = 192.168.86.0/24
  1. Harvesters 6-25 (20 vms)

  2. OS = Ubuntu Linux 20.04.4 (all)

  3. CPU = Intel(R) Core™ i9-10900 CPU @ 2.80GHz 2.81 GHz

1. Sockets = 1
2. Cores = 4
  1. RAM = 4GB
  2. Local HD’s
1. VMDK1 = 32 GB
  1. Network:
1. NIC1 = 192.168.83.0/24
2. NIC2 = 192.168.84.0/24
3. NIC3 = 192.168.85.0/24
4. NIC4 = 192.168.86.0/24
  1. VM_Host2

4 Harvesters

  1. OS = VMware ESX 7.0U3

  2. CPU = 6 CPUs x Intel(R) Core™ i5-10400 CPU @ 2.90GHz

  3. RAM = 32GB

  4. Local HD’s

  5. datastore1 = 2TB (WDC WD2000FYYZ-0)

  6. Network:

  7. NIC1 (vSwitch0) = 192.168.83.0/24 & 192.168.84.0/24

  8. NIC2 (vSwitch1) = 192.168.85.0/24 & 192.168.86.0/24

  9. VM’s

  10. Harvesters 2-5 (4 vms)

1. OS = Ubuntu Linux 20.04.4 (all)
2. CPU = Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz

  1. Sockets = 1
  2. Cores = 4

3. RAM = 1.5GB
4. Local HD’s

  1. VMDK1 = 32 GB

5. Network:

  1. NIC1 = 192.168.83.0/24
  2. NIC2 = 192.168.84.0/24
  3. NIC3 = 192.168.85.0/24
  4. NIC4 = 192.168.86.0/24
  1. TrueNAS

Contains 12 plotting drives. NFS Shares

  1. OS = TrueNAS 12.07

  2. CPU = Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz

  3. RAM = 64GB

  4. Local HD’s (brands vary)

  5. Pool1 = 12TB

  6. Pool2 = 12TB

  7. Pool3 = 12TB

  8. Pool4 = 12TB

  9. Pool5 = 12TB

  10. Pool6 = 12TB

  11. Pool7 = 12TB

  12. Pool8 = 12TB

  13. Pool9 = 12TB

  14. Pool10 = 10TB

  15. Pool11 = 10TB

  16. Pool12 = 10TB

  17. Network:

  18. NIC1 = 192.168.83.0/24

  19. NIC2 = 192.168.84.0/24

  20. NIC3 = 192.168.85.0/24

  21. NIC4 = 192.168.86.0/24

  22. Synology Model DS418

Contains plotting drives. SMB Shares

  1. Local HD =

  2. Pool1 = 10TB

  3. Pool2 = 10TB

  4. Pool3 = 10TB

  5. Pool4 = 10TB

*Seagate Exos X10 10TB Internal Hard Drive HDD – 3.5 Inch 6Gb/s 7200 RPM 128MB Cache for Enterprise, Data Center

  1. Network:

  2. NIC1 = 192.168.83.0/24

  3. NIC2 = 192.168.84.0/24

No one wants to analyze complex systems, where the problem was introduced most likely early in the process.

Why don’t you

  1. shut down everything but your farmer, and analyze logs
  2. once you see that farmer logs are clean, add to that farmer a couple of drives, and check logs whether those are recognized (by default, every 3 minutes it should be some entry about those drives)
  3. after that, you can try to bring up one harvester to see whether it can communicate with the farmer
  4. if all is good at this point, you should be able to bring one VMHost at a time, and make it work

Otherwise, it is just too many moving parts to look at and/or suggest anything.

Although, I kind of fail to see the advantage of using VMWare to split the box for instance in separate harvesters. If all VMClients on such VMHost can work as advertised, just one system should also handle all connected resources. (Although, maybe this layout gives you some advantage when using 1Mbps Ethernet cables connected to separate NICs on the same box - potentially less expensive solution to going 2.5 or 10Gbps route.)

7 Likes

If I counted right, you have ~180tb of data (plot) drives. Holy cow, you could farm that w/one (slow) NUC. Are you planning on expanding to a PB farm or is that it? If that is it, perhaps simplify the whole thing to one plotting PC and a NUC and have at it.

4 Likes

I can speak to this because I have 3 harvesters running as vm’s on one system. The reason is speed. For some reason when a Chia harvester is given multiple remote locations, it runs dramatically slower. It must have to do with it opening and closing connection between resources. But if you assign one harvester to each remote set of drives, it runs great. Even if you have multiple harvesters on the same system.

Here is a thread I did on this setup:

But the bottom line was: 1 harvester with multiple remote drive arrays = very poor performance, 1 harvester per remote drive array = great. Even if those harvesters were on the same system.

1 Like

Good read that other thread about your VM route. It looks to me that you should report it on Chia’s github Issues page and have them look at that setup. If the problem is really the harvester code, I would think that insight gained from your setup could further improve how harvester works.

There is always a chance that the issue is no longer an issue. That setup/test was done in May of 2021. I am still running that setup because it works. But since there have been many updates since then, the issue may be gone or reduced. I don’t know. Not worth reworking my setup to find out.

Without a whole bunch of info about every little thing in that setup and all the system logs, I think anyone would have a hard time figuring out where the problem lies.

I think the only way to diagnose (unless you get lucky) is to start simple and work your way back into this setup so you can see when the problem pop’s up.

Connect a farmer with some local hard drives directly to the Switch, run with a pool, so you can quickly see if partials are being correctly received by the pool.

If that works, enable the vlan’s again see if it still works, add a harvester, etc, etc.
Trouble shooting a complex system like this for a not typical workload (chia farming) is going to have you chasing ghosts for ever unless you break it down.

1 Like

Strip it down to 1 piece and start it up and see if that runs correctly then add the next part!

2 Likes

I have a similar setup, its awesome. I would first check to make sure your farmer has connections on port 8444.
Run:
chia show -s -c | wc -l
(how many connections)
and
chia show -s -c
(shows actual connections on 8444)
To confirm connections.

I did have harvesters slow down when I put more than 1500 plots per harvester.
Keeping it around 1500 keeps responses below 0.25 sec.
This is a chia software issue. I use VMs to get around it.

1 Like

ESXI or VMware Workstation?

drhicom: We utilize ESXI

I’ve used ESXI, Proxmox and Xen. Its the same across all platforms. Once I get over 1500 plots the response times slow down.

How are your drives connected?

Thank you for the insight WolfGT. We’ve been discussing your comments…

Sometimes diagnosis is quite cumbersome. I guess if it were easy, then everybody would be on board!

The Synology is connected via Samba, NFS for the TrueNAS.

TrueNAS protocol was changed to NFS recently in an effort to reduce response times. It’s noteworthy , our response times were reduced to below 1 second

We begin having our issue with 0 proofs around this time, but haven’t conclusively linked this protocol switch to the issue.

I don’t have much over-network plot space (it’s 95% direct attached) but I have not seen any appreciable difference on TrueNAS using NFS over SAMBA. In fact, I reverted back to SAMBA having researched performance and read articles from reputable sites that benchmark the two options.

What I did end up doing on my TrueNAS box was adding a dedictaed NIC for the SAMBA share and also adding a dedicated NIC on my harvester machine, then linking the two directly with one cable so that they are the only two devices on the network (point to point/end to end). Then, I set MTU to 9000(/9014) often referred to as ‘jumbo frames’ on both NIC’s on the link with a positive increase in performance.

Typical average response times are less than 0.25s, and only go near 5s when a proof is found.

1 Like

Meant to add, I also have flow control switched off on both NIC’s in the link.

It’s myself and a partner. He’s the guy who first wired his college with ethernet.

We’ll be sure to touch on the points you mentioned when we talk next.

Thank you

This is great insight, thanks!