As many will know, Hyper-threading (HT) is a trick to squeeze more performance out of a system. It works quite well on systems that don’t push a CPU too hard and have many many small processes and threads going on because it has twice as many logical CPU’s as actual, physical CPU’s.
But don’t be fooled, those extra “free” logical CPU’s are not CPU’s at all. The basically look for idle moments in a CPU’s pipeline and squeeze extra CPU instructions into those “pauses/gaps”. It is way more complicated than that but this is easier to visualise.
HT is a bad idea in situations when applications need “a full/real CPU” and so in many database applications or other “heavy stuff”, HT is turned off in the BIOS because it is counter productive.
My CPU is a dual Xeon E5-2640 v4 @ 2.40GHz with 10 cores/ 20 threads each. So 20 real cores.
The OS is Ubuntu Server 20.04 LTS.
So I went to see if HT is beneficial or counter productive in plotting. Here is what I found:
I had up to 18 plots in parallel with plots taking 11 to 15 hours to complete. This varied wildly.
I went from 25/26 plots a day to 30/31 a day with the exact same hardware.
I kept HT enabled in the BIOS and started using CPU Affinity to pin jobs to real CPU’s and not let them use “fake” logical CPU’s.
I also made sure that Jobs don’t cross NUMA zones as that increases latency because a CPU must talk to memory which belongs to the other CPU.
When looking at the CPU numbering, my system shows 40 CPU’s. To find out which Hyperthreading units (logical CPU’s in Windows terminology) are on the same physical core do:
where X is the CPU number, starting with 0 and ending at 39 on my system.
"cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list" 0,20 (CPU 1, core 1) 1,21 (CPU 1, core 2) 2,22 (CPU 1, core 3) 3,23 (CPU 1, core 4) 4,24 (CPU 1, core 5) 5,25 (CPU 1, core 6) 6,26 (CPU 1, core 7) 7,27 (CPU 1, core 8) 8,28 (CPU 1, core 9) 9,29 (CPU 1, core 10) 10,30 (CPU 2, core 1) 11,31 (CPU 2, core 2) 12,32 (CPU 2, core 3) 13,33 (CPU 2, core 4) 14,34 (CPU 2, core 5) 15,35 (CPU 2, core 6) 16,36 (CPU 2, core 7) 17,37 (CPU 2, core 8) 18,38 (CPU 2, core 9) 19,39 (CPU 2, core 10)
How to interpret:
CPU 20 is actually the "fake" CPU on physical core 0. CPU 21 is actually the "fake" CPU on physical core 1. ... ... CPU 39 is actually the "fake" CPU on physical core 19.
As I want to use only “real” CPU’s without other stuff running on a “fake” CPU and slowing the real CPU’s work down (a core will try to load-balance between both HT units), I need to modify my Swar Plotmanager config so that I have 4 jobs (I have 4 NVMe drives, one for each job) and assign a maximum of 5 real/full CPU’s to a job. 4 jobs x 5 threads is 20 CPU’s.
I use 4 threads for stage 1, allow 1 stage 1 plots per job and 4 plots per job. This way, the 5 CPU’s available per job are used to the fullest.
CPU Affinity for plotting jobs: NUMA Node 0: Job 1: 0,1,2,3,4 Job 2: 5,6,7,8,9 NUMA Node 1: Job 3: 10,11,12,13,14 Job 4: 15,16,17,18,19
As you can see, this also binds jobs to stay within a certain NUMA node which, in my dual CPU’s setup are:
NUMA node0 CPU(s): 0-9,20-29 NUMA node1 CPU(s): 10-19,30-39
(use the command “lscpu” to find this out on your system)
Mind you, as I did not disable HT in the BIOS, the other 20 CPU’s are still around despite them being fake. But they are fine for other things an operating system does like Disk and network I/O.
So instead of 18 jobs in parallel (using all 40 logical CPUs / threads) and not using any form of CPU or NUMA affinity, I now have 12 jobs in parallel max. but those jobs consistently finish between 8 and 9 hours instead of 11 to 15 hours. Due to way I staggered them, I now get 30 to 31 plots a day instead of 25/26 on average.
Added bonus: the CPU’s get less hot. They used to be around 55 degrees C all day long. With the new config, they are around 44 degrees. Fan speeds are lower as a consequence and this save a little bit on electricity.
That’s it. That is what I wanted to share with you. Hope you find it useful.
Disclaimer: it took me close to a week of tinkering to get this result. A little bit more here, a little bit less there.
At one point the machine started plotting very consistently and predictable and keeping in mind these CPU’s are quite old, i’m quite pleased with the result.