There are 64 NUMA memory pipes available. If you run 64 threads or less then each thread will run at 100% load. If you enable 88 threads then Windows will start "sharing" the memory pipes. So the 24 threads (above the 64) would start sharing the memory pipes with 24 of the 1st 64 and therefore would only be running at 50% efficiency because 1/2 the time they are running and 1/2 the time they are waiting for memory.
You would wind up with 40 threads running at full load and 48 threads running at 50% load ... ie 64 threads. Since the memory channels would have to be continuously loaded/unloaded it would actually be less efficient than just running 64 threads.
So to overcome this limitation you would set your program up to launch 44 threads at Node 0 (CPU0) and then 44 threads to Node 1 (CPU1). Then all 88 threads would be running at 100% loading.
We proved the technique works as I had describe in my post to Pete. We launched one BOINC client to node 0 and a 2nd BOINC client to node 1. On our 72 thread machines all threads were running at 100% load.