So do I need to use numactl or not now? Of course, whilst 4 workers is optimal with one process running, it may well not be if two processes are running. I'll give that a try.
The HT thing was just an observation. It does materially affect the Affinity setting, though. So, should you decide to turn on HT, it will need a different set of values for Affinity.

We still need numactl to evaluate the impact of stage 2 with local memory vs non-local memory. If it turns out that using only half the total amount, but local, memory makes stage 2 much faster, then that is the way to go.

Hopefully with the numactl and Affinity settings, we'll be able to run two instances of P95 with local memory and fully utilizing the cores. If that does work, I have no doubt that you'll get the best performance. It may or may not be significantly better than running a single instance, but that's what we want to find out.
