While unlikely, it is possible that 20 or 24 threads yields a bit of improvement. Hyperthreads don't always help on matrix solving, but since this is a benchmark thread it might be nice to demonstrate that. I suggest 20 as alternative because using every possible HT might be impacted by any background process, but that effect should be reduced if we leave a few HTs 'open'. I've found situations where using N1 cores runs faster than N cores, for what I presume are similar reasons. 
HT helps a lot on LA, at least for me.

VBITS=128 on otherwise idle machine. ETA after 1% of job: 6threaded 14hr 34 min 12threads 8 hr 26 min 18threads 9 hr 15 min 24threads 8 hr 27 min These times look rather slow; I just installed the extra 32GB memory today, so perhaps filling all 8 slots slows memory access a bunch. Some time I'll remove the original 16GB and see if 4 sticks is faster than 8. 

Note that we only count the LA phase in our calculations. 

12 = 8h04m50s 20 = 8h32m31s 24 = 7h59m33s 

