Xeon Phi does pretty well with 4 workers, one designated for P1 on 12 GB of MCDRAM.

It's a tricky balancing act. More throughput vs. find a stage 2 factor. Here is the (edited) benchmark I ran:
Prime95 64bit version 30.3, RdtscTiming=1
Timings for 6048K FFT length (64 cores, 1 worker): Throughput: 218.85 iter/sec.
Timings for 6048K FFT length (64 cores, 2 workers): Throughput: 432.09 iter/sec.
Timings for 6048K FFT length (64 cores, 4 workers): Throughput: 602.49 iter/sec.
Timings for 6048K FFT length (64 cores, 8 workers): Throughput: 654.40 iter/sec.
Timings for 6048K FFT length (64 cores, 16 workers): Throughput: 666.12 iter/sec.
Timings for 6048K FFT length (64 cores, 32 workers): Throughput: 693.20 iter/sec.
Timings for 6048K FFT length (64 cores, 64 workers): Throughput: 717.88 iter/sec.
I get ~10% more throughput by running 16 workers compared to 4 workers, but I don't get to do stage 2 P1. Also, I am not prepared to wait for a ~4 months to turn around 64 candidates! Plus the current HDD in the Phi is not very big.
Edit: I plan to install a 500GB disk and run 64 workers. With mprime's nice runtime autotuning I should get 64 candidates done in ~96 days. Presumably, I just have to alter some settings and I can continue with what I have done so far as well as new work.