mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   LLR Affinity Problem (https://www.mersenneforum.org/showthread.php?t=26037)

carpetpool 2020-10-03 21:16

LLR Affinity Problem
 
1 Attachment(s)
I know there's a way to run different LLR instances and have them assigned to different designated CPU, making it run significantly faster than if only one instance were used.

I am using a [URL="https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html"]4 core, 8 thread[/URL] CPU. In the attachment I sent, one instance of LLR is running with only one thread, and time per bit is 0.576 ms. The CPU affinity is set to 0.

After terminating the program, I copy the LLR exectuable to another directory and run a test on a number of similar size to the first run (one thread). The CPU affinity is set to 1.

I check on the first run, when I notice a time increase of 1.172 ms. almost twice as running one one LLR application! No speedup whatsoever.

My goal is to run 4 instances of LLR with similar time sufficiency as only running one instance of LLR single threaded (4 instances each running with close to 0.576 ms. per bit, so that testing is 4x faster). Does anyone know what I am doing wrong here?

I am aware that running a single instance with 8 threads is less productive than running 4 single threaded instances and for some reason I never figured out how to achieve the latter.

Thanks for help!

paulunderwood 2020-10-03 21:38

Running only one instance has all the cache too itself and will run quicker than running two instances where there will be contention for cache. On a 4c/8t box I run on instance with the -t4 option. I think this approach is cache friendlier.

VBCurtis 2020-10-03 22:06

In Windows, are cores 0 and 1 hyperthreads of the same physical core? That would explain your timing exactly doubling.

What happens when you assign the second LLR copy to core 2 rather than 1?

Have you tried not assigning affinity? I've had decent luck just letting Windows utilize the cores- manually assigning affinity does help sometimes, but for this use case I'm not sure it matters for you.

carpetpool 2020-10-04 22:39

Thanks for the suggestions! I ran 4 subsequent instances of LLR --- assigning affinity to CPUS 0, 2.

The time increased by about 0.120 ms which I guess makes sense given that more cores means slower clock speed.

I loaded up 4 instances running on CPUS 0, 2, 4, 6 and the time per bit almost doubled --- a (0.380 ms. increase).

I think Paul is right --- running four threads on one instance seems to be faster than running 4 instances single threaded.

I would think that with larger number of cores, say 12 or 16, the latter might become slower?

paulunderwood 2020-10-04 23:30

I don't know about 12 core chips running LLR, but generally it makes sense to run 1 instance per chip or chiplet.

VBCurtis 2020-10-05 05:32

[QUOTE=paulunderwood;558896]I don't know about 12 core chips running LLR, but generally it makes sense to run 1 instance per chip or chiplet.[/QUOTE]

My experience, mostly on Haswell-era desktops, is that LLR doesn't benefit much from splitting small FFTs on to multiple threads. 128K per thread seems to be a good cutoff- so for OP's example 192K FFT, I doubt running two 2-threaded instances would be faster than four 1-threaded.

Once FFT reaches 256K, 2-threaded runs work pretty well.

OP- I've run LLR on this size of number on prebuilt machines with slow 2-channel memory, and running 3 instances was just about as fast as 4 but generated quite a bit less heat. That is, 3 is enough to saturate the memory on some quad-core machines. It takes some experimenting with threads-per-process and number of processes to find the sweet spot!

henryzz 2020-10-05 08:27

[QUOTE=VBCurtis;558917]OP- I've run LLR on this size of number on prebuilt machines with slow 2-channel memory, and running 3 instances was just about as fast as 4 but generated quite a bit less heat. That is, 3 is enough to saturate the memory on some quad-core machines. It takes some experimenting with threads-per-process and number of processes to find the sweet spot![/QUOTE]

Better still might be to reduce the cpu speed to match the memory throughput and still use all cores. Generally lower speeds need less power/cycle. Experimentation might be needed there.


All times are UTC. The time now is 06:42.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.