![]() |
on my i7, after kill -STOPping a large linear algebra job (-STOP/-CONT make life very easy)
siqs(281396163585532137380297959872159569353696836686080935550459706878100362721) (using taskset to ensure that I'm using separate cores) ./yafu-linux64-32k -threads 1 QS elapsed time = 102.4299 seconds. -threads 2 QS elapsed time = 59.6516 seconds = 1.71x -threads 3 QS elapsed time = 39.9434 seconds = 2.56x -threads 4 QS elapsed time = 30.5106 seconds = 3.35x -threads 8 QS elapsed time = 23.7950 seconds = 4.30x ./yafu-linux64-64k -threads 1 QS elapsed time = 109.8800 seconds. -threads 2 QS elapsed time = 55.7896 seconds = 1.97x -threads 3 QS elapsed time = 37.7891 seconds = 2.91x -threads 4 QS elapsed time = 30.3107 seconds = 3.62x -threads 8 QS elapsed time = 24.1601 seconds = 4.55x |
[quote=fivemack;190677]on my i7, siqs(281396163585532137380297959872159569353696836686080935550459706878100362721)
(using taskset to ensure that I'm using separate cores) ...[/quote] Hm, interesting. Looks like the 32k executable is really only faster for single-threaded work on the i7. |
Here's another data point on i7 3.6GHz running 64-bit win7 with HT disabled, using the same siqs(281396163585532137380297959872159569353696836686080935550459706878100362721):
cpu% readings are rough estimates from taskmgr and timings are from yafu's reports. No significant background processes (well, playing music but that's kinda insignificant nowadays :toot:). yafu-win64-32k: 1: 85.4s 2: 47.5s, 1.80x 3: 34.6s, 2.47x 4: 29.5s, 2.89x yafu-win64-64k: 1: ~24% cpu, 89.8s 2: ~45% cpu, 49.3s, 1.82x 3: ~65-70% cpu, 35.1s, 2.56x 4: ~80-85% cpu, 29.3s, 3.06x |
@fivemack:
Thanks! I didn't think any of your cpu's would be idle (and suspect it probably wasn't ;)). Looks like 4 thread performance does increase vs. the Q9550, and hyperthreading also helps quite a bit more than I thought it would (anything more than no gain is a bonus). @mini-geek I'm surprised at the difference in scalablity with sieve block size... can't readily explain that one. [edit] Happily surprised, I'll add. I'm feeling pretty good about 3.62x (or even 3.35x) out of 4. Also, to fairly consider threading efficiency, I guess we should be subtracting out the single threaded post-processing time. If I do that, using fivemack's numbers, I get 3.71x and 4.02x respectively for the 32k and 64k versions using 4 threads (assuming post-processing takes about the same amount of time as it does on a similar machine of mine: 4 seconds)! [further edit] Thanks Mikael! Windows threading not as efficient as pthreads? Compiler differences? sigh... |
I was deliberately listing the 'QS elapsed time' rather than 'SIQS elapsed time', since I thought the post-processing was serial and would confuse things, so you may have compensated for the same thing twice.
I wonder whether the i7 L2 is fast enough that a 128k or 256k sieve block size might make sense; I think I've done the experiment with msieve and it didn't help. |
[quote=fivemack;190687]I was deliberately listing the 'QS elapsed time' rather than 'SIQS elapsed time', since I thought the post-processing was serial and would confuse things, so you may have compensated for the same thing twice.
I wonder whether the i7 L2 is fast enough that a 128k or 256k sieve block size might make sense; I think I've done the experiment with msieve and it didn't help.[/quote] Ahh... that makes sense. A perfect 4x for 4 scaling was too good to be true. The 64k blocksize definately hurts in single threading mode... I expect that larger sizes would continue to hurt, but maybe not. I'll have to think more on why it seems to scale better... |
Another data point on a 8 cpu (dual quad) machine (linux binaries)...
[code] [FONT=Arial][FONT=Verdana][SIZE=2]greedo Intel(R) Xeon(R) CPU X5460 @ 3.16GHz [/SIZE][/FONT] [SIZE=2][FONT=Arial] 32k 64k [/FONT][/SIZE] [SIZE=2][FONT=Arial]threads qs time scaling % efficiency qs time scaling % efficiency[/FONT][/SIZE] [SIZE=2][FONT=Arial]1 99.03 1 108.45 1 [/FONT][/SIZE] [SIZE=2][FONT=Arial]2 52.4 1.889885496 94.49427481 56.34 1.924920128 96.24600639[/FONT][/SIZE] [SIZE=2][FONT=Arial]3 36.59 2.70647718 90.21590599 38.37 2.826426896 94.21422987[/FONT][/SIZE] [SIZE=2][FONT=Arial]4 27.57 3.591947769 89.79869423 29.09 3.728085253 93.20213132[/FONT][/SIZE] [SIZE=2][FONT=Arial]5 23.41 4.230243486 84.60486971 24.88 4.35892283 87.17845659[/FONT][/SIZE] [SIZE=2][FONT=Arial]6 20.3 4.878325123 81.30541872 21.21 5.113154173 85.21923621[/FONT][/SIZE] [SIZE=2][FONT=Arial]7 17.85 5.54789916 79.25570228 18.63 5.821256039 83.16080055[/FONT][/SIZE] [SIZE=2][FONT=Arial]8 16.35 6.056880734 75.71100917 16.48 6.580703883 82.25879854[/FONT][/SIZE][/FONT] [/code] Also, I remember now that 64k is the maximum possible block size in yafu. My bucket sieve requires that in order to use 16 bit addressing of sieve hits. |
[QUOTE=fivemack;190687]I was deliberately listing the 'QS elapsed time' rather than 'SIQS elapsed time', since I thought the post-processing was serial and would confuse things[/quote]
Oh, right, I wasn't. Fixing up the results by removing ~3.3s post-proc time yields <1.9x,2.7x,3.3x> instead for <2,3,4> threads of the 64k version. Pretty spiffy regardless. Good job Ben! |
Reducing the blocksize gives even better scaling and performance (NOTE: if you try to re-compile to test this yourself, you'll get a segfault due to an obscure problem which I've now fixed, but which is broken in the 1.11 code)
[CODE] Blocksize = 16k qs time scaling % efficiency 99.36 50.52 1.966745843 98.33729216 34.17 2.907813872 96.92712906 26.22 3.789473684 94.73684211 21.05 4.720190024 94.40380048 17.9 5.550837989 92.51396648 15.47 6.422753717 91.75362453 14.21 6.992258973 87.40323716 [/CODE] |
[QUOTE=bsquared;190676]Thanks Jeff! That would be interesting to see if it is a memory bandwidth thing or just thread overhead. Anyone with an idle i7 that wants to see?[/QUOTE]
It need not be a memory bandwidth issue. P95 also shows similar issue when going from 3 to 4 cores. It could be due to thread synchronization, and the fact that OS needs to preempt one of the threads every now and then, leading to rest of the threads stalling -- is that a true statement about this code? |
[quote=axn;190699]It need not be a memory bandwidth issue. P95 also shows similar issue when going from 3 to 4 cores. It could be due to thread synchronization, and the fact that OS needs to preempt one of the threads every now and then, leading to rest of the threads stalling -- is that a true statement about this code?[/quote]
I know very little about P95's code, particularly how it implements threading, but I don't think this code is comparable. The threads are *very* loosely coupled. A typical factorization needs a few hundred spawn/join/merge phases, but otherwise each thread proceeds entirely unsyncronized/unaware of the other threads. Unless the OS is consistent picking on one thread to stall such that the merge process is delayed [edit] On second thought on every spawn/join/merge phase the other threads are always waiting on the slowest thread, so maybe the OS fiddles with them enough to cause significant stalling. On third thought, each thread is identical, and all have access to identical resources, except the last one. This one presumably is running on the cpu which is also handling all the user activity and background processes, and so the L1 cache of that one is necessarily going to be more heavily utilized by other stuff. Since yafu's siqs is tightly cache bound, this last thread will no doubt be slower and thus cause more significant stalling. |
| All times are UTC. The time now is 22:31. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.