mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   YAFU (https://www.mersenneforum.org/forumdisplay.php?f=96)
-   -   Yafu (https://www.mersenneforum.org/showthread.php?t=10871)

fivemack 2009-09-22 15:22

on my i7, after kill -STOPping a large linear algebra job (-STOP/-CONT make life very easy)

siqs(281396163585532137380297959872159569353696836686080935550459706878100362721)

(using taskset to ensure that I'm using separate cores)

./yafu-linux64-32k
-threads 1 QS elapsed time = 102.4299 seconds.
-threads 2 QS elapsed time = 59.6516 seconds = 1.71x
-threads 3 QS elapsed time = 39.9434 seconds = 2.56x
-threads 4 QS elapsed time = 30.5106 seconds = 3.35x
-threads 8 QS elapsed time = 23.7950 seconds = 4.30x

./yafu-linux64-64k
-threads 1 QS elapsed time = 109.8800 seconds.
-threads 2 QS elapsed time = 55.7896 seconds = 1.97x
-threads 3 QS elapsed time = 37.7891 seconds = 2.91x
-threads 4 QS elapsed time = 30.3107 seconds = 3.62x
-threads 8 QS elapsed time = 24.1601 seconds = 4.55x

Mini-Geek 2009-09-22 15:41

[quote=fivemack;190677]on my i7, siqs(281396163585532137380297959872159569353696836686080935550459706878100362721)

(using taskset to ensure that I'm using separate cores)

...[/quote]
Hm, interesting. Looks like the 32k executable is really only faster for single-threaded work on the i7.

mklasson 2009-09-22 15:49

Here's another data point on i7 3.6GHz running 64-bit win7 with HT disabled, using the same siqs(281396163585532137380297959872159569353696836686080935550459706878100362721):

cpu% readings are rough estimates from taskmgr and timings are from yafu's reports. No significant background processes (well, playing music but that's kinda insignificant nowadays :toot:).

yafu-win64-32k:
1: 85.4s
2: 47.5s, 1.80x
3: 34.6s, 2.47x
4: 29.5s, 2.89x

yafu-win64-64k:
1: ~24% cpu, 89.8s
2: ~45% cpu, 49.3s, 1.82x
3: ~65-70% cpu, 35.1s, 2.56x
4: ~80-85% cpu, 29.3s, 3.06x

bsquared 2009-09-22 15:52

@fivemack:
Thanks! I didn't think any of your cpu's would be idle (and suspect it probably wasn't ;)). Looks like 4 thread performance does increase vs. the Q9550, and hyperthreading also helps quite a bit more than I thought it would (anything more than no gain is a bonus).

@mini-geek
I'm surprised at the difference in scalablity with sieve block size... can't readily explain that one.

[edit]
Happily surprised, I'll add. I'm feeling pretty good about 3.62x (or even 3.35x) out of 4. Also, to fairly consider threading efficiency, I guess we should be subtracting out the single threaded post-processing time. If I do that, using fivemack's numbers, I get 3.71x and 4.02x respectively for the 32k and 64k versions using 4 threads (assuming post-processing takes about the same amount of time as it does on a similar machine of mine: 4 seconds)!

[further edit]
Thanks Mikael! Windows threading not as efficient as pthreads? Compiler differences? sigh...

fivemack 2009-09-22 16:37

I was deliberately listing the 'QS elapsed time' rather than 'SIQS elapsed time', since I thought the post-processing was serial and would confuse things, so you may have compensated for the same thing twice.

I wonder whether the i7 L2 is fast enough that a 128k or 256k sieve block size might make sense; I think I've done the experiment with msieve and it didn't help.

bsquared 2009-09-22 16:47

[quote=fivemack;190687]I was deliberately listing the 'QS elapsed time' rather than 'SIQS elapsed time', since I thought the post-processing was serial and would confuse things, so you may have compensated for the same thing twice.

I wonder whether the i7 L2 is fast enough that a 128k or 256k sieve block size might make sense; I think I've done the experiment with msieve and it didn't help.[/quote]

Ahh... that makes sense. A perfect 4x for 4 scaling was too good to be true.

The 64k blocksize definately hurts in single threading mode... I expect that larger sizes would continue to hurt, but maybe not. I'll have to think more on why it seems to scale better...

bsquared 2009-09-22 16:59

Another data point on a 8 cpu (dual quad) machine (linux binaries)...

[code]
[FONT=Arial][FONT=Verdana][SIZE=2]greedo Intel(R) Xeon(R) CPU X5460 @ 3.16GHz [/SIZE][/FONT]
[SIZE=2][FONT=Arial] 32k 64k [/FONT][/SIZE]
[SIZE=2][FONT=Arial]threads qs time scaling % efficiency qs time scaling % efficiency[/FONT][/SIZE]
[SIZE=2][FONT=Arial]1 99.03 1 108.45 1 [/FONT][/SIZE]
[SIZE=2][FONT=Arial]2 52.4 1.889885496 94.49427481 56.34 1.924920128 96.24600639[/FONT][/SIZE]
[SIZE=2][FONT=Arial]3 36.59 2.70647718 90.21590599 38.37 2.826426896 94.21422987[/FONT][/SIZE]
[SIZE=2][FONT=Arial]4 27.57 3.591947769 89.79869423 29.09 3.728085253 93.20213132[/FONT][/SIZE]
[SIZE=2][FONT=Arial]5 23.41 4.230243486 84.60486971 24.88 4.35892283 87.17845659[/FONT][/SIZE]
[SIZE=2][FONT=Arial]6 20.3 4.878325123 81.30541872 21.21 5.113154173 85.21923621[/FONT][/SIZE]
[SIZE=2][FONT=Arial]7 17.85 5.54789916 79.25570228 18.63 5.821256039 83.16080055[/FONT][/SIZE]
[SIZE=2][FONT=Arial]8 16.35 6.056880734 75.71100917 16.48 6.580703883 82.25879854[/FONT][/SIZE][/FONT]
[/code]

Also, I remember now that 64k is the maximum possible block size in yafu. My bucket sieve requires that in order to use 16 bit addressing of sieve hits.

mklasson 2009-09-22 17:14

[QUOTE=fivemack;190687]I was deliberately listing the 'QS elapsed time' rather than 'SIQS elapsed time', since I thought the post-processing was serial and would confuse things[/quote]

Oh, right, I wasn't. Fixing up the results by removing ~3.3s post-proc time yields <1.9x,2.7x,3.3x> instead for <2,3,4> threads of the 64k version.

Pretty spiffy regardless. Good job Ben!

bsquared 2009-09-22 17:54

Reducing the blocksize gives even better scaling and performance (NOTE: if you try to re-compile to test this yourself, you'll get a segfault due to an obscure problem which I've now fixed, but which is broken in the 1.11 code)

[CODE]
Blocksize = 16k
qs time scaling % efficiency
99.36
50.52 1.966745843 98.33729216
34.17 2.907813872 96.92712906
26.22 3.789473684 94.73684211
21.05 4.720190024 94.40380048
17.9 5.550837989 92.51396648
15.47 6.422753717 91.75362453
14.21 6.992258973 87.40323716
[/CODE]

axn 2009-09-22 17:58

[QUOTE=bsquared;190676]Thanks Jeff! That would be interesting to see if it is a memory bandwidth thing or just thread overhead. Anyone with an idle i7 that wants to see?[/QUOTE]

It need not be a memory bandwidth issue. P95 also shows similar issue when going from 3 to 4 cores. It could be due to thread synchronization, and the fact that OS needs to preempt one of the threads every now and then, leading to rest of the threads stalling -- is that a true statement about this code?

bsquared 2009-09-22 18:18

[quote=axn;190699]It need not be a memory bandwidth issue. P95 also shows similar issue when going from 3 to 4 cores. It could be due to thread synchronization, and the fact that OS needs to preempt one of the threads every now and then, leading to rest of the threads stalling -- is that a true statement about this code?[/quote]

I know very little about P95's code, particularly how it implements threading, but I don't think this code is comparable. The threads are *very* loosely coupled. A typical factorization needs a few hundred spawn/join/merge phases, but otherwise each thread proceeds entirely unsyncronized/unaware of the other threads. Unless the OS is consistent picking on one thread to stall such that the merge process is delayed

[edit]
On second thought on every spawn/join/merge phase the other threads are always waiting on the slowest thread, so maybe the OS fiddles with them enough to cause significant stalling.

On third thought, each thread is identical, and all have access to identical resources, except the last one. This one presumably is running on the cpu which is also handling all the user activity and background processes, and so the L1 cache of that one is necessarily going to be more heavily utilized by other stuff. Since yafu's siqs is tightly cache bound, this last thread will no doubt be slower and thus cause more significant stalling.


All times are UTC. The time now is 22:31.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.