mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

MrRepunit 2012-01-18 12:07

Factoring Repunits
 
1 Attachment(s)
Hi Oliver,

I tried to follow the source code of mfaktc and found the important places where to change stuff to let it work with repunits. But as I don't fully understand all the source code I don't know if these places would be enough.
In general:
The factors itself have the form of f=2*k*p+1. What is additional compared to factors of Mersenne numbers is that f%8 can also be 3 and 5 (in addition to 1 and 7). So in mfaktc.c line 213 and 241 this must be changed. Also one has to make sure that <base>-1 is not a factor of f itself if the primality is not checked of these factors.
Then the divisibility checks in tf_*.cu must be done with <base>^p % f. (e.g. tf_71bit.cu line 455-481)

I implemented these changes (for normal base 10 repunits as this is my priority right now) in factor5 so that you can compare, just make a diff between the original and the repunit version.

I hope this helps not to invest too much time in thinking about math and the implementation.

Cheers,
Danilo

MrRepunit 2012-01-19 17:03

1 Attachment(s)
I found out that there are only 16 remainders for mod 120 (not 32 as I assumed first). This made the changes compared to factor5 even more easier. Now it is a factor 2 faster.
In case someone is interested I uploaded the new source file.

ET_ 2012-01-23 14:50

[QUOTE=MrRepunit;286707]I found out that there are only 16 remainders for mod 120 (not 32 as I assumed first). This made the changes compared to factor5 even more easier. Now it is a factor 2 faster.
In case someone is interested I uploaded the new source file.[/QUOTE]

Gee, I KNOW that source code... :smile:

Luigi

MrRepunit 2012-01-23 15:23

[QUOTE=ET_;287035]Gee, I KNOW that source code... :smile:

Luigi[/QUOTE]
I know, I really hope that you don't mind that I messed it up a bit :-).
But it already helped to find a factor of a double repunit (base 10):
657253333333333333267609 divides R(R(19))

ET_ 2012-01-25 12:26

:et_:[QUOTE=MrRepunit;287038]I know, I really hope that you don't mind that I messed it up a bit :-).
But it already helped to find a factor of a double repunit (base 10):
657253333333333333267609 divides R(R(19))[/QUOTE]

:et_:

axn 2012-01-25 12:56

[QUOTE=MrRepunit;286707]I found out that there are only 16 remainders for mod 120 (not 32 as I assumed first). This made the changes compared to factor5 even more easier. Now it is a factor 2 faster.
In case someone is interested I uploaded the new source file.[/QUOTE]

Hmmm... By my calculation, it should be 32 only (4 mod 8, 2 mod 3, 4 mod 5 = 32 mod 120). How did you arrive at the figure 16?

MrRepunit 2012-01-25 14:40

[QUOTE=axn;287193]Hmmm... By my calculation, it should be 32 only (4 mod 8, 2 mod 3, 4 mod 5 = 32 mod 120). How did you arrive at the figure 16?[/QUOTE]
What I did: I checked lots of known factors (see [URL="http://www.repunit.org/"]http://www.repunit.org[/URL]) for their remainder modulo 120, I just found 16. I also rechecked with more factors. These factors were found with a normal sieve, not taking care of the special form of repunit factors.
I am also no expert in number theory, I am just a theoretical physicist, or in other words an experimental mathematician :smile:.

axn 2012-01-25 16:07

[QUOTE=MrRepunit;287199] or in other words an experimental mathematician :smile:.[/QUOTE]

:smile: That's all well and good. And probably you're observing a real phenomenon. But I'd feel much more comfortable if this could be rigorously proven.

Batalov 2012-01-26 22:29

Apparently mod 5 and mod 8 are coupled.
There are 4 possibilities mod 5 and 4 possibilities mod 8, but only 8 possibilities mod 40: +/-1, +/-3, +/-9, +/-13.

I used known factors from [URL="http://homepage2.nifty.com/m_kamada/math/Phin10.txt.bz2"]Phin10.txt[/URL] ... (I am y.a. experimental mathematician.)

firejuggler 2012-01-28 07:58

cuda 4.1 released [URL]http://www.tomshardware.com/news/nvidia-cuda-gpu-developer-llvm,14579.html[/URL]

TheJudger 2012-01-28 17:49

firejuggler: yepp, I know (I receive automatic notices about new CUDA releases). You can expect new executables of mfaktc 0.18 using CUDA 4.1 soon.

MrRepunit: sorry, I haven't looked at your stuff yet. :sad:

Oliver

Dubslow 2012-01-29 05:28

[QUOTE=Dubslow;286255]Here's the issues I've had over the last 4 months or so with drivers.
[url]http://forums.nvidia.com/index.php?showtopic=220802[/url][/QUOTE]
AHAHAHHAHAHAHAHAHAHAHAA!!!


I finally got the damn drivers to install.
[code]CUDA version info
binary compiled for CUDA 4.0
CUDA runtime version 4.10
CUDA driver version 4.10[/code]

Please take your time TheJudger, I've been without my GPU for a few weeks due to driver issues, I can wait a few more days :smile::smile:

TheJudger 2012-01-29 13:38

mfaktc 0.18 - CUDA 4.1
 
Hello!

[url]http://www.mersenneforum.org/mfaktc/mfaktc-0.18.win.cuda41.zip[/url]
[url]http://www.mersenneforum.org/mfaktc/mfaktc-0.18.linux64.cuda41.tar.gz[/url]

These executables are compiled with CUDA 4.1. The sourcecode is exactly the same than before so there is no need to repost the sourcecode. This version will use checkpoints from mfaktc 0.18 (CUDA 4.0)! CUDA 4.1 needs driver version 285 or newer.

If you're using a CC 1.x GPU than there is no need to update. [B]If you're using a GPU with CC 2.0 this update is recommended[/B] (the GPU code just runs a little bit faster!). And those with CC 2.1 can try it, too, but I expect that the performance difference is barely noticeable.

Oliver

kladner 2012-01-29 18:42

-st2 completes without errors on a GTX 460 @ 875MHz, Win7-64 driver 290.53. I run the test on principle when there are any changes.

EDIT: Thanks for the updated version, Oliver.

oswald 2012-01-29 20:09

Most excellent! I have a GTX570 and it was doing about 500M/s. Now it is hitting 540M/s.

Thanks,
Roy

James Heinrich 2012-01-29 21:53

[QUOTE=oswald;287676]Most excellent! I have a GTX570 and it was doing about 500M/s. Now it is hitting 540M/s.[/QUOTE]Just curious: with what SievePrimes and across how many instances? I ask because my GTX570 is doing around 400M/s at SP=5000, 2 instances, 71-72.

oswald 2012-01-30 00:39

5000 SievePrimes, six instances, from 68-78. I7/920 running at 2.67 Ghz.
Each instance eating 12% of CPU with 1% to 2% wait time.
NumStreams=8 and CPUStreams=3. Affinity is not set.
batch - cmd.exe /c "start "mfaktc 5" /low mfaktc-win-64.exe -v 2"
Windows 7 Ult.

If I use the computer and run prime95 with one worker window, it drops to about 480M/s to 500M/s.

GPU Load is 99%, Fan 86%, Temp 86C to 88C. Voltage 1.075V, Clock 911 Mhz, Memory 2106 Mhz and Shader 1822 Mhz.

James Heinrich 2012-01-30 02:09

[QUOTE=oswald;287697]5000 SievePrimes, six instances, from 68-78. I7/920 running at 2.67 Ghz.[/QUOTE]How does your performance fare with 4 instances rather than 6? I suspect it wouldn't be much different, since you only have 4 cores to work with.

Also, on the topic that was raised before... any special reason you're taking these exponents abnormally high (to 2^78)?

oswald 2012-01-30 02:38

Six seems the fastest for me. Four and Five are a little slower, maybe 4% or 5%. Seven about the same, but more cpu time is wasted and Eight is slower with 100% cpu used.

I'm going to drop back to 72 when I'm done with the current 77 and a few 74s. I just wanted a couple to see if the program would process the larger bits faster or slower. I didn't see any difference.

Also I thought it would be cool. It wasn't.

flashjh 2012-01-30 03:03

[QUOTE=oswald;287709]Six seems the fastest for me. Four and Five are a little slower, maybe 4% or 5%. Seven about the same, but more cpu time is wasted and Eight is slower with 100% cpu used.

I'm going to drop back to 72 when I'm done with the current 77 and a few 74s. I just wanted a couple to see if the program would process the larger bits faster or slower. I didn't see any difference.

Also I thought it would be cool. It wasn't.[/QUOTE]

So, did you find any factors above 72?

oswald 2012-01-30 03:58

[QUOTE=flashjh;287710]So, did you find any factors above 72?[/QUOTE]

45385591,73

Just one. So it would seem that 72 is the sweet spot.

I've seen some TFs go by to 81. Anyone get any factors above 73?

chalsall 2012-01-30 04:13

[QUOTE=oswald;287713]I've seen some TFs go by to 81. Anyone get any factors above 73?[/QUOTE]

Speaking only from the [URL="http://www.gpu72.com/reports/factoring_cost/"]GPU72 dataset[/URL], no. Also, I thought this query might be of interest:

[CODE]mysql> select Exponent,WorkType,Factor,BitLevel,GHzDays from Facts where BitLevel>72 and BitLevel<73 order by BitLevel;
+----------+----------+------------------------+---------------+---------------+
| Exponent | WorkType | Factor | BitLevel | GHzDays |
+----------+----------+------------------------+---------------+---------------+
| 49038349 | 2 | 4801848565079148256831 | 72.0240783691 | 3.1368792057 |
| 48402157 | 2 | 4924923229320807180241 | 72.0605926514 | 3.0787878036 |
| 52391939 | 2 | 5129368153807159288753 | 72.1192703247 | 3.6857914925 |
| 51856367 | 2 | 5260403915947198114753 | 72.1556625366 | 3.6537399292 |
| 50645029 | 2 | 5661793507001515858399 | 72.2617492676 | 3.4019815922 |
| 50113633 | 2 | 5686252493318912556919 | 72.2679672241 | 3.3713331223 |
| 46088459 | 2 | 5699536457216611369993 | 72.2713317871 | 2.9335625172 |
| 57124637 | 2 | 5699667324436008230441 | 72.2713699341 | 4.4322924614 |
| 58727143 | 2 | 6032238348257590709041 | 72.3531799316 | 4.7944402695 |
| 49172281 | 2 | 6118314734795051557889 | 72.3736267090 | 3.3100326061 |
| 47377471 | 2 | 6214261243217069278799 | 72.3960723877 | 3.0206968784 |
| 51279133 | 1 | 6246983672197511925169 | 72.4036483765 | 34.9744110107 |
| 51891943 | 2 | 6274529751550608502601 | 72.4099960327 | 3.6537446976 |
| 52377517 | 2 | 6288072527920072455497 | 72.4131088257 | 3.6857895851 |
| 47634437 | 2 | 6562788825323273521183 | 72.4748001099 | 3.0497453213 |
| 49577639 | 2 | 6933287551485680909593 | 72.5540313721 | 3.3406801224 |
| 51789487 | 2 | 7277402504460056581103 | 72.6239166260 | 3.6537451744 |
| 51836971 | 2 | 7542321801724532859601 | 72.6754989624 | 3.6537442207 |
| 48794531 | 2 | 7632349113139487255599 | 72.6926193237 | 3.1078319550 |
| 49685179 | 2 | 7761122876680074100649 | 72.7167587280 | 3.3406836987 |
| 50415769 | 2 | 7944599695223051338921 | 72.7504653931 | 3.4019765854 |
| 52987387 | 1 | 8081154896226789542663 | 72.7750549316 | 33.8468780518 |
| 51959269 | 2 | 8475377849335575069007 | 72.8437652588 | 3.6537444592 |
| 51901393 | 2 | 8512615206095183396513 | 72.8500900269 | 3.6537411213 |
| 49090681 | 2 | 8605111745487297296833 | 72.8656845093 | 3.1368756294 |
| 45385591 | 1 | 8844916506899498728081 | 72.9053421021 | 40.8332138062 |
| 52352803 | 2 | 9024358165365373833857 | 72.9343109131 | 3.6857903004 |
| 47967737 | 1 | 9076742576732763259807 | 72.9426651001 | 29.9110641479 |
| 51739159 | 2 | 9080527408501786785767 | 72.9432678223 | 3.6216900349 |
| 48055169 | 2 | 9236283448528736895481 | 72.9678039551 | 3.0497450829 |
| 52950509 | 2 | 9247535348527498365463 | 72.9695587158 | 3.7178452015 |
+----------+----------+------------------------+---------------+---------------+
31 rows in set (0.01 sec)[/CODE]

Note that WorkType==1 is TF; ==2 is P-1.

kladner 2012-01-30 04:28

[QUOTE=oswald;287713]45385591,73

Just one. So it would seem that 72 is the sweet spot.

I've seen some TFs go by to 81. Anyone get any factors above 73?[/QUOTE]

Here's one:

[CODE]M72000209 has a factor: 10234577977625536865383
found 1 factor(s) for M72000209 from 2^73 to 2^74[/CODE]

This exponent and the level were requested from outside GPU to 72.

oswald 2012-01-30 05:34

Wow, so much really wonderful information to look at. It's hard for me to remember all the good places to look.
Thank you!

BigBrother 2012-01-30 07:45

List of all nVidia chips and their CC (Compute Cabability): [URL="http://developer.nvidia.com/cuda-gpus"]http://developer.nvidia.com/cuda-gpus[/URL]

Batalov 2012-02-02 01:44

They didn't list the "560 Ti 448 core" but it is a slightly disabled 570 (with 2.0); in a couple days, I'll be able to test it.

LaurV 2012-02-02 02:49

[QUOTE=Batalov;288048]They didn't list the "560 Ti 448 core" but it is a slightly disabled 570 (with 2.0); in a couple days, I'll be able to test it.[/QUOTE]
Not really. It is more an "experimental", or "engineering sample" of a "new 560", where they want to increase the number of cores and reduce the price. They have trouble with increasing the clock and heat. Nvidia list it [URL="http://www.geforce.com/Hardware/GPUs/geforce-gtx-560ti/specifications"]here[/URL]. I tested one, standard clock, and for the same money you can grab a 560@950MHz which is really REALLY much MUCH faster (about 28% faster!).

James Heinrich 2012-02-02 03:00

[QUOTE=LaurV;288052]for the same money you can grab a 560@950MHz which is really REALLY much MUCH faster (about 28% faster!).[/QUOTE]For gaming, perhaps, but the GTX 560Ti 448 is very desirable for mfaktc over any other GTX 560 because (like the GTX 570) it is compute 2.0, whereas the regular GTX 560 is v2.1 which may sound better, but is roughly 50% slower at mfaktc.

At stock clocks, the 560Ti448 has marginally higher GFLOPS than the 560Ti (1312 vs 1263), but [url=http://mersenne-aries.sili.net/mfaktc.php?sort=model&noA=1]expected daily throughput with mfaktc[/url] is 262 vs 168 GHz-days/day.

KyleAskine 2012-02-02 14:47

[QUOTE=James Heinrich;288054]For gaming, perhaps, but the GTX 560Ti 448 is very desirable for mfaktc over any other GTX 560 because (like the GTX 570) it is compute 2.0, whereas the regular GTX 560 is v2.1 which may sound better, but is roughly 50% slower at mfaktc.

At stock clocks, the 560Ti448 has marginally higher GFLOPS than the 560Ti (1312 vs 1263), but [url=http://mersenne-aries.sili.net/mfaktc.php?sort=model&noA=1]expected daily throughput with mfaktc[/url] is 262 vs 168 GHz-days/day.[/QUOTE]

<offtopic>
When you sort the GHz-d/d column on your website, it does string compare instead of numerical compare.
</offtopic>

James Heinrich 2012-02-02 16:52

[QUOTE=KyleAskine;288080]When you sort the GHz-d/d column on your website, it does string compare instead of numerical compare.[/QUOTE]Hmm, that's embarrassing... fixed now :smile:

nucleon 2012-02-03 05:16

Thanks for that.

Dubslow 2012-02-06 05:07

[QUOTE=Dubslow;287602]AHAHAHHAHAHAHAHAHAHAHAA!!!


I finally got the damn drivers to install.
[code]CUDA version info
binary compiled for CUDA 4.0
CUDA runtime version 4.10
CUDA driver version 4.10[/code]

Please take your time TheJudger, I've been without my GPU for a few weeks due to driver issues, I can wait a few more days :smile::smile:[/QUOTE]

I can't find the head bashing emoticon. It seems it was too good to be true; after the 'successful' install, Ubuntu then refused to boot properly. I'm back to nvidia-current and 270.* until either that package upgrades or I can get this truly fixed. I don't know why it won't just do it correctly. So, I still can't run mfatkc in Linux

Stupid nvidia drivers.

flashjh 2012-02-07 00:27

Has anyone else noticed mfaktc locked up lately? My instances look like they're running and the CPU usage stays @ 100%, but they are not progressing. I just updated to a newer beta driver to see if the old driver was causing the problem, but I was curious if anyone has seen this?

Jerry

Dubslow 2012-02-07 00:48

I've experienced this from time to time since I first installed it however many months ago (it was at 0.17 at the time, I think) and I've had it where it just stops. Nothing appears wrong, except that nothing is output. When this happened in version 0.18, pressing ^C did nothing, because the current class never finished.

However, this happens rarely enough that it caused no problems. I [i]think[/i] this only happens in Windows, but that could easily be wrong.

Chuck 2012-02-07 00:50

[QUOTE=flashjh;288493]Has anyone else noticed mfaktc locked up lately? My instances look like they're running and the CPU usage stays @ 100%, but they are not progressing. I just updated to a newer beta driver to see if the old driver was causing the problem, but I was curious if anyone has seen this?

Jerry[/QUOTE]

Nvidia 285.62 has been working fine for me since Oct 2011; I updated to the mfaktc compiled with CUDA 4.10 and my two instances of mfaktc have continued to run without incident.

Chuck

flashjh 2012-02-07 01:02

[QUOTE=Dubslow;288495]I've experienced this from time to time since I first installed it however many months ago (it was at 0.17 at the time, I think) and I've had it where it just stops. Nothing appears wrong, except that nothing is output. When this happened in version 0.18, pressing ^C did nothing, because the current class never finished.

However, this happens rarely enough that it caused no problems. I [I]think[/I] this only happens in Windows, but that could easily be wrong.[/QUOTE]


Yes, this is exactly what I'm talking about... normally I wouldn't have mentioned it, but it's been happening several times a day lately. Since I have to *work*, I don't get a chance to fix it for several hours, which is a lot of lost TFing.


[QUOTE=Chuck;288496]Nvidia 285.62 has been working fine for me since Oct 2011; I updated to the mfaktc compiled with CUDA 4.10 and my two instances of mfaktc have continued to run without incident.

Chuck[/QUOTE]

I'm on 295.51 since just a bit ago. If the lockup occurs again, I'll revert to 285.62 and see if that fixes it.

Thanks

James Heinrich 2012-02-07 02:53

[QUOTE=Chuck;288496]Nvidia 285.62 has been working fine for me[/QUOTE]Same here.

kladner 2012-02-07 05:12

I'm running 290.53 and mfaktc 0.18 (compiled with CUDA 4.10), and have not noticed any problems.

Radikalinsky 2012-02-09 00:23

[QUOTE=flashjh;288493]Has anyone else noticed mfaktc locked up lately? My instances look like they're running and the CPU usage stays @ 100%, but they are not progressing. I just updated to a newer beta driver to see if the old driver was causing the problem, but I was curious if anyone has seen this?

Jerry[/QUOTE]

I had this effect on two systems since about half a year, with different versions of mfaktc and NVidia drivers, but only with overclocked GPU and during some other activity (internet browsing). The GPU is probably overloaded and protects itself by a reset to a very slow clock speed, und mfaktc hangs completely, consuming 0 % of the GPU. Also the CPU consumption of mfaktc goes down to 0 %. This happens because I set AllowSleep=1 in the ini file, which works fine and saves some power and noise.

Rad

Dubslow 2012-02-09 02:24

Okay, i think them the only reason it crashed only in Windows is because in Windows I do overclock it, however I've never noticed instability except for this problem. While in the testing process for the OC, when it crashed, it crashed hard and Windows would complain about the video driver and that it had been restarted and the clock speed reset to stock (factory OC). no such thing occurred whenever mfaktc hanged. I also did not notice my cpu usage drop, although I don't have allow sleep enabled.

Batalov 2012-02-09 02:59

Did you test the stability of OC with, say, EVGA OC Scanner? (absense of artifacts in Furry and Tessy tests)

flashjh 2012-02-09 05:01

Thanks for all the replies. I downgraded to 285.62 and all seems to be fine now. Been running for a couple of days non-stop. :smile:

TheJudger 2012-02-12 00:27

I've just updated my box from openSUSE 11.1 (gcc 4.3.x) to openSUSE 12.1 (gcc 4.6.x).
The reason why I didn't update my machine earlier was the fact that CUDA officiall supports only gcc up to 4.4. :sad:
I did the upgrade now because openSUSE 11.1 doesn't support AVX (e.g. Sandy Bridge).
So I've installed openSUSE 12.1 with gcc 4.6 and gcc 4.3 (packages from openSUSE 11.2 repo), setting gcc 4.3 to the default compiler (using the alternatives framework).
Now I can compile and run mfaktc on openSUSE 12.1.

Now the interesting part: I've compiled all mfaktc source files with gcc 4.3 expect the sieve code (sieve.c). [B]For sieve.c I've used gcc 4.6. Result: 20% faster sieving (SievePrimes=5000).[/B] This needs further testing but it looks promising. :smile:
Yes, no code changes, just an updated compiler!

Oliver

Batalov 2012-02-12 01:06

[QUOTE=TheJudger;289080][B]For sieve.c I've used gcc 4.6. Result: 20% faster sieving (SievePrimes=5000).[/B] This needs further testing but it looks promising. :smile:
Yes, no code changes, just an updated compiler!

Oliver[/QUOTE]
Nice! This also means that (CPU-specific) assembly code fragments would do even better still (but it is surely quite some work).

I also run openSUSE 12.1. Just last night, when I tried the GPU GMP-ECM code, I had to "hack" the cuda includes to let the 4.6 compiler do the job and not bail (which it does by default).

kjaget 2012-02-12 14:33

[QUOTE=TheJudger;289080]
Yes, no code changes, just an updated compiler!

Oliver[/QUOTE]

That's really interesting. On the windows side, I spent some time porting the code to build with Intel's C compiler. Generally this is a much better optimizing compiler than MSVC, but I saw no difference. Granted, I wasn't building .cu files with it, just .c and combining them with the nvcc/msvc compiled .cu files, but that should have picked up any improvements it could find in sieve.c

Could be a lot of things - MSVC isn't as bad as I thought, the older GCC was particularly bad, I'm not building for AVX-enabled targets so there's something specific in the sieve code which works well with AVX but not SSE, or lots of other options.

Good news regardless, though.

Any idea how 20% faster sieving translates into run time improvements?

nucleon 2012-02-12 18:43

My question is - does the current win code have these improvements? If not - can I get a hold of it? I have a machine where the GPU% is hovering around 85-90% with sieve primes at 5000. I'm CPU limited on my farm atm.

[QUOTE=kjaget;289143]Any idea how 20% faster sieving translates into run time improvements?[/QUOTE]

2 outcomes I can think of (my guess):

1) If your GPU% is running at say 80% (as your CPU is maxed), then 20% sieve code improvement would boost the GPU% by 20% (ish), so one would expect GPU% to increase to 96% (ish), thereby giving an overall improvement of 20% approx.

2) If your GPU% is close to 99%, and sieve primes on your mfaktc instances is say 'x', then the improvement would allow you to increase sieve primes value above 'x'. The actual throughput improvement is anyone's guess. But it won't be higher than 20% and likely to be noticeably less than 20%.

Add disclaimer of 'your mileage may vary'.

-- Craig

Ethan (EO) 2012-02-14 04:16

A microoptimization that yields an extra 1% throughput for the single instance case on my machine (GTX 470 fed to about 50% utilization with one instance):

Beginning at line 124 of tf_common.cu in 0.18, change
[CODE]
/* set result array to 0 */
for(i=0;i<32;i++)mystuff->h_RES[i]=0;
cudaMemcpy(mystuff->d_RES, mystuff->h_RES, 32*sizeof(int), cudaMemcpyHostToDevice);
[/CODE]

to

[CODE]
/* set result array to 0 */
cudaMemsetAsync(mystuff->d_RES, 0, 32*sizeof(int));
for(i=0;i<32;i++)mystuff->h_RES[i]=0;
[/CODE]

No improvement for multi-instance cases ; -st2 passed. memset() on h_RES is slower.

-Ethan

TheJudger 2012-02-14 12:34

Ethan,

1% more throughput sounds unreasonable high, this is executed once per class. How long was you test case?
Did you mean cudaMemset() or cudaMemsetAsync()? Async would need the streamid as extra parameter and might be unsave.

Oliver

Ethan (EO) 2012-02-14 22:38

[QUOTE=TheJudger;289363]Ethan,

1% more throughput sounds unreasonable high, this is executed once per class. How long was you test case?
Did you mean cudaMemset() or cudaMemsetAsync()? Async would need the streamid as extra parameter and might be unsave.

Oliver[/QUOTE]

Hi Oliver -- maybe more like 0.8%; I see 183M/s compared to 181.5M/s with this change.

streamid defaults to 0 if omitted, and memsets in streamid 0 aren't overlapped with operations in any other streams; since this is happening before any kernel launches it should be safe.

As for the performance difference -- when I profile these with -tf 101001001 70 71, the memcpy case sees a delay of about 10ms between the memcpy and the first kernel launch, while the memset case sees a delay of about 3.5ms between the memset and the first kernel launch. That's about 0.4% improvement. The rest of the difference ... ?

TheJudger 2012-02-17 12:21

[QUOTE=Ethan (EO);289336]A microoptimization that yields an extra 1% throughput for the single instance case on my machine (GTX 470 fed to about 50% utilization with one instance):
[...]

No improvement for multi-instance cases ; -st2 passed. memset() on h_RES is slower.[/QUOTE]

I can reproduce this on my system, too. It is faster for CPU limited situations.

[QUOTE=Ethan (EO);289405]streamid defaults to 0 if omitted, and memsets in streamid 0 aren't overlapped with operations in any other streams; since this is happening before any kernel launches it should be safe.[/QUOTE]
Yepp, you're right.

[QUOTE=Ethan (EO);289405]As for the performance difference -- when I profile these with -tf 101001001 70 71, the memcpy case sees a delay of about 10ms between the memcpy and the first kernel launch, while the memset case sees a delay of about 3.5ms between the memset and the first kernel launch. That's about 0.4% improvement. The rest of the difference ... ?[/QUOTE]

I really don't understand why it is faster by a [B]constant factor[/B]. The code fragment is executed [B]once per class[/B] and the [B]amount of work/data within the fragment is constant[/B] all the time. I would assume a constant time improvement (e.g. the 6.5ms in your upper case) but actually it is 0.6% to 0.8% faster. Time per class in my testcases decreased from e.g. 7.220s to 7.165s (-55ms) and from 28.532s to 28.311s (-211ms). For times longer execution time yield four times higher difference in runtime...

Oliver

TheJudger 2012-02-19 19:23

Ideas for "running multiple instances of mfaktc in a single directory"
[LIST][*]add a commandline switch to specify the name of the instance (called [I]NAME[/I] here)[*]worktodo file:[LIST][*]remove worktodo file name from mfaktc.ini[*]if NAME is not specified use worktodo.txt[*]if NAME is specifed use worktodo.NAME.txt[/LIST][*]result file:[LIST][*]if NAME is not specified write results into results.txt[*]if NAME is specified write results into results.NAME.txt[/LIST][*]mfaktc.ini:[LIST][*]if NAME is not specified read settings from mfaktc.ini[*]if NAME is specified check if mfaktc.NAME.ini exists[LIST][*]if mfaktc.NAME.ini exists use mfakt.NAME.ini[*]if mfaktc.NAME.ini doesn't exist use mfaktc.ini[/LIST][/LIST][/LIST]
Oliver

Bdot 2012-02-19 19:52

[QUOTE=TheJudger;290009]Ideas for "running multiple instances of mfaktc in a single directory"
[LIST][*]add a commandline switch to specify the name of the instance (called [I]NAME[/I] here)[*]worktodo file:[LIST][*]remove worktodo file name from mfaktc.ini[*]if NAME is not specified use worktodo.txt[*]if NAME is specifed use worktodo.NAME.txt[/LIST] [*]result file:[LIST][*]if NAME is not specified write results into results.txt[*]if NAME is specified write results into results.NAME.txt[/LIST] [*]mfaktc.ini:[LIST][*]if NAME is not specified read settings from mfaktc.ini[*]if NAME is specified check if mfaktc.NAME.ini exists[LIST][*]if mfaktc.NAME.ini exists use mfakt.NAME.ini[*]if mfaktc.NAME.ini doesn't exist use mfaktc.ini[/LIST] [/LIST] [/LIST]
Oliver[/QUOTE]
This is how it is implemented in mfakto:
[LIST][*]it has an option (-i) to let you specify a different ini-file (if not specified, it's mfakto.ini)[*]the ini-file contains worktodo (WorkFile=) and results file (ResultsFile=) names for the instance[*]file locking will make access to the files safe, even if it is the same for all instances[/LIST]

flashjh 2012-02-25 19:36

[QUOTE=James Heinrich;273210]It looks like I'll be able to help George revise the parsing logic in the near future. Don't expect anything changed today or tomorrow, but if we can collectively decide on a standardized format for the results (by the end of this week, let's say), I'll see if I can get the results parser to understand it all correctly within a few weeks.[/QUOTE]

James, any luck with this?

James Heinrich 2012-02-25 19:41

[QUOTE=flashjh;290901]James, any luck with this?[/QUOTE]Not yet. :no:
I'm experiencing infinitely more difficulty setting up a development environment than I expected. [url="http://en.wikipedia.org/wiki/WIMP_%28software_bundle%29"]WIMP[/url] != [url="http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29"]LAMP[/url] (or even [url=http://en.wikipedia.org/wiki/WAMP]WAMP[/url] as I have on my home/development server).

flashjh 2012-03-02 21:44

[QUOTE=James Heinrich;290902]Not yet. :no:
I'm experiencing infinitely more difficulty setting up a development environment than I expected. [URL="http://en.wikipedia.org/wiki/WIMP_%28software_bundle%29"]WIMP[/URL] != [URL="http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29"]LAMP[/URL] (or even [URL="http://en.wikipedia.org/wiki/WAMP"]WAMP[/URL] as I have on my home/development server).[/QUOTE]

James,

My factor for M[URL="http://www.mersenne.org/report_exponent/?exp_lo=58703263&exp_hi=10000&B1=Get+status"]58703263[/URL] showed up correctly today:

[CODE]Manual testing 58703263 [B]F[/B] 2012-03-02 16:45 0.0 920694316080604322623 1.7365[/CODE]

Did you make some changes?

James Heinrich 2012-03-02 22:07

[QUOTE=flashjh;291636]Did you make some changes?[/QUOTE]No. I'm working through changes, but nothing has been published to the site yet.

I assume you mean that your TF factor was correctly credited as TF (instead of P-1)? Your example is a [url=http://mersenne-aries.sili.net/M58703263]relatively small factor[/url] (<2^70) so that's probably expected. It's only when you submit factors larger than what PrimeNet considers "reasonable" for TF that it falsely assumes it must've come from P-1.

flashjh 2012-03-02 22:08

[QUOTE=James Heinrich;291638]No. I'm working through changes, but nothing has been published to the site yet.

I assume you mean that your TF factor was correctly credited as TF (instead of P-1)? It is a relatively small factor (<2^70) so that's probably expected. It's only when you submit factors larger than what PrimeNet considers "reasonable" for TF that it falsely assumes it must've come from P-1.[/QUOTE]

Ah, that explains some from the past, as well. Thanks for the update.

rcv 2012-03-05 07:44

I got my first CUDA capable card (560 Ti) a little over a month ago, and have been running mfaktc on 64-bit Linux. I have a few questions.

1. Can anyone explain why Compute Capability 2.1 is about "half" as fast as 2.0 for running mfaktc? [Yes, I know there are a billion or so fewer transistors, but what specific feature/function do the 2.0 cards have that 2.1 lacks that makes such a huge difference to mfaktc.]

2. I am disappointed in how much CPU it takes to feed my GPU. I would happily give up a fraction of my GPU performance to get back my CPU performance. [It's no trouble consuming nearly two full i7 cores to feed the GPU via two instances of mfaktc.]

mfaktc is compiled with a minimum SievePrimes=5000. I have tweaked the code to let me run at SievePrimes=1000. Is there a discussion as to why the user shouldn't be allowed to set a lower SievePrimes than 5K?

3. Has anyone considered running the sieving on the GPU? Is it just that nobody has written the code or is there a reason the idea was rejected? [If one were running the sieve and the trial factoring on the same processor, the proper tradeoff between sieving and trial factoring seems pretty clear -- If trial factoring can test, say, 250 million candidates per second, then sieving should stop at the point it can no longer remove more than 250 million candidates per second.]

Thanks in advance!

Dubslow 2012-03-05 08:44

[url]http://www.mersenneforum.org/showthread.php?p=281245#post281245[/url]

^ That's the answer for your first question.

For the other two, I'm not sure, though I'll cast my vote again for on-GPU sieving.

TheJudger 2012-03-05 10:51

Hi!

[QUOTE=rcv;291954]2. I am disappointed in how much CPU it takes to feed my GPU. I would happily give up a fraction of my GPU performance to get back my CPU performance. [It's no trouble consuming nearly two full i7 cores to feed the GPU via two instances of mfaktc.]
[/QUOTE]
Just run CudaLucas to get your CPU free. Primenet needs more LL and less TF!

[QUOTE=rcv;291954]mfaktc is compiled with a minimum SievePrimes=5000. I have tweaked the code to let me run at SievePrimes=1000. Is there a discussion as to why the user shouldn't be allowed to set a lower SievePrimes than 5K?[/QUOTE]
Have you compared the speed of mfaktc when you lower SievePrimes to 1000? Avg. rate is [B]not[/B] the speed, time per class is the speed! Lower SievePrimes is just a waste of energy and [B]not validated[/B]!

src/params.h:
[CODE]
[...]
/******************************************************************************
*******************************************************************************
[B]*** DO NOT EDIT DEFINES BELOW THIS LINE UNLESS YOU REALLY KNOW WHAT YOU DO! ***
*** DO NOT EDIT DEFINES BELOW THIS LINE UNLESS YOU REALLY KNOW WHAT YOU DO! ***
*** DO NOT EDIT DEFINES BELOW THIS LINE UNLESS YOU REALLY KNOW WHAT YOU DO! ***
[/B]*******************************************************************************
******************************************************************************/
[...]
#define SIEVE_PRIMES_MIN 5000 /* DO NOT CHANGE! */
#define SIEVE_PRIMES_DEFAULT 25000 /* DO NOT CHANGE! */
#define SIEVE_PRIMES_MAX 200000 /* DO NOT CHANGE! */
[...]
[/CODE]


[QUOTE=rcv;291954]3. Has anyone considered running the sieving on the GPU? Is it just that nobody has written the code or is there a reason the idea was rejected? [If one were running the sieve and the trial factoring on the same processor, the proper tradeoff between sieving and trial factoring seems pretty clear -- If trial factoring can test, say, 250 million candidates per second, then sieving should stop at the point it can no longer remove more than 250 million candidates per second.][/QUOTE]

Yes, for sure... but I don't know how to implement this [B]efficient[/B]. Bdot (mfakto) tried and he got IIRC ~30M/s for a not so slow GPU. And the tradeoff is not that simple because the number of candidates per second doesn't matter, the time per assignment matters!

Oliver

TheJudger 2012-03-05 12:29

btw. don't take it personally if my previos post sounds too rude.

I'm getting the same questions again and again so I might be a little bit annoyed. :sad:

Oliver

kjaget 2012-03-05 16:20

[QUOTE=TheJudger;291963]Have you compared the speed of mfaktc when you lower SievePrimes to 1000? Avg. rate is [B]not[/B] the speed, time per class is the speed![/QUOTE]

I've seen this misunderstanding quite a bit as well. And thought into removing it from future versions? Maybe replacing it with something like GHz-days/day to something which is easy to sum up among instances to see the total throughput?

Dubslow 2012-03-05 16:28

Difficult to calculate and estimate. Sticking with the raw data is better, but we do need to figure out a better way to print it.

James Heinrich 2012-03-05 16:44

[QUOTE=kjaget;291984]I've seen this misunderstanding quite a bit as well. And thought into removing it from future versions? Maybe replacing it with something like GHz-days/day to something which is easy to sum up among instances to see the total throughput?[/QUOTE]I highly second this. The "M/s" value is somewhat meaningless for the user, and often misunderstood. The conversion of time-per-class into GHz-days-per-day should be very simple: GHz-days for the assignment is given by:[code]0.016968 * pow(2, $bitlevel - 48) * 1680 / $exponent

// example using M50,000,000 from 2^69-2^70:
= 0.016968 * pow(2, 70 - 48) * 1680 / 50000000
= 2.3912767291392 GHz-days

// magic constant is 0.016968 for TF to 65-bit and above
// magic constant is 0.017832 for 63-and 64-bit
// magic constant is 0.01116 for 62-bit and below[/code]Then all you need is: 86400 / (time_per_class * classes_per_exponent) * ghz_days_assignment

Of course, above code is based on a single bitlevel, but easily adapted to multi-bitlevel assigments.

rcv 2012-03-05 19:17

@Dubslow: Thank you for the pointer. That's exactly what I was looking for.

@kjaget/TheJudger: On my setup, the time per class and the megaprimes per second are in a lock-step inverse relationship with each other (with constant SievePrimes). Whether I set SievePrimes to 1000 or 1500 or 2000, the average rate remains a little above 125 megacandidates per second, on each of the two instances I am running. When I vary SievePrimes, the number of candidates changes, as expected, and the time per class changes proportionally. If I have a misunderstanding, I'm sure you folks will correct me. [See more, below.]

[code][FONT=Courier New][SIZE=1]Starting trial factoring M52575179 from 2^71 to 2^72
class | candidates | time | ETA | avg. rate | SievePrimes | CPU wait
1669/4620 | 1.39G | 11.097s | 1h53m | 125.30M/s | 1500 | 0.39%
1672/4620 | 1.39G | 10.776s | 1h49m | 129.03M/s | 1500 | 0.41%
1680/4620 | 1.39G | 10.533s | 1h47m | 132.01M/s | 1500 | 0.41%
1681/4620 | 1.39G | 11.658s | 1h58m | 119.27M/s | 1500 | 0.37%
1689/4620 | 1.39G | 10.670s | 1h48m | 130.31M/s | 1500 | 0.41%
[/SIZE][/FONT][/code][QUOTE=TheJudger;291965]btw. don't take it personally if my previos post sounds too rude. I'm getting the same questions again and again so I might be a little bit annoyed. :sad:[/QUOTE]
OK. I won't take it personally. For all you know, I just fell off the turnip truck. At least your answers are all here in one big bold place for future questioners to find. :smile:


[QUOTE=TheJudger;291963]
Have you compared the speed of mfaktc when you lower SievePrimes to 1000? Avg. rate is [B]not[/B] the speed, time per class is the speed! Lower SievePrimes is just a waste of energy and [B]not validated[/B]![/QUOTE]
I disagree completely about this being a waste of energy! I saw the warning in the code, which I heeded. I saw the word "unless". I've looked at the code. I've run the self-test. I've found 4 factors at the 71/72 bit size with the tweaked version. Who should I see to get it validated? If you know of a problem, PLEASE let me know.

In the parlance of Mathematica, the fraction of candidates which pass the sieving is given by Apply[Times,Prime[5+Range[sp]]-1)/Prime[5+Range[sp]])], where sp is the number of SievePrimes. At SievePrimes=1500, the above formula yields 28.5914%. At SievePrimes=5000, the above formula yields 25.0285%.

The number of candidates reported in each class by mfaktc (1.39G, as shown above) with SievePrimes=1500 agrees with the theoretical. Floor[.285914665945569*2^71/4620/52575179/2+1/2]=1389675478 candidates per class.

At SievePrimes=5000, the number of candidates per class is theoretically Floor[0.250284623178239*2^71/4620/52575179/2+1/2]=1216497244.

When I switched from SievePrimes=5000 to SievePrimes=1500, the number of candidates per second remained constant, but the time per class increased by about 14% (0.285915/0.250285-1). As best as I could tell, my CPU usage due to mfaktc went down by more than half. Now, the GPU is almost never starved for work. In contrast, with high fixed values of SievePrimes, my CPU becomes saturated, the GPU is often starved for work, the net mfaktc throughput goes down, and I can't use my CPU for other useful work. With moderate values of SievePrimes, the CPU burns a lot of time and the GPU is sometimes starved for work.

With a smaller number of cores and a slower GPU, the default and minimum SievePrimes may make very good sense. But with a larger number of cores and a faster GPU, the minimum SievePrimes does not make sense for me. And, I would respectfully suggest it may not make sense for other people.

So, let me re-ask my 2nd question... Aside from validating the code, is there a reason why the user shouldn't be allowed to set a lower SievePrimes than 5K?


[QUOTE=TheJudger;291963]Yes, for sure... but I don't know how to implement this [B]efficient[/B]. Bdot (mfakto) tried and he got IIRC ~30M/s for a not so slow GPU. And the tradeoff is not that simple because the number of candidates per second doesn't matter, the time per assignment matters![/QUOTE]

I maintain that if both sieving and trial factoring is done on the GPU, the tradeoff *is* as simple as matching candidates per second being removed by the siever to candidates per second being tested by the trial factoror.

I actually have some prototype sieving code. It is not optimized. At the smallest prime factor, not inherently sieved by the class mechanism (p=13), it can sieve out 64 billion candidates per second. At p=1583, the incremental rate of candidate removal is 1 billion candidates per second, At p=2297, the incremental rate of candidate removal is 500 megacandidates per second, and at p=4093, the incremental rate of candidate removal is 261 megacandidates per second. But the curve is rather flat, here. With my 560Ti GPU, my prototype sieving code, and your trial factoring code, it would seem the tradeoff between more sieving and more trial factoring is probably in the vicinity of SievePrimes=1000+/-500, and not especially sensitive to variations. [This would leave your CPU essentially unused.]

@Bdot: If you are interested, would you please weigh in on how this compares with your results.

Thanks to all who responded!

Bdot 2012-03-05 20:07

[QUOTE=rcv;292021]
So, let me re-ask my 2nd question... Aside from validating the code, is there a reason why the user shouldn't be allowed to set a lower SievePrimes than 5K?
[/QUOTE]

I think, the best way to free your CPU core(s) for other tasks is to set mfaktc to low priority.

Regarding the pure question "why not lower SievePrimes": because it's not tested. As long as SievePrimes remains above SIEVE_SPLIT (250), the code will probably work correctly. However, before mfakt[co] is given to the public, some more tests are performed. You may have seen the CHECKS_MODBASECASE #define. Completing the full selftest at a certain SievePrimes value is a prerequisite, but not sufficient.

OK, now that this is out I can tell you that some time ago I had finished a few tests with CHECKS_MODBASECASE at SievePrimes=256, and that did not show up any error, [B]but it was not the full test[/B], and it was mfakto, which runs different kernels.

And I briefly tested a special version of mfakto that skips sieving and memory transfer to the GPU completely, testing all candidates of a class. This one also successfully completed a few rounds of CHECKS_MODBASECASE tests.

So I'm quite confident that the full tests of lower SievePrimes would pass, but so far nobody has done these tests.

[QUOTE=rcv;292021]
I maintain that if both sieving and trial factoring is done on the GPU, the tradeoff *is* as simple as matching candidates per second being removed by the siever to candidates per second being tested by the trial factoror.

I actually have some prototype sieving code. It is not optimized. At the smallest prime factor, not inherently sieved by the class mechanism (p=13), it can sieve out 64 billion candidates per second. At p=1583, the incremental rate of candidate removal is 1 billion candidates per second, At p=2297, the incremental rate of candidate removal is 500 megacandidates per second, and at p=4093, the incremental rate of candidate removal is 261 megacandidates per second. But the curve is rather flat, here. With my 560Ti GPU, my prototype sieving code, and your trial factoring code, it would seem the tradeoff between more sieving and more trial factoring is probably in the vicinity of SievePrimes=1000+/-500, and not especially sensitive to variations. [This would leave your CPU essentially unused.]

@Bdot: If you are interested, would you please weigh in on how this compares with your results.

Thanks to all who responded![/QUOTE]
I'd be really be interested to see this work. Though I'm maintaining the OpenCL version, receiving a hint how GPU-sieving could be done in a fast way would certainly help.

My first approach was so terribly slow that I could only sieve the first 256 primes to get the 30M/s (sieve output for the siever alone) on a GPU that otherwise runs ~150M/s through factoring. I was so disappointed that I gave up ... Your results, however ...

Bdot 2012-03-05 20:16

[QUOTE=James Heinrich;291989]I highly second this. The "M/s" value is somewhat meaningless for the user, and often misunderstood. The conversion of time-per-class into GHz-days-per-day should be very simple: GHz-days for the assignment is given by:[code]0.016968 * pow(2, $bitlevel - 48) * 1680 / $exponent

// example using M50,000,000 from 2^69-2^70:
= 0.016968 * pow(2, 70 - 48) * 1680 / 50000000
= 2.3912767291392 GHz-days

// magic constant is 0.016968 for TF to 65-bit and above
// magic constant is 0.017832 for 63-and 64-bit
// magic constant is 0.01116 for 62-bit and below[/code]Then all you need is: 86400 / (time_per_class * classes_per_exponent) * ghz_days_assignment

Of course, above code is based on a single bitlevel, but easily adapted to multi-bitlevel assigments.[/QUOTE]

It's the first time I've seen the formula to calculate this. (OK, if I had asked you had probably told me earlier ... ) I think, I'll leave the M/s as an option in the ini-file, but switch the default to GHz-days/day (= GHz ?). Added to todo-list ...

Dubslow 2012-03-05 22:23

Perhaps call it Eq. GHz, for Equivalent GHz (to a Core2).

@rcv: What a wonderful defense :smile: I like where this is going. I'd volunteer to do some 460 testing, but unfortunately I cannot for the life of me upgrade my Linux Nvidia drivers past 270.xx, so I'm stuck with mfaktc 0.17.

flashjh 2012-03-05 22:35

[QUOTE=Bdot;292025]So I'm quite confident that the full tests of lower SievePrimes would pass, but so far nobody has done these tests.[/QUOTE]

Just point me in the right direction and I'll test it. What needs to be done, specifically?

Prime95 2012-03-05 23:19

[QUOTE=rcv;292021]I actually have some prototype sieving code. It is not optimized. [/QUOTE]

I too would be interested in seeing this code.

Bdot 2012-03-05 23:26

[QUOTE=flashjh;292046]Just point me in the right direction and I'll test it. What needs to be done, specifically?[/QUOTE]

1. Build a binary that allows for setting low SievePrimes, and that has CHECKS_MODBASECASE defined.
2. Run the full selftest with that binary with fixed, low SievePrimes
3. Analyze the output for any CHECKS_MODBASECASE violations (and of course, all factors need to be found).

But before we all run and do something just for the sake of doing something: Maybe go back and think again about "Why do we want lower SievePrimes?" and "Do we really want that?". I haven't quite understood that part yet.

Dubslow 2012-03-06 02:57

[QUOTE=Bdot;292048]

But before we all run and do something just for the sake of doing something: Maybe go back and think again about "Why do we want lower SievePrimes?" and "Do we really want that?". I haven't quite understood that part yet.[/QUOTE]

[QUOTE=rcv;292021]

In the parlance of Mathematica, the fraction of candidates which pass the sieving is given by Apply[Times,Prime[5+Range[sp]]-1)/Prime[5+Range[sp]])], where sp is the number of SievePrimes. At SievePrimes=1500, the above formula yields 28.5914%. At SievePrimes=5000, the above formula yields 25.0285%.

The number of candidates reported in each class by mfaktc (1.39G, as shown above) with SievePrimes=1500 agrees with the theoretical. Floor[.285914665945569*2^71/4620/52575179/2+1/2]=1389675478 candidates per class.

At SievePrimes=5000, the number of candidates per class is theoretically Floor[0.250284623178239*2^71/4620/52575179/2+1/2]=1216497244.

[B][U]When I switched from SievePrimes=5000 to SievePrimes=1500, the number of candidates per second remained constant, but the time per class increased by about 14% (0.285915/0.250285-1).[/B] As best as I could tell, my CPU usage due to mfaktc went down by more than half.[/U] Now, the GPU is almost never starved for work. In contrast, with high fixed values of SievePrimes, my CPU becomes saturated, the GPU is often starved for work, the net mfaktc throughput goes down, and I can't use my CPU for other useful work. With moderate values of SievePrimes, the CPU burns a lot of time and the GPU is sometimes starved for work.
[/QUOTE]

Seems like a pretty clear benefit to me. (My emphasis.)

kjaget 2012-03-06 04:18

[QUOTE=Dubslow;292058]Seems like a pretty clear benefit to me. (My emphasis.)[/QUOTE]

Not sure which side you meant the benefit was, but here's my take.

Considering how much more efficient GPUs are at generating GHz-days of work, trading 100% of 1 CPU for a 14% speedup in GPU production feels like a net win to me. If you choose to measure it that way, of course.

For a 560ti, that's ~25 GHz-days / day extra throughput. I don't see that a CPU is going to give anywhere that kind of throughput doing other types of work.

Dubslow 2012-03-06 06:00

No, it's about half a CPU core, there'd still be significant CPU use (unless he pushes the GPU sieve thing through as well, which I'm hoping for).

TheJudger 2012-03-06 10:59

Hi rcv,

[QUOTE=rcv;292021]I actually have some prototype sieving code. It is not optimized. At the smallest prime factor, not inherently sieved by the class mechanism (p=13), it can sieve out 64 billion candidates per second. At p=1583, the incremental rate of candidate removal is 1 billion candidates per second, At p=2297, the incremental rate of candidate removal is 500 megacandidates per second, and at p=4093, the incremental rate of candidate removal is 261 megacandidates per second. But the curve is rather flat, here. With my 560Ti GPU, my prototype sieving code, and your trial factoring code, it would seem the tradeoff between more sieving and more trial factoring is probably in the vicinity of SievePrimes=1000+/-500, and not especially sensitive to variations. [This would leave your CPU essentially unused.][/QUOTE]

this sounds interesting. I would love to see your code. Are you talking about sieving only or does this include the translation of set/unset bits into FC candidates, too?

Oliver

apsen 2012-03-06 14:18

[QUOTE=Bdot;292048]Maybe go back and think again about "Why do we want lower SievePrimes?" and "Do we really want that?". I haven't quite understood that part yet.[/QUOTE]

I have old CPU and new GPU. With lower limit at 5000 I saturate all 4 cores of CPU but my GPU usage is well below 100%. I played with reducing the limit too and found that the lowest time per class for me is around 1000-1500. As the limit is lowered below 1000 - the CPU usage will go down. I did run long tests at 1000 and there were no problems detected but I did not use the CHECKS_MODBASECASE define.

And low priority does not always help. I have not specifically checked mfaktc but prime95 would cause hiccups in some programs even on idle priority. In fact IIRC it has setting to pause processing when it detects specified programs running.

James Heinrich 2012-03-06 14:32

[QUOTE=apsen;292091]IIRC it has setting to pause processing when it detects specified programs running.[/QUOTE]Indeed it does: [i]PauseWhileRunning=[/i] setting in prime.txt (and also the related [i]LowMemWhileRunning=[/i]). Memory isn't an issue in mfaktc, but I would love to see a PauseWhileRunning setting so I don't have to kill off mfaktc before running a game or other GPU-intensive task, or (more importantly) remember to restart it afterwards.

kjaget 2012-03-06 18:58

[QUOTE=Dubslow;292077]No, it's about half a CPU core, there'd still be significant CPU use (unless he pushes the GPU sieve thing through as well, which I'm hoping for).[/QUOTE]

I wasn't quite sure, but I thought it cut his CPU usage in half. Since he was running 2 cores to start with = 1 core reduction. Or maybe not, but at least that was my reading of it.

rcv 2012-03-06 21:28

@All: I'm not claiming that everyone would want to do less sieving. There are many whose sole purpose is to get their PrimeNet ranking as high as possible. For those people, you *should* spend all of the cores of your CPU to help feed the GPU.

But, as some of you have pointed out, PrimeNet already has more TF than it needs. If you can give it 20% less TF, but also give the project an extra core of P-1 or LL, it might be a net benefit to the project.

What I *am* saying is that the user should be *allowed to choose* a higher or a lower SievePrimes, as to each user's needs and personal preferences. (As long as the code works, of course.)

What I am also suggesting (in the way of sieving on the GPU) is there are some people, such as myself, who have plenty of useful work to keep our expensive Intel i7 cores busy, who would prefer to see the entire sieving/trial factoring compute operation dumped onto the GPU, regardless of a decrease in TF performance.

[QUOTE=TheJudger;292082]Are you talking about sieving only or does this include the translation of set/unset bits into FC candidates, too?[/QUOTE]
Yes, I have a prototype kernel that converts from the bitmap to a vector of candidates in the form your siever uploads to the GPU. But no, the timings I quoted previously did not include this kernel, because it is a relatively fixed overhead. [The timings were the approximate *incremental* costs of using higher or lower SievePrimes on the GPU.] For the record, this step of my unoptimized prototype code can convert candidates for the trial factoror's use on my 560Ti in about 3-4% of the time it takes your code to do the trial factoring. In exchange, you don't have to copy the list of candidates from host to device.

-- Rocke

bcp19 2012-03-06 21:56

The talk here kind of brings up a thought I've been working on, mainly, how efficient is it to use a faster CPU to do your GPU work. When I started, I had a 560Ti in a Core 2 Quad 8200. 2 cores would get this card to around 200M/s output, which is about 75-80% of the card's capability. Adding a 3rd core seemed a waste. This same card is now in an i5-2500K running around 240M/s and takes up 1 full core (under 3% wait time). Doing some calculations, a single core of the 2500K can perform around 9.6M iterations on a 26M exp per day while all 4 cores on the 8200 combined would only be able to perform around 5.9M iterations per day. While using multiples cores of the 8200 is likely to be more power hungry than 1 core of the 2500K, should I 'waste' 9.6M iterations of calculations per day when the same work could be done using 3 of the 4 8200 cores while only sacrificing around 4.5M iterations?

Bdot 2012-03-06 23:11

[QUOTE=rcv;292129]
What I *am* saying is that the user should be *allowed to choose* a higher or a lower SievePrimes, as to each user's needs and personal preferences. (As long as the code works, of course.)
[/QUOTE]
OK, added to todo list to run all tests with SievePrimes 256 and 1000. If they all succeed then we can still decide what the new low limit should be.
[QUOTE=rcv;292129]
What I am also suggesting (in the way of sieving on the GPU) is there are some people, such as myself, who have plenty of useful work to keep our expensive Intel i7 cores busy, who would prefer to see the entire sieving/trial factoring compute operation dumped onto the GPU, regardless of a decrease in TF performance.
[/QUOTE]
I read the description of your prototype. It's a very neat approach and I already fear how much code would need to be ported to OpenCL. But I'll definitely throw in some effort if you allow me to. Add a GPL header (or other copyright of your choice) to it to avoid misuse.

Prime95 2012-03-07 02:38

[QUOTE=rcv;292129]Yes, I have a prototype kernel that converts from the bitmap to a vector of candidates in the form your siever uploads to the GPU. [/QUOTE]

I was wondering if we could do away with this step. If you divide the bit array into reasonable size chunks - say 1KB - and put each CUDA core is in charge of TFing one chunk. Each core then processes a trial factor until its chunk is complete. Each chunk ought to have roughly the same number of set bits. You waste some TFs at the end-of-chunk processing as CUDA cores that got a chunk with fewer set bits wait for CUDA cores processing the chunks that had more set bits. The advantage is you save the memory accesses writing the vector of candidates.

This idea can tweaked further if the cost of setting up a new chunk to process is low (use smaller chunks and rather than waiting for CUDA cores processing chunks with lots of set bits, CUDA cores simply grab the next available chunk).

kjaget 2012-03-07 14:35

[QUOTE=bcp19;292133]The talk here kind of brings up a thought I've been working on, mainly, how efficient is it to use a faster CPU to do your GPU work. When I started, I had a 560Ti in a Core 2 Quad 8200. 2 cores would get this card to around 200M/s output, which is about 75-80% of the card's capability. Adding a 3rd core seemed a waste. This same card is now in an i5-2500K running around 240M/s and takes up 1 full core (under 3% wait time). Doing some calculations, a single core of the 2500K can perform around 9.6M iterations on a 26M exp per day while all 4 cores on the 8200 combined would only be able to perform around 5.9M iterations per day. While using multiples cores of the 8200 is likely to be more power hungry than 1 core of the 2500K, should I 'waste' 9.6M iterations of calculations per day when the same work could be done using 3 of the 4 8200 cores while only sacrificing around 4.5M iterations?[/QUOTE]

Impossible to tell until you give us some timing info from your mfaktc runs. Which goes back to my point that we should think strongly about removing the candidates/sec report from mfaktc output since it's so easily misunderstood.

bcp19 2012-03-07 23:03

[QUOTE=kjaget;292197]Impossible to tell until you give us some timing info from your mfaktc runs. Which goes back to my point that we should think strongly about removing the candidates/sec report from mfaktc output since it's so easily misunderstood.[/QUOTE]
Well, since I have no timings from the 8200 on hand, then how about this, not quite apples to apples, but close:
Two systems with a 26M exp for a benchmark: 2500 = ~9.6M iterations per day on a single core while the 2400 = ~4.32M iterations per day. 2500 outputs ~168 GHzD/Day on a 560Ti using 1 core while the 2400 outputs ~160GHzD/day on a 560 using 2 cores (SievePrimes = 5000 on both systems). Using GPUz both cards show 99% GPU load. Mfaktc on the 2500 says CPU wait < 3% while the 2 instances on the 2400 have roughly 20% CPU wait. P95 on the 2400 is set to run all 4 cores, cores 1 and 3 share mfaktc instances and average 103-109ms/iter on a 45M exp while core 2 averages 19.3ms/iter. When I tried doing the same on the 2500, core 2 was running a 26M exp at 18ms/iter and core 3 (shared with mfaktc) was something ridiculous like 3 seconds per iteration, so I gave up P95 share on it. So, basically you have a 9.6M iterations CPU = 168 GHzD/day on the 2500 and 80% of 2 4.32M (6.912M) iteration CPUs = 160 GHzD/Day on the 2400. No M/s listed, so we've taken that out of the equation now, we only have work output. The result looks the same: in terms of work per day, it seems you are more efficient using a lesser system to run a GPU.

Edit: Looking at the 8200, it's hard to really compare. I am not sure if it is from L1/L2/L3 sharing, but using a 45M exp, the 8200 running all 4 cores shows 90ms/iter. If you have core 1/3 running a 45M exp and 2/4 doing TF, it shows 60ms/iter. I currently have a 550Ti in the 8200. Using a single core on a 26M exp it comes out at ~2.2M iter/day. This theoretically means it it capable of 8.8M iter/day on 4 cores, but in reality gives ~5.9M iter/day. You can however runing 1/3 LL and use 2/4 to power a GPU. The 550Ti outputs ~96 GHzD/Day. If you give the 2 cores their 'full' capability, you have 96GHzD = 4.4M iter. The 2500 = 17.5 GHzD/1M iter, the 2400= 23.188 and at 4.4 the 8200 = 21.36. Here is where it get tricky though, core 1 and 3 are unaffected with the GPU running on 2/4, but are affected with LL/DC on 2/4. There is only 1.5M iter/day difference between 4 cores on LL and 2 cores on LL, so with 96 = 1.5M iter, you get 64. Makes it hard to compare.

kjaget 2012-03-08 14:58

Again, without timings I can't say much.

[QUOTE=bcp19;292261]Well, since I have no timings from the 8200 on hand, then how about this, not quite apples to apples, but close:
Two systems with a 26M exp for a benchmark: 2500 = ~9.6M iterations per day on a single core while the 2400 = ~4.32M iterations per day. 2500 outputs ~168 GHzD/Day on a 560Ti using 1 core while the 2400 outputs ~160GHzD/day on a 560 using 2 cores (SievePrimes = 5000 on both systems). Using GPUz both cards show 99% GPU load. Mfaktc on the 2500 says CPU wait < 3% while the 2 instances on the 2400 have roughly 20% CPU wait.[/QUOTE]

Allow sieve primes to auto-adjust for both cases and you should get better throughput, at least on the 2400 system. Considering adding at least one more core on the 2500 system, maybe more for both. My thinking :

In general, it only takes a small bit of improvement in mfaktc to overcome what you could get from p95. Using your numbers for example, a 26M exponent is worth ~22.5GHz-days. It takes about 2.7 days to run this on the 2500 CPU, so you're generating 8.3 GHz-days/day using 1 core.

If the GPU is outputting 168GHz Days per day using 1 core, all you'll need is a 5% speedup by adding a second core to match that performance. If you get more than 5% more mfaktc throughput by adding a second core, it's a net win.

To give an example from my fastest system, I need 3 cores to load up my GPU. Going to 4 cores lets each CPU sieve a bit deeper, giving me an overall 10% increase in performance (timings go from 7.3 sec/class with 3 instances to 8.8 sec a class with 4 instances). So if you're shooting for overall max GHz-days/day, that's a net win, assuming you're looking for max GHz-days credit. It's not intuitive that trading 25% of my CPU capability for a 10% speedup is the right thing to do, but GPUs are so much quicker than CPUs at generating GHz-days that you can't trust that 10% of one is equal to 25% of the other.

So the first question is what are you trying to optimize. Second question is back to my original one - let's see timings for mfaktc running on 1, 2, 3,... cores of each system (makes sense to use the same exponent and bit depth for testing just to simplify stuff, and sieve primes should stabilize in a few minutes of running so you're not wasting that much time).

bcp19 2012-03-08 17:04

[QUOTE=kjaget;292328]Again, without timings I can't say much.

Allow sieve primes to auto-adjust for both cases and you should get better throughput, at least on the 2400 system. Considering adding at least one more core on the 2500 system, maybe more for both. My thinking :

In general, it only takes a small bit of improvement in mfaktc to overcome what you could get from p95. Using your numbers for example, a 26M exponent is worth ~22.5GHz-days. It takes about 2.7 days to run this on the 2500 CPU, so you're generating 8.3 GHz-days/day using 1 core.

If the GPU is outputting 168GHz Days per day using 1 core, all you'll need is a 5% speedup by adding a second core to match that performance. If you get more than 5% more mfaktc throughput by adding a second core, it's a net win.

To give an example from my fastest system, I need 3 cores to load up my GPU. Going to 4 cores lets each CPU sieve a bit deeper, giving me an overall 10% increase in performance (timings go from 8.x sec/class with 3 instances to 10.x sec a class with 4 instances). So if you're shooting for overall max GHz-days/day, that's a net win, assuming you're looking for max GHz-days credit. It's not intuitive that trading 25% of my CPU capability for a 10% speedup is the right thing to do, but GPUs are so much quicker than CPUs at generating GHz-days that it makes sense.

So the first question is what are you trying to optimize. Second question is back to my original one - let's see timings for mfaktc running on 1, 2, 3,... cores of each system (makes sense to use the same exponent and bit depth for testing just to simplify stuff, and sieve primes should stabilize in a few minutes of running so you're not wasting that much time).[/QUOTE]

<sigh> You are clearly not understanding. Imagine this: you have 3 computer systems, an overclocked 2500k, a normal 2400 and a normal 8200. If you have a single GPU, which system would it most efficiently run in? From what I have listed in previous messages, it appears that the 8200 would be most efficient, considering a single core of the 2500k can produce more LL work than the entire 8200 can, while the 8200 would only need 2 or 3 cores to produce the same amout of GPU output.

kjaget 2012-03-08 18:05

[QUOTE=bcp19;292339]<sigh> You are clearly not understanding. Imagine this: you have 3 computer systems, an overclocked 2500k, a normal 2400 and a normal 8200. If you have a single GPU, which system would it most efficiently run in? From what I have listed in previous messages, it appears that the 8200 would be most efficient, considering a single core of the 2500k can produce more LL work than the entire 8200 can, while the 8200 would only need 2 or 3 cores to produce the same amout of GPU output.[/QUOTE]

You have to balance that against the fact that a faster CPU will be able to sieve deeper, allowing the GPU to generate more GHz-days per GFLOP. My gut feel is that GPUs are so much more efficient at producing GHz-days that until you max out sieve primes, it always makes sense to use CPUs running mfaktc rather than anything else. So if you can do that by running the 8200, that might be the way to go. If you can't, the faster cards need to be paired up with faster CPUs.

kladner 2012-03-08 18:10

[QUOTE=kjaget;292341].....until you max out sieve primes.....[/QUOTE]

Just to make sure I understand correctly, what kind of numbers are you considering maxed out?

kjaget 2012-03-08 18:39

[QUOTE=kladner;292342]Just to make sure I understand correctly, what kind of numbers are you considering maxed out?[/QUOTE]

High - possibly up to the max of 200,000. From what I've been able to test, adding CPU cores once you've got the GPU to 100% usage gives more improvement in TF throughput than you lose by taking that CPU off of other tasks. But that's not hard since GPUs are so much better at generating GHz-days of work. That was my point that showing that ~5% better GPU throughput is just as many GHZ-days as 100% of a CPU core.

I'm working from a small sample size (just my personal hardware at home) so I don't have enough different systems to say exactly where the break-even point is (and even that would just be an rough approximation). But I see lots of people locking sieve primes at 5000 to free up CPU time without measuring the TF GHz-day performance hit you take, and just assuming that once the Mcandidates /sec is maxed out that there's nothing more that more mfaktc can do.

Of course, this exact tradeoff depends on a complex interaction of how efficient your CPU is at both LL testing and sieving and how fast the GPU is. That's why I keep going back to the fact it's hard to give specific recommendations without knowing mfaktc timings running a bunch of instances in any particular system. I'm genuinely interested to see them for various CPU & GPU combinations since I have a limited set to test with here at home.

And also I keep reiterating that GHz-days/day isn't the only way to measure this, so there can be other correct answers. For example, you might be willing to give up the absolute max GHz-day/day if you value your ranking in each category above absolute total throughput (so a TF GHz-day isn't equally valuable to an LL GHz-day or whatever). Can't argue with that approach, especially considering the GPU firepower

Xyzzy 2012-03-08 18:54

We have four cores per box sieving for our GTX570s. If we let Mfaktc automatically choose the SievePrimes parameter, the cores go up to about 30,000 each and we net around 1250 GHz/days/day overall.

If we run Prime95 on each box as well, doing P-1 work using all four cores for one instance, and we set SievePrimes to 5000, we net around 1100 GHZ/days/day overall.

In the second example, we are able to complete (roughly) three P-1 tests every two days per box.

Since each box has 8GiB or more of memory it kinda makes sense to do the P-1 work and lose the 150 GHz/days/day. It does, however, push our temperatures up about 5°C.

We are not sure what the optimal settings are but we know P-1 testing needs to be done, so we think it is more helpful for the project. While it is fun to crank out GHz/days/day, doing optimal work for the project is probably best. We are currently running through a pile of 70-71 bit TF work but once that is done we will go back to just taking whatever work we get to 72 bits, which is kinda the goal of the project, or something like that.

(The fact that Craig is smoking us real bad makes it easier to make decisions like this.)

:max:

bcp19 2012-03-08 19:11

[QUOTE=kjaget;292341]You have to balance that against the fact that a faster CPU will be able to sieve deeper, allowing the GPU to generate more GHz-days per GFLOP. My gut feel is that GPUs are so much more efficient at producing GHz-days that until you max out sieve primes, it always makes sense to use CPUs running mfaktc rather than anything else. So if you can do that by running the 8200, that might be the way to go. If you can't, the faster cards need to be paired up with faster CPUs.[/QUOTE]

Actually, you are kind of thinking backwards, the slower systems will sieve higher than the faster ones per class while the faster one will sieve for more classes in the same amount of time. If I set the adjust to 1 on the 2500k, it stays unchanged at 5000 SP while the 2400 with 2 instances will battle back and forth a while then equalize around 14000-16000. Back when the 560 was in the 8200, the SP would equalize around 40k.

As a test, I just set the SP on the 2500 to 10k. M/s dropped by over 40 and it took over a minute longer to run the same exponent, which is approx a 5% decrease in throughput.

kjaget 2012-03-08 19:59

[QUOTE=bcp19;292348]Actually, you are kind of thinking backwards, the slower systems will sieve higher than the faster ones per class while the faster one will sieve for more classes in the same amount of time.If I set the adjust to 1 on the 2500k, it stays unchanged at 5000 SP while the 2400 with 2 instances will battle back and forth a while then equalize around 14000-16000. Back when the 560 was in the 8200, the SP would equalize around 40k.[/QUOTE]

If you're using 2 cores of 1 system and comparing it to 1 core on another, why are you saying that the 2 core setup is slower? 2 * 0.9 is more than 1 * 1.0 so I think you have the slower and faster setups backwards in your first sentence.

[QUOTE]As a test, I just set the SP on the 2500 to 10k. M/s dropped by over 40 and it took over a minute longer to run the same exponent, which is approx a 5% decrease in throughput.[/QUOTE]

This is a symptom of not having enough CPU power to max out the GPU. I was discussing the behavior after you add enough CPUs to max out the GPU. Of course if you don't have the CPU power to max out the GPU asking that CPU to do more will make things even slower. But there's additional gains by adding more CPU cores once the GPU is at 100%, and that's what I'm discussing.

Try an experiment. Set sieve prime adjust to 1. Run 1 instance and let it stabilize and note how long a class takes. Then run 2 of the same exponent, again letting it stabilize and keep track of the time. Repeat for 3 and 4 (up to however many cores you have). Post the results here and I'll show you what I'm talking about with respect to scaling.

Your throughput will increase rapidly with each additional core until you load the GPU 100%. Then you'll see smaller increases as the increased sieve primes make the GPU run quicker per class. The large increase is obviously worth it, the smaller one is the one that's closer to a GHz-day parity with CPU power so it takes more careful analysis to figure out whether it's worth it.

bcp19 2012-03-08 21:27

[QUOTE=kjaget;292359]If you're using 2 cores of 1 system and comparing it to 1 core on another, why are you saying that the 2 core setup is slower? 2 * 0.9 is more than 1 * 1.0 so I think you have the slower and faster setups backwards in your first sentence. [/QUOTE]
Apples and oranges. 1 core on system A can do 9.6 iter/day, 2 cores on system B can do 8.64 iter/day (4.32 each). 2 core (B) = .9 * 1 core (A), therefore 2B*.9 < 1A*1.0

[quote]This is a symptom of not having enough CPU power to max out the GPU. I was discussing the behavior after you add enough CPUs to max out the GPU. Of course if you don't have the CPU power to max out the GPU asking that CPU to do more will make things even slower. But there's additional gains by adding more CPU cores once the GPU is at 100%, and that's what I'm discussing.[/quote]
1 core (A) = 99% GPU load (maxed), 2 core (B) = 99% GPU load (maxed)

[Quote]Try an experiment. Set sieve prime adjust to 1. Run 1 instance and let it stabilize and note how long a class takes. Then run 2 of the same exponent, again letting it stabilize and keep track of the time. Repeat for 3 and 4 (up to however many cores you have). Post the results here and I'll show you what I'm talking about with respect to scaling.

Your throughput will increase rapidly with each additional core until you load the GPU 100%. Then you'll see smaller increases as the increased sieve primes make the GPU run quicker per class. The large increase is obviously worth it, the smaller one is the one that's closer to a GHz-day parity with CPU power so it takes more careful analysis to figure out whether it's worth it.[/Quote]
1 core of a 2500k and adjust = 1, SP= 5000, 16.5 min = ~87 exp/day, CPU wait <3%, ~171GHzD/Day
2 cores of a 2500k and adjust = 1, SP = ~25000, 31 min each = ~93 exp/day CPU wait ~20% each ~183 GHzD/Day *(should claify this, once SP climbed above 25k, the est time also increased, so the run was actually adjust=0 and SP=25k)
1 core of a 2400 and adjust =1, SP=5000, ~25 min = ~58 exp/day <3% wait, ~114 GHzD/Day
2 core of a 2400 and adjust = 0, SP = 5000, ~36 min each, ~80 exp/day, CPU wait 20% , ~157 GHzD/Day
2 cores, adjust = 1, SP = ~12000, ~33.75 min, ~85.3 exp/day, <3% CPU wait, ~168 GHzD/Day

So, a 7% gain on the 2500 from 160% cpu usage. Barely worth using a 2nd core. With such results, I did not continue.

The 2400, 1 core, obvious, not enough CPU. 2 core @ 5k vs 2 core @ 12K. Same 7% gain, but CPU usage goes from ~160 to ~200%. Did not bother to continue.

It can be argued that since one has a 560 and one a 560 Ti that you cannot adequately compare this, but it sure seems like the 2400 is more efficient.

Edit, reran it after thinking about the slowdown, must have had a process running as new run was 36K SP and ~25min giving a 32% increase. Fair improvement, but takes a lot of resources. It's possible similiar increases could be had on the 2400, will have to test later.

kladner 2012-03-08 21:36

[QUOTE=kjaget;292345]High - possibly up to the max of 200,000. From what I've been able to test, adding CPU cores once you've got the GPU to 100% usage gives more improvement in TF throughput than you lose by taking that CPU off of other tasks. But that's not hard since GPUs are so much better at generating GHz-days of work. That was my point that showing that ~5% better GPU throughput is just as many GHZ-days as 100% of a CPU core.

I'm working from a small sample size (just my personal hardware at home) so I don't have enough different systems to say exactly where the break-even point is (and even that would just be an rough approximation). But I see lots of people locking sieve primes at 5000 to free up CPU time without measuring the TF GHz-day performance hit you take, and just assuming that once the Mcandidates /sec is maxed out that there's nothing more that more mfaktc can do.

Of course, this exact tradeoff depends on a complex interaction of how efficient your CPU is at both LL testing and sieving and how fast the GPU is. That's why I keep going back to the fact it's hard to give specific recommendations without knowing mfaktc timings running a bunch of instances in any particular system. I'm genuinely interested to see them for various CPU & GPU combinations since I have a limited set to test with here at home.

And also I keep reiterating that GHz-days/day isn't the only way to measure this, so there can be other correct answers. For example, you might be willing to give up the absolute max GHz-day/day if you value your ranking in each category above absolute total throughput (so a TF GHz-day isn't equally valuable to an LL GHz-day or whatever). Can't argue with that approach, especially considering the GPU firepower[/QUOTE]

Thanks for the response. I have been watching developments in CUDALucas and considering rearranging what I run. Right now on a 1090T @ 3.5GHz w/16GB RAM, that is 3 P-1 cores, 1 LL/DC core, and 2 feeding mfaktc on a GTX 460. I am not really ready to change, as yet.

On the other hand, I would be willing to produce some data on different numbers of mfaktc instances if that would be useful.

EDIT: At the moment, I'm trying out locking Sieve Primes at 14000 for two instances. This was aimed at making other programs run a little better on the system. Clearly, there a trade offs. If I left mfaktc to decide, it would be running SP ~18-19K, depending on the exponents. The GPU fluctuates around 95%, and can be driven up with 3 instances. Thing is, I'm not sure I can live with the system under those circumstances. This is my only machine, and I do need for it to behave moderately well for general use.

kladner 2012-03-09 05:01

Some tests Part 1
 
1 Attachment(s)
@[URL="http://www.mersenneforum.org/member.php?u=1870"]kjaget[/URL] I will have to put these test runs up in more than one post. I guess I should have redirected the outputs to text files. In any case, I ran from 1 to 4 instances of mfaktc, with affinities set to individual cores of the 1090T. For this test, I used the same exponent for all instances. I let the tests run until Sieve Primes had held steady for several classes.

These are the results for 2 and 3 instances.

kladner 2012-03-09 05:03

1 Attachment(s)
@[URL="http://www.mersenneforum.org/member.php?u=1870"]kjaget[/URL]

These are the results for 1 instance.

kladner 2012-03-09 05:05

1 Attachment(s)
@[URL="http://www.mersenneforum.org/member.php?u=1870"]kjaget[/URL]

These are the results for 4 instances.


All times are UTC. The time now is 22:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.