mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-06-16, 18:15   #1024
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

2×33×109 Posts
Default

Would it be possible to split mfaktc into two programs? One which makes the candidates and writes them to disk and the other which pass the candidates to the gpu. This would remove completely being cpu-bound. As long as enough computers are available(in theory on mersenneforum not just one person depending on file size).
henryzz is offline   Reply With Quote
Old 2011-06-16, 18:33   #1025
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Quote:
Originally Posted by henryzz View Post
Would it be possible to split mfaktc into two programs? One which makes the candidates and writes them to disk and the other which pass the candidates to the gpu.
Possible but not feasible.
Currently mfaktc needs
  • one 8 byte integer per grid (k_base in src/tf_common.cu)
  • one 4 byte integer per factor candidate (XXX_ktab[] in src/tf_common.cu)
So 100M candidates per second need 400MB/sec written to / read from disk.

If you manage to reduce the the needed bandwidth per factor candidate to one 1 byte integer you'll need 100MB/sec. 1 byte per FC is easy if you evaluate those FCs serially but not so easy if you need to do highly parallel and independend. But even if you get it down to 1 bit per FC you'll need 12.5MB/sec for 100M candidates per second.

Oliver
TheJudger is offline   Reply With Quote
Old 2011-06-16, 18:35   #1026
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

647410 Posts
Default

Quote:
Originally Posted by henryzz View Post
Would it be possible to split mfaktc into two programs? One which makes the candidates and writes them to disk and the other which pass the candidates to the gpu. This would remove completely being cpu-bound. As long as enough computers are available(in theory on mersenneforum not just one person depending on file size).
I think the number of candidates makes it sensible to make them
"on the fly". I worked with batches of 15,015 x 8 bits for reasons
anyone who has tried sieving (which I know includes you!) will understand.

David

Last fiddled with by davieddy on 2011-06-16 at 18:40
davieddy is offline   Reply With Quote
Old 2011-06-16, 18:45   #1027
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

10110111111102 Posts
Default

Quote:
Originally Posted by davieddy View Post
I think the number of candidates makes it sensible to make them
"on the fly". I worked with batches of 15,015 x 8 bits for reasons
anyone who has tried sieving (which I know includes you!) will understand.

David
Good point the volume is just too great. Even if storage size was available then the disk drive would struggle to keep up.
henryzz is offline   Reply With Quote
Old 2011-06-16, 18:47   #1028
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

10038 Posts
Default

Quote:
Originally Posted by Christenson View Post
The arguments for being "sligthly broken" are as follows:
a) doesn't quite tickle the server optimally when reports are resulted manually.....
b) requires manual care and feeding, instead of being able to be told to go get work from the server, and having results show up on the server automagically.
c) mfaktc uses the CPU to sieve, so you need a decent CPU core to feed a good GPU card.
d) It can break if interrupted...needs to keep multiple checkpoint files for when working on large jobs.
e) Once those issues are fiixed, I'd argue the program is perfect....all of these have to do with care and feeding.
You're talking about the biggest improvement in speed for given cost basis (both initial and ongoings) since project inception, and you're complaining.

Sheesh tuff crowd.

a) not sure what you exactly mean here, I take it you're annoyed when your video card does the equivalent of x GHz-days work, and you get x/100 credit as it found a 'cheap' factor. Seriously deal with it. We're all in the same boat here. And I'm assuming here it's the same as CPU TF work.
b) write a lynx script if it annoys you (it's not difficult) I'm hardly a script guru and I did one in a weekend. I'm a fan of what I suggested previously - prime95 have a generic extensions option, rather than writing custom code to sit on top of mfaktc. As a general rule custom code always costs more in the long run, than using generic.
c) so, it's using the best of both worlds. Anything else is a compromise. A decent video card requires decent PC. Match appropriately and you shall be rewarded.
d) Do a breadth field search rather than a depth field search. Look at my stats above - more factors are found doing a large number of small TFs than doing large bit depth TFs. Or do a periodic rsync to another media if it really concerns you. If you're worried about GPU efficiency - combining some of the earlier stages with "Stages=0" gives similar efficiencies as doing larger bit depths.
e) all issues above have work around if it concerns you.

I have nothing but praise for mfaktc and the performance it's getting it's awesome. I'm getting 10x the results for similar cost basis.

I'd buy Oliver a beer, if we were near each other. :)

I guess I'm getting defensive at the 'slightly broken' phrase. If you said here's a list of suggested improvements. I probably wouldn't get so defensive.

-- Craig
nucleon is offline   Reply With Quote
Old 2011-06-16, 19:05   #1029
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hi Craig,

Quote:
Originally Posted by nucleon View Post
d) Do a breadth field search rather than a depth field search. Look at my stats above - more factors are found doing a large number of small TFs than doing large bit depth TFs. Or do a periodic rsync to another media if it really concerns you. If you're worried about GPU efficiency - combining some of the earlier stages with "Stages=0" gives similar efficiencies as doing larger bit depths.
With Stages=1 mfaktc will combine "small" bitlevels automatically, do you think I should increase the autocombining limit a little bit?

Oliver
TheJudger is offline   Reply With Quote
Old 2011-06-17, 03:32   #1030
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5·359 Posts
Default

Nucleon, don't take offense...what you have to understand is that I code for a living, and code is never "perfect"....and I intend to address these opportunities by submitting the patches to Mr Oliver, "The Judger"...it's just taken me longer than I would like to find the "gwthread.c" routines in P95 to re-use, and I'm more distractible than I'd like. P95's code for communications is actually pretty straightforward, and indeed uses mutexes to keep lines, messages, and results in one piece between different threads. The only change I would make in P95 is to add a call to block further low-priority (communications thread) access to mutexes when a high-priority thread is within a few seconds of reporting a result and getting more work to do.

As for "splitting" mfaktc into sieving and factoring parts, in some sense, it is split now. The issue with the sieving is that it is to some degree CPU-bound, since there are now several hundred parallel CPUs on the GPU that can use the output of the sieve to run TF tests.

I would argue that it might be useful to move the sieving process onto the GPU. The underlying requirement is significant bandwidth, as described above, from the sieving process to the TF testing process. You wouldn't want the sieve output to cross the disk, just memory, which leads back to the single process, multithreaded model now in use. It is a question of whether we can effectively (low cost task switch, maybe stay off the main PCI bus) run heterogeneous threads on the GPU.

To me the major problem with using up a significant part of a good CPU is that that CPU is taken away from other GIMPS work, particularly LL tests....as I calculated above, we might remove 10% additional candidates from the LL pool, so lots and lots of LL still has to be done.
Christenson is offline   Reply With Quote
Old 2011-06-17, 04:42   #1031
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

194A16 Posts
Default

Quote:
Originally Posted by nucleon View Post
I guess I'm getting defensive at the 'slightly broken' phrase. If you said here's a list of suggested improvements. I probably wouldn't get so defensive.

-- Craig
It was I who compared "slightly broken" with
being "slightly pregnant", thinking (narrow-mindedly) that a competent
exhaustive search for factors was not likely to miss 30% of them.

He (Christenson) was merely pointing out my naivety!

Anyway, the mystery of the low factor discovery rate has been solved,
with the realization that P-1 had already been done.

David
davieddy is offline   Reply With Quote
Old 2011-06-17, 11:12   #1032
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

100000001010102 Posts
Default

FWIW, we like mfaktc the way it is now. The UI is simple and getting work queued up is no problem. None of our computers used for this are networked externally, so we just load things up manually in two week chunks. Not having a GUI is a plus!

We have had only one issue overall, but that was due to user error. (The forum assistant responsible has been beaten mercilessly.)

We have always liked the idea of programs doing one thing well, and chaining programs together to do what we want.
Xyzzy is offline   Reply With Quote
Old 2011-06-17, 11:33   #1033
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

179510 Posts
Default

The discussion and proposal has been to simply automate the fetching of work and reporting of results, optionally, just as mprime has an option to use or not use primenet.

I'm still feeding mfaktc manually, just that two weeks at a time seems like an awful lot of assignments to handle at once.
Christenson is offline   Reply With Quote
Old 2011-06-17, 13:33   #1034
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5×103 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Hi Craig,

With Stages=1 mfaktc will combine "small" bitlevels automatically, do you think I should increase the autocombining limit a little bit?

Oliver
Yeah I noticed the auto combine feature. I think it's easy for me to say increase it, but I'm dealing with GTX580s/460s. I'm guessing with people with lower hardware would rather see frequency of results, rather than efficiency.

But by all means include something in the readme or the ini file under stages. The diminishing point of returns for combining seems to around k-4 or k-3, where k is the last bit depth, for the exponents I looked at.

-- Craig
nucleon is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 13:19.


Mon Aug 2 13:19:24 UTC 2021 up 10 days, 7:48, 0 users, load averages: 2.46, 2.14, 2.02

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.