mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-03-05, 04:31   #100
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

100011000112 Posts
Default

Quote:
Originally Posted by aaronhaviland View Post
1.48 and 1.64, why?
Just wondering/making sure the version was reliable for you to compare with.
flashjh is offline   Reply With Quote
Old 2012-03-05, 05:16   #101
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts
Default

Why does Prime95 keep roundoff error below .24 if CL and gL both seem to work fine with more than that? (Why not go up to like .45 or something?)
Dubslow is offline   Reply With Quote
Old 2012-03-05, 05:47   #102
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2·7·521 Posts
Default

Quote:
Originally Posted by Dubslow View Post
Why does Prime95 keep roundoff error below .24 if CL and gL both seem to work fine with more than that? (Why not go up to like .45 or something?)
If the average max roundoff error for 1000 iterations is above 0.24 then it is not uncommon for one or more of the tens of millions of iterations to come out above 0.4 -- which is getting into dangerous territory.
Prime95 is offline   Reply With Quote
Old 2012-03-05, 06:45   #103
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
Default

Ah, okay, and that's what produces the "reproducible error" message?
Dubslow is offline   Reply With Quote
Old 2012-03-05, 20:48   #104
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2·7·521 Posts
Default

I have some ideas for speeding up the carry propagation step. Before proceeding, can anyone tell me what percentage of the time is spent doing the various steps: forward FFT, point-wise squaring, inverse FFT, carry propagation (which includes applying weights and converting to and from integer).

I can describe the ideas to anyone who wants to run with it, or else you'll have to wait for me to install an nVidia environment and learning how to use it.
Prime95 is offline   Reply With Quote
Old 2012-03-05, 21:37   #105
ixfd64
Bemusing Prompter
 
ixfd64's Avatar
 
"Danny"
Dec 2002
California

2,341 Posts
Default

Hmm, is George hinting that GPU support will be his next project?
ixfd64 is offline   Reply With Quote
Old 2012-03-06, 03:28   #106
aaronhaviland
 
Jan 2011
Dudley, MA, USA

7310 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I have some ideas for speeding up the carry propagation step. Before proceeding, can anyone tell me what percentage of the time is spent doing the various steps: forward FFT, point-wise squaring, inverse FFT, carry propagation (which includes applying weights and converting to and from integer).

I can describe the ideas to anyone who wants to run with it, or else you'll have to wait for me to install an nVidia environment and learning how to use it.
Here's a break-down of what various kernel runtimes are on my GPU (sorted by most expensive):
Code:
Environment: GTX 460, Cuda3.2, driver 295.20, x86_64

testPrime 216091, signalSize 12288 (profiled approx 5000 iterations)
Kernel                      % of Total GPU Time
CUFFT (Both directions)        77.08%
llintToIrrBal<3>                8.91%
loadIntToDoubleIBDWT            5.44%
invDWTproductMinus2             4.92%
ComplexPointwiseSqr             3.65%


testPrime 26199377, signalSize 1474560, (profiled approx 9000 iterations)
CUFFT (Both directions)        70.79%
llintToIrrBal<3>                8.53%
loadIntToDoubleIBDWT            7.77%
invDWTproductMinus2             7.71%
ComplexPointwiseSqr             5.20%
(Grouped under CUFFT are multiple sub-kernel launches within the CUFFT library, depending on how it breaks down the transform)

I haven't really looked much at optimising the kernels themselves, yet, but am definitely open to ideas.

All of the kernels are memory-bound, and the little amount of work done in each of the three small kernels (loadIntToDoubleIBDWT, invDWTproductMinus2, and ComplexPointwiseSqr) bothers me: there is likely too much overhead lost to launching the kernels, compared to the amount of work done.
aaronhaviland is offline   Reply With Quote
Old 2012-03-06, 03:55   #107
odin
 
Apr 2010

33 Posts
Default

Hi, I would like to try this against my GTX295. I'm running Windows 7 64 bit.
Can someone please post a compiled version of this program for my platform.
Please also include what version of CUDA I have to install to run it.

Thanks.
odin is offline   Reply With Quote
Old 2012-03-06, 04:20   #108
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11100011111102 Posts
Default

Quote:
Originally Posted by aaronhaviland View Post
All of the kernels are memory-bound, and the little amount of work done in each of the three small kernels (loadIntToDoubleIBDWT, invDWTproductMinus2, and ComplexPointwiseSqr) bothers me: there is likely too much overhead lost to launching the kernels, compared to the amount of work done.
Thanks. I doubt anything can be done to speed up point-wise squaring. I think we can greatly reduce the memory costs of the carry code which is taking 24% of the time. I think cutting that in half is a reasonable goal.
Prime95 is offline   Reply With Quote
Old 2012-03-15, 03:19   #109
aaronhaviland
 
Jan 2011
Dudley, MA, USA

73 Posts
Default

Quote:
Originally Posted by aaronhaviland View Post
I'm a little concerned about the residue mismatch I just got on M(26171441), but since I had restarted it several times, and changed a few things, including the checkpoint format itself, it was most likely my fault. I'm re-starting the test... it's using an FFT length with a high round-off error (around 0.37) so I can test out what an acceptable round-off error should be with this method. (So far, residues are matching CUDALucas through around 250,000 iterations.)
Finally... got a match. Tagged this code in git as v0.9.3:

M( 26171441 )C, 0x449e471e42bfe489, n = 1474560, gpuLucas v0.9.3

Estimated run-time was over-enthusiastic. Run-time was about even with CUDALucas. The estimate calculation has been adjusted slightly to compensate.

As I mentioned, the kernels are *very* memory throughput bound, but I believe I may have already found a few methods to ease some of this strain... need to do more tests.
aaronhaviland is offline   Reply With Quote
Old 2014-07-28, 22:14   #110
GhettoChild
 
"Ghetto_Child"
Jul 2014
Montreal, QC, Canada

41 Posts
Default

so I'm wondering where do I get gpuLucas to test it out? Here? https://github.com/Almajester/gpuLucas ? I'm currently using multiple instances of CUDALucas v2.05Beta on both gpus of a GTX 295. I'm able to run a single instance of MFaktC v0.20 simultaneously but that's all, just 3 instances of all these apps together max. I'm looking to try out gpuLucas and compare its performance with CL but is there a version I can download and test without having to compile it myself? I've never used this makefile compiling stuff before and really not confident in how to use it properly, I'm a windows user.

Last fiddled with by GhettoChild on 2014-07-28 at 22:15
GhettoChild is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3457 2020-12-28 05:11
Do normal adults give themselves an allowance? (...to fast or not to fast - there is no question!) jasong jasong 35 2016-12-11 00:57
Find Mersenne Primes twice as fast? Derived Number Theory Discussion Group 24 2016-09-08 11:45
TPSieve CUDA Testing Thread Ken_g6 Twin Prime Search 52 2011-01-16 16:09
Fast calculations modulo small mersenne primes like M61 Dresdenboy Programming 10 2004-02-29 17:27

All times are UTC. The time now is 14:29.

Fri Jan 15 14:29:52 UTC 2021 up 43 days, 10:41, 0 users, load averages: 1.44, 1.49, 1.64

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.