mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-10-08, 22:53   #232
Robish
 
"Rob Gahan"
Aug 2013
Ireland

22×32 Posts
Smile

Quote:
Originally Posted by kracker View Post
I'm suprised CULu allowed it?
Iteration 289390000 M( 332233123 )C, 0xffffffff80000000, n = 2097152, CUDALucas
v2.03 err = 0.0000 (0:45 real, 4.4768 ms/iter, ETA 53:16:24)
Iteration 289400000 M( 332233123 )C, 0xffffffff80000000, n = 2097152, CUDALucas
v2.03 err = 0.0000 (0:45 real, 4.4787 ms/iter, ETA 53:17:03)
Iteration 289410000 M( 332233123 )C, 0xffffffff80000000, n = 2097152, CUDALucas
v2.03 err = 0.0000 (0:44 real, 4.4792 ms/iter, ETA 53:16:38)

Does this not look ok? err = 0.0000 is not good then eh?
So 20971520 was probably correct then?
I wouldn't mind but I was a fifth of the way through when I changed.
Maybe Ill continue where I left off and report back in 152 days so :-(

I'm new to this, .....Ill get there.....eventually ;-)
Robish is offline   Reply With Quote
Old 2013-10-08, 23:27   #233
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

100110001001112 Posts
Default

Quote:
Originally Posted by Robish View Post
I'm new to this, .....Ill get there.....eventually ;-)
I hope you have a *great* deal of chain mail. And silk clothing.

You'll needed it around here...

(That's meant to be funny, and serious, at the same time.)
chalsall is offline   Reply With Quote
Old 2013-10-08, 23:39   #234
Robish
 
"Rob Gahan"
Aug 2013
Ireland

22·32 Posts
Talking

Quote:
Originally Posted by chalsall View Post
I hope you have a *great* deal of chain mail. And silk clothing.

You'll needed it around here...

(That's meant to be funny, and serious, at the same time.)

Thanks for the tip, :-) ...bit frosty alright, hope it warms up a bit soon ;-)
Robish is offline   Reply With Quote
Old 2013-10-09, 00:29   #235
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2×3×1,693 Posts
Default

Quote:
Originally Posted by Robish View Post
Thanks for the tip, :-) ...bit frosty alright, hope it warms up a bit soon ;-)
There are very many who share that sentiment. Do hang in there!
kladner is offline   Reply With Quote
Old 2013-10-09, 06:25   #236
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

72·197 Posts
Default

@Robish: FFT is used to multiply big numbers very fast. Imagine you have to multiply by pencil numbers of 1000 digits, you will have to do one million digit multiplications. Like last digit of the second number with all digits of the first, then the one before last digit of the second with all digits of the first, and so on. This may take almost all your life. Now imagine you have 1 million digits numbers, to multiply them you will need to do one million of millions, i.e. 1000 billions, or one trillion, digit operations. None of your children, or their children, or any of your descendents will ever finish...

And for a LL test, we talk numbers of hundred million digits, and you will have to multiply them hundred million times. To make the things much faster, the integers are transformed into "some real/complex numbers", then "multiplied fast" using some tricks, then transformed back to integers. Because we can not store real numbers in computers (think about an infinite number of decimals), but only approximations of those, this "approximation error" propagates during transformations, so every time we transform them back into integers we have to check how big the error is.

Thinking like an engineer, as someone said here few days ago, we do an "approximating multiplication", but which is much faster (well, not really, but to make a parallel...)

Both "too big" error and "too small" error are bad.

With a "too big" error you risk to get a different integer when you round the numbers back.

Think you use analog circuitry to multiply 5 and 3, much faster than reading the multiplication table inside of a CPU. You may use analog circuits (opamps), you put 5V into one input, 3V into the other, and you get instantly the result 15.2 (or 14.8), at the output. That won't be a problem, you know that the result is integer, and round it to 15 in your mind. But if you get a result like 15.8, well, or even 19.5V, well, you have to use a more accurate opamp....

On the other hand, imagine your opamps only work for a range of voltage over 2.7V, you will never be able to multiply 2*2, because you put 2V on the input and you only get zero as output. The "rounding" error to the closest integer is zero, but still not useful to you, and opamps with a lower input voltage should be used.

With FFT is the same, we transform "digital" "integer" stuff into "analog" "real signals", "multiply" them, then transform back.

For more math detail, you can read the Math on Gimps server, or the wiki (follow the links on the Math page), or better, look into the D.Knuth's book mentioned there (Vol 2, take a push cart with you when you go to library or bookshop, that is a brick of a book, large format, thousand pages...) for fast FFT multiplication algorithms.

What you have to keep in mind, the "approximation" is dimensioned in such a way that an "error" is expected, (this means, it MUST appear, and it should be not zero, if the opamps really work!); and it is expected to be between, say, 0.001 and 0.49, otherwise the result is wrong.

The fact that clLucas accepts that is WRONG. Keep in mind that this tool is "under development". Msft and few other here are currently investing a lot of time in it.

Last fiddled with by LaurV on 2013-10-09 at 07:22 Reason: typos, lots of them...
LaurV is offline   Reply With Quote
Old 2013-10-09, 08:51   #237
Robish
 
"Rob Gahan"
Aug 2013
Ireland

22×32 Posts
Smile

Quote:
Originally Posted by LaurV View Post
@Robish: FFT is used to multiply big numbers very fast. Imagine you have to multiply by pencil numbers of 1000 digits, you will have to do one million digit multiplications. Like last digit of the second number with all digits of the first, then the one before last digit of the second with all digits of the first, and so on. This may take almost all your life. Now imagine you have 1 million digits numbers, to multiply them you will need to do one million of millions, i.e. 1000 billions, or one trillion, digit operations. None of your children, or their children, or any of your descendents will ever finish...

And for a LL test, we talk numbers of hundred million digits, and you will have to multiply them hundred million times. To make the things much faster, the integers are transformed into "some real/complex numbers", then "multiplied fast" using some tricks, then transformed back to integers. Because we can not store real numbers in computers (think about an infinite number of decimals), but only approximations of those, this "approximation error" propagates during transformations, so every time we transform them back into integers we have to check how big the error is.

Thinking like an engineer, as someone said here few days ago, we do an "approximating multiplication", but which is much faster (well, not really, but to make a parallel...)

Both "too big" error and "too small" error are bad.

With a "too big" error you risk to get a different integer when you round the numbers back.

Think you use analog circuitry to multiply 5 and 3, much faster than reading the multiplication table inside of a CPU. You may use analog circuits (opamps), you put 5V into one input, 3V into the other, and you get instantly the result 15.2 (or 14.8), at the output. That won't be a problem, you know that the result is integer, and round it to 15 in your mind. But if you get a result like 15.8, well, or even 19.5V, well, you have to use a more accurate opamp....

On the other hand, imagine your opamps only work for a range of voltage over 2.7V, you will never be able to multiply 2*2, because you put 2V on the input and you only get zero as output. The "rounding" error to the closest integer is zero, but still not useful to you, and opamps with a lower input voltage should be used.

With FFT is the same, we transform "digital" "integer" stuff into "analog" "real signals", "multiply" them, then transform back.

For more math detail, you can read the Math on Gimps server, or the wiki (follow the links on the Math page), or better, look into the D.Knuth's book mentioned there (Vol 2, take a push cart with you when you go to library or bookshop, that is a brick of a book, large format, thousand pages...) for fast FFT multiplication algorithms.

What you have to keep in mind, the "approximation" is dimensioned in such a way that an "error" is expected, (this means, it MUST appear, and it should be not zero, if the opamps really work!); and it is expected to be between, say, 0.001 and 0.49, otherwise the result is wrong.

The fact that clLucas accepts that is WRONG. Keep in mind that this tool is "under development". Msft and few other here are currently investing a lot of time in it.

Thanks a million LaurV for your detailed and clear explanation. Very helpful and it looks like I've some research to do ;-)

I'll continue the proper tests and show the output here, but with 190 days each I think they'll be the last (100million) I do. Well until Maxwell maybe.. ;-)

Anyway I appreciate you taking the time to demystify some of this "Black Art" With more help like yours maybe a lot more newbies (like me) will come on board and speed up the search.

Thanks again

Rob.
Robish is offline   Reply With Quote
Old 2013-10-09, 10:12   #238
Robish
 
"Rob Gahan"
Aug 2013
Ireland

1001002 Posts
Smile

Quote:
Originally Posted by Robish View Post
Thanks a million LaurV for your detailed and clear explanation. Very helpful and it looks like I've some research to do ;-)

I'll continue the proper tests and show the output here, but with 190 days each I think they'll be the last (100million) I do. Well until Maxwell maybe.. ;-)

Anyway I appreciate you taking the time to demystify some of this "Black Art" With more help like yours maybe a lot more newbies (like me) will come on board and speed up the search.

Thanks again

Rob.
CUDALucas-2.03-cuda4.2-sm_30-x86-64 -threads 512 -f 20971520 -t 332233123

Continuing work from a partial result of M332233123 fft length = 20971520 iteration = 42622168
Iteration 42630000 M( 332233123 )C, 0xbd8dc41b383353ed, n = 20971520, CUDALucas
v2.03 err = 0.0336 (6:21 real, 38.0896 ms/iter, ETA 3064:05:40)
Iteration 42640000 M( 332233123 )C, 0x9a85a50ac0f45352, n = 20971520, CUDALucas
v2.03 err = 0.0346 (8:14 real, 49.3313 ms/iter, ETA 3968:17:25)
Iteration 42650000 M( 332233123 )C, 0x86ab474ca4d1eb20, n = 20971520, CUDALucas
v2.03 err = 0.0354 (8:13 real, 49.2523 ms/iter, ETA 3961:47:47)

Looks better ;-) but only 165 days to go :-(

Did I read somewhere that dropping -t speeds it up a bit?

Thinking (Must learn patience, must learn patience, must learn patience)

Cheers

Rob.
Robish is offline   Reply With Quote
Old 2013-10-09, 10:28   #239
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

72×197 Posts
Default

Well, man, you are serious about it!

If you want to use cudaLucas, then you can do better. With a GTX Titan, even better... Do you have one? (if yes, send it to me! )

Start P95 (or go to it if it is already running) and use "Advanced -> Time..." option, to "time" your exponent. Fill the exponent in the box, select 100 iterations, click ok. See what FFT length is used. Write it down somewhere.

Play with "cudalucas -cufftbench x y z" option, where x is a bit smaller than the value chosen by p95, y is about 20% higher, and z is a multiple of 1024 (see discussions on cudaLucas thread).

With last command you "tunned" the right FFT size for your card. Select the FFT length which is around the one chosen by p95, it may be a bit smaller, or up to 10-15-20 percent larger, AND which is the fastest for your card. Run few iterations with the command you used above, but with the right FFT, the one you "cufftbench" it.

Try with both -t and without. Try with "-polite 0" and without. (edit: use -k, then you can turn "polite" on/off interactively using the keyboard, just press "p" and "enter", see the speed difference. watch the GPU occupancy in this time, it should go down a bit when polite is active).

My estimation (from a superficial mental calculus and from the error your test shows) it would be that a FFT close to 18M would be much faster, and keep the error under 0.21 (I may be wrong). 2097152*3^2 is better than 2097152*2*5 when splitting the "butterflies" of the FFT multiplication, therefore if the error given for ~18M8 FFT size is not too big, that would be much faster for your card.

If my estimation is right, you may come few weeks shorter, comparing with the time you have now.

Keep us informed.

edit: if you can save the residues on the way it would be even better, in case someone wants to DC later.

Last fiddled with by LaurV on 2013-10-09 at 10:41
LaurV is offline   Reply With Quote
Old 2013-10-09, 15:00   #240
TeknoHog
 
TeknoHog's Avatar
 
Mar 2010
Jyvaskyla, Finland

22·32 Posts
Default

Quote:
Originally Posted by Robish View Post
Iteration 289390000 M( 332233123 )C, 0xffffffff80000000, n = 2097152, CUDALucas
v2.03 err = 0.0000 (0:45 real, 4.4768 ms/iter, ETA 53:16:24)
Iteration 289400000 M( 332233123 )C, 0xffffffff80000000, n = 2097152, CUDALucas
v2.03 err = 0.0000 (0:45 real, 4.4787 ms/iter, ETA 53:17:03)
Iteration 289410000 M( 332233123 )C, 0xffffffff80000000, n = 2097152, CUDALucas
v2.03 err = 0.0000 (0:44 real, 4.4792 ms/iter, ETA 53:16:38)
While others have already pointed out why this is not OK, I think the 0xffffffff80000000 is really telling. Normally, the remainder should be different each time, and this particular number is especially bad -- a string of 1s followed by 0s, in bitwise representation. It reminds me of hardware and logic errors when programming FPGAs.
TeknoHog is offline   Reply With Quote
Old 2013-10-09, 15:42   #241
chris2be8
 
chris2be8's Avatar
 
Sep 2009

22·523 Posts
Default

Could this be related? User Sheng has residues of 0xffffffff80000000

Chris
chris2be8 is offline   Reply With Quote
Old 2013-10-09, 16:19   #242
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

100101101101012 Posts
Default

Quote:
Originally Posted by chris2be8 View Post
Could this be related? User Sheng has residues of 0xffffffff80000000

Chris
Isn't that guy a known poacher? If so, he has found a shortcut path to the credit bucket... (maybe mistakenly, he forgot to modify the FFT when he switched from 33M to 55M exponents, or maybe intentionally).

Now, because we know the last two (masked) digits of the residue, I will go and report doublechecks... I get my credit, the exponents are cleared...
Screw gimps!

Grrr....

Last fiddled with by LaurV on 2013-10-09 at 16:21
LaurV is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
Can't get OpenCL to work on HD7950 Ubuntu 14.04.5 LTS VictordeHolland Linux 4 2018-04-11 13:44
OpenCL accellerated lattice siever pstach Factoring 1 2014-05-23 01:03
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
AMD's Graphics Core Next- a reason to accelerate towards OpenCL? Belteshazzar GPU Computing 19 2012-03-07 18:58

All times are UTC. The time now is 07:14.


Mon Aug 2 07:14:13 UTC 2021 up 10 days, 1:43, 0 users, load averages: 1.64, 1.91, 1.73

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.