mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

Dubslow 2012-07-25 07:01

[QUOTE=Bdot;305895]Oops, I did not recognize you're waiting for me ...

As the only NV card where I can try CL is in my workstation at work, I could not investigate any further. At the moment there is really no time at all. Sorry for that.[/QUOTE]

Ah, sorry about the miscommunication :smile:. flash, have you been able to get anywhere on that? (My problem is that I'm just about the only person besides msft who doesn't use Windows...)

lycorn 2012-07-26 13:02

[QUOTE=LaurV;305573]There is a 2.03 Stable version and a (better, but still under work) 2.04 Beta version, both on the [URL="https://sourceforge.net/projects/cudalucas/files"]sourceforge[/URL] page. I personally use the Beta right now. There is no difference in math, just in "cosmetic" things, the Beta has some "improvements" which are partially working, partially are still worked on..:smile:[/QUOTE]

Thanks, LaurV.
I gave it a go, but unfortunately got a residue mismatch. As it was the second time in two tests (the first one ran a couple of months ago), I´ll keep on using the card for TF only.

NormanRKN 2012-07-26 21:25

i´ve read not all posts.
is there a possibility for a SP version instead of a DP ?
most nvidia are very very ugly in performance VS. a ati/amd gpu.
this means a waste of time and ERNERGY.
my nvidias all the time needs very very very long and for a ati/amd is it a warp ;)
for me is nvidia vs. ati in DP like angel and devil.
please not crunching DP on a Nvidia :no:. only waste (most).
if i see a midrange ati vs. a highend nvidia crunching in DP mode .. puuuh.. forget nvidia ;)
SP is there work (excluded (maybe) the teslas with something more DP ;) )

p.e. in wilkyway@home is a ati 7x faster then a nvidia

Norman

Dubslow 2012-07-26 22:03

I don't know about SP FFTs, but somehow I don't think it's possible. (I'm not the person who does the math of CUDALucas.)

The reason an LL program hasn't been implemented on OpenCL/AMD cards is because there is no OpenCL FFT library, whereas nVidia provides a cufftlib for CUDA cards.

LaurV 2012-07-27 02:18

SP FFT is not possible, there is not enough room for the carry. Search the forum, there have been a lot of discussions. (If I find some time today I will come back with some link).

Talking amd against nvidia is only a gamer subject. For games, amd cards may be "better" or "faster" and they are "cheaper". But think about that "cheaper" part: generally, you get what you pay for. For "real" [strike]things[/strike]race they don't even see the tail of the nvidias... By the way, do you understand why DP is in general 4-6-8 times slower then SP? Compare filling a barrel with a bucket, and filling it with a cup. With the bucket you will move slower, need more time to fill the bucket, carry it, pour it, etc, but at the end, you will fill the barrel much faster with the bucket than with the cup. Try it.

And see [URL="http://mersenne-aries.sili.net/mfaktc.php?sort=ghdpd"]James' page[/URL] for "real" speed comparison (edit: in Firefox you have to scroll down, that page has a problem when displayed in Firefox if the window is too narrow). Those results don't use any FFT, so not the FFT is the problem. Ati/amd do some [B]inaccurate[/B] "video tricks" faster. But when it come to (accurate) general processing, they are much slower. There is a big difference between video cards and GPU's. You can - at most - compare ati/amd cards with the last nvidia fiasco: 6xx series, as all of such cards are "video cards", fast for games, lousy for general computing (GPGPU). But you can't compare ati/amd with Fermi or Tesla. No way!

Bdot 2012-07-27 15:08

[QUOTE=Dubslow;306116]The reason an LL program hasn't been implemented on OpenCL/AMD cards is because there is no OpenCL FFT library, whereas nVidia provides a cufftlib for CUDA cards.[/QUOTE]

AMD ships an example FFT implementation along with the APP SDK. As OpenCL is a little hesitant to enforce double type processing (be it emulated on SP-only HW), this example is only SP so far.
I've ordered an HD7850, and when that is up and running I may spend a little time to see if that FFT example can easily be changed to DP. In case that succeeds, I may be able to provide a few performance figures.
In any case, an FFT "library" at an "example" stage is something different than cufft.
[QUOTE=LaurV;306130]
Talking amd against nvidia is only a gamer subject. For games, amd cards may be "better" or "faster" and they are "cheaper". But think about that "cheaper" part: generally, you get what you pay for. For "real" [strike]things[/strike]race they don't even see the tail of the nvidias...

And see [URL="http://mersenne-aries.sili.net/mfaktc.php?sort=ghdpd"]James' page[/URL] for "real" speed comparison. You can - at most - compare ati/amd cards with the last nvidia fiasco: 6xx series, as all of such cards are "video cards", fast for games, lousy for general computing (GPGPU). But you can't compare ati/amd with Fermi or Tesla. No way![/QUOTE]

Quite energetic!
Certainly, the mentioned milkyway, or bit coining etc. are not gaming (in a sense of 3D-video games).

You're quite right that the AMD VLIW5/4 architecture had big difficulties with computing. But when NV made a step backwards with their 6xx series, AMD made a leap forward with it's GCN. With mfakto, an HD7970 may still not reach a 580, but it is on par with a 570. And I finally ordered a GCN-card and expect to find one or another trick to speed up GCN even more.

Your statements are on weak ground (at least until "big kepler" blasts AMD to pieces again :smile:).

kracker 2012-07-28 01:53

[QUOTE=Bdot;306187]And I finally ordered a GCN-card and expect to find one or another trick to speed up GCN even more.[/QUOTE]

I would like to know how well the 7850 preforms, since I have a 7770, (one step down), just curious how much of a performance increase going up to it (not that I'll get it, just curious):smile:

LaurV 2012-08-04 10:46

ok, now the million dollar question: how do I...:whistle:(how to put this?)... "convert" a save file from 1440k fft into 1568k fft? (1600k would work too :D)

the point is that I am testing two 27M exponents and after 18M, respective 24M iterations, I get error >0.35, which is reproducible when I restart from 1, 2, 3, anterior checkpoints. Theoretically, the checkpoint (discrete transform) is converted by the program into residue (integer), from which the last figures are displayed. The program is doing this anyhow, as he needs first to keep the errors under control, and second to substract that "2" on each iteration. Next step should be to convert the binary form of the residue in a "different size" transform. Can I use CL (some switch) or othr 3rd party tool, do this faster than re-run all the ~20M iterations with a bigger FFT?

Edit: P95 is "trying with a larger FFT" or "trying with a slow method" in these cases. What CL does to avoid it? One patch for the future should be to select a larger FFT for 27M range. 1568 is the fastest, after 1440.

Edit2: 27000929 is still running, but the errors are at the limit, I got 0.2578, 0.2582, etc. Here should be the last frontier where CL automatically selects 1440k as FFT size, keeping in mind that not all people do "tuning" for each exponent size :D. You can end up like me, with half of the test done in vain. A method of "conversion" should be available...

Meantime I restarted the other exponents with 1568k, as for my gtx580's all values in between or after, result in longer times. Both runs reached first 1M iter, and up to now all residues (every 100k) match with the first run.

Edit3: 8M iterations, everything matching.

Dubslow 2012-08-04 19:01

Are you using 2.03 or 2.04? The latter should test the average roundoff at the beginning of the test and select a higher length if the roundoff is too high.

Keep in mind that the errors shown on screen is only the maximum error since the last checkpoint (2.04) or the maximum error since the last (re)start (2.03). That means the average should be lower than 0.25. Maximum errors of 0.25-0.30, maybe even a bit higher, should still be okay.

As for the error > 0.35, did it print what the actual error was? And yes, what with all the reports of Prime95 v27 issues, it did occur to me that CUDALucas doesn't handle too-large errors very nicely...

As for the "FFT conversion", it should be possible with some slight variant of the teeny-weeny thingies I posted here a few pages back. Note, however, that I make ABSOLUTELY NO GUARANTEE THAT IT WILL WORK IN ANY FASHION. It would be cool if it does though, the idea has occurred to me before :D

[code]#include <stdlib.h>
#include <stdio.h>
#include <string.h>
void print_time_from_seconds (int sec) // copied almost verbatim from CuLu source
{
if (sec > 3600)
{
printf ("%d", sec / 3600);
sec %= 3600;
printf (":%02d", sec / 60);
}
else
printf ("%d", sec / 60);
sec %= 60;
printf (":%02d\n", sec);
}
int main(int argc, char** argv) {
char* name;
int q, n, j, old, new;
long t;
double* x;
FILE* f;

if( argc < 4 || !argv[1] || !argv[2] || !argv[3] ) {
printf("First argument should be name of checkpoint file, second should be old FFT (full form, not K form), and third should be new FFT\n");
return -1;
}
name = argv[1]; old = atoi(argv[2]); new = atoi(argv[3]);
f = fopen(name, "rb"); // Ignore compiler warnings about "secure functions"
fread(&q, sizeof(int), 1, f);
fread(&n, sizeof(int), 1, f);
if( n != old) {
printf("Supplied old length doesn't match checkpoint's old length, aborting\n");
return -1;
}
fread(&j, sizeof(int), 1, f);
x = (double*) calloc(new, sizeof(double));
fread(x, sizeof(double), old, f);
fread(&t, sizeof(long), 1, f); // comment out this line for 2.03 save files
fclose(f);
printf("This is a checkpoint for exp = %d, n = %dK, iter = %d, and total time = %ld = ", q, n/1024, j, t);
print_time_from_seconds(t);
printf("Converting from FFT %d to FFT %d\n", old, new);
len = strlen(name)+1;
char* newname = calloc((len+=4), sizeof(char));
snprintf(newname, len, "%s.new", name);
f = fopen(newname, "wb");
fwrite(&q, sizeof(int), 1, f);
fwrite(&n, sizeof(int), 1, f);
fwrite(&j, sizeof(int), 1, f);
fwrite(x, sizeof(double), new, f);
fwrite(&t, sizeof(long), 1, f); // comment this out for 2.03 save files
fclose(f);
printf("Written new checkpoint.\n")
return 127;
}[/code]
[code]bill@Gravemind:~/CUDALucas∰∂ ckpconvert t27812929 1572864 1638400
This is a checkpoint for exp = 27812929, n = 1536K, iter = 140001, and total time = 869 = 14:29
Converting from FFT 1572864 to FFT 1638400
Written new checkpoint.[/code]

Dubslow 2012-08-04 20:24

Edit: Whoops, change "fwrite(&n, sizeof(int), 1, f);" to "fwrite(&new, sizeof(int), 1, f);" :razz:

LaurV 2012-08-04 22:07

[QUOTE=Dubslow;306919]Are you using 2.03 or 2.04? The latter should test the average roundoff at the beginning of the test and select a higher length if the roundoff is too high.[/QUOTE]
I use 2.04 last beta. You can try for yourself, I just found a nice exponent: 27290759. CL selects 1440k for it. Everything is ok until just before 2.5M iterations (takes about one hour and half), where it gets an error bigger then .35 and gets angry. This is the exponent with the "lowest" error-iter-count. Of course, to speed up, I can provide intermediary save files, just 2,3,5 minutes before the error. Honestly I thing doesn't make sense, it is clear that a bigger FFT should be used for this range.

My concern is that people who don't use manual tuning of the FFT (and rely on auto selection) will lose time and run millions of iterations in vain, as long as the program can't "increase" the FFT "on the fly" in a "clever" (transparent) way for the user, as P95 does, in such cases.

Related to the average errors, when I said "average" I was meaning the average value displayed on screen. I knew the values represent "max error" for a range.

I haven't put an eye into your program, it is 5:05 AM here, and I need to get some sleep for this night (I had something to work on). Anyhow, the two expos are almost done after restart with the new FFT size. Next time maybe.


All times are UTC. The time now is 23:15.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.