![]() |
1 Attachment(s)
[code]
Platform 0 : Advanced Micro Devices, Inc. Platform 1 : Intel(R) Corporation Platform :Advanced Micro Devices, Inc. Device 0 : Tonga Build Options are : -D KHR_DP_EXTENSION Starting M37957727 fft length = 2048K Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length. Iteration 100, average error = 0.12330, max error = 0.16406 Iteration 200, average error = 0.14911, max error = 0.17969 Iteration 300, average error = 0.15930, max error = 0.17969 Iteration 400, average error = 0.16440, max error = 0.17969 Iteration 500, average error = 0.16746, max error = 0.17969 Iteration 600, average error = 0.16950, max error = 0.17969 Iteration 700, average error = 0.17095, max error = 0.17969 Iteration 800, average error = 0.17214, max error = 0.18750 Iteration 900, average error = 0.17385, max error = 0.18750 Iteration 1000, average error = 0.17520 < 0.25 (max error = 0.18750), continuing test. SIGINT caught, writing checkpoint. Estimated time spent so far: 0:10 ------------------------------------------------------------------------- Platform 0 : Advanced Micro Devices, Inc. Platform 1 : Intel(R) Corporation Platform :Advanced Micro Devices, Inc. Device 0 : Tonga Build Options are : -D KHR_DP_EXTENSION Starting M37957727 fft length = 2048K Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length. Iteration 100, average error = 0.08218, max error = 0.10938 Iteration 200, average error = 0.09578, max error = 0.10938 Iteration 300, average error = 0.10031, max error = 0.10938 Iteration 400, average error = 0.10361, max error = 0.11719 Iteration 500, average error = 0.10633, max error = 0.11719 Iteration 600, average error = 0.10814, max error = 0.11719 Iteration 700, average error = 0.10943, max error = 0.11719 Iteration 800, average error = 0.11043, max error = 0.12500 Iteration 900, average error = 0.11205, max error = 0.12500 Iteration 1000, average error = 0.11333 < 0.25 (max error = 0.12500), continuing test. [/code]For anyone who wants to try... x64 binaries with the improved accuracy/precision patch. :smile: (Compiled with GCC 5.2/SDK 3.0/clFFT 2.8) |
[QUOTE=kracker;418719]For anyone who wants to try... x64 binaries with the improved accuracy/precision patch. :smile:
(Compiled with GCC 5.2/SDK 3.0/clFFT 2.8)[/QUOTE] Thanks! |
[QUOTE=kracker;418719]
[/code]For anyone who wants to try... x64 binaries with the improved accuracy/precision patch. :smile: (Compiled with GCC 5.2/SDK 3.0/clFFT 2.8)[/QUOTE] Thanks, the .ini file is very handy! I accidentally got assigned 2 DC tests when I wanted to stresstest a C2D with Mprime on Ubuntu (I forgot Mprime doesn't have a graphical interface and should be configured beforehand or started with parameters :redface: ). The assignments were Anonymous/Anonymous, but I changed the username/pcname in the files and let Mprime communicate those with the server. So the assignments are now on my account page, so I can easily unreserve them if necessary. But I decided to finish them and test clLucas on my HD7950 at the same time. [code]Platform 0 : Advanced Micro Devices, Inc. Platform :Advanced Micro Devices, Inc. Device 0 : Tahiti Build Options are : -D KHR_DP_EXTENSION CUFFT bench start = 524288 end = 4194304 distance = 524288 CUFFT_Z2Z size= 524288 time= 0.620040 msec CUFFT_Z2Z size= 1048576 time= 0.930050 msec CUFFT_Z2Z size= 1572864 time= 1.680100 msec CUFFT_Z2Z size= 2097152 time= 1.500080 msec CUFFT_Z2Z size= 2621440 time= 2.130120 msec CUFFT_Z2Z size= 3145728 time= 2.400140 msec CUFFT_Z2Z size= 3670016 time= 2.860160 msec CUFFT_Z2Z size= 4194304 time= 2.640150 msec [/code]So if I read those values correctly, 2048K is faster than 1536K and 4096K is faster than 3584K. So it's sometimes faster the use the higher FFT size!? I started one of the assignments (DC cat4, since its was anonymous remember), M41,803,373, FFT 2048K had a too high error rate, so it automatically started with a higher FFT of 2240K, but that one was terribly slow (33ms/iter). I've restarted it with 2560K and now the times are much better (4.5ms). It's now at iter 9,200,000 (~22%) so I will now let it finish as I'm curious if the card can reliably LL. It's running with "CheckRoundoffAllIterations=1" since that seemed sensible to me for the first test on the card. Any tips/tricks are welcome :smile:. |
[QUOTE=VictordeHolland;418798]So if I read those values correctly, 2048K is faster than 1536K and 4096K is faster than 3584K. So it's sometimes faster the use the higher FFT size!?[/QUOTE]
Yes, this is normal. You can read in cudaLucas thread about the tuning of cudaLucas, and have an idea. The real speed depends not only on the size of the FFT, but also on its "granularity". This has to be tuned for each system, it mostly depends on how your gpgpu can split those threads and make those butterflies when multiplies. Some FFT values are "good" for your card, some are not. However, cuFFT library is quite performant on non-powers-of two FFT, which can not be said about clFFT, where powers-of-two FFT are faster. To see more about it, you can read this thread from the start :razz: [edit:there is a link [URL="http://www.mersenneforum.org/showthread.php?p=366726"]here[/URL], about which exponents are better to LL/DC with clLucas] The point is that you have to tune your card, for all exponent ranges, and use the fastest FFT that gives an error between, say, 0.05 and 0.25, for each range. On cudaLucas this can be done with -cufftbench switch, which can tune the number of threads for each FFT too, sometime improving the speed as much as 20% or so, over the "default" settings. |
Nitpicking: Shouldn't be called "clfftbench"? Or simpler "fftbench"? (the "cu" is from Cuda, and a bit confusing here, related to which library is used).
Also, any "thread" tuning is necessary? Also, it seems msft ported an older version of the command line stuff, the newest cudaLucas accepts the sizes in kilos (easier to type) and it can do more tricks with them, like threadbench, however my knowledge about opencl is almost null so I can't say if the thread tuning is necessary here or not). |
[QUOTE=LaurV;418882]Nitpicking: Shouldn't be called "clfftbench"? Or simpler "fftbench"? (the "cu" is from Cuda, and a bit confusing here, related to which library is used).[/QUOTE]
FIX in the next version. [QUOTE=LaurV;418882]Also, any "thread" tuning is necessary? [/QUOTE] If CL_KERNEL_WORK_GROUP_SIZE = 512,"thread" tuning is necessary. |
[QUOTE=VictordeHolland;418798]
I started one of the assignments (DC cat4, since its was anonymous remember), M41,803,373, FFT 2048K had a too high error rate, so it automatically started with a higher FFT of 2240K, but that one was terribly slow (33ms/iter). I've restarted it with 2560K and now the times are much better (4.5ms). It's now at iter 9,200,000 (~22%) so I will now let it finish as I'm curious if the card can reliably LL. It's running with "CheckRoundoffAllIterations=1" since that seemed sensible to me for the first test on the card.[/QUOTE] [code] M( 41803373 )C, 0x3469e8e34195fb48, n = 2560K, clLucas v1.026[/code]Hurray, it's a match :smile:! Estimated running time: 52-53h Good to know my Asus HD7950 is stable with the factory voltage/clocks: core: 900MHz mem: 1250MHz vcore: 1094mV vmem: 1600mV Max temperature was also normal during the run (70C) with air cooling. I didn't check the wattage, might do that with the next run. I also noticed clLucas using almost a complete CPU core, that is definitely something to consider. |
[QUOTE=msft;419149]FIX in the next version.
If CL_KERNEL_WORK_GROUP_SIZE = 512,"thread" tuning is necessary.[/QUOTE] Settings threads above 256 crashes for me [QUOTE=VictordeHolland;419160][code] M( 41803373 )C, 0x3469e8e34195fb48, n = 2560K, clLucas v1.026[/code]Hurray, it's a match :smile:! Estimated running time: 52-53h Good to know my Asus HD7950 is stable with the factory voltage/clocks: core: 900MHz mem: 1250MHz vcore: 1094mV vmem: 1600mV Max temperature was also normal during the run (70C) with air cooling. I didn't check the wattage, might do that with the next run. I also noticed clLucas using almost a complete CPU core, that is definitely something to consider.[/QUOTE] What drivers do you have? Right now, clLucas is using around a fourth of a core, but I know some of amd's older drivers running mfakto would use a whole core even with GPU sieving on. EDIT: Little minor bug... passing Ctrl+C properly shuts down clLucas but "closing" clLucas gives me the error [code] Warning: Program terminating, but clFFT resources not freed. Please consider explicitly calling clfftTeardown( ).[/code] |
First [URL="http://www.mersenne.org/report_exponent/?exp_lo=36787943&full=1"]success[/URL] with the version compiled by kracker 7-8 posts above. It took about 36 hours on a HD 7970, slightly OC'ed, using a 2048k FFT. Which is sensible faster then the old version, but still slower than the gtx580 (~23 hours, using a shorter FFT) and much slower than the Titan (~13 hours, same shorter FFT). Shorter (non power of two) FFT here is slower.
Another one scheduled to finish in about 10 hours. |
[QUOTE=kracker;419183]
What drivers do you have? Right now, clLucas is using around a fourth of a core, but I know some of amd's older drivers running mfakto would use a whole core even with GPU sieving on. [/QUOTE] AMD driver Catalyst 15.7 partially solved it. If Prime95/Mprime is running on all cores, the CPU load of clLucas is only 1-2%. When I stop/suspend Prime95/Mprime the CPU load of clLucas is back up to using a full core. So it is much less of an issue for me (I'm running ECM with Prime95 anyway), but not entirely fixed. Mfakto 0.14 doesn't do this, not with the older driver I was using and not with 15.7 . [code]# Polite is the same as the -polite option. If it's 1, each iteration is # polite. If it's (for example) 12, then every 12th iteration is polite. Thus # the higher the number, the less polite the program is. Set to 0 to turn off # completely. Polite!=0 will incur a slight performance drop, but the screen # should be more responsive. Trade responsiveness for performance. (Note: # polite=0 is known to cause CUDALucas to use some extra CPU time; Polite=64 or # higher is a good compromise.) Polite=64[/code]I've tried different polite settings (0, 32, 64), but that doesn't solve it either. I think it might be related to this (from mfakto.ini): [code] # Newer AMD drivers cause high CPU load when many kernels are scheduled on the GPU. To avoid # useless CPU cycles (busy wait), <FlushInterval> kernels will be scheduled at most. I saw CPU # load to start with FlushInterval=8. FlushInterval=0 means disable this feature (schedule all # kernels as fast as possible). # The disadvantage is, that after enqueueing <FlushInterval> kernels, mfakto waits for the GPU # queue to become empty, causing the GPU to idle for a brief moment until the next kernel is # scheduled. Can cost ~ 1% performance. # This setting is likely to disappear once I found a better way to circumvent this AMD bug. # # Possible values: 0 = off, >0 = OpenCL queue size limit FlushInterval=8 [/code] |
[QUOTE=kracker;419183]EDIT: Little minor bug... passing Ctrl+C properly shuts down clLucas but "closing" clLucas gives me the error [code]
Warning: Program terminating, but clFFT resources not freed. Please consider explicitly calling clfftTeardown( ).[/code][/QUOTE] FIX in the next version. |
| All times are UTC. The time now is 13:00. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.