![]() |
[QUOTE=paulunderwood;534226]I don't understand it. I git cloned gpuowl and compiled, and it runs slower than before 1240 us. vs. 750 us. What am I doing wrong?[/QUOTE]
I can confirm it works for me on a Radeon VII. I pulled this version before this was even posted, so was using CARRY32 during my tuning without realizing it. With a 5632K FFT, I was getting 888us/it. I placed -use CARRY64 on the command line and the timing slowed to 910us/it. It just keeps getting better all the time! NOTE: I installed AMD's ROCm drivers with the --opencl=pal and --headless options, which installs the lightest weight drivers possible. I am using an i7 CPU and motherboard that has built-in video, so that is what I'm using for the console. There's no monitor connected to the Radeon VII at all. Like George said, these Linux drivers are light years ahead of the Windows drivers. |
[QUOTE=PhilF;534229]With a 5632K FFT, I was getting 888us/it.[/QUOTE]
I think you should be getting under 800us. Are you overclocking memory yet? Thanks for the --headless idea -- I'll try that soon. |
To expand on the new CARRY32 feature. The size of the carry increases as FFTs get larger and as the exponent approaches the limit of the current FFT size. I did test 2000 iterations of an exponent over 1 billion near the upper end of a 56M FFT. The maximum carry I saw was 80% of a fatal overflow value. Thus, I think the new code is safe for some time to come though we really should do some more research.
Also, the new code stores carries in a different order to be more AMD-friendly. One can get the old memory layout with "-use OLD_CARRY_LAYOUT". That layout might be better on nVidia or it might be irrelevant. CARRY32 and CARRY64 both work with the new and old memory layout. To activate the old code "-use CARRY64,OLD_CARRY_LAYOUT" |
[QUOTE=paulunderwood;534226]I don't understand it. I git cloned gpuowl and compiled, and it runs slower than before 1240 us. vs. 750 us. What am I doing wrong?[/QUOTE]
I don't know, but the ROCm compiler can generate surprising results sometimes. What version of ROCm are you using, and what FFT size? One way to attempt to debug this is: - run with CARRY64, do you recover the normal perfermance you had before? - produce a ISA dump with CARRY64 (using -dump <folder>) - produce another dump with CARRY32 - compare the .s files from the two dumps. This can be facilitated by the delta.sh script in gpuowl/tools/ which produces a partially agregated instruction counts Anothe interesting bit of information is to run with -time in before/after cases, and see which kernel has a massive slowdown. One more thing to keep an eye on is thermal throttling by the GPU. If you keep the hottest tempearature (spot) at under 98C (e.g. 90, 95) there should be little/no thermal throttling. |
@preda: Feature request:
It seems that Ben Delo's big increase in PRP firepower makes it impossible for P-1'ers to stay ahead of the PRP wavefronts. This means we may get assigned an exponent that hasn't had any P-1 done. Can we change the default behavior of gpuowl to do a P-1 test on the exponent if needed? For first implementation, don't worry about optimal bounds, we can add that later. P-1 has about a 5% chance of finding a factor. For me a PRP test take 18 hours, so investing up to 54 minutes of P-1 makes sense. Looking at recent P-1 results turned into primenet, prime95 chose bounds around B1=745000, B2=14713750 for a 96M exponent. I've no idea how long that takes on my GPU -- maybe I'll go test that now. |
[QUOTE=Prime95;534311]I've no idea how long that takes on my GPU -- maybe I'll go test that now.[/QUOTE]
I tested B1=750000, B2=20*B1 on a 5M FFT expo and it took 26 minutes. Clearly a worthwhile investment if no P-1 has been done before (PRP lines in worktodo that do not end in ",0") . Bonus. My test found a factor! So the P-1 code still works and another exponent bites the dust. |
[QUOTE=Prime95;534317]I tested B1=750000, B2=20*B1 on a 5M FFT expo and it took 26 minutes. Clearly a worthwhile investment if no P-1 has been done before (PRP lines in worktodo that do not end in ",0") .
Bonus. My test found a factor! So the P-1 code still works and another exponent bites the dust.[/QUOTE] Cool! I was just assigned a few Cat 4 exponents in the 103M range, TF'ed to 74 bits with no P-1 at all. With a Radeon VII, should I TF it higher first, or skip that and do some P-1 first, or both? |
[QUOTE=Prime95;534311]@preda: Feature request:
It seems that Ben Delo's big increase in PRP firepower makes it impossible for P-1'ers to stay ahead of the PRP wavefronts. This means we may get assigned an exponent that hasn't had any P-1 done. Can we change the default behavior of gpuowl to do a P-1 test on the exponent if needed? For first implementation, don't worry about optimal bounds, we can add that later. P-1 has about a 5% chance of finding a factor. For me a PRP test take 18 hours, so investing up to 54 minutes of P-1 makes sense. Looking at recent P-1 results turned into primenet, prime95 chose bounds around B1=745000, B2=14713750 for a 96M exponent. I've no idea how long that takes on my GPU -- maybe I'll go test that now.[/QUOTE] Understood; I'm looking into this, estimated 1-2days. |
[QUOTE=PhilF;534322]Cool!
I was just assigned a few Cat 4 exponents in the 103M range, TF'ed to 74 bits with no P-1 at all. With a Radeon VII, should I TF it higher first, or skip that and do some P-1 first, or both?[/QUOTE] Skip the TF, just P-1. |
[QUOTE=Prime95;534328]Skip the TF, just P-1.[/QUOTE]
OK, thanks. BTW, in regards to my memory timing, I had a chance to play with it today without success. Even overclocked to just 1050 produced errors. |
[QUOTE=PhilF;534329]OK, thanks.
BTW, in regards to my memory timing, I had a chance to play with it today without success. Even overclocked to just 1050 produced errors.[/QUOTE] Did you undervolt? that could also be the reason for errors. |
| All times are UTC. The time now is 23:13. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.