![]() |
|
|
#331 | |||
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
3·29·83 Posts |
Quote:
Quote:
Quote:
|
|||
|
|
|
|
|
#332 | |||||
|
Nov 2010
Germany
10010101012 Posts |
Quote:
Quote:
.While this article shows a basic problem, it is one that the OpenCL compiler was brilliant at circumventing. Probably that optimizations have cost quite some effort, but the translated OpenCL code was reordered so much that I sometimes had trouble matching it to the original code. The compiler knows about the VLIW4/5 dependency issue, analyzes it and reorders as much as the dependencies allow. But often it is hard to find independent instructions to fill the gaps. Even more of a problem of VLIW5 are the instructions that can run only on the special "t" unit, leaving 4 others empty. Widely discussed are mul32 and mul_hi in this respect, but conversions back and forth between integer and floating point representation are as bad. And finally all the operations to provide for carry/borrow cost their share of the available GFLOPS. I need more machines to test on ![]() Quote:
Quote:
![]() That's the nice thing about modulo: it all repeats over and over ... No matter where in the circle of 4620 classes you start, you'll always hit each class once. By excluding FC's that are 3 or 5 mod 8, and multiples of 3, 5, 7, 11 you keep 2/4 * 2/3 * 4/5 * 6/7 * 10/11 = 960 of 4620 classes. Quote:
|
|||||
|
|
|
|
|
#333 |
|
Oct 2011
Maryland
1001000102 Posts |
|
|
|
|
|
|
#334 |
|
"James Heinrich"
May 2004
ex-Northern Ontario
23×149 Posts |
|
|
|
|
|
|
#335 | |
|
Oct 2011
Maryland
2·5·29 Posts |
Quote:
On my 5870 I get around 200 M/s with Barrett32. On my 6970 I get around 120 M/s with Barrett32. I get around 140 M/s with MUL24. So I think we still need major refinements with the 6xxx series. Though hopefully Barrett24 fixes everything! Last fiddled with by KyleAskine on 2012-01-13 at 18:48 Reason: Added last line! |
|
|
|
|
|
|
#336 |
|
Nov 2010
Germany
3·199 Posts |
Well, certainly not everything. Currently it is capable only of finding factors between 263 and 270. It should be able to find them up to 271, but at 70.8 bits I see some misses. Once I see that in the debugger I will be able to tell if it will stay with the 270 limit, or if I can fix it to work for all 271 as well. And once that is done, I'd like to send it out to a few people for testing.
But 272, the goal of GPU-to-72, will not be possible with this kernel. The next kernel will add another 24 bits, which will certainly slow it down considerably. Or maybe just add 12 bits? Hmm, lets see ... I also started a kernel that uses 15-bit chunks in order to avoid the expensive mul_hi instructions, just to check if that maybe can increase the efficiency of AMD GPUs ... BTW, testing mfakto on Nvidia turns out to be way more effort than it might be worth. Nvidia's OpenCL compiler is buggy and not yet complete. I had to remove all printf's even though they were in inactive #ifdefs. And once that was done, the compiler crashes. Code:
Error in processing command line: Don't understand command line argument "-O3"! Code:
(0) Error: call to external function printf is not supported Code:
Select device - Get device info - Compiling kernels .Stack dump: 0. Running pass 'Function Pass Manager' on module ''. 1. Running pass 'Combine redundant instructions' on function '@mfakto_cl_barrett79' mfakto-nv.exe has stopped working |
|
|
|
|
|
#337 |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
160658 Posts |
Lol I can't help I hardly know anything about programming, only the very basics
|
|
|
|
|
|
#338 | |
|
Oct 2011
Maryland
12216 Posts |
Quote:
|
|
|
|
|
|
|
#339 |
|
"Jerry"
Nov 2011
Vancouver, WA
100011000112 Posts |
|
|
|
|
|
|
#340 | |
|
Nov 2010
Germany
10010101012 Posts |
Quote:
The barrett24 kernel, however, normally needs 3 spare bits. I managed to "borrow" one, but not more. As the processing width is 3x24 bits, I need to limit the new kernel's bit_max at 70. I also noticed that the new kernel's register usage seems to be very efficient, resulting in 1-2% performance gain when using a vector size of 8 instead of 4. I'll send you and flashjh a test version within the next few days. Try to save a few 69 -> 70 assignments for it ... Last fiddled with by Bdot on 2012-01-16 at 09:37 |
|
|
|
|
|
|
#341 | |
|
Oct 2011
Maryland
4428 Posts |
Quote:
|
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfaktc: a CUDA program for Mersenne prefactoring | TheJudger | GPU Computing | 3498 | 2021-08-06 21:07 |
| gpuOwL: an OpenCL program for Mersenne primality testing | preda | GpuOwl | 2719 | 2021-08-05 22:43 |
| LL with OpenCL | msft | GPU Computing | 433 | 2019-06-23 21:11 |
| OpenCL for FPGAs | TObject | GPU Computing | 2 | 2013-10-12 21:09 |
| Program to TF Mersenne numbers with more than 1 sextillion digits? | Stargate38 | Factoring | 24 | 2011-11-03 00:34 |