mersenneforum.org > Data CEMPLLA: An alternative to GIMPS ?
 Register FAQ Search Today's Posts Mark Forums Read

2017-07-12, 23:28   #199
ewmayer
2ω=0

Sep 2002
República de California

22·3·929 Posts

Quote:
 Originally Posted by chalsall Isn't that how Trump became Commander in Chief?
Only Trump?

But please, let's keep pawl-it-icks in the Soap Box.

2017-07-12, 23:50   #200
chalsall
If I May

"Chris Halsall"
Sep 2002

2·5·29·31 Posts

Quote:
 Originally Posted by ewmayer But please, let's keep pawl-it-icks in the Soap Box.
Meow.

 2017-07-13, 00:11 #201 airsquirrels     "David" Jul 2015 Ohio 20516 Posts I also tried running cuobjdump on the binary as well, no plx or cubin to be found, although perhaps that’s where the encryption instructions are being used. The CUDA calls are linked at run time so no version information was easily extractable, and the code that uses them does setup a stream and some host to device and device to host CUDA memcpy operations before a loop of kernel launch. If I could have got it running I would have hooked a debugger to the load module code and dumped the the kernel for nvdisassm to at least get some idea of the algorithm, however the code itself plain does not run. The installer crash was in the visual studio 2010 runtime, so nothing telltale about how to work around that. The time was for the joy of the hunt, rather than merit of the effort.
2017-07-13, 00:33   #202
science_man_88

"Forget I exist"
Jul 2009
Dumbassville

26×131 Posts

Quote:
 Originally Posted by airsquirrels I also tried running cuobjdump on the binary as well, no plx or cubin to be found, although perhaps that’s where the encryption instructions are being used. The CUDA calls are linked at run time so no version information was easily extractable, and the code that uses them does setup a stream and some host to device and device to host CUDA memcpy operations before a loop of kernel launch. If I could have got it running I would have hooked a debugger to the load module code and dumped the the kernel for nvdisassm to at least get some idea of the algorithm, however the code itself plain does not run. The installer crash was in the visual studio 2010 runtime, so nothing telltale about how to work around that. The time was for the joy of the hunt, rather than merit of the effort.
can't extract it into machine code ?

2018-10-03, 19:57   #203
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

11×347 Posts

Quote:
 Originally Posted by axn Those are integer instructions, so possibly he could be doing integer transforms. IIRC, cudaLucas uses nvidia's cuFFT library where max FFT size is "128 million elements" (https://developer.nvidia.com/cufft) I think clLucas also has same kind of limitation (dependent on clFFT library from AMD?) gpuOWL, OTOH, uses hand-rolled FFT, but currently only supports p-o-2 2M & 4M FFTs. But maybe the author can write a 256M FFT for s&g. George should write a 192M one for s&g as well (if not already done).
CUDALucas and ClLucas max out at 64M FFT length. For good reason, since run time per iteration of roughly n ln n ln ln n means ~twice the exponent, more than 4 times the run time. Similarly, George has chosen to limit prime95 ffts above 32M to FMA3 hardware fast enough for it, and not bother to code above 64M yet.
GpuOwL has now progressed since v3.5 to supporting up to a 144M FFT. https://www.mersenneforum.org/showpo...&postcount=505 Perhaps when Preda comes back from his long vacation he'll add an 8K W or 4K H which would enable fft lengths > 144M and potentially supporting gigadigit exponents with very long run times.

I've suspected since I learned of CEMPLLA v1, that its reason for a 5 GPU minimum was Toom-Cook-3 in parallel wrapped around a library implementation of 64M fft (5 GPUs, each doing one of the 5 out of 9 partial products required for 3x3 at 64M size) 192M total, just barely big enough I think for gigadigit.

CUDALucas 2.06beta on a GTX1070 at 64M is about 85 msec/iteration, so on a 1080Ti would be about 42 ms/it. Toom-Cook-3 would be longer than that, so a modified CUDALucas spreading a single gigadigit exponent to multiple GPUs that way would be more than 13 years. (I recall reading somewhere that the CEMPLLA author had seen cases where larger exponents ran considerably FASTER than smaller exponents, or some such, and I recall in my own testing seeing cases where fft timings are anomalously fast, in some cases orders of magnitude, _BUT THOSE ARE FOR FATAL TO ACCURACY ERROR CONDITIONS_.)

Last fiddled with by kriesel on 2018-10-03 at 20:00

2018-10-03, 20:06   #204
retina
Undefined

"The unspeakable one"
Jun 2006
My evil lair

3×1,811 Posts

Quote:
 Originally Posted by kriesel (I recall reading somewhere that the CEMPLLA author had seen cases where larger exponents ran considerably FASTER than smaller exponents, or some such, and I recall in my own testing seeing cases where fft timings are anomalously fast, in some cases orders of magnitude, _BUT THOSE ARE FOR FATAL TO ACCURACY ERROR CONDITIONS_.)
Yeah. There is no point in running really really fast if you are going in the wrong direction.

 2018-10-27, 14:52 #205 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 11×347 Posts Anomalously fast iterations on gpus Detailed description of error cases producing fast-but-wrong iteration can be seen at https://mersenneforum.org/showpost.p...&postcount=617
2019-04-29, 19:28   #206
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

1110111010012 Posts

Quote:
 Originally Posted by kriesel CUDALucas and ClLucas max out at 64M FFT length.
Nope. Well, sort of, depending on gpu model and CUDA level. Small-gpu-ram models won't be able to primality test, threadbench, or fftbench above certain levels. A 1GB Quadro 2000 is limited to around 32768k - 38880k as I recall, in CUDALucas or CUDAPm1. The CUDALucas code (v2.06 at least, and possibly some earlier), with a sufficiently high CUDA level, can go to 256M fft length, and p~231 probably because of using signed 32 bit integers in places. But run times are dreadfully long. p~109 takes about 1.5 years on a GTX1080Ti; p~231 ~9 years on a GTX1080. CUDAPm1 has other issues that often occur at lower p than its memory limit.
Code:
Device              GeForce GTX 1080 Ti
Compatibility       6.1
clockRate (MHz)     1620
memClockRate (MHz)  5505

fft    max exp  ms/iter
1      22133   0.1083
...
4608   85111207   3.2221
...
65536 1143276383  49.4602
69120 1204418959  49.4578
73728 1282931137  51.3181
75264 1309078039  56.8343
81920 1422251777  58.5331
82944 1439645131  60.2333
84672 1468986017  64.5615
86016 1491797777  66.1291
86400 1498314007  67.0704
93312 1615502269  67.4838
96768 1674025489  69.4963
98304 1700021251  72.0720
100352 1734668777  74.1605
102400 1769301077  77.7934
104976 1812840839  78.9627
110592 1907684153  80.0951
114688 1976791967  82.2443
115200 1985426669  86.3511
116640 2009707367  91.6873
131072 2147483647  94.1572
Code:
Device              GeForce GTX 1080
Compatibility       6.1
clockRate (MHz)     1797
memClockRate (MHz)  5005

fft    max exp  ms/iter
1      22133   0.1797
...
4608   85111207   4.3534
...
65536 1143276383  66.6890
69120 1204418959  70.9543
69984 1219148351  73.3338
73728 1282931137  73.5568
75264 1309078039  81.4435
76832 1335757897  83.7932
81920 1422251777  83.9614
82944 1439645131  85.6173
84672 1468986017  91.7141
86016 1491797777  95.1226
86400 1498314007  95.1886
93312 1615502269  96.7214
96768 1674025489  99.2282
98304 1700021251 103.3630
100352 1734668777 105.7010
102400 1769301077 110.7946
104976 1812840839 112.6621
110592 1907684153 114.1252
114688 1976791967 117.3213
115200 1985426669 123.8906
116640 2009707367 131.4965
131072 2147483647 133.0254
139968 2147483647 149.7381
147456 2147483647 150.2931
163840 2147483647 171.1358
165888 2147483647 175.8527
169344 2147483647 185.7351
172032 2147483647 194.0131
172800 2147483647 200.9266
174960 2147483647 202.3867
184320 2147483647 202.9247
186624 2147483647 207.8176
193536 2147483647 215.1034
200704 2147483647 217.1268
204800 2147483647 225.7533
209952 2147483647 231.5600
221184 2147483647 232.8545
229376 2147483647 239.4878
230400 2147483647 267.3366
233280 2147483647 267.8970
236196 2147483647 272.5134
262144 2147483647 273.2270

Last fiddled with by kriesel on 2019-04-29 at 19:49

2019-05-02, 15:43   #208
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

11·347 Posts

Quote:
 Originally Posted by DukeBG He did implement res64's from what it looks (and the display of the current exponent was there allegedly from the beginning), though never posted any for smaller exponents to compare the validity of the tests. Because the i+a (see above). He never posted timings... Because he never actually had them. From the nvidia thread you can read that he doesn't himself own any hardware that his software "expects", with code in place to abort calculations if they are "taking too long". And that code always firing for him. It's a wonder how he's developing software that he himself is unable to truly test and that's probably the most amusing thing about the whole shebang.
Enjoyed your post. And it raised a question or two for me.

Where did you see the implementation of res64's in CEMPLLA?

It's my recollection that more than one forum member that I believe to be respected and competent, and possessing the necessary hardware, tried to install and test the CEMPLLA software, and it failed to install and run.

It's also my recollection that the author made reference to displaying timing info as seconds/iteration. If that's multiple seconds, that's low performance.

If the author could not run it, and others could not run it, it may have been a substantial accomplishment in coding (undocumented performance questions aside), but I'm not sure we should call it developing.

That aside, if the tone of his announcements were more mainstream, and the secrecy was gone, and the performance and reliability reasonable, I think it would be welcomed.

Last fiddled with by kriesel on 2019-05-02 at 16:18

 2019-05-02, 15:51 #209 retina Undefined     "The unspeakable one" Jun 2006 My evil lair 3×1,811 Posts Nah, the author was just greedy. Wanting to use other people's time, effort and resources to get the prize money for himself. That's all. It was a mistaken path though, because people aren't stupid, and finding large primes is hard.

 Similar Threads Thread Thread Starter Forum Replies Last Post a1call Programming 19 2019-11-08 22:31 paulunderwood Miscellaneous Math 36 2019-08-26 08:09 a1call Miscellaneous Math 41 2019-07-21 14:19 CRGreathouse Number Theory Discussion Group 51 2018-12-16 21:55 ixfd64 Software 1 2008-04-26 21:28

All times are UTC. The time now is 14:20.

Fri May 29 14:20:33 UTC 2020 up 65 days, 11:53, 1 user, load averages: 3.01, 2.67, 2.54