mersenneforum.org > Data Strategic couple and dribble checks (PRP's, P-1's, and special Certs too)
 Register FAQ Search Today's Posts Mark Forums Read

 2022-11-13, 18:42 #12 LaurV Romulan Interpreter     "name field" Jun 2011 Thailand 3·5·683 Posts You both didn't get it. Read again my post. It is not about this particular exponent. Neither about using cudaLucas in the future. We know it is slower. And now, I proved it is buggy too. Not the first time I did that either (see 2012). I don't know if other FFTs are affected. There may be. Therefore, there may be exponents which were both LL and DC with cudaLucas (the server accepted such results, with different shifts) and the residues matched, yet, they are wrong. Is not about "fixing" cudaLucas either, as long as we have gpuOwl and PRP with certs. But such exponents, if they exists, we need to find them and redo the tests. If they are too many to re-test "in bulk", then we need to debug cudaLucas to see which FFTs are affected, which versions are affected, etc., to eventually reduce the list. I would be quite happy to be wrong, and no test to be affected. But putting my nose into cudaLucas internals (FFT) is not what I can do. What I can do, I can insulate the point where the residues start differing, and make a checkpoint file close to it. Then I can pass that to somebody who knows the trade (George, Mihai, Ernst, etc). The bug can be reproduced with colab script from Teal/Daniel on A100 and V100. Going to bed, 1:45 AM here. Need to work today, too, in few hours... Last fiddled with by LaurV on 2022-11-13 at 18:45
 2022-11-13, 18:51 #13 Uncwilly 6809 > 6502     """"""""""""""""""" Aug 2003 101×103 Posts 7×1,543 Posts Spin up a thread about this in either GPU computing or in Software. I understand that there is an issue. But, getting a sanity check via Prime95 should show what the right result is. That can point to the answer WRT to the FFT issue.
 2022-11-13, 18:56 #14 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 2×11×17×19 Posts PMed James & George asking for the CUDALucas-both-times list if feasible. (Would require reporting program to be stored in the database per LL result reported, or a sufficient set of clues to deduce it.) And all the more reason to DC LL via PRP/GEC/proof generation and upload and cert. @LaurV, if the 20000K fft deviating residues are reproducible in CUDALucas, please isolate it to the granularity gpuowl accepts on logging intervals (10,000) or finer. Another possibility is a bug in the NVIDIA CUDA dlls. Pentium fdiv microcode bugs went undetected for a long time, and were operand dependent. Last fiddled with by kriesel on 2022-11-13 at 19:49
 2022-11-13, 20:06 #15 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 2·11·17·19 Posts LaurV, if you are interested in trying to reproduce the problem in other exponents, you could try 20000K fft on CUDALucas for M332196607, for which I have Jacobi-checked and matching final residue, and full log at 50K iterations spacing for interim residues. And probably could round up a few others in gpuowl logs.
2022-11-13, 20:59   #16
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

176248 Posts

Quote:
 Originally Posted by LaurV But putting my nose into cudaLucas internals (FFT) is not what I can do.
Quote:
 Originally Posted by kriesel Another possibility is a bug in the NVIDIA CUDA dlls.
IIRC, there are no CudaLucas FFT internals, simply a call to the CUDA FFT library. That doesn't mean the bug isn't in CudaLucas, there is the weighting and carry propagation code to consider.

P.S. The database knows which LL results were produced by CudaLucas. I'd be extremely surprised if shift count doesn't protect GIMPS from a bad result getting flagged as DCed.

Last fiddled with by Prime95 on 2022-11-13 at 21:02

2022-11-13, 21:37   #17
ATH
Einyen

Dec 2003
Denmark

2·3·569 Posts

Quote:
 Originally Posted by Prime95 P.S. The database knows which LL results were produced by CudaLucas. I'd be extremely surprised if shift count doesn't protect GIMPS from a bad result getting flagged as DCed.
How many exponents where all tests, 2 or more, were done with CUDALucas?

2022-11-13, 21:57   #18
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

1BC216 Posts

Quote:
 Originally Posted by ATH How many exponents where all tests, 2 or more, were done with CUDALucas?
TBD. As earlier indicated, I requested by PM, James or George query the database for the list. It's Sunday afternoon in North America. Please be patient.

One of the issues with CUDALucas is the absence of either readback or error/success value checking after some CUDA library calls. As one example:
Returns:cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidMemcpyDirection
Gpuowl copies host>gpu,gpu>host, and does a compare on the host to verify correctness of the gpu copy.
CUDALucas does not do readback and does not IIUC check for success or error return values from the call either.
In CUDALucas.cu routine void write_gpu_data(int q, int n),
Code:
  // Square kernel data
for (j = (n >> 2) - 1; j > 0; j--) s_ct[j] = 0.5 * cospi (j * d);
cudaMemcpy (g_ct, s_ct, sizeof (double) * (n / 4), cudaMemcpyHostToDevice);
then continues on without checking for success or errors, as if such things never could happen.
It does this for most calls, for speed, yet is slower than gpuowl on same hardware and inputs.

Similarly, in the LL Iteration loop,
Code:
  cufftExecZ2Z (g_plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE);
Gpu to host transfer is sometimes checked:
Code:
  if (error_flag & 3)
{
err = cutilSafeCall1 (cudaMemcpy (&terr, g_err, sizeof (float), cudaMemcpyDeviceToHost));
if(terr > *maxerr) *maxerr = terr;
//if( g_pf && g_sl) usleep(g_sv);//, nanosleep sleep(1);
}
else if (g_pf && (iter % g_po) == 0)
{
//if(g_sl) usleep(g_sv);//, nanosleep sleep(1);
}
if(err != cudaSuccess) terr = -1.0f;
return (terr);
}

Last fiddled with by kriesel on 2022-11-13 at 22:00

2022-11-14, 04:10   #19
LaurV
Romulan Interpreter

"name field"
Jun 2011
Thailand

3·5·683 Posts

Quote:
 Originally Posted by kriesel @LaurV, if the 20000K fft deviating residues are reproducible in CUDALucas, please isolate it to the granularity gpuowl accepts on logging intervals (10,000) or finer.
Will do this, please give me a day or two.

2022-11-14, 04:18   #20
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

22·43·47 Posts

Quote:
 Originally Posted by ATH How many exponents where all tests, 2 or more, were done with CUDALucas?
If I did the query correctly:

Code:
34643591
34696567
35184673
35381377
35478853
36142801
36211067
36313813
36497473
36532159
36717713
36841111
37018711
37047167
38093491
38208713
38276081
38363993
38931791
38976211
38976221
39052267
39258293
39839603
40123351
40404289
40413371
40473841
40501819
40641659
41508253
41518229
41856721
42791519
43883923
44932729
45243557
45285043
48073099
48075583
48122471
48429497
48555343
48677777
49404263
49457687
53998811
54009271
54010013
55831921
56294479
56309111
57766307
57954781
58370549
58370563
72366587
73604719
73612841
73614041
73642033
73684073
73684703
73685609
73798027
73802059
73812071
73901071
77075387
77143147
88680457
132000191
137362691
666666667

 2022-11-14, 07:47 #21 ATH Einyen     Dec 2003 Denmark 341410 Posts Hmm I'm involved in 23 of the 74 exponents. I'm triple checking my lowest one now with Prime95 30.8 b17: 36532159
 2022-11-14, 10:38 #22 LaurV Romulan Interpreter     "name field" Jun 2011 Thailand 3·5·683 Posts Hmm... they are by far not so many as I expected. I thought there are more of them, especially in 332M, where I did myself some, but probably those which were LL and DC by myself were killed by Madpoo with Prime95, already. I could "owl-LL" all those, except the bigger ones. As George said, I would be surprised a lot, if the random shift wouldn't catch this bug (and disappointed a lot too , because there it goes to the drain my advocacy for random shift ). On the other hand, meantime, on a 2080 Ti, Windows 10: Code: FFT = 20000k (wrong) | Nov 14 16:48:00 | M332329111 121482200 0x72ac700df6edc14e | 20000K 0.05566 267.1888 2.67s | 6:04:40:47 36.55% | | Nov 14 17:10:34 | M332329111 121482201 0x1d15bc664e50aa21 | 20000K 0.25000 1.#INF 0.03s | 6:04:40:47 36.55% | | Nov 14 17:10:34 | M332329111 121482202 0x495c10fac3cb687b | 20000K 0.12500 37.6750 0.03s | 6:04:40:48 36.55% | | Nov 14 17:10:34 | M332329111 121482203 0x87be878e3f8a71ba | 20000K 0.06250 36.5630 0.03s | 6:04:40:48 36.55% | | Nov 14 17:10:34 | M332329111 121482204 0xfa90c31f9f4db434 | 20000K 0.05371 36.3340 0.03s | 6:04:40:49 36.55% | | Nov 14 17:10:34 | M332329111 121482205 0xbe66bb2afd9a4d8a | 20000K 0.05371 36.7410 0.03s | 6:04:40:49 36.55% | | Nov 14 17:10:34 | M332329111 121482206 0xd9db0fb42ccfebae | 20000K 0.05103 31.4970 0.03s | 6:04:40:50 36.55% | FFT = 19600k (correct, I mean, like gpuOwl, and like other FFTs I tried at this size) | Nov 14 16:55:52 | M332329111 121482200 0x72ac700df6edc14e | 19600K 0.09570 271.3789 2.71s | 37:04:05:54 36.55% | | Nov 14 17:13:36 | M332329111 121482201 0x1d15bc664e50aa21 | 19600K 0.25000 1.#INF 0.03s | 37:04:06:07 36.55% | | Nov 14 17:13:36 | M332329111 121482202 0x495c10fac3cb687b | 19600K 0.12500 39.0810 0.03s | 37:04:07:09 36.55% | | Nov 14 17:13:36 | M332329111 121482203 0x87be878e3f8a71ba | 19600K 0.06250 36.9730 0.03s | 37:04:08:06 36.55% | | Nov 14 17:13:36 | M332329111 121482204 0xfa90c31f9f4db434 | 19600K 0.07324 36.5400 0.03s | 37:04:09:01 36.55% | | Nov 14 17:13:36 | M332329111 121482205 0xbe66bb2afd9a4d8a | 19600K 0.07324 36.8750 0.03s | 37:04:09:57 36.55% | | Nov 14 17:13:36 | M332329111 121482206 0xb3b61f68599fd75e | 19600K 0.06958 36.6120 0.03s | 37:04:10:53 36.55% | All tests were run till the next checkpoint matched, to make sure it is not an error. I mean, next checkpoints on both branches (which were different as in the former post). Then, where the residues started to differ, the range split in 10 and ran again, full range, so the checkpoint at the end matches. When the split reached "1", I ran every branch twice to make sure it is not a hardware error. Once we switched to smaller ranges, all tests were done with error checking for every iteration. No error catch. I will share residue file(s) at 121482200 with George (cudaLucas can show every residue on screen, but the smallest granulation for checkpoints is 10, even if you set it to 1). I mean, it is no big secret, just they are 40MB+. Last fiddled with by LaurV on 2022-11-14 at 10:47

 Similar Threads Thread Thread Starter Forum Replies Last Post Madpoo Marin's Mersenne-aries 1841 2019-07-16 03:30 fivemack NFS@Home 1 2014-11-30 07:52 theshark Information & Answers 21 2014-08-30 17:36 Optics Information & Answers 8 2009-04-25 18:23 PHinker Software 3 2004-12-18 17:08

All times are UTC. The time now is 19:57.

Thu Dec 1 19:57:42 UTC 2022 up 105 days, 17:26, 0 users, load averages: 0.87, 1.02, 1.02