mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2022-10-29, 00:59   #12
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2C6E16 Posts
Default

Quote:
Originally Posted by joejoefla View Post
I'll stop testing on my desktop until further notice.
Don't stop testing. Switch to DC. Confirm sanity using trusted knowledge.

Watch your thermals, voltages, etc et al...

Thanks again, George et al.

Seriously.
chalsall is offline   Reply With Quote
Old 2022-10-29, 12:36   #13
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24×3×163 Posts
Default

Quote:
Originally Posted by joejoefla View Post
Probably did. If it did I apologize I'll get it figured out. I'll stop testing on my desktop until further notice.
Thank you. Please qualify each of your systems by running PRP/GEC/proof DC on them. Those that detect errors via GEC will back up from error detection point and retry from last known good savefile, so you won't lose much throughput, unless reliability is truly terrible, and the eventual result, assuming it manages to complete, will be good results, because the GEC has that excellent an error detection rate. (I once had a system that was so unreliable it could no longer make progress in a PRP/GEC run. I recycled it.) Systems that produce nonzero GEC error detection rate ought not be used for P-1 or LL until the problem is corrected.
I suggest you offer the list(s) of all recent P-1 attempts made on your error-prone desktop (extractable from its results.txt or results.json.txt) or any other system found by GEC-guarded runs to be unreliable, to George for re-queueing by volunteers.
kriesel is online now   Reply With Quote
Old 2022-10-29, 15:16   #14
joejoefla
 
May 2019

25 Posts
Default

Just emailed him.

In other news I ran a double check and the results came through fine.
[Sat Oct 29 09:18:34 2022]
{"status":"P", "exponent":30402457, "worktype":"LL", "fft-length":1638400, "shift-count":27069027, "error-code":"00000000", "security-code":"D6081688", "program":{"name":"Prime95", "version":"30.8", "build":17, "port":4}, "timestamp":"2022-10-29 13:18:34", "user":"joejoefla", "computer":"11600k"}

Running another DC to see what that does. It's like the error went away! Keep in mind I'm running stock on everything, except this time around I downclocked my 11600k from 4.6Ghz to 4.0Ghz. Maybe leaving the clocks at stock 4.6Ghz is what was causing the hardware issues. CPU temps were at 95C.

Last fiddled with by joejoefla on 2022-10-29 at 15:19
joejoefla is offline   Reply With Quote
Old 2022-10-29, 16:20   #15
joejoefla
 
May 2019

25 Posts
Default

Running DC of 62825017 on my cpu and GPU. Will report back on results.
joejoefla is offline   Reply With Quote
Old 2022-10-29, 17:45   #16
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default

Use normal size exponents, and PRP/GEC/proof as DC. Sometimes issues are memory location dependent or fft size dependent or temperature dependent. Or they don't show up in a shorter run. A cooler overnight run may have no problems, while late afternoon with warmer ambient temp may have problems. Jacobi check on LL DC can only detect 50% of errors; GEC on PRP catches all.
Good luck.
Attached Thumbnails
Click image for larger version

Name:	prp as dc.png
Views:	51
Size:	170.9 KB
ID:	27541  
kriesel is online now   Reply With Quote
Old 2022-10-29, 19:21   #17
joejoefla
 
May 2019

2016 Posts
Default

[Sat Oct 29 15:12:34 2022]
{"status":"C", "exponent":62825017, "worktype":"LL", "res64":"BD055E86CCD80522", "fft-length":3440640, "shift-count":40443958, "error-code":"00000000", "security-code":"CA5D721F", "program":{"name":"Prime95", "version":"30.8", "build":18, "port":8}, "timestamp":"2022-10-29 19:12:34", "user":"joejoefla", "computer":"VMWare1"}


Ran the problem LL I had on a server. Residue matches previous test.
joejoefla is offline   Reply With Quote
Old 2022-10-29, 19:26   #18
joejoefla
 
May 2019

1000002 Posts
Default

2022-10-29 14:58:11 config: -user joejoefla -cpu 11600k -proof 9 -maxalloc 7000 -safeMath
2022-10-29 14:58:11 device 0, unique id ''
2022-10-29 14:58:11 11600k 30402457 FFT: 1664K 256:13:256 (17.84 bpw)
2022-10-29 14:58:11 11600k Expected maximum carry32: 1FFA0000
2022-10-29 14:58:11 11600k OpenCL args "-DEXP=30402457u -DWIDTH=256u -DSMALL_HEIGHT=256u -DMIDDLE=13u -DPM1=0 -DWEIGHT_STEP_MINUS_1=0xe.c430d7f6117c8p-7 -DIWEIGHT_STEP_MINUS_1=-0xd.3d3458cdb423p-7 -cl-std=CL2.0 -cl-finite-math-only "
2022-10-29 14:58:11 11600k

2022-10-29 14:58:11 11600k OpenCL compilation in 0.05 s
2022-10-29 14:58:11 11600k 30402457 LL 0 loaded: 0000000000000004
2022-10-29 15:01:05 11600k 30402457 LL 100000 0.33%; 1737 us/it; ETA 0d 14:37; 0000000000000002
2022-10-29 15:03:58 11600k 30402457 LL 200000 0.66%; 1733 us/it; ETA 0d 14:32; 0000000000000002
2022-10-29 15:06:54 11600k 30402457 LL 300000 0.99%; 1760 us/it; ETA 0d 14:43; 0000000000000002
2022-10-29 15:09:31 11600k 30402457 LL 400000 1.32%; 1563 us/it; ETA 0d 13:01; 0000000000000002
2022-10-29 15:11:57 11600k 30402457 LL 500000 1.64%; 1466 us/it; ETA 0d 12:11; 0000000000000002
2022-10-29 15:14:27 11600k 30402457 LL 600000 1.97%; 1500 us/it; ETA 0d 12:25; 0000000000000002
2022-10-29 15:14:27 11600k 30402457 EE 500000 (jacobi == 0)
2022-10-29 15:14:27 11600k 30402457 LL 0 loaded: 0000000000000004
2022-10-29 15:17:06 11600k 30402457 LL 100000 0.33%; 1585 us/it; ETA 0d 13:20; 0000000000000002
2022-10-29 15:19:47 11600k 30402457 LL 200000 0.66%; 1616 us/it; ETA 0d 13:33; 0000000000000002
2022-10-29 15:22:28 11600k 30402457 LL 300000 0.99%; 1610 us/it; ETA 0d 13:28; 0000000000000002

Getting this when running GPUOWL6. Just simply a hardware error?
joejoefla is offline   Reply With Quote
Old 2022-10-29, 21:06   #19
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2·112·47 Posts
Default

Quote:
Originally Posted by joejoefla View Post
Running DC of 62825017 on my cpu and GPU. Will report back on results.
Thank you for continuing to "Work this problem." It can take some time to "make friends" with software, particularly when it is interacting with hardware.

Sometimes at very high speeds, temperatures, voltages, amps, etc et al.

My question right now is is there a "standard suite" of tests that one would run through the current code paths to prove things are sane?

Separately, what kind of logging can be done with regards to the empirical.

Please forgive me if I sometimes come across as being profoundly stupid in my thinking. It happens sometimes.

But sometimes stupid questions results in good discussion. And convergence of good ideas.

I hope that makes sense to many...
chalsall is offline   Reply With Quote
Old 2022-10-30, 22:45   #20
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2×112×47 Posts
Default

Quote:
Originally Posted by chalsall View Post
My question right now is is there a "standard suite" of tests that one would run through the current code paths to prove things are sane?
Actually... Ken... That was a question directed to you specifically.

Given multiple code paths using multiple different HW (CPU, GPU, etc et al) do you have a suggestion as to what sort of "test suite" one might use? Right at the FFT edges.

I don't get out much. Sincere question.
chalsall is offline   Reply With Quote
Old 2022-10-30, 23:14   #21
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

5·937 Posts
Default

Quote:
Originally Posted by joejoefla View Post
2022-10-29 14:58:11 config: -user joejoefla -cpu 11600k -proof 9 -maxalloc 7000 -safeMath
2022-10-29 14:58:11 device 0, unique id ''
2022-10-29 14:58:11 11600k 30402457 FFT: 1664K 256:13:256 (17.84 bpw)
2022-10-29 14:58:11 11600k Expected maximum carry32: 1FFA0000
2022-10-29 14:58:11 11600k OpenCL args "-DEXP=30402457u -DWIDTH=256u -DSMALL_HEIGHT=256u -DMIDDLE=13u -DPM1=0 -DWEIGHT_STEP_MINUS_1=0xe.c430d7f6117c8p-7 -DIWEIGHT_STEP_MINUS_1=-0xd.3d3458cdb423p-7 -cl-std=CL2.0 -cl-finite-math-only "
2022-10-29 14:58:11 11600k

2022-10-29 14:58:11 11600k OpenCL compilation in 0.05 s
2022-10-29 14:58:11 11600k 30402457 LL 0 loaded: 0000000000000004
2022-10-29 15:01:05 11600k 30402457 LL 100000 0.33%; 1737 us/it; ETA 0d 14:37; 0000000000000002
2022-10-29 15:03:58 11600k 30402457 LL 200000 0.66%; 1733 us/it; ETA 0d 14:32; 0000000000000002
2022-10-29 15:06:54 11600k 30402457 LL 300000 0.99%; 1760 us/it; ETA 0d 14:43; 0000000000000002
2022-10-29 15:09:31 11600k 30402457 LL 400000 1.32%; 1563 us/it; ETA 0d 13:01; 0000000000000002
2022-10-29 15:11:57 11600k 30402457 LL 500000 1.64%; 1466 us/it; ETA 0d 12:11; 0000000000000002
2022-10-29 15:14:27 11600k 30402457 LL 600000 1.97%; 1500 us/it; ETA 0d 12:25; 0000000000000002
2022-10-29 15:14:27 11600k 30402457 EE 500000 (jacobi == 0)
2022-10-29 15:14:27 11600k 30402457 LL 0 loaded: 0000000000000004
2022-10-29 15:17:06 11600k 30402457 LL 100000 0.33%; 1585 us/it; ETA 0d 13:20; 0000000000000002
2022-10-29 15:19:47 11600k 30402457 LL 200000 0.66%; 1616 us/it; ETA 0d 13:33; 0000000000000002
2022-10-29 15:22:28 11600k 30402457 LL 300000 0.99%; 1610 us/it; ETA 0d 13:28; 0000000000000002

Getting this when running GPUOWL6. Just simply a hardware error?
Why not run version 7? Also why are you doing LL and not PRP?

Somewhere your hardware is not right. The 3 main things to check are:
  • heat
  • voltages
  • timings

Whatever OS you are using, install something that monitors the first two. As for timings, don't overclock ... well at least not at first.

Last fiddled with by paulunderwood on 2022-10-30 at 23:15
paulunderwood is offline   Reply With Quote
Old 2022-10-31, 00:58   #22
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2·112·47 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
Why not run version 7? Also why are you doing LL and not PRP?
A mentor of mine once shared a very insightful story...

Student: "Why did they do it that way?

Mentor: "Well... That's actually a two-part question...

Mentor: "Why?" involves a whole lotta engineering decisions that might, or might not, have made sense at the time.

Mentor: "Did they do it that way?" Answer: Yes.

Quote:
Originally Posted by paulunderwood View Post
Somewhere your hardware is not right. ... Whatever OS you are using, install something that monitors the first two. As for timings, don't overclock ... well at least not at first.
Completely agree.

Few appreciate the importance of empirical experimentation. Even when it comes to the "so-called" deterministic code.

Put it on a test-bench and run it multiple times. See what it does.

If the hardware isn't sane, the software /should/ "see" this.
chalsall is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
PRP-test issue pxp FactorDB 10 2020-01-01 13:04
32/64 bit gmp-ecm issue... WraithX GMP-ECM 15 2016-12-19 17:42
Forum Log In issue Unregistered Information & Answers 7 2011-09-28 05:14
PauseWhileRunning issue Kevin Software 1 2011-06-16 05:33
Speed Issue ThomRuley LMH > 100M 10 2005-04-26 22:18

All times are UTC. The time now is 16:28.


Fri Jul 7 16:28:20 UTC 2023 up 323 days, 13:56, 0 users, load averages: 1.92, 2.03, 1.74

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔