![]() |
[QUOTE=chalsall;394922]I didn't know Primenet offered such notification. Cool. :smile:
As someone who uses mprime to ensure the computers I'm responsible for are "sane" (which is why I only do DCs), it would be great if there could be a report (and, optionally, emails) which details which results are "Suspect" vs. which only didn't match a previous LL (or both). Important metrics for those who are responsible for 24/7/36[56].[/QUOTE] George added the option in the account details somewhere. As of now it only emails for a suspect result and this is before any TC happens. In this case, my result was actually fine so there was no cause for alarm but it's still really nice to know. I think a report of all suspicious results along with the resolution when the time comes would be useful information also. For now trying to find mismatched results is... clunky at best. |
[QUOTE=TheMawn;394928]George added the option in the account details somewhere. As of now it only emails for a suspect result and this is before any TC happens. In this case, my result was actually fine so there was no cause for alarm but it's still really nice to know.
I think a report of all suspicious results along with the resolution when the time comes would be useful information also. For now trying to find mismatched results is... clunky at best.[/QUOTE] I think you can find your own bad/suspect results in your account page, can't you? Hmm... I'd have to check but I thought it showed that. For you personally Mr. The Mawn, I saw 359 LL results and none of them were flagged as suspect or bad so you're in the clear. 253 verified (double-checked) and 106 unverified (only single-checked for now, but no errors seen during the test). EDIT: Yeah, I looked at my account page, and a couple months back I did some test runs where the initial result was suspect and no double-check had been done yet. So right now, I have one entry in my results that's highlighted in a groovy mustard yellow color that indicates the two residues on that exponent don't match. Mine was the clean run but until a triple-check happens, it's unknown whether mine or the original is correct. Clicking on the exponent to show the full data for it shows the reason it's colored that way...because the tests didn't match. I kind of forgot I added that... funny how that works. "Suspect" results will be highlighted yellow (in the "Result Type" column), and "Bad" results are colored red. Meanwhile, "Verified" checks get a font color of green and "Unverified" (first time checks) get a font color of blue. ( "no factor" results are just boring black on white ) EDIT #2: Oh... when I looked at "The Mawn"s results, there are actually 20 LL tests where there's an error code reported. 16 of them are verified clean though (residues matched). 4 are still awaiting a double check: 55175819 55214251 65624197 65859847 I don't know what the different error codes indicate. The 2nd one there had a code of 03000300 and the other 3 all show 01000100. For whatever that's worth. |
[QUOTE=Madpoo;395020]I
For you personally Mr. The Mawn, I saw 359 LL results and none of them were flagged as suspect or bad so you're in the clear. 253 verified (double-checked) and 106 unverified (only single-checked for now, but no errors seen during the test). EDIT #2: Oh... when I looked at "The Mawn"s results, there are actually 20 LL tests where there's an error code reported. 16 of them are verified clean though (residues matched). 4 are still awaiting a double check: 55175819 55214251 65624197 65859847 [/QUOTE] I believe those were triggered by a > 0.40 roundoff. I remember getting a massive batch of those during a short stretch. The program takes measures to check them out (using a different FFT for one) but I believe they're still flagged. Even if they're a low risk, it's nice to know where to start looking for a possible error. What didn't help that time is I freaked out when I saw errors and I immediately stopped the test... before Prime95 had a chance to check them out, so they were never resolved. Anyhow, I'm not sure how you found all that out. I'm looking at My Account --> Results and I just filtered out everything except LL and DC, but I can't see any error codes. |
[QUOTE=TheMawn;395046]...
Anyhow, I'm not sure how you found all that out. I'm looking at My Account --> Results and I just filtered out everything except LL and DC, but I can't see any error codes.[/QUOTE] Magic! Well, that, and looking in the database directly, which logs the error code reported by the client. I don't really know what the criteria is when a result is checked in, how the server decides to mark it as suspect or not (which should be indicated if you have any like that in your results). Then again, maybe it doesn't show that in the account page, but only on the detailed exponent report page? Not sure. I would have thought that any result with a non-zero error code would be marked as suspect until a double-check (or triple-check in case of a mismatch) confirms the residue. Results that are the "loser" in a triple-check get marked as "bad". I don't think there's anything that would mark a result bad right away, during check-in. Unless it was an unwanted and unmatching triple-check I suppose. However, when I look at the LL results, there are a good number (7,127) where the result code is non-zero, but it's currently flagged as "unverified - clean" instead of "unverified - suspect". So there must be something else involved in determining if it's actually a suspect result or not. For those who like esoteric info: There are 25,194 results (consisting of 24,182 distinct exponents) with an error code of zero, but flagged as "bad" (meaning they were probably the loser in a triple-check, I assume). For whatever reason, there are 728 exponents that have been checked multiple times, a handful as many as 67 times. Those might be some glitches from a v4 database migration though. They're from the same computer, same shift-count, etc. |
[QUOTE=Madpoo;395200]Well, that, and looking in the database directly, which logs the error code reported by the client.
I don't really know what the criteria is when a result is checked in, how the server decides to mark it as suspect or not (which should be indicated if you have any like that in your results). Then again, maybe it doesn't show that in the account page, but only on the detailed exponent report page? Not sure.[/QUOTE] Any chance we could ask you to drill down on that? There are many here who participate no only because it's "cool", but also for business reasons (or, at least, have to have an excuse to "the powers that be (but, often, shouldn't be)"). :smile: :wink: It would be *really* useful to be able to see which of our machines are returning results which /might/ not be sane. A simple mis-match on a DC from time-to-time is to be expected. More serious warnings would, of course, gather greater attention. To share, many years ago one of my ("mission critical") machines started generating and returning bad results. Thankfully I was able to see this, and retire it, before it "crashed and burned" (it did so two months later). |
[QUOTE=chalsall;395208]
It would be *really* useful to be able to see which of our machines are returning results which /might/ not be sane. A simple mis-match on a DC from time-to-time is to be expected. More serious warnings would, of course, gather greater attention.[/QUOTE] I intend to look into expanding the optional email feature to send users an email when their suspect result is verified or proven bad. For now, you have to manually monitor the web page of your results. @madpoo: There is a description of the error code somewhere buried in this forum (I'd look in the Data subforum). The error code consists of 4 hex counts. One of the counts is the "ignore previous error message, roundoff error was reproducible". Thus if you get a matching count of roundoff errors and count of not-an-error, then the result is considered clean. This happens quite often at the upper limit of an FFT size. |
[QUOTE=Prime95;395237]
@madpoo: There is a description of the error code somewhere buried in this forum (I'd look in the Data subforum). The error code consists of 4 hex counts. One of the counts is the "ignore previous error message, roundoff error was reproducible". Thus if you get a matching count of roundoff errors and count of not-an-error, then the result is considered clean. This happens quite often at the upper limit of an FFT size.[/QUOTE] Thanks, I'll see if I can find that info. I gave the results a quick looksie to spot any obvious patterns in the error code and why it was ultimately marked as "unverified - clean" but I'll admit I didn't look that hard. :smile: For my own part, you may recall a certain "thing" back in the late 90's when I had Prime95 running on thousands of machines. Before that all went belly up, I did have a little monitoring script in place that trawled the results file of each machine once every day or two. If it saw any errors, it pinged me and stopped Prime95 on that system. I even emailed the admin I knew in that city to let them know "machine XYZ is acting funny, it might have a memory/hardware problem". Point being, unless you're running these on disconnected systems, even monitoring thousands of clients for weird things wouldn't be too hard on the user side of the fence, with some very basic scripting. I'm guessing Curtis might have something that checks his batch of computers, but that's just a guess. |
[QUOTE=Madpoo;395242]Point being, unless you're running these on disconnected systems, even monitoring thousands of clients for weird things wouldn't be too hard on the user side of the fence, with some very basic scripting.[/QUOTE]
You are being presumptuous. Not all are trained in the "wet work". [QUOTE=Madpoo;395242]I'm guessing Curtis might have something that checks his batch of computers, but that's just a guess.[/QUOTE] Perhaps Curtis has enough firepower (paid for by his U) that he really doesn't care? Edit: Amusing... To me "wet work" means calculatulations done by biological systems or emulated using determistic code. It seems to most it means murder or assassination. My bad. |
[QUOTE=chalsall;395243]You are being presumptuous.
Not all are trained in the "wet work". Perhaps Curtis has enough firepower (paid for by his U) that he really doesn't care? Edit: Amusing... To me "wet work" means calculatulations done by biological systems or emulated using determistic code. It seems to most it means murder or assassination. My bad.[/QUOTE] LOL... I wondered at first when I saw "wet work". I'll take it in the cyberpunk form, as a compliment. :smile: It was years and years ago, but if I remember, all I did was use a list of machine names that had the client running. I'd just connect to each one over the network and grab the results file, (I probably copied it using the machine name as the filename). Then once I'd gathered them all, it was just searching them for any errors or whatever that old version of Prime95 would do if it encountered an error. I seem to recall some errors would actually generate a hardware error message of some kind. sum of inputs <> sum of outputs or whatever... if I saw something like that, I knew that machine was flaky so I stopped the service (these were running as a service), removed the service, and that was that. Out of however many thousands of machines, I seem to recall just a handful, a dozen maybe, acting up. |
[QUOTE=Madpoo;395301]I seem to recall some errors would actually generate a hardware error message of some kind. sum of inputs <> sum of outputs or whatever...[/QUOTE]Here's a subset of the errors I've found in Prime95 output over the years[code]([0-9]+) does not divide M([0-9]+)
(Iteration: ([0-9]+)/([0-9]+), )?ERROR: FFT data has been zeroed! (Iteration: ([0-9]+)/([0-9]+), )?ERROR: ILLEGAL SUMOUT (Iteration: ([0-9]+)/([0-9]+), )?ERROR: SUM(INPUTS) != SUM(OUTPUTS), ([0-9.e+-]+) != ([0-9.e+-]+) All intermediate files bad. Temporarily abandoning work unit. Cannot initialize FFT code, errcode=([0-9]+) Cannot write to file (.*). Disregard last error. Result is reproducible and thus not a hardware problem. Error (creating|writing) worktodo.txt file Error (reading|writing) intermediate file: (.*) ERROR: Bad factor for M([0-9]+) found: ([0-9]+) ERROR: Factor doesn't divide N! FATAL ERROR: Final result was ([0-9A-F]{8}), expected: ([0-9A-F]{8}). FATAL ERROR: Reading from temp file. FATAL ERROR: Resulting sum was ([0-9.e+-]+), expected: ([0-9.e+-]+) FATAL ERROR: Rounding was ([0-9.e+-]+), expected less than ([0-9.]+(.[0-9]+)?) Hardware failure detected, consult stress(.txt)? file. Iteration: ([0-9]+)/([0-9]+), (POSSIBLE )?ERROR: ROUND OFF (([0-9]+.[0-9]+)) > 0.40 Iteration: ([0-9]+)/([0-9]+), ERROR: ROUND OFF (([0-9.e+-]+)) > ([0-9.]+) Maximum number of warnings exceeded. Memory allocation error. Day and night memory settings changed to ([0-9]+)MB and ([0-9]+)MB. Memory allocation error. Trying again using less memory. Number sent to gwsetup is too large for the FFTs to handle. Possible hardware failure, consult readme.txt file, restarting test. Spool file is corrupt. Attempting to salvage data. SUMOUT error occurred.[/code] |
[QUOTE=Mark Rose;393763]A bug: when a factor is found by trial factoring (and possibly other methods), the expired dates for ALL uncompleted LL/DC assignments are displayed as the date the trial factor is found, even if those assignments have expired before.
Here are some examples: [url=http://www.mersenne.org/report_exponent/?exp_lo=M69625627&full=1]M69625627[/url] [url=http://www.mersenne.org/report_exponent/?exp_lo=M69419309&full=1]M69419309[/url] [url=http://www.mersenne.org/report_exponent/?exp_lo=M68978089&full=1]M68978089[/url] [url=http://www.mersenne.org/report_exponent/?exp_lo=M68949029&full=1]M68949029[/url] [url=http://www.mersenne.org/report_exponent/?exp_lo=M68850539&full=1]M68850539[/url][/QUOTE] Fixed for factors reported in the future. |
| All times are UTC. The time now is 22:09. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.