mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2017-11-03, 07:54   #12
rudi_m
 
rudi_m's Avatar
 
Jul 2005

2×7×13 Posts
Default

I see that 29.4 now comes with libgmp.so. But it is still using the globally installed one, see

Code:
$ ps aux | grep mprime 
rudi     28456  0.2  0.0 144196  5036 pts/10   SNl+ 08:38   0:00 ./mprime -m
$ sof -p 28456 | grep libgmp
mprime  28456 rudi  mem   REG  254,0  551496  176632 /usr/lib64/libgmp.so.10.1.2
That's not a problem, but if you really want the user to use the local one by default then you should add an rpath link, like
Quote:
gcc -Wl,-rpath,'$ORIGIN' ...
(In a Makefile you would need to write $$ORIGIN).
rudi_m is offline   Reply With Quote
Old 2017-11-03, 10:19   #13
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

26628 Posts
Default

It is an interesting question whether it'd worth to keep or skip the Jacobi test.

Check me, my analysis:
Suppose that there is a 3% error rate for LL/Prp (without error check) and it takes t time for a given p exponent.
Do only strong error check, with it a single (detected!) error check results an t/80*1.002 overhead in time in roll back to a good iteration if we are in the main wavefront p~8e7 with a (very) traditional error check at every 1e6 iteration (don't know what is currently used in p95). [the extra 0.002 is due to the strong error check].

Assume exactly 3% probability on a single error, then the expected overhead on rollback is
Code:
0.03*t/80*1.002=0.00037575*t
but with Jacobi check you would save only half of this, because that detects errors
with 50%. Since Jacobi takes more than 5 times of this, it is simply not worth to do it also with the strong error check.

An advantage of the Jacobi check is that it gives a more reliable result, but with a strong error check it has no value, with ~1/mp error rate we would see less than 2^(-1e6) summed error probability for all p>1e6.


About the 0.2% cost of the strong error check overhead in time:
You simply can't do it better than 2/sqrt(p) (where the whole is 1), if you want to see at least one strong error check.
So you can achieve the 0.2% (total) overhead in time for p>1e6.
What would/could happen with much larger p, and with even better error rate (better than 3%):
With L=H=sqrtint(p)/10 we would see at least 100 error checks
and the overhead would be only 20/sqrt(p), and this one can be arbitrarily small, if p is "large". But this is still not a recommended setup, because we don't know what would be the future memory's error rate, and what would be the used algorithm/method on integer multiplication.

Last fiddled with by R. Gerbicz on 2017-11-03 at 10:20 Reason: typo on exponent
R. Gerbicz is offline   Reply With Quote
Old 2017-11-03, 10:49   #14
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

2×36 Posts
Default

Ouch, forget the first part of my post, for Prp test there is no Jacobi check!
R. Gerbicz is offline   Reply With Quote
Old 2017-11-03, 14:35   #15
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

332610 Posts
Default

Feature request: Can there be a defined value for "low memory" as used in "LowMemWhileRunning" and "MaxHighMemWorkers"? For example, I have 64GB RAM and let my P-1 workers use 11GB each, but not while running Photoshop. Rather than locking out stage2 entirely, could there be an option to use the "low" memory amount (e.g. 1GB instead of 11GB)? This option already exists by time of day (e.g. "Memory=5000 during 7:30-23:30 else 50000") but I would like it to kick in for "LowMemWhileRunning".
James Heinrich is offline   Reply With Quote
Old 2017-11-03, 16:30   #16
GP2
 
GP2's Avatar
 
Sep 2003

13×199 Posts
Default

Something very weird happened yesterday in one of my work directories.

I was running the prerelease 29.4b2, and the only unusual thing I can think of was that I was using InterimResidues=10000 in prime.txt

This might be a rare old problem though, rather than a new one caused by 29.4, and it might explain why exponents occasionally get abandoned after a large percentage of the LL test has already been done.


The worktodo.txt file looked like this before:

Code:
DoubleCheck=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx,82082339,75,1
AdvancedTest=32985569
(I had queued up a triple check of an old exponent, but that's not really relevant)

Now it looks like this:

Code:
DoubleCheck=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx,46843957,73,1
In other words, the worktodo.txt file, with exponent 82082339 at 68.3% completed, was simply deleted, and recreated from scratch.

The exponent still shows up assigned to me in my Assignment Details page (for each person's account it's at https://mersenne.org/workload/ ), but it's gone from worktodo.txt, and if I wasn't monitoring the progress of my exponents and this one in particular, it would no doubt expire after 40 days or so.

prime.log looks like this:

Code:
[Thu Nov  2 16:01:34 2017 - ver 29.4]
Registering assignment: LL M32985569
PrimeNet error 40: No assignment
ra: redundant LL effort, exponent: 32985569
Registering assignment: LL M32985569
PrimeNet error 40: No assignment
ra: redundant LL effort, exponent: 32985569
[Thu Nov  2 16:17:59 2017 - ver 29.4]
Registering assignment: LL M32985569
PrimeNet error 40: No assignment
ra: redundant LL effort, exponent: 32985569
Registering assignment: LL M32985569
PrimeNet error 40: No assignment
ra: redundant LL effort, exponent: 32985569
[Thu Nov  2 16:30:31 2017 - ver 29.4]
Registering assignment: LL M32985569
PrimeNet error 40: No assignment
ra: redundant LL effort, exponent: 32985569
Registering assignment: LL M32985569
PrimeNet error 40: No assignment
ra: redundant LL effort, exponent: 32985569
[Thu Nov  2 16:45:58 2017 - ver 29.4]
Registering assignment: LL M32985569
PrimeNet error 40: No assignment
ra: redundant LL effort, exponent: 32985569
[Thu Nov  2 16:49:17 2017 - ver 29.4]
Getting assignment from server
PrimeNet success code with additional info:
Server assigned Lucas Lehmer primality double-check work.
Got assignment xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: Double check M46843957
Sending expected completion date for M46843957: Nov  7 2017
[Thu Nov  2 18:11:33 2017 - ver 29.4]
Sending expected completion date for M46843957: Nov  7 2017
The results.txt file looks like this:

Code:
[Thu Nov  2 16:38:42 2017]
M82082339 interim Wg8 residue 04EEEDF862DE5EF2 at iteration 56020000
M82082339 interim Wg8 residue 959127764538F882 at iteration 56020001
M82082339 interim Wg8 residue A0B5015145E9626F at iteration 56020002
[Thu Nov  2 16:59:39 2017]
M46843957 interim Wg8 residue 7F425D08EC6C04EC at iteration 10000
M46843957 interim Wg8 residue EBE26DB177C74266 at iteration 10001
M46843957 interim Wg8 residue 86524632C0753142 at iteration 10002
The timestamp of the savefile is:

Code:
-rw-r--r-- 1 root ec2-user 10260364 Nov  2 16:40 p82082339
The launch time of the current instance is November 2, 2017 at 18:28:22 (all times are UTC).

There were a lot of predictable complaints in prime.log about the unassigned triple check, which are irrelevant, I've seen those before. There may have been a lot of restarts in a short period of time, because I run spot instances on the AWS cloud, and when the spot market price is close to the limit price it is not uncommon for instances to launch and then get terminated after only a few minutes.

The point is, in monitoring expiring exponents for strategic doublechecks, it's not that uncommon to see exponents abandoned by other users after a large percentage of the LL test has been completed. That's been a mystery, and we always assumed that this was because the user quit GIMPS, or their computer died, or some other reason. But now it looks like exponents can get wiped from worktodo.txt without any user intention.

This is NOT the same as the "unreserving" bug. In that one, prime.log records "Unreserving Mxxxxxxxx" lines, the assignment also disappears from the Assignment Details page, and exponents get deleted from the bottom of the worktodo.txt file, but exponents that already have more than 0.1% progress are immune from being unreserved.

Here it looks like the existing worktodo.txt file simply got deleted somehow, and a new one got created from scratch. We know that the program rewrites the worktodo.txt file after each DiskWrite interval, because if you manually edit it, then that edit gets overwritten at the next DiskWrite and the program restores what it thinks the worktodo.txt file should contain.

So is it possible that this periodic worktodo.txt rewrite is done non-atomically? Maybe the existing file gets deleted and then the new version is written immediately after? If it's done non-atomically, then if the system gets shut down or rebooted at precisely the right moment in between deletion of the old worktodo.txt and recreation of the new worktodo.txt, then when the system boots up again, there is no worktodo.txt and a new one gets recreated from scratch, and the assignments in the old just end up expiring some weeks later and the existing work progress is lost.

Could that be what happened here, and is it happening regularly elsewhere? Even if the odds are one in 10,000 of a reboot happening at exactly the right (wrong) moment, we are dealing with many millions of LL tests being done, after all.
GP2 is offline   Reply With Quote
Old 2017-11-03, 18:17   #17
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

163578 Posts
Default

Quote:
Originally Posted by rudi_m View Post
I see that 29.4 now comes with libgmp.so. But it is still using the globally installed one.
I included the library since one linux user reported difficulty getting libgmp. I think he was using an older distro. I thought including the library would be more convenient than telling users to go build it from scratch.

I have no idea what the "correct" solution is. I would think using the global one is best in case the user has gone to the trouble of making a version optimized for his machine or a newer libgmp has been released with bug fixes and better algorithms.

I'm more than happy to do whatever the linux experts here say is best.
Prime95 is offline   Reply With Quote
Old 2017-11-03, 18:28   #18
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

32×823 Posts
Default

Quote:
Originally Posted by GP2 View Post
So is it possible that this periodic worktodo.txt rewrite is done non-atomically? Maybe the existing file gets deleted and then the new version is written immediately after?
Yes, that is exactly how it is coded.

Save files are written using the more tedious process of create x.write, deleting x, renaming x.write to x and on reading it looks for x and if it does not exist looks for x.write.

It looks like I need to do the same for worktodo.txt, prime.txt, and local.txt.
Prime95 is offline   Reply With Quote
Old 2017-11-03, 19:06   #19
Mark Rose
 
Mark Rose's Avatar
 
"/X\(ā€˜-ā€˜)/X\"
Jan 2013

3·977 Posts
Default

Quote:
Originally Posted by Prime95 View Post
There are very few PRP-D assignments available. If you run out of assignments, you can switch to first-time PRP.
I may switch back to doing DCLL if I can figure out the stability issues. I kind of want to finish all the DCLL.
Mark Rose is offline   Reply With Quote
Old 2017-11-03, 20:04   #20
GP2
 
GP2's Avatar
 
Sep 2003

13·199 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
I've switched my flaky machine to this version and PRP-D.

I've eliminated the memory channels and memory sticks as issues and power supply is next.
Why not have your flaky machine do Gerbicz PRP first time checks instead? No better way to put the vaunted error detection code to a thorough test. The flakier the better.

I am running PRP-DC on ten machines at the moment, they're being assigned gpuOwL exponents so far, so I could queue up your exponents after they're finished.
GP2 is offline   Reply With Quote
Old 2017-11-03, 20:09   #21
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

22×769 Posts
Default

I had another issue with 24b2 on an Amazon instance. I was doing a PRP CF and after a break for the automatic benchmark, it suddenly could not read any of the save files and I had to restart the work.

@GP2: Maybe the "Elastic File System" is not 100% reliable on Amazon instances. Any limit to how many instances can write to the same EFS?

Quote:
Iteration 11000000 / 28035701
M28035701/known_factors interim Wg8 residue E1E2C59AACEA8CE6 at iteration 11000000
[Thu Nov 2 04:04:05 2017]
Iteration 12000000 / 28035701
M28035701/known_factors interim Wg8 residue 6358034B610F739C at iteration 12000000
[Thu Nov 2 05:54:28 2017]
Iteration 12947159 / 28035701
FFTlen=1440K, Type=3, Arch=4, Pass1=320, Pass2=4608, clm=4 (1 core, 1 worker): 7.03 ms. Throughput: 142.32 iter/sec.
.
.
.
FFTlen=1728K, Type=3, Arch=4, Pass1=768, Pass2=2304, clm=1 (1 core, 1 worker): 8.73 ms. Throughput: 114.59 iter/sec.
Error reading intermediate file: pS035701
Renaming pS035701 to pS035701.bad1
Trying backup intermediate file: pS035701.bu
Error reading intermediate file: pS035701.bu
Renaming pS035701.bu to pS035701.bad2
All intermediate files bad. Temporarily abandoning work unit.
ATH is offline   Reply With Quote
Old 2017-11-03, 21:07   #22
GP2
 
GP2's Avatar
 
Sep 2003

50338 Posts
Default

Quote:
Originally Posted by ATH View Post
I had another issue with 24b2 on an Amazon instance. I was doing a PRP CF and after a break for the automatic benchmark, it suddenly could not read any of the save files and I had to restart the work.

@GP2: Maybe the "Elastic File System" is not 100% reliable on Amazon instances. Any limit to how many instances can write to the same EFS?
I had the same issue, but that wasn't a filesystem issue, it was a one-time backwards incompatibility done deliberately by the mprime program.

In the final 24b2 pre-release version, George introduced Gerbicz-like error testing for PRP-CF. You can see that the results started appearing as "Unverified (Reliable)" on the exponent status page. As a result, all the old save files could not be resumed. The program renamed them to bad, and restarted those exponents. It was a one-time issue, and not really an issue since we are still doing small exponents and lost only minutes of work.


As for the Elastic File System, it does have the same issues as any other network file system, with latency and replication across availability zones. But not file corruption.

One major issue with EFS is that I/O throughput is throttled. You are charged according to how much disk space your EFS filesystem uses, and your I/O rate is proportional to how much disk space you use. It's documented, but it caught me by surprise when I first encountered it.

If you keep the default DiskWriteTime of 30 minutes, and only do LL testing, which creates relatively small save files, then you should be OK. But if you do stuff that creates big save files, like Pāˆ’1, ECM, Fermat with large B2, or you reduce your DiskWriteTime to smaller values, then you start having problems. Simple LInux commands take a long time to complete, or you see 100M .write files taking several minutes to finish writing.

The long term solution for using mprime on the cloud would be to have it read and write directly to S3 storage instead of to files. The throttling doesn't happen with I/O to S3, or to an instance's EBS storage ("local disk space").

Last fiddled with by GP2 on 2017-11-03 at 21:11
GP2 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Prime95 version 27.3 Prime95 Software 148 2012-03-18 19:24
Prime95 version 26.3 Prime95 Software 76 2010-12-11 00:11
Prime95 version 25.5 Prime95 PrimeNet 369 2008-02-26 05:21
Prime95 version 25.4 Prime95 PrimeNet 143 2007-09-24 21:01
When the next prime95 version ? pacionet Software 74 2006-12-07 20:30

All times are UTC. The time now is 15:15.

Sat Apr 17 15:15:57 UTC 2021 up 9 days, 9:56, 0 users, load averages: 1.14, 1.40, 1.42

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.