mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2019-01-24, 07:13   #177
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2×7×223 Posts
Default

Quote:
Originally Posted by Prime95 View Post
@ATH,GP2: I just change FixedHardwareUID to not send any HardwareGUID info to the server. This seems to work just fine in my limited testing. I think that will address both of your scenarios.

Just generate a ComputerGUID and use it in as many places as you like.
That did not seem to work. For example I have 4 instances doing PRP tests, and I was giving them all the computerID:
ComputerID=ec2-PRP

and after I switched to 29.5b8 I removed FixedHardwareUID=1 and started mprime again and then stopped it again and added FixedHardwareUID=1 back. Then I tried to use the same ComputerGUID on all 4 instances. But as I wrote in post #170 it failed after a few hours and all instances created new computer accounts.
ATH is offline   Reply With Quote
Old 2019-01-24, 07:37   #178
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7×1,069 Posts
Default

Quote:
Originally Posted by ATH View Post
That did not seem to work. For example I have 4 instances doing PRP tests, and I was giving them all the computerID:
ComputerID=ec2-PRP
The new functionality is in the next build. Sorry if I did not make that clear.

You'll need to set ComputerGUID=same_long_hex_string in all your instances.
Prime95 is online now   Reply With Quote
Old 2019-01-24, 17:03   #179
GP2
 
GP2's Avatar
 
Sep 2003

3×863 Posts
Default 29.5 build 8 hanging for b=3 (using PRP-2)

Just a couple of days ago, I started doing PRP tests for 3^p-1 exponents in the 500k range.

I am using PRP-2 for this, because 3^p-1 can't be done with the default PRP-3 for the same reason that 2^p-1 can't be done with PRP-2.

I was doing this range successfully with 29.4 a few months ago, but then paused and didn't resume until just a couple of days ago.


I am using 29.5 build 8 on 64-bit Linux and a single instance of a two-core virtual machine (c5.xlarge on AWS). I have gotten hangs for PRP-2 3^p-1 twice now, on two consecutive days. Edit: the first time it ran for ten hours before hanging, the second time for less than an hour. So this seems to be very reproducible.

Meanwhile I have also been running PRP-3 10^p-1 on four instances and PRP-3 2^p+1 on dozens of instances of the same kind of two-core machines, and hangs were very rare even with older builds of 29.5

local.txt:
Code:
WorkerThreads=1
CoresPerTest=2
The ps command shows the process is still running but the CPU time is not increasing.

Using gcore to create a core dump for the still-running process, and gdb to examine the stack, I get:

Code:
(gdb) bt

#0  0x00007f5266d198ed in pthread_join () from /lib64/libpthread.so.0
#1  0x000000000047f41f in gwthread_wait_for_exit ()
#2  0x000000000041ac41 in LaunchWorkerThreads ()
#3  0x0000000000442b4c in linuxContinue ()
#4  0x000000000040818b in main ()
Code:
(gdb) info threads

  Id   Target Id         Frame
* 1    Thread 0x7f5267491740 (LWP 2672) 0x00007f5266d198ed in pthread_join () from /lib64/libpthread.so.0
  2    Thread 0x7f52662d6700 (LWP 2680) 0x00007f5266d1ec26 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
  3    Thread 0x7f5265ad5700 (LWP 2727) 0x00007f5266d1e86d in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
  4    Thread 0x7f52652d4700 (LWP 3260) 0x00007f5266d1e86d in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
Code:
(gdb) thread apply all bt

Thread 4 (Thread 0x7f52652d4700 (LWP 3260)):
#0  0x00007f5266d1e86d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000047f171 in gwevent_wait ()
#2  0x000000000046d357 in auxiliary_thread ()
#3  0x000000000047f01a in ThreadStarter ()
#4  0x00007f5266d1854b in start_thread () from /lib64/libpthread.so.0
#5  0x00007f52663cc2ff in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f5265ad5700 (LWP 2727)):
#0  0x00007f5266d1e86d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000047f171 in gwevent_wait ()
#2  0x000000000046dc74 in multithread_op ()
#3  0x000000000046df4d in gwadd3 ()
#4  0x000000000046e7ec in gwsquare2_carefully ()
#5  0x0000000000438de5 in prp ()
#6  0x00000000004443b3 in primeContinue ()
#7  0x00000000004467bb in LauncherDispatch ()
#8  0x000000000044aa64 in Launcher ()
#9  0x000000000047f01a in ThreadStarter ()
#10 0x00007f5266d1854b in start_thread () from /lib64/libpthread.so.0
#11 0x00007f52663cc2ff in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f52662d6700 (LWP 2680)):
#0  0x00007f5266d1ec26 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000047f150 in gwevent_wait ()
#2  0x000000000044b85f in timed_events_scheduler ()
#3  0x000000000047f01a in ThreadStarter ()
#4  0x00007f5266d1854b in start_thread () from /lib64/libpthread.so.0
#5  0x00007f52663cc2ff in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f5267491740 (LWP 2672)):
#0  0x00007f5266d198ed in pthread_join () from /lib64/libpthread.so.0
#1  0x000000000047f41f in gwthread_wait_for_exit ()
#2  0x000000000041ac41 in LaunchWorkerThreads ()
I'm not sure why there are four threads on a two-core machine. I do not have HyperthreadLL=1 set in the local.txt file.

The log file below shows it hanged around 08:58, and then I tried to kill the process at 16:24 and again at 16:26

Code:
[Work thread Jan 24 08:57] Setting affinity to run helper thread 1 on CPU core #2
[Work thread Jan 24 08:57] Starting 2-PRP test of (3^541439-1)/2 using AVX-512 FFT length 48K, Pass1=128, Pass2=384, clm=1, 2 threads
[Work thread Jan 24 08:58] Iteration: 400000 / 858159 [46.61%], ms/iter:  0.135, ETA: 00:01:01
[Work thread Jan 24 08:58] Iteration: 800000 / 858159 [93.22%], ms/iter:  0.132, ETA: 00:00:07
[Main thread Jan 24 16:24] Stopping all worker threads.
[Main thread Jan 24 16:26] Stopping all worker threads.
In the last two lines, kill -TERM caused the program to print out "Stopping all worker threads" but it remained hanged and did not terminate. I terminated it with kill -KILL. Note that kill -QUIT has no effect at all.

Looking at the file directory shows:

Code:
-rw-r--r-- 1 root ec2-user     1821 Jan 24 05:24 results.bench.txt
-rw-r--r-- 1 root ec2-user      809 Jan 24 05:24 gwnum.txt
-rw-rw-r-- 1 root ec2-user 36922144 Jan 24 08:09 results.txt
-rw-r--r-- 1 root ec2-user   691417 Jan 24 08:57 results.json.txt
-rw-rw-r-- 1 root ec2-user  1048494 Jan 24 08:57 worktodo.txt
-rw-rw-r-- 1 root ec2-user      314 Jan 24 14:15 local.txt
So even though it hanged at about 08:58, it still managed to update local.txt at 14:15

The timestamps on the benchmark-related files are 05:24, and log files also show that the program has completed benchmarks successfully twice, at 05:24 today and 05:38 yesterday. It is not hanging in the benchmark, and the stack trace also shows that.

Killing and restarting mprime causes the PRP testing to resume successfully. The problem is not the specific exponent it was working on at the time of the hang.

PS,
I am not sure why the timestamp of results.txt is getting updated, because 29.5 is not writing anything to it, it is writing to results.json.txt instead. The last line in results.txt dates from October and version 29.4

Last fiddled with by GP2 on 2019-01-24 at 17:41
GP2 is offline   Reply With Quote
Old 2019-01-24, 17:52   #180
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7×1,069 Posts
Default

Great info GP2, I'll get right on it. It looks like a problem with the brand new multithreaded add and subtract.
Prime95 is online now   Reply With Quote
Old 2019-01-24, 22:44   #181
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

10110111011002 Posts
Default

@GP2 I assume you realise that 3^p-1 is always even and can't be prime.
henryzz is offline   Reply With Quote
Old 2019-01-24, 23:01   #182
GP2
 
GP2's Avatar
 
Sep 2003

A1D16 Posts
Default

Quote:
Originally Posted by henryzz View Post
@GP2 I assume you realise that 3^p-1 is always even and can't be prime.
I'm looking for generalized repunits of the form (b^p − 1) / (b − 1), like these and these and these.

Obviously the worktodo line always specifies the known factor b−1.
GP2 is offline   Reply With Quote
Old 2019-01-25, 14:27   #183
Chuck
 
Chuck's Avatar
 
May 2011
Orange Park, FL

2×443 Posts
Default

Quote:
Originally Posted by Chuck View Post
I've got a PRP double check running now.
PRP double check of this exponent is complete. Results were parsed correctly by build 8.
Chuck is offline   Reply With Quote
Old 2019-01-25, 22:57   #184
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7·1,069 Posts
Default

29.5 build 9 for GP2 and ATH to test.

1) FixedHardwareUID=1 implementation changed
2) Hang in multithreaded add and subtract fixed.
3) JSON tweaks per James' request.

Again, this is likely the last 29.5 build. My plan is for the next release to be 29.6 -- a release candidate.

Linux 64-bit: ftp://mersenne.org/gimps/p95v295b9.linux64.tar.gz
Windows 64-bit: ftp://mersenne.org/gimps/p95v295b9.win64.zip
Prime95 is online now   Reply With Quote
Old 2019-01-25, 23:03   #185
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

508610 Posts
Default

Quote:
Originally Posted by Prime95 View Post
29.5 build 9 for GP2 and ATH to test.

1) FixedHardwareUID=1 implementation changed
2) Hang in multithreaded add and subtract fixed.
3) JSON tweaks per James' request.

Again, this is likely the last 29.5 build. My plan is for the next release to be 29.6 -- a release candidate.

Linux 64-bit: ftp://mersenne.org/gimps/p95v295b9.linux64.tar.gz
Windows 64-bit: ftp://mersenne.org/gimps/p95v295b9.win64.zip
Still distinguishing by labeling on the worker title bar, P-1 and ECM, but not yet LL, PRP, PRP-CF?
kriesel is online now   Reply With Quote
Old 2019-01-26, 09:16   #186
GP2
 
GP2's Avatar
 
Sep 2003

3·863 Posts
Default

Quote:
Originally Posted by Prime95 View Post
@ATH,GP2: I just change FixedHardwareUID to not send any HardwareGUID info to the server.

Just generate a ComputerGUID and use it in as many places as you like.
Rather than manually inventing some ComputerGUID, would it be possible to have a setting that forces the ComputerGUID to be the hardware GUID?

You can't usefully keep track of individual virtual computers in the cloud because they soon disappear, but you could still usefully distinguish different hardware types. Cloud providers will only use a limited set of different hardware types, so this won't cause mass proliferation of CPUs in Primenet.

With the appropriate configuration setting, the behavior could be:
  1. on startup, always do a fresh determination of the hardware GUID value (not relying on any HardwareGUID setting previously stored in prime.txt)
  2. always use that hardware GUID as the computer GUID (not relying on any ComputerGUID setting previously stored in local.txt)

If that behavior is incompatible with the legacy usages of FixedHardwareUID, then perhaps a new setting with a new name could be used. Maybe CloudHardwareUID=1



When I set FixedHardwareUID=1 in prime.txt, it automatically generates a HardwareGUID=... line if there wasn't one before. (It still does this in build 9).

On AWS cloud, across various c5 instances of all sizes, I see only two different HardwareGUIDs being generated, which correspond to different stepping and microcode values in /proc/cpuinfo:

Code:
stepping        : 3
microcode       : 0x1000141
HardwareGUID=f2a283891d846f17015d10daaed71dc7

stepping        : 4
microcode       : 0x2000043
HardwareGUID=daf2d7cfa4eefcf9c6f2696915f78d9f
However, if a HardwareGUID=... line already exists in prime.txt, it doesn't change even if the working directory is resumed on a different instance with the other stepping and microcode values.
GP2 is offline   Reply With Quote
Old 2019-01-26, 09:58   #187
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

1100001100102 Posts
Default

Looks like it is working now in build 9. You can just ignore HardwareGUID now.

I have 4 instances doing PRP which I call ec2-PRP. I merged those 4 with the same name on the cpu page: https://www.mersenne.org/cpus/

The remaining ec2-PRP cpu I took the GUID (if you click on the name on the CPU page) and copied to all 4 instances, so all 4 local.txt have:
ComputerID=ec2-PRP
ComputerGUID=2a7e47990f25df12d8ff0f6fd6bb00ed

and it seems to work, they did not change and create new cpu "accounts" yet at least after ~10 hours.

All the 4 instances created the same HardwareGUID in prime.txt, I do not understand how that works, but you do not have to worry about HardwareGUID now.
ATH is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 03:47.

Thu May 6 03:47:02 UTC 2021 up 27 days, 22:27, 0 users, load averages: 2.75, 3.01, 3.07

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.