mersenneforum.org SkylakeX teasers (aka prime95 29.5)
 Register FAQ Search Today's Posts Mark Forums Read

2019-01-24, 07:13   #177
ATH
Einyen

Dec 2003
Denmark

2×7×223 Posts

Quote:
 Originally Posted by Prime95 @ATH,GP2: I just change FixedHardwareUID to not send any HardwareGUID info to the server. This seems to work just fine in my limited testing. I think that will address both of your scenarios. Just generate a ComputerGUID and use it in as many places as you like.
That did not seem to work. For example I have 4 instances doing PRP tests, and I was giving them all the computerID:
ComputerID=ec2-PRP

and after I switched to 29.5b8 I removed FixedHardwareUID=1 and started mprime again and then stopped it again and added FixedHardwareUID=1 back. Then I tried to use the same ComputerGUID on all 4 instances. But as I wrote in post #170 it failed after a few hours and all instances created new computer accounts.

2019-01-24, 07:37   #178
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

7·1,069 Posts

Quote:
 Originally Posted by ATH That did not seem to work. For example I have 4 instances doing PRP tests, and I was giving them all the computerID: ComputerID=ec2-PRP
The new functionality is in the next build. Sorry if I did not make that clear.

You'll need to set ComputerGUID=same_long_hex_string in all your instances.

 2019-01-24, 17:03 #179 GP2     Sep 2003 3·863 Posts 29.5 build 8 hanging for b=3 (using PRP-2) Just a couple of days ago, I started doing PRP tests for 3^p-1 exponents in the 500k range. I am using PRP-2 for this, because 3^p-1 can't be done with the default PRP-3 for the same reason that 2^p-1 can't be done with PRP-2. I was doing this range successfully with 29.4 a few months ago, but then paused and didn't resume until just a couple of days ago. I am using 29.5 build 8 on 64-bit Linux and a single instance of a two-core virtual machine (c5.xlarge on AWS). I have gotten hangs for PRP-2 3^p-1 twice now, on two consecutive days. Edit: the first time it ran for ten hours before hanging, the second time for less than an hour. So this seems to be very reproducible. Meanwhile I have also been running PRP-3 10^p-1 on four instances and PRP-3 2^p+1 on dozens of instances of the same kind of two-core machines, and hangs were very rare even with older builds of 29.5 local.txt: Code: WorkerThreads=1 CoresPerTest=2 The ps command shows the process is still running but the CPU time is not increasing. Using gcore to create a core dump for the still-running process, and gdb to examine the stack, I get: Code: (gdb) bt #0 0x00007f5266d198ed in pthread_join () from /lib64/libpthread.so.0 #1 0x000000000047f41f in gwthread_wait_for_exit () #2 0x000000000041ac41 in LaunchWorkerThreads () #3 0x0000000000442b4c in linuxContinue () #4 0x000000000040818b in main () Code: (gdb) info threads Id Target Id Frame * 1 Thread 0x7f5267491740 (LWP 2672) 0x00007f5266d198ed in pthread_join () from /lib64/libpthread.so.0 2 Thread 0x7f52662d6700 (LWP 2680) 0x00007f5266d1ec26 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 3 Thread 0x7f5265ad5700 (LWP 2727) 0x00007f5266d1e86d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 4 Thread 0x7f52652d4700 (LWP 3260) 0x00007f5266d1e86d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 Code: (gdb) thread apply all bt Thread 4 (Thread 0x7f52652d4700 (LWP 3260)): #0 0x00007f5266d1e86d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000047f171 in gwevent_wait () #2 0x000000000046d357 in auxiliary_thread () #3 0x000000000047f01a in ThreadStarter () #4 0x00007f5266d1854b in start_thread () from /lib64/libpthread.so.0 #5 0x00007f52663cc2ff in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f5265ad5700 (LWP 2727)): #0 0x00007f5266d1e86d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000047f171 in gwevent_wait () #2 0x000000000046dc74 in multithread_op () #3 0x000000000046df4d in gwadd3 () #4 0x000000000046e7ec in gwsquare2_carefully () #5 0x0000000000438de5 in prp () #6 0x00000000004443b3 in primeContinue () #7 0x00000000004467bb in LauncherDispatch () #8 0x000000000044aa64 in Launcher () #9 0x000000000047f01a in ThreadStarter () #10 0x00007f5266d1854b in start_thread () from /lib64/libpthread.so.0 #11 0x00007f52663cc2ff in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f52662d6700 (LWP 2680)): #0 0x00007f5266d1ec26 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000047f150 in gwevent_wait () #2 0x000000000044b85f in timed_events_scheduler () #3 0x000000000047f01a in ThreadStarter () #4 0x00007f5266d1854b in start_thread () from /lib64/libpthread.so.0 #5 0x00007f52663cc2ff in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f5267491740 (LWP 2672)): #0 0x00007f5266d198ed in pthread_join () from /lib64/libpthread.so.0 #1 0x000000000047f41f in gwthread_wait_for_exit () #2 0x000000000041ac41 in LaunchWorkerThreads () I'm not sure why there are four threads on a two-core machine. I do not have HyperthreadLL=1 set in the local.txt file. The log file below shows it hanged around 08:58, and then I tried to kill the process at 16:24 and again at 16:26 Code: [Work thread Jan 24 08:57] Setting affinity to run helper thread 1 on CPU core #2 [Work thread Jan 24 08:57] Starting 2-PRP test of (3^541439-1)/2 using AVX-512 FFT length 48K, Pass1=128, Pass2=384, clm=1, 2 threads [Work thread Jan 24 08:58] Iteration: 400000 / 858159 [46.61%], ms/iter: 0.135, ETA: 00:01:01 [Work thread Jan 24 08:58] Iteration: 800000 / 858159 [93.22%], ms/iter: 0.132, ETA: 00:00:07 [Main thread Jan 24 16:24] Stopping all worker threads. [Main thread Jan 24 16:26] Stopping all worker threads. In the last two lines, kill -TERM caused the program to print out "Stopping all worker threads" but it remained hanged and did not terminate. I terminated it with kill -KILL. Note that kill -QUIT has no effect at all. Looking at the file directory shows: Code: -rw-r--r-- 1 root ec2-user 1821 Jan 24 05:24 results.bench.txt -rw-r--r-- 1 root ec2-user 809 Jan 24 05:24 gwnum.txt -rw-rw-r-- 1 root ec2-user 36922144 Jan 24 08:09 results.txt -rw-r--r-- 1 root ec2-user 691417 Jan 24 08:57 results.json.txt -rw-rw-r-- 1 root ec2-user 1048494 Jan 24 08:57 worktodo.txt -rw-rw-r-- 1 root ec2-user 314 Jan 24 14:15 local.txt So even though it hanged at about 08:58, it still managed to update local.txt at 14:15 The timestamps on the benchmark-related files are 05:24, and log files also show that the program has completed benchmarks successfully twice, at 05:24 today and 05:38 yesterday. It is not hanging in the benchmark, and the stack trace also shows that. Killing and restarting mprime causes the PRP testing to resume successfully. The problem is not the specific exponent it was working on at the time of the hang. PS, I am not sure why the timestamp of results.txt is getting updated, because 29.5 is not writing anything to it, it is writing to results.json.txt instead. The last line in results.txt dates from October and version 29.4 Last fiddled with by GP2 on 2019-01-24 at 17:41
 2019-01-24, 17:52 #180 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 11101001110112 Posts Great info GP2, I'll get right on it. It looks like a problem with the brand new multithreaded add and subtract.
 2019-01-24, 22:44 #181 henryzz Just call me Henry     "David" Sep 2007 Cambridge (GMT/BST) 22·32·163 Posts @GP2 I assume you realise that 3^p-1 is always even and can't be prime.
2019-01-24, 23:01   #182
GP2

Sep 2003

3×863 Posts

Quote:
 Originally Posted by henryzz @GP2 I assume you realise that 3^p-1 is always even and can't be prime.
I'm looking for generalized repunits of the form (b^p − 1) / (b − 1), like these and these and these.

Obviously the worktodo line always specifies the known factor b−1.

2019-01-25, 14:27   #183
Chuck

May 2011
Orange Park, FL

37616 Posts

Quote:
 Originally Posted by Chuck I've got a PRP double check running now.
PRP double check of this exponent is complete. Results were parsed correctly by build 8.

 2019-01-25, 22:57 #184 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 1D3B16 Posts 29.5 build 9 for GP2 and ATH to test. 1) FixedHardwareUID=1 implementation changed 2) Hang in multithreaded add and subtract fixed. 3) JSON tweaks per James' request. Again, this is likely the last 29.5 build. My plan is for the next release to be 29.6 -- a release candidate. Linux 64-bit: ftp://mersenne.org/gimps/p95v295b9.linux64.tar.gz Windows 64-bit: ftp://mersenne.org/gimps/p95v295b9.win64.zip
2019-01-25, 23:03   #185
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·2,543 Posts

Quote:
 Originally Posted by Prime95 29.5 build 9 for GP2 and ATH to test. 1) FixedHardwareUID=1 implementation changed 2) Hang in multithreaded add and subtract fixed. 3) JSON tweaks per James' request. Again, this is likely the last 29.5 build. My plan is for the next release to be 29.6 -- a release candidate. Linux 64-bit: ftp://mersenne.org/gimps/p95v295b9.linux64.tar.gz Windows 64-bit: ftp://mersenne.org/gimps/p95v295b9.win64.zip
Still distinguishing by labeling on the worker title bar, P-1 and ECM, but not yet LL, PRP, PRP-CF?

2019-01-26, 09:16   #186
GP2

Sep 2003

3×863 Posts

Quote:
 Originally Posted by Prime95 @ATH,GP2: I just change FixedHardwareUID to not send any HardwareGUID info to the server. Just generate a ComputerGUID and use it in as many places as you like.
Rather than manually inventing some ComputerGUID, would it be possible to have a setting that forces the ComputerGUID to be the hardware GUID?

You can't usefully keep track of individual virtual computers in the cloud because they soon disappear, but you could still usefully distinguish different hardware types. Cloud providers will only use a limited set of different hardware types, so this won't cause mass proliferation of CPUs in Primenet.

With the appropriate configuration setting, the behavior could be:
1. on startup, always do a fresh determination of the hardware GUID value (not relying on any HardwareGUID setting previously stored in prime.txt)
2. always use that hardware GUID as the computer GUID (not relying on any ComputerGUID setting previously stored in local.txt)

If that behavior is incompatible with the legacy usages of FixedHardwareUID, then perhaps a new setting with a new name could be used. Maybe CloudHardwareUID=1

When I set FixedHardwareUID=1 in prime.txt, it automatically generates a HardwareGUID=... line if there wasn't one before. (It still does this in build 9).

On AWS cloud, across various c5 instances of all sizes, I see only two different HardwareGUIDs being generated, which correspond to different stepping and microcode values in /proc/cpuinfo:

Code:
stepping        : 3
microcode       : 0x1000141
HardwareGUID=f2a283891d846f17015d10daaed71dc7

stepping        : 4
microcode       : 0x2000043
HardwareGUID=daf2d7cfa4eefcf9c6f2696915f78d9f
However, if a HardwareGUID=... line already exists in prime.txt, it doesn't change even if the working directory is resumed on a different instance with the other stepping and microcode values.

 2019-01-26, 09:58 #187 ATH Einyen     Dec 2003 Denmark 312210 Posts Looks like it is working now in build 9. You can just ignore HardwareGUID now. I have 4 instances doing PRP which I call ec2-PRP. I merged those 4 with the same name on the cpu page: https://www.mersenne.org/cpus/ The remaining ec2-PRP cpu I took the GUID (if you click on the name on the CPU page) and copied to all 4 instances, so all 4 local.txt have: ComputerID=ec2-PRP ComputerGUID=2a7e47990f25df12d8ff0f6fd6bb00ed and it seems to work, they did not change and create new cpu "accounts" yet at least after ~10 hours. All the 4 instances created the same HardwareGUID in prime.txt, I do not understand how that works, but you do not have to worry about HardwareGUID now.