![]() |
|
|
#177 | |
|
Einyen
Dec 2003
Denmark
2×1,579 Posts |
Quote:
ComputerID=ec2-PRP and after I switched to 29.5b8 I removed FixedHardwareUID=1 and started mprime again and then stopped it again and added FixedHardwareUID=1 back. Then I tried to use the same ComputerGUID on all 4 instances. But as I wrote in post #170 it failed after a few hours and all instances created new computer accounts. |
|
|
|
|
|
|
#178 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
5·11·137 Posts |
Quote:
You'll need to set ComputerGUID=same_long_hex_string in all your instances. |
|
|
|
|
|
|
#179 |
|
Sep 2003
5×11×47 Posts |
Just a couple of days ago, I started doing PRP tests for 3^p-1 exponents in the 500k range.
I am using PRP-2 for this, because 3^p-1 can't be done with the default PRP-3 for the same reason that 2^p-1 can't be done with PRP-2. I was doing this range successfully with 29.4 a few months ago, but then paused and didn't resume until just a couple of days ago. I am using 29.5 build 8 on 64-bit Linux and a single instance of a two-core virtual machine (c5.xlarge on AWS). I have gotten hangs for PRP-2 3^p-1 twice now, on two consecutive days. Edit: the first time it ran for ten hours before hanging, the second time for less than an hour. So this seems to be very reproducible. Meanwhile I have also been running PRP-3 10^p-1 on four instances and PRP-3 2^p+1 on dozens of instances of the same kind of two-core machines, and hangs were very rare even with older builds of 29.5 local.txt: Code:
WorkerThreads=1 CoresPerTest=2 Using gcore to create a core dump for the still-running process, and gdb to examine the stack, I get: Code:
(gdb) bt #0 0x00007f5266d198ed in pthread_join () from /lib64/libpthread.so.0 #1 0x000000000047f41f in gwthread_wait_for_exit () #2 0x000000000041ac41 in LaunchWorkerThreads () #3 0x0000000000442b4c in linuxContinue () #4 0x000000000040818b in main () Code:
(gdb) info threads Id Target Id Frame * 1 Thread 0x7f5267491740 (LWP 2672) 0x00007f5266d198ed in pthread_join () from /lib64/libpthread.so.0 2 Thread 0x7f52662d6700 (LWP 2680) 0x00007f5266d1ec26 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 3 Thread 0x7f5265ad5700 (LWP 2727) 0x00007f5266d1e86d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 4 Thread 0x7f52652d4700 (LWP 3260) 0x00007f5266d1e86d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 Code:
(gdb) thread apply all bt Thread 4 (Thread 0x7f52652d4700 (LWP 3260)): #0 0x00007f5266d1e86d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000047f171 in gwevent_wait () #2 0x000000000046d357 in auxiliary_thread () #3 0x000000000047f01a in ThreadStarter () #4 0x00007f5266d1854b in start_thread () from /lib64/libpthread.so.0 #5 0x00007f52663cc2ff in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f5265ad5700 (LWP 2727)): #0 0x00007f5266d1e86d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000047f171 in gwevent_wait () #2 0x000000000046dc74 in multithread_op () #3 0x000000000046df4d in gwadd3 () #4 0x000000000046e7ec in gwsquare2_carefully () #5 0x0000000000438de5 in prp () #6 0x00000000004443b3 in primeContinue () #7 0x00000000004467bb in LauncherDispatch () #8 0x000000000044aa64 in Launcher () #9 0x000000000047f01a in ThreadStarter () #10 0x00007f5266d1854b in start_thread () from /lib64/libpthread.so.0 #11 0x00007f52663cc2ff in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f52662d6700 (LWP 2680)): #0 0x00007f5266d1ec26 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000047f150 in gwevent_wait () #2 0x000000000044b85f in timed_events_scheduler () #3 0x000000000047f01a in ThreadStarter () #4 0x00007f5266d1854b in start_thread () from /lib64/libpthread.so.0 #5 0x00007f52663cc2ff in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f5267491740 (LWP 2672)): #0 0x00007f5266d198ed in pthread_join () from /lib64/libpthread.so.0 #1 0x000000000047f41f in gwthread_wait_for_exit () #2 0x000000000041ac41 in LaunchWorkerThreads () The log file below shows it hanged around 08:58, and then I tried to kill the process at 16:24 and again at 16:26 Code:
[Work thread Jan 24 08:57] Setting affinity to run helper thread 1 on CPU core #2 [Work thread Jan 24 08:57] Starting 2-PRP test of (3^541439-1)/2 using AVX-512 FFT length 48K, Pass1=128, Pass2=384, clm=1, 2 threads [Work thread Jan 24 08:58] Iteration: 400000 / 858159 [46.61%], ms/iter: 0.135, ETA: 00:01:01 [Work thread Jan 24 08:58] Iteration: 800000 / 858159 [93.22%], ms/iter: 0.132, ETA: 00:00:07 [Main thread Jan 24 16:24] Stopping all worker threads. [Main thread Jan 24 16:26] Stopping all worker threads. Looking at the file directory shows: Code:
-rw-r--r-- 1 root ec2-user 1821 Jan 24 05:24 results.bench.txt -rw-r--r-- 1 root ec2-user 809 Jan 24 05:24 gwnum.txt -rw-rw-r-- 1 root ec2-user 36922144 Jan 24 08:09 results.txt -rw-r--r-- 1 root ec2-user 691417 Jan 24 08:57 results.json.txt -rw-rw-r-- 1 root ec2-user 1048494 Jan 24 08:57 worktodo.txt -rw-rw-r-- 1 root ec2-user 314 Jan 24 14:15 local.txt The timestamps on the benchmark-related files are 05:24, and log files also show that the program has completed benchmarks successfully twice, at 05:24 today and 05:38 yesterday. It is not hanging in the benchmark, and the stack trace also shows that. Killing and restarting mprime causes the PRP testing to resume successfully. The problem is not the specific exponent it was working on at the time of the hang. PS, I am not sure why the timestamp of results.txt is getting updated, because 29.5 is not writing anything to it, it is writing to results.json.txt instead. The last line in results.txt dates from October and version 29.4 Last fiddled with by GP2 on 2019-01-24 at 17:41 |
|
|
|
|
|
#180 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
5×11×137 Posts |
Great info GP2, I'll get right on it. It looks like a problem with the brand new multithreaded add and subtract.
|
|
|
|
|
|
#181 |
|
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
2×33×109 Posts |
@GP2 I assume you realise that 3^p-1 is always even and can't be prime.
|
|
|
|
|
|
#182 |
|
Sep 2003
1010000110012 Posts |
|
|
|
|
|
|
#183 | |
|
May 2011
Orange Park, FL
3×5×59 Posts |
Quote:
|
|
|
|
|
|
|
#184 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
5×11×137 Posts |
29.5 build 9 for GP2 and ATH to test.
1) FixedHardwareUID=1 implementation changed 2) Hang in multithreaded add and subtract fixed. 3) JSON tweaks per James' request. Again, this is likely the last 29.5 build. My plan is for the next release to be 29.6 -- a release candidate. Linux 64-bit: ftp://mersenne.org/gimps/p95v295b9.linux64.tar.gz Windows 64-bit: ftp://mersenne.org/gimps/p95v295b9.win64.zip |
|
|
|
|
|
#185 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,419 Posts |
Quote:
|
|
|
|
|
|
|
#186 | |
|
Sep 2003
5·11·47 Posts |
Quote:
You can't usefully keep track of individual virtual computers in the cloud because they soon disappear, but you could still usefully distinguish different hardware types. Cloud providers will only use a limited set of different hardware types, so this won't cause mass proliferation of CPUs in Primenet. With the appropriate configuration setting, the behavior could be:
If that behavior is incompatible with the legacy usages of FixedHardwareUID, then perhaps a new setting with a new name could be used. Maybe CloudHardwareUID=1 When I set FixedHardwareUID=1 in prime.txt, it automatically generates a HardwareGUID=... line if there wasn't one before. (It still does this in build 9). On AWS cloud, across various c5 instances of all sizes, I see only two different HardwareGUIDs being generated, which correspond to different stepping and microcode values in /proc/cpuinfo: Code:
stepping : 3 microcode : 0x1000141 HardwareGUID=f2a283891d846f17015d10daaed71dc7 stepping : 4 microcode : 0x2000043 HardwareGUID=daf2d7cfa4eefcf9c6f2696915f78d9f |
|
|
|
|
|
|
#187 |
|
Einyen
Dec 2003
Denmark
2×1,579 Posts |
Looks like it is working now in build 9. You can just ignore HardwareGUID now.
I have 4 instances doing PRP which I call ec2-PRP. I merged those 4 with the same name on the cpu page: https://www.mersenne.org/cpus/ The remaining ec2-PRP cpu I took the GUID (if you click on the name on the CPU page) and copied to all 4 instances, so all 4 local.txt have: ComputerID=ec2-PRP ComputerGUID=2a7e47990f25df12d8ff0f6fd6bb00ed and it seems to work, they did not change and create new cpu "accounts" yet at least after ~10 hours. All the 4 instances created the same HardwareGUID in prime.txt, I do not understand how that works, but you do not have to worry about HardwareGUID now. |
|
|
|