mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   SkylakeX teasers (aka prime95 29.5) (https://www.mersenneforum.org/showthread.php?t=23723)

ATH 2019-01-24 07:13

[QUOTE=Prime95;506742]@ATH,GP2: I just change FixedHardwareUID to not send any HardwareGUID info to the server. This seems to work just fine in my limited testing. I think that will address both of your scenarios.

Just generate a ComputerGUID and use it in as many places as you like.[/QUOTE]

That did not seem to work. For example I have 4 instances doing PRP tests, and I was giving them all the computerID:
ComputerID=ec2-PRP

and after I switched to 29.5b8 I removed FixedHardwareUID=1 and started mprime again and then stopped it again and added FixedHardwareUID=1 back. Then I tried to use the same ComputerGUID on all 4 instances. But as I wrote in post #170 it failed after a few hours and all instances created new computer accounts.

Prime95 2019-01-24 07:37

[QUOTE=ATH;506744]That did not seem to work. For example I have 4 instances doing PRP tests, and I was giving them all the computerID:
ComputerID=ec2-PRP[/QUOTE]

The new functionality is in the next build. Sorry if I did not make that clear.

You'll need to set ComputerGUID=same_long_hex_string in all your instances.

GP2 2019-01-24 17:03

29.5 build 8 hanging for b=3 (using PRP-2)
 
Just a couple of days ago, I started doing PRP tests for 3^p-1 exponents in the 500k range.

I am using PRP-2 for this, because 3^p-1 can't be done with the default PRP-3 for the same reason that 2^p-1 can't be done with PRP-2.

I was doing this range successfully with 29.4 a few months ago, but then paused and didn't resume until just a couple of days ago.


I am using 29.5 build 8 on 64-bit Linux and a single instance of a two-core virtual machine (c5.xlarge on AWS). I have gotten hangs for PRP-2 3^p-1 twice now, on two consecutive days. [B]Edit:[/B] the first time it ran for ten hours before hanging, the second time for less than an hour. So this seems to be very reproducible.

Meanwhile I have also been running PRP-3 10^p-1 on four instances and PRP-3 2^p+1 on dozens of instances of the same kind of two-core machines, and hangs were very rare even with older builds of 29.5

local.txt:
[CODE]
WorkerThreads=1
CoresPerTest=2
[/CODE]

The [c]ps[/c] command shows the process is still running but the CPU time is not increasing.

Using [c]gcore[/c] to create a core dump for the still-running process, and [c]gdb[/c] to examine the stack, I get:

[CODE]
(gdb) bt

#0 0x00007f5266d198ed in pthread_join () from /lib64/libpthread.so.0
#1 0x000000000047f41f in gwthread_wait_for_exit ()
#2 0x000000000041ac41 in LaunchWorkerThreads ()
#3 0x0000000000442b4c in linuxContinue ()
#4 0x000000000040818b in main ()
[/CODE]

[CODE]
(gdb) info threads

Id Target Id Frame
* 1 Thread 0x7f5267491740 (LWP 2672) 0x00007f5266d198ed in pthread_join () from /lib64/libpthread.so.0
2 Thread 0x7f52662d6700 (LWP 2680) 0x00007f5266d1ec26 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
3 Thread 0x7f5265ad5700 (LWP 2727) 0x00007f5266d1e86d in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
4 Thread 0x7f52652d4700 (LWP 3260) 0x00007f5266d1e86d in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
[/CODE]

[CODE]
(gdb) thread apply all bt

Thread 4 (Thread 0x7f52652d4700 (LWP 3260)):
#0 0x00007f5266d1e86d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x000000000047f171 in gwevent_wait ()
#2 0x000000000046d357 in auxiliary_thread ()
#3 0x000000000047f01a in ThreadStarter ()
#4 0x00007f5266d1854b in start_thread () from /lib64/libpthread.so.0
#5 0x00007f52663cc2ff in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f5265ad5700 (LWP 2727)):
#0 0x00007f5266d1e86d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x000000000047f171 in gwevent_wait ()
#2 0x000000000046dc74 in multithread_op ()
#3 0x000000000046df4d in gwadd3 ()
#4 0x000000000046e7ec in gwsquare2_carefully ()
#5 0x0000000000438de5 in prp ()
#6 0x00000000004443b3 in primeContinue ()
#7 0x00000000004467bb in LauncherDispatch ()
#8 0x000000000044aa64 in Launcher ()
#9 0x000000000047f01a in ThreadStarter ()
#10 0x00007f5266d1854b in start_thread () from /lib64/libpthread.so.0
#11 0x00007f52663cc2ff in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f52662d6700 (LWP 2680)):
#0 0x00007f5266d1ec26 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x000000000047f150 in gwevent_wait ()
#2 0x000000000044b85f in timed_events_scheduler ()
#3 0x000000000047f01a in ThreadStarter ()
#4 0x00007f5266d1854b in start_thread () from /lib64/libpthread.so.0
#5 0x00007f52663cc2ff in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f5267491740 (LWP 2672)):
#0 0x00007f5266d198ed in pthread_join () from /lib64/libpthread.so.0
#1 0x000000000047f41f in gwthread_wait_for_exit ()
#2 0x000000000041ac41 in LaunchWorkerThreads ()
[/CODE]

I'm not sure why there are four threads on a two-core machine. I do not have [c]HyperthreadLL=1[/c] set in the local.txt file.

The log file below shows it hanged around 08:58, and then I tried to kill the process at 16:24 and again at 16:26

[CODE]
[Work thread Jan 24 08:57] Setting affinity to run helper thread 1 on CPU core #2
[Work thread Jan 24 08:57] Starting 2-PRP test of (3^541439-1)/2 using AVX-512 FFT length 48K, Pass1=128, Pass2=384, clm=1, 2 threads
[Work thread Jan 24 08:58] Iteration: 400000 / 858159 [46.61%], ms/iter: 0.135, ETA: 00:01:01
[Work thread Jan 24 08:58] Iteration: 800000 / 858159 [93.22%], ms/iter: 0.132, ETA: 00:00:07
[Main thread Jan 24 16:24] Stopping all worker threads.
[Main thread Jan 24 16:26] Stopping all worker threads.
[/CODE]

In the last two lines, [c]kill -TERM[/c] caused the program to print out "Stopping all worker threads" but it remained hanged and did not terminate. I terminated it with [c]kill -KILL[/c]. Note that [c]kill -QUIT[/c] has no effect at all.

Looking at the file directory shows:

[CODE]
-rw-r--r-- 1 root ec2-user 1821 Jan 24 05:24 results.bench.txt
-rw-r--r-- 1 root ec2-user 809 Jan 24 05:24 gwnum.txt
-rw-rw-r-- 1 root ec2-user 36922144 Jan 24 08:09 results.txt
-rw-r--r-- 1 root ec2-user 691417 Jan 24 08:57 results.json.txt
-rw-rw-r-- 1 root ec2-user 1048494 Jan 24 08:57 worktodo.txt
-rw-rw-r-- 1 root ec2-user 314 Jan 24 14:15 local.txt
[/CODE]

So even though it hanged at about 08:58, it still managed to update local.txt at 14:15

The timestamps on the benchmark-related files are 05:24, and log files also show that the program has completed benchmarks successfully twice, at 05:24 today and 05:38 yesterday. It is not hanging in the benchmark, and the stack trace also shows that.

Killing and restarting mprime causes the PRP testing to resume successfully. The problem is not the specific exponent it was working on at the time of the hang.

PS,
I am not sure why the timestamp of results.txt is getting updated, because 29.5 is not writing anything to it, it is writing to results.json.txt instead. The last line in results.txt dates from October and version 29.4

Prime95 2019-01-24 17:52

Great info GP2, I'll get right on it. It looks like a problem with the brand new multithreaded add and subtract.

henryzz 2019-01-24 22:44

@GP2 I assume you realise that 3^p-1 is always even and can't be prime.

GP2 2019-01-24 23:01

[QUOTE=henryzz;506805]@GP2 I assume you realise that 3^p-1 is always even and can't be prime.[/QUOTE]

I'm looking for generalized repunits of the form (b^p − 1) / (b − 1), like [URL="https://oeis.org/A028491"]these[/URL] and [URL="https://oeis.org/A004023"]these[/URL] and [URL="https://oeis.org/A000978"]these[/URL].

Obviously the worktodo line always specifies the known factor b−1.

Chuck 2019-01-25 14:27

[QUOTE=Chuck;506658]I've got a [URL="https://www.mersenne.org/report_exponent/?exp_lo=79160167&full=1"]PRP double check[/URL] running now.[/QUOTE]

PRP double check of this exponent is complete. Results were parsed correctly by build 8.

Prime95 2019-01-25 22:57

29.5 build 9 for GP2 and ATH to test.

1) FixedHardwareUID=1 implementation changed
2) Hang in multithreaded add and subtract fixed.
3) JSON tweaks per James' request.

Again, this is likely the last 29.5 build. My plan is for the next release to be 29.6 -- a release candidate.

Linux 64-bit: [url]ftp://mersenne.org/gimps/p95v295b9.linux64.tar.gz[/url]
Windows 64-bit: [url]ftp://mersenne.org/gimps/p95v295b9.win64.zip[/url]

kriesel 2019-01-25 23:03

[QUOTE=Prime95;506861]29.5 build 9 for GP2 and ATH to test.

1) FixedHardwareUID=1 implementation changed
2) Hang in multithreaded add and subtract fixed.
3) JSON tweaks per James' request.

Again, this is likely the last 29.5 build. My plan is for the next release to be 29.6 -- a release candidate.

Linux 64-bit: [URL]ftp://mersenne.org/gimps/p95v295b9.linux64.tar.gz[/URL]
Windows 64-bit: [URL]ftp://mersenne.org/gimps/p95v295b9.win64.zip[/URL][/QUOTE]
Still distinguishing by labeling on the worker title bar, P-1 and ECM, but not yet LL, PRP, PRP-CF?

GP2 2019-01-26 09:16

[QUOTE=Prime95;506742]@ATH,GP2: I just change FixedHardwareUID to not send any HardwareGUID info to the server.

Just generate a ComputerGUID and use it in as many places as you like.[/QUOTE]

Rather than manually inventing some ComputerGUID, would it be possible to have a setting that forces the ComputerGUID to be the hardware GUID?

You can't usefully keep track of individual virtual computers in the cloud because they soon disappear, but you could still usefully distinguish different hardware types. Cloud providers will only use a limited set of different hardware types, so this won't cause mass proliferation of CPUs in Primenet.

With the appropriate configuration setting, the behavior could be:[LIST=1][*]on startup, always do a fresh determination of the hardware GUID value (not relying on any [c]HardwareGUID[/c] setting previously stored in prime.txt)[*]always use that hardware GUID as the computer GUID (not relying on any [c]ComputerGUID[/c] setting previously stored in local.txt)[/LIST]
If that behavior is incompatible with the legacy usages of FixedHardwareUID, then perhaps a new setting with a new name could be used. Maybe [c]CloudHardwareUID=1[/c]



When I set FixedHardwareUID=1 in prime.txt, it automatically generates a HardwareGUID=... line if there wasn't one before. (It still does this in build 9).

On AWS cloud, across various c5 instances of all sizes, I see only two different HardwareGUIDs being generated, which correspond to different stepping and microcode values in /proc/cpuinfo:

[CODE]
stepping : 3
microcode : 0x1000141
HardwareGUID=f2a283891d846f17015d10daaed71dc7

stepping : 4
microcode : 0x2000043
HardwareGUID=daf2d7cfa4eefcf9c6f2696915f78d9f
[/CODE]

However, if a HardwareGUID=... line already exists in prime.txt, it doesn't change even if the working directory is resumed on a different instance with the other stepping and microcode values.

ATH 2019-01-26 09:58

Looks like it is working now in build 9. You can just ignore HardwareGUID now.

I have 4 instances doing PRP which I call ec2-PRP. I merged those 4 with the same name on the cpu page: [url]https://www.mersenne.org/cpus/[/url]

The remaining ec2-PRP cpu I took the GUID (if you click on the name on the CPU page) and copied to all 4 instances, so all 4 local.txt have:
ComputerID=ec2-PRP
ComputerGUID=2a7e47990f25df12d8ff0f6fd6bb00ed

and it seems to work, they did not change and create new cpu "accounts" yet at least after ~10 hours.

All the 4 instances created the same HardwareGUID in prime.txt, I do not understand how that works, but you do not have to worry about HardwareGUID now.


All times are UTC. The time now is 22:08.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.