mersenneforum.org  

Go Back   mersenneforum.org > New To GIMPS? Start Here! > Information & Answers

Reply
 
Thread Tools
Old 2018-02-15, 13:32   #1
GordonWade
 
Feb 2018

416 Posts
Default SIGSEGV in xi3eK8 running torture test

I have been using prime95 torture test mode to validate the stability of an overclock project and it keeps failing after about 48Hrs in what seems to be the same place with a segfault.

Taken from core dump produced by latest run using p95v294b8 :

Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000000000057e3ca in xi3eK8 ()



I have performed the following in an attempt to eliminate H/W issues as a cause.
1) Reset H/W to factory defaults.
2) Upgraded to latest version of prime (p95v294b8) from p95v294b7.
3) Tested with older version of prime (p95v294b5).
4) Ran memtest86 for 100+Hrs (No errors)

Platform is as follows:
AMD Athlon(tm) 64 X2 Dual Core Processor 4800+ (2.5GHz, O/C at 3.2Ghz)
ASUSTeK M2N-E Motherboard
6GB Kingston DDR2-800 running at DDR 667
OS - Linux (Ubuntu)
Kernel version #35-Ubuntu SMP Thu Jan 25 09:13:46 UTC 2018
Kernel release 4.13.0-32-generic


Tail end of activity log (this seems to be consistent failing point)

[Worker #1 Feb 14 19:30] Test 6, 1500 Lucas-Lehmer iterations of M41943041 using AMD K8 FFT length 3200K, Pass1=640, Pass2=5K, clm=1.


gdb Backtrace

Reading symbols from /home/gordon/Downloads/p95v294b8/mprime...done.
[New LWP 19550]
[New LWP 19551]
[New LWP 19546]
[New LWP 19549]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/gordon/Downloads/p95v294b8/mprime -d -t'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000000000057e3ca in xi3eK8 ()
[Current thread is 1 (Thread 0x7fd6a8433700 (LWP 19550))]
#0 0x000000000057e3ca in xi3eK8 ()
#1 0x00007fd6a0b1de00 in ?? ()
#2 0x00000008a0002000 in ?? ()
#3 0x00007f0000000006 in ?? ()
#4 0x0000000001586fc2 in ??04ED ()
#5 0x00007fd6a0002000 in ?? ()
#6 0x00007fd6a0001720 in ?? ()
#7 0x0000000002800001 in ?? ()
#8 0x00007fd6a0c1107c in ?? ()
#9 0x00007fd6a0c11080 in ?? ()
#10 0x00007fd6a842cfe0 in ?? ()
#11 0x0000000000000001 in ?? ()
#12 0x00000000004547dd in gwsquare ()
#13 0x00000000004257e4 in selfTestInternal ()
#14 0x000000000043d126 in tortureTest ()
#15 0x000000000043d814 in LauncherDispatch ()
#16 0x000000000043f808 in Launcher ()
#17 0x000000000046b42a in ThreadStarter ()
#18 0x00007fd6a9c457fc in start_thread (arg=0x7fd6a8433700) at pthread_create.c:465
#19 0x00007fd6a8f60b5f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

While I know that I don't need to run the torture test for that length of time, I would like to have it run for at least 7 days to ensure confidence in the overclocking.

Any assistance would be greatly appreciated.

Regards,
Gordon...
GordonWade is offline   Reply With Quote
Old 2018-02-15, 14:33   #2
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1BE416 Posts
Default

Try putting "CpuArchitecture=4" in local.txt.

This will run the Intel version of xi3e rather than the AMD version.
Prime95 is offline   Reply With Quote
Old 2018-02-15, 20:21   #3
GordonWade
 
Feb 2018

22 Posts
Default

Ok, I have added this to local.txt, what is the impact of running the Intel version, rather than the AMD version?
GordonWade is offline   Reply With Quote
Old 2018-02-15, 22:50   #4
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

714010 Posts
Default

Differences should be minimal. I don't remember all the minor code changes for K8. One was use of the PREFETCHW instruction instead of the PREFETCHT1 instruction.
Prime95 is offline   Reply With Quote
Old 2018-02-15, 22:51   #5
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11011111001002 Posts
Default

Also, if you believe the error is coming from one particular FFT length, then you can torture test just that FFT size to reproduce the problem faster.
Prime95 is offline   Reply With Quote
Old 2018-02-16, 00:00   #6
GordonWade
 
Feb 2018

22 Posts
Default

How do I configure for a given FFT size? can I also set the M# too?

F.Y.I.

"K8 FFT length 3200K, Pass1=640, Pass2=5K, clm=1" was used 16 times with various M###### iterations (8 times on each processor)

"K8 tpe-2 FFT length 3200K, Pass1=640, Pass2=5K, clm=1" was used 20 times with various M###### iterations (10 times on each processor)

M41943041 was used 12 times with various FFT types and lengths, but this is the first time the combination was seen.

I had been monitoring the final ~1Hr of the process with a watched ps command looping at 0.5sec interval and the process went away between 19:30:02.05 and 19:30:03.00.

Not knowing when the "[Worker #1 Feb 14 19:30] Test 6" message is logged from within the code, I can't tell if failure is when Worker #2 starts processing the M41943041 or if it is Worker #1 that fails.

I had also been monitoring the entire run of the process at 60Sec intervals and observed from the trapped ps command O/P that the SIZE RSS values grew over time:
19546 mprime Mon Feb 12 17:45:53 2018 00:05 37248 17000
19546 mprime Mon Feb 12 17:45:53 2018 2-01:44:09 186208 142136

Don't know if that is relevant.
GordonWade is offline   Reply With Quote
Old 2018-02-16, 00:42   #7
Mysticial
 
Mysticial's Avatar
 
Sep 2016

7×47 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Differences should be minimal. I don't remember all the minor code changes for K8. One was use of the PREFETCHW instruction instead of the PREFETCHT1 instruction.
Off topic question: (maybe move to a separate thread)

Out of curiosity, under what circumstances do you use PREFETCHW?

From my experience:
  • If the write is to be used again in the near future, then it's likely a cached region in which it will be used many times. So the cost of pulling it into cache is a one-time cost.
  • If the write won't be used again (i.e. streaming), then use non-temporal stores.

The other thing is that write-misses aren't blocking unless they're long enough to overrun the OOE window. So they never seem to show up in the profiler unless there's also a bandwidth bottleneck.

Last fiddled with by Mysticial on 2018-02-16 at 00:49
Mysticial is offline   Reply With Quote
Old 2018-02-16, 03:05   #8
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11011111001002 Posts
Default

Quote:
Originally Posted by Mysticial View Post
Out of curiosity, under what circumstances do you use PREFETCHW?
I only use it on AMD K6 and K8 processors. The instruction didn't exist until recently on Intel CPUs.

Using prefetchw on those AMD chips did result in noticeable savings. I'm not familiar enough with ancient AMD cache design to explain why. Obviously, changing a cache line's state from clean to dirty on these chips had a cost. Thus, loading the cache line from memory and pre-marking it dirty was beneficial.

On another off-topic note: On Intel chips if I do a 64-byte write to an aligned 64-byte memory address that is not in the caches, is the chip smart enough to not load the line from memory before completely overwriting the data in the cache? or must it load the data just to make sure the memory address is valid? I'll probably buy an i9X this summer and be able to answer questions like this for myself. Right now I'm coding up unoptimized AVX-512 FFTs. Get it working first, optimize it later.
Prime95 is offline   Reply With Quote
Old 2018-02-16, 03:58   #9
Mysticial
 
Mysticial's Avatar
 
Sep 2016

7×47 Posts
Default

Quote:
Originally Posted by Prime95 View Post
On another off-topic note: On Intel chips if I do a 64-byte write to an aligned 64-byte memory address that is not in the caches, is the chip smart enough to not load the line from memory before completely overwriting the data in the cache? or must it load the data just to make sure the memory address is valid?
I've never specifically benchmarked that case myself since knowing it wouldn't have affected any of my decisions. I'd be using 64-byte stores all the time in both cases.

Nevertheless, I want to say that answer is yes, normal 64-byte store misses do avoid the read from memory. But I'm not very confident of it since I'm basing it off of some weak indirect observations:
  1. According to InstLatx64, 64-byte NT-stores have the much higher throughput (per byte) than both 32-byte and 16-byte NT-stores.
  2. In my own code, if I replace all 64-byte NT-stores with normal stores, there is a significant drop in performance without a noticeable increase in bandwidth consumption. (according to VTune)
I have not personally verified #1 myself and I'm not entirely sure how they tested it. But assuming it's true, it seems to hint that 64-byte NT-stores are "easier" possibly because there's no need for any write-combining logic.

#2 seems to indicate that normal 64-byte stores do avoid the read from memory. If they didn't, the bandwidth consumption would increase significantly. However, there is also a large performance drop. (VTune traces these precisely to the stores that were converted from NT to normal.)

My hypothesis here is:
  • Normal stores need to retain memory ordering with other stores. So they occupy the same OOE resources and will easily stall execution is they take too long. They also get flushed into cache thereby consuming the already insufficient L3 bandwidth on these chips.
  • NT-stores stores get kicked into the NT-store buffer. Since there's no need to preserve any sort of memory ordering, they can hang out there as long as they want without blocking execution. They then go directly to memory bypassing all the caches.


Quote:
I'll probably buy an i9X this summer and be able to answer questions like this for myself. Right now I'm coding up unoptimized AVX-512 FFTs. Get it working first, optimize it later.
I actually recommend doing it earlier than later. The "environment" has changed so drastically in this generation that there's a potential that it will derail any sort of pre-planning. This was the case for me with y-cruncher. 3 years of pre-written AVX512 using standard extrapolations. In the end, half of it had to be rewritten anyway.
  • AVX512 for double the compute.
  • Only 2 SIMD ports instead of 3 (for 512-bit).
  • Some chips only have 1 FMA.
  • Longer latencies on port5.
  • AVX512 throttling
  • L3 cache bandwidth cut by about a factor of 3x.
  • Smaller usable LLC. (L2 + L3)
  • NT-stores take longer to process therefore hogging more OOE resources.
  • prefetchnta is actively harmful if you're not careful
I mentioned in my blog that the L3 is "uselessly slow". So you only have 1 MB of usable LLC per core. (half of the previous generation) Combined with 2x the SIMD size, the "effective size" of the cache (measured in the # of SIMD words) is 1/4 of what it used to be. This broke quite a few of my algorithms since they couldn't be squeezed to fit in such a small amount of cache.

I also have a significant amount of code aimed at Cannonlake. But seeing how many of the assumptions I've made while writing that code are already collapsing in Skylake X, I've put the brakes on that stuff for now.

Agner Fog still doesn't have his analysis on Skylake X/Purley. I'm not sure why, but I suspect it has something to do with how ridiculously complicated the architecture has become.

Last fiddled with by Mysticial on 2018-02-16 at 04:29
Mysticial is offline   Reply With Quote
Old 2018-02-16, 05:40   #10
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

22·3·5·7·17 Posts
Default

Thanks for the insights!
Prime95 is offline   Reply With Quote
Old 2018-02-18, 00:07   #11
GordonWade
 
Feb 2018

22 Posts
Default

So two things happened with the change suggested:

1 - The SIGSEGV happened sooner after 44:36:38 of execution.

[Worker #1 Feb 17 17:39] Self-test 64K passed!
[Worker #1 Feb 17 17:39] Test 1, 3100 Lucas-Lehmer iterations of M20971521 using type-2 FFT length 1536K, Pass1=384, Pass2=4K, clm=1.
[Worker #2 Feb 17 17:42] Test 5, 4000 Lucas-Lehmer iterations of M19922943 using type-2 FFT length 1536K, Pass1=384, Pass2=4K, clm=1.



2 - The SIGSEGV indicates a different module "0x00000000005c9ccb in xi2eK8"

root@Lubuntu-Test:~# gdb -q /home/gordon/Downloads/p95v294b8/mprime /var/cores/core.mprime.9733.Lubuntu-Test.1518910926
Reading symbols from /home/gordon/Downloads/p95v294b8/mprime...done.
[New LWP 9736]
[New LWP 9735]
[New LWP 9733]
[New LWP 9734]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/gordon/Downloads/p95v294b8/mprime -d -t'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000005c9ccb in xi2eK8 ()
[Current thread is 1 (Thread 0x7f77c4bed700 (LWP 9736))]
(gdb) bt
#0 0x00000000005c9ccb in xi2eK8 ()
#1 0x00007f77b8449a00 in ?? ()
#2 0x00000060b8001000 in ?? ()
#3 0x00007f7700000000 in ?? ()
#4 0x00000000020806ff in ??04D3 ()
#5 0x00007f77c4beb020 in ?? ()
#6 0x00007f77b80051b0 in ?? ()
#7 0x00000000012fffff in xpass2_r4dwpn_10240_levels_CORE ()
#8 0x00007f77b851d87c in ?? ()
#9 0x00007f77b851d880 in ?? ()
#10 0x00007f77c4beb020 in ?? ()
#11 0x0000000000000001 in ?? ()
#12 0x00000000004547dd in gwsquare ()
#13 0x00000000004257e4 in selfTestInternal ()
#14 0x000000000043d126 in tortureTest ()
#15 0x000000000043d814 in LauncherDispatch ()
#16 0x000000000046b42a in ThreadStarter ()
#17 0x00007f77c6c007fc in start_thread (arg=0x7f77c4bed700) at pthread_create.c:465
#18 0x00007f77c5f1bb5f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)
GordonWade is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Is it possible to disable benchmarking while torture tests are running? ZFR Software 4 2018-02-02 20:18
Can I just start Prime95 by running torture test? marks9GIMPS Information & Answers 5 2011-06-05 18:44
Will the torture test, test ALL available memory? swinster Software 2 2007-12-01 17:54
Torture Test - System running processor very low compared to other systems DougTheSlug Hardware 5 2005-01-27 09:51
Torture test not torture enough? cmokruhl Software 3 2003-01-08 00:14

All times are UTC. The time now is 14:45.

Fri Oct 30 14:45:21 UTC 2020 up 50 days, 11:56, 1 user, load averages: 1.32, 1.75, 1.89

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.