mersenneforum.org SIGSEGV in xi3eK8 running torture test
 Register FAQ Search Today's Posts Mark Forums Read

 2018-02-15, 13:32 #1 GordonWade   Feb 2018 48 Posts SIGSEGV in xi3eK8 running torture test I have been using prime95 torture test mode to validate the stability of an overclock project and it keeps failing after about 48Hrs in what seems to be the same place with a segfault. Taken from core dump produced by latest run using p95v294b8 : Program terminated with signal SIGSEGV, Segmentation fault. #0 0x000000000057e3ca in xi3eK8 () I have performed the following in an attempt to eliminate H/W issues as a cause. 1) Reset H/W to factory defaults. 2) Upgraded to latest version of prime (p95v294b8) from p95v294b7. 3) Tested with older version of prime (p95v294b5). 4) Ran memtest86 for 100+Hrs (No errors) Platform is as follows: AMD Athlon(tm) 64 X2 Dual Core Processor 4800+ (2.5GHz, O/C at 3.2Ghz) ASUSTeK M2N-E Motherboard 6GB Kingston DDR2-800 running at DDR 667 OS - Linux (Ubuntu) Kernel version #35-Ubuntu SMP Thu Jan 25 09:13:46 UTC 2018 Kernel release 4.13.0-32-generic Tail end of activity log (this seems to be consistent failing point) [Worker #1 Feb 14 19:30] Test 6, 1500 Lucas-Lehmer iterations of M41943041 using AMD K8 FFT length 3200K, Pass1=640, Pass2=5K, clm=1. gdb Backtrace Reading symbols from /home/gordon/Downloads/p95v294b8/mprime...done. [New LWP 19550] [New LWP 19551] [New LWP 19546] [New LWP 19549] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by /home/gordon/Downloads/p95v294b8/mprime -d -t'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x000000000057e3ca in xi3eK8 () [Current thread is 1 (Thread 0x7fd6a8433700 (LWP 19550))] #0 0x000000000057e3ca in xi3eK8 () #1 0x00007fd6a0b1de00 in ?? () #2 0x00000008a0002000 in ?? () #3 0x00007f0000000006 in ?? () #4 0x0000000001586fc2 in ??04ED () #5 0x00007fd6a0002000 in ?? () #6 0x00007fd6a0001720 in ?? () #7 0x0000000002800001 in ?? () #8 0x00007fd6a0c1107c in ?? () #9 0x00007fd6a0c11080 in ?? () #10 0x00007fd6a842cfe0 in ?? () #11 0x0000000000000001 in ?? () #12 0x00000000004547dd in gwsquare () #13 0x00000000004257e4 in selfTestInternal () #14 0x000000000043d126 in tortureTest () #15 0x000000000043d814 in LauncherDispatch () #16 0x000000000043f808 in Launcher () #17 0x000000000046b42a in ThreadStarter () #18 0x00007fd6a9c457fc in start_thread (arg=0x7fd6a8433700) at pthread_create.c:465 #19 0x00007fd6a8f60b5f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 While I know that I don't need to run the torture test for that length of time, I would like to have it run for at least 7 days to ensure confidence in the overclocking. Any assistance would be greatly appreciated. Regards, Gordon...
 2018-02-15, 14:33 #2 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 2·3·1,193 Posts Try putting "CpuArchitecture=4" in local.txt. This will run the Intel version of xi3e rather than the AMD version.
 2018-02-15, 20:21 #3 GordonWade   Feb 2018 22 Posts Ok, I have added this to local.txt, what is the impact of running the Intel version, rather than the AMD version?
 2018-02-15, 22:50 #4 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 2·3·1,193 Posts Differences should be minimal. I don't remember all the minor code changes for K8. One was use of the PREFETCHW instruction instead of the PREFETCHT1 instruction.
 2018-02-15, 22:51 #5 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 2×3×1,193 Posts Also, if you believe the error is coming from one particular FFT length, then you can torture test just that FFT size to reproduce the problem faster.
 2018-02-16, 00:00 #6 GordonWade   Feb 2018 22 Posts How do I configure for a given FFT size? can I also set the M# too? F.Y.I. "K8 FFT length 3200K, Pass1=640, Pass2=5K, clm=1" was used 16 times with various M###### iterations (8 times on each processor) "K8 tpe-2 FFT length 3200K, Pass1=640, Pass2=5K, clm=1" was used 20 times with various M###### iterations (10 times on each processor) M41943041 was used 12 times with various FFT types and lengths, but this is the first time the combination was seen. I had been monitoring the final ~1Hr of the process with a watched ps command looping at 0.5sec interval and the process went away between 19:30:02.05 and 19:30:03.00. Not knowing when the "[Worker #1 Feb 14 19:30] Test 6" message is logged from within the code, I can't tell if failure is when Worker #2 starts processing the M41943041 or if it is Worker #1 that fails. I had also been monitoring the entire run of the process at 60Sec intervals and observed from the trapped ps command O/P that the SIZE RSS values grew over time: 19546 mprime Mon Feb 12 17:45:53 2018 00:05 37248 17000 19546 mprime Mon Feb 12 17:45:53 2018 2-01:44:09 186208 142136 Don't know if that is relevant.
2018-02-16, 00:42   #7
Mysticial

Sep 2016

7×47 Posts

Quote:
 Originally Posted by Prime95 Differences should be minimal. I don't remember all the minor code changes for K8. One was use of the PREFETCHW instruction instead of the PREFETCHT1 instruction.
Off topic question: (maybe move to a separate thread)

Out of curiosity, under what circumstances do you use PREFETCHW?

From my experience:
• If the write is to be used again in the near future, then it's likely a cached region in which it will be used many times. So the cost of pulling it into cache is a one-time cost.
• If the write won't be used again (i.e. streaming), then use non-temporal stores.

The other thing is that write-misses aren't blocking unless they're long enough to overrun the OOE window. So they never seem to show up in the profiler unless there's also a bandwidth bottleneck.

Last fiddled with by Mysticial on 2018-02-16 at 00:49

2018-02-16, 03:05   #8
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

2·3·1,193 Posts

Quote:
 Originally Posted by Mysticial Out of curiosity, under what circumstances do you use PREFETCHW?
I only use it on AMD K6 and K8 processors. The instruction didn't exist until recently on Intel CPUs.

Using prefetchw on those AMD chips did result in noticeable savings. I'm not familiar enough with ancient AMD cache design to explain why. Obviously, changing a cache line's state from clean to dirty on these chips had a cost. Thus, loading the cache line from memory and pre-marking it dirty was beneficial.

On another off-topic note: On Intel chips if I do a 64-byte write to an aligned 64-byte memory address that is not in the caches, is the chip smart enough to not load the line from memory before completely overwriting the data in the cache? or must it load the data just to make sure the memory address is valid? I'll probably buy an i9X this summer and be able to answer questions like this for myself. Right now I'm coding up unoptimized AVX-512 FFTs. Get it working first, optimize it later.

2018-02-16, 03:58   #9
Mysticial

Sep 2016

7×47 Posts

Quote:
 Originally Posted by Prime95 On another off-topic note: On Intel chips if I do a 64-byte write to an aligned 64-byte memory address that is not in the caches, is the chip smart enough to not load the line from memory before completely overwriting the data in the cache? or must it load the data just to make sure the memory address is valid?
I've never specifically benchmarked that case myself since knowing it wouldn't have affected any of my decisions. I'd be using 64-byte stores all the time in both cases.

Nevertheless, I want to say that answer is yes, normal 64-byte store misses do avoid the read from memory. But I'm not very confident of it since I'm basing it off of some weak indirect observations:
1. According to InstLatx64, 64-byte NT-stores have the much higher throughput (per byte) than both 32-byte and 16-byte NT-stores.
2. In my own code, if I replace all 64-byte NT-stores with normal stores, there is a significant drop in performance without a noticeable increase in bandwidth consumption. (according to VTune)
I have not personally verified #1 myself and I'm not entirely sure how they tested it. But assuming it's true, it seems to hint that 64-byte NT-stores are "easier" possibly because there's no need for any write-combining logic.

#2 seems to indicate that normal 64-byte stores do avoid the read from memory. If they didn't, the bandwidth consumption would increase significantly. However, there is also a large performance drop. (VTune traces these precisely to the stores that were converted from NT to normal.)

My hypothesis here is:
• Normal stores need to retain memory ordering with other stores. So they occupy the same OOE resources and will easily stall execution is they take too long. They also get flushed into cache thereby consuming the already insufficient L3 bandwidth on these chips.
• NT-stores stores get kicked into the NT-store buffer. Since there's no need to preserve any sort of memory ordering, they can hang out there as long as they want without blocking execution. They then go directly to memory bypassing all the caches.

Quote:
 I'll probably buy an i9X this summer and be able to answer questions like this for myself. Right now I'm coding up unoptimized AVX-512 FFTs. Get it working first, optimize it later.
I actually recommend doing it earlier than later. The "environment" has changed so drastically in this generation that there's a potential that it will derail any sort of pre-planning. This was the case for me with y-cruncher. 3 years of pre-written AVX512 using standard extrapolations. In the end, half of it had to be rewritten anyway.
• AVX512 for double the compute.
• Only 2 SIMD ports instead of 3 (for 512-bit).
• Some chips only have 1 FMA.
• Longer latencies on port5.
• AVX512 throttling
• L3 cache bandwidth cut by about a factor of 3x.
• Smaller usable LLC. (L2 + L3)
• NT-stores take longer to process therefore hogging more OOE resources.
• prefetchnta is actively harmful if you're not careful
I mentioned in my blog that the L3 is "uselessly slow". So you only have 1 MB of usable LLC per core. (half of the previous generation) Combined with 2x the SIMD size, the "effective size" of the cache (measured in the # of SIMD words) is 1/4 of what it used to be. This broke quite a few of my algorithms since they couldn't be squeezed to fit in such a small amount of cache.

I also have a significant amount of code aimed at Cannonlake. But seeing how many of the assumptions I've made while writing that code are already collapsing in Skylake X, I've put the brakes on that stuff for now.

Agner Fog still doesn't have his analysis on Skylake X/Purley. I'm not sure why, but I suspect it has something to do with how ridiculously complicated the architecture has become.

Last fiddled with by Mysticial on 2018-02-16 at 04:29

 2018-02-16, 05:40 #10 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 11011111101102 Posts Thanks for the insights!

 Similar Threads Thread Thread Starter Forum Replies Last Post ZFR Software 4 2018-02-02 20:18 marks9GIMPS Information & Answers 5 2011-06-05 18:44 swinster Software 2 2007-12-01 17:54 DougTheSlug Hardware 5 2005-01-27 09:51 cmokruhl Software 3 2003-01-08 00:14

All times are UTC. The time now is 01:28.

Thu Nov 26 01:28:18 UTC 2020 up 76 days, 22:39, 3 users, load averages: 1.00, 1.63, 1.74