![]() |
Crashes in Prime95 with Zen 2
I've been running Prime95 for some 10 years. Recently I replaced my Ryzen 7 1700, which had been working flawlessly, with a brand new Ryzen 7 3700x, the rest being kept the same. Since then, every night around 6 am for some reason, Prime95 crashes with an access violation.
Hardware: AMD Ryzen 7 3700x (stock settings) Asrock X370 Killer SLI with BIOS 5.40 (latest) 4x Kingston 4 GB DDR4-2133 ECC (memtest approved, of course) Software: Prime95 v29.8 build 3 Fault addresses (relative to image base): 0x1bc4f03 0x1bc50b9 0x1bc4f03 Normally I'd suspect the hardware, but the fault addresses all occur in the same subroutine and twice on the same address. Also the memory has been tested well and is ECC protected. So I hope the author is willing to take a look. |
I suspect it is the automated benchmark crashing.
Try doing a thoughput benchmark on a 4096K FFT. Select "Benchmark all implementations...". If it crashes post results.bench.txt. Then we'll get a Zen 1 user to do the same thing to see which FFT implementation is crashing. |
It didn't crash at all with a 4096K FFT, so I decided to test what I'm currently working on: 4800K FFTs, 4 cores, 4 workers. It crashed immediately, so results.bench.txt only contains topology information, I'm afraid. You still want it?
|
No need.
Add "Autobench=0" to prime.txt while I investigate. |
On second thought, do send the output.
My code review turns up nothing suspicious. For grins, please try adding "CpuSupports3DNow=0" in local.txt. I don't think that will make a difference. One other possibility is a bug in the hwloc library. Debugging may require remote access to zen 2 machine. Preferably linux. |
[CODE]AMD Ryzen 7 3700X 8-Core Processor
CPU speed: 4165.85 MHz, 8 hyperthreaded cores CPU features: 3DNow! Prefetch, SSE, SSE2, SSE4, AVX, AVX2, FMA L1 cache size: 8x32 KB, L2 cache size: 8x512 KB, L3 cache size: 2x16 MB L1 cache line size: 64 bytes, L2 cache line size: 64 bytes Machine topology as determined by hwloc library: Machine#0 (total=11269920KB, Backend=Windows, hwlocVersion=2.0.3, ProcessName=prime95.exe) Package (total=11269920KB, CPUVendor=AuthenticAMD, CPUFamilyNumber=23, CPUModelNumber=113, CPUModel="AMD Ryzen 7 3700X 8-Core Processor ", CPUStepping=0) L3 (size=16384KB, linesize=64, ways=16, Inclusive=0) L2 (size=512KB, linesize=64, ways=8, Inclusive=1) L1d (size=32KB, linesize=64, ways=8, Inclusive=0) Core (cpuset: 0x00000003) PU#0 (cpuset: 0x00000001) PU#1 (cpuset: 0x00000002) L2 (size=512KB, linesize=64, ways=8, Inclusive=1) L1d (size=32KB, linesize=64, ways=8, Inclusive=0) Core (cpuset: 0x0000000c) PU#2 (cpuset: 0x00000004) PU#3 (cpuset: 0x00000008) L2 (size=512KB, linesize=64, ways=8, Inclusive=1) L1d (size=32KB, linesize=64, ways=8, Inclusive=0) Core (cpuset: 0x00000030) PU#4 (cpuset: 0x00000010) PU#5 (cpuset: 0x00000020) L2 (size=512KB, linesize=64, ways=8, Inclusive=1) L1d (size=32KB, linesize=64, ways=8, Inclusive=0) Core (cpuset: 0x000000c0) PU#6 (cpuset: 0x00000040) PU#7 (cpuset: 0x00000080) L3 (size=16384KB, linesize=64, ways=16, Inclusive=0) L2 (size=512KB, linesize=64, ways=8, Inclusive=1) L1d (size=32KB, linesize=64, ways=8, Inclusive=0) Core (cpuset: 0x00000300) PU#8 (cpuset: 0x00000100) PU#9 (cpuset: 0x00000200) L2 (size=512KB, linesize=64, ways=8, Inclusive=1) L1d (size=32KB, linesize=64, ways=8, Inclusive=0) Core (cpuset: 0x00000c00) PU#10 (cpuset: 0x00000400) PU#11 (cpuset: 0x00000800) L2 (size=512KB, linesize=64, ways=8, Inclusive=1) L1d (size=32KB, linesize=64, ways=8, Inclusive=0) Core (cpuset: 0x00003000) PU#12 (cpuset: 0x00001000) PU#13 (cpuset: 0x00002000) L2 (size=512KB, linesize=64, ways=8, Inclusive=1) L1d (size=32KB, linesize=64, ways=8, Inclusive=0) Core (cpuset: 0x0000c000) PU#14 (cpuset: 0x00004000) PU#15 (cpuset: 0x00008000) Prime95 64-bit version 29.8, RdtscTiming=1[/CODE] |
You most probably already know this, but Zen 2 has double the AVX bandwith compared to Zen 1, so it can process AVX 256-bit at full speed. My own FFT implementation benchmark (determining the order of 18782*(2^32-1)^4096+1) went from 2m19 to 1m49, a 27.5% speed increase.
Also, why is there no exception reporting with a full register dump in Prime95? That helps enormously with fault finding. I could send you some source if needed. |
"AutoBench=0" did the trick for now!
|
Any other Zen 2 users out there? Do they crash too on a 4800K all implementations FFT?
|
Bed time now, but if you don't get more reports before tomorrow I can try it too.
|
A quick run before I go into work, it crashed repeatably.
Settings: CPU 3700X P95 29.8b5 Windows 10, 64 bit min/max FFT: 4800k Unselected "benchmark HT" 4 cores 4 workers Happens with "benchmark all implementations" checked and unchecked! Looking at the output window, it is starting to do a test and then crashes almost immediately. There's a second or so of running before it does so, and there's nothing in the output other than hwloc stuff. When crashing the application closes without any further notice. No errors displayed in Windows. If I leave it on default of 8 cores, 1, 2, 8 workers, that runs normally. So it seems limited to 4 cores/4 worker setting. |
Sounds like it is a problem with hwloc?
I think you can disable hwloc with this line in prime.txt, maybe you can check if that helps? EnableSetAffinity=0 |
I'm at work now so it will be some time before I can do any follow up testing. It seems to get past hwloc ok, and the crash happens when running the fft. I guess the question now is, what is different about running 4c4w than 8c 1,2,8w?
Forgot to say, I quickly tried same on a 6700k at 4c4w, ran normally without problem. Edit: didn't think at the time, wonder if 8 cores, 4 workers would work... |
Does not sound like an hwloc problem.
|
To recap and cover new testing:
Prime95 29.8b5 Windows 10 64-bit (probably all on 1903) 4800k FFT throughput benchmark 3700X (Zen 2, 8 cores) 8 cores 1, 2, 4, 8 workers: ok 4 cores, 4 workers: crashes 3600 (Zen 2, 6 cores) 6 cores, 1, 2, 6 workers: ok 4 cores, 4 workers: crashes 6700k (Skylake 4 cores) 4 cores, 4 workers: ok 8086k (Coffee Lake 6 cores) 6 cores, 1, 6 workers: ok 4 cores, 4 workers: ok I can't easily test older Ryzen generations as I dropped the new CPUs into the systems that had them. |
Is this issue 100% reproducible?
|
[QUOTE=ixfd64;522529]Is this issue 100% reproducible?[/QUOTE]
Probably. It has affected 2 different users - both Windows. Does it happen under Linux? Any chance either Evil Genius or mackerel could load Linux in a VM and try mprime? |
I have it running in the Linux subsystem. What parameters do you want me to use with mprime?
|
[QUOTE=ixfd64;522529]Is this issue 100% reproducible?[/QUOTE]
Yes. I'm just one of the early adopters. |
[QUOTE=Evil Genius;522537]I have it running in the Linux subsystem. What parameters do you want me to use with mprime?[/QUOTE]
./mprime -m Then choose Benchmark and the same options you used under Windows. |
[CODE][Mon Jul 29 22:46:46 2019]
Compare your results to other computers at http://www.mersenne.org/report_benchmarks AMD Ryzen 7 3700X 8-Core Processor CPU speed: 4281.77 MHz, 8 hyperthreaded cores CPU features: 3DNow! Prefetch, SSE, SSE2, SSE4, AVX, AVX2, FMA L1 cache size: 8x32 KB, L2 cache size: 512 KB, L3 cache size: 32 MB L1 cache line size: 64 bytes, L2 cache line size: 64 bytes Machine topology as determined by hwloc library: Machine#0 (total=16706384KB, Backend=Linux, OSName=Linux, OSRelease=4.4.0-18362-Microsoft, OSVersion="#1-Microsoft Mon Mar 18 12:02:00 PST 2019", HostName=zenstation, Architecture=x86_64, hwlocVersion=2.0.3, ProcessName=mprime) Package#0 (total=16706384KB, CPUVendor=AuthenticAMD, CPUFamilyNumber=23, CPUModelNumber=113, CPUModel="AMD Ryzen 7 3700X 8-Core Processor ", CPUStepping=0) Core#0 (cpuset: 0x00000003) PU#0 (cpuset: 0x00000001) PU#1 (cpuset: 0x00000002) Core#1 (cpuset: 0x0000000c) PU#2 (cpuset: 0x00000004) PU#3 (cpuset: 0x00000008) Core#2 (cpuset: 0x00000030) PU#4 (cpuset: 0x00000010) PU#5 (cpuset: 0x00000020) Core#3 (cpuset: 0x000000c0) PU#6 (cpuset: 0x00000040) PU#7 (cpuset: 0x00000080) Core#4 (cpuset: 0x00000300) PU#8 (cpuset: 0x00000100) PU#9 (cpuset: 0x00000200) Core#5 (cpuset: 0x00000c00) PU#10 (cpuset: 0x00000400) PU#11 (cpuset: 0x00000800) Core#6 (cpuset: 0x00003000) PU#12 (cpuset: 0x00001000) PU#13 (cpuset: 0x00002000) Core#7 (cpuset: 0x0000c000) PU#14 (cpuset: 0x00004000) PU#15 (cpuset: 0x00008000) Prime95 64-bit version 29.8, RdtscTiming=1 FFTlen=4800K all-complex, Type=3, Arch=4, Pass1=384, Pass2=12800, clm=4 (4 cores, 4 workers): 23.54, 23.94, 23.82, 23.91 ms. Throughput: 168.08 iter/sec. FFTlen=4800K all-complex, Type=3, Arch=4, Pass1=384, Pass2=12800, clm=2 (4 cores, 4 workers): 24.53, 24.65, 24.51, 24.55 ms. Throughput: 162.87 iter/sec. FFTlen=4800K all-complex, Type=3, Arch=4, Pass1=384, Pass2=12800, clm=1 (4 cores, 4 workers): 25.53, 25.51, 25.21, 25.63 ms. Throughput: 157.05 iter/sec. FFTlen=4800K all-complex, Type=3, Arch=4, Pass1=640, Pass2=7680, clm=4 (4 cores, 4 workers): 23.93, 23.73, 23.92, 23.51 ms. Throughput: 168.28 iter/sec. FFTlen=4800K all-complex, Type=3, Arch=4, Pass1=640, Pass2=7680, clm=2 (4 cores, 4 workers): 24.26, 24.22, 24.41, 24.42 ms. Throughput: 164.43 iter/sec. FFTlen=4800K all-complex, Type=3, Arch=4, Pass1=640, Pass2=7680, clm=1 (4 cores, 4 workers): 25.18, 25.28, 25.16, 25.14 ms. Throughput: 158.79 iter/sec. FFTlen=4800K all-complex, Type=3, Arch=4, Pass1=768, Pass2=6400, clm=4 (4 cores, 4 workers): 24.01, 23.44, 23.92, 23.89 ms. Throughput: 167.98 iter/sec. FFTlen=4800K all-complex, Type=3, Arch=4, Pass1=768, Pass2=6400, clm=2 (4 cores, 4 workers): 24.36, 24.35, 24.40, 24.46 ms. Throughput: 163.99 iter/sec. FFTlen=4800K all-complex, Type=3, Arch=4, Pass1=768, Pass2=6400, clm=1 (4 cores, 4 workers): 25.19, 24.74, 24.78, 25.21 ms. Throughput: 160.14 iter/sec. FFTlen=4800K all-complex, Type=3, Arch=4, Pass1=1280, Pass2=3840, clm=4 (4 cores, 4 workers): 24.02, 23.44, 23.37, 23.29 ms. Throughput: 170.01 iter/sec. FFTlen=4800K all-complex, Type=3, Arch=4, Pass1=1280, Pass2=3840, clm=2 (4 cores, 4 workers): 24.11, 24.07, 24.09, 24.25 ms. Throughput: 165.77 iter/sec. FFTlen=4800K all-complex, Type=3, Arch=4, Pass1=1280, Pass2=3840, clm=1 (4 cores, 4 workers): 24.56, 24.70, 24.68, 24.63 ms. Throughput: 162.31 iter/sec.[/CODE] mprime did not crash |
Do not run the all-complex FFTs. The crashes were with that checkbox off.
|
[QUOTE=ixfd64;522529]Is this issue 100% reproducible?[/QUOTE]
On two systems I can do it on demand 100%. [QUOTE=Prime95;522536]Does it happen under Linux? Any chance either Evil Genius or mackerel could load Linux in a VM and try mprime?[/QUOTE] Not something I can do any time soon. [QUOTE=Prime95;522547]Do not run the all-complex FFTs. The crashes were with that checkbox off.[/QUOTE] If it helps, I went the other way. It still crashes if I check complex FFTs on Windows. |
[CODE]
Prime95 64-bit version 29.8, RdtscTiming=1 FFTlen=4800K, Type=3, Arch=4, Pass1=320, Pass2=15360, clm=4 (4 cores, 4 workers): 24.32, 24.22, 24.05, 24.21 ms. Throughput: 165.30 iter/sec. FFTlen=4800K, Type=3, Arch=4, Pass1=320, Pass2=15360, clm=2 (4 cores, 4 workers): 24.94, 25.10, 25.14, 24.90 ms. Throughput: 159.88 iter/sec. FFTlen=4800K, Type=3, Arch=4, Pass1=320, Pass2=15360, clm=1 (4 cores, 4 workers): 25.35, 25.56, 25.31, 25.15 ms. Throughput: 157.83 iter/sec. FFTlen=4800K, Type=3, Arch=4, Pass1=384, Pass2=12800, clm=4 (4 cores, 4 workers): 24.61, 23.98, 24.37, 24.19 ms. Throughput: 164.73 iter/sec. FFTlen=4800K, Type=3, Arch=4, Pass1=384, Pass2=12800, clm=2 (4 cores, 4 workers): 24.73, 24.94, 24.87, 24.69 ms. Throughput: 161.26 iter/sec. FFTlen=4800K, Type=3, Arch=4, Pass1=384, Pass2=12800, clm=1 (4 cores, 4 workers): 25.46, 25.24, 25.24, 25.18 ms. Throughput: 158.23 iter/sec. FFTlen=4800K, Type=3, Arch=4, Pass1=640, Pass2=7680, clm=4 (4 cores, 4 workers): 24.59, 24.63, 24.31, 24.28 ms. Throughput: 163.59 iter/sec. FFTlen=4800K, Type=3, Arch=4, Pass1=640, Pass2=7680, clm=2 (4 cores, 4 workers): 24.67, 24.60, 24.68, 24.58 ms. Throughput: 162.39 iter/sec. FFTlen=4800K, Type=3, Arch=4, Pass1=640, Pass2=7680, clm=1 (4 cores, 4 workers): 25.49, 25.45, 25.26, 25.43 ms. Throughput: 157.43 iter/sec. FFTlen=4800K, Type=3, Arch=4, Pass1=768, Pass2=6400, clm=4 (4 cores, 4 workers): 24.80, 24.51, 24.64, 24.70 ms. Throughput: 162.19 iter/sec. FFTlen=4800K, Type=3, Arch=4, Pass1=768, Pass2=6400, clm=2 (4 cores, 4 workers): 25.17, 25.14, 25.03, 25.22 ms. Throughput: 159.11 iter/sec. FFTlen=4800K, Type=3, Arch=4, Pass1=768, Pass2=6400, clm=1 (4 cores, 4 workers): 25.72, 25.79, 25.55, 25.77 ms. Throughput: 155.61 iter/sec. FFTlen=4800K, Type=3, Arch=4, Pass1=1280, Pass2=3840, clm=4 (4 cores, 4 workers): 24.10, 24.14, 24.41, 24.57 ms. Throughput: 164.58 iter/sec. FFTlen=4800K, Type=3, Arch=4, Pass1=1280, Pass2=3840, clm=2 (4 cores, 4 workers): 24.15, 23.99, 24.02, 24.23 ms. Throughput: 165.98 iter/sec. FFTlen=4800K, Type=3, Arch=4, Pass1=1280, Pass2=3840, clm=1 (4 cores, 4 workers): 24.88, 25.04, 25.09, 25.07 ms. Throughput: 159.89 iter/sec. [/CODE] |
So what we know:
1) Problem is Windows only. 2) Problem is only 4 cores / 4 workers (on 6-core and 8-core CPUs). 3) Problem occurs on 4800K FFT. Are other FFT sizes a problem? Perhaps clue #2 is the key. Maybe the bug can be reproduced on an Intel Windows machine when benchmarking fewer cores than are available. |
29.8b5 on my old 8 core Haswell-E the 4800K benchmark on 4 cores 4 workers works fine with and without all-complex FFTs.
|
[QUOTE=Prime95;522568]2) Problem is only 4 cores / 4 workers (on 6-core and 8-core CPUs).[/QUOTE]
This was not reproduced on 8086k (Intel 6 cores), so it isn't universal. I'm wondering if it is a CCX thing with Ryzen, and it would be interesting to try this with earlier generation. Edit: I've been thinking about changing the cooling on my Ryzen systems, so could use that opportunity to drop in one of the older CPUs again for a test. Might not be before the weekend. Edit 2: I've asked on another forum if anyone else can do the testing, might get results faster that way. |
When I select 4096k FFT, 2 cores, 2 workers, Prime95 goes haywire.
Error setting affinity to core #xyz. There are 8 cores. Error setting affinity to core #xyz. There are 8 cores. Error setting affinity to core #xyz. There are 8 cores. ... mprime is fine with it. |
Stack corruption?
Please set "AffinityVerbosityBench=3" in prime.txt and try again. It will print out a smidge more information. |
I get a lot of 'Affinity set to cpuset 0x00000003' or similar credible hexadecimal number but occasionally 'Error setting affinity to core #208. There are 8 cores.' or something similar.
|
Thinking out loud:
The bad core number is stored on the stack. There are a number of hwloc calls prior to attempting to set the affinity. The problem happens only under Windows and Zen 2. Possible causes: 1) Bug in prime95. Seems unlikely in that prime95 is executing the same code for both Intel and Zen2. 2) Bug in Zen 2. Seems unlikely. AMD has a big QA budget. 3) Bug in hwloc. Possible. Note the Linux hwloc output is different than the Windows hwloc output. Hwloc developers may not have tested on Zen 2 yet. 4) Bug in Windows. Hwloc gets much of its info from the OS. We've seen several cases where hwloc returns bad cache information because the OS isn't detecting caches properly. Not sure how to proceed from here. Hwloc bug repository had no relevant posts as of 2 days ago. |
I think a good question would be why is the cpu core the affinity is set to >= 8?
|
Although it will prove nothing if it succeeds, try running hwloc's stand-alone program called lstopo or lstopo-no-graphics from [url]https://www.open-mpi.org/software/hwloc/v2.0/[/url]
|
Machine (11GB total) + Package L#0
NUMANode L#0 (P#0 11GB) L3 L#0 (16MB) L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#1) L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#2) PU L#3 (P#3) L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#4) PU L#5 (P#5) L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#6) PU L#7 (P#7) L3 L#1 (16MB) L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 PU L#8 (P#8) PU L#9 (P#9) L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 PU L#10 (P#10) PU L#11 (P#11) L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 PU L#12 (P#12) PU L#13 (P#13) L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 PU L#14 (P#14) PU L#15 (P#15) |
[QUOTE=Prime95;522641]Although it will prove nothing if it succeeds, try running hwloc's stand-alone program called lstopo or lstopo-no-graphics from [url]https://www.open-mpi.org/software/hwloc/v2.0/[/url][/QUOTE]
Is there a way to compare the hwloc-reported topology on Windows with that given by hwloc (or simply in /proc/cpuinfo) on a same-CPU Linux system? |
Since I'm tinkering with the boxes now, I just reproduced the problem with Windows 7, which strictly speaking neither MS nor AMD support with Zen 2 CPUs.
The interesting thing is, the first time I tried running 4 cores, 4 workers with a 3600, I got a load of "Error setting affinity to core #xyz. There are 6 cores." messages on screen, before Windows reported an application error. Nothing in log after hwloc. Subsequent runs just went to the application error without those affinity messages, presumably due to something written in config files after 1st run. Might take some time but I'm going to drop in a 2600 shortly to see if that is also affected. I didn't get any testers on the other forum I posted on. |
1 Attachment(s)
2600 temporarily installed. I tried 4 cores, 4 workers, and it crashed just like it did on 3600 and 3700X. Since it was mentioned earlier in the thread, I also tried 2 cores 2 workers, and it also gave a load of affinity errors but completed without crashing. The errors didn't appear in the log, but as it didn't crash I was able to copy and save it in attached text file.
So new information right now is: It also happens in Windows 7, not just Windows 10. It also happens with Zen+ CPU, not limited to Zen 2. I have a crazy idea to try out, back shortly :) Edit: and the results are in. I went into the bios and disabled half the cores, so it is running in 3+0 configuration. One CCX. Tried a bench with 2 core 2 workers, ran fine, no errors. Similar 3c3w. Is there something about splitting work across CCX that is causing the problem? |
After it crashes without an error message, have you checked the Windows Event Viewer to see the code reported by the application crash?
|
I don't have a Zen 2 to test on, but would the recent fix in 29.8b6 apply to the issues in this thread?
|
[QUOTE=hansl;523934]I don't have a Zen 2 to test on, but would the recent fix in 29.8b6 apply to the issues in this thread?[/QUOTE]
Yes. The bug was in prime95 running on a CPU with multiple L3 caches. The benchmark code that makes sure a worker's threads are all running in the same L3 cache was flawed. |
[QUOTE=Prime95;523965]Yes. The bug was in prime95 running on a CPU with multiple L3 caches. The benchmark code that makes sure a worker's threads are all running in the same L3 cache was flawed.[/QUOTE]
Is this an extension to the core-affinity considerations? I.e. do various cores statically map to a given L3 cache, or is that mapping something the OS can fiddle at runtime? |
[QUOTE=ewmayer;524051]Is this an extension to the core-affinity considerations? I.e. do various cores statically map to a given L3 cache, or is that mapping something the OS can fiddle at runtime?[/QUOTE]
Maybe the OS is smart enough to group different threads from the same process into the same L3 cache -- or maybe not. Hwloc libraries give you enough control to ensure this happens. |
| All times are UTC. The time now is 18:00. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.