mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Unstable Ryzen 7950X (https://www.mersenneforum.org/showthread.php?t=28300)

Prime95 2022-12-17 00:29

Stop all P-1 and PRP testing.

This system has hardware problems. Concentrate on passing a torture test.

My first thoughts whenever hardware problems arise are to change your memory settings. You'll need to get comfortable with the BIOS. Reduce memory speed, then try a torture test.

tuckerkao 2022-12-17 01:08

I reduced the frequency of the RAMs from 6000 to 4800 inside the BIOS which seemed to be the factory overclocking.

Prime95 passed the torture test with around 16 small tests. I'll run the P-1 with only 1 GB/worker of emergency memory, see if the same problem arises or not.


Progress: It seemed like the P-1 survived at least 20 minutes with all 16 Cores used while without any errors reporting.


I still cannot find the Extreme Tweaker inside my Asus Prime 670-p Bios to adjust the thermal limit of my CPU.

Prime95 2022-12-17 01:43

The small FFT torture test does not test RAM as much as a blend torture test

James Heinrich 2022-12-17 04:53

[QUOTE=tuckerkao;620029]I reduced the frequency of the RAMs from 6000 to 4800[/QUOTE]If you simply set your RAM to 6000 without enabling the AMD EXPO profile (which also adjusts all sorts of other things like voltages) then it's unlikely to be stable at 6000. If your RAM does not have AMD EXPO profile, it may be tricky to get it working at that speed.

tuckerkao 2022-12-17 05:03

[QUOTE=James Heinrich;620046]The TF bitlevel is probably pretty close to what GPU72 recommends.[/QUOTE]
After the Nvidia Lovelace GPUs released, most exponents from M108M to M120M have been trial factored up to at least 2[SUP]77[/SUP] by either Chalsall's GPU72 crews or TheJudger's group.

If using that ratio, all M168M exponents should be brought up to at least 2[SUP]79[/SUP] which will take less than 2 hours total from 2[SUP]74[/SUP] to 2[SUP]79[/SUP] per exponent on my fastest GPU before running the PRP tests.


[QUOTE=James Heinrich;620047]If you simply set your RAM to 6000 without enabling the AMD EXPO profile (which also adjusts all sorts of other things like voltages) then it's unlikely to be stable at 6000. If your RAM does not have AMD EXPO profile, it may be tricky to get it working at that speed.[/QUOTE]
I'm not sure whether there's a necessity to re-run the P-1 tests during the unstable time frame or not. The impacted exponents were M[M]168916721[/M], M[M]168455141[/M], M[M]168465413[/M], M[M]168465421[/M], M[M]168465601[/M], M[M]168173101[/M]. I typed the list down, so I don't forget them later.

paulunderwood 2022-12-17 05:07

[QUOTE=James Heinrich;620047]If you simply set your RAM to 6000 without enabling the AMD EXPO profile (which also adjusts all sorts of other things like voltages) then it's unlikely to be stable at 6000. If your RAM does not have AMD EXPO profile, it may be tricky to get it working at that speed.[/QUOTE]

EXPO is one-click RAM OC for GigaByte boards. DCOP is ASUS's.

I recommend setting BIOS defaults and just turning on DCOP or EXPO depending on one's board's manufacturer.

James Heinrich 2022-12-17 05:18

[QUOTE=paulunderwood;620049]EXPO is one-click RAM OC for GigaByte boards. DCOP is ASUS's[/QUOTE]EXPO (in this context) is AMD's version of Intel's XMP, brand new for AM5:
[url]https://www.amd.com/en/technologies/expo[/url]
It's not specific to any motherboard manufacturer or RAM manufacturer, but both need to support it to work as intended.

paulunderwood 2022-12-17 05:24

[QUOTE=James Heinrich;620050]EXPO (in this context) is AMD's version of Intel's XMP, brand new for AM5:
[url]https://www.amd.com/en/technologies/expo[/url]
It's not specific to any motherboard manufacturer or RAM manufacturer, but both need to support it to work as intended.[/QUOTE]

I stand corrected. I hit a page from 2016 with EOPC :redface:

S485122 2022-12-17 11:54

James posted some data in the [url=https://www.mersenneforum.org/showthread.php?t=28107]Zen4 7950X Benchmarks[/url] thread :
[QUOTE=James Heinrich;620024]After playing with my 7950X for a bit I came to two conclusions:

1) Getting workers aligned to chiplets is [b]vital[/b]. Running 16-thread PRP across 2 chiplets is actually 40% [b]slower[/b] than just running 8 threads on one chiplet.

2) You can save a fair amount of power and heat and not lose much performance. I actually dropped the thermal limit from 90°C to 70°C in the BIOS. My PRP iteration times are perhaps 1% slower, but CPU power consumption is down from 235W to 195W, temperature down from 92C to 72C, and (important to me) the fan noise is down from very-noticeable to barely-there. For me at least it's a hugely worthwhile tradeoff.[/QUOTE]
His 7950X used 235 W without going over 90 °C.

Tuckertao's 7950X reaching more than 86 °C at 153 W might indicate a cooling problem. He blamed the errors showing up in Prime95 on Windows 11, some updates, then Windows 10, then Prime95, ... He changed all kind of settings without fully understanding their meaning.

He should start by reverting his motherboard BIOS settings to the factory defaults without overclocking, perhaps just configuring the memory settings according to CPU, memory and motherboard specifications (checking the respective manufacturers documentation.)

Then he should check his hardware setup : CPU cooler, ventilation of the case, ...

Then revert his Prime95 settings to the default, then input his user and computer names, set the memory to use for P+-1 and ECM, optionally set the work type.

After that start torture testing "Small FFT's" to check the cooling of his hardware, then "Large FFT's to check the memory or just "Blend". A torture test should be run for quite a time (hours), especially if there are suspicions of hardware problems.

Andrew Usher 2022-12-17 15:52

Based on symptoms it seems not unlikely that overheating is contributing to his problems, and he knows he's running too hot (and should know how to fix it, because it's surely his settings that caused it to run hot in the first place - other than possibly laptops, CPUs don't normally reach that temperature with default settings). Most people including me have a max CPU temp around 70 C.

Re-doing the suspect P-1 is surely a good idea. As for the P-1 bounds, perhaps he'd understand better if he knew of the 30.8 changes that cause the greatest amount of the disparity - he's running 30.8, and presumably has enough memory to run the faster stage 2 on these exponents, and so really should be taking advantage of the higher B2 it allows. If stage 2 takes less than half as much time as stage 1 (ignoring GCD time), it's definitely suboptimal. But that is less important than reliability - good P-1 at acceptable bounds beats bad P-1 at good bounds.

And to repeat what others have said, the torture test is there for a reason! Every time you make a significant hardware change or encounter apparent hardware problems it should be run again.

scan80269 2022-12-18 01:02

I concur with others in that memory is the most likely point of trouble for the OP.

My 7950X is paired with an ASUS ProArt X670E-Creator WiFi motherboard. This board supports DDR5 memory with AMD EXPO profiles but not DDR5 memory with Intel XMP profiles. I test installed a pair of Trident Z5 RGB XMP DDR5 6000 memory modules with this motherboard and the XMP profile did not even appear within BIOS setup. This is when I realized that X670/E motherboards really need to be given DDR5 memory modules with AMD EXPO profile, if one wants memory speed & timings faster than the DDR5 4800 JEDEC standard. I use a pair of G.Skill Trident Z5 Neo AMD EXPO 16GB DDR5 6000 modules with my 7950X and have not encountered any stability issues.

While ASUS motherboards typically allow memory timings to be manually configured, this is always a risky approach with questionable odds of success.

Another thing to consider is how Microsoft Windows 11 OS can mess up support for CPUs with non-hybrid/legacy core architecture such as Zen 3 & Zen 4 Ryzen. The most recent OS scheduler efforts by Microsoft have been focused in optimally supporting hybrid core CPUs such as Intel Alder Lake & Raptor Lake. It appears that Microsoft has struggled to keep up with the latest gen Intel & AMD CPUs and the scheduler cannot support hybrid core CPUs without compromising non-hybrid core CPU performance and perhaps robustness also. The latest incident of Windows 11 22H2 causing performance issues with Zen 4 Ryzen CPUs, reported back in October, was by no means the first.

In addition to running the Prime95 torture test with large FFTs (stresses memory controller and RAM), I recommend running Memtest86+ to assess the integrity of the memory subsystem.

One last recommendation: update the motherboard to the latest released BIOS from the manufacturer. AMD AGESA code support for Zen 4 is apparently still evolving to improve system performance and DRAM compatibility.


All times are UTC. The time now is 16:06.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.