![]() |
|
|
#1 |
|
Sep 2002
2510 Posts |
I have an interesting problem, maybe one of you can help.
When launching a 3D-Game, prime drops out of a stability test (or regular LL test) immediately with a fatal error, e.g. "FATAL ERROR: Resulting sum was 5.986536030368264e+016, expected: 1.569155080311306e+017". It is always the first worker, the second is fine. The system is a C2D E8200 with 2GB RAM, Asrock 4Core1600Twins-P35, ATI HD3870 and Windows XP 64. The machine is otherwise stable. It even occurs when underclocking to e.g. 1,6 GHz. What I found out so far: The errors are not resulting from voltage drops because of additional power drawn by the graphics card, as there are none. Anyway for testing I changed the card to a Nvidia 7300GT, which hardly draws any current, but the result is the same (fatal error). Changing prime95 to 32 bit or 25.8 didn't help either. Oddly enough this does not happen when launching applications like Furmark, it also does not happen while running Linux. Anybody has a suggestion on how to debug? |
|
|
|
|
|
#2 |
|
"Richard B. Woods"
Aug 2002
Wisconsin USA
22×3×641 Posts |
A while ago, I posted that a similar problem could be caused by a failure to properly save and restore floating-point registers somewhere in Windows when prime95 is interrupted for a higher-priority task.
Someone (perhaps xilman) replied that such problems had by then been fixed in Windows. What version of Windows are you using, which service packs have been installed, and does it currently have all the applicable fixes from Microsoft Update? When you write "a 3D-Game", do you mean one particular game, or a category of games? Does it happen with some games, but not others? Last fiddled with by cheesehead on 2009-02-15 at 20:37 |
|
|
|
|
|
#3 |
|
Sep 2002
2510 Posts |
We'll that's interesting to hear. It's XP x64 SP2 with all patches by auto-update.
With 3D-game I think of something like Far Cry 2, Gothic 3. It doesn't happen with really old ones like Unreal Tournament |
|
|
|
|
|
#4 |
|
"Richard B. Woods"
Aug 2002
Wisconsin USA
22×3×641 Posts |
Let me clarify that my suggested possibility of improper FP register save/restore was a theoretical bug that could produce your observed symptoms, based on my general knowledge of internals of some operating systems other than Windows. I don't claim to be an expert on Intel or Windows, but what I do know includes no reason why such a bug could not exist.
OTOH it's puzzling why such a fundamental problem could exist unreported or at least unfixed. The OS is supposed to save/restore registers across interrupts, without regard to or interference by any application. So no application's failure to save/restore should be able to affect any other application when there's no connection other than that one application interrupted another. That's what I meant by "such a fundamental problem" -- it's so fundamental to OS operation and design that ... Perhaps it's a fault in some games' software (consistent with happening with some games but not others, but not the only possibility that's consistent with that) -- but that runs into the objection that, as explained above, the operating system should insulate all other applications from such a bug, so how could game software affect prime95? Now, if, after the OS saved prime95's FP registers then transferred control to the game software, the game software somehow managed to clobber those saved FP values (which shouldn't be possible!), then when it gave control back to the OS, the OS would restore prime95's FP register(s) to the clobbered value(s) rather than the value(s) it had saved earlier. Then when prime95 received control back from the OS after it was interrupted, it would, unknowingly, have incorrect values in one or more FP registers. This chain of events should not be possible, but perhaps there is an extremely rare combination of circumstances that nobody's thought of that would not be handled correctly. (Why am I going on so long about this? Because such a problem has been reported every several months, I don't recall anyone's suggesting a plausible possibility other than what're described here (but I might have forgotten), and I'm 'bugged" about it!) I've encountered enough weird and supposedly-impossible bugs in my programming career not to rule out such remote possibilities until shown definitive proof-by-test. Indeed, there was a case of a bug I found that should have been impossible, in accordance with considerations similar to what I've outlined. Resolution: it turned out that someone had made a mistake in one wire's connection to a circuit board, and only a very specific set of circumstances triggered errors due to that hardware mistake. In order for something like that to be the cause in your case (and other cases with similar symptoms), there would have to be a firmware or hardware defect in a small percentage of systems that only made a difference under a very specific set of conditions that usually arises only when certain game software was started while prime95 was running. Unlikely? Yes. Impossible? No (remember the Pentium FDIV bug?). How to find such a bug, if it exists? Run the same games while prime95 is running on other systems with hardware/firmware/software combinations just like yours. (I'm not saying you have to be the one to do this, just that it's a method that could be used by a combination of people with sufficiently similar systems, perhaps able to be coordinated only through a site like this forum.) Or -- swap chips with some other system and try as above. Last fiddled with by cheesehead on 2009-02-16 at 08:34 |
|
|
|
|
|
#5 |
|
Jun 2003
155816 Posts |
Normal processes can NOT corrupt each others memory/state in Win32.
So, the bug, if exists, must be with a device driver. AFAIK, these things play by a different set of rules. |
|
|
|
|
|
#6 |
|
"Richard B. Woods"
Aug 2002
Wisconsin USA
22·3·641 Posts |
Oh, yeah, I forgot to mention the rumors about device drivers. Would the differences in whether the problem occurred with one game rather than another game be attributable to different device drivers? (I know little about games internals.)
BTW, if a device driver can do such corruption, I consider that to be an operating system flaw (that the design or specifications of driver/OS combinations, interface, or permissions would permit such a thing is professionally incompetent). Perhaps I should mention that I've personally studied and written IOCRs (I/O control routines) that were the equivalent of today's device drivers AFAIK. Last fiddled with by cheesehead on 2009-02-16 at 09:48 |
|
|
|
|
|
#7 |
|
Jun 2003
23·683 Posts |
Graphics card manufacturers are notorious for including game-specific "optimizations" in their device drivers to win the benchmarketing stakes!
|
|
|
|
|
|
#8 |
|
"Richard B. Woods"
Aug 2002
Wisconsin USA
22·3·641 Posts |
... and the combination of Windows/Intel software/hardware specifications for them and their interfaces with the OS fails to prevent them from corrupting processes, right? That's professional incompetence by system designers.
Last fiddled with by cheesehead on 2009-02-16 at 09:54 |
|
|
|
|
|
#9 |
|
Jan 2003
2·103 Posts |
When you say there are no voltage swings is this because you enabled some sort of load line calibration in the BIOS? This eliminates vdroop, but at the cost of undamped current swings. And the 2nd core being the weaker of the 2 is close enough to it's stability threshold to get affected. Just my guess. maybe totally off.
|
|
|
|
|
|
#10 |
|
Sep 2002
52 Posts |
I didn't see any drop on the 12/5/3.3V lines and the sensors for Vcpu remains constant, but anyway the 7300GT tested only draws 10W more in 3D than idle. Shouldn't be much of an impact.
I'll check out the device drivers, maybe something comes up with sound or chipset. Last fiddled with by kaeptn_kork on 2009-02-16 at 22:34 |
|
|
|
|
|
#11 |
|
Sep 2002
52 Posts |
I got some time to fiddle with the system. After removing the sound driver, the errors disappeared!
Now the challenge is to find a suitable sound driver (ALC888 Audio Codec). |
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| ERROR: cudaGetLastError() returned 4: unspecified launch failure | MacFactor | GPU Computing | 0 | 2017-12-22 16:04 |
| Prime 95: FATAL ERROR: Rounding was 0.5, expected | etuckram | Hardware | 1 | 2011-05-27 00:11 |
| Fatal error | Unregistered | Information & Answers | 2 | 2010-02-26 23:28 |
| Fatal Error | Cameron2384 | Information & Answers | 3 | 2009-04-26 13:37 |
| Ran Prime For The First Time With Fatal Error | SethNKC | Information & Answers | 3 | 2009-02-22 19:38 |