mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2006-04-07, 10:08   #1
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

100010101112 Posts
Default strange problem with torture test on 16core machines

Hello,

I've two machines with 16 cores each (8way opteron with dualcore CPU) running SuSE Linux 10.0 with kernel 2.6.16.1.
installed memory is 64/128GB
both machines run several day some memory test programms -> no problems
when I start 16 times "mprime -t" (sprime 24.14, each process in its own directory) usually 1 (sometimes 2) process dies within 3 minutes while the other processes run fine for hours.

Error message is ALLWAYS this:
Code:
FATAL ERROR: Rounding was 0.5, expected less than 0.4
Torture Test ran 0 minutes - 1 errors, 0 warnings.
I've tried to pin each process of mprime to one specific core via numactl -> the problem occurs on random CPUs. It is very reproductible... and it's the same behaviour on both machines.

No "machine check exceptions" are noticed.

any ideas?

Last fiddled with by TheJudger on 2006-04-07 at 10:09
TheJudger is offline   Reply With Quote
Old 2006-04-07, 12:02   #2
Cruelty
 
Cruelty's Avatar
 
May 2005

23×7×29 Posts
Default

If it occurs on random CPU, then maybe power supply is responsible for such behaviour?
Cruelty is offline   Reply With Quote
Old 2006-04-07, 12:55   #3
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

22×1,873 Posts
Default

It could still be memory. Try running in-place small FFT torture test on all CPUs. If this works, then I think it is a memory problem.
Prime95 is offline   Reply With Quote
Old 2006-04-07, 16:23   #4
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

right now it looks like a spooky hw problem at least on one system (propably one systemboard has some problems)
memory was swaped several times.

I've partitioned the machines into 2/4way (4/8 core) systems and the problem does not occour on all partitions -> conclude: NOT a sw problem (Linux kernel etc.)
TheJudger is offline   Reply With Quote
Old 2006-04-08, 02:20   #5
moo
 
moo's Avatar
 
Jul 2004
Nowhere

14518 Posts
Default

does it also become a 50000 watt space heater.... just wondering cause thats a ton of heat. also could you post pictures so i can droll over them -_-

Last fiddled with by moo on 2006-04-08 at 02:20
moo is offline   Reply With Quote
Old 2006-04-08, 11:20   #6
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Quote:
Originally Posted by moo
does it also become a 50000 watt space heater.... just wondering cause thats a ton of heat. also could you post pictures so i can droll over them -_-
Equippet with 8x 885Opteron (2.6GHz dualcore) and 128G of memory it consumes "only" ~1000watts ;)

Right now I'm pretty sure that one core of one CPU makes spooky things when SSE is used...
TheJudger is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Odd scaling of test times between two machines mdettweiler Hardware 3 2014-07-28 16:35
Will the torture test, test ALL available memory? swinster Software 2 2007-12-01 17:54
Contacting users with problem machines markhl Data 1 2003-10-02 07:12
Running a LL test on 2 different machines lycorn Software 10 2003-01-13 19:34
Torture test not torture enough? cmokruhl Software 3 2003-01-08 00:14

All times are UTC. The time now is 10:53.

Tue May 18 10:53:06 UTC 2021 up 40 days, 5:33, 0 users, load averages: 1.60, 1.83, 1.76

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.