mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Wanted: known-good motherboard (https://www.mersenneforum.org/showthread.php?t=9833)

fivemack 2008-01-09 00:40

Wanted: known-good motherboard
 
I'm looking for a motherboard with on-board video which takes a Q6600 processor and is known to work without trouble with 8GB of memory under a recent version of Ubuntu.

What I have just bought is a DG965OT, which has a BIOS issue, which Intel has been aware of for nine months and does not plan to fix, meaning that the kernel is loaded into uncached memory and runs slower than two sloths tied together crawling backwards through particularly thick treacle. This is obviously unacceptable; I've still got the packaging, the board's not fit for purpose, I'll send it back, but then I have a Q6600 and 8GB of DDR2 memory and no motherboard :glare:

Wacky 2008-01-09 03:18

I know a mother, and she is bored!
Perhaps we "geeks", should expend some effort in other areas :)

Xyzzy 2008-01-09 05:00

We have a DG965OT and it is very fast with 4GB of memory. We didn't know about any 8GB problem. Good thing we didn't plan to buy that much!

Do you have a link to the technical article on the issue?

fivemack 2008-01-09 12:04

[url]http://lkml.org/lkml/2007/6/4/316[/url] is a long thread on linux-kernel which hashes out what the problem is and then mocks Intel for not solving it.

[url]http://forums.fedoraforum.org/showthread.php?t=157232[/url] suggests that, if I downgrade to exactly the right version of the BIOS and sacrifice two pink goats and a yellow llama (but it mustn't be a spotted yellow llama) things might be made able to work.

lavalamp 2008-01-09 14:58

Heh, you sound like you used to write for Blackadder.

Anyway, the next logical step up from the DG965OT would be either [url=http://www.intel.com/products/motherboard/DG33TL/index.htm]DG33TL[/url] or [url=http://www.intel.com/products/motherboard/DQ35JO/index.htm]DQ35JO[/url]/[url=http://www.intel.com/products/motherboard/DQ35MP/index.htm]DQ35MP[/url], assuming you want to stay with Intel. However, if you want to keep a distance from Intel boards then you're spoilt for choice as there are a [url=http://www.scan.co.uk/Products/Products.ASP?CatID=28&FilterCategories=130&Thumbnails=yes]whole[/url] [url=http://www.overclockers.co.uk/productlist.php?&groupid=701&catid=5&subid=326&sortby=priceAsc]host[/url] of G965/G33/G35 boards form other vendors, most of which are also micro-ATX. I only have limited experience with Linux though, so I cannot advise how well any of them play with 'buntu.

I did skim the thread on LKML and I came away with the impression that it's a problem specific to the 965 chipset (which may affect boards from other vendors too), so perhaps one of the newer Intel boards would be without this issue.

Re-reading what I've just written it doesn't seem very helpful, I'm sure you knew most or all of what I said, nevertheless, I'll post it anyway just for the Blackadder comment.

Xyzzy 2008-01-09 17:41

Good fun:

[url]http://www.bbc.co.uk/comedy/blackadder/quotes/[/url]

lavalamp 2008-01-09 22:46

Indeed. :grin:[quote]Blackadder: I'm as poor as a church mouse, that's just had an enormous tax bill on the very day his wife ran off with another mouse, taking all the cheese.

Blackadder: We're in the stickiest situation since Sticky the Stick Insect got stuck on a sticky bun.

Blackadder: Baldrick, does it have to be this way? Our valued friendship ending with me cutting you up into strips and telling the prince that you walked over a very sharp cattle grid in an extremely heavy hat? [/quote]

fivemack 2008-01-16 16:23

I have bought a Gigabyte P35-DS3R board and a video card, and after a particularly tiresome six-hour computer assembly session, ably assisted by such of my friends as enjoyed the aggressive use of screwdrivers, and given extra drama by the discovery that 'the fan isn't spinning' is no longer a sign that the computer is misbuilt - thermostatic fans don't necessarily spin at boot-up - managed to build a working 8GB quad-core computer.

This has, I suspect, allowed me to discover new and exciting threading bugs in msieve-1.33 which would probably not have shown up on a platform with less memory or fewer processors. Also, were my spare bedroom transported through time to 1993, the left and right computers in it would be #2 and #3 on the top500 supercomputer list.

lavalamp 2008-01-16 17:40

I'd go forward in time to acquire a PC so that it heads the top of the list now. :wink:

Good luck with your server farm.

tallguy 2008-01-16 17:50

[quote=Xyzzy;122526]Good fun:

[URL]http://www.bbc.co.uk/comedy/blackadder/quotes/[/URL][/quote]
Great stuff. My fav, from the Christmas special: [quote]
Blackadder (to Baldrick): “You wouldn't know a subtle plan if it painted itself purple and danced naked on top of a harpsichord singing 'Subtle Plans Are Here Again.'”[/quote]

fivemack 2008-01-17 09:07

Sadly, they appear to have been hardware rather than threading bugs.

At least, if the same executable running on the same data on two computers differing only in motherboard and amount of memory runs to completion on computer A and has to restart before running to completion on computer B, one might assume that computer B has a hardware fault.

Now, how does one go about diagnosing and fixing hardware faults? I suspect long and involved mprime torture tests would be a good start.

garo 2008-01-17 20:39

Usual crap. Eliminate the variables one at a time:

1. Memory
2. Heat
3. Overclock if any stressing either of the above or the FSB.
4. Motherboard
5. CPU

Unfortunately, there is no short cut.

fivemack 2008-01-23 12:08

[QUOTE=garo;123064]Usual crap. Eliminate the variables one at a time:

1. Memory
2. Heat
3. Overclock if any stressing either of the above or the FSB.
4. Motherboard
5. CPU

Unfortunately, there is no short cut.[/QUOTE]

But how do I go about eliminating those variables? I've left memtest running overnight - no issue appeared. The machine isn't overclocked. I've written my own memory tester and run it on all four cores (admittedly unthreaded) - no issue appeared.

I'm not sure what else I can do without spending money; there's a slight temptation to buy a chunkier cooler for the CPU, but the temperatures I'm measuring on the CPU on the unreliable machine are no different from the ones I measure on a similar system (same CPU, different motherboard, less memory) that works.

Andi47 2008-01-23 13:50

As I have heared, memtest does not find every memory issue.

Does a Prime95 torture test fail on this computer?

1.) Try for example to build the computer with only one of the memory sticks and then run the torture test for 24 or 48 hours (or run a msieve postprocessing which fits into the memory), then change the memory sticks and run the torture test (or threaded msieve linear algebra) again, etc., until you have found the faulty memory. (If it fails with (nearly) all memory sticks, it is most propably not a memory issue.) You say that threaded linear algebra fails after 24-48 hours, so propably running a memory test (either memtest or prime95 torture test or something else) only overnight might not be sufficient to detect a memory issue.

You can also put the memory sticks into a known reliable computer and test if it fails now.

2.) What are your CPU- and motherboard temperatures (idle and stressed)?

3.) should be no issue as you say that the box is not OC'ed.

4.) I don't know how tho test this other than eliminating variables 1, 2, 3 and 5

5.) You can for example put the cpu onto an other motherboard and test if torture test or threaded msieve fails.

garo 2008-01-24 13:11

Like Andi47 said above...

There is no OC so that is ruled out. Do tell us what the temps on both machines are. Also, try and swap CPUs between the reliable and unreliable machine so that you can tell if the CPU itself is at fault.

paulunderwood 2008-01-24 14:00

[QUOTE=garo;123064]Usual crap. Eliminate the variables one at a time:

1. Memory
2. Heat
3. Overclock if any stressing either of the above or the FSB.
4. Motherboard
5. CPU

Unfortunately, there is no short cut.[/QUOTE]

Additionally, the PSU (power supply unit) ought to be checked out. With all that memory, plus devices, the PSU might just not be up to it. I suggest you swap PSUs and re-check.

Andi47 2008-01-24 17:09

[QUOTE=paulunderwood;123790]Additionally, the PSU (power supply unit) ought to be checked out. With all that memory, plus devices, the PSU might just not be up to it. I suggest you swap PSUs and re-check.[/QUOTE]

@Fivemack: What are the Voltages (VCore, etc.) of your PC?

fivemack 2008-01-27 10:52

I've looked at all the BIOS options, and found something about memory timing which was set to 'Turbo'; I reset it to a more conservative level, and have managed to run one thread of msieve for 96 hours without hitting issues.

That seems to make sense to explain the symptoms - it is likely that 'Turbo' is an option which doesn't take account of the rather large memory loading in an 8GB system.

Andi47 2008-01-27 12:33

[QUOTE=fivemack;124046]I've looked at all the BIOS options, and found something about memory timing which was set to 'Turbo'; I reset it to a more conservative level, and have managed to run one thread of msieve for 96 hours without hitting issues.

That seems to make sense to explain the symptoms - it is likely that 'Turbo' is an option which doesn't take account of the rather large memory loading in an 8GB system.[/QUOTE]

Do you also try multithreaded msieve?

Has one-threaded msieve also failed with the Turbo-option on?

garo 2008-01-28 16:31

While Turbo may be part of the problem, I would advise you to dig deeper and figure out which component the Turbo setting was stressing. That component is probably vulnerable (unless Turbo was really screwing some settings) and more likely to fail soon.

fivemack 2008-01-28 23:38

Yes, single-threaded msieve had failed in the past with Turbo on.

On the other hand, a four-threaded msieve with Turbo off has just failed.

The particularly tiresome bit in all this is that the fast machine is the NFS server to a small farm of headless machines, and so they can't get anything done while I'm fiddling with the fast machine, and having acquired the farm I feel peculiarly distressed when it's not working efficiently.

fivemack 2008-02-04 10:00

Output from 'sensors' on the unreliable machine

Core temperatures +72, +69, +70, +71 C
Voltages 1.18V, 1.74V, 3.38V, 3.02V, 1.38V, 0V, 0.08V, 3.07V, 3.10V
fan 1991RPM
case temperatures +42, +57, -2 (I presume this last one is an error)

Output from 'sensors' on the reliable machine running four threads of linear algebra:

Core temperatures +59C, +52C, +58C, +52C

Maybe I should go and buy some thermal grease and a 500W PSU; it's quite possible that the CPU fan isn't well-installed on the unreliable machine.

paulunderwood 2008-02-04 10:12

The "turbo" timings are aggressive and unless you do have all the motherboard manufacturer's recommended parts (e.g. memory) -- the top notch stuff -- I think "turbo" will never work. :wink: I have a board that fails with "turbo".

sdbardwick 2008-02-04 12:05

The unreliable machine's temps seem high and the fan speed seems low. Try setting the Smart Fan Control Method to Disabled (in the BIOS under PC Heath Status section). That should force the CPU fan to maximum and give you an idea if the cooler needs to be reseated/replaced.

garo 2008-02-08 16:47

Yeah looks like your temps are too high. Try the fan thing and if that does not work try re-seating the CPU. You can hold off on the new PSU for the moment.

fivemack 2008-02-25 22:06

I've replaced the cooler with a giant copper thing with heat-pipes and a transparent fan with blue LEDs (a slightly tiresome process since you have to remove the motherboard); the temperature went down by about six degrees, but a medium-sized four-threaded SNFS linalg job using msieve-1.33 still failed weirdly (ran to 157% complete, then submatrix-is-not-invertible):

[code]
Mon Feb 25 00:29:06 2008 matrix is 2368013 x 2368229 (653.8 MB) with weight 164779638 (69.58/col)
Mon Feb 25 00:29:06 2008 sparse part has weight 147700924 (62.37/col)
Mon Feb 25 00:29:06 2008 matrix includes 64 packed rows
Mon Feb 25 00:29:06 2008 using block size 65536 for processor cache size 4096 kB
Mon Feb 25 00:29:30 2008 commencing Lanczos iteration (4 threads)
Mon Feb 25 00:29:30 2008 memory use: 687.4 MB
Mon Feb 25 14:58:23 2008 lanczos error: submatrix is not invertible
Mon Feb 25 14:58:23 2008 lanczos halted after 58838 iterations (dim = 3720589)
Mon Feb 25 14:58:23 2008 linear algebra failed; retrying...
Mon Feb 25 14:58:23 2008 commencing Lanczos iteration (4 threads)
Mon Feb 25 14:58:23 2008 memory use: 687.4 MB
[/code]

Andi47 2008-02-26 07:58

~65°C still seems a bit warm to me, but I don't know if that temperature is too high for a core 2.

Does the CPU-fan run at maximum speed? (if not, try to increase it's speed, there must be an option in the BIOS-setup somewhere)

Does the case have fans to get cool air into the case? (a heat-piped copper heatsink sounds good, but it won't work properly if the air around it is boiling)

Have you tried a threaded run on the same mid-size number with an earlier version of msieve?

(one more weird idea: Doas the linalg still fail when you [i]underclock[/i] the cpu?)

paulunderwood 2008-02-26 09:11

Can we have a recap of the memory and mainboard makes and models -- then we might see what the timings should be.

Please list the voltages again, this time with labels.

With the memory, stress your computer with one stick plugged in, then two etc.

When you fit the heatsink do not use too much grease. You need good heat transfer not insulation. The idea is to remove pockets of air which heat up. Use a good quality grease. :smile:


All times are UTC. The time now is 08:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.