mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-07-13, 17:18   #1068
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

23·149 Posts
Default

Each core adds more overall performance, there's never a case where X cores does more total work than X+1 cores, but the performance of each core drops the more loaded the CPU is.

So the answer to your question would be: "4 cores"

Last fiddled with by James Heinrich on 2011-07-13 at 17:19
James Heinrich is offline   Reply With Quote
Old 2011-07-13, 18:26   #1069
apsen
 
Jun 2011

131 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
Each core adds more overall performance, there's never a case where X cores does more total work than X+1 cores, but the performance of each core drops the more loaded the CPU is."
mfaktc performance drops 50% on loading 4th core. If all cores suffer the same penalty then 4/4*0.5 is less then 3/4. But Prime95 performance does not seem to drop off as badly on addition of the 4th core... I'll need to do some tests. As it is 8800 GTS performs at 3/4 of GTX 465 :-(
apsen is offline   Reply With Quote
Old 2011-07-13, 19:10   #1070
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

23·149 Posts
Default

Someone else will likely correct me, but I believe a GTX 465 needs more than 1 instance to show its potential. I'd try 2 instances of mfaktc and 2 Prime95 workers and see what your overall throughput is like.
James Heinrich is offline   Reply With Quote
Old 2011-07-13, 19:47   #1071
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3×137 Posts
Default

Yes. Even with 64 bit mfaktc.
Karl M Johnson is offline   Reply With Quote
Old 2011-07-13, 20:12   #1072
apsen
 
Jun 2011

131 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
Someone else will likely correct me, but I believe a GTX 465 needs more than 1 instance to show its potential. I'd try 2 instances of mfaktc and 2 Prime95 workers and see what your overall throughput is like.
I do not know how to check but it was said that CUDALucas maxes out GPU. Maybe I'll just run CUDALucas on 465 (+4 Prime95 workers) and run mfaktc on my two 8800?
apsen is offline   Reply With Quote
Old 2011-07-17, 17:55   #1073
apsen
 
Jun 2011

131 Posts
Default

Quote:
Originally Posted by Christenson View Post
10200 = 27D8....you sure you have the right return-type declared for cudaStreamCreate?

If you are just trying to run mfaktc, I'd be inclined to ignore the "I can't build it" problem. What do you hope to do with the modification?
I figured this one out - toolkit mismatch.

I have also modified mfaktc so it no longer needs atomics and compiles for any cuda compute capability under CUDA 2.2/3.1/3.2. I haven't tried 4.0 but I do not see why it would have a problem with that.

Anyway here's modified mfaktc:
Attached Files
File Type: zip mfaktc.0.17.sm_10.zip (120.9 KB, 107 views)
apsen is offline   Reply With Quote
Old 2011-07-17, 22:57   #1074
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

aspen: why didn't you change the version string? Seems that you did alot more changes than just the removal of the atomics...

Oliver
TheJudger is offline   Reply With Quote
Old 2011-07-17, 23:38   #1075
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

45716 Posts
Default

aspen: your changes seem to screw something up.

On my GTX 8800 (CUDA 4.0) it sometimes fails the short selftest!
Performance is half of the expected value (no async CPU/GPU computation?).

Code:
running a simple selftest...
ERROR: selftest failed for M49635893!
  expected result: 000F300E 00B13196 00D84F67
  reported result: 001DAC4B 001DAC50 001DAC55
  reported result: 001DAC57 001DAC5D 001DAC5F
  reported result: 001DAC61 001DAC67 001DAC6B
  reported result: 001DAC70 001DAC73 001DAC7A
  reported result: 001DAC7E 001DAC84 001DAC8A
  reported result: 001DAC8C 001DAC8E 001DAC99
  reported result: 001DAC9A 001DAC9E 001DACA0
  reported result: 001DACA2 001DACA3 001DACA5
  reported result: 001DACAF 001DACB0 001DACB5
  reported result: 001DACB6 001DACBD 001DACBE
Selftest statistics
  number of tests           31
  successfull tests         30
  wrong factor reported     1

selftest FAILED!
I don't recommend to run aspens version until this if fixed!

Oliver

Last fiddled with by TheJudger on 2011-07-17 at 23:48
TheJudger is offline   Reply With Quote
Old 2011-07-18, 01:49   #1076
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

179510 Posts
Default

Hi Oliver:

I've been putting my time into parse.c ... gone through 1 re-write, need another to get it organized with a parse_line function that returns as a structure with both the data found and the original line.

I really don't have time to check over apsen's changes right now, as work has gotten rather rough....I'm supposed to be doing something I never have done before, with few resources and little support.
Christenson is offline   Reply With Quote
Old 2011-07-18, 01:51   #1077
apsen
 
Jun 2011

100000112 Posts
Default

Quote:
Originally Posted by TheJudger View Post
aspen: your changes seem to screw something up.

On my GTX 8800 (CUDA 4.0) it sometimes fails the short selftest!
Performance is half of the expected value (no async CPU/GPU computation?).

I don't recommend to run aspens version until this if fixed!

Oliver
Sorry, It wasn't really meant for general consumption That's why I did not post the executable. I was hoping for your to take a look at it. We could transfer this to private conversation.

For me all tests (including long one) come up fine. I did have problem in the interim so maybe I need to check if I posted the right version. Also I haven't tested with CUDA 4.0...

The idea is simple give each thread it's own chunk of memory to write the results so there's no need to have shared variable.

I did have to rearrange the code a little bit to make it possible but I tried to keep it so it's easy to do diff. It could use a little straightening otherwise.

Last fiddled with by apsen on 2011-07-18 at 02:02
apsen is offline   Reply With Quote
Old 2011-07-18, 01:51   #1078
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

34038 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
Someone else will likely correct me, but I believe a GTX 465 needs more than 1 instance to show its potential. I'd try 2 instances of mfaktc and 2 Prime95 workers and see what your overall throughput is like.
Someone else says you are dead on, James. Here's why: right now mfaktc is sieving on the CPU...so you will either need a very hot CPU core or two cores to reach full potential on a hot GPU card....
Christenson is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 01:35.


Fri Aug 6 01:35:47 UTC 2021 up 13 days, 20:04, 1 user, load averages: 2.56, 2.36, 2.35

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.