mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-04-22, 16:46   #771
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3·137 Posts
Default

From what I've heard, 4.0 is total bullshift, except inline ptx assembly support.
Karl M Johnson is offline   Reply With Quote
Old 2011-04-22, 17:05   #772
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

202A16 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Did you set SievePrimesAdjust to 0 for your tests?
No, but we will now!



Quote:
Originally Posted by Christenson View Post
xyzzy is probably no more immune to typos than the rest of us, and it's easy to confuse a 1 followed by 5 0s with a 1 followed by 6 zeros unless a thousands separator is used.
We tried 1,000,000 which is why we used commas in that post.

Xyzzy is offline   Reply With Quote
Old 2011-04-22, 20:28   #773
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5×359 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
No, but we will now!



We tried 1,000,000 which is why we used commas in that post.

Hey, judger, there's a principle of good programming that's been missed here: If the program rejects the input for some reason, it should at least say so.

But I need to be careful, or I'm gonna get myself signed up to audit for this problem and maybe rewrite the readme.txt a bit.
Christenson is offline   Reply With Quote
Old 2011-04-22, 21:22   #774
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

21278 Posts
Default

Quote:
Originally Posted by Christenson View Post
Hey, judger, there's a principle of good programming that's been missed here: If the program rejects the input for some reason, it should at least say so.
I'm pretty sure that it does allready.

Oliver
TheJudger is offline   Reply With Quote
Old 2011-04-22, 23:04   #775
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5·103 Posts
Default

Quote:
Originally Posted by Brain View Post
Running 2 instances only, I can confirm a slight drop of 5 to 10%.
Yep - I was going to suggest that as you were running 4 cores on 460GTX. I have 2x 4.5GHz cores for a 580GTX, so it looks like I need to throw more cpu cores at it.

-- Craig
nucleon is offline   Reply With Quote
Old 2011-04-22, 23:17   #776
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

202A16 Posts
Default

We upped "SievePrimes" higher but the GPU load dropped dramatically. The CPU usage did not change (the core tied to the instance was already at 100%) but if each instance is tied to a core then that makes sense. We expected the system memory to get used more but it remained stable. 2GB (!) in a box would be very usable with plenty of headroom.

If we ran two instances can we tie two cores to each instance?

At this point we would rather have (if they sold them) a silly fast dual core CPU than a moderately fast (3.3GHz) quad core. (Again, we are not going to overclock.) Turbo mode (3.7GHz) never kicks in for us.

We doubt the GTX 580 would be dramatically ($150) better.

For fun tonight we are going to try playing with "SievePrimes" to see if we can alter each instance/core to drop the core to less than all out.

We still have not decided whether or not to run 2 or 3 instances. We did, however, turn in 740GHz/days of work today, which we think represents more than a day but less than two days work.
Xyzzy is offline   Reply With Quote
Old 2011-04-23, 00:06   #777
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

21278 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
We expected the system memory to get used more but it remained stable. 2GB (!) in a box would be very usable with plenty of headroom.
Memory usage does not depend on the value of SievePrimes.

Quote:
Originally Posted by Xyzzy View Post
If we ran two instances can we tie two cores to each instance?
No, one core per instance (no multithreading in the CPU part).


Quote:
Originally Posted by Xyzzy View Post
For fun tonight we are going to try playing with "SievePrimes" to see if we can alter each instance/core to drop the core to less than all out.
Usually you won't see a drop in CPU usage because if the CPU is not busy with sieving it will busywait for the GPU. (unreleased mfaktc 0.17 can do sleep() in this case which will reduce the CPU usage in GPU-limited setups)

Oliver
TheJudger is offline   Reply With Quote
Old 2011-04-23, 08:31   #778
Ralf Recker
 
Ralf Recker's Avatar
 
Oct 2010

191 Posts
Default CUDA toolkit 4.0rc2 compilation works

on my 64 bit Linux box.

Code:
mfaktc v0.16p1

Compiletime options
  THREADS_PER_GRID_MAX      1048576
  THREADS_PER_BLOCK         256
  SIEVE_SIZE_LIMIT          32kiB
  SIEVE_SIZE                193154bits
  SIEVE_SPLIT               250
  VERBOSE_TIMING            disabled
  MORE_CLASSES              enabled

Runtime options
  SievePrimes               25000
  SievePrimesAdjust         1
  NumStreams                3
  CPUStreams                3
  WorkFile                  worktodo.txt
  Checkpoints               enabled
  Stages                    enabled
  StopAfterFactor           bitlevel
  PrintMode                 0

CUDA device info
  name                      GeForce GTX 470
  compute capability        2.0
  maximum threads per block 1024
  number of multiprocessors 14 (448 shader cores)
  clock rate                1215MHz

CUDA version info
  binary compiled for CUDA  4.0
  CUDA driver version       4.0
  CUDA runtime version      4.0

Automatic parameters
  threads per grid          917504

running a simple selftest...
Selftest statistics
  number of tests           31
  successfull tests         31

selftest PASSED!
The long self test:

Code:
Selftest statistics
  number of tests           4914
  successfull tests         4914

selftest PASSED!
I didn't measure the performance yet, but the performance of the (modified) 32 bit tpsieve CUDA app didn't change noticeably (compared to the CUDA toolkit 3.2 version).

Last fiddled with by Ralf Recker on 2011-04-23 at 08:36 Reason: Added mfaktc output
Ralf Recker is offline   Reply With Quote
Old 2011-04-24, 01:20   #779
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

2×23×179 Posts
Default

For fun, we measured the throughput of our current system. It turns out to be 1 GHz day every 2 minutes and 40 seconds.

Question: If we wanted to run all of the instances in one directory, is there a way to specify individual "worktodo.txt" and "results.txt" files per instance?
Xyzzy is offline   Reply With Quote
Old 2011-04-24, 03:15   #780
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

202A16 Posts
Default

We are still stuck on Windows, so we were mucking about and remembered the "start" command and the "affinity" option. (In Linux we never give processor or core affinity much thought.)

Anyways, using (or not using) core affinity, we see two different profiles.

1 - Two cores pegged and at ~65°C. Two cores idle and at ~56°C.
2 - All four cores share the load and all at ~60°C.

We have attached two images from "Tast Manager".

Questions:

1 - Is running individual cores better than averaging the cores out? Which two cores should one choose? Is running individual cores hotter an issue?
2 - Without having an instance tied to a core, is a lot of efficiency lost to context switching?
Attached Thumbnails
Click image for larger version

Name:	1.png
Views:	102
Size:	8.0 KB
ID:	6528   Click image for larger version

Name:	2.png
Views:	104
Size:	14.7 KB
ID:	6529  
Xyzzy is offline   Reply With Quote
Old 2011-04-24, 04:23   #781
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5×103 Posts
Default

Quote:
Originally Posted by Xyzzy View Post

Questions:

1 - Is running individual cores better than averaging the cores out? Which two cores should one choose? Is running individual cores hotter an issue?
2 - Without having an instance tied to a core, is a lot of efficiency lost to context switching?
In the way mfaktc is constructed, my 'gut feeling' is that it's best to tie mfaktc to a single core for greater performance. Doesn't matter which core. Depending on the cpu the L3 cache is probably shared across cores and the L2 cache is core specific. So there would be inter-core overhead when the process is switched.

To preempt another question - how does affinity work in windows? I had trouble finding suitable help on the topic. By trial and error I found the affinity value to be a bitmask hex value. i.e. if you take your cpu as cores 0,1,2,3, the hex affinity mask becomes 1 to run on core0 ,2 to run on core1 ,4,8. etc.... The mask can be set to 3 to be affinity on core0 and core1. But mfaktc can only take advantage of one core at a time.

If you have a HT cpu with 4 real cores, and 4x virtual cores, the mask becomes 1,4,10,40 to run the processes on different real cores. mfaktc needs a real core.

Also I run mfaktc with low priority so the system and any other processes aren't affected. I also install bash from cygwin to give me a unix-y shell. The bash script for me becomes:

Code:
#!/bin/bash

AFFINITY=`cat limit.affinity`

cmd.exe /C start /low /affinity $AFFINITY ./mfaktc-win-64.exe
I have a file 'limit.affinity' in each directory with a different affinity mask.

-- Craig
nucleon is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 13:23.


Mon Aug 2 13:23:45 UTC 2021 up 10 days, 7:52, 0 users, load averages: 2.40, 2.17, 2.05

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.