mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

Karl M Johnson 2011-04-22 16:46

From what I've heard, 4.0 is total bullshift, except inline ptx assembly support.

Xyzzy 2011-04-22 17:05

[QUOTE=TheJudger;259284]Did you set SievePrimesAdjust to 0 for your tests?[/QUOTE]No, but we will now!

:smile:

[QUOTE=Christenson;259289]xyzzy is probably no more immune to typos than the rest of us, and it's easy to confuse a 1 followed by 5 0s with a 1 followed by 6 zeros unless a thousands separator is used.[/QUOTE]We tried 1,000,000 which is why we used commas in that post.

:max:

Christenson 2011-04-22 20:28

[QUOTE=Xyzzy;259317]No, but we will now!

:smile:

We tried 1,000,000 which is why we used commas in that post.

:max:[/QUOTE]

Hey, judger, there's a principle of good programming that's been missed here: If the program rejects the input for some reason, it should at least say so.

But I need to be careful, or I'm gonna get myself signed up to audit for this problem and maybe rewrite the readme.txt a bit.

TheJudger 2011-04-22 21:22

[QUOTE=Christenson;259352]Hey, judger, there's a principle of good programming that's been missed here: If the program rejects the input for some reason, it should at least say so.[/QUOTE]

I'm pretty sure that it does allready. :smile:

Oliver

nucleon 2011-04-22 23:04

[QUOTE=Brain;259311]Running 2 instances only, I can confirm a slight drop of 5 to 10%.[/QUOTE]

Yep - I was going to suggest that as you were running 4 cores on 460GTX. I have 2x 4.5GHz cores for a 580GTX, so it looks like I need to throw more cpu cores at it.

-- Craig

Xyzzy 2011-04-22 23:17

We upped "SievePrimes" higher but the GPU load dropped dramatically. The CPU usage did not change (the core tied to the instance was already at 100%) but if each instance is tied to a core then that makes sense. We expected the system memory to get used more but it remained stable. 2GB (!) in a box would be very usable with plenty of headroom.

If we ran two instances can we tie two cores to each instance?

At this point we would rather have (if they sold them) a silly fast dual core CPU than a moderately fast (3.3GHz) quad core. (Again, we are not going to overclock.) Turbo mode (3.7GHz) never kicks in for us.

We doubt the GTX 580 would be dramatically ($150) better.

For fun tonight we are going to try playing with "SievePrimes" to see if we can alter each instance/core to drop the core to less than all out.

We still have not decided whether or not to run 2 or 3 instances. We did, however, turn in 740GHz/days of work today, which we think represents more than a day but less than two days work.

TheJudger 2011-04-23 00:06

[QUOTE=Xyzzy;259362]We expected the system memory to get used more but it remained stable. 2GB (!) in a box would be very usable with plenty of headroom.[/QUOTE]
Memory usage does not depend on the value of SievePrimes.

[QUOTE=Xyzzy;259362]If we ran two instances can we tie two cores to each instance?[/QUOTE]
No, one core per instance (no multithreading in the CPU part).


[QUOTE=Xyzzy;259362]For fun tonight we are going to try playing with "SievePrimes" to see if we can alter each instance/core to drop the core to less than all out.[/QUOTE]
Usually you won't see a drop in CPU usage because if the CPU is not busy with sieving it will busywait for the GPU. (unreleased mfaktc 0.17 can do sleep() in this case which will reduce the CPU usage in GPU-limited setups)

Oliver

Ralf Recker 2011-04-23 08:31

CUDA toolkit 4.0rc2 compilation works
 
on my 64 bit Linux box.

[CODE]mfaktc v0.16p1

Compiletime options
THREADS_PER_GRID_MAX 1048576
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 193154bits
SIEVE_SPLIT 250
VERBOSE_TIMING disabled
MORE_CLASSES enabled

Runtime options
SievePrimes 25000
SievePrimesAdjust 1
NumStreams 3
CPUStreams 3
WorkFile worktodo.txt
Checkpoints enabled
Stages enabled
StopAfterFactor bitlevel
PrintMode 0

CUDA device info
name GeForce GTX 470
compute capability 2.0
maximum threads per block 1024
number of multiprocessors 14 (448 shader cores)
clock rate 1215MHz

CUDA version info
binary compiled for CUDA 4.0
CUDA driver version 4.0
CUDA runtime version 4.0

Automatic parameters
threads per grid 917504

running a simple selftest...
Selftest statistics
number of tests 31
successfull tests 31

selftest PASSED!
[/CODE]The long self test:

[CODE]Selftest statistics
number of tests 4914
successfull tests 4914

selftest PASSED!
[/CODE]I didn't measure the performance yet, but the performance of the (modified) 32 bit tpsieve CUDA app didn't change noticeably (compared to the CUDA toolkit 3.2 version).

Xyzzy 2011-04-24 01:20

For fun, we measured the throughput of our current system. It turns out to be 1 GHz day every 2 minutes and 40 seconds.

Question: If we wanted to run all of the instances in one directory, is there a way to specify individual "worktodo.txt" and "results.txt" files per instance?

Xyzzy 2011-04-24 03:15

2 Attachment(s)
We are still stuck on Windows, so we were mucking about and remembered the "start" command and the "affinity" option. (In Linux we never give processor or core affinity much thought.)

Anyways, using (or not using) core affinity, we see two different profiles.

1 - Two cores pegged and at ~65°C. Two cores idle and at ~56°C.
2 - All four cores share the load and all at ~60°C.

We have attached two images from "Tast Manager".

Questions:

1 - Is running individual cores better than averaging the cores out? Which two cores should one choose? Is running individual cores hotter an issue?
2 - Without having an instance tied to a core, is a lot of efficiency lost to context switching?

nucleon 2011-04-24 04:23

[QUOTE=Xyzzy;259438]

Questions:

1 - Is running individual cores better than averaging the cores out? Which two cores should one choose? Is running individual cores hotter an issue?
2 - Without having an instance tied to a core, is a lot of efficiency lost to context switching?[/QUOTE]

In the way mfaktc is constructed, my 'gut feeling' is that it's best to tie mfaktc to a single core for greater performance. Doesn't matter which core. Depending on the cpu the L3 cache is probably shared across cores and the L2 cache is core specific. So there would be inter-core overhead when the process is switched.

To preempt another question - how does affinity work in windows? I had trouble finding suitable help on the topic. By trial and error I found the affinity value to be a bitmask hex value. i.e. if you take your cpu as cores 0,1,2,3, the hex affinity mask becomes 1 to run on core0 ,2 to run on core1 ,4,8. etc.... The mask can be set to 3 to be affinity on core0 and core1. But mfaktc can only take advantage of one core at a time.

If you have a HT cpu with 4 real cores, and 4x virtual cores, the mask becomes 1,4,10,40 to run the processes on different real cores. mfaktc needs a real core.

Also I run mfaktc with low priority so the system and any other processes aren't affected. I also install bash from cygwin to give me a unix-y shell. The bash script for me becomes:

[code]
#!/bin/bash

AFFINITY=`cat limit.affinity`

cmd.exe /C start /low /affinity $AFFINITY ./mfaktc-win-64.exe
[/code]

I have a file 'limit.affinity' in each directory with a different affinity mask.

-- Craig


All times are UTC. The time now is 23:09.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.