![]() |
From what I've heard, 4.0 is total bullshift, except inline ptx assembly support.
|
[QUOTE=TheJudger;259284]Did you set SievePrimesAdjust to 0 for your tests?[/QUOTE]No, but we will now!
:smile: [QUOTE=Christenson;259289]xyzzy is probably no more immune to typos than the rest of us, and it's easy to confuse a 1 followed by 5 0s with a 1 followed by 6 zeros unless a thousands separator is used.[/QUOTE]We tried 1,000,000 which is why we used commas in that post. :max: |
[QUOTE=Xyzzy;259317]No, but we will now!
:smile: We tried 1,000,000 which is why we used commas in that post. :max:[/QUOTE] Hey, judger, there's a principle of good programming that's been missed here: If the program rejects the input for some reason, it should at least say so. But I need to be careful, or I'm gonna get myself signed up to audit for this problem and maybe rewrite the readme.txt a bit. |
[QUOTE=Christenson;259352]Hey, judger, there's a principle of good programming that's been missed here: If the program rejects the input for some reason, it should at least say so.[/QUOTE]
I'm pretty sure that it does allready. :smile: Oliver |
[QUOTE=Brain;259311]Running 2 instances only, I can confirm a slight drop of 5 to 10%.[/QUOTE]
Yep - I was going to suggest that as you were running 4 cores on 460GTX. I have 2x 4.5GHz cores for a 580GTX, so it looks like I need to throw more cpu cores at it. -- Craig |
We upped "SievePrimes" higher but the GPU load dropped dramatically. The CPU usage did not change (the core tied to the instance was already at 100%) but if each instance is tied to a core then that makes sense. We expected the system memory to get used more but it remained stable. 2GB (!) in a box would be very usable with plenty of headroom.
If we ran two instances can we tie two cores to each instance? At this point we would rather have (if they sold them) a silly fast dual core CPU than a moderately fast (3.3GHz) quad core. (Again, we are not going to overclock.) Turbo mode (3.7GHz) never kicks in for us. We doubt the GTX 580 would be dramatically ($150) better. For fun tonight we are going to try playing with "SievePrimes" to see if we can alter each instance/core to drop the core to less than all out. We still have not decided whether or not to run 2 or 3 instances. We did, however, turn in 740GHz/days of work today, which we think represents more than a day but less than two days work. |
[QUOTE=Xyzzy;259362]We expected the system memory to get used more but it remained stable. 2GB (!) in a box would be very usable with plenty of headroom.[/QUOTE]
Memory usage does not depend on the value of SievePrimes. [QUOTE=Xyzzy;259362]If we ran two instances can we tie two cores to each instance?[/QUOTE] No, one core per instance (no multithreading in the CPU part). [QUOTE=Xyzzy;259362]For fun tonight we are going to try playing with "SievePrimes" to see if we can alter each instance/core to drop the core to less than all out.[/QUOTE] Usually you won't see a drop in CPU usage because if the CPU is not busy with sieving it will busywait for the GPU. (unreleased mfaktc 0.17 can do sleep() in this case which will reduce the CPU usage in GPU-limited setups) Oliver |
CUDA toolkit 4.0rc2 compilation works
on my 64 bit Linux box.
[CODE]mfaktc v0.16p1 Compiletime options THREADS_PER_GRID_MAX 1048576 THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 193154bits SIEVE_SPLIT 250 VERBOSE_TIMING disabled MORE_CLASSES enabled Runtime options SievePrimes 25000 SievePrimesAdjust 1 NumStreams 3 CPUStreams 3 WorkFile worktodo.txt Checkpoints enabled Stages enabled StopAfterFactor bitlevel PrintMode 0 CUDA device info name GeForce GTX 470 compute capability 2.0 maximum threads per block 1024 number of multiprocessors 14 (448 shader cores) clock rate 1215MHz CUDA version info binary compiled for CUDA 4.0 CUDA driver version 4.0 CUDA runtime version 4.0 Automatic parameters threads per grid 917504 running a simple selftest... Selftest statistics number of tests 31 successfull tests 31 selftest PASSED! [/CODE]The long self test: [CODE]Selftest statistics number of tests 4914 successfull tests 4914 selftest PASSED! [/CODE]I didn't measure the performance yet, but the performance of the (modified) 32 bit tpsieve CUDA app didn't change noticeably (compared to the CUDA toolkit 3.2 version). |
For fun, we measured the throughput of our current system. It turns out to be 1 GHz day every 2 minutes and 40 seconds.
Question: If we wanted to run all of the instances in one directory, is there a way to specify individual "worktodo.txt" and "results.txt" files per instance? |
2 Attachment(s)
We are still stuck on Windows, so we were mucking about and remembered the "start" command and the "affinity" option. (In Linux we never give processor or core affinity much thought.)
Anyways, using (or not using) core affinity, we see two different profiles. 1 - Two cores pegged and at ~65°C. Two cores idle and at ~56°C. 2 - All four cores share the load and all at ~60°C. We have attached two images from "Tast Manager". Questions: 1 - Is running individual cores better than averaging the cores out? Which two cores should one choose? Is running individual cores hotter an issue? 2 - Without having an instance tied to a core, is a lot of efficiency lost to context switching? |
[QUOTE=Xyzzy;259438]
Questions: 1 - Is running individual cores better than averaging the cores out? Which two cores should one choose? Is running individual cores hotter an issue? 2 - Without having an instance tied to a core, is a lot of efficiency lost to context switching?[/QUOTE] In the way mfaktc is constructed, my 'gut feeling' is that it's best to tie mfaktc to a single core for greater performance. Doesn't matter which core. Depending on the cpu the L3 cache is probably shared across cores and the L2 cache is core specific. So there would be inter-core overhead when the process is switched. To preempt another question - how does affinity work in windows? I had trouble finding suitable help on the topic. By trial and error I found the affinity value to be a bitmask hex value. i.e. if you take your cpu as cores 0,1,2,3, the hex affinity mask becomes 1 to run on core0 ,2 to run on core1 ,4,8. etc.... The mask can be set to 3 to be affinity on core0 and core1. But mfaktc can only take advantage of one core at a time. If you have a HT cpu with 4 real cores, and 4x virtual cores, the mask becomes 1,4,10,40 to run the processes on different real cores. mfaktc needs a real core. Also I run mfaktc with low priority so the system and any other processes aren't affected. I also install bash from cygwin to give me a unix-y shell. The bash script for me becomes: [code] #!/bin/bash AFFINITY=`cat limit.affinity` cmd.exe /C start /low /affinity $AFFINITY ./mfaktc-win-64.exe [/code] I have a file 'limit.affinity' in each directory with a different affinity mask. -- Craig |
| All times are UTC. The time now is 23:09. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.