 nuggetprime, run 4 apps simultaniously and you'll get testing 4 candidates at the same time on one GPU, what's the problem? I asked about another. I have 2 GPUs. When I use GeneferCUDA, 2 jobs are working each at its own GPU. But if change PRPNet port, which calls llr, 2 llrcuda apps are working at the 1st GPU. Could anybody help me? I can show inis, logs or screenshots if it needs.
2011-03-19, 14:19   #233
msft

Jul 2009
Tokyo

10011000102 Posts

Quote:
 Originally Posted by x3mEn Hm... GeneferCUDA really supports GPU affinity, but llrcuda.0.60 doesn't... any idea?
Code:
CPU_AFFINITY = (unsigned int) IniGetInt (INI_FILE, "Affinity", 99);
you need use Llr.int file.

2011-03-19, 14:21   #234
msft

Jul 2009
Tokyo

2·5·61 Posts

Quote:
 Originally Posted by nuggetprime This is a question to msft: Is it possible to implement testing multiple candidates at the same time on one GPU? I think this would greatly improve throughput. Just like on a quad-core CPU you get about 3x more throughput if you test 4 candidates on 4 cores than 1 candidate on 4 cores.
Sorry I can not test,Now.

2011-03-19, 14:41   #235
x3mEn

Feb 2011

1102 Posts

Quote:
 Originally Posted by msft Code: CPU_AFFINITY = (unsigned int) IniGetInt (INI_FILE, "Affinity", 99); you need use Llr.int file.
msft, you are right, llr.ini helped

2011-03-19, 14:48   #236
nuggetprime

Mar 2007
Austria

1001011102 Posts

Quote:
 Originally Posted by x3mEn nuggetprime, run 4 apps simultaniously and you'll get testing 4 candidates at the same time on one GPU, what's the problem? I asked about another. I have 2 GPUs. When I use GeneferCUDA, 2 jobs are working each at its own GPU. But if change PRPNet port, which calls llr, 2 llrcuda apps are working at the 1st GPU. Could anybody help me? I can show inis, logs or screenshots if it needs.
Have you got a GPU where you can test how much slower it is with 4 instances than with 1?
From what I read in the previous posts,speed at the moment is about that of 1-1.5 cores of a cheap quadcore (Athlon II X4 640),at about twice the price and power consumption.
msft,do you think that in the next 1-2 years the code will gain so much speed that a say 100 dollar GPU outperforms a 100 dollar CPU for throughput?
Is it useful to invest in a fast GPU(GTX 560 TI) today or should I wait for something better to show up?

2011-03-19, 15:07   #237
msft

Jul 2009
Tokyo

2×5×61 Posts

Quote:
 Originally Posted by nuggetprime Have you got a GPU where you can test how much slower it is with 4 instances than with 1? From what I read in the previous posts,speed at the moment is about that of 1-1.5 cores of a cheap quadcore (Athlon II X4 640),at about twice the price and power consumption. msft,do you think that in the next 1-2 years the code will gain so much speed that a say 100 dollar GPU outperforms a 100 dollar CPU for throughput? Is it useful to invest in a fast GPU(GTX 560 TI) today or should I wait for something better to show up?
I understand.
If someone take me FFT source code,I can 10% speedup .
Anyway CUDALucas's Speed depend memory band width.

2011-03-19, 16:24   #238
x3mEn

Feb 2011

2·3 Posts

Quote:
 Originally Posted by nuggetprime Have you got a GPU where you can test how much slower it is with 4 instances than with 1?
The feature is that even if 4th threads have equal priority (for example 3 [Middle]), active thread takes a lion share of GPU resource.
Between 2 jobs the second is 3 times slower than active one. So I don't know how to test correctly what you are asking. msft can probably advise something...

 2011-03-19, 22:29 #239 msft     Jul 2009 Tokyo 2·5·61 Posts alice in wonderland In computer world, measure is not linear, 3x quicker machine need 9x Cost&Power. Timemachine is very expensive.
2011-03-19, 22:47   #240
em99010pepe

Sep 2004

54168 Posts

Quote:
 Originally Posted by nuggetprime Is it useful to invest in a fast GPU(GTX 560 TI) today or should I wait for something better to show up?
For the moment use it to sieve instead. LLR on CPU's.

 2011-03-20, 08:41   #241 Ralf Recker     Oct 2010 2778 Posts Has anyone already tried to reduce the number of threads and increase the workload per thread (Better(?) latency hiding/ILP as described by Volkov et al. in various papers and presentations) for example in the transpose functions?
2011-03-20, 09:33   #242
msft

Jul 2009
Tokyo

2×5×61 Posts

Quote:
 Originally Posted by Ralf Recker Has anyone already tried to reduce the number of threads and increase the workload per thread (Better(?) latency hiding/ILP as described by Volkov et al. in various papers and presentations) for example in the transpose functions?
Yes.I try tune for my GTX460,
with target FFT length is over 2048k.

