mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > Twin Prime Search

Reply
 
Thread Tools
Old 2010-09-07, 21:20   #12
amphoria
 
amphoria's Avatar
 
"Dave"
Sep 2005
UK

1010110110002 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
Interesting! Try fiddling with the -m option (probably going up from 8 in increments of 1), and see if you can make a single instance do any better.
I started with m=8 but didn't increment by 1 each time and you will see why.

The peak throughput was with m=127 where I got 282M p/sec with 0.81 CPU used. With m=128 the throughput fell off a cliff to 32M p/sec and 0.11 CPU used. m=127 is equal to cuda cores per multi-processor (32 for a 465) * 4 - 1. Go figure!
amphoria is offline   Reply With Quote
Old 2010-09-07, 21:30   #13
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

39510 Posts
Default

I'll tell you why that happened: I interpret that parameter differently when it exceeds BLOCKSIZE, because I never thought more than BLOCKSIZE blocks would be needed. Guess what BLOCKSIZE is set to!

So to keep going, start at 16384 (128*128, which is interpreted as total threads, not blocks), and increment by at least 128 at a time! Edit: By the way, that's total threads per multiprocessor.

Edit2: As -m gets bigger, you should make sure your test range is big enough to account for all those tests. My little 20M test probably isn't big enough.

Edit3: Finally, you should be looking at the number printed at the end, not any intermediate progress reports, for the best assessment of speed.

Last fiddled with by Ken_g6 on 2010-09-07 at 21:37
Ken_g6 is offline   Reply With Quote
Old 2010-09-08, 17:41   #14
amphoria
 
amphoria's Avatar
 
"Dave"
Sep 2005
UK

23×347 Posts
Default

I have confirmed that m=16384 gives the same performance as m=127. However, any value of m greater than this gives out of range argument.
amphoria is offline   Reply With Quote
Old 2010-09-08, 18:02   #15
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

1100010112 Posts
Default

OK, I've removed the apparently arbitrary range restriction. But I'm not entirely sure it's arbitrary; don't be too surprised if it crashes otherwise.
Ken_g6 is offline   Reply With Quote
Old 2010-09-08, 20:16   #16
amphoria
 
amphoria's Avatar
 
"Dave"
Sep 2005
UK

23×347 Posts
Default

It doesn't appear to crash. Peak throughput on a GTX465 appears to be 329M p/sec using 0.95 CPU (Core i7@3.6GHz). This was achieved with 24,576 threads per multiprocessor or 6 threads per CUDA core. After completing this I ran a 200G range to confirm stability.
amphoria is offline   Reply With Quote
Old 2010-09-08, 20:37   #17
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5×79 Posts
Default

Fascinating. I wonder why it takes that many?
Quote:
Originally Posted by amphoria View Post
or 6 threads per CUDA core.
No, that's 512 threads per CUDA core! 24,576 * 7 multiprocessors, divided by 336 CUDA cores.
Ken_g6 is offline   Reply With Quote
Old 2010-09-08, 22:03   #18
amphoria
 
amphoria's Avatar
 
"Dave"
Sep 2005
UK

23·347 Posts
Default

A GTX465 actually has 11 multiprocessors each containing 32 CUDA cores for 352 CUDA cores in total. I believe you are thinking of the GTX460. That would suggest 768 threads per CUDA core!
amphoria is offline   Reply With Quote
Old 2010-09-11, 22:02   #19
Oddball
 
Oddball's Avatar
 
May 2010

499 Posts
Default

Quote:
Originally Posted by amphoria View Post
It doesn't appear to crash. Peak throughput on a GTX465 appears to be 329M p/sec using 0.95 CPU
Could you run benchmarks for the current p=~900T area as well as the p=6000T and p=8000T area?

The CPU I'm using gives me 138M p/sec at 900T with 5.99 CPU. However, this drops to 134M p/sec at 6000T, since only 5.4 CPU is being used. Sieve speed goes down even further at 8000T. At that point, I'm only getting 129M p/sec with only 5.21 CPU used.

I'm curious to see how pronounced this effect is with GPUs.
Oddball is offline   Reply With Quote
Old 2010-09-11, 22:22   #20
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

18B16 Posts
Default

The first thing I think of on the CPU client, when CPU usage drops, is that you should run a second process and split the threads between them. That might work on high-end Fermis as well.

For this particular problem, you could instead try bumping up your blocksize in tpconfig.txt to the size of your L2 cache, and maybe increase your chunksize as well.
Ken_g6 is offline   Reply With Quote
Old 2010-09-11, 23:01   #21
Oddball
 
Oddball's Avatar
 
May 2010

499 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
For this particular problem, you could instead try bumping up your blocksize in tpconfig.txt to the size of your L2 cache, and maybe increase your chunksize as well.
Increasing the chunksize slows it down even more. The key was doubling blocksize and cutting the value of chunksize in half. By doing so, sieve speed at 6000T went up from 134M p/sec to 150M p/sec. 5.97 CPU was used (up from 5.40 earlier).

edit: One minor drawback with those settings is the small hit in speed for lower p values. At p=940T, sieve speed goes down from 138.2M p/sec to 135.6M p/sec.

Last fiddled with by Oddball on 2010-09-11 at 23:03
Oddball is offline   Reply With Quote
Old 2010-09-12, 10:37   #22
amphoria
 
amphoria's Avatar
 
"Dave"
Sep 2005
UK

53308 Posts
Default

Quote:
Originally Posted by Oddball View Post
Could you run benchmarks for the current p=~900T area as well as the p=6000T and p=8000T area?
Benchmarks are as follows using a single CPU core and m=24576:

Code:
  p           M p/sec         CPU
-----         -------         ---
900T            311           0.95
6000T           251           0.96
8000T           240           0.96
amphoria is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Fast Mersenne Testing on the GPU using CUDA Andrew Thall GPU Computing 109 2014-07-28 22:14
Inconsistent factors with TPSieve Caldera Twin Prime Search 7 2013-01-05 18:32
tpsieve-cuda slows down with increasing p amphoria Twin Prime Search 0 2011-07-23 10:52
Is TPSieve-0.2.1 faster than Newpgen? cipher Twin Prime Search 4 2009-05-18 18:36
Thread for non-PrimeNet LL testing ThomRuley Lone Mersenne Hunters 6 2005-10-16 20:11

All times are UTC. The time now is 13:34.


Fri Jul 7 13:34:43 UTC 2023 up 323 days, 11:03, 0 users, load averages: 1.19, 1.21, 1.19

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔