mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2010-01-13, 13:47   #67
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2·5·61 Posts
Default

Hi,

On my GTX260

a) compiles
OK
b) runs correctly and
correct
c) is faster or slower?
slower
msft is offline   Reply With Quote
Old 2010-01-13, 14:04   #68
BigBrother
 
Feb 2005
The Netherlands

2·109 Posts
Default

Quote:
Originally Posted by axn View Post
Presumably, you've figured out what to do already, if not check this out: http://msdn.microsoft.com/en-us/library/hd0hzyf8.aspx
Yes, I replaced the two sieve_clear_bit() calls with the _bittestandreset intrinsic, but it didn't improve performance. Actually, the program was slightly slower.

Last fiddled with by BigBrother on 2010-01-13 at 14:12
BigBrother is offline   Reply With Quote
Old 2010-01-13, 15:25   #69
axn
 
axn's Avatar
 
Jun 2003

23·683 Posts
Default

Bummer
axn is offline   Reply With Quote
Old 2010-01-13, 15:33   #70
BigBrother
 
Feb 2005
The Netherlands

DA16 Posts
Default The infamous ptx hack on Windows

Using a method described on http://code.cheesydesign.com/ I managed to uncover the commands issued by nvcc without using the defective --dryrun command line option.

The program now processes 4.2M candidates/second on my 9600M GS, which is an increase of ~33%
BigBrother is offline   Reply With Quote
Old 2010-01-13, 17:59   #71
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

59710 Posts
Default

Quote:
Originally Posted by TheJudger View Post
I'm not 100% sure if I got the point of your post.
The smallest primes (3, 5, 7, 11) are removed from the sieve totally.
I quickly read your code, and it seems that it's not using a dense representation for the sieve array. You don't need to have bits for even numbers for instance.

If I misread your code, sorry.
ldesnogu is offline   Reply With Quote
Old 2010-01-14, 11:15   #72
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

5·223 Posts
Default

ldesnogu: there are no even numbers.

The sieve represents the k-values of the factor candidates. The factor candidates are 2*k*p+1, so they are allways odd.
TheJudger is offline   Reply With Quote
Old 2010-01-14, 11:26   #73
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

5×223 Posts
Default

Hi!

axn: the asm code works but is slower (as msft reported allready) :(

I don't remeber the specific settings for this run but here are the numbers.
My C code as inline function: ~33M/s
My C code as macro: ~32M/s
Your asm code as inline function: ~28M/s
Your asm code as macro: ~30M/s
-----

Anyway I've improved the siever a little bit. :)

-----
New Benchmarks on my system (openSUSE 11.1 x86-64, CUDA 2.3, GTX 275, 4GHz C2D)

Single Process
THREADS_PER_GRID: 2^20
THREADS_PER_BLOCK: 256
SIEVE_PRIMES: 15000

M66362159 from 2^ 1 to 2^64: 142.4s
M66362159 from 2^64 to 2^65: 142.2s
M66362159 from 2^65 to 2^66: 282.4s
M66362159 from 2^66 to 2^67: 559.4s
---
Single Process
THREADS_PER_GRID: 2^20
THREADS_PER_BLOCK: 256
SIEVE_PRIMES: 15000
MORE_CLASSES

M66362159 from 2^65 to 2^66: 284.1s
M66362159 from 2^66 to 2^67: 545.5s
M66362159 from 2^67 to 2^68: 1068.6s
---
Two Processes at the same time
THREADS_PER_GRID: 2^20
THREADS_PER_BLOCK: 256
SIEVE_PRIMES: 150000 (10 fold increase compared to Single Process)

M66362159 from 2^64 to 2^65: 206.3s
M66362159 from 2^65 to 2^66: 401.0s
M66362159 from 2^66 to 2^67: 794.0s
-----
find attached the new version. :)
version 0.02 (2010-01-13)
- fixed some printf's
- allocate and free arrays only ONCE (was per class before)
- added check of return values of most *alloc()
- siever: improved the loop which creates the candidate list


Oliver
Attached Files
File Type: gz mfaktc-0.02.tar.gz (24.8 KB, 296 views)
TheJudger is offline   Reply With Quote
Old 2010-01-19, 15:42   #74
moebius
 
moebius's Avatar
 
Jul 2009
Germany

2×353 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Hi henryzz,

I forgot to mention: I have run it only on Linux right now (openSUSE 11.1 x86-64).
If you still want a binary I can create one on my system. Let me known which CPU you have, I'll make some settings than.
You need to install the CUDA software aswell.
I'm interested in an exe for 32-bit Windows XP SP3,CUDA 2.3, Athlon 64 (single core), Geforce 8600GT (256 MB). It would be nice if that were possible, because I haven't installed MSVC-compiler on this machine.
moebius is offline   Reply With Quote
Old 2010-01-19, 17:19   #75
BigBrother
 
Feb 2005
The Netherlands

2×109 Posts
Default Windows executables

Here are the executables I compiled, they work on my 32-bit Vista system, so I guess they also work on 32-bit XP. Both versions are included, the original and the one with the ptx hack.
Attached Files
File Type: zip mfaktc.zip (115.3 KB, 309 views)
BigBrother is offline   Reply With Quote
Old 2010-01-19, 18:05   #76
moebius
 
moebius's Avatar
 
Jul 2009
Germany

13028 Posts
Smile

Very nice, I try it out later, because msieve_gpu currently running on the nvidia.

bedankt..
moebius is offline   Reply With Quote
Old 2010-01-20, 08:56   #77
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

111510 Posts
Default

Thank you, BigBrother.
Actually I've no CUDA-capable compiler environment installed under Windows, is MSVC the first choice for CUDA on Windows? Is it available for free for non-commercial usage?
-----
I've improved the sieve code again. I've unrolled the loop which creates the ktab and use a precalculated table to get check 8 bits at once. This minimized the effect of "MORE_CLASSES" (becomes benefical at 2^69 to 2^70 for M66362159).

The GPU-code is untouched.

Single Process
THREADS_PER_GRID: 2^20
THREADS_PER_BLOCK: 256
SIEVE_PRIMES: 45000 (with this value my CPU can keep the GPU busy with on core)

M66362159 from 2^ 1 to 2^64: 113.4s
M66362159 from 2^64 to 2^65: 113.4s
M66362159 from 2^65 to 2^66: 223.6s
M66362159 from 2^66 to 2^67: 442.3s
M66362159 from 2^67 to 2^68: 879.7s

Sorry, no new code release yet. You have to wait, I want to try some things before. ;)

Oliver

Last fiddled with by TheJudger on 2010-01-20 at 08:58
TheJudger is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1724 2023-06-04 23:31
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 42 2022-12-18 05:59
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 14:21.


Fri Jul 7 14:21:38 UTC 2023 up 323 days, 11:50, 0 users, load averages: 1.08, 1.16, 1.21

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔