mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2010-01-24, 18:21   #111
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

100010110112 Posts
Default

Quote:
Originally Posted by kjaget View Post
Thanks - I get it now. You're balancing how much you can sieve out on the CPU side while still keeping the GPU busy. So you have to look at both the rate and the time per class.
Right, that's it!
Actually you've just to look at the time per class (or even better: the time for the whole run).
The current code (0.03) contains allready an indicator which shows you if you can do more sieving (by displaying the time waited for the GPU). But this becomes visible only when you activate "VERBOSE_TIMING" in params.h. If you enable it you'll see alot more timing information. When you enable it I recommend to redirect the output to a file (Linux: ./mfaktc <exponent> <bit_min> <bit_max> &> mfaktc.out) and look at the mfaktc.out after the run. Another way: run mfaktc on a remote host via ssh (yeah, again Linux ;)), this it what I usually do.
In 0.04 I will move this indicator to a discrete parameter and enable it by default.
Once I've converted SIEVE_PRIMES from a compiletime option to a runtime option I could use these indicator to in/decrease it during a run automaticly.

Oliver
TheJudger is offline   Reply With Quote
Old 2010-01-24, 20:15   #112
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

137758 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Once I've converted SIEVE_PRIMES from a compiletime option to a runtime option I could use these indicator to in/decrease it during a run automaticly.
If you do this would it be possible for it to allow you to make it a bit CPU bound for those who want to be able to use their pc while it is running. Currently the screen updates too slow.
henryzz is offline   Reply With Quote
Old 2010-01-25, 00:00   #113
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

5·7·139 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Hi Luigi,

you're talkting about Linux, right?

Go here: http://www.nvidia.com/object/cuda_get.html
You'll need the CUDA driver (display driver) and the CUDA toolkit. The CUDA SDK is optional (some code examples, verify your installation, ...)

I recommend a 64bit Linux which is supported by CUDA, I'm using openSUSE 11.1 (11.2 isn't supported by CUDA right now and has some problems with it (gcc 4.4.x))

- install the driver
- install the toolkit
- setup some environment variables (as noted by the toolkit)

Documentation is available aswell on the nvidia website.

The executables just need some shared libraries (e.g. cudart), on a proper configured system it doesn't matter in which directory the are.

Automatic parameter selection is at least on my whishlist but not a major objective at the moment. Actually the most intersting parameter is SIEVE_PRIMES. It depends on the exponent aswell.
M66xxxxxx => my GPU can do ~54M/sec
M3312xxxxxx => my GPU can do ~41M/sec
The siever does NOT depend on the exponent size! So as the exponents get bigger it removes the pressure from the siever and more sieving is possible (increase SIEVE_PRIMES).


Oliver
Thank you Oliver. yes I'm on Ubuntu 9.10, 64 bit.
I will try the download next Monday!

SIEVE_PRIMES is an argument I struggled with while developing factor5, I know it's a PITA...

I will let you know about the installation.

Luigi

Last fiddled with by ET_ on 2010-01-25 at 00:01
ET_ is offline   Reply With Quote
Old 2010-01-25, 06:00   #114
only_human
 
only_human's Avatar
 
"Gang aft agley"
Sep 2002

72528 Posts
Default

Quote:
Originally Posted by moebius View Post
E:\mfactc>mfaktc-hack 65255629 68 69 0
mfaktc v0.02 Copyright (C) 2009, 2010 Oliver Weihe
THREADS_PER_GRID 1048576
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 230945bits
SIEVE_PRIMES 50000
USE_PINNED_MEMORY enabled
USE_ASYNC_COPY enabled
VERBOSE_TIMING disabled
SELFTEST disabled
MORE_CLASSES disabled
sieve_init(): sieving factor candidates with small primes up to 611957
tf(65255629, 68, 69);
k_min = 2261474677980

no factor for M65255629 from 2^68 to 2^69
tf(): total time spent: 12785031msec

too bad that I can't upload the results to primenet
Quote:
Originally Posted by kjaget View Post
I'm running with the .03 code ported to windows (thanks for the hints provided in the thread). It ran through the self-test and also found a few of the factors I've found recently using P95. Given that amount of testing, and that it's a few percent faster than the previous binary posted, I figured I'd put it up here.

Attachment 4658

It's doing ~27M/sec on an 8800GT fed by 1 thread of a 3.3GHz C2D 4400. I'm running P95 on the other core, so that's probably slowing me a bit. I did try various options in params.h and nothing sped it up (lots of options did slow it down, though). So I guess that 1 core is enough to keep my card saturated. Time to speed up the GPU code again :)

I've added code to have this run at idle priority so it shouldn't kill the responsiveness of the machine it's loading up. Otherwise it should be the same as the source code posted.
I decided to compare my machine with the same task as mobius and using the win32 executable provided by kjaget.

I am deliberately using a low-end passively cooled board by Sparkle:
Quote:
S15-8406 Sparkle GeForce 8400 GS Video Card - 256MB DDR2
Code:
Processor:		Intel(R) Celeron(R) CPU          420  @ 1.60GHz (1601 MHz)
Operating System:	Windows 7 Ultimate, 32-bit
DirectX version:	11.0 
GPU processor:		GeForce 8400 GS
Driver version:		196.21
CUDA Cores:		8 
Memory interface:	64-bit 
Total available graphics memory:	1023 MB
Dedicated video memory:	256 MB
System video memory:	0 MB
Shared system memory:	767 MB
Video BIOS version:	62.98.29.00.00
IRQ:			16
Bus:			PCI Express x16
My results so far are:
Code:
Compiletime Options
  THREADS_PER_GRID    1048576
  THREADS_PER_BLOCK   256
  SIEVE_SIZE_LIMIT    32kiB
  SIEVE_SIZE          230945bits
  SIEVE_PRIMES        50000
  USE_PINNED_MEMORY   enabled
  USE_ASYNC_COPY      enabled
  VERBOSE_TIMING      disabled
  SELFTEST            disabled
  MORE_CLASSES        disabled
sieve_init(): sieving factor candidates with small primes up to 611957
tf(65255629, 68, 69);
 k_min = 2261474677980
 k_max = 4522949356282
class    0: tested 991952896 candidates in 617380ms (1606713/sec)
class   11: tested 991952896 candidates in 603623ms (1643331/sec)
class   12: tested 991952896 candidates in 605835ms (1637331/sec)
I followed steps 1-3 of CUDA_SDK_Release_Notes.txt and a single note item from step 4
Quote:
1. Uninstall any previous versions of the NVIDIA CUDA Toolkit and NVIDIA GPU Computing SDK.
You can uninstall the NVIDIA CUDA Toolkit through the Start menu:
Start menu->All Programs->NVIDIA Corporation->CUDA Toolkit->Uninstall CUDA

You can uninstall the NVIDIA GPU Computing SDK through the Start menu:
Start menu->All Programs->NVIDIA Corporation
->NVIDIA GPU Computing SDK->Uninstall NVIDIA GPU Computing SDK

2. Install version 3.0 Beta 1 of the NVIDIA CUDA Toolkit by running
cudatoolkit_3.0-beta1_Win_[32|64].exe corresponding to your operating
system.

3. Install version 3.0 Beta 1 of the NVIDIA GPU Computing SDK by running
gpucomputingsdk_3.0-beta1_Win_[32|64].exe corresponding to your operating
system.

4 Notes --
- Most samples link to a utility library called "cutil" whose source code
is in "NVIDIA GPU Computing SDK\C\common". The release and emurelease versions of
these samples link to cutil[32|64].lib and dynamically load
cutil[32|64].dll. The debug and emudebug versions of these samples link
to cutil[32D|64D].lib and dynamically load cutil[32D|64D].dll.
To build the 32-bit and/or 64-bit, release and/or debug configurations
of the cutil library, use the solution files located in
"NVIDIA GPU Computing SDK\C\common". The output of the compilation goes to
"NVIDIA GPU Computing SDK\C\common\lib":
- cutil[32|64].lib and cutil[32D|64D].lib are the release and debug
import libraries,
- cutil[32|64].dll and cutil[32D|64D].dll are the release and debug
dynamic-link libraries, which get also copied to
"NVIDIA GPU Computing SDK\C\bin\win[32|64]\[release|emurelease]" and
"NVIDIA GPU Computing SDK\C\bin\win[32|64]\[debug|emudebug]"
respectively;
Initially, the mfaktc-hack executable file couldn't find cudart.dl. I copied C:\CUDA\bin\cudart32_30_8.dll to C:\CUDA\bin\cudart.dll

Last fiddled with by only_human on 2010-01-25 at 06:17 Reason: added last paragraphs about CUDA install and DLL
only_human is offline   Reply With Quote
Old 2010-01-27, 08:38   #115
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

5×223 Posts
Default

Hi,

don't waste GPU cycles with mfaktc 0.01 to 0.03, it is known to return wrong results in some cases.
The problem is that the long division overestimates in some cases the part of the divident.
The chance for an error is something like 1 in 10000 (wrong residue for a factor candidate).

Quick fix (not very well checked):
in mfaktc.cu mfakt() replace
Code:
ff=__int_as_float(0x3f7ffffd) / ff;
with
Code:
ff=__int_as_float(0x3f7ffffc) / ff;
Anyway this is fixed in the next version (which is a little bit faster, too :)).

Oliver
TheJudger is offline   Reply With Quote
Old 2010-01-29, 08:52   #116
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

5·223 Posts
Default

Hi David,

Quote:
Originally Posted by henryzz View Post
If you do this would it be possible for it to allow you to make it a bit CPU bound for those who want to be able to use their pc while it is running. Currently the screen updates too slow.
could you try it again with a lower THREADS_PER_GRID value?
e.g. replace (1<<20) with (1<<18) to reduce the CUDA-kernel size by a factor of 4. Offcourse this might have a small performance penalty.

If this doesn't help (enough) you could try to add a e.g. usleep(10000); right after the first occurence of cudaStreamSynchronize() in mfaktc.cu.

Oliver
TheJudger is offline   Reply With Quote
Old 2010-01-29, 09:07   #117
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

111510 Posts
Default

Hi,

here is mfaktc 0.04. :)
Just another speed enhancement (GPU code is ~10% faster). :)
Some cleanups in the code asweel.

Hint for optimizing SIEVE_PRIMES:
If you have enabled SHOW_WAIT in params.h it will show you the average time the CPU-code has waited for the CUDA-kernel.
As long as SIEVE_PRIMES is <= 100000 just try to keep the average wait time to 50-500 usecs. If you've "fast" CPU and a "slow" GPU you might increase SIEVE_PRIMES above 100000 but you've to watch the total runtime, too.
This becomes a bit more complicated when you run two or more processes at once on one GPU. In this case take a look at the total runtimes. ;)

-----
Benchmark on my system:

Single Process
THREADS_PER_GRID: 2^20
THREADS_PER_BLOCK: 256
SIEVE_PRIMES: 40000

M66362159 from 2^ 1 to 2^64: 112.5s
M66362159 from 2^64 to 2^65: 103.7s
M66362159 from 2^65 to 2^66: 203.3s
M66362159 from 2^66 to 2^67: 401.2s
M66362159 from 2^67 to 2^68: 799.3s

Single Process
THREADS_PER_GRID: 2^20
THREADS_PER_BLOCK: 256
SIEVE_PRIMES: 75000
M3321932839 from 2^50 to 2^71: 310.2s

-----
There is a performance penalty if the bit_min is small. A higher bit_min enables more precalculations. This is the reason why M66362159 from 2^1 to 2^64 takes longer than from 2^64 to 2^65.

If bit_min is small enough it will report 1 as a factor.
Attached Files
File Type: gz mfaktc-0.04.tar.gz (26.8 KB, 245 views)
TheJudger is offline   Reply With Quote
Old 2010-01-29, 23:42   #118
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

3×23×89 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Hi David,



could you try it again with a lower THREADS_PER_GRID value?
e.g. replace (1<<20) with (1<<18) to reduce the CUDA-kernel size by a factor of 4. Offcourse this might have a small performance penalty.

If this doesn't help (enough) you could try to add a e.g. usleep(10000); right after the first occurence of cudaStreamSynchronize() in mfaktc.cu.

Oliver
I am pretty certain that decreasing THREADS_PER_GRID led to faster response times. The latest version is also more responsive than the first one.
henryzz is offline   Reply With Quote
Old 2010-02-01, 00:29   #119
ixfd64
Bemusing Prompter
 
ixfd64's Avatar
 
"Danny"
Dec 2002
California

23×313 Posts
Default

Those figures are pretty impressive. Once all of the kinks have been worked out, I think the next logical step would be to port the code to OpenCL, so it can be used with those ultra-fast Radeon HD xxxx chips.

I can't wait until George gets back!
ixfd64 is offline   Reply With Quote
Old 2010-02-01, 12:21   #120
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

21338 Posts
Default

Hi ixfd64

Quote:
Originally Posted by ixfd64 View Post
Those figures are pretty impressive. Once all of the kinks have been worked out, I think the next logical step would be to port the code to OpenCL, so it can be used with those ultra-fast Radeon HD xxxx chips.

I can't wait until George gets back!
I'm not sure about OpenCL... from what I've heart it is slower than CUDA oder Stream in general. And I don't own an ATI GPU.

Oliver
TheJudger is offline   Reply With Quote
Old 2010-02-05, 09:05   #121
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

5×223 Posts
Default

Luigi:
Do you have a list of all OBD factors which can be easily parsed?
I would like to add all known factors < 2^71 of OBD to my selftest routine.

Thank you,
Oliver

P.S. less than 5 minutes for M3321XXXXXX from 2^50 to 2^71 with the latest (not released) version :)
TheJudger is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1724 2023-06-04 23:31
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 42 2022-12-18 05:59
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 14:21.


Fri Jul 7 14:21:33 UTC 2023 up 323 days, 11:50, 0 users, load averages: 0.91, 1.13, 1.20

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔