![]() |
|
|
#111 | |
|
"Oliver"
Mar 2005
Germany
100010110112 Posts |
Quote:
Actually you've just to look at the time per class (or even better: the time for the whole run). The current code (0.03) contains allready an indicator which shows you if you can do more sieving (by displaying the time waited for the GPU). But this becomes visible only when you activate "VERBOSE_TIMING" in params.h. If you enable it you'll see alot more timing information. When you enable it I recommend to redirect the output to a file (Linux: ./mfaktc <exponent> <bit_min> <bit_max> &> mfaktc.out) and look at the mfaktc.out after the run. Another way: run mfaktc on a remote host via ssh (yeah, again Linux ;)), this it what I usually do. In 0.04 I will move this indicator to a discrete parameter and enable it by default. Once I've converted SIEVE_PRIMES from a compiletime option to a runtime option I could use these indicator to in/decrease it during a run automaticly. Oliver |
|
|
|
|
|
|
#112 | |
|
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
137758 Posts |
Quote:
Currently the screen updates too slow.
|
|
|
|
|
|
|
#113 | |
|
Banned
"Luigi"
Aug 2002
Team Italia
5·7·139 Posts |
Quote:
I will try the download next Monday! SIEVE_PRIMES is an argument I struggled with while developing factor5, I know it's a PITA... I will let you know about the installation. ![]() Luigi Last fiddled with by ET_ on 2010-01-25 at 00:01 |
|
|
|
|
|
|
#114 | ||||
|
"Gang aft agley"
Sep 2002
72528 Posts |
Quote:
Quote:
I am deliberately using a low-end passively cooled board by Sparkle: Quote:
Code:
Processor: Intel(R) Celeron(R) CPU 420 @ 1.60GHz (1601 MHz) Operating System: Windows 7 Ultimate, 32-bit DirectX version: 11.0 GPU processor: GeForce 8400 GS Driver version: 196.21 CUDA Cores: 8 Memory interface: 64-bit Total available graphics memory: 1023 MB Dedicated video memory: 256 MB System video memory: 0 MB Shared system memory: 767 MB Video BIOS version: 62.98.29.00.00 IRQ: 16 Bus: PCI Express x16 Code:
Compiletime Options THREADS_PER_GRID 1048576 THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 230945bits SIEVE_PRIMES 50000 USE_PINNED_MEMORY enabled USE_ASYNC_COPY enabled VERBOSE_TIMING disabled SELFTEST disabled MORE_CLASSES disabled sieve_init(): sieving factor candidates with small primes up to 611957 tf(65255629, 68, 69); k_min = 2261474677980 k_max = 4522949356282 class 0: tested 991952896 candidates in 617380ms (1606713/sec) class 11: tested 991952896 candidates in 603623ms (1643331/sec) class 12: tested 991952896 candidates in 605835ms (1637331/sec) Quote:
Last fiddled with by only_human on 2010-01-25 at 06:17 Reason: added last paragraphs about CUDA install and DLL |
||||
|
|
|
|
|
#115 |
|
"Oliver"
Mar 2005
Germany
5×223 Posts |
Hi,
don't waste GPU cycles with mfaktc 0.01 to 0.03, it is known to return wrong results in some cases. The problem is that the long division overestimates in some cases the part of the divident. The chance for an error is something like 1 in 10000 (wrong residue for a factor candidate). Quick fix (not very well checked): in mfaktc.cu mfakt() replace Code:
ff=__int_as_float(0x3f7ffffd) / ff; Code:
ff=__int_as_float(0x3f7ffffc) / ff; Oliver |
|
|
|
|
|
#116 | |
|
"Oliver"
Mar 2005
Germany
5·223 Posts |
Hi David,
Quote:
e.g. replace (1<<20) with (1<<18) to reduce the CUDA-kernel size by a factor of 4. Offcourse this might have a small performance penalty. If this doesn't help (enough) you could try to add a e.g. usleep(10000); right after the first occurence of cudaStreamSynchronize() in mfaktc.cu. Oliver |
|
|
|
|
|
|
#117 |
|
"Oliver"
Mar 2005
Germany
111510 Posts |
Hi,
here is mfaktc 0.04. :) Just another speed enhancement (GPU code is ~10% faster). :) Some cleanups in the code asweel. Hint for optimizing SIEVE_PRIMES: If you have enabled SHOW_WAIT in params.h it will show you the average time the CPU-code has waited for the CUDA-kernel. As long as SIEVE_PRIMES is <= 100000 just try to keep the average wait time to 50-500 usecs. If you've "fast" CPU and a "slow" GPU you might increase SIEVE_PRIMES above 100000 but you've to watch the total runtime, too. This becomes a bit more complicated when you run two or more processes at once on one GPU. In this case take a look at the total runtimes. ;) ----- Benchmark on my system: Single Process THREADS_PER_GRID: 2^20 THREADS_PER_BLOCK: 256 SIEVE_PRIMES: 40000 M66362159 from 2^ 1 to 2^64: 112.5s M66362159 from 2^64 to 2^65: 103.7s M66362159 from 2^65 to 2^66: 203.3s M66362159 from 2^66 to 2^67: 401.2s M66362159 from 2^67 to 2^68: 799.3s Single Process THREADS_PER_GRID: 2^20 THREADS_PER_BLOCK: 256 SIEVE_PRIMES: 75000 M3321932839 from 2^50 to 2^71: 310.2s ----- There is a performance penalty if the bit_min is small. A higher bit_min enables more precalculations. This is the reason why M66362159 from 2^1 to 2^64 takes longer than from 2^64 to 2^65. If bit_min is small enough it will report 1 as a factor. |
|
|
|
|
|
#118 | |
|
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
3×23×89 Posts |
Quote:
|
|
|
|
|
|
|
#119 |
|
Bemusing Prompter
"Danny"
Dec 2002
California
23×313 Posts |
Those figures are pretty impressive. Once all of the kinks have been worked out, I think the next logical step would be to port the code to OpenCL, so it can be used with those ultra-fast Radeon HD xxxx chips.
I can't wait until George gets back! |
|
|
|
|
|
#120 | |
|
"Oliver"
Mar 2005
Germany
21338 Posts |
Hi ixfd64
Quote:
Oliver |
|
|
|
|
|
|
#121 |
|
"Oliver"
Mar 2005
Germany
5×223 Posts |
Luigi:
Do you have a list of all OBD factors which can be easily parsed? I would like to add all known factors < 2^71 of OBD to my selftest routine. Thank you, Oliver P.S. less than 5 minutes for M3321XXXXXX from 2^50 to 2^71 with the latest (not released) version :) |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1724 | 2023-06-04 23:31 |
| gr-mfaktc: a CUDA program for generalized repunits prefactoring | MrRepunit | GPU Computing | 42 | 2022-12-18 05:59 |
| The P-1 factoring CUDA program | firejuggler | GPU Computing | 753 | 2020-12-12 18:07 |
| mfaktc 0.21 - CUDA runtime wrong | keisentraut | Software | 2 | 2020-08-18 07:03 |
| World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |