mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

James Heinrich 2013-10-02 19:24

[QUOTE=Mark Rose;354973]I noticed the mersenne.ca stats for it are extremely slow to update though :/[/QUOTE]The [url=http://www.mersenne.ca/tf1G.php?available_assignments=1]stats for >1G TF[/url], such as they are, update nightly.

[QUOTE=garo;354975]But isn't that a rather inefficient use of GPUs? I suspect nothing beats old Athlons at TF under 64 bit.[/QUOTE]It is (much) less efficient than TF in normal ranges, but not [i]too[/i] horrible. My GTX 670, for example, gets about 150GHz-days/day throughput in this range, compared with approx 238GHz-days/day doing TF in normal ranges. By comparison, an [url=http://www.mersenne.ca/throughput.php?cpu1=AMD%20Athlon%28tm%29%2064%20X2%20Dual%20Core%20Processor%206000%2B|1024|0&mhz1=3000]Athlon X2 6000+[/url] can get about 11GHz-days/day out of both cores up to 2[sup]63[/sup] (9/day to 2[sup]64[/sup]), assuming Prime95 efficiency (although the Prime95 application doesn't support exponents beyond PrimeNet range). For what it's worth, I have used CPUs to TF the entire range up to 2[sup]51[/sup], but it's getting to the point where it's no longer practical to take everything up another bitlevel with CPUs.

I don't really expect anyone to join me, or even approve of my pet project, but it's what I've chose to expend my GPU time on for the next few years. :smile:

Mark Rose 2013-10-03 14:35

[QUOTE=James Heinrich;354982]The [url=http://www.mersenne.ca/tf1G.php?available_assignments=1]stats for >1G TF[/url], such as they are, update nightly.[/quote]

Interesting. I submitted over 1000 TF results in the >1G range yet only [URL="http://www.mersenne.ca/stats.php?showuserstats=shifted"]10 show up[/URL]. :(

James Heinrich 2013-10-03 15:17

Sorry, yes, that section of the user-stats is known-broken. I have added a warning message to the page to make it clear. I will at some point get around to tracking down where the fault lies.

To be clear: the errors in the user stats pages extend across all ranges, not just the 1G+ range.
In the 1G+ range any user-specific factoring effort for factors smaller than 0.1GHz-day effort (roughly 2[sup]67[/sup]) is not recorded (the factor is recorded of course, just not who found it).

Mark Rose 2013-10-03 18:43

[QUOTE=James Heinrich;355091]Sorry, yes, that section of the user-stats is known-broken. I have added a warning message to the page to make it clear. I will at some point get around to tracking down where the fault lies.

To be clear: the errors in the user stats pages extend across all ranges, not just the 1G+ range.
In the 1G+ range any user-specific factoring effort for factors smaller than 0.1GHz-day effort (roughly 2[sup]67[/sup]) is not recorded (the factor is recorded of course, just not who found it).[/QUOTE]

Ahh, okay. Thanks for the information. I don't remember the exact level I was factoring to. I think mostly in the 2[sup]66[/sup] to 2[sup]68[/sup] range.

flashjh 2013-11-13 03:12

CUDALucas 2.05 beta and "CUDALucas Road Map"
 
Wrong forum, meant to go [URL="http://www.mersenneforum.org/showthread.php?p=359150#post359150"]here[/URL]

ewmayer 2013-11-19 21:49

[QUOTE=garo;354975]But isn't that a rather inefficient use of GPUs? I suspect nothing beats old Athlons at TF under 64 bit.[/QUOTE]

Actually, Intel has significantly improved integer-MUL support in their 2 main post-Core 2 chip families - roughly halved the latency, doubled the per-cycle pipelined throughput. [Those 2 are independent, btw.] GMP users may have noticed these speedups, although I have seen no one mention it around here. [Perhaps someone did in the factoring forums]. Here are [url=http://gmplib.org/list-archives/gmp-devel/2013-August/003353.html]comments from early August[/url] by GMP's Torbjorn Granlund:
[quote]I got a new Intel Haswell system for the GMP test system array. This
CPU line is interesting to GMP because of its improvements in the area
of integer arithmetic.

The undisputed GMP champion has for years been the now defunct AMD CPUs
K8 and K10. The most critical multiplication loops run at between 2.375
and 2.5 cycles per accumulated 64 x 64 -> 128 bit product.

No Intel system has come close, and newer AMD systems (Bulldozer,
Piledriver) run he loops at between 4.5 and 5.2 cycles per limb.
(New GMP code reaches 4.25 cycles.)

Haswell adds a new multiply instruction which avoids 2 of 3 fixed-
register operands. The old MUL did (rdx,rax) <- rax * regormem, while
the new MULX does (reg1,reg2) <- rdx * regormem. I suppose they kept
rdx fixed as a concession to the general x86 ugliness. :-)

Furthermore, MUL overwrites the carry flag with a useless value, while
MULX leaves flags alone.

The new instruction is much more suitable for GMP's needs.

I have written some preliminary loops using MULX, and optimised them for
Haswell. The results are encouraging; this CPU has the potential to
outperform all other x86 CPUs. The key multiply loops run at between
1.6 and 2.3 cycles/limb, resulting in about 20% higher performance than
on the old K10.

Thus far, only mul_1 (1.6 cycles/limb), and addmul_1/submul_1 (2.3
cycles/limb) are in the public repo.

I have a 1.75 c/l mul_2 and 2.0 c/l addmul_2 in the assembly works. I
strongly suspect it is possible to do addmul at considerably less than
2.0 c/l.

(A caveat about the new system: Perhaps I was unlucky, or perhaps the
platform in not yet robust, but the first system I got had a dead CPU,
and the second is not 100% stable under GNU/Linux; I get rare spurious
non-reproducible segfaults. Neither FreeBSD, Debian, or Ubuntu would
work at all; they crashed in strange ways during install. Finally
Gentoo installed, but has the segfault problem.)[/quote]

Having been fully occupied with AVX/float code most of this year, I first noticed the impressive IMUL throughput boost a couple of weeks ago, while porting my TF code [which has macros for both IMUL and SSE/AVX-float-based TF beyond 64 bits] to my Haswell. The float-double TF code [up to 78 bits] got a nice boost from AVX, but the pure-int code [which has x86 asm routines for 64 and 96-bit factor candidates] was even better. A little digging through Agner Fog's pre-Haswell instruction timings PDF confirmed the MUL enhancements already on Sandy Bridge - Haswell further adds the MULX instruction, which I will be playing with going forward, as well as using FMA to boost the float-TF routines.

James Heinrich 2013-11-20 03:45

I just installed NVIDIA drivers v331.82 and now mfaktc doesn't work anymore. Or, more specifically, the 64-bit LessClasses version doesn't work anymore. I tried the 32-bit regular version and it still seems to work OK.

Crash gives me this error dump:[code]Problem signature:
Problem Event Name: APPCRASH
Application Name: mfaktc-win-64.exe
Application Version: 0.0.0.0
Application Timestamp: 50e9bf08
Fault Module Name: nvcuda.dll
Fault Module Version: 8.17.13.3182
Fault Module Timestamp: 5280db7b
Exception Code: c0000005
Exception Offset: 000000000009b506
OS Version: 6.1.7601.2.1.0.256.48
Locale ID: 1033
Additional Information 1: 0800
Additional Information 2: 08002199d42341871ec210c846947482
Additional Information 3: 915a
Additional Information 4: 915a5873c4a2aec8d9ca7379729b85a7[/code]

[i]edit: rolling back to v331.65 didn't fix my problem :sad:[/i]

TheMawn 2013-11-20 05:20

I'm still on 331.65. I ignored the update for the time being (it's supposed to be to improve performance in Assassin's Creed: Black Flag and some other game (guess which of the two I am more looking forward to playing... :razz:).

If you manage to get 331.65 to work for you again, I could update as well to check that this isn't just you. If the issue doesn't get solved by the weekend, I'll update my OS SSD image, update the GPU drivers, and restore the entire f*****g image if I get the same problem.

You're under Windows 7? Or Linux? Do you have some kind of system restore feature? Windows 7 should have automatically made one before an update of that magnitude. Try restoring from that if it isn't going to hurt anything else of yours.

James Heinrich 2013-11-20 13:35

I re-updated to 331.82 [i]and rebooted[/i] this time, and now mfaktc is happy again. (I couldn't reboot last night because I was still processing a 45-hour job).
I just found it odd that the LessClasses version wasn't happy but the regular mfatkc worked fine.

TheMawn 2013-11-21 02:23

The one time I didn't ask "Have you rebooted"...

xtreme2k 2013-12-24 09:24

Any advice to tweak my system for higher output?
[QUOTE]mfaktc v0.20 (64bit built)

Compiletime options
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 193154bits
SIEVE_SPLIT 250
MORE_CLASSES enabled

Runtime options
SievePrimes 25000
SievePrimesAdjust 1
SievePrimesMin 5000
SievePrimesMax 100000
NumStreams 3
CPUStreams 3
GridSize 3
GPU Sieving enabled
GPUSievePrimes 82486
GPUSieveSize 64Mi bits
GPUSieveProcessSize 16Ki bits
WorkFile worktodo.txt
Checkpoints enabled
CheckpointDelay 30s
Stages enabled
StopAfterFactor bitlevel
PrintMode full
V5UserID (none)
ComputerID (none)
AllowSleep no
TimeStampInResults no

CUDA version info
binary compiled for CUDA 4.20
CUDA runtime version 4.20
CUDA driver version 6.0

CUDA device info
name GeForce GTX 670
compute capability 3.0
maximum threads per block 1024
number of multiprocessors 7 (1344 shader cores)
clock rate 980MHz

Automatic parameters
threads per grid 917504

running a simple selftest...
Selftest statistics
number of tests 92
successfull tests 92

selftest PASSED!

got assignment: exp=75844001 bit_min=71 bit_max=72 (6.31 GHz-days)
Starting trial factoring M75844001 from 2^71 to 2^72 (6.31 GHz-days)
k_min = 15566051433240
k_max = 31132102873038
Using GPU kernel "barrett76_mul32_gs"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Dec 24 20:21 | 0 0.1% | 2.380 38m02s | 238.45 82485 n.a.%
Dec 24 20:21 | 3 0.2% | 2.363 37m44s | 240.17 82485 n.a.%
Dec 24 20:21 | 4 0.3% | 2.339 37m18s | 242.63 82485 n.a.%
Dec 24 20:21 | 15 0.4% | 2.338 37m15s | 242.74 82485 n.a.%
Dec 24 20:21 | 16 0.5% | 2.341 37m16s | 242.43 82485 n.a.%
[/QUOTE]


All times are UTC. The time now is 23:15.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.