![]() |
[QUOTE=Mark Rose;354973]I noticed the mersenne.ca stats for it are extremely slow to update though :/[/QUOTE]The [url=http://www.mersenne.ca/tf1G.php?available_assignments=1]stats for >1G TF[/url], such as they are, update nightly.
[QUOTE=garo;354975]But isn't that a rather inefficient use of GPUs? I suspect nothing beats old Athlons at TF under 64 bit.[/QUOTE]It is (much) less efficient than TF in normal ranges, but not [i]too[/i] horrible. My GTX 670, for example, gets about 150GHz-days/day throughput in this range, compared with approx 238GHz-days/day doing TF in normal ranges. By comparison, an [url=http://www.mersenne.ca/throughput.php?cpu1=AMD%20Athlon%28tm%29%2064%20X2%20Dual%20Core%20Processor%206000%2B|1024|0&mhz1=3000]Athlon X2 6000+[/url] can get about 11GHz-days/day out of both cores up to 2[sup]63[/sup] (9/day to 2[sup]64[/sup]), assuming Prime95 efficiency (although the Prime95 application doesn't support exponents beyond PrimeNet range). For what it's worth, I have used CPUs to TF the entire range up to 2[sup]51[/sup], but it's getting to the point where it's no longer practical to take everything up another bitlevel with CPUs. I don't really expect anyone to join me, or even approve of my pet project, but it's what I've chose to expend my GPU time on for the next few years. :smile: |
[QUOTE=James Heinrich;354982]The [url=http://www.mersenne.ca/tf1G.php?available_assignments=1]stats for >1G TF[/url], such as they are, update nightly.[/quote]
Interesting. I submitted over 1000 TF results in the >1G range yet only [URL="http://www.mersenne.ca/stats.php?showuserstats=shifted"]10 show up[/URL]. :( |
Sorry, yes, that section of the user-stats is known-broken. I have added a warning message to the page to make it clear. I will at some point get around to tracking down where the fault lies.
To be clear: the errors in the user stats pages extend across all ranges, not just the 1G+ range. In the 1G+ range any user-specific factoring effort for factors smaller than 0.1GHz-day effort (roughly 2[sup]67[/sup]) is not recorded (the factor is recorded of course, just not who found it). |
[QUOTE=James Heinrich;355091]Sorry, yes, that section of the user-stats is known-broken. I have added a warning message to the page to make it clear. I will at some point get around to tracking down where the fault lies.
To be clear: the errors in the user stats pages extend across all ranges, not just the 1G+ range. In the 1G+ range any user-specific factoring effort for factors smaller than 0.1GHz-day effort (roughly 2[sup]67[/sup]) is not recorded (the factor is recorded of course, just not who found it).[/QUOTE] Ahh, okay. Thanks for the information. I don't remember the exact level I was factoring to. I think mostly in the 2[sup]66[/sup] to 2[sup]68[/sup] range. |
CUDALucas 2.05 beta and "CUDALucas Road Map"
Wrong forum, meant to go [URL="http://www.mersenneforum.org/showthread.php?p=359150#post359150"]here[/URL]
|
[QUOTE=garo;354975]But isn't that a rather inefficient use of GPUs? I suspect nothing beats old Athlons at TF under 64 bit.[/QUOTE]
Actually, Intel has significantly improved integer-MUL support in their 2 main post-Core 2 chip families - roughly halved the latency, doubled the per-cycle pipelined throughput. [Those 2 are independent, btw.] GMP users may have noticed these speedups, although I have seen no one mention it around here. [Perhaps someone did in the factoring forums]. Here are [url=http://gmplib.org/list-archives/gmp-devel/2013-August/003353.html]comments from early August[/url] by GMP's Torbjorn Granlund: [quote]I got a new Intel Haswell system for the GMP test system array. This CPU line is interesting to GMP because of its improvements in the area of integer arithmetic. The undisputed GMP champion has for years been the now defunct AMD CPUs K8 and K10. The most critical multiplication loops run at between 2.375 and 2.5 cycles per accumulated 64 x 64 -> 128 bit product. No Intel system has come close, and newer AMD systems (Bulldozer, Piledriver) run he loops at between 4.5 and 5.2 cycles per limb. (New GMP code reaches 4.25 cycles.) Haswell adds a new multiply instruction which avoids 2 of 3 fixed- register operands. The old MUL did (rdx,rax) <- rax * regormem, while the new MULX does (reg1,reg2) <- rdx * regormem. I suppose they kept rdx fixed as a concession to the general x86 ugliness. :-) Furthermore, MUL overwrites the carry flag with a useless value, while MULX leaves flags alone. The new instruction is much more suitable for GMP's needs. I have written some preliminary loops using MULX, and optimised them for Haswell. The results are encouraging; this CPU has the potential to outperform all other x86 CPUs. The key multiply loops run at between 1.6 and 2.3 cycles/limb, resulting in about 20% higher performance than on the old K10. Thus far, only mul_1 (1.6 cycles/limb), and addmul_1/submul_1 (2.3 cycles/limb) are in the public repo. I have a 1.75 c/l mul_2 and 2.0 c/l addmul_2 in the assembly works. I strongly suspect it is possible to do addmul at considerably less than 2.0 c/l. (A caveat about the new system: Perhaps I was unlucky, or perhaps the platform in not yet robust, but the first system I got had a dead CPU, and the second is not 100% stable under GNU/Linux; I get rare spurious non-reproducible segfaults. Neither FreeBSD, Debian, or Ubuntu would work at all; they crashed in strange ways during install. Finally Gentoo installed, but has the segfault problem.)[/quote] Having been fully occupied with AVX/float code most of this year, I first noticed the impressive IMUL throughput boost a couple of weeks ago, while porting my TF code [which has macros for both IMUL and SSE/AVX-float-based TF beyond 64 bits] to my Haswell. The float-double TF code [up to 78 bits] got a nice boost from AVX, but the pure-int code [which has x86 asm routines for 64 and 96-bit factor candidates] was even better. A little digging through Agner Fog's pre-Haswell instruction timings PDF confirmed the MUL enhancements already on Sandy Bridge - Haswell further adds the MULX instruction, which I will be playing with going forward, as well as using FMA to boost the float-TF routines. |
I just installed NVIDIA drivers v331.82 and now mfaktc doesn't work anymore. Or, more specifically, the 64-bit LessClasses version doesn't work anymore. I tried the 32-bit regular version and it still seems to work OK.
Crash gives me this error dump:[code]Problem signature: Problem Event Name: APPCRASH Application Name: mfaktc-win-64.exe Application Version: 0.0.0.0 Application Timestamp: 50e9bf08 Fault Module Name: nvcuda.dll Fault Module Version: 8.17.13.3182 Fault Module Timestamp: 5280db7b Exception Code: c0000005 Exception Offset: 000000000009b506 OS Version: 6.1.7601.2.1.0.256.48 Locale ID: 1033 Additional Information 1: 0800 Additional Information 2: 08002199d42341871ec210c846947482 Additional Information 3: 915a Additional Information 4: 915a5873c4a2aec8d9ca7379729b85a7[/code] [i]edit: rolling back to v331.65 didn't fix my problem :sad:[/i] |
I'm still on 331.65. I ignored the update for the time being (it's supposed to be to improve performance in Assassin's Creed: Black Flag and some other game (guess which of the two I am more looking forward to playing... :razz:).
If you manage to get 331.65 to work for you again, I could update as well to check that this isn't just you. If the issue doesn't get solved by the weekend, I'll update my OS SSD image, update the GPU drivers, and restore the entire f*****g image if I get the same problem. You're under Windows 7? Or Linux? Do you have some kind of system restore feature? Windows 7 should have automatically made one before an update of that magnitude. Try restoring from that if it isn't going to hurt anything else of yours. |
I re-updated to 331.82 [i]and rebooted[/i] this time, and now mfaktc is happy again. (I couldn't reboot last night because I was still processing a 45-hour job).
I just found it odd that the LessClasses version wasn't happy but the regular mfatkc worked fine. |
The one time I didn't ask "Have you rebooted"...
|
Any advice to tweak my system for higher output?
[QUOTE]mfaktc v0.20 (64bit built) Compiletime options THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 193154bits SIEVE_SPLIT 250 MORE_CLASSES enabled Runtime options SievePrimes 25000 SievePrimesAdjust 1 SievePrimesMin 5000 SievePrimesMax 100000 NumStreams 3 CPUStreams 3 GridSize 3 GPU Sieving enabled GPUSievePrimes 82486 GPUSieveSize 64Mi bits GPUSieveProcessSize 16Ki bits WorkFile worktodo.txt Checkpoints enabled CheckpointDelay 30s Stages enabled StopAfterFactor bitlevel PrintMode full V5UserID (none) ComputerID (none) AllowSleep no TimeStampInResults no CUDA version info binary compiled for CUDA 4.20 CUDA runtime version 4.20 CUDA driver version 6.0 CUDA device info name GeForce GTX 670 compute capability 3.0 maximum threads per block 1024 number of multiprocessors 7 (1344 shader cores) clock rate 980MHz Automatic parameters threads per grid 917504 running a simple selftest... Selftest statistics number of tests 92 successfull tests 92 selftest PASSED! got assignment: exp=75844001 bit_min=71 bit_max=72 (6.31 GHz-days) Starting trial factoring M75844001 from 2^71 to 2^72 (6.31 GHz-days) k_min = 15566051433240 k_max = 31132102873038 Using GPU kernel "barrett76_mul32_gs" Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Dec 24 20:21 | 0 0.1% | 2.380 38m02s | 238.45 82485 n.a.% Dec 24 20:21 | 3 0.2% | 2.363 37m44s | 240.17 82485 n.a.% Dec 24 20:21 | 4 0.3% | 2.339 37m18s | 242.63 82485 n.a.% Dec 24 20:21 | 15 0.4% | 2.338 37m15s | 242.74 82485 n.a.% Dec 24 20:21 | 16 0.5% | 2.341 37m16s | 242.43 82485 n.a.% [/QUOTE] |
| All times are UTC. The time now is 23:15. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.