mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

kriesel 2018-09-27 12:42

[QUOTE=nofaith628;496845]As you may have read in my previous call for help, my switch from GTX 1080TI to a Titan V has resulted in errors.

Previous attempts to assuage this issue:
[CODE]ERROR: cudaGetLastError() returned 8: invalid device function[/CODE]have failed, regardless of attempts to reinstall, clean install and uninstall display driver and CUDA, as well as tweaking settings in mfaktc.ini.

Without any prerequisite knowledge on computer programming and compilation. I have managed to compile a [B]non-optimized[/B] version of mfaktc 10.0, it currently works with the Titan V. The GHz-days output is not good, but it works.

As for the rest, I have discussed with Oliver (TheJudger) that he may release an optimized mfaktc for the Turing architecture in the near future, there are no plans to optimize the code specifically for Volta Architecture as it has a very high entry price.[/QUOTE]
Thanks for sharing the build.

I'm guessing here, that you meant something like mfaktc version 0.21, compiled for Windows 64 bit, and for CUDA 10. (There was no 32-bit CUDA, only 64-bit, beginning at CUDA version 8.0, as I recall; highest version of mfaktc I've seen previously was v0.21.) [URL]https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html[/URL] says of CUDA 10 "32-bit tools are no longer supported..." The CUDA 10 download page confirms x86_64 is available and win32 is not. Including the CUDA 10 runtime dll in the zip file would be a plus.

lycorn 2018-09-27 21:31

Hi all,

It´s been a long time since I ran mfakt on any of my machines.

I am now intending to start running it on a GTX 1060, but I must confess I´m a bit off as to the recommended CUDA version / mfakt version. I don´t have the means to do any compilation myself, so I would kindly request any willing member of this forum to point me to the right binaries. I am using Windows 10.

Many thanks.

kriesel 2018-09-28 00:49

[QUOTE=lycorn;496932]Hi all,

It´s been a long time since I ran mfakt on any of my machines.

I am now intending to start running it on a GTX 1060, but I must confess I´m a bit off as to the recommended CUDA version / mfakt version. I don´t have the means to do any compilation myself, so I would kindly request any willing member of this forum to point me to the right binaries. I am using Windows 10.

Many thanks.[/QUOTE]

First line following is generated by a batch file.
[CODE]mfaktc-win-64.LessClasses-CUDA8.exe (re)launch at Mon 12/04/2017 10:46:19.80 count 0
mfaktc v0.21 (64bit built)

Compiletime options
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 230945bits
SIEVE_SPLIT 250
MORE_CLASSES disabled

Runtime options
SievePrimes 25000
SievePrimesAdjust 1
SievePrimesMin 5000
SievePrimesMax 100000
NumStreams 3
CPUStreams 3
GridSize 3
GPUSievePrimes 82486
GPUSieveSize 64Mi bits
GPUSieveProcessSize 16Ki bits
Checkpoints enabled
CheckpointDelay 300s
WARNING: Cannot read WorkFileAddDelay from mfaktc.ini, set to 600s by default
WorkFileAddDelay 600s
Stages enabled
StopAfterFactor bitlevel
PrintMode full
V5UserID Kriesel
ComputerID condor-gtx1060
ProgressHeader "Date Time | class Pct | time ETA | GHz-d/day Sieve Wait"
ProgressFormat "%d %T | %C %p%% | %t %e | %g %s %W%%"
AllowSleep no
TimeStampInResults yes

CUDA version info
binary compiled for CUDA 8.0
CUDA runtime version 8.0
CUDA driver version 8.0

CUDA device info
name GeForce GTX 1060 3GB
compute capability 6.1
max threads per block 1024
max shared memory per MP 98304 byte
number of multiprocessors 9
clock rate (CUDA cores) 1771MHz
memory clock rate: 4004MHz
memory bus width: 192 bit

Automatic parameters
threads per grid 589824
random selftest offset 23085
GPUSievePrimes (adjusted) 82486
GPUsieve minimum exponent 1055144

running a simple selftest...
Selftest statistics
number of tests 107
successfull tests 107

selftest PASSED![/CODE]

James Heinrich 2018-09-28 00:53

The [i]LessClasses[/i] version should only be used for extremely-fast-running assignments (where each assignment only takes a few seconds).

mfaktc can be download from [url=https://mersenneforum.org/mfaktc/mfaktc-0.21/]here[/url] or [url=https://download.mersenne.ca/mfaktc/mfaktc-0.21]here[/url].

James Heinrich 2018-09-28 01:04

[QUOTE=Honza;496893]I guess old cudart32_80.dll and cudart64_80.dll needs to be updated with cudart32_100.dll and cudart64_100.dll[/QUOTE]CUDA DLLs can be found [url=https://download.mersenne.ca/CUDA-DLLs]here[/url], if needed. Or you can download the toolkit from [url]https://developer.nvidia.com/cuda-toolkit[/url], install just the libraries you need, and grab the DLLs from where you installed it (by default, C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin).

kriesel 2018-09-28 01:54

[QUOTE=James Heinrich;496947]The [I]LessClasses[/I] version should only be used for extremely-fast-running assignments (where each assignment only takes a few seconds).
[/QUOTE]
Why?
I've been running it on high exponents and ordinary assignments on multiple gpus for months.

Following is on a gtx1070, one of two instances running on it.

[CODE]Sep 27 14:04 | 405 96.9% | 271.26 13m34s | 293.35 82485 n.a.%
Sep 27 14:08 | 408 97.9% | 271.00 9m02s | 293.64 82485 n.a.%
Sep 27 14:13 | 413 99.0% | 271.10 4m31s | 293.53 82485 n.a.%
Sep 27 14:17 | 416 100.0% | 271.27 0m00s | 293.35 82485 n.a.%
no factor for M173090623 from 2^76 to 2^77 [mfaktc 0.21 barrett87_mul32_gs]
tf(): total time spent: 7h 13m 50.524s

Starting trial factoring M173090623 from 2^77 to 2^78 (176.83 GHz-days)
k_min = 436521993025020
k_max = 873043986050178
Using GPU kernel "barrett87_mul32_gs"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Sep 27 14:26 | 0 1.0% | 542.32 14h18m | 293.46 82485 n.a.%
Sep 27 14:35 | 5 2.1% | 542.49 14h09m | 293.37 82485 n.a.%
Sep 27 14:44 | 8 3.1% | 542.39 14h00m | 293.42 82485 n.a.%
Sep 27 14:53 | 12 4.2% | 542.38 13h51m | 293.43 82485 n.a.%[/CODE]

James Heinrich 2018-09-28 02:30

[QUOTE=kriesel;496953]Why? I've been running it on high exponents and ordinary assignments on multiple gpus for months.[/QUOTE]Someone else can explain the mechanics better than I, but the more classes the more candidates are filtered out prior to testing. The extra overhead to do this is not worth it for [i]very[/i] fast-running assignments, but it is beneficial for any "normal" TF assignment.

edit: from the mfakto readme:[quote]MoreClasses is a switch for defining if 420 (2*2*3*5*7) or 4620 (2*2*3*5*7*11) classes of
factor candidates should be used. Normally, 4620 gives better results but for very small classes
420 reduces the class initialization overhead enough to provide an overall benefit.[/quote]To clarify: mfakto allows this to be set in the ini file, whereas mfaktc is hardcoded to 4620 classes (unless you explicitly use the LessClasses version which is hardcoded to 420 classes).

You can easily run a quick test: using the same ini settings try running both the normal and LessClasses version of mfaktc and compare the throughput of each.

kriesel 2018-09-28 14:03

[QUOTE=James Heinrich;496956]Someone else can explain the mechanics better than I, but the more classes the more candidates are filtered out prior to testing. The extra overhead to do this is not worth it for [I]very[/I] fast-running assignments, but it is beneficial for any "normal" TF assignment.

edit: from the mfakto readme:To clarify: mfakto allows this to be set in the ini file, whereas mfaktc is hardcoded to 4620 classes (unless you explicitly use the LessClasses version which is hardcoded to 420 classes).

You can easily run a quick test: using the same ini settings try running both the normal and LessClasses version of mfaktc and compare the throughput of each.[/QUOTE]

Thanks. On the 3GB GTX1060, I found regular CUDA8 gives about 2% higher throughput initially, 1% later, than the less-classes CUDA8, at the costs of restart of the current exponent/bit level (ignoring the existing checkpoint file) and much more rapid log file growth. There's no warning about the restart of bit level. In this case it cost 4.5 hours of throughput. There are cases where it could cost weeks. (GPU-Z indicates power, thermal, vrel are limiting performance.)

less-classes (420):
[CODE]Starting trial factoring M172926979 from 2^76 to 2^77 (88.50 GHz-days)
k_min = 218467540932060
k_max = 436935081864318
Using GPU kernel "barrett87_mul32_gs"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Sep 28 03:56 | 0 1.0% | 181.20 4h46m | 439.57 82485 n.a.%
Sep 28 03:59 | 5 2.1% | 181.24 4h43m | 439.49 82485 n.a.%
...
Sep 28 08:19 | 380 91.7% | 181.24 24m10s | 439.48 82485 n.a.%
Sep 28 08:22 | 384 92.7% | 181.26 21m09s | 439.44 82485 n.a.%
Sep 28 08:25 | 389 93.8% | 181.21 18m07s | 439.55 82485 n.a.%
Sep 28 08:28 | 392 94.8% | 181.52 15m08s | 438.81 82485 n.a.%
received signal "SIGINT"
mfaktc will exit once the current class is finished.
press ^C again to exit immediately
Sep 28 08:31 | 396 95.8% | 181.00 12m04s | 440.06 82485 n.a.%[/CODE]4620 classes:
[CODE]got assignment: exp=172926979 bit_min=76 bit_max=78 (265.50 GHz-days)
Starting trial factoring M172926979 from 2^76 to 2^77 (88.50 GHz-days)
k_min = 218467540931640
k_max = 436935081864318
Using GPU kernel "barrett87_mul32_gs"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Sep 28 08:34 | 0 0.1% | 17.811 4h44m | 447.20 82485 n.a.%
Sep 28 08:34 | 5 0.2% | 17.797 4h44m | 447.55 82485 n.a.%
Sep 28 08:34 | 9 0.3% | 17.769 4h43m | 448.26 82485 n.a.%
Sep 28 08:34 | 20 0.4% | 17.840 4h44m | 446.47 82485 n.a.%
...
Sep 28 08:53 | 321 7.0% | 17.971 4h27m | 443.22 82485 n.a.%
Sep 28 08:54 | 324 7.1% | 17.947 4h26m | 443.81 82485 n.a.%
Sep 28 08:54 | 329 7.2% | 17.928 4h26m | 444.28 82485 n.a.%
Sep 28 08:54 | 336 7.3% | 17.901 4h25m | 444.95 82485 n.a.%
Sep 28 08:54 | 341 7.4% | 17.916 4h25m | 444.58 82485 n.a.% [/CODE]

James Heinrich 2018-09-28 14:08

[QUOTE=kriesel;496996]There's no warning about the restart of bit level.[/QUOTE]Sorry, I guess I should have more explicitly warned you that the checkpoint files would not be cross-compatible between the 420-class and 4620-class implementations.

lycorn 2018-09-28 22:11

Thank you all for your answers.

Up and running. It´s nice to be "back in business"...

VictordeHolland 2018-09-29 11:10

[QUOTE=James Heinrich;496948]CUDA DLLs can be found [url=https://download.mersenne.ca/CUDA-DLLs]here[/url], if needed.[/QUOTE]
Could you make sub-dirs arranged by CUDA SDK version? Last time I needed a CUDA-DLL for factoring on my GTX1080Ti I pretty much downloaded them all cause I was unsure which ones I needed :blush: . Sorry for wasting your bandwidth!

Later I found out that CUDA capability of the card =/= CUDA SDK version. Still I find it a bit confusing that you need to compile for different architectures, right? A mfaktc compile with CUDA SDK 10, GTX980 won't work on a GTX1080 right? Cause the architecture/CUDA capability of the GTX1080 is higher (and somehow not backwards compatibe?) Or am I just being ignorent?


All times are UTC. The time now is 23:06.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.