mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Operazione Doppi Mersennes (https://www.mersenneforum.org/forumdisplay.php?f=99)
-   -   Trial division with CUDA (mmff) -- used, but runs like new! (https://www.mersenneforum.org/showthread.php?t=17162)

Fan Ming 2020-01-26 12:30

Same problems for [B]MM107[/B], but MM89 is normal:
[CODE]/content/drive/My Drive/mmff-test
mmff v0.28 (64bit built)

Compiletime options
THREADS_PER_BLOCK 256
MORE_CLASSES enabled

Runtime options
GPU Sieving enabled
WARNING: Cannot read GPUSievePrimes from mmff.ini, using default value (82486)
GPUSievePrimes depends on worktodo entry
GPUSieveSize 128M bits
WARNING: Cannot read GPUSieveProcessSize from mmff.ini, using default value (8)
GPUSieveProcessSize 8K bits
WorkFile worktodo.txt
Checkpoints enabled
CheckpointDelay 30s
StopAfterFactor class
PrintMode full
V5UserID (none)
ComputerID (none)
GPUProgressHeader " class | candidates | time | ETA | raw rate | SievePrimes | CPU wait"
GPUProgressFormat "%C/4620 | %n | %ts | %e | %rM/s | %s | %W%%"
TimeStampInResults no

CUDA version info
binary compiled for CUDA 10.10
CUDA runtime version 10.10
CUDA driver version 10.10

CUDA device info
name Tesla P100-PCIE-16GB
compute capability 6.0
maximum threads per block 1024
number of mutliprocessors 56 (unknown number of shader cores)
clock rate 1328MHz

got assignment: MM107, k range 41400000000000 to 41500000000000 (154-bit factors)
Starting trial factoring of MM107 in k range: 41400G to 41500G (154-bit factors)
k_min = 41400000000000
k_max = 41500000000000
Using GPU kernel "mfaktc_barrett160_M107gs"
Verifying (2^(2^107)) % 13435069371854815219033511685499715361952762321 = 974520303404695347505301237807931102140431668099
ERROR: Verifying on CPU failed. Remainder didn't match. Possible problems exist.[/CODE]

MM89 works properly:
[CODE]/content/drive/My Drive/mmff-test
mmff v0.28 (64bit built)

Compiletime options
THREADS_PER_BLOCK 256
MORE_CLASSES enabled

Runtime options
GPU Sieving enabled
WARNING: Cannot read GPUSievePrimes from mmff.ini, using default value (82486)
GPUSievePrimes depends on worktodo entry
GPUSieveSize 128M bits
WARNING: Cannot read GPUSieveProcessSize from mmff.ini, using default value (8)
GPUSieveProcessSize 8K bits
WorkFile worktodo.txt
Checkpoints enabled
CheckpointDelay 30s
StopAfterFactor class
PrintMode full
V5UserID (none)
ComputerID (none)
GPUProgressHeader " class | candidates | time | ETA | raw rate | SievePrimes | CPU wait"
GPUProgressFormat "%C/4620 | %n | %ts | %e | %rM/s | %s | %W%%"
TimeStampInResults no

CUDA version info
binary compiled for CUDA 10.10
CUDA runtime version 10.10
CUDA driver version 10.10

CUDA device info
name Tesla P100-PCIE-16GB
compute capability 6.0
maximum threads per block 1024
number of mutliprocessors 56 (unknown number of shader cores)
clock rate 1328MHz

got assignment: MM89, k range 41400000000000 to 41500000000000 (136-bit factors)
Starting trial factoring of MM89 in k range: 41400G to 41500G (136-bit factors)
k_min = 41400000000000
k_max = 41500000000000
Using GPU kernel "mfaktc_barrett140_M89gs"
Verifying (2^(2^89)) % 51250722476366711691515168579592911982721 = 37671549122511752130292866601915335328068
class | candidates | time | ETA | raw rate | SievePrimes | CPU wait
0/4620 | 21.65M | 0.029s | n.a. | 746.60M/s | 649781 | n.a.%
Verifying (2^(2^89)) % 51250720280168236496304157387929107838071 = 35746096159163930640949829473693574340078
5/4620 | 21.65M | 0.029s | n.a. | 746.60M/s | 649781 | n.a.%
Verifying (2^(2^89)) % 51250719954174058311049257317093331713479 = 22759295645343611258946139802672470959760
9/4620 | 21.65M | 0.029s | n.a. | 746.60M/s | 649781 | n.a.%
Verifying (2^(2^89)) % 51250720852115103746739125095447985265401 = 41842644712508723081556126612950349320116
20/4620 | 21.65M | 0.028s | n.a. | 773.27M/s | 649781 | n.a.%
Verifying (2^(2^89)) % 51250721744324486800537682201019693669463 = 13062456361537928045073778273891658192745
21/4620 | 21.65M | 0.028s | n.a. | 773.27M/s | 649781 | n.a.%
Verifying (2^(2^89)) % 51250722110368501136753204925391936624199 = 11766302253559315831356912138896967481965
29/4620 | 21.65M | 0.028s | n.a. | 773.27M/s | 649781 | n.a.%
Verifying (2^(2^89)) % 51250720709149122429788413288172826239287 = 14816860850408810792926573186880149802296
33/4620 | 21.65M | 0.028s | n.a. | 773.27M/s | 649781 | n.a.%
Verifying (2^(2^89)) % 51250721052309815139813681631034757950353 = 41152310359413274585223328516751757168125
36/4620 | 21.65M | 0.028s | n.a. | 773.27M/s | 649781 | n.a.%
Verifying (2^(2^89)) % 51250721263933188975570868864490245452809 = 44183317763900802218115380969512121058940
44/4620 | 21.65M | 0.027s | n.a. | 801.91M/s | 649781 | n.a.%
Verifying (2^(2^89)) % 51250721378323800365697147786268920062497 = 18692344536121868666837048177982998180467
48/4620 | 21.65M | 0.027s | n.a. | 801.91M/s | 649781 | n.a.%
Verifying (2^(2^89)) % 51250721395487839010388945297745277400527 = 3166578919721146857552725561773689514712
53/4620 | 21.65M | 0.026s | n.a. | 832.75M/s | 649781 | n.a.%
Verifying (2^(2^89)) % 51250722287699697944266073163866784053033 = 37430319078903975242289720426417282202568
56/4620 | 21.65M | 0.027s | n.a. | 801.91M/s | 649781 | n.a.%
Verifying (2^(2^89)) % 51250721481285749313140796010178879854681 = 13236591153213340344689881456839734478969
60/4620 | 21.65M | 0.026s | n.a. | 832.75M/s | 649781 | n.a.%
Verifying (2^(2^89)) % 51250721086661413289943698879210555986631 = 29373416315097083858424261053021555658515
65/4620 | 21.65M | 0.026s | n.a. | 832.75M/s | 649781 | n.a.%
Verifying (2^(2^89)) % 51250720960840901517095503879288267435217 = 28658988341202110234172669662839524833844
68/4620 | 21.65M | 0.026s | n.a. | 832.75M/s | 649781 | n.a.%
...[/CODE]

kriesel 2020-02-25 20:06

GPUSieveSize limit
 
Various builds of mmff v0.28 have been posted. Do any of these support GPUSieveSize from 128 to 2047, like the recent increase in mfaktc? There seems to be an advantage all the way up to 128 and a bit of underutilization left yet there, on a GTX1650, and there likely is on other fast gpus also.

win 7 x64 gtx1650 mmff tune
mm127, 120000T to 120500T

GPUSievePrimes 810549 GPUSieveSize 16 GpuSieveProcessSize 32 367.75 66W 95% utilization
GPUSievePrimes 810549 GPUSieveSize 32 GpuSieveProcessSize 32 380.41
GPUSievePrimes 810549 GPUSieveSize 64 GpuSieveProcessSize 32 387.10 99%
GPUSievePrimes 810549 GPUSieveSize 128 GpuSieveProcessSize 32 389.59 * 66W 99%
GPUSievePrimes 810549 GPUSieveSize 256 GpuSieveProcessSize 32 GPUSieveSize capped at 128

Fan Ming 2020-02-26 02:02

[QUOTE=kriesel;538323]Various builds of mmff v0.28 have been posted. Do any of these support GPUSieveSize from 128 to 2047, like the recent increase in mfaktc? There seems to be an advantage all the way up to 128 and a bit of underutilization left yet there, on a GTX1650, and there likely is on other fast gpus also.
[/QUOTE]

I've ever tried to enlarge the upper limit to 2047, however, the speed gain seems no significant. I experimented it on colab T4.

kriesel 2020-02-26 13:55

1 Attachment(s)
[QUOTE=Fan Ming;538340]I've ever tried to enlarge the upper limit to 2047, however, the speed gain seems no significant. I experimented it on colab T4.[/QUOTE]
Thanks for your response. Please post any T4 throughput data versus GPUSieveSize that you have collected.
It appears to me after graphing the GTX1650 data I've collected, to offer about 0.6% additional throughput on that gpu model, or 2 to 2.5 days per year, depending on a 2047 or 4095 revised limit. Based on mfaktc experience, the effect is likely larger for faster gpus, and there are considerably faster than the GTX1650, such as the RTX2080 and similar, or the Tesla T4.

Fan Ming 2020-02-26 14:22

[QUOTE=kriesel;538364]Thanks for your response. Please post any T4 throughput data versus GPUSieveSize that you have collected.
It appears to me after graphing the GTX1650 data I've collected, to offer about 0.6% additional throughput on that gpu model, or 2 to 2.5 days per year, depending on a 2047 or 4095 revised limit. Based on mfaktc experience, the effect is likely larger for faster gpus, and there are considerably faster than the GTX1650, such as the RTX2080 and similar, or the Tesla T4.[/QUOTE]

Sorry I didn't keep the detailed data. I tested MM89, and Raw rate is about 1340? when GPUSieveSize is 128, and still ~1340 when GPUSieveSize is 2047. Since the change was not too significant, I'm not impressed with that and didn't keep the data.

kriesel 2020-02-26 19:23

2047 GPUSieveSize limit Windows build requested
 
Please make and post a Windows 7 x64 through Windows 10 x64 CUDA 10.x compatible build allowing GPUSieveSize up to 2047. Switching to unsigned int for 4095 would be more work.

Fan Ming 2020-02-27 09:39

1 Attachment(s)
Compiled fixed mmff 0.28 (in this post: [url]https://www.mersenneforum.org/showpost.php?p=535756&postcount=360[/url]) CUDA 10.1 version for Windows 64bit using Microsoft Visual Studio 2012. This time all test cases should pass now(though some Exp failure problem described in this post: [url]https://www.mersenneforum.org/showpost.php?p=535994&postcount=362[/url] still remain unsolved for specific card). The 2047 version will be posted later.

Fan Ming 2020-02-27 09:44

1 Attachment(s)
Compiled fixed mmff 0.28 for Windows 64 with GPUSievesizemax enlarged to 2047. It seems some code in the gpusieve.cu require to negate the GPUSievesize and involves arithmetic for signed 32 bit integer, so I didn't make change for further 4095. Only 2047 version here.

kriesel 2020-02-28 22:25

Going to 2047
 
Thanks for the builds, Fan Ming!

As before, Win7x64, GTX1650, etc
128-2047 variation tune feb 28:
[CODE]GPUSievePrimes 810549 GPUSieveSize 128 GpuSieveProcessSize 32 384.14 62W/75 99%
GPUSievePrimes 810549 GPUSieveSize 256 GpuSieveProcessSize 32 386.14 66w 100%
GPUSievePrimes 810549 GPUSieveSize 512 GpuSieveProcessSize 32 386.24 65w 100%
GPUSievePrimes 810549 GPUSieveSize 1024 GpuSieveProcessSize 32 386.65 63w 100%
GPUSievePrimes 810549 GPUSieveSize 2047 GpuSieveProcessSize 32 386.66 *
386.66/384.14= 1.00656 gain from 2047 over 128 GPUSieveSize[/CODE]
I would expect somewhat more gain than that ratio, on faster gpus.

kriesel 2020-03-24 00:26

Build request
 
For mmff v0.28, I see here,
CUDA ? OS? source only? [URL]https://mersenneforum.org/showpost.php?p=376423&postcount=317[/URL]
CUDA 6 Win x86 and x64 [URL]https://mersenneforum.org/mmff/[/URL]
CUDA 8.0 linux [URL]https://mersenneforum.org/showpost.php?p=497116&postcount=329[/URL]
CUDA 8.0 linux [URL]https://mersenneforum.org/showpost.php?p=497151&postcount=331[/URL]
CUDA 8.0 linux x64 [URL]https://mersenneforum.org/showpost.php?p=497231&postcount=333[/URL]
CUDA 10. win 64 [URL]https://mersenneforum.org/showpost.php?p=505723&postcount=335[/URL]
CUDA 10.1 linux [URL]https://mersenneforum.org/showpost.php?p=535756&postcount=360[/URL]
CUDA 10.1 Win [URL]https://mersenneforum.org/showpost.php?p=538430&postcount=370[/URL]
CUDA 10.1 GpuSieveSize 2047 max Win [URL]https://mersenneforum.org/showpost.php?p=538431&postcount=371[/URL]

Could we also get a CUDA 8.0 Win 64 build with GpuSieveSize 2047 max, posted here? That would suit GTX10xx.

Dylan14 2020-09-02 16:32

2 Attachment(s)
Attached are two builds of mmff v0.28.1 (Gary's source), compiled on Ubuntu 20.04, with Cuda 10.1 and sm_61 (good for Pascal cards, ie GTX10xx). The first build is with a default max sieve size, the other is with max sieve size 2047. These run the worktodo_check file with no issues, however, MM107 still doesn't work:


[CODE]dylan@dylan-G11CD:~/Desktop/mmff-0.28.1$ ./mmff.exe -v 3
mmff v0.28.1 (64bit built)

Compiletime options
THREADS_PER_BLOCK 256
MORE_CLASSES enabled

Runtime options
GPU Sieving enabled
WARNING: Cannot read GPUSievePrimes from mmff.ini, using default value (82486)
GPUSievePrimes depends on worktodo entry
GPUSieveSize 128M bits
WARNING: Cannot read GPUSieveProcessSize from mmff.ini, using default value (8)
GPUSieveProcessSize 8K bits
WorkFile worktodo.txt
Checkpoints enabled
CheckpointDelay 300s
StopAfterFactor disabled
PrintMode full
V5UserID (none)
ComputerID (none)
GPUProgressHeader " class | candidates | time | ETA | raw rate | SievePrimes | CPU wait"
WARNING, no ProgressFormat specified in mmff.ini, using default
ProgressFormat "%C/4620 | %n | %ts | %e | %rM/s | %s"
TimeStampInResults no

CUDA version info
binary compiled for CUDA 10.10
CUDA runtime version 10.10
CUDA driver version 10.20

CUDA device info
name GeForce GTX 1060 6GB
compute capability 6.1
maximum threads per block 1024
number of mutliprocessors 10 (unknown number of shader cores)
clock rate 1708MHz

got assignment: MM107, k range 41400000000000 to 41500000000000 (154-bit factors)
Starting trial factoring of MM107 in k range: 41400G to 41500G (154-bit factors)
k_min = 41400000000000
k_max = 41500000000000
Using GPU kernel "mfaktc_barrett160_M107gs"
Verifying (2^(2^107)) % 13435069353863506604210333952641545581205240561 = 549163915026848401193023077053146353871994535742
ERROR: Exponentiation failure[/CODE]It even persists with a leading edge range, which uses a different kernel than what Fan Ming used in [URL="https://mersenneforum.org/showpost.php?p=535997&postcount=364"]post 364[/URL]:
[CODE]dylan@dylan-G11CD:~/Desktop/mmff-0.28.1$ ./mmff.exe -v 3
mmff v0.28.1 (64bit built)

Compiletime options
THREADS_PER_BLOCK 256
MORE_CLASSES enabled

Runtime options
GPU Sieving enabled
WARNING: Cannot read GPUSievePrimes from mmff.ini, using default value (82486)
GPUSievePrimes depends on worktodo entry
GPUSieveSize 128M bits
WARNING: Cannot read GPUSieveProcessSize from mmff.ini, using default value (8)
GPUSieveProcessSize 8K bits
WorkFile worktodo.txt
Checkpoints enabled
CheckpointDelay 300s
StopAfterFactor disabled
PrintMode full
V5UserID (none)
ComputerID (none)
GPUProgressHeader " class | candidates | time | ETA | raw rate | SievePrimes | CPU wait"
WARNING, no ProgressFormat specified in mmff.ini, using default
ProgressFormat "%C/4620 | %n | %ts | %e | %rM/s | %s"
TimeStampInResults no

CUDA version info
binary compiled for CUDA 10.10
CUDA runtime version 10.10
CUDA driver version 10.20

CUDA device info
name GeForce GTX 1060 6GB
compute capability 6.1
maximum threads per block 1024
number of mutliprocessors 10 (unknown number of shader cores)
clock rate 1708MHz

got assignment: MM107, k range 10000000000000000 to 12000000000000000 (162-bit factors)
Starting trial factoring of MM107 in k range: 10P to 12P (162-bit factors)
k_min = 10000000000000000
k_max = 12000000000000000
Using GPU kernel "mfaktc_barrett172_M107gs"
Verifying (2^(2^107)) % 3245185537408870535270390810652173364064364295271 = 249933689397060655837985681873552465902105993031524
ERROR: Exponentiation failure[/CODE]


All times are UTC. The time now is 00:40.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.