mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GMP-ECM (https://www.mersenneforum.org/forumdisplay.php?f=55)
-   -   ECM for CUDA GPUs in latest GMP-ECM ? (https://www.mersenneforum.org/showthread.php?t=16480)

jwaltos 2016-07-23 02:32

I don't know if you want to try this but it should compile with CUDA 6.5. When it does, run your diagnostic and see what you can (and can't) change to make it work in the later CUDA versions.

WraithX 2016-07-23 02:59

[QUOTE=wombatman;438539]Running into a problem I posted previously on here, but my solution then isn't working now. I'm trying to get GPU-ECM running on Ubuntu 14.04 with a GTX 570. I configured for CC 2.0, and everything compiles without issue.

I can pass all the standard ECM tests, so I don't think there's anything wrong there, but when I try to run test.gpuecm, I fail immediately with a "too many resources requested for launch" error. Using verbose, the correct GPU is identified along with the correct compute capability (2.0).

The block size is 32x32x1, and the grid size is 1x1x1. CUDA version is 7.5.

Any help/advice is appreciated.[/QUOTE]
I believe there is a chance this is due to the program using too many CUDA registers. I don't see a build.log in my own gmp-ecm directories, but the information we need is printed out to screen while "make" is running. If you can, try to copy the portion of the "make" step where it prints out what all CUDA capabilities it is compiling for. It has been given the "-v" parameter so it should tell you things like number of registers used, and how much smem is used, etc. With all of this information we can see if it is requesting too many registers, or too much memory.

wombatman 2016-07-23 15:31

Here's the PTXAS info for sm_20:
[CODE]ptxas info : Compiling entry function '_Z15Cuda_Ell_DblAddPA32_VjS1_S1_S1_j' for 'sm_20'
ptxas info : Function properties for _Z15Cuda_Ell_DblAddPA32_VjS1_S1_S1_j
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 35 registers, 24576 bytes smem, 68 bytes cmem[0]
ptxas info : Compiling entry function '_Z16Cuda_Init_Devicev' for 'sm_20'
ptxas info : Function properties for _Z16Cuda_Init_Devicev
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 2 registers, 32 bytes cmem[0]
ptxas info : 384 bytes gmem, 4 bytes cmem[3]
[/CODE]

WraithX 2016-07-23 16:46

[QUOTE=wombatman;438586]Here's the PTXAS info for sm_20:
[CODE]ptxas info : Compiling entry function '_Z15Cuda_Ell_DblAddPA32_VjS1_S1_S1_j' for 'sm_20'
ptxas info : Function properties for _Z15Cuda_Ell_DblAddPA32_VjS1_S1_S1_j
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 35 registers, 24576 bytes smem, 68 bytes cmem[0]
ptxas info : Compiling entry function '_Z16Cuda_Init_Devicev' for 'sm_20'
ptxas info : Function properties for _Z16Cuda_Init_Devicev
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 2 registers, 32 bytes cmem[0]
ptxas info : 384 bytes gmem, 4 bytes cmem[3]
[/CODE][/QUOTE]

Hmmm, I think it is running out of registers. I believe it is loading up 32x32 kernels, and each one is requesting 35 registers, so the total it is requesting is 32x32x35 = 35840 registers, but CC 2.x can only support 32768 per MP. Would you mind posting about this on the ecm-discuss list? Cyril may be able to help figure out a way to fix this problem.

One potential workaround is to reduce ECM_GPU_CURVES_BY_BLOCK to 16 in ecm-gpu.h. However, this will halve the total number of curves you can run at a time. Also in your post to ecm-discuss, let them know if reducing that number fixed the issue or not.

wombatman 2016-07-23 19:49

[QUOTE=WraithX;438593]Hmmm, I think it is running out of registers. I believe it is loading up 32x32 kernels, and each one is requesting 35 registers, so the total it is requesting is 32x32x35 = 35840 registers, but CC 2.x can only support 32768 per MP. Would you mind posting about this on the ecm-discuss list? Cyril may be able to help figure out a way to fix this problem.

One potential workaround is to reduce ECM_GPU_CURVES_BY_BLOCK to 16 in ecm-gpu.h. However, this will halve the total number of curves you can run at a time. Also in your post to ecm-discuss, let them know if reducing that number fixed the issue or not.[/QUOTE]

Can you point me to the list? Sorry!

Changing CURVES_BY_BLOCK does get it to pass several of the tests in test.gpuecm. It fails at (2^718+1)/5, returning code 0 instead of 6, but maybe that's due to only running 16 curves instead of 32?

WraithX 2016-07-24 06:10

[QUOTE=wombatman;438613]Can you point me to the list? Sorry!

Changing CURVES_BY_BLOCK does get it to pass several of the tests in test.gpuecm. It fails at (2^718+1)/5, returning code 0 instead of 6, but maybe that's due to only running 16 curves instead of 32?[/QUOTE]

Sure thing, you can join it over here:
[url]http://lists.gforge.inria.fr/mailman/listinfo/ecm-discuss[/url]

That page details what email address to send a message to, and gives you a link to the archives of old discussions.

I think you're right about the test failing since it is only running 16 curves instead of 32. We can modify the test later if need be. Let's see what Cryil, and Paul, have to say about your issue first.

WraithX 2016-07-30 12:37

[QUOTE=wombatman;438613]Can you point me to the list? Sorry!

Changing CURVES_BY_BLOCK does get it to pass several of the tests in test.gpuecm. It fails at (2^718+1)/5, returning code 0 instead of 6, but maybe that's due to only running 16 curves instead of 32?[/QUOTE]

Do you receive the emails from the ecm-discuss list? If not, then you should know that both Paul and Cyril have replied to you asking for more details, or asking if you could try different things.

You can see the emails at the archive page here:
[url]https://lists.gforge.inria.fr/pipermail/ecm-discuss/2016-July/thread.html[/url]

wombatman 2016-07-30 13:35

Hey, I did see and reply to those. It seems it didn't actually get sent out to the listserv for some reason. I'll go in and double-check what happened.

Edit: It helps if I actually send my response to the listserv instead of just Cyril. Sorry about that.

Dubslow 2016-07-30 17:07

[QUOTE=wombatman;439060]
Edit: It helps if I actually send my response to the listserv instead of just Cyril. Sorry about that.[/QUOTE]

Every time I try and drive by a mailing list, I do the exact same thing. It's happened at least 3 times and I never seem to learn from it (possibly because I only do it ~once a year).

cgy606 2016-08-04 06:52

[QUOTE=wombatman;437465]Honestly, I have no idea about 7 to 10 compatibility, but I don't think it should. I compile with VS2012 since I got a free copy of the full edition somewhere along the line.[/QUOTE]

Can you give me an idea on how to install gpu-ecm on a windows machine. I downloaded all the files, but the readme, readme.gpu install, and install ecm show up in a messed up format in notepad. I have downloaded visual studio 2015, so I believe I have the tool to compile gmp-ecm and build it, but not sure how to proceed.

VBCurtis 2016-08-04 15:01

Try wordpad; notepad sometimes pukes on linux-formatted-cr files.
(sorry, I have no compilation advice)

henryzz 2016-08-04 15:39

[QUOTE=VBCurtis;439327]Try wordpad; notepad sometimes pukes on linux-formatted-cr files.
(sorry, I have no compilation advice)[/QUOTE]

Better still for this sort of thing Notepad++

xilman 2016-08-04 17:37

[QUOTE=henryzz;439330]Better still for this sort of thing Notepad++[/QUOTE]Even better, IMAO, is [URL="http://ergoemacs.org/emacs/which_emacs.html"]Emacs[/URL].

Let the religious wars commence. :devil:

cgy606 2016-08-05 06:04

[QUOTE=VBCurtis;439327]Try wordpad; notepad sometimes pukes on linux-formatted-cr files.
(sorry, I have no compilation advice)[/QUOTE]

Wordpad works fine, thanks.

Unfortunately, I have absolutely no idea how to build anything in visual studio. I am trying to build gpu-ecm, so I am following the instructions. You first have to build gmp-ecm. In order to build that on Windows, you need yasm. I got that and put it into vs2012, but I don't know how to change the environmental PATH YASMPATH; so I don't know if that is working. Next, I need to get mpir built. I am reading that readme, and it say to go to mpir.sln in the build_vc2012 folder and run the script and build, but I am getting all kinds of error (again don 't know if I am even building it correctly). I have no idea how to build using vs cause I have never done it. I know some of you have build gpu-ecm on windows machine. Any tips would be appreciated...

wombatman 2016-08-05 14:42

1 Attachment(s)
I truthfully don't remember all the steps I went through to get everything compiled successfully with VS2012 (I have no experience with VS2015). I'm attaching the gpu_ecm exe I have built. Try it out and see what dlls you need for it.

cgy606 2016-08-05 17:55

I extracted the ecm_gpu.exe file and running with:

echo "144!+5" | ecm_gpu.exe -gpu -savea gpu.save -c 256 50000 0 >> gpu.out

And I got an error of "missing cudart64_70.dll". So I went to GeForce driver page and downloaded the latest nvidia package launcher and installed it. It installed however, when I went to rerun ecm_gpu.exe, I got the same error. What nvidia drivers did you install?

wombatman 2016-08-05 21:20

1 Attachment(s)
You would need to install CUDA 7.0. Alternatively, you should be able to use the dll attached. :smile:

I'm also attaching mpir.dll, which I think you'll need too.

cgy606 2016-08-05 22:32

[QUOTE=wombatman;439419]You would need to install CUDA 7.0. Alternatively, you should be able to use the dll attached. :smile:

I'm also attaching mpir.dll, which I think you'll need too.[/QUOTE]

Thanks, I think it is working now. Is there a way to multi-thread stage 2 when running the resume command?

wombatman 2016-08-05 22:49

You'll want to find ecm.py, written by WraithX on these forums. He recently added the ability to resume a file in a multithread fashion using that script.

cgy606 2016-08-10 16:59

I am using a GTX 980 on my laptop with 2048 cuda cores and running 1024 stage 1 instances per process. I have been running curves using ecm_gpu (stage 1 on my gpu and multi-threading stage 2 on my cpu using ecm.py script). Suddenly, the gpu speed cut in half. I tried playing around with the number of curves running on my gpu using the -gpucurves tag. It was still slow. Then I shut my laptop off and restarted... it returned to normal. Has anybody ever experienced this? My concern is that it might go back to "half speed" later on in the middle of some big job...

xilman 2016-08-10 17:42

[QUOTE=cgy606;439738]I am using a GTX 980 on my laptop with 2048 cuda cores and running 1024 stage 1 instances per process. I have been running curves using ecm_gpu (stage 1 on my gpu and multi-threading stage 2 on my cpu using ecm.py script). Suddenly, the gpu speed cut in half. I tried playing around with the number of curves running on my gpu using the -gpucurves tag. It was still slow. Then I shut my laptop off and restarted... it returned to normal. Has anybody ever experienced this? My concern is that it might go back to "half speed" later on in the middle of some big job...[/QUOTE]Not experienced it myself, but laptops are notorious for slowing things down if they feel that the temperature is too high or the estimated battery lifetime is too low. Had endless fun persuading my MacBook that I knew better than it did when it came to deciding whether or not to use the GPU.

Gordon 2016-08-10 17:57

[QUOTE=cgy606;439738]I am using a GTX 980 on my laptop with 2048 cuda cores and running 1024 stage 1 instances per process. I have been running curves using ecm_gpu (stage 1 on my gpu and multi-threading stage 2 on my cpu using ecm.py script). Suddenly, the gpu speed cut in half. I tried playing around with the number of curves running on my gpu using the -gpucurves tag. It was still slow. Then I shut my laptop off and restarted... it returned to normal. Has anybody ever experienced this? My concern is that it might go back to "half speed" later on in the middle of some big job...[/QUOTE]

Happens when you start a 2nd copy of gmp-ecm with only a single gpu available...perhaps the same applies to ecm_gpu?

wombatman 2016-08-10 18:37

Yeah, my assumption would be the GPU getting too hot. GPU-ECM will definitely raise the temperature.

LaurV 2016-08-11 05:56

Take the GPU-Z and run it in parallel with GPU-ECM, watch it close. If there was a temperature issue, then it will happen again in about the same amount of time, then look to what GPU-Z says in that moment (it says why the card's speed is restricted, like temperature issues, power issues, etc). Look if the clock changes (it may, to reduce the power, or it may not, and reduce the power by inserting idle clocks, each method has advantages and disadvantages, but for a clear temperature issue, the clock will be cut, for sure).

cgy606 2016-08-19 22:29

[QUOTE=wombatman;439399]I truthfully don't remember all the steps I went through to get everything compiled successfully with VS2012 (I have no experience with VS2015). I'm attaching the gpu_ecm exe I have built. Try it out and see what dlls you need for it.[/QUOTE]

Hi Wombatman,

I was wondering you have have thought about what the steps that you took to compile gmp-ecm for gpu-ecm. I have the standard 1018-bit version and was wondering if I could try to see if a lower bit version would work more quickly and be effective (as in finding factors in stage 1 or stage 2 on the cpu). This guy from a much earlier post seems to have gotten it to work:

[QUOTE=debrouxl;289494]I've switched to CC 2.0 compilation as well, and the default number of curves has raised from 32 to 64 - same change as xilman above.

I haven't yet seen a mention of non-power of 2 NB_DIGITS in this thread... therefore, I tried it, even if I have no idea whether it should work :smile:
Well, at least, it does not seem to fail horribly:
* the resulting executable doesn't crash;
* the size of the executable is between the size of the 512-bit version and the size of the 1024-bit version;
* on both a C211 and a C148, the 768-bit version is faster than the 1024-bit-arithmetic version:

[code]$ echo 7666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666663 | ./gpu_ecm_24 -vv -save 76663_210_ecm24_3e6 3000000
#Compiled for a NVIDIA GPU with compute capability 2.0.
#Will use device 0 : GeForce GT 540M, compute capability 2.1, 2 MPs.
#s has 4328086 bits
Precomputation of s took 0.252s
Input number is 7666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666663 (212 digits)
Using B1=3000000, firstinvd=435701810, with 64 curves
...
gpu_ecm took : 1444.690s (0.000+1444.686+0.004)
Throughput : 0.044

$ echo 7666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666663 | ./gpu_ecm_32 -vv -save 76663_210_ecm32_3e6 3000000
...
gpu_ecm took : 1814.801s (0.000+1814.797+0.004)
Throughput : 0.035[/code]

[code]for i in 16 24 32; do echo 3068628376360794912078530386432442844396649484227245118385713667577336042284107359110543525586164007547649873239035755922916136752709773803297694127 | "./gpu_ecm_$i" -vv -save "80009_213_ecm${i}_3e6" 3000000; done
...
gpu_ecm took : 865.578s (0.000+865.574+0.004)
Throughput : 0.074
...
gpu_ecm took : 1707.302s (0.000+1707.298+0.004)
Throughput : 0.037
...
gpu_ecm took : 2044.451s (0.000+2044.447+0.004)
Throughput : 0.031
[/code]


Comparison against CPU GMP-ECM running on 1 hyperthread of a SandyBridge i7, whose other 7 hyperthreads are used to the max as well:
[code]$ echo 7666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666663 | ecm -c 1 3e6
GMP-ECM 6.5-dev [configured with GMP 5.0.90, --enable-asm-redc, --enable-assert] [ECM]
Input number is 7666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666663 (211 digits)
Using B1=3000000, B2=5706890290, polynomial Dickson(6), sigma=1718921992
Step 1 took 34590ms
Step 2 took 11536ms[/code]

[code]$ echo 3068628376360794912078530386432442844396649484227245118385713667577336042284107359110543525586164007547649873239035755922916136752709773803297694127 | ecm -c 1 3e6
GMP-ECM 6.5-dev [configured with GMP 5.0.90, --enable-asm-redc, --enable-assert] [ECM]
Input number is 3068628376360794912078530386432442844396649484227245118385713667577336042284107359110543525586164007547649873239035755922916136752709773803297694127 (148 digits)
Using B1=3000000, B2=5706890290, polynomial Dickson(6), sigma=3766168691

Step 1 took 21521ms
Step 2 took 8016ms[/code]

For composites of those sizes, the GT 540M can beat one hyperthread of i7-2670QM if the CPU is busy, but not if the CPU is idle.[/QUOTE]

I was thinking it would be nice to have multiple versions of gpu-ecm for varying bit levels in order to speed up stage 1 on the gpu in comparison to stage 2 on the cpu. I have observed a constant 3.2 hr on my GTX 980 for all input C307 and below for 1024 curves at B1 = 43M. My quad core 4ghz processor takes 4.1hr to run 1024 stage 2 curves (using 8 hyper-threads). However for a C155, cpu stage 2 takes only 2.4 hr, so looking to make up the difference using a 634-bit version (changing NB_DIGITS to 20 from 32 in the standard 1018-bit) for this particular number. Any help would be appreciated...

wombatman 2016-08-19 23:06

1 Attachment(s)
Here's the program compiled with NB_DIGITS set to 20. I would try finding a known factor to ensure it works properly. It passed the first few tests in test.gpuecm, which generally indicates it is working, but it's good to be sure.

cgy606 2016-08-20 04:52

Some interesting behavior to say the least. I ran a test on a C147 that I know has a p10, p17, p20, p20 at B1 = 250K. I ran 512 curves on the GPU using the ndigits = 20. It found all factors after stage 1:

[CODE]GMP-ECM 7.0.1-dev [configured with MPIR 2.7.0, --enable-gpu, --enable-openmp] [ECM]
Input number is (99!+5)/9176362385 (147 digits)
Using B1=250000, B2=0, sigma=3:3407157017-3:3407157528 (512 curves)
Block: 20x32x1 Grid: 16x1x1

Computing 512 Step 1 took 2328ms of CPU time / 19709ms of GPU time
********** Factor found in step 1: 5275321151
Found probable prime factor of 10 digits: 5275321151
Composite cofactor ((99!+5)/9176362385)/5275321151 has 137 digits
********** Factor found in step 1: 42645646522247063
Found probable prime factor of 17 digits: 42645646522247063
Composite cofactor (((99!+5)/9176362385)/5275321151)/42645646522247063 has 120 digits
********** Factor found in step 1: 61133702826671342149
Found probable prime factor of 20 digits: 61133702826671342149
Composite cofactor ((((99!+5)/9176362385)/5275321151)/42645646522247063)/61133702826671342149 has 100 digits
********** Factor found in step 1: 31905776268663843113
Found probable prime factor of 20 digits: 31905776268663843113
Probable prime cofactor (((((99!+5)/9176362385)/5275321151)/42645646522247063)/61133702826671342149)/31905776268663843113 has 81 digits


[/CODE]However, I now tried to run the on the gpu a composite that I was reasonable certain wouldn't find factors in stage 1. I choose a C189 that has 2x p24 in it using B1 = 250K. I ran a total of 3072 curves at that B1 and no factors were found in stage 1. I then went on to the save files using the ecm.py script written by WraithX for stage 2 and I didn't find a single factor in stage 2 (no factor when I know there are 2 p24 after a 7*t30 search is highly improbable)

The scientist in me decided that I should try another 512 curves at B1 = 250K (a t30 search) using the default ndigits = 32. Here is the output I got:[CODE]

ON GPU

GMP-ECM 7.0.1-dev [configured with MPIR 2.7.0, --enable-gpu, --enable-openmp] [ECM]
Input number is (126!+5)/79768672096773991353065 (189 digits)
Using B1=250000, B2=0, sigma=3:290623459-3:290623970 (512 curves)
Block: 32x32x1 Grid: 16x1x1
300000 iterations to go
200000 iterations to go
100000 iterations to go
90000 iterations to go
80000 iterations to go
70000 iterations to go
60000 iterations to go
50000 iterations to go
40000 iterations to go
30000 iterations to go
20000 iterations to go
10000 iterations to go
GPU: factor 324295084094116127662247 found in Step 1 with curve 368 (-sigma 3:290623827)
Computing 512 Step 1 took 2344ms of CPU time / 33876ms of GPU time
********** Factor found in step 1: 324295084094116127662247
Found probable prime factor of 24 digits: 324295084094116127662247
Composite cofactor ((126!+5)/79768672096773991353065)/324295084094116127662247 has 165 digits

STAGE 2 RESUME ON CPU

-> ___________________________________________________________________
-> | Running ecm.py, a Python driver for distributing GMP-ECM work |
-> | on a single machine. It is copyright, 2011-2016, David Cleaver |
-> | and is a conversion of factmsieve.py that is Copyright, 2010, |
-> | Brian Gladman. Version 0.40 (Python 2.6 or later) 6th Aug 2016 |
-> |_________________________________________________________________|

-> Resuming work from resume file: 126fac5_250e3_0.save
-> Spreading the work across 8 thread(s)
->=============================================================================
-> Working on the number(s) in the resume file: 126fac5_250e3_0.save
-> Using up to 8 instances of GMP-ECM...
-> Found 512 unique resume lines to work on.
-> Will start working on the 512 resume lines.
-> ecm -resume resume_job_126fac5_250e3_0-save_inp_t00.txt 250000 > resume_job_126fac5_250e3_0-save_out_t00.txt (64 resume lines)
-> ecm -resume resume_job_126fac5_250e3_0-save_inp_t01.txt 250000 > resume_job_126fac5_250e3_0-save_out_t01.txt (64 resume lines)
-> ecm -resume resume_job_126fac5_250e3_0-save_inp_t02.txt 250000 > resume_job_126fac5_250e3_0-save_out_t02.txt (64 resume lines)
-> ecm -resume resume_job_126fac5_250e3_0-save_inp_t03.txt 250000 > resume_job_126fac5_250e3_0-save_out_t03.txt (64 resume lines)
-> ecm -resume resume_job_126fac5_250e3_0-save_inp_t04.txt 250000 > resume_job_126fac5_250e3_0-save_out_t04.txt (64 resume lines)
-> ecm -resume resume_job_126fac5_250e3_0-save_inp_t05.txt 250000 > resume_job_126fac5_250e3_0-save_out_t05.txt (64 resume lines)
-> ecm -resume resume_job_126fac5_250e3_0-save_inp_t06.txt 250000 > resume_job_126fac5_250e3_0-save_out_t06.txt (64 resume lines)
-> ecm -resume resume_job_126fac5_250e3_0-save_inp_t07.txt 250000 > resume_job_126fac5_250e3_0-save_out_t07.txt (64 resume lines)
GMP-ECM 7.0.1-dev [configured with MPIR 2.7.0, --enable-gpu, --enable-openmp] [ECM]
Using B1=250000-250000, B2=128992510, polynomial Dickson(3), 8 threads
____________________________________________________________________________
Curves Complete | Average seconds/curve | Runtime | ETA
-----------------|---------------------------|---------------|--------------
17 of 512 | Stg1 0.000s | Stg2 0.523s | 0d 00:00:02 | 0d 00:02:09

Resume line 17 out of 512:

Using B1=250000-250000, B2=128992510, polynomial Dickson(3), sigma=3:290623649
Step 1 took 0ms
Step 2 took 594ms
********** Factor found in step 2: 452655830807187689684039
Found probable prime factor of 24 digits: 452655830807187689684039
Probable prime cofactor (((126!+5)/79768672096773991353065)/324295084094116127662247)/452655830807187689684039 has 142 digits[/CODE]Not only does it find one of the factors in stage 1 (after 512 stage 1 runs at 250K), but it finds the other p24 after only 17 curves in stage 2. Now, I know ecm is a probabilistic algorithm, but these results are not due to sheer 'luck'...

I think something is grossly wrong in the ecm code when ndigits is changed from the default (I know cyril has pointed to this), but haven;t a clue what it is...

henryzz 2016-08-20 07:09

From memory only 16 and 32 worked when I looked at it a long while ago. Not a clue why.

VBCurtis 2016-08-20 14:31

I agree with Henry- only power-of-2 values worked, and only 32 worked without issue. 16 and 64 "mostly worked", but were never found to be 100% reliable for anyone.
IMO, it's not worth the missed factors to gain the time savings from using 16, unless you're running CPU-only ECM in parallel.

wombatman 2016-08-20 15:12

Yeah, I played around NB_DIGITS before, but only bumping it up as I recall. It will compile and run and can find factors in Stage 1, but it always had issues with Stage 2.

RichD 2016-12-07 16:30

I am now trying to build GMP-ECM w/ CUDA on my Linux box. A few hundred curves a day in not cutting it for me anymore. :smile:

I tried configure with -—enable-gpu and it doesn’t find cuda.h. I then added -—with-cuda=/usr/local/cuda-8.0.

Not sure why I needed to do that because other packages (mfaktc, mmff, msieve) all build/make fine as is.

BTW, I’ve added the following to .bashrc per nvidia documentation.
[CODE]export PATH=/usr/local/cuda-8.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH[/CODE]

The output is the following, so I am lost.
[CODE]configure: Using cuda.h from /usr/local/cuda-8.0/include
checking cuda.h usability... no
checking cuda.h presence... yes
configure: WARNING: cuda.h: present but cannot be compiled
configure: WARNING: cuda.h: check for missing prerequisite headers?
configure: WARNING: cuda.h: see the Autoconf documentation
configure: WARNING: cuda.h: section "Present But Cannot Be Compiled"
configure: WARNING: cuda.h: proceeding with the compiler's result
configure: WARNING: ## ------------------------------------------------ ##
configure: WARNING: ## Report this to ecm-discuss@lists.gforge.inria.fr ##
configure: WARNING: ## ------------------------------------------------ ##
checking for cuda.h... no
configure: error: required header file missing[/CODE]

I searched lists.gforge.inria.fr but couldn’t find anything meaningful.
I don't recall seeing a pre-built Linux x86_64 for CUDA 8.0 fly by.

jwaltos 2016-12-08 05:47

[QUOTE=RichD;448673]

I searched lists.gforge.inria.fr but couldn’t find anything meaningful.
I don't recall seeing a pre-built Linux x86_64 for CUDA 8.0 fly by.[/QUOTE]

This is something I used that worked before. Perhaps it may help...or not.

[jw@localhost ~]$ export PATH=/usr/local/cuda-7.0/bin:$PATH
[jw@localhost ~]$ export LD_LIBRARY_PATH=/usr/local/cuda-7.0/lib64:$LD_LIBRARY_PATH
[jw@localhost trgpu]$ su
Password:
[root@localhost trgpu]# ldconfig /usr/local/cuda-7.0/lib64
[root@localhost trgpu]# make check

## GPU program works only when in root mode.

chris2be8 2016-12-08 16:37

I had a lot of fun getting it to work. See [url]http://mersenneforum.org/showthread.php?p=364978&highlight=nouveau+update-initramfs#post364978[/url] and nearby posts for details of what I had to do.

Chris

RichD 2016-12-09 05:42

I don’t have a problem with CUDA or my GPU setup. I am running two GPU cards in a box with nvidia toolkit, samples, SDK and drivers all working properly. I have been running mfaktc for months using CUDA 8.0. At times I will stop one of the mfaktc jobs and run mmff for a while. I even run msieve poly selection using the GPU with no problems building or making.

I can’t get past the configure step in ECM-GMP let along even trying to make it. I do have a successful make without the GPU.

I don’t want to go back to CUDA 7.0 or 7.5 because they both had the carry bit bug in them. I’ll just wait until the next release of configure (ECM-GMP) for CUDA 8.0 to see if that will address my problem.

0PolarBearsHere 2016-12-11 10:52

[QUOTE=wombatman;439399]I truthfully don't remember all the steps I went through to get everything compiled successfully with VS2012 (I have no experience with VS2015). I'm attaching the gpu_ecm exe I have built. Try it out and see what dlls you need for it.[/QUOTE]

Are you able to compile a cuda8 version?

wombatman 2016-12-12 03:40

Not a clue, but I can give it a shot. What compute capability do you want/need?

0PolarBearsHere 2017-02-19 05:59

I've got a gtx1080 that I want to test with ecm to see how it compares with trial factoring.

VBCurtis 2017-02-19 06:06

I can't think of any candidate numbers that fit within the GPU-ECM 1018-bit limit but would otherwise be trial-factored rather than sieved. What numbers are you interested in testing?

xilman 2017-02-19 07:25

[QUOTE=0PolarBearsHere;453218]I've got a gtx1080 that I want to test with ecm to see how it compares with trial factoring.[/QUOTE]I can provide a bunch of GCW numbers if you are interested.

0PolarBearsHere 2017-02-23 23:54

[QUOTE=xilman;453222]I can provide a bunch of GCW numbers if you are interested.[/QUOTE]

Definitely interested. I'll need some instructions about how to test them though.

pepi37 2017-02-25 14:59

[QUOTE=VBCurtis;453219]I can't think of any candidate numbers that fit within the GPU-ECM 1018-bit limit but would otherwise be trial-factored rather than sieved. What numbers are you interested in testing?[/QUOTE]

So GPU has limit, when you use ECM, and what is about -pm1?
In that case is same limit ?

Last two weeks I play with ECM-GMP ( ver 7.0.4) for Windows. When I do Pminus1 in Prime95 I learn ( what i need to learn) in one day, and found many factors.
But learning curve of ECM_GMP is such a different thing. It is so erratic

For example

[QUOTE]C:\Users\Alpha-I7\Desktop\New folder\ecm>ecm -mpzmod -one -v -nobase2 50 < test.txt
GMP-ECM 7.0.4-dev [configured with GMP 6.1.1] [ECM]
Running on Alpha-I7-PC
Input number is 4*53^500288+1 (862636 digits)
Using REDC
Using B1=50, B2=1356, polynomial x^1, sigma=0:10519400687847901454
dF=12, k=2, d=90, d2=7, i0=-6
Expected number of curves to find a factor of n digits:
35 40 45 50 55 60 65 70 75 80
Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf
Step 1 took 58407ms
********** Factor found in step 1: 123685[/QUOTE]

Since I use [U][B]-one [/B][/U]in command line ECM-GMP , when is found factor should skip for next one, but no: he stop ( I terminate it after 15 minutes)
And settings B1 and B2 are so much different then setting same values in Prime95.

I would like to "master" ECM-GMP for at least same speed and same numbers of factors found like in Prime95, but it looks like it will be hard task to do it.

VBCurtis 2017-02-25 16:44

The GPU's advantage is running many curves in parallel; pm1 does a single curve, so a GPU would be a total waste for pm1.

ET_ 2017-02-25 16:47

[QUOTE=VBCurtis;453733]The GPU's advantage is running many curves in parallel; pm1 does a single curve, so a GPU would be a total waste for pm1.[/QUOTE]

...unless you use CUDA-Pm1 that parallelizes the FFT calculations...
But that's a totally different subject.

pepi37 2017-02-25 16:56

[QUOTE=VBCurtis;453733]The GPU's advantage is running many curves in parallel; pm1 does a single curve, so a GPU would be a total waste for pm1.[/QUOTE]

Many curves in parallel until limit.

pepi37 2017-02-25 16:57

[QUOTE=ET_;453734]...unless you use CUDA-Pm1 that parallelizes the FFT calculations...
But that's a totally different subject.[/QUOTE]

Some more info? Link please!

That program is only for base2, I am doing outside that :)

xilman 2017-02-25 18:11

[QUOTE=pepi37;453727]So GPU has limit, when you use ECM, and what is about -pm1?
In that case is same limit ?[/QUOTE]The limit is on the arithmetic, not the algorithms built on top of it.

pepi37 2017-02-25 18:19

[QUOTE=xilman;453745]The limit is on the arithmetic, not the algorithms built on top of it.[/QUOTE]
So for now GPU as option in GMP-EMC is no choice for me :(
Continuing to explore GMP-EMC.....

bsquared 2017-09-27 16:50

Maybe I missed it, but I haven't seen any benchmark numbers for GPU-ECM since several years ago in this thread. Could I ask someone to do that for current-generation GPUs? e.g., how long does it take to run N parallel curves for each of the 512-bit and 1018-bit versions (which I understand to be the only reliable ones)?

wombatman 2017-09-27 18:33

[QUOTE=bsquared;468655]Maybe I missed it, but I haven't seen any benchmark numbers for GPU-ECM since several years ago in this thread. Could I ask someone to do that for current-generation GPUs? e.g., how long does it take to run N parallel curves for each of the 512-bit and 1018-bit versions (which I understand to be the only reliable ones)?[/QUOTE]

Got a particular number (or size of number) in mind?

bsquared 2017-09-27 18:59

[QUOTE=wombatman;468662]Got a particular number (or size of number) in mind?[/QUOTE]

The way I understand things, it doesn't matter; any number will run in the same amount of time (although different for the different versions 512 vs. 1018-bit). Someone let me know if that's not the case.

I'm currently working on the 197 digit cofactor of 149^70+70^149:
[CODE]18990123508557902868419834986612849212629329047848408031918356871905091180915018818857084783656883790655928770567349115604303665109625610225945717466120031386943078873289069222693826896799892056101[/CODE]

which you can use for the larger ECM version. Pick anything you like for the smaller one.

wombatman 2017-09-27 19:02

[QUOTE=bsquared;468663]The way I understand things, it doesn't matter; any number will run in the same amount of time (although different for the different versions 512 vs. 1018-bit). Someone let me know if that's not the case.

I'm currently working on the 197 digit cofactor of 149^70+70^149:
[CODE]18990123508557902868419834986612849212629329047848408031918356871905091180915018818857084783656883790655928770567349115604303665109625610225945717466120031386943078873289069222693826896799892056101[/CODE]

which you can use for the larger ECM version. Pick anything you like for the smaller one.[/QUOTE]

:tu: I'll run the lower levels tonight as I get a chance to. That will provide an approximate trend, as I've noticed that the time involved for higher B1s is roughly linear with B1.

xilman 2017-09-27 19:26

[QUOTE=bsquared;468655]Maybe I missed it, but I haven't seen any benchmark numbers for GPU-ECM since several years ago in this thread. Could I ask someone to do that for current-generation GPUs? e.g., how long does it take to run N parallel curves for each of the 512-bit and 1018-bit versions (which I understand to be the only reliable ones)?[/QUOTE]I'll see what I can do. For better or worse I seem to have ended up with the job of maintaining GPU-ECM. Any assistance with that task will be much appreciated. In particular, evidence of inadequacies of the software will be useful. Even more useful will be contributions to its enhancement.

BTW, and AFAIK, the 512-bit version is really 506-bit limited. One of the things on my to-do list is to allow for more versions. Ideally the end-user shouldn't have to predetermine the size of the arithmetic. Another WIBNI is to implement stage 2 on the GPU. That might be the easier of the two. I would also like to extend the ECMNET client to use a GPU where available.

wombatman 2017-09-27 20:04

[QUOTE=xilman;468666]I'll see what I can do. For better or worse I seem to have ended up with the job of maintaining GPU-ECM. Any assistance with that task will be much appreciated. In particular, evidence of inadequacies of the software will be useful. Even more useful will be contributions to its enhancement.

BTW, and AFAIK, the 512-bit version is really 506-bit limited. One of the things on my to-do list is to allow for more versions. Ideally the end-user shouldn't have to predetermine the size of the arithmetic. Another WIBNI is to implement stage 2 on the GPU. That might be the easier of the two. I would also like to extend the ECMNET client to use a GPU where available.[/QUOTE]

I can do little to nothing on the programming side of things, but I'm always happy to try and help out with testing. Please feel free to PM me when or if you want help there.

VBCurtis 2017-09-28 03:03

[QUOTE=wombatman;468664]:tu: I'll run the lower levels tonight as I get a chance to. That will provide an approximate trend, as I've noticed that the time involved for higher B1s is roughly linear with B1.[/QUOTE]

Timing is within 1% of linear from 60M to 1200M (the smallest and largest bounds I've used for GPU-ECM). So, we can benchmark by testing a single B1 bound across cards.

To make the bench short-ish, how about the t50 standard B1 = 43M? I'll report 750ti numbers tomorrow.

wombatman 2017-09-28 04:54

I checked B1=3M and 11M, and they were also within 1% of each other (0.6%, to be precise), so we could probably use one of those to decrease the time. If reproducibility is a concern, B1=3M could be run, say, 3 times and averaged.

wombatman 2017-09-28 12:53

Times for the C197 bsquared put up on a GTX980Ti:

B1=3M: 776.75s
B1=11M: 2830.76s
B1=43M: 11235.902s

As noted by Curtis, these all scale linearly with B1 within ~1%.

bsquared 2017-09-28 13:16

[QUOTE=wombatman;468709]Times for the C197 bsquared put up on a GTX980Ti:

B1=3M: 776.75s
B1=11M: 2830.76s
B1=43M: 11235.902s

As noted by Curtis, these all scale linearly with B1 within ~1%.[/QUOTE]

Thank you, much appreciated. How many curves does that card run in parallel? I'm trying to gauge the throughput of these things...

wombatman 2017-09-28 13:28

[QUOTE=bsquared;468710]Thank you, much appreciated. How many curves does that card run in parallel? I'm trying to gauge the throughput of these things...[/QUOTE]

1408 at once. Also, please note that those timings are just for Stage 1.

bsquared 2017-09-28 13:37

[QUOTE=wombatman;468711]1408 at once. Also, please note that those timings are just for Stage 1.[/QUOTE]

Yep, understood. Thanks - this is very helpful!

wombatman 2017-09-28 14:38

[QUOTE=bsquared;468712]Yep, understood. Thanks - this is very helpful![/QUOTE]

Does this mean YAFU may be incorporating GPU-based ECM? :whistle:

chris2be8 2017-09-28 16:55

Output from the last job I ran with B1=43M: [code]
Wed 27 Sep 2017 03:16:08 BST ecm to 50 digits stage 1 step 1 of 2 ended
GMP-ECM 7.0-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is 3129333620258940486330629139630925933631807344645219843266754670086579137577512492047980091055198974792305750155580720314468906941756462912147587701714558353912867079793467240454311904331638921 (193 digits)
Using B1=43000000, B2=1, sigma=3:2949756002-3:2949756417 (416 curves)
Computing 416 Step 1 took 251097ms of CPU time / 5018458ms of GPU time
Wed 27 Sep 2017 04:42:07 BST ecm to 50 digits stage 1 step 2 of 2 ended
[/code]
The GPU does 416 curves at a time.

The script waits for the stage 2 tasks to finish before starting the next stage 1 run. But the last save file says the residues were saved at Wed Sep 27 04:39:49 2017 so the run was from Wed 27 Sep 2017 03:16:08 to Wed Sep 27 04:39:49.

The GPU is a GeForce GTX 970 according to nvidia-smi: [code]
Thu Sep 28 16:56:39 2017
+------------------------------------------------------+
| NVIDIA-SMI 352.39 Driver Version: 352.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 970 Off | 0000:01:00.0 Off | N/A |
| 0% 54C P0 34W / 201W | 15MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[/code]
Chris

bsquared 2017-09-28 18:48

[QUOTE=wombatman;468715]Does this mean YAFU may be incorporating GPU-based ECM? :whistle:[/QUOTE]

No... but possibly a competing technology :whistle:
[SIZE="1"]No promises...[/SIZE]

wombatman 2017-09-28 20:41

[QUOTE=bsquared;468736]No... but possibly a competing technology :whistle:
[SIZE="1"]No promises...[/SIZE][/QUOTE]

[URL="https://imgur.com/uu9pmDP"]https://imgur.com/uu9pmDP[/URL]

Edit: And when or if you want some testing done, feel free to PM :smile:

lorgix 2017-09-28 21:31

[QUOTE=bsquared;468736]No... but possibly a competing technology :whistle:
[SIZE="1"]No promises...[/SIZE][/QUOTE]

I'm gonna hold you to that.

xilman 2017-09-29 11:26

Hmm. I tried to run an 11M test on my 460 but this appeared
[code]pcl@anubis $ ecm -timestamp -save timer -gpu 11000000 0
GMP-ECM 7.0.1-dev [configured with GMP 6.1.0, --enable-asm-redc, --enable-gpu] [ECM]
3129333620258940486330629139630925933631807344645219843266754670086579137577512492047980091055198974792305750155580720314468906941756462912147587701714558353912867079793467240454311904331638921
Input number is 3129333620258940486330629139630925933631807344645219843266754670086579137577512492047980091055198974792305750155580720314468906941756462912147587701714558353912867079793467240454311904331638921 (193 digits)
[Fri Sep 29 12:09:41 2017]
Using B1=11000000, B2=0, sigma=3:2396433671-3:2396433894 (224 curves)
GPU: Block: 32x32x1 Grid: 7x1x1 (224 parallel curves)
cudakernel.cu(256) : Error cuda : too many resources requested for launch.
pcl@anubis $ [/code]
It used to work but something has clearly changed in the last few months. A reboot and full investigation will follow but for now I can report a single curve on the host processor, a AMD Phenom(tm) II X6 1090T clocked at 3.2Ghz, took took 29760ms.

Historical data from my lab book indicates that 224 curves at B1=110M used to take 529.7s cpu and 24645.4s gpu, which suggests that the B1=11m benchmark should take around 2500 seconds.

The system with a 970 is currently running a GNFS matrix and that should finish before benchmarking takes place.

wombatman 2017-09-29 14:09

[QUOTE=xilman;468804]Hmm. I tried to run an 11M test on my 460 but this appeared
[code]pcl@anubis $ ecm -timestamp -save timer -gpu 11000000 0
GMP-ECM 7.0.1-dev [configured with GMP 6.1.0, --enable-asm-redc, --enable-gpu] [ECM]
3129333620258940486330629139630925933631807344645219843266754670086579137577512492047980091055198974792305750155580720314468906941756462912147587701714558353912867079793467240454311904331638921
Input number is 3129333620258940486330629139630925933631807344645219843266754670086579137577512492047980091055198974792305750155580720314468906941756462912147587701714558353912867079793467240454311904331638921 (193 digits)
[Fri Sep 29 12:09:41 2017]
Using B1=11000000, B2=0, sigma=3:2396433671-3:2396433894 (224 curves)
GPU: Block: 32x32x1 Grid: 7x1x1 (224 parallel curves)
cudakernel.cu(256) : Error cuda : too many resources requested for launch.
pcl@anubis $ [/code]
It used to work but something has clearly changed in the last few months. A reboot and full investigation will follow but for now I can report a single curve on the host processor, a AMD Phenom(tm) II X6 1090T clocked at 3.2Ghz, took took 29760ms.

Historical data from my lab book indicates that 224 curves at B1=110M used to take 529.7s cpu and 24645.4s gpu, which suggests that the B1=11m benchmark should take around 2500 seconds.

The system with a 970 is currently running a GNFS matrix and that should finish before benchmarking takes place.[/QUOTE]

I had a similar issue on my 560. I think I had to edit the header file that defined how many curves were run by default. If you use "-gpucurves 112", does it still fail?

xilman 2017-09-29 14:19

[QUOTE=wombatman;468812]I had a similar issue on my 560. I think I had to edit the header file that defined how many curves were run by default. If you use "-gpucurves 112", does it still fail?[/QUOTE]I haven'tyet investigated. the point is that until recently the default 224 curves ran just fine. "Recently" a number of other changes have been made. This is a Gentoo system and most anything graphical could be using a different amount of resources since it was rebuilt.

xilman 2017-10-01 11:10

[QUOTE=xilman;468804]The system with a 970 is currently running a GNFS matrix and that should finish before benchmarking takes place.[/QUOTE]The 970 worked fine:
[code]pcl@horus ~ $ ecm -timestamp -save timer -gpu 11000000 0
GMP-ECM 7.0.2-dev [configured with GMP 6.1.0, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
3094431532040564572408601477248844996578265879041533103613101760833878317954758837772212892619722197393217347727674267786666231717141855759707447284275839786370629363307
Input number is 3094431532040564572408601477248844996578265879041533103613101760833878317954758837772212892619722197393217347727674267786666231717141855759707447284275839786370629363307 (169 digits)
[Sun Oct 1 10:59:47 2017]
Using B1=11000000, B2=0, sigma=3:346811657-3:346812488 (832 curves)
GPU: Block: 32x32x1 Grid: 26x1x1 (832 parallel curves)
Computing 832 Step 1 took 98070ms of CPU time / 3220596ms of GPU time
[/code]
Yes, it's a different number but the run time is independent of the input because the code uses constant time constant size arithmetic.

The numbers indicate 118ms of cpu and 3871ms of gpu per curve. A single curve on the host cpu (i7-5820 @ 3.3GHz) took 22906ms so a naive calculation gives that the gpu is about 5.7 times the speed of a single cpu core.


The 460 system still dies in the same way despite a reboot and with "-gpucurves 112" so more investigation is required.

WraithX 2017-10-02 02:25

[QUOTE=xilman;468947]
[CODE]cudakernel.cu(256) : Error cuda : too many resources requested for launch.[/CODE]
The 460 system still dies in the same way despite a reboot and with "-gpucurves 112" so more investigation is required.[/QUOTE]

This error was brought up on the ecm-discuss mailing list back in July 2016. I'm not sure what changed in the code to make it use more resources, but some older cards fail with this error message. A workaround was put into the README.gpu file:
[CODE]
4. Known issues

On some configurations (GTX 570 with compute capability 2.0 for example)
one gets the Cuda error "too many resources requested for launch". This
can be solved by decreasing ECM_GPU_CURVES_BY_BLOCK from 32 to 16 in ecm-gpu.h.
[/CODE]
You can see the the threads from July 2016 here:
[url]https://lists.gforge.inria.fr/pipermail/ecm-discuss/2016-July/thread.html[/url]
(relevant thread is "[Ecm-discuss] CC 2.0 issue with GTX 570" )

And the last one came in August 2016, here:
[url]https://lists.gforge.inria.fr/pipermail/ecm-discuss/2016-August/thread.html[/url]

xilman 2017-10-02 06:20

[QUOTE=WraithX;468988]This error was brought up on the ecm-discuss mailing list back in July 2016. I'm not sure what changed in the code to make it use more resources, but some older cards fail with this error message. A workaround was put into the README.gpu file:
[CODE]
4. Known issues

On some configurations (GTX 570 with compute capability 2.0 for example)
one gets the Cuda error "too many resources requested for launch". This
can be solved by decreasing ECM_GPU_CURVES_BY_BLOCK from 32 to 16 in ecm-gpu.h.
[/CODE]
You can see the the threads from July 2016 here:
[url]https://lists.gforge.inria.fr/pipermail/ecm-discuss/2016-July/thread.html[/url]
(relevant thread is "[Ecm-discuss] CC 2.0 issue with GTX 570" )

And the last one came in August 2016, here:
[url]https://lists.gforge.inria.fr/pipermail/ecm-discuss/2016-August/thread.html[/url][/QUOTE]
Many thanks! Given that information I can go back and see what happened to cause the problem and, perhaps, fix it in a more computationally efficient manner.

lorgix 2017-10-29 14:21

B1=11e6 takes 3076 seconds on my Tesla M2050. 448 curves. (c290)
Does anybody have a recent binary for CC 2.0 btw?

storm5510 2019-10-16 12:01

[QUOTE=wombatman;439399]I truthfully don't remember all the steps I went through to get everything compiled successfully with VS2012 (I have no experience with VS2015). I'm attaching the gpu_ecm exe I have built. Try it out and see what dlls you need for it.[/QUOTE]

I have a GTX-1080 with the latest driver set running on Windows 10 Pro x64 v1903. This would not work with it. I gave me the message below:

[CODE]The application was not able to start correctly. (0xc000007b). Click OK to close this application.[/CODE]
I am running the latest CPU variant with no problems, so this is not a big issue, to me anyway.

EdH 2019-10-22 03:03

In trying to spin up a GPU version of GMP-ECM in the Colab Environment:
[code]
. . .
configure: Using cuda.h from /usr/include/linux/include
checking cuda.h usability... yes
checking cuda.h presence... yes
checking for cuda.h... yes
checking that CUDA Toolkit version is at least 3.0... [B]no[/B]
configure: error: [B]a newer version[/B] of the CUDA Toolkit is needed
[/code][code]
apt search cuda-toolkit

Sorting... Done
Full Text Search... Done
[B]cuda-toolkit-10-0[/B]/unknown,now 10.0.130-1 amd64 [[B]installed[/B],automatic]
CUDA Toolkit 10.0 meta-package

cuda-toolkit-10-1/unknown 10.1.243-1 amd64
CUDA Toolkit 10.1 meta-package

nvidia-cuda-toolkit/bionic 9.1.85-3ubuntu1 amd64
NVIDIA CUDA development toolkit
[/code]BTW, I installed the 10.1 and 9.1... toolkits also, with no success.

Suggestions, anyone?

Dylan14 2019-10-24 21:46

[QUOTE=EdH;528555]In trying to spin up a GPU version of GMP-ECM in the Colab Environment:
[code]
. . .
configure: Using cuda.h from /usr/include/linux/include
checking cuda.h usability... yes
checking cuda.h presence... yes
checking for cuda.h... yes
checking that CUDA Toolkit version is at least 3.0... [B]no[/B]
configure: error: [B]a newer version[/B] of the CUDA Toolkit is needed
[/code][code]
apt search cuda-toolkit

Sorting... Done
Full Text Search... Done
[B]cuda-toolkit-10-0[/B]/unknown,now 10.0.130-1 amd64 [[B]installed[/B],automatic]
CUDA Toolkit 10.0 meta-package

cuda-toolkit-10-1/unknown 10.1.243-1 amd64
CUDA Toolkit 10.1 meta-package

nvidia-cuda-toolkit/bionic 9.1.85-3ubuntu1 amd64
NVIDIA CUDA development toolkit
[/code]BTW, I installed the 10.1 and 9.1... toolkits also, with no success.

Suggestions, anyone?[/QUOTE]

Hmm... could you try cat cuda.h to see what version it is?
If it is larger than 3 (which is most likely is) then you may have to fiddle around with the configure script so it looks for a higher version.

EdH 2019-10-25 00:46

[QUOTE=Dylan14;528850]Hmm... could you try cat cuda.h to see what version it is?
If it is larger than 3 (which is most likely is) then you may have to fiddle around with the configure script so it looks for a higher version.[/QUOTE]
I'll do a cat in a bit (after I recover from an extended power outage - one of the problems with having a bunch of ancient machines, some of which don't wake back up right). Meanwhile, I had tried playing around with some of the code in the configure file in an attempt to swap failure with success, but I wasn't able to get anywhere - I'm a bit too far over my head.

Thanks!

EdH 2019-10-25 03:09

@Dylan14:

Good Call! The one I was pointing to was something from last century. I found one in a very current folder that got me a couple steps further. Now I have to work on libraries. But I had to stop for now.

Thanks!

EdH 2019-10-26 01:48

This #%$@ thing is fighting me all the way!

Last straw!:
[code]checking for compatibility between gcc and nvcc... no
configure: error: gcc version is not compatible with nvcc[/code]

chalsall 2019-10-26 02:04

[QUOTE=EdH;528965]This #%$@ think is fighting me all the way![/QUOTE]

As is its job.

You're smarter than it is.

Prove it. :wink:

EdH 2019-10-26 02:18

[QUOTE=chalsall;528966]As is its job.

You're smarter than it is.

Prove it. :wink:[/QUOTE]
But, I think it is more patient.

However, a bit after I posted, I thought, "I did get further than last night."

chalsall 2019-10-26 02:39

[QUOTE=EdH;528967]However, a bit after I posted, I thought, "I did get further than last night."[/QUOTE]

And that is exactly how to approach this kind of "problem space".

Run experiments. Observe results.

Repeat as necessary...

Give it a little bit of time, and you'll begin "lucid dreaming" about stuff like this.

It's kinda cool waking up with "well, obviously, that's how you'd do that" before even the first coffee of the morning...

EdH 2019-10-26 03:17

Success! Thanks for the words of encouragement.

I'll see if it will repeat tomorrow and figure out where to post here if success [B]is[/B] repeatable...

EdH 2019-10-26 15:27

[QUOTE=EdH;528977]. . .
I'll see if it will repeat tomorrow and figure out where to post here if success [B]is[/B] repeatable...[/QUOTE]
I added it as a new thread to my list of How I... threads in the blog area:

[URL="https://www.mersenneforum.org/showthread.php?t=24887"]How I Compile the GPU branch of GMP-ECM in a Colaboratory Session[/URL]

EdH 2019-11-05 18:51

How Valuable Is Stage 1 In the Overall ECM Process
 
This question is based primarily on my playing with Colab and my GPU-GMP-ECM setup mentioned earlier.

If I run the GPU version, it will saturate the GPU with as many curves as cores available and run stage 1. But, in a recent test case, that was over 4k cores. If I were to fully process all those it would mean about 30 minutes for the stage 1 part on the GPU (for my chosen number/B1), followed by an awfully long CPU processing time for stage 2 to finish all those runs with only two CPU cores handling them.

Is there much value n running just stage 1 ECM, realizing that unless I keep the residues for later followup, all those stage 1 runs will have to be repeated at some point?

fivemack 2019-11-05 20:01

[QUOTE=EdH;529744]This question is based primarily on my playing with Colab and my GPU-GMP-ECM setup mentioned earlier.

If I run the GPU version, it will saturate the GPU with as many curves as cores available and run stage 1. But, in a recent test case, that was over 4k cores. If I were to fully process all those it would mean about 30 minutes for the stage 1 part on the GPU (for my chosen number/B1), followed by an awfully long CPU processing time for stage 2 to finish all those runs with only two CPU cores handling them.

Is there much value n running just stage 1 ECM, realizing that unless I keep the residues for later followup, all those stage 1 runs will have to be repeated at some point?[/QUOTE]

Don't just run stage 1! Keep the residues, they are pretty tiny - less than 500 bytes per curve. Use something like

[code]
ecm -v -v -gpudevice 0 -gpu -save e.$u.s1 85e7 1
[/code]

then take the .s1 files home, split the lines up however you like and let your other cores at them

[code]
ecm -maxmem 3000 -resume $u 85e7 < ../N > $u.s2
[/code]

Or post the .s1 files somewhere, link to them from this forum and let people at them.

I found it was 43.6 hours to do 1792 stage-1 curves on a GTX1080Ti and about 1070 seconds on one core of Xeon Silver 4114 to run stage 2 on one curve (so you needed about 12 such cores to keep up with the GPU)

EdH 2019-11-05 20:40

Thanks! I must have gotten something mixed up in my earlier testing. (That's happening a lot , lately.) I thought I ran some 11e7 stage 1 tests on a Colab session in about 40 minutes for 832 curves, but I'm coming up on 2 hours this time. I must have been using a smaller candidate before.

I will modify my routine to save residues, then and see what other mischief I can come up with.

Thanks, again!

fivemack 2019-11-05 21:31

[QUOTE=EdH;529750]Thanks! I must have gotten something mixed up in my earlier testing. (That's happening a lot , lately.) I thought I ran some 11e7 stage 1 tests on a Colab session in about 40 minutes for 832 curves, but I'm coming up on 2 hours this time. I must have been using a smaller candidate before.

I will modify my routine to save residues, then and see what other mischief I can come up with.

Thanks, again![/QUOTE]

The size of the candidate shouldn't have any effect on the stage 1 GPU time - I believe the ECMGPU code uses fixed-size integers. 40 minutes sounds maybe more plausible for 11e6.

EdH 2019-11-05 22:02

[QUOTE=fivemack;529759]The size of the candidate shouldn't have any effect on the stage 1 GPU time - I believe the ECMGPU code uses fixed-size integers. 40 minutes sounds maybe more plausible for 11e6.[/QUOTE]It's quite possible it was 11e6, now that you mention it. I think at the time I was checking different values, 11e3, 11e4, 11e5, etc. I might have gotten mixed up, but I was rather certain that I had made note that the 11e7 stage 1 on the GPU Colab instance took longer at 40 minutes than the ecmpi runs locally at 35 minutes. Hmm, maybe I stopped it at 40 minutes, because it took longer. My memory is really great, but for too short a spell. Always a jab to remind me to keep better notes. . .

EdH 2019-11-13 22:00

In playing with my Colab instance of GMP-ECM with GPU enabled (K80 GPU), I have made some observations:

The K80 shows 832 cores. If I run threads up to half of the cores, vs. over half the cores, the time for the run just about doubles. It is pretty close to constant within those halves. For example, running stage 1 for 5+3,1185L, it takes just about 22 seconds to complete <417 curves at 11e4. >416 curves takes about 41 seconds. Of note, I am using multiples of 32 cores.

Running a single curve without the GPU takes about .4 second. Obviously, running more curves on a single CPU core would sequentially add time - 416 curves by CPU would take about 166 seconds.

This appears to indicate that while the GPU sounds impressive with all its cores, a quad core CPU would keep up with the K80 GPU in stage 1 runs.

Am I missing something here?

R.D. Silverman 2019-11-13 22:54

[QUOTE=EdH;530517]In playing with my Colab instance of GMP-ECM with GPU enabled (K80 GPU), I have made some observations:

The K80 shows 832 cores. If I run threads up to half of the cores, vs. over half the cores, the time for the run just about doubles. It is pretty close to constant within those halves. For example, running stage 1 for 5+3,1185L, it takes just about 22 seconds to complete <417 curves at 11e4. >416 curves takes about 41 seconds. Of note, I am using multiples of 32 cores.

Running a single curve without the GPU takes about .4 second. Obviously, running more curves on a single CPU core would sequentially add time - 416 curves by CPU would take about 166 seconds.

This appears to indicate that while the GPU sounds impressive with all its cores, a quad core CPU would keep up with the K80 GPU in stage 1 runs.

Am I missing something here?[/QUOTE]

It looks good to me.

henryzz 2019-11-14 13:01

[QUOTE=EdH;530517]In playing with my Colab instance of GMP-ECM with GPU enabled (K80 GPU), I have made some observations:

The K80 shows 832 cores. If I run threads up to half of the cores, vs. over half the cores, the time for the run just about doubles. It is pretty close to constant within those halves. For example, running stage 1 for 5+3,1185L, it takes just about 22 seconds to complete <417 curves at 11e4. >416 curves takes about 41 seconds. Of note, I am using multiples of 32 cores.

Running a single curve without the GPU takes about .4 second. Obviously, running more curves on a single CPU core would sequentially add time - 416 curves by CPU would take about 166 seconds.

This appears to indicate that while the GPU sounds impressive with all its cores, a quad core CPU would keep up with the K80 GPU in stage 1 runs.

Am I missing something here?[/QUOTE]
The gpu doesn't care what size of number you give it while the cpu does. A quad core cpu probably wouldn't keep up so well on a 1000 bit number(I believe your test case had 720 bits).
I seem to recall that it was possible to compile versions of gpu-ecm that ran faster with a limit of ~256 or ~512 bits. I am not sure whether ~768 worked. I think it needed to be a power of 2.

EdH 2019-11-14 14:18

[QUOTE=henryzz;530556]The gpu doesn't care what size of number you give it while the cpu does. A quad core cpu probably wouldn't keep up so well on a 1000 bit number(I believe your test case had 720 bits).
I seem to recall that it was possible to compile versions of gpu-ecm that ran faster with a limit of ~256 or ~512 bits. I am not sure whether ~768 worked. I think it needed to be a power of 2.[/QUOTE]
Hmm! I was thinking of just giving up on the GPU. Maybe I need to play more after all. . .

henryzz 2019-11-14 14:27

[QUOTE=EdH;530563]Hmm! I was thinking of just giving up on the GPU. Maybe I need to play more after all. . .[/QUOTE]

Be careful if you go down the route of trying gpu-ecm with different limits. I think there was some concern about whether it was appearing to work but wasn't finding all expected factors.

EdH 2019-11-14 14:53

[QUOTE=henryzz;530567]Be careful if you go down the route of trying gpu-ecm with different limits. I think there was some concern about whether it was appearing to work but wasn't finding all expected factors.[/QUOTE]
Is this discussion available somewhere? I don't remember it in this thread. I think I went through this thread - maybe I skipped some. . . Anyway, I probably won't pursue this too much further, at least not too soon.

Thanks!

VBCurtis 2019-11-14 16:35

Henry is right- GPU-ECM is the same speed with any input within its bounds (1019 bits or smaller). So, it's relatively a waste of time for small numbers, but quite nice for 900+ bit inputs. When I had a working CUDA setup, I used it for lots of 900-1000 bit inputs. Even in your test case, it doubles the production of a quad-core by doing stage1 on GPU and stage2 on CPU (with a couple cores left over, since stage 2 runs faster than stage 1).

fivemack 2019-11-14 21:37

[QUOTE=EdH;530517]In playing with my Colab instance of GMP-ECM with GPU enabled (K80 GPU), I have made some observations:

The K80 shows 832 cores. [/quote]

That rather surprises me, since a K80 card is specified as having 4992 cores, so you're getting 1/6 of it (and, as you say the timings double when you go to >416 curves, 1/12 of it).

(though my 1080Ti is specified as 3584 cores and runs 1792 curves at a time, so there's a small constant factor there in any case)

EdH 2019-11-14 23:09

Thanks for all the info, everyone.

[QUOTE=fivemack;530619]That rather surprises me, since a K80 card is specified as having 4992 cores, so you're getting 1/6 of it (and, as you say the timings double when you go to >416 curves, 1/12 of it).

(though my 1080Ti is specified as 3584 cores and runs 1792 curves at a time, so there's a small constant factor there in any case)[/QUOTE]
I've run nvidia-smi to see what I've been assigned, but it hasn't shown me core count, that I can recognize, anyway. When I run ECM with it enabled, it tells me 832 cores as default, which is supposed to be saturation. Can a GPU serve others at the same time I'm using the 832? I kind of thought not.

I think one of the times I got a P100, it showed 4000+ with my ECM run. Since I only get two Xeon cores to back up all the GPU stage 1s, unless I pull the residues off, it's kind of less than ideal to run much in a GPU-ECM Colab session.

storm5510 2019-11-23 15:22

[QUOTE=Karl M Johnson;288948]Windows binary wanted :smile:[/QUOTE]

A "working" 64-bit Windows variant would be nice... :bangheadonwall:


All times are UTC. The time now is 04:22.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.