mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2015-08-14, 13:11   #12
wombatman
I moo ablest echo power!
 
wombatman's Avatar
 
May 2013

29×61 Posts
Default

Quote:
Originally Posted by VBCurtis View Post
What meaning of "sieve" do you mean here? LaurV's reply concerned NFS sieving, as in finding relations, while you seem to be referring to part of the poly-select step.
There we go! You put into words what I couldn't! Thanks.
wombatman is online now   Reply With Quote
Old 2015-08-19, 23:27   #13
Anyone
 
Aug 2015

1416 Posts
Default

Quote:
Originally Posted by Antonio View Post
To get msieve working on my system I needed the following support files:

Code:
cudart64_55.dll
cufft64_55.dll
pthreadVC2.dll
sort_engine_sm20.dll
stage1_core_sm20.ptx
They are all in the same directory as msieve.

@ Antonio

Could you tell me from where I can get pthreadVC2.dll?

what is this file for?

Thanks
Anyone is offline   Reply With Quote
Old 2015-08-20, 00:34   #14
Anyone
 
Aug 2015

22·5 Posts
Default

Quote:
Originally Posted by VBCurtis View Post
What meaning of "sieve" do you mean here? LaurV's reply concerned NFS sieving, as in finding relations, while you seem to be referring to part of the poly-select step.

Anyone-
GPU poly select on degree 4 polys is a waste of effort. A C97 should be factored in an hour or so on a multi-core system, so poly select might take ~5 minutes. Even up to C110 or so, YAFU's simplicity outweighs any time savings you might get from doing GPU poly select on your own. Once you get up over C130, the GPU's power on the first step of the 3-step poly select pays meaningful benefits; I wouldn't bother with CPU select at all over C135.
Poly selection has 3 steps: Generating raw polynomial candidates (GPU does this 100x faster, or more), followed by size optimization (the -nps flag in msieve, CPU only), followed by root optimization (-npr, also CPU only).
The improvement in final poly output is mostly created by doing the nps and npr steps on a small subset of the polys found in step 1 (this is controlled by stage1norm). The general plan is to set stage 1 norm so that the output from the GPU (the lines on the screen of coefficients of the polys) is in the range of 2-5/second. A desktop CPU core can do the size optimization at roughly that rate, so running msieve with -np1 (GPU step) -nps (size opt step) flags both set results in one core fully loaded to do size optimization while the gpu generates a TON of polys but -stage1_norm rejects 98% of them before the -nps step is run.
Alternately, you can leave stage1norm alone, and just have the GPU spit out thousands of polys to a file with -np1. You then direct msieve later to -nps that file, while the GPU (presumably) does something else or sits idle. Running in this manner causes the CPU step to take ~20x as long as the GPU step!
the -stage2_norm flag is used to control how many of the -nps output polys are root-optimized (a slow process). Two ways to deal with this: sort the output of the .ms file that -nps outputs by the last column (a measure of poly quality), then truncate the file to only run -npr on the best 100 (or 200, whatever suits your fancy). *Or*, set -stage2_norm to be about a factor of 20 (or 25) lower than the default msieve setting (listed in the log), which causes -nps to output 100-200 polys per GPU-day of search.

If you run -np all at once, msieve isn't very good at these filters, and will take a LONG time to -npr the massive output from the GPU/nps steps. This is why it's not worth bothering to use GPU select for composites below C120 or so- it's not worth the human time to run 2-3 stages and filter the outputs.

Hope this helps! Ask questions if you wish, but do read the msieve docs too.


Thank you all for the valuable information. As I told you I am new to this so sorry for the coming questions :)

1- What is the mapping between C97, C110 , etc? Is C97, number with 97 digits ? Does it follow the conversion ration of 3.3 (C97 = 320-bits ) ?
2- What is "degree 4 polys"

3- VBCurtis You said: "running msieve with -np1 (GPU step) -nps (size opt step) flags both set results in one core fully loaded to do size optimization while the gpu generates a TON of polys but -stage1_norm rejects 98% of them before the -nps step is run."
Do you mean that both: the steps of "Poly select - GPU" and "Size optimization - CPU" can run in parallel or5 they still must be in series? I am confused!

4- If I understand correctly:
-we run "msieve -np1 -nps"
-both steps are in parallel
-we set "-stage1_norm" so that we keep the GPU trying more with the poly select "Value_2" till the CPU finishes the size optimization of the GPU previous output "Value_1" etc ... this time in your example is 2-5 seconds.
How I set the "-stage1_norm"
What if I have multi core processor, I am still able to do the size optimization in parallel ?

I understood the -stage2_norm you explain it so nicely !!

Thanks a lot guys for your great help and for your time :)

BR
Anyone is offline   Reply With Quote
Old 2015-08-20, 05:51   #15
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

4,861 Posts
Default

Quote:
Originally Posted by Anyone View Post
Thank you all for the valuable information. As I told you I am new to this so sorry for the coming questions :)

1- What is the mapping between C97, C110 , etc? Is C97, number with 97 digits ? Does it follow the conversion ration of 3.3 (C97 = 320-bits ) ?
2- What is "degree 4 polys"

3- VBCurtis You said: "running msieve with -np1 (GPU step) -nps (size opt step) flags both set results in one core fully loaded to do size optimization while the gpu generates a TON of polys but -stage1_norm rejects 98% of them before the -nps step is run."
Do you mean that both: the steps of "Poly select - GPU" and "Size optimization - CPU" can run in parallel or5 they still must be in series? I am confused!

4- If I understand correctly:
-we run "msieve -np1 -nps"
-both steps are in parallel
-we set "-stage1_norm" so that we keep the GPU trying more with the poly select "Value_2" till the CPU finishes the size optimization of the GPU previous output "Value_1" etc ... this time in your example is 2-5 seconds.
How I set the "-stage1_norm"
What if I have multi core processor, I am still able to do the size optimization in parallel ?
1. yes. yes, roughly; we measure inputs in decimal digits rather than binary in most cases.
2. polys = polynomials. Degree = highest power, as in basic algebra. Wiki the word if you don't remember the concept.
3. If you invoke msieve with -np1 -nps, the program will run a thread on the first step (on GPU), and a thread of your CPU on the size-optimization step. msieve is not otherwise multi-threaded; a multi-core chip is not used. If you really cared, you could invoke stage 1 GPU on its own, then run a bunch of -nps with various save files to use the other cores, but you'd have to figure that out for yourself (the rest of us don't bother).
4. -stage1_norm=1e26 would set the cutoff for saving the poly to 1e26. If you invoke msieve -np1 -nps, the log will show you msieve's default stage 1 norm and stage 2 norm; you then shrink these via guess-and-see-what-happens to alter the output. I did say 2-5/second, with "/" meaning "per" rather than 2-5 seconds for each hit. I usually decrease stage1norm by one magnitude (if default is 1e26, I might set it to 1.5e25, or 1e25, or 7e24, whatever results in my GPU turning out a couple hits per second).

For stage2_norm, I check the log for default stage 2 norm, then divide by something near 25. The bigger the job, the smaller I make this norm (for big jobs, say GNFS170, I might run poly select for a week, looking for 100 or so hits per day out of the -nps step).

Before more questions, read: msieve readme, msieve -h for all the available flags, jeff gilchrist's intro to NFS factoring webpage, and peruse the "polynomial request thread" where much discussion has happened over the last two years about how to use msieve-GPU most effectively. Many tuning parameters are open questions- the guide I give here is the result of my personal experience, rather than definite best practices used by everyone.

-Curtis
VBCurtis is offline   Reply With Quote
Old 2015-08-20, 13:22   #16
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

DD516 Posts
Default

To add to the very good explanation above, GPU stage 1 is multithreaded and you actually can ask for more than one thread to drive the GPU. For 100-digit problems and below, a GPU will crunch through stage 1 so fast that even ignoring the size optimization the GPU is still mostly idle. You can throw 2-3 threads at stage 1 (-t 2 or -t 3 does this) and generate hits about 30% faster just from better GPU utilization, but you still need to boil all those hits down and stage 2 is still a bottleneck. That doesn't change the advice that a GPU is wasted on small problems; even for 150-digit polynomial selection I've found more than one thread is needed to keep the GPU fully occupied.
jasonp is offline   Reply With Quote
Old 2015-08-20, 17:02   #17
Anyone
 
Aug 2015

22·5 Posts
Default

Quote:
Originally Posted by VBCurtis View Post
1. yes. yes, roughly; we measure inputs in decimal digits rather than binary in most cases.
2. polys = polynomials. Degree = highest power, as in basic algebra. Wiki the word if you don't remember the concept.
3. If you invoke msieve with -np1 -nps, the program will run a thread on the first step (on GPU), and a thread of your CPU on the size-optimization step. msieve is not otherwise multi-threaded; a multi-core chip is not used. If you really cared, you could invoke stage 1 GPU on its own, then run a bunch of -nps with various save files to use the other cores, but you'd have to figure that out for yourself (the rest of us don't bother).
4. -stage1_norm=1e26 would set the cutoff for saving the poly to 1e26. If you invoke msieve -np1 -nps, the log will show you msieve's default stage 1 norm and stage 2 norm; you then shrink these via guess-and-see-what-happens to alter the output. I did say 2-5/second, with "/" meaning "per" rather than 2-5 seconds for each hit. I usually decrease stage1norm by one magnitude (if default is 1e26, I might set it to 1.5e25, or 1e25, or 7e24, whatever results in my GPU turning out a couple hits per second).

For stage2_norm, I check the log for default stage 2 norm, then divide by something near 25. The bigger the job, the smaller I make this norm (for big jobs, say GNFS170, I might run poly select for a week, looking for 100 or so hits per day out of the -nps step).

Before more questions, read: msieve readme, msieve -h for all the available flags, jeff gilchrist's intro to NFS factoring webpage, and peruse the "polynomial request thread" where much discussion has happened over the last two years about how to use msieve-GPU most effectively. Many tuning parameters are open questions- the guide I give here is the result of my personal experience, rather than definite best practices used by everyone.

-Curtis


VBCurtis and jasonp Thanks a lot for the explanation.

VBCurtis I will read the resources you pointed. thanks you !

jasonp I could get my GPU >50% busy by using "-t 100" however the tool crashes if I use "-t 150" !!!

one last question, I have two GPUs on my systems, I could not get both of them working it is alwys either GPU0 or GPU1. Any suggestions?

Thanks a again
BR
Anyone is offline   Reply With Quote
Old 2015-11-12, 19:27   #18
chris2be8
 
chris2be8's Avatar
 
Sep 2009

207810 Posts
Default

A (probably) CUDA level related issue, it looks as if CUDA 7.5 doesn't support compute level 1.1 any more. I'm trying to get my new GTX 970 to work, but when I try to make msieve with CUDA=1 I get:
Code:
"/usr/local/cuda-7.5/bin/nvcc" -arch sm_11 -ptx -o stage1_core_sm11.ptx gnfs/poly/stage1/stage1_core_gpu/stage1_core.cu
nvcc fatal   : Value 'sm_11' is not defined for option 'gpu-architecture'
Makefile:307: recipe for target 'stage1_core_sm11.ptx' failed
make: *** [stage1_core_sm11.ptx] Error 1
I'm trying to get round this by editing the Makefiles to remove references to sm_11 (and sm13) since the new card won't need that, but I'm having problems (I'm about to start from scratch with a new download). Is there a definitive list of edits?

Chris
chris2be8 is offline   Reply With Quote
Old 2015-11-13, 03:16   #19
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

1101110101012 Posts
Default

The makefile in the bc40 directory also needs to be modified to remove the older targets as well.
jasonp is offline   Reply With Quote
Old 2015-11-13, 16:31   #20
chris2be8
 
chris2be8's Avatar
 
Sep 2009

2·1,039 Posts
Default

I've got it to compile cleanly after editing the Makefiles as follows (msieve.orig is as downloaded, msieve-svn988.2 is changed):
Code:
diff msieve.orig/trunk/Makefile msieve-svn988.2/trunk/Makefile
172,173d171
<       stage1_core_sm11.ptx \
<       stage1_core_sm13.ptx \
305,310d302
<
< stage1_core_sm11.ptx: $(NFS_GPU_HDR)
<       $(NVCC) -arch sm_11 -ptx -o $@ $<
<
< stage1_core_sm13.ptx: $(NFS_GPU_HDR)
<       $(NVCC) -arch sm_13 -ptx -o $@ $<
and:
Code:
diff msieve.orig/trunk/b40c/Makefile msieve-svn988.2/trunk/b40c/Makefile
20,21d19
< GEN_SM13 = -gencode=arch=compute_13,code=\"sm_13,compute_13\"
< GEN_SM10 = -gencode=arch=compute_10,code=\"sm_10,compute_10\"
35c33
< all: $(LIBNAME)_sm10.$(EXT) $(LIBNAME)_sm13.$(EXT) $(LIBNAME)_sm20.$(EXT)
---
> all: $(LIBNAME)_sm20.$(EXT)
40,45d37
<
< $(LIBNAME)_sm10.$(EXT) : $(DEPS)
<       $(NVCC) $(GEN_SM10) -o $@ sort_engine.cu $(NVCCFLAGS) $(INC) -O3
<
< $(LIBNAME)_sm13.$(EXT) : $(DEPS)
<       $(NVCC) $(GEN_SM13) -o $@ sort_engine.cu $(NVCCFLAGS) $(INC) -O3
After that "make all CUDA=1 NO_ZLIB=1 2>&1 | tee -a make.all.out" complied everything. I've not had time to test it yet though.

Chris
chris2be8 is offline   Reply With Quote
Old 2015-11-14, 17:39   #21
chris2be8
 
chris2be8's Avatar
 
Sep 2009

2·1,039 Posts
Default

Tested, it works. I've generated several smallish (100-104 digit) polys with it and they sieve OK.

Chris
chris2be8 is offline   Reply With Quote
Old 2015-11-16, 16:48   #22
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

23×3×5×72 Posts
Default

Would this fix it on windows for later versions of cuda or does bc40 run with later versions of cuda on linux? I have been unable to compile a version on windows using the latest source(my VS2012 install is messed up permanently). I still need a recent version that works on a core 2.
henryzz is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Keeping cuda working over Ubuntu upgrades fivemack Software 9 2016-01-16 16:37
ssh -X not working? Dubslow Linux 3 2012-05-11 14:44
Log Out not working? cheesehead Forum Feedback 1 2012-03-19 17:13
DST not working? Dubslow Forum Feedback 2 2012-03-19 06:53
How is working on 44M-45M ? hbock Lone Mersenne Hunters 0 2005-04-06 17:16

All times are UTC. The time now is 00:59.


Sat Jul 17 00:59:17 UTC 2021 up 49 days, 22:46, 1 user, load averages: 1.66, 1.35, 1.34

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.