mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2018-12-29, 13:50   #3026
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

22·7·53 Posts
Default

Thanks Judger, that was a detailed description, ok, so even in cpu that thread work distribution would not work in my way.

just some more questions:
1. in your latest binary mfaktc-0.21 I'm assuming that the main sieve is in gpusieve.cu's
__global__ static void __launch_bounds__(256,6) SegSieve (uint8 *big_bit_array_dev, uint8 *pinfo_dev, uint32 maxp)
method, 256 refers to the fact that you are using 256 threads in one pool, right? What is the 6 ?
2. where do you store the pinfo_dev, so the sieving prime's info? Is it in your default 64mbit bitmap or elsewhere?
3. still not clear (at least for me) the ratio of sieving time/total time; say if it is really that tiny,
then why not use deeper sieve, using primes up to 2m or 10m. With that you could lower the number of k survivors by a factor
of log(B2)/log(B1), where B2 is the new depth, B1 is the old (ofcourse, the sieve would be slower).
4. in the mentioned SegSieve, say we have 20480=80*256 GPU threads, then do you launch these threads at once in that routine,
making a sieve on 80*65536=5242880 length interval ? Or do you start at once only 256 threads, when you have 256 available threads?
5. What happens if the number of GPU threads is not divisible by 256 ?


Possible ideas:

on line 539:
locsieve32[threadIdx.x * block_size / threadsPerBlock / 32 + j] |= mask;
where we know:
const uint32 block_size_in_bytes = 8192; // Size of shared memory array in bytes
const uint32 block_size = block_size_in_bytes * 8; // Number of bits generated by each block
const uint32 threadsPerBlock = 256; // Threads per block

so we could also write:
locsieve32[8 * threadIdx.x + j] |= mask;
and the compiler can easily replace it by a shift. I'm not sure if the compiler arrives to the single shifting.
But if you are sure about this, then it is absolutely right not to change. Note that there are many such cases in this method.

And you could replace those lots of const .. with #define (with a single integer, don't leave mult/operations there), say:
#define block_size_in_bytes 8192 // Size of shared memory array in bytes
#define block_size 65536 // = block_size_in_bytes * 8; // Number of bits generated by each block
#define threadsPerBlock 256

etc., would it be faster on GPU?
R. Gerbicz is offline   Reply With Quote
Old 2018-12-29, 16:36   #3027
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

35·13 Posts
Default

You can set sieve primes in mfaktc.ini

# GPUSievePrimes defines how far we sieve the factor candidates on the GPU.
# The first <GPUSievePrimes> primes are sieved.
#
# Minimum: GPUSievePrimes=0
# Maximum: GPUSievePrimes=1075000
#
# Default: GPUSievePrimes=82486

Usually you test the speed with different sieve primes for your GPU the first time, but the fastest is always between 75K and 150K every time I have tested it.

The optimal can be slightly different I believe for different exponent sizes and bit depths.

Last fiddled with by ATH on 2018-12-29 at 16:38
ATH is offline   Reply With Quote
Old 2018-12-29, 18:36   #3028
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

45716 Posts
Default

Hi Robert,

Quote:
Originally Posted by R. Gerbicz View Post
1. in your latest binary mfaktc-0.21 I'm assuming that the main sieve is in gpusieve.cu's
__global__ static void __launch_bounds__(256,6) SegSieve (uint8 *big_bit_array_dev, uint8 *pinfo_dev, uint32 maxp)
method, 256 refers to the fact that you are using 256 threads in one pool, right? What is the 6 ?
Correct, 256 threads per block and a minimum of 6 blocks per "multiprocessor" (a group of GPU cores).

Quote:
Originally Posted by R. Gerbicz View Post
2. where do you store the pinfo_dev, so the sieving prime's info? Is it in your default 64mbit bitmap or elsewhere?
Not sure what you want to know here. The big 64 Mibit bitmap contains just the bits of the sieve.

Quote:
Originally Posted by R. Gerbicz View Post
3. still not clear (at least for me) the ratio of sieving time/total time; say if it is really that tiny,
then why not use deeper sieve, using primes up to 2m or 10m. With that you could lower the number of k survivors by a factor
of log(B2)/log(B1), where B2 is the new depth, B1 is the old (ofcourse, the sieve would be slower).
I can't tell you the timings because I don't know, too. As ATH noted one simply adjusts GPUSievePrimes to the "optimal" value (minimum runtime for given work).

Quote:
Originally Posted by R. Gerbicz View Post
4. in the mentioned SegSieve, say we have 20480=80*256 GPU threads, then do you launch these threads at once in that routine,
making a sieve on 80*65536=5242880 length interval ? Or do you start at once only 256 threads, when you have 256 available threads?
All threads(blocks) are launched at the same time. With default configuration there are 1024 blocks of 256 threads each (64 Mibit partitioned into 1024 64 kibit segments). From the code itself it is OK to run them all at once or block after block. The GPU (driver) does the magic to schedules the blocks to the HW. Currently the (big) bitmap is limited to 128 Mibit and thus 524288 threads - enough until nvidia manages to build GPUs with more than lets say 64k cores.

Quote:
Originally Posted by R. Gerbicz View Post
5. What happens if the number of GPU threads is not divisible by 256 ?
It is!




Quote:
Originally Posted by R. Gerbicz View Post
Possible ideas:

on line 539:
locsieve32[threadIdx.x * block_size / threadsPerBlock / 32 + j] |= mask;
where we know:
const uint32 block_size_in_bytes = 8192; // Size of shared memory array in bytes
const uint32 block_size = block_size_in_bytes * 8; // Number of bits generated by each block
const uint32 threadsPerBlock = 256; // Threads per block

so we could also write:
locsieve32[8 * threadIdx.x + j] |= mask;
and the compiler can easily replace it by a shift. I'm not sure if the compiler arrives to the single shifting.
But if you are sure about this, then it is absolutely right not to change. Note that there are many such cases in this method.

And you could replace those lots of const .. with #define (with a single integer, don't leave mult/operations there), say:
#define block_size_in_bytes 8192 // Size of shared memory array in bytes
#define block_size 65536 // = block_size_in_bytes * 8; // Number of bits generated by each block
#define threadsPerBlock 256

etc., would it be faster on GPU?
Pretty sure the compiler knows that those variables are constant and precomputes the value. Replacing the multiplication with a shift might be a bad idea - at least on some GPUs shift has the same throughput as mul - and we still need the add. On the other hand there is an multiply-add with the same throughput as mul (add if for free).

https://docs.nvidia.com/cuda/cuda-c-...c-instructions

Oliver

Last fiddled with by TheJudger on 2018-12-29 at 18:37
TheJudger is offline   Reply With Quote
Old 2018-12-29, 19:09   #3029
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

31×173 Posts
Default

Quote:
Originally Posted by TheJudger View Post
I can't tell you the timings because I don't know, too. As ATH noted one simply adjusts GPUSievePrimes to the "optimal" value (minimum runtime for given work).
Could it be obtained by profiling?
kriesel is online now   Reply With Quote
Old 2018-12-30, 08:14   #3030
Neutron3529
 
Neutron3529's Avatar
 
Dec 2018
China

41 Posts
Default

Quote:
Originally Posted by TheJudger View Post
[edit by LaurV]

As the thread got longer and longer and people continue coming and asking "Where is this or that version of mfaktc?" I am editing this post to give the link to the folder which Xyzzy kindly created:

http://www.mersenneforum.org/mfaktc/

Here you can find many different versions of mfaktc, for different OS-es, different Cuda drivers, bla bla.
Select the suitable one for you, download it, clear some exponents!
Thanks!

[end of edit by LaurV]


Hi,

maybe a bit offtopic...

as a proof-of-concept I can do TF on my GTX 275 :)
A straight forward implementation of the multiprecision arithmetic gives me ~15.000.000 tested factors per second for a 26bit exponent and factors up to 71bit. (The numbers are out of my mind, I hope they are correct)
I've checked the correctness for some 1000 cases against a basic implementation using libgmp on CPU, no errors found. :)

Currently there is no presieving of factor candidates... but I've alot cycles free on the CPU. ;)

The quality of the code right now is.... "proof-of-concept" (means really ugly, alot of hardcoded stuff, ...) :D

Most time is spent on the calculation if the reminder (using a simple basecase division).
jasonp: if you read this: the montgomery multiplication suggested on the nvidia forum doesn't look suitable on CUDA for me. :(

TheJudger
Recently I found some improvments while using mfaktc, one is here

https://www.mersenneforum.org/showpo...postcount=3014
I think add a double-check variable may provides a more convicent result.
This post have no reply, I don't know whether it is ignored. And I would apologize if such suggest is really useless.



Further more, I found it is really slow when mfaktc is dealing with a really big worktodo.txt(e.g., ~2M)
It will take ~1 second to rewrite the worktodo.txt.
I think a better way to dealing such problem is, once the worktodo.txt is provided, mfaktc could remember the size of worktodo.txt, and dealing with the last record.
after TF, mfaktc could delete the last record.
I think delete the last record is much quickly than delete the first one.
Neutron3529 is offline   Reply With Quote
Old 2018-12-30, 12:40   #3031
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hi,

Quote:
Originally Posted by Neutron3529 View Post
Recently I found some improvments while using mfaktc, one is here

https://www.mersenneforum.org/showpo...postcount=3014
I think add a double-check variable may provides a more convicent result.
This post have no reply, I don't know whether it is ignored. And I would apologize if such suggest is really useless.
There is currently no doublechecking for TF work and thus you can't improve (reduce time) for that.
You want a checksum or whatever for all factor attemps in each class, right? Won't work for multiple reasons:
  • the list of factor candidates depends on sieve parameters (e.g. more sieving, less candidates)
  • even with same settings the list of candidates depends on hardware because we ignore memory conflicts and sometimes a composite factor isn't cleared by the sieve while on the next run it is.


Quote:
Originally Posted by Neutron3529 View Post
Further more, I found it is really slow when mfaktc is dealing with a really big worktodo.txt(e.g., ~2M)
It will take ~1 second to rewrite the worktodo.txt.
I think a better way to dealing such problem is, once the worktodo.txt is provided, mfaktc could remember the size of worktodo.txt, and dealing with the last record.
after TF, mfaktc could delete the last record.
I think delete the last record is much quickly than delete the first one.
Is this a common usecase? Keep in mind that I focus on current primenet wavefront. 2M worktodo isn't general usage. Maybe easiest is to split your worktodo to reasonable sizes and put a small script around mfaktc (put small worktodo.txt into directory, start mfaktc (let it run until it has finished worktodo.txt, repeat with next worktodo.txt)
TheJudger is offline   Reply With Quote
Old 2018-12-30, 16:20   #3032
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

65158 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Is this a common usecase? 2M worktodo isn't general usage.
No, that's not a common usecase. Even with my work at the large-exponent-low-bits above 1000M where exponent (not class) runtimes are ~1s there's no reason to have mammoth worktodo. For convenience I fetch/submit 1000 exponents at a time (~25kB worktodo) but even if it was an offline system I would seriously consider writing some script that would slice off 100-1000 assignments at a time from a separate bulk assignment file when worktodo.txt runs empty (and at the same time archive off results.txt since that also gets large quickly).

I'm curious what Neutron3529 is doing to get 2MB worktodo... is he my mystery TF'er who reserves a million exponents at a time and then takes 2 weeks to submit the results?
James Heinrich is offline   Reply With Quote
Old 2018-12-30, 20:06   #3033
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

31×173 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
No, that's not a common usecase. Even with my work at the large-exponent-low-bits above 1000M where exponent (not class) runtimes are ~1s there's no reason to have mammoth worktodo. ... my mystery TF'er who reserves a million exponents at a time and then takes 2 weeks to submit the results?
At what point does it stop being an acceptable request and start looking like a DOS?
I don't see the point of MB worktodo downloads when there's so much work to do below 109 with a well chosen 1k size worktodo representing several ThzDays.
kriesel is online now   Reply With Quote
Old 2018-12-30, 20:18   #3034
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

22×733 Posts
Default

Quote:
Originally Posted by kriesel View Post
I don't see the point of MB worktodo downloads
I can see why someone would do that if they didn't have a script to automate fetching/distributing the work.
Mark Rose is offline   Reply With Quote
Old 2018-12-30, 21:13   #3035
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

31×173 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
I can see why someone would do that if they didn't have a script to automate fetching/distributing the work.
Plenty of work to do well under 109.
It would be good to TF exponents to full gputo72 bit depth, successively, with preference for lowest exponent first, to clear the path for P-1 and primality testing.
In other words, instead of single-bit-depth per exponent, do full-to-goal-bit-depth before going on to another exponent on the same gpu.

https://www.mersenne.ca/exponent/200000033 half a ThzDay.
https://www.mersenne.ca/exponent/400000079 a ThzDay
https://www.mersenne.ca/exponent/800000087 nearly 10 ThzDays.
https://www.mersenne.ca/exponent/990000337 ~15ThzDays. Two weeks on a GTX1080, for ~56 bytes of worktodo content.
1KB of such work could occupy even a fast gpu for a month or more.
I have TF of one exponent on an older gpu that's estimated to complete in April.

Last fiddled with by kriesel on 2018-12-30 at 21:21
kriesel is online now   Reply With Quote
Old 2018-12-30, 21:45   #3036
GP2
 
GP2's Avatar
 
Sep 2003

5·11·47 Posts
Default

Quote:
Originally Posted by Neutron3529 View Post
It will take ~1 second to rewrite the worktodo.txt.
Even if you use a RAM drive?
GP2 is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 22:16.


Fri Jul 16 22:16:11 UTC 2021 up 49 days, 20:03, 2 users, load averages: 2.00, 3.94, 3.32

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.