![]() |
|
|
#3026 |
|
"Robert Gerbicz"
Oct 2005
Hungary
22·7·53 Posts |
Thanks Judger, that was a detailed description, ok, so even in cpu that thread work distribution would not work in my way.
just some more questions: 1. in your latest binary mfaktc-0.21 I'm assuming that the main sieve is in gpusieve.cu's __global__ static void __launch_bounds__(256,6) SegSieve (uint8 *big_bit_array_dev, uint8 *pinfo_dev, uint32 maxp) method, 256 refers to the fact that you are using 256 threads in one pool, right? What is the 6 ? 2. where do you store the pinfo_dev, so the sieving prime's info? Is it in your default 64mbit bitmap or elsewhere? 3. still not clear (at least for me) the ratio of sieving time/total time; say if it is really that tiny, then why not use deeper sieve, using primes up to 2m or 10m. With that you could lower the number of k survivors by a factor of log(B2)/log(B1), where B2 is the new depth, B1 is the old (ofcourse, the sieve would be slower). 4. in the mentioned SegSieve, say we have 20480=80*256 GPU threads, then do you launch these threads at once in that routine, making a sieve on 80*65536=5242880 length interval ? Or do you start at once only 256 threads, when you have 256 available threads? 5. What happens if the number of GPU threads is not divisible by 256 ? Possible ideas: on line 539: locsieve32[threadIdx.x * block_size / threadsPerBlock / 32 + j] |= mask; where we know: const uint32 block_size_in_bytes = 8192; // Size of shared memory array in bytes const uint32 block_size = block_size_in_bytes * 8; // Number of bits generated by each block const uint32 threadsPerBlock = 256; // Threads per block so we could also write: locsieve32[8 * threadIdx.x + j] |= mask; and the compiler can easily replace it by a shift. I'm not sure if the compiler arrives to the single shifting. But if you are sure about this, then it is absolutely right not to change. Note that there are many such cases in this method. And you could replace those lots of const .. with #define (with a single integer, don't leave mult/operations there), say: #define block_size_in_bytes 8192 // Size of shared memory array in bytes #define block_size 65536 // = block_size_in_bytes * 8; // Number of bits generated by each block #define threadsPerBlock 256 etc., would it be faster on GPU? |
|
|
|
|
|
#3027 |
|
Einyen
Dec 2003
Denmark
35·13 Posts |
You can set sieve primes in mfaktc.ini
# GPUSievePrimes defines how far we sieve the factor candidates on the GPU. # The first <GPUSievePrimes> primes are sieved. # # Minimum: GPUSievePrimes=0 # Maximum: GPUSievePrimes=1075000 # # Default: GPUSievePrimes=82486 Usually you test the speed with different sieve primes for your GPU the first time, but the fastest is always between 75K and 150K every time I have tested it. The optimal can be slightly different I believe for different exponent sizes and bit depths. Last fiddled with by ATH on 2018-12-29 at 16:38 |
|
|
|
|
|
#3028 | ||||||
|
"Oliver"
Mar 2005
Germany
45716 Posts |
Hi Robert,
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
https://docs.nvidia.com/cuda/cuda-c-...c-instructions Oliver Last fiddled with by TheJudger on 2018-12-29 at 18:37 |
||||||
|
|
|
|
|
#3029 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
31×173 Posts |
|
|
|
|
|
|
#3030 | |
|
Dec 2018
China
41 Posts |
Quote:
https://www.mersenneforum.org/showpo...postcount=3014 I think add a double-check variable may provides a more convicent result. This post have no reply, I don't know whether it is ignored. And I would apologize if such suggest is really useless. Further more, I found it is really slow when mfaktc is dealing with a really big worktodo.txt(e.g., ~2M) It will take ~1 second to rewrite the worktodo.txt. I think a better way to dealing such problem is, once the worktodo.txt is provided, mfaktc could remember the size of worktodo.txt, and dealing with the last record. after TF, mfaktc could delete the last record. I think delete the last record is much quickly than delete the first one. |
|
|
|
|
|
|
#3031 | ||
|
"Oliver"
Mar 2005
Germany
11×101 Posts |
Hi,
Quote:
![]() You want a checksum or whatever for all factor attemps in each class, right? Won't work for multiple reasons:
Quote:
|
||
|
|
|
|
|
#3032 |
|
"James Heinrich"
May 2004
ex-Northern Ontario
65158 Posts |
No, that's not a common usecase. Even with my work at the large-exponent-low-bits above 1000M where exponent (not class) runtimes are ~1s there's no reason to have mammoth worktodo. For convenience I fetch/submit 1000 exponents at a time (~25kB worktodo) but even if it was an offline system I would seriously consider writing some script that would slice off 100-1000 assignments at a time from a separate bulk assignment file when worktodo.txt runs empty (and at the same time archive off results.txt since that also gets large quickly).
I'm curious what Neutron3529 is doing to get 2MB worktodo... is he my mystery TF'er who reserves a million exponents at a time and then takes 2 weeks to submit the results? |
|
|
|
|
|
#3033 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
31×173 Posts |
Quote:
I don't see the point of MB worktodo downloads when there's so much work to do below 109 with a well chosen 1k size worktodo representing several ThzDays. |
|
|
|
|
|
|
#3034 |
|
"/X\(‘-‘)/X\"
Jan 2013
22×733 Posts |
|
|
|
|
|
|
#3035 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
31×173 Posts |
Quote:
It would be good to TF exponents to full gputo72 bit depth, successively, with preference for lowest exponent first, to clear the path for P-1 and primality testing. In other words, instead of single-bit-depth per exponent, do full-to-goal-bit-depth before going on to another exponent on the same gpu. https://www.mersenne.ca/exponent/200000033 half a ThzDay. https://www.mersenne.ca/exponent/400000079 a ThzDay https://www.mersenne.ca/exponent/800000087 nearly 10 ThzDays. https://www.mersenne.ca/exponent/990000337 ~15ThzDays. Two weeks on a GTX1080, for ~56 bytes of worktodo content. 1KB of such work could occupy even a fast gpu for a month or more. I have TF of one exponent on an older gpu that's estimated to complete in April. Last fiddled with by kriesel on 2018-12-30 at 21:21 |
|
|
|
|
|
|
#3036 |
|
Sep 2003
5·11·47 Posts |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1676 | 2021-06-30 21:23 |
| The P-1 factoring CUDA program | firejuggler | GPU Computing | 753 | 2020-12-12 18:07 |
| gr-mfaktc: a CUDA program for generalized repunits prefactoring | MrRepunit | GPU Computing | 32 | 2020-11-11 19:56 |
| mfaktc 0.21 - CUDA runtime wrong | keisentraut | Software | 2 | 2020-08-18 07:03 |
| World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |