![]() |
Thanks Judger, that was a detailed description, ok, so even in cpu that thread work distribution would not work in my way.
just some more questions: 1. in your latest binary mfaktc-0.21 I'm assuming that the main sieve is in gpusieve.cu's __global__ static void __launch_bounds__(256,6) SegSieve (uint8 *big_bit_array_dev, uint8 *pinfo_dev, uint32 maxp) method, 256 refers to the fact that you are using 256 threads in one pool, right? What is the 6 ? 2. where do you store the pinfo_dev, so the sieving prime's info? Is it in your default 64mbit bitmap or elsewhere? 3. still not clear (at least for me) the ratio of sieving time/total time; say if it is really that tiny, then why not use deeper sieve, using primes up to 2m or 10m. With that you could lower the number of k survivors by a factor of log(B2)/log(B1), where B2 is the new depth, B1 is the old (ofcourse, the sieve would be slower). 4. in the mentioned SegSieve, say we have 20480=80*256 GPU threads, then do you launch these threads at once in that routine, making a sieve on 80*65536=5242880 length interval ? Or do you start at once only 256 threads, when you have 256 available threads? 5. What happens if the number of GPU threads is not divisible by 256 ? Possible ideas: on line 539: locsieve32[threadIdx.x * block_size / threadsPerBlock / 32 + j] |= mask; where we know: const uint32 block_size_in_bytes = 8192; // Size of shared memory array in bytes const uint32 block_size = block_size_in_bytes * 8; // Number of bits generated by each block const uint32 threadsPerBlock = 256; // Threads per block so we could also write: locsieve32[8 * threadIdx.x + j] |= mask; and the compiler can easily replace it by a shift. I'm not sure if the compiler arrives to the single shifting. But if you are sure about this, then it is absolutely right not to change. Note that there are many such cases in this method. And you could replace those lots of const .. with #define (with a single integer, don't leave mult/operations there), say: #define block_size_in_bytes 8192 // Size of shared memory array in bytes #define block_size 65536 // = block_size_in_bytes * 8; // Number of bits generated by each block #define threadsPerBlock 256 etc., would it be faster on GPU? |
You can set sieve primes in mfaktc.ini
# GPUSievePrimes defines how far we sieve the factor candidates on the GPU. # The first <GPUSievePrimes> primes are sieved. # # Minimum: GPUSievePrimes=0 # Maximum: GPUSievePrimes=1075000 # # Default: GPUSievePrimes=82486 Usually you test the speed with different sieve primes for your GPU the first time, but the fastest is always between 75K and 150K every time I have tested it. The optimal can be slightly different I believe for different exponent sizes and bit depths. |
Hi Robert,
[QUOTE=R. Gerbicz;504271]1. in your latest binary mfaktc-0.21 I'm assuming that the main sieve is in gpusieve.cu's __global__ static void __launch_bounds__(256,6) SegSieve (uint8 *big_bit_array_dev, uint8 *pinfo_dev, uint32 maxp) method, 256 refers to the fact that you are using 256 threads in one pool, right? What is the 6 ?[/QUOTE] Correct, 256 threads per block and a minimum of 6 blocks per [I]"multiprocessor"[/I] (a group of GPU cores). [QUOTE=R. Gerbicz;504271]2. where do you store the pinfo_dev, so the sieving prime's info? Is it in your default 64mbit bitmap or elsewhere?[/QUOTE] Not sure what you want to know here. The big 64 Mibit bitmap contains just the bits of the sieve. [QUOTE=R. Gerbicz;504271]3. still not clear (at least for me) the ratio of sieving time/total time; say if it is really that tiny, then why not use deeper sieve, using primes up to 2m or 10m. With that you could lower the number of k survivors by a factor of log(B2)/log(B1), where B2 is the new depth, B1 is the old (ofcourse, the sieve would be slower).[/QUOTE] I can't tell you the timings because I don't know, too. As ATH noted one simply adjusts GPUSievePrimes to the "optimal" value (minimum runtime for given work). [QUOTE=R. Gerbicz;504271]4. in the mentioned SegSieve, say we have 20480=80*256 GPU threads, then do you launch these threads at once in that routine, making a sieve on 80*65536=5242880 length interval ? Or do you start at once only 256 threads, when you have 256 available threads?[/QUOTE] All threads(blocks) are launched at the same time. With default configuration there are 1024 blocks of 256 threads each (64 Mibit partitioned into 1024 64 kibit segments). From the code itself it is OK to run them all at once or block after block. The GPU (driver) does the magic to schedules the blocks to the HW. Currently the (big) bitmap is limited to 128 Mibit and thus 524288 threads - enough until nvidia manages to build GPUs with more than lets say 64k cores. [QUOTE=R. Gerbicz;504271]5. What happens if the number of GPU threads is not divisible by 256 ?[/QUOTE] It is! [QUOTE=R. Gerbicz;504271]Possible ideas: on line 539: locsieve32[threadIdx.x * block_size / threadsPerBlock / 32 + j] |= mask; where we know: const uint32 block_size_in_bytes = 8192; // Size of shared memory array in bytes const uint32 block_size = block_size_in_bytes * 8; // Number of bits generated by each block const uint32 threadsPerBlock = 256; // Threads per block so we could also write: locsieve32[8 * threadIdx.x + j] |= mask; and the compiler can easily replace it by a shift. I'm not sure if the compiler arrives to the single shifting. But if you are sure about this, then it is absolutely right not to change. Note that there are many such cases in this method. And you could replace those lots of const .. with #define (with a single integer, don't leave mult/operations there), say: #define block_size_in_bytes 8192 // Size of shared memory array in bytes #define block_size 65536 // = block_size_in_bytes * 8; // Number of bits generated by each block #define threadsPerBlock 256 etc., would it be faster on GPU?[/QUOTE] Pretty sure the compiler knows that those variables are constant and precomputes the value. Replacing the multiplication with a shift might be a bad idea - at least on some GPUs shift has the same throughput as mul - and we still need the add. On the other hand there is an multiply-add with the same throughput as mul (add if for free). [URL="https://docs.nvidia.com/cuda/cuda-c-programming-guide/#arithmetic-instructions"]https://docs.nvidia.com/cuda/cuda-c-programming-guide/#arithmetic-instructions[/URL] Oliver |
[QUOTE=TheJudger;504293]
I can't tell you the timings because I don't know, too. As ATH noted one simply adjusts GPUSievePrimes to the "optimal" value (minimum runtime for given work). [/QUOTE] Could it be obtained by profiling? |
[QUOTE=TheJudger;196991][B][edit by LaurV]
As the thread got longer and longer and people continue coming and asking "Where is this or that version of mfaktc?" I am editing this post to give the link to the folder which Xyzzy kindly created: [URL]http://www.mersenneforum.org/mfaktc/[/URL] Here you can find many different versions of mfaktc, for different OS-es, different Cuda drivers, bla bla. Select the suitable one for you, download it, clear some exponents! :smile: Thanks! [end of edit by LaurV][/B] Hi, maybe a bit offtopic... as a proof-of-concept I can do TF on my GTX 275 :) A straight forward implementation of the multiprecision arithmetic gives me ~15.000.000 tested factors per second for a 26bit exponent and factors up to 71bit. (The numbers are out of my mind, I hope they are correct) I've checked the correctness for some 1000 cases against a basic implementation using libgmp on CPU, no errors found. :) Currently there is no presieving of factor candidates... but I've alot cycles free on the CPU. ;) The quality of the code right now is.... "proof-of-concept" (means really ugly, alot of hardcoded stuff, ...) :D Most time is spent on the calculation if the reminder (using a simple basecase division). jasonp: if you read this: the montgomery multiplication suggested on the nvidia forum doesn't look suitable on CUDA for me. :( TheJudger[/QUOTE] Recently I found some improvments while using mfaktc, one is here [url]https://www.mersenneforum.org/showpost.php?p=504109&postcount=3014[/url] I think add a double-check variable may provides a more convicent result. This post have no reply, I don't know whether it is ignored. And I would apologize if such suggest is really useless. Further more, I found it is really slow when mfaktc is dealing with a really big worktodo.txt(e.g., ~2M) It will take ~1 second to rewrite the worktodo.txt. I think a better way to dealing such problem is, once the worktodo.txt is provided, mfaktc could remember the size of worktodo.txt, and dealing with the last record. after TF, mfaktc could delete the last record. I think delete the last record is much quickly than delete the first one. |
Hi,
[QUOTE=Neutron3529;504342]Recently I found some improvments while using mfaktc, one is here [url]https://www.mersenneforum.org/showpost.php?p=504109&postcount=3014[/url] I think add a double-check variable may provides a more convicent result. This post have no reply, I don't know whether it is ignored. And I would apologize if such suggest is really useless.[/QUOTE] There is currently no doublechecking for TF work and thus you can't improve (reduce time) for that. :yucky: You want a checksum or whatever for all factor attemps in each class, right? Won't work for multiple reasons:[LIST][*]the list of factor candidates depends on sieve parameters (e.g. more sieving, less candidates)[*]even with same settings the list of candidates depends on hardware because we ignore memory conflicts and sometimes a composite factor isn't cleared by the sieve while on the next run it is.[/LIST] [QUOTE=Neutron3529;504342]Further more, I found it is really slow when mfaktc is dealing with a really big worktodo.txt(e.g., ~2M) It will take ~1 second to rewrite the worktodo.txt. I think a better way to dealing such problem is, once the worktodo.txt is provided, mfaktc could remember the size of worktodo.txt, and dealing with the last record. after TF, mfaktc could delete the last record. I think delete the last record is much quickly than delete the first one.[/QUOTE] Is this a common usecase? Keep in mind that I focus on current primenet wavefront. 2M worktodo isn't general usage. Maybe easiest is to split your worktodo to reasonable sizes and put a small script around mfaktc (put small worktodo.txt into directory, start mfaktc (let it run until it has finished worktodo.txt, repeat with next worktodo.txt) |
[QUOTE=TheJudger;504352]Is this a common usecase? 2M worktodo isn't general usage.[/QUOTE]No, that's not a common usecase. Even with my work at the large-exponent-low-bits above 1000M where exponent (not class) runtimes are ~1s there's no reason to have mammoth worktodo. For convenience I fetch/submit 1000 exponents at a time (~25kB worktodo) but even if it was an offline system I would seriously consider writing some script that would slice off 100-1000 assignments at a time from a separate bulk assignment file when worktodo.txt runs empty (and at the same time archive off results.txt since that also gets large quickly).
I'm curious what [i]Neutron3529[/i] is doing to get 2MB worktodo... is he my mystery TF'er who reserves [url=https://www.mersenne.ca/tf1G.php#assign_count_by_age]a million exponents at a time[/url] and then takes 2 weeks to submit the results? |
[QUOTE=James Heinrich;504378]No, that's not a common usecase. Even with my work at the large-exponent-low-bits above 1000M where exponent (not class) runtimes are ~1s there's no reason to have mammoth worktodo. ... my mystery TF'er who reserves [URL="https://www.mersenne.ca/tf1G.php#assign_count_by_age"]a million exponents at a time[/URL] and then takes 2 weeks to submit the results?[/QUOTE]At what point does it stop being an acceptable request and start looking like a DOS?
I don't see the point of MB worktodo downloads when there's so much work to do below 10[SUP]9 [/SUP]with a well chosen 1k size worktodo representing several ThzDays. |
[QUOTE=kriesel;504414]I don't see the point of MB worktodo downloads[/QUOTE]
I can see why someone would do that if they didn't have a script to automate fetching/distributing the work. |
[QUOTE=Mark Rose;504416]I can see why someone would do that if they didn't have a script to automate fetching/distributing the work.[/QUOTE]
Plenty of work to do well under 10[SUP]9[/SUP]. It would be good to TF exponents to full gputo72 bit depth, successively, with preference for lowest exponent first, to clear the path for P-1 and primality testing. In other words, instead of single-bit-depth per exponent, do full-to-goal-bit-depth before going on to another exponent on the same gpu. [URL]https://www.mersenne.ca/exponent/200000033[/URL] half a ThzDay. [URL]https://www.mersenne.ca/exponent/400000079[/URL] a ThzDay [URL]https://www.mersenne.ca/exponent/800000087[/URL] nearly 10 ThzDays. [URL]https://www.mersenne.ca/exponent/990000337[/URL] ~15ThzDays. Two weeks on a GTX1080, for ~56 bytes of worktodo content. 1KB of such work could occupy even a fast gpu for a month or more. I have TF of one exponent on an older gpu that's estimated to complete in April. |
[QUOTE=Neutron3529;504342]
It will take ~1 second to rewrite the worktodo.txt. [/QUOTE] Even if you use a RAM drive? |
| All times are UTC. The time now is 23:03. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.