mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Math

Reply
 
Thread Tools
Old 2016-10-18, 13:43   #12
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2·53·71 Posts
Default

Try

http://processors.wiki.ti.com/index...._of_Real_Input
https://www.dsprelated.com/showarticle/97.php
http://www.engineeringproductivityto...T0001/PT10.HTM

or even this odd collection:

http://www.jjj.de/fft/fftpage.html
Prime95 is online now   Reply With Quote
Old 2016-10-18, 14:44   #13
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

Thanks! Reading and gaining understanding is great, coming to know what you don't know you don't know and finding where to learn it is the challenge of the self-taught.
airsquirrels is offline   Reply With Quote
Old 2016-10-20, 04:43   #14
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

Good work is being done! I now have a full understanding of the whole process from a math stand point and am in the process of benchmarking various implementation trade offs for doing LL on the GPU vs those used on the CPU. The biggest unintuitive trade off not utilized in current GPU LL code is the understanding of just how slow (latency, not bandwidth) memory access is on a GPU vs a CPU. It is often 2-40x faster to compute twiddle factors, weight signal values, etc. on the fly vs. reading from precomputed tables. Additionally, separating the squaring, normalizing, and carry kernels means writing large amounts of data out to memory just to load it back up, perform 3 multiplications and some adds, write it all out again, and then load it all up again. On an architecture without large caches this is sucking up quite a bit of performance. Especially when all the weight signals and precomputed factors are also coming in from memory to pollute the cache.

The one area where I am struggling with creating parallel code is the variable word carry/adder. I'm currently shuttling off to the CPU and doing that sequentially so I can work on other areas while still maintaining correct residues.

Is the parallel prefix research route the one I should be taking? I see that applying for most of the carries, however I'm not sure how to adapt it to the cyclical nature of the carries with the IBDWT. gpuLucas seems to just make assumptions about the maximum length of a carry chain and limit the number of passes based on that. It seems there may be circumstances with lots of 1 bits where that could fall down. Does mLucas or prime95 effectively parallel the carry work with multiple threads per worker?
airsquirrels is offline   Reply With Quote
Old 2016-10-20, 14:13   #15
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2·53·71 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
gpuLucas seems to just make assumptions about the maximum length of a carry chain and limit the number of passes based on that. It seems there may be circumstances with lots of 1 bits where that could fall down. Does mLucas or prime95 effectively parallel the carry work with multiple threads per worker?
Perfect carry code is not required. Say you get a 53-bit result and are storing 20-bits per word. You'll have a 33-bit carry into the next word, a 13-bit carry into the word after that, and -1,0,+1 into the third word. Do not worry about whether +/-1 could cause another carry.

As far as I know, mlucas, prime95, and cudaLucas all do parallel carry work with multiple threads.
Prime95 is online now   Reply With Quote
Old 2016-10-20, 14:51   #16
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Perfect carry code is not required. Say you get a 53-bit result and are storing 20-bits per word. You'll have a 33-bit carry into the next word, a 13-bit carry into the word after that, and -1,0,+1 into the third word. Do not worry about whether +/-1 could cause another carry.

As far as I know, mlucas, prime95, and cudaLucas all do parallel carry work with multiple threads.
Does the potential extra carry just get absorbed into the math, and we have enough precision to restore a value even if the word size is 1 bit longer than it should have been?

Take the small example with word sizes {10,9,9,9}

After squaring, IFFT, inverse weighting, and rounding say we are left with:

787356, 938236, 393292, and 837234.

Base -> Carry : Carry1 -> Carry 2
924 -> 768 : 256 -> 1
253 -> 1832 : 296 -> 3
76 -> 768 : 256 -> 1
114 -> 1635 : 99 -> 3

adding the first round carry results in parallel of carry gives us

924 + 99 + 1 = 1024 -> 1 bit carry
252 + 256 + 3 = 511 -> 0
76 + 296 + 1 = 373 -> 0
114 + 256 + 3 = 373 -> 0

Are you saying we can ignore the 1 bit carry and leave 1024 as the first word for the next round?

If we can't, the next round looks like

0 -> 0
512 -> 1
373 -> 0
373 -> 0

Which again needs a another round to get our words back to size of
0 -> 0
0 -> 0
374 -> 0
373 -> 0

The above represents "perfect carry code" - Which of the above steps is unnecessary? Note that I am using unbalanced representations for simplicity, but I can contrive similar values for balanced representations.

Last fiddled with by airsquirrels on 2016-10-20 at 15:27
airsquirrels is offline   Reply With Quote
Old 2016-10-20, 15:12   #17
GP2
 
GP2's Avatar
 
Sep 2003

5·11·47 Posts
Default

Fascinating stuff, though I'm only vaguely following along.

GPUs will surely see a lot more technological advancement than Intel CPUs in the coming years, since they're key to things like deep learning which is rapidly advancing now. It seems almost inevitable that the GPU-based programs will supersede mprime at some point, though probably not for some years yet.

Just a request, this thread should probably be renamed to something less vague, for the sake of posterity, since the information being revealed here has lasting value.
GP2 is offline   Reply With Quote
Old 2016-10-20, 15:36   #18
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2·53·71 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
The above represents "perfect carry code" - Which of the above steps is unnecessary?
All of them :)

You can leave the 6 digit numbers as is and do another FFT multiply. Of course, now you'll have 12-digit numbers to deal with. The thing to understand about carry propagation is that it does not change the math -- each representation is equally valid. The reason we do carry propagation is to keep the numbers a manageable size in order to reduce round off error. As far as round off error is concerned it makes little difference if each FFT word is between 0 and 511 vs. 0 and 512.
Prime95 is online now   Reply With Quote
Old 2016-10-20, 17:12   #19
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

Quote:
Originally Posted by Prime95 View Post
All of them :)

You can leave the 6 digit numbers as is and do another FFT multiply. Of course, now you'll have 12-digit numbers to deal with. The thing to understand about carry propagation is that it does not change the math -- each representation is equally valid. The reason we do carry propagation is to keep the numbers a manageable size in order to reduce round off error. As far as round off error is concerned it makes little difference if each FFT word is between 0 and 511 vs. 0 and 512.
That means I can make my carry code significantly faster than I was expecting.

Back to the workshop with me...
airsquirrels is offline   Reply With Quote
Old 2016-10-22, 19:51   #20
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
That means I can make my carry code significantly faster than I was expecting.

Back to the workshop with me...
For a quick little update, turns out with enough wavefronts all of that memory latency for accessing precomputed buffers nearly disappears, so my in-line computation gains did not scale up. Storing all our twiddle dees and twiddle dums as precomputed constant arrays compiled in let's the compiler optimize everything for strides, access, etc.

On the positive side my implementation of LL in OpenCL currently validates all the small mersenne primes, as well as 74207281. Sadly its current manifestation clocks in at ~4ms/iteration at 4096k on a Fury X, just slightly above clLucas. Unsurprisingly msft did a lot of good successful optimizing already.

I'm investigating more math tricks now. I'm wondering whether it is critical for each word in the FFT to be in order. I've also thought about whether you could do some hybrid approach, using a larger Single Precision FFT on the input side and possibly doing some fancy match to switch to double precision only around the squaring and reconstruction.
airsquirrels is offline   Reply With Quote
Old 2018-12-29, 18:15   #21
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

536310 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
For a quick little update, turns out with enough wavefronts all of that memory latency for accessing precomputed buffers nearly disappears, so my in-line computation gains did not scale up. Storing all our twiddle dees and twiddle dums as precomputed constant arrays compiled in let's the compiler optimize everything for strides, access, etc.

On the positive side my implementation of LL in OpenCL currently validates all the small mersenne primes, as well as 74207281. Sadly its current manifestation clocks in at ~4ms/iteration at 4096k on a Fury X, just slightly above clLucas. Unsurprisingly msft did a lot of good successful optimizing already.

I'm investigating more math tricks now. I'm wondering whether it is critical for each word in the FFT to be in order. I've also thought about whether you could do some hybrid approach, using a larger Single Precision FFT on the input side and possibly doing some fancy match to switch to double precision only around the squaring and reconstruction.
Very interesting thread. Is there anything further to report?
kriesel is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Learning languages is the new fashion... LaurV LaurV 3 2019-11-05 10:25
Online language-learning course kladner Lounge 8 2013-04-18 03:08
Learning Python - Course from Google Jeff Gilchrist Programming 3 2012-01-15 00:29
flowcharts, self-learning jasong jasong 6 2007-12-07 14:06
Learning About RAM the Hard Way Longshot Hardware 5 2005-05-21 16:40

All times are UTC. The time now is 18:15.


Fri Jul 16 18:15:14 UTC 2021 up 49 days, 16:02, 1 user, load averages: 2.49, 2.36, 2.01

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.