![]() |
|
|
#12 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
11101011001102 Posts |
|
|
|
|
|
|
#13 | |
|
"David"
Jul 2015
Ohio
11·47 Posts |
Quote:
|
|
|
|
|
|
|
#14 |
|
"David"
Jul 2015
Ohio
11×47 Posts |
Good work is being done! I now have a full understanding of the whole process from a math stand point and am in the process of benchmarking various implementation trade offs for doing LL on the GPU vs those used on the CPU. The biggest unintuitive trade off not utilized in current GPU LL code is the understanding of just how slow (latency, not bandwidth) memory access is on a GPU vs a CPU. It is often 2-40x faster to compute twiddle factors, weight signal values, etc. on the fly vs. reading from precomputed tables. Additionally, separating the squaring, normalizing, and carry kernels means writing large amounts of data out to memory just to load it back up, perform 3 multiplications and some adds, write it all out again, and then load it all up again. On an architecture without large caches this is sucking up quite a bit of performance. Especially when all the weight signals and precomputed factors are also coming in from memory to pollute the cache.
The one area where I am struggling with creating parallel code is the variable word carry/adder. I'm currently shuttling off to the CPU and doing that sequentially so I can work on other areas while still maintaining correct residues. Is the parallel prefix research route the one I should be taking? I see that applying for most of the carries, however I'm not sure how to adapt it to the cyclical nature of the carries with the IBDWT. gpuLucas seems to just make assumptions about the maximum length of a carry chain and limit the number of passes based on that. It seems there may be circumstances with lots of 1 bits where that could fall down. Does mLucas or prime95 effectively parallel the carry work with multiple threads per worker? |
|
|
|
|
|
#15 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
2·53·71 Posts |
Quote:
As far as I know, mlucas, prime95, and cudaLucas all do parallel carry work with multiple threads. |
|
|
|
|
|
|
#16 | |
|
"David"
Jul 2015
Ohio
20516 Posts |
Quote:
Take the small example with word sizes {10,9,9,9} After squaring, IFFT, inverse weighting, and rounding say we are left with: 787356, 938236, 393292, and 837234. Base -> Carry : Carry1 -> Carry 2 924 -> 768 : 256 -> 1 253 -> 1832 : 296 -> 3 76 -> 768 : 256 -> 1 114 -> 1635 : 99 -> 3 adding the first round carry results in parallel of carry gives us 924 + 99 + 1 = 1024 -> 1 bit carry 252 + 256 + 3 = 511 -> 0 76 + 296 + 1 = 373 -> 0 114 + 256 + 3 = 373 -> 0 Are you saying we can ignore the 1 bit carry and leave 1024 as the first word for the next round? If we can't, the next round looks like 0 -> 0 512 -> 1 373 -> 0 373 -> 0 Which again needs a another round to get our words back to size of 0 -> 0 0 -> 0 374 -> 0 373 -> 0 The above represents "perfect carry code" - Which of the above steps is unnecessary? Note that I am using unbalanced representations for simplicity, but I can contrive similar values for balanced representations. Last fiddled with by airsquirrels on 2016-10-20 at 15:27 |
|
|
|
|
|
|
#17 |
|
Sep 2003
5·11·47 Posts |
Fascinating stuff, though I'm only vaguely following along.
GPUs will surely see a lot more technological advancement than Intel CPUs in the coming years, since they're key to things like deep learning which is rapidly advancing now. It seems almost inevitable that the GPU-based programs will supersede mprime at some point, though probably not for some years yet. Just a request, this thread should probably be renamed to something less vague, for the sake of posterity, since the information being revealed here has lasting value. |
|
|
|
|
|
#18 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
752610 Posts |
Quote:
You can leave the 6 digit numbers as is and do another FFT multiply. Of course, now you'll have 12-digit numbers to deal with. The thing to understand about carry propagation is that it does not change the math -- each representation is equally valid. The reason we do carry propagation is to keep the numbers a manageable size in order to reduce round off error. As far as round off error is concerned it makes little difference if each FFT word is between 0 and 511 vs. 0 and 512. |
|
|
|
|
|
|
#19 | |
|
"David"
Jul 2015
Ohio
11×47 Posts |
Quote:
Back to the workshop with me... |
|
|
|
|
|
|
#20 | |
|
"David"
Jul 2015
Ohio
11×47 Posts |
Quote:
On the positive side my implementation of LL in OpenCL currently validates all the small mersenne primes, as well as 74207281. Sadly its current manifestation clocks in at ~4ms/iteration at 4096k on a Fury X, just slightly above clLucas. Unsurprisingly msft did a lot of good successful optimizing already. I'm investigating more math tricks now. I'm wondering whether it is critical for each word in the FFT to be in order. I've also thought about whether you could do some hybrid approach, using a larger Single Precision FFT on the input side and possibly doing some fancy match to switch to double precision only around the squaring and reconstruction. |
|
|
|
|
|
|
#21 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
536310 Posts |
Quote:
|
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Learning languages is the new fashion... | LaurV | LaurV | 3 | 2019-11-05 10:25 |
| Online language-learning course | kladner | Lounge | 8 | 2013-04-18 03:08 |
| Learning Python - Course from Google | Jeff Gilchrist | Programming | 3 | 2012-01-15 00:29 |
| flowcharts, self-learning | jasong | jasong | 6 | 2007-12-07 14:06 |
| Learning About RAM the Hard Way | Longshot | Hardware | 5 | 2005-05-21 16:40 |