![]() |
|
|
#1 |
|
"Mihai Preda"
Apr 2015
3×457 Posts |
Here I'm explaining how I see LL error-detection on the GPU -- maybe somebody has helpful ideas/tips in that direction.
Doing LL on the GPU is somewhat unreliable, because sometimes some GPUs manifest transient hardware errors. This problem is compounded by the long iteration chain needed for large LL exponents. And, for a first-time LL, there is no way to cheaply validate the final result -- basically the only way to validate an LL is to do it once again as a double-check. That's why I would very much like to have an "error detection" mechanism for the LL iteration on the GPU. This would validate (with some probability) each iteration step of the LL. When an error is detected, the step can be re-run until correct; thus "error detection" is easily turned into "error correction" for transient errors. Below, E is the exponent, N is the FFT size. We can see the LL iteration as application of this LLstep() function: function LLstep(BigInt state) { return (state^2 - 2) mod (2^E - 1); } Where "state" is a big integer, formed by N integer "words" or digits (representing together E bits). I see ideal error detection like this: have a function checksum() that takes the LL state (N words) and computes a small value (less than 64bits) that represents a form of "checksum" or "abstraction" or "characteristic" or "reduction" of the LL state, with properties discussed below. e.g. checksum(BigInt state) -> int checkState. Let's call the small characteristic that is computed from the LL state, "checkState". We must have, in addition to the LLstep() shown above, a similar iteration function checkStep() on the checkState: int checkStep(int checkState); This checkStep() performs a similar function to LLstep(), but on the abstracted value checkState, i.e.: checkStep(checksum(state)) == checksum(LLstep(state)) Now, the core of error detection lies in checking, after every iteration, that: checksum(LLstep(state)) == checkStep(checksum(state)) Normally LLstep() is executed on the GPU, while checkStep() is executed on the CPU. In other words: - keep LL state on the GPU, keep checkState on the CPU. On each iteration: - update LL state on the GPU (using LLstep()) and update checkState on the CPU (using checkStep()) - compute checksum(LL state) either on GPU or CPU, and compare with CPU's checkState for equality. The difficulty lies in finding the pair checksum() & checkStep() that is compatible with LLstep() above. |
|
|
|
|
|
#2 | |
|
"Forget I exist"
Jul 2009
Dumbassville
20C016 Posts |
Quote:
Last fiddled with by science_man_88 on 2017-05-03 at 00:29 |
|
|
|
|
|
|
#3 |
|
"Mihai Preda"
Apr 2015
3·457 Posts |
1. The SUMINP != SUMOUT is error-detection, but it only covers the FFT convolution. It does not cover:
- multiply from words (ints) to doubles - inverse multiply from doubles to words - double rounding - carry propagation Thus I'd estimate it covers about 50% of the "error surface". 2. "max rounding error" also covers only the FFT convolution (with the same limitations as SUMINP!=SUMOUT), and IMO it is weaker. "rounding error" becomes even weaker for error detection when the normal rounding error rate increases (e.g. big words nearing 19bits), when random flips in the FFT would fail to trigger a jump in rounding error. "Rounding error" has the advantage of being easy and cheap on the GPU. 3. An error-detection mechanism that covers carry propagation must compute the checksum irrespective of how the carries are distributed. In other words it must compute the same result before carry-propagation as after carry-propagation. En example: word-0 has 19bits. A carry from word-0 to word-1 decreases words-0 by 2^19 and increases word-1 by 1. The checksum must compute the same value in the two cases (carry moved or not). The first thing that comes to mind is a modulo checksum: choose an integer M for modulo sum = 0 for i = N-1 downto 0: sum = ((sum << wordLength[i]) + word[i]) modulo M This has the nice property of abstracting the carries. OTOH the function checkStep() which would "simulate" the LL iteration on this checksum does not exist. This is because LLstep() does the modulo (2^E - 1), and our checksum lost essential information for that by doing the modulo M. 4. My idea for how to fix the problem above (3) was this: Instead of choosing an integer modulo M, choose a modulo value that is compatible with 2^E-1. Compatible means "divides". Of course that'd be hard to do with an integer, but it's easy if we extend to an irrational (floating-point) modulo. For example choose M such: M^N == 2^E - 1 which means: bitsPerWord = E / N M = 2^bitsPerWord which produces M^N = 2^E This has the very nice property: checkStep(checkState) == (checkState^2 - 2) modulo M The problem comes from the limited precision in computing the real modulo M, coupled with the amplification of error that comes from the shifting involved in checksum() (required by the carry-abstraction). Thus IMO this also doesn't work. 5. Other ideas? |
|
|
|
|
|
#4 | |
|
"Mihai Preda"
Apr 2015
3·457 Posts |
Quote:
A weak check because the space of "error LL state" is huge (about 2^E) and the test only catches a few of them. |
|
|
|
|
|
|
#5 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
2×53×71 Posts |
Here's a random idea off the top of my head (read: may be worthless)
Let's assume most GPU errors are caused by memory problems not incorrect floating-point computations. How about computing a checksum of all the FFT data as it is written to GPU memory. Then, compute a checksum as it is read back in from GPU memory. Compare and raise error if they ever differ. The pros are it is very simple to implement, works for the FFT process and the carry propagation process. The cons are it will not catch arithmetic errors. |
|
|
|
|
|
#6 |
|
"Kieren"
Jul 2011
In My Own Galaxy!
2×3×1,693 Posts |
Keeping watch on the memory would be a very good idea, at least based on my experiences with CUDALucas on a few different cards.
|
|
|
|
|
|
#7 | |
|
Romulan Interpreter
Jun 2011
Thailand
7·1,373 Posts |
Quote:
But every manufacturer used to have his own yields -- not 100% of the chips coming out of the foundry were good. Some of the bad parts were beyond hopes, so they were scrapped, moved to the grinding machines, transformed to powder, then here you are, in the gold-recovering line, etc., but some other were still usable, assuming you can live with one (or few) bad bit(s). And guess what, nobody would care, neither see, if one pixel on your screen can only display 8 million colors, or 4, or 2, instead of intended 16 million colors. The human eye anyhow can only see few thousands, and only talented artists and painters can see like ten thousand colors or so. Professional monitors manufacturers say they use 8 bits of every fundamental color (red, green, blue) for a total of 16 777 216 colors (24 bits), but actually only 6 most significant lines are connected to the LCD, making only 262 144 colors physically on screen (18 bits). Therefore, memories with only few bits affected were sold very cheap to graphic card manufacturers, and used by them, of course. You may have 32-bit color in your OS, but your cheap monitor may not be able to show you more than a 5-6-5, with some TMED to enhance the number of colors. Nowadays, some still use them. If your card is not function properly for calculus, but it has no flaw for videos and games, there is a higher possibility that the culprit is not the ALU's inside (arithmetical units, whatever those may be), but the memory. Some bits are just so stubborn and will not flip when you say so, but only when they want to...
Last fiddled with by LaurV on 2017-05-03 at 05:34 |
|
|
|
|
|
|
#8 |
|
"Mihai Preda"
Apr 2015
3·457 Posts |
In the LL test, the buffers do not move around, they are allocated once at the beginning of the test and then keep the same position in the global memory. So the repeated iterations hit the same bytes again and again. So a stuck bit would be visible "from the start" if the LL buffers hit it.
I would not be so concerned about a consistent error that manifests itself each time. What's more difficult is the error that hits let's say about once in 24h. If you repeat, it's not there anymore. About memory vs. computation, there is global memory, then the caches which likely use different technology from global memory, and there are the registers (VGPRs) which also are a form of "cache memory". Now, if bit-flip hits a VGPR it would be hard to argue whether that's memory or compute. OTOH I have no idea about the probability distribution of errors among these different elements. Let's hope mostly only global memory is affected :) |
|
|
|
|
|
#9 |
|
Romulan Interpreter
Jun 2011
Thailand
100101100010112 Posts |
You are totally right, and that is why we do DC. BTW, how about implementing a shift?
, this is more important (and easier) than the new FFTs. Right now, we can not LL and DC the same expos with gpuOwl, due to the fact that the test is "too" deterministic (a software error in the FFT will repeat itself on the subsequent tests). So, first step will be to implement a random shift, like P95 and cudaLucas have. Here I just realized that my LL+DC is futile, as I am doing one with clLucas and one with gpuOwl, and neither of them have shifts. Anyhow, Madpoo will do TC, so I am not concerned ![]() And as long as we are at it, can you implement a -l (or -L, i.e. list) command line switch, to list/enumerate the available devices? Not much useful to have a device selection switch, if we are so ignorant and do not know which gpu-able thingies we have in our computer... is it? Like GPU-Z is doing, listing all the available devices. This would not be difficult to implement. Maybe I have a 5000GHzDays toy inside of the box and I have no idea it is there...
|
|
|
|
|
|
#10 | |
|
"Mihai Preda"
Apr 2015
3·457 Posts |
Quote:
Validating between clLucas and gpuOwL (for checking software correctness not for DC), is still quite strong because the two have completely different FFT implementations, with different trigonometric tables, etc, which is visible in the different rounding error results. I would expect the two have different rounding too, and different carry propagation, but I'd need to check clLucas' source to be sure of that. You can always double-check exponents that have been LLd on CPU, that way you get software validation + useful DC; but you don't get "early abort" if gpuOwL is going awry. Already there, see -h or --help. |
|
|
|
|
|
|
#11 |
|
Romulan Interpreter
Jun 2011
Thailand
7·1,373 Posts |
Well, we disagree about the shift. First, is important, to protect agains software errors. Second, is (almost) free. The only affected part is the start, where you start with "4 shifted with some random x bits", and the "-2" step where you, instead, subtract "-2 shifted" each time, and then "shift some more" the "-2 shifted" value. All the FFT and all the rest of the code stays the same. Here is an example
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Sieving both sides vs one side at a time | paul0 | Factoring | 5 | 2015-11-18 13:58 |
| Side Topic: 'It's Not a Toom-ah' | R.D. Silverman | Math | 68 | 2015-11-15 04:21 |
| mfaktc and CUDALucas side-by-side | TObject | GPU Computing | 2 | 2012-07-21 01:56 |
| False OnBattery detection | Traveller | Software | 13 | 2009-03-09 23:16 |
| Gap detection | Joe O | Prime Sierpinski Project | 6 | 2006-11-05 18:52 |