![]() |
[QUOTE=chalsall;294849]Agreed. Probably about 1.125%.
And so everyone knows, the emprical data on the [URL="http://www.gpu72.com/reports/factor_percentage/"]Factor Found Percentage[/URL] report is undercounting a bit in the "to 71" and "to 72" columns for reasons I won't go into now, but this will hopefully correct itself shortly. (Hint, hint to the person responsible... :wink:) But this doesn't change the fact that by both your and James' analysis, we're going to 70 bits too early in the DC range.[/QUOTE] I noticed this person's tactics when I was ALMOST 2nd place in days saved a while ago. Then I looked the next day and I was around 15000 GHz-d/saved behind :yucky: |
[QUOTE=chalsall;294849]Agreed. Probably about 1.125%.
And so everyone knows, the emprical data on the [URL="http://www.gpu72.com/reports/factor_percentage/"]Factor Found Percentage[/URL] report is undercounting a bit in the "to 71" and "to 72" columns for reasons I won't go into now, but this will hopefully correct itself shortly. (Hint, hint to the person responsible... :wink:) But this doesn't change the fact that by both your and James' analysis, we're going to 70 bits too early in the DC range.[/QUOTE] If you use my machines as a baseline, and your 1.125%, worst case is 37M and best case is 32M. I'd say to change it to 32 for now since I am the main producer and I will switch the dogs over to LL range. Also, using the 1.125%, the changeover to ^72 becomes 42-47M on the dogs and ^73 becomes 56-57M. ^71 is the weird area... for DC it is 40-43M but for LL it is 35-37M (which we don't have) |
I just realized, all my graphs have been done on nVidia cards. While I only have 1 data set for AMD, you may find this surprising:
I was running a 5770(which cannot currently do DC/LL) in my 2500, with P95 sharing the core. A full core takes ~8.7ms/iter, the shared core took ~21.25ms/iter. So, 41% of the core was being used by P95 or 59% lost to mfakto. A full core can do 37% of a 26M DC/day, so 59% of that 37% gives a 'loss' of 21.83% of a DC/day. The 5770 was outputting ~64GD/day, which means it would produce 293 GD in the time the lost portion of the core could do 1 DC. This equates to 127.5 TF for that lost DC. This changes the breakeven point for ^70 to 29M, ^71 to 37M, and ^72 to 47M. For 2LL, this means ^72 would be at 37M and ^73 would be at 47M, and this is on my [B]2nd 'worst'[/B] system. I need a new PS for my Core2Duo before I can run tests on it with the 5770, but if we guesstimate, and say it takes both cores of the Duo to only get the same 64 GD, you end up with 418.9GD/lost DC or 182 TF at 26M. That makes 26M the ^70 mark, 33M the ^71 mark and 41M the ^72 mark for DC with 41m the ^73 mark, 51M the ^74 mark and 59m the ^75 for 2LL. |
[QUOTE=Prime95;294466]Does anyone know if add.cc runs runs on 168 cores or does it get restricted to 32 or even worse 8 cores??[/QUOTE]
I don't know yet. I'm still waiting to get my hands on a GTX 680. [QUOTE=Prime95;294466]In general, how does one know which PTX instructions map to actual hardware instructions? If it's emulated, how does one see which instructions are used to emulate the PTX instruction?[/QUOTE] This is one of nvidias secrets. :sad: For GK104 they totally crippelt int32 performance in alot ways. :furious: I guess that I need to code new kernels for CC 3.0 but this may take some time. For mfaktc the reduced number of registers (per core) and the reduced L1/shared memory (per core) is OK but the crippelt int32 performance is really bad. Oliver |
And to make the LL v TF calculations that much harder... Re the comment made to ram selection for video cards.
Can we gather any DC LL stats yet on GPUs? Are they matching more/less often? -- Craig |
Hi,
since Kepler "light" aka GK104 sucks at integer code: any idea whether it is feasible to use a couple of 32bit floats for "small long integers" or not? Primary I need addition, subtraction and multiplication of integers with ~80/160 bits of data. Oliver |
[QUOTE=TheJudger;296462]Hi,
since Kepler "light" aka GK104 sucks at integer code: any idea whether it is feasible to use a couple of 32bit floats for "small long integers" or not? Primary I need addition, subtraction and multiplication of integers with ~80/160 bits of data. Oliver[/QUOTE] Using my recent 5x15bit barrett kernel could probably be a good start, as you would have only 23 mantissa bits to use. Addition and subtraction should be no problem, but multiplication would need some work as I rely on 32 bits for the result of a 15x15bits multiplication plus carry of up to 17 bits in one mad24. Certainly a solvable problem. But I have no idea if that would be any faster than pure integer ... |
@bcp19: Would you like to post the version of mfaktc that you used to TF M2137, M2267 and M2273 from 60 to 61 bits? I am curious how did you split the classes, as it is not enough to modify the limit in the source of 0.18 and recompile. Anyhow, doing those TF's makes no sense, as how many ECM was done in that area, there should be no factors below 180 bits. One can do "fake reports" there and get a lot of TF credit (thousands of GHzDays), like reporting all exponents TF-ed to 70 or 80 bits, and still have no negative influence on the project. From the percent of the ECM done, there is no factor below 40 to 60 digits (depending on exponent) on that range of expos below 5000, that means no factor below 120-180 bits. So theoretically, if I want to surpass Xyzzy and Nucleon at TF, I could report all of them to 65 bits, without influencing negative the project in a whole. But this is childish. So please, could you post the "test" version of mfaktc that you used to do those tests? No harm intended, just being curious how did you solve the problem.
|
[QUOTE=LaurV;296501]@bcp19: Would you like to post the version of mfaktc that you used to TF M2137, M2267 and M2273 from 60 to 61 bits? I am curious how did you split the classes, as it is not enough to modify the limit in the source of 0.18 and recompile. Anyhow, doing those TF's makes no sense, as how many ECM was done in that area, there should be no factors below 180 bits. One can do "fake reports" there and get a lot of TF credit (thousands of GHzDays), like reporting all exponents TF-ed to 70 or 80 bits, and still have no negative influence on the project. From the percent of the ECM done, there is no factor below 40 to 60 digits (depending on exponent) on that range of expos below 5000, that means no factor below 120-180 bits. So theoretically, if I want to surpass Xyzzy and Nucleon at TF, I could report all of them to 65 bits, without influencing negative the project in a whole. But this is childish. So please, could you post the "test" version of mfaktc that you used to do those tests? No harm intended, just being curious how did you solve the problem.[/QUOTE]
It's a special version that I asked TheJudger to make for me back in December or January, so I am unable to tell you what was changed as he did the work, compiled it and sent me the exe. I tested it on around 5000 known factored exponents, giving him feedback to make a few tweeks, before I started using it full time. It's only able to use about 1/3 of a 450, so the run time is around 4.5days on those 2k exponents, and it has been running since early Feb doing some 20-40k exps, but it's slow enough that I only tend to check it once or twice a month. |
That is wonderful! May I be part of the "testing team"? I would like to play with lower exponents too. Please send me some win-64 binaries and tell me what expos to attack so we don't step on each-other toes.
Please see [URL="http://www.mersenneforum.org/showpost.php?p=296593&postcount=10"]this post here[/URL] too, related to this problem. |
[QUOTE=LaurV;296594]That is wonderful! May I be part of the "testing team"? I would like to play with lower exponents too. Please send me some win-64 binaries and tell me what expos to attack so we don't step on each-other toes.
Please see [URL="http://www.mersenneforum.org/showpost.php?p=296593&postcount=10"]this post here[/URL] too, related to this problem.[/QUOTE] PM me your e-mail addy and I'll send it out. After the 2-3k run, I'll finish the 20-40k's and then probably work on something between 100k and 1M. |
[QUOTE=bcp19;296646]PM me your e-mail addy and I'll send it out. After the 2-3k run, I'll finish the 20-40k's and then probably work on something between 100k and 1M.[/QUOTE]
Why don't we make that version available here? I know I would also like to play with it! :smile: [URL="http://www.mersenneforum.org/mfaktc/"]http://www.mersenneforum.org/mfaktc/[/URL] |
[QUOTE=diamonddave;296647]Why don't we make that version available here? I know I would also like to play with it! :smile:
[URL]http://www.mersenneforum.org/mfaktc/[/URL][/QUOTE] I don't know how to put it on there. Also, there are no DOC files to go with it. |
I think that the author is the one who can decide if he wants it to make it public or not. Of course I would like to have it, to play with it, but maybe he has a good reason why didn't make it public. It could be still under test, or under development, I tried in the past to modify it by myself, when I found that is not enough to change the lower limit constrain, and in fact, because there are much more candidates to test when the expo is small, the modification is not easy to do.
|
[QUOTE=LaurV;296652]I think that the author is the one who can decide if he wants it to make it public or not. Of course I would like to have it, to play with it, but maybe he has a good reason why didn't make it public. It could be still [B]under test[/B], or under development,
[...][/QUOTE] Yepp, it isn't very well tested. Without modifications in the sieve code I have to limit SievePrimes to very low values when small exponents are tested. Otherwise the code will just hang (endless loop in offset calculation). So with such small SievePrimes it seems to work well but again: this is not tested very well. There are two issues when the biggest prime is the sieve is greater or equal then the exponent to test:[LIST=1][*]remove the prime exponent from the list of primes used for sieving. (Candidates are 2kp+1 so they are never divisible by p. This causes the endless loop.)[*]offset calculation needs to take care that the primes used for sieving can be a factor of M[SUB][B]p[/B][/SUB]. This occurs as soon as the primes used for sieving are greater than 2[B]p[/B].[/LIST]So the simple solution is to limits SievePrimes in that way that the biggest prime in the sieve list is less than the exponent. :blush: Another potential issue is the GPU code itself. Precalculation (mfaktc-0.18/src/tf_common.cu starting at line 89) might be too gready for small exponents. [I]Big[/I] bit_min with [I]small[/I] exponents need to be reviewed. And another potential issue: floating point accuracy for approximation in the long divisions need to be checked for small exponents (small factor candidates). Because there is only little use to do more TF on such small exponents this isn't high priority on my todo list, sorry! Oliver |
Taking 84M to 67 bits, just finished taking 334M up a bitlevel.
Edit: Taking 84M to 67 and finishing the other half of 334M to 66. |
My first time...and not so lucky
Tinkering wth my GeForce 8400GS
Does that sound kinky? Anyway getting suspicios errors: [CODE]running a simple selftest... ERROR: cudaGetLastError() returned 6: the launch timed out and was terminated[/CODE] This is after about 45 seconds at which time the screen goes blank and then Windows7 64-bit pops up a window in the taskbar that says something like: Display Driver stopped responding and was recovered. Only one time out of about 7 tries it did pass the selftest and then said it could not read the worktodo.txt file. 2 questions: 1. I assume it should be in the same directory as the mfaktc? 2. From the readme there is this sample line: Factor=bla,66362159,64,68 Is the bla, required? Per James suggestion I changed Numstreams to 2 but my GUI is still quite laggy. |
[QUOTE=petrw1;297209]Is the bla, required?[/QUOTE]Bla is not required. mfaktc is pretty good at ignoring invalid worktodo now, but the assignment key should either not be there at all, "N/A", or a 32-char hex string[code]Factor=66362159,64,68
Factor=N/A,66362159,64,68 Factor=6F466E3E1BCBC848ACA66E45ACBAC5FD,66362159,64,68[/code] I'm sure you're aware, but the 8400 GS is a feeble card, depending on the CPU it's paired with it could potentially achieve [i]less[/i] combined TF throughput than just the CPU alone... :cry: The #1 way to make compute 1.x cards usable with mfaktc is to change GridSize=[b]0[/b] (anything larger than that will lag noticeably). The values for streams are far less important. |
[QUOTE=James Heinrich;297211]I'm sure you're aware, but the 8400 GS is a feeble card, depending on the CPU it's paired with it could potentially achieve [i]less[/i] combined TF throughput than just the CPU alone... :cry:
[/QUOTE] Thanks...I suspected as much. I just wanted to see if I could get it to work. It is paired with a i5-750 OC'd 3.20 |
[QUOTE=petrw1;297213]It is paired with a i5-750 OC'd 3.20[/QUOTE]Your [url=http://mersenne-aries.sili.net/throughput.php?cpu1=Intel%28R%29+Core%28TM%29+i5+CPU+750+%40+2.67GHz%7C256%7C8192&mhz1=3200]i5-750[/url] can do about 3.75GHz-days/day of TF. Your [url=http://mersenne-aries.sili.net/mfaktc.php?sort=ghdpd&noA=1]8400 GS[/url] can do about 3.1 or 2.4GHz-days/day (depending on the revision). Which means that if you spend any more than 82% or 64% of a single CPU core feeding mfaktc, you're losing throughput.
On my slower system I have an 8800GT. I've locked mfaktc on SievePrimes=5000, which isn't optimal for GPU throughput but leaves 90% of the CPU core free to work on P-1. |
petrw1: the 8400GS might be too slow for real jobs but if you just want to see how/if mfaktc works it will still do the job. As James recommended: lower the GridSize, otherwise there is a high chance that you'll trigger the watchdog for blocked drivers on Windows, the 8400GS is really slow.
Oliver |
Por favor...
[QUOTE=James Heinrich;297218]*SNIP*I've locked mfaktc on SievePrimes=5000, which isn't optimal for GPU throughput but leaves 90% of the CPU core free to work on P-1.[/QUOTE]
Mein Herr, could you (or another) explain what "SievePrimes" does? I assume (and cringe at the word) that it sieves the k values for, say, values that are divisible by 7 (6 mod 7 for exponents ending in 9), but I'm not sure. Also, please try to explain as simply as possible, I'm learning as I go :) Thanks! Johann |
[QUOTE=c10ck3r;298009]Mein Herr, could you (or another) explain what "SievePrimes" does? I assume (and cringe at the word) that it sieves the k values for, say, values that are divisible by 7 (6 mod 7 for exponents ending in 9), but I'm not sure. Also, please try to explain as simply as possible, I'm learning as I go :)
Thanks! Johann[/QUOTE] It's the count of the number of primes to sieve n=2kp+1 with. The more primes you use in sieving, the more not-prime-n you eliminate, but of course the law of diminishing returns applies; the important fact is that this sieving-candidates is done on the CPU, while the actual trying-candidates happens on the GPU, so that the SP count is effectively how much work the CPU has to do before a candidate is sent to the GPU. If the CPU can't keep up with the GPU, lower sieve primes so it's doing less work; if the GPU can't keep up, increase SP so the CPU does more work. |
Quick short reply
[QUOTE=c10ck3r;298009]Mein Herr, could you (or another) explain what "SievePrimes" does? I assume (and cringe at the word) that it sieves the k values for, say, values that are divisible by 7 (6 mod 7 for exponents ending in 9), but I'm not sure. Also, please try to explain as simply as possible, I'm learning as I go :)
Thanks! Johann[/QUOTE] Find on page 2 of this thread [URL="http://www.mersenneforum.org/showpost.php?p=200887&postcount=37"]this explanation[/URL]. |
What is the “CPU Wait”? The bigger the % the worse the CPU is keeping up? Or is it the other way around?
Thanks |
[QUOTE=TObject;298012]What is the “CPU Wait”? The bigger the % the worse the CPU is keeping up? Or is it the other way around?
Thanks[/QUOTE] The latter. A high CPU wait means that it is waiting for the GPU. That is, the CPU is running ahead of the GPU. |
Wunderbar
[QUOTE=Dubslow;298010]It's the count of the number of primes to sieve n=2kp+1 with. The more primes you use in sieving, the more not-prime-n you eliminate, but of course the law of diminishing returns applies; the important fact is that this sieving-candidates is done on the CPU, while the actual trying-candidates happens on the GPU, so that the SP count is effectively how much work the CPU has to do before a candidate is sent to the GPU. If the CPU can't keep up with the GPU, lower sieve primes so it's doing less work; if the GPU can't keep up, increase SP so the CPU does more work.[/QUOTE]
So, how do I change SP? |
[QUOTE=c10ck3r;298021]So, how do I change SP?[/QUOTE]Ideally, set [i]SievePrimesAdjust=1[/i] in mfaktc.ini and let it reach optimum.
|
[QUOTE=James Heinrich;298024]Ideally, set [i]SievePrimesAdjust=1[/i] in mfaktc.ini and let it reach optimum.[/QUOTE]
Often users find that it isn't very good; you can set [i]SievePrimes=5000[/i] or whatever number in mfaktc.ini, and SPAdjust to taste. (If adjust is on, it will start with whatever value you gave but will change on the fly.) |
mfaktc 0.18 compiled with CUDA 4.[B][COLOR="Red"]2[/COLOR][/B] and compute capability [B][COLOR="Red"]3.0[/COLOR][/B] support. Sources are unchanged so just a new executable. :smile:
[url]http://www.mersenneforum.org/mfaktc/mfaktc-0.18.win.cuda42.zip[/url] This version is for GTX 680 owners (which can't run the CUDA 4.0 or 4.1 executables). All other users can upgrade but there is no need to do so. As always recommended: run the full selftest (mfaktc...exe -st2) before you start productive jobs. About GTX 680: I still hadn't had my hands on a GTX 680, the tests where done by a forum user here. Once I have access to a Kepler card (and some time) I guess I can tweak the code a little bit but don't expect that a GTX 680 will ever perform as good as a GTX 580. :sad: Oliver |
[URL="http://www.abload.de/image.php?img=neuebitmap2s2f8r.jpg"][img]http://www.abload.de/img/neuebitmap2s2f8r.jpg[/img][/URL]
Just playing around with some 65xxxxxx Exponents 70 - 71 :smile: |
So your GTX 680 is ~20% overclocked and is worth ~400M/s for some reasonable assignments. So a stock GTX 680 is at ~330M/s, just 10% faster than my stock GTX 470.
For mfaktc: 470 < [B]680[/B] < 480 < 570 < 580 Less than we all hoped for but not really bad. Now I'm interested in the power consumption while running mfaktc. Perhaps a 680 does a good job at mfaktc-performance per watt? Oliver |
70% TDP means perhaps 140W which is quite better than i expected
|
[QUOTE=TheJudger;298272]For mfaktc: 470 < [B]680[/B] < 480 < 570 < 580[/QUOTE]According to [url=http://mersenne-aries.sili.net/mfaktc.php?sort=ghdpd&noA=1]my chart[/url] based on one benchmark from a while ago, I have the 680 and 470 very close together, with the 680 slightly behind (206 vs 218 GHz-days/day). Should I increase the expected performance of the compute-3.0 cards?
[i]edit:[/i] I've just added more 600 series GPUs to my list. What an ugly mess of computer 2.1 / 3.0 chips making up the lineup. And three variants of the GT 640! Performance-per-watt is all over the place, even performance itself: the GT 630 is rated 672 GFLOPS vs 415 GFLOPS for the 40nm version of the GT 640. But thanks to the discrepancy between 2.1 and 3.0 performance, the GT 640 still outperforms at mfaktc. |
Let's wait for some more (non-OCed) results and I get my hands on a Kepler.
Oliver |
[QUOTE=Prime95;294453]I'd say they are about 20 times slower than they should be!! 32-bit muls are much faster than shift lefts! Repeated adds are much faster than small shift lefts. Algorithms may have to change to avoid shift rights.[/QUOTE]
Well, the barrett79 kernel (this kernel is the fasted and most used kernel) doesn't contain many shifts at all. [QUOTE=TheJudger;298282]Let's wait for some more (non-OCed) results and I get my hands on a Kepler.[/QUOTE] For a limited amount of time I can put my hand on a GTX 680 (factory overclocked, driver reports 1124MHz). [B]Raw[/B] GPU speed for TF M66362159 69 70 mfaktc 0.19-pre1: 380.74M/s (my stock GTX 470 does ~335M/s) mfaktc 0.19-pre2: 380.92M/s -pre2 is the first attempt to optimized for Kepler... in the barrett79 kernel I've replaced all shiftlefts by multiplies... not really worth the extra code! :sad: Another attempt was to replace all shiftrights by multiplies (hi 32bit word), too... not a good idea, result was ~370M/s. :sad: Actual code for a shiftleft of a mutliword integer [I]nn[/I] 23 bits: [CODE] // shiftleft nn 23 bits [...] #if __CUDA_ARCH__ >= 300 nn.d4 = __umad32(nn.d4, 8388608, __umul32hi(nn.d3, 8388608)); nn.d3 = __umad32(nn.d3, 8388608, __umul32hi(nn.d2, 8388608)); nn.d2 = __umad32(nn.d2, 8388608, __umul32hi(nn.d1, 8388608)); nn.d1 = __umul32(nn.d1, 8388608); #else nn.d4 = (nn.d4 << 23) + (nn.d3 >> 9); nn.d3 = (nn.d3 << 23) + (nn.d2 >> 9); nn.d2 = (nn.d2 << 23) + (nn.d1 >> 9); nn.d1 = nn.d1 << 23; #endif [/CODE] The old code has 3[SUP]*1[/SUP] instructions per word: shiftleft + shiftright + add The new code has only 2[SUP]*1[/SUP] instructions per word: multiply (high word) + multiply-add [SUP]*1[/SUP] we don't really know how many hardware instructions those are in hardware, PTX code is only a interim code. Oliver |
[QUOTE=TheJudger;300278]
[B]Raw[/B] GPU speed for TF M66362159 69 70 mfaktc 0.19-pre1: 380.74M/s (my stock GTX 470 does ~335M/s) mfaktc 0.19-pre2: 380.92M/s [/QUOTE] Disappointing. I had hoped that Kepler would be better on mfaktc as it doesn't do much slow double-precision FP. Thanks for the bit-shift optimization timings. Very interesting. Do you have any ideas as to where the bottlenecks are? Please keep us informed as to other optimizations you try. The info may be useful for other CUDA projects. |
Yes, for many GPU computing applications Kepler is (another) step backwards. :sad:
I've no clue what the bottlenecks are. The barrett79 kernel uses shifts only for the initial setup (precompute (scaled) inverse of the factor candidate), the main loop is without shifts! So I'm really surprised how worse the replacement of shiftright with multiply (high word) is. The lower number of registers per core on Kepler is not an issue for mfaktc, half of them would be enough! Shared memory, onchip cache and offchip memory is barely in use for mfaktc. The performance of mfaktc primary depends on 32bit integer throughput. ----- Back to the multiword shiftleft example: [CODE] nn.d4 = __umad32(nn.d4, 8388608, (nn.d3 >> 9)); [/CODE] This is a very, very small improvement over shiftleft + shiftright + add! [CODE] nn.d4= __umadhi32(nn.d3, 8388608, (nn.d4 << 23)); [/CODE] And this is worse than shiftleft + shiftright + add! ----- A small improvement for 0.19-pre2: 381.94M/s! Oliver |
Do the above limitations apply to both Keplers (mini and the recently announced big keplers)
Mini Kepler being GTX680, and Big Kepler being the announced K20 to be released at the end of the year? Or is it too early to comment on big Kepler? -- Craig |
Hi Craig,
I've no clue... and I don't trust numbers written on paper. So give me a Kepler an I'll tell you. But for now I need to understand how Kepler light works. mini Kepler and big Kepler... for me it is Kepler light and Kepler (just as Fermi (CC 2.0) and Fermi light (CC 2.1)). Oliver |
k?
Just curious: I've run the following line:
[CODE] Factor=N/A,2373583,61,62[/CODE][SIZE=2]and found a factor: [/SIZE][SIZE=2]4675173077110571839 (prime) [/SIZE][B][k = 984834546993[/B]] [SIZE=2] My code gives me a bit length of 62.02 so a bit higher than demanded... No problem but is this expected / wanted? [/SIZE][CODE]mfaktc v0.18 (64bit built) got assignment: exp=2373583 bit_min=61 bit_max=62 Starting trial factoring M2373583 from 2^61 to 2^62 k_min = 485730431340 k_max = 971460871270 [B][k_fac = 984834546993[/B]] Using GPU kernel "75bit_mul32" [/CODE]I haven't looked in the mfaktc source... Might k_max be a soft border because of the classes system? |
[QUOTE=Brain;300852]
Might k_max be a soft border because of the classes system?[/QUOTE] Yes. [quote=README.txt]################################################################## # 5.1 Stuff that looks like an issue but actually isn't an issue # ################################################################## - mfaktc runs slower on small ranges. Usually it doesn't make much sense to run mfaktc with an upper limit smaller than 2^64. It is designed for trial factoring above 2^64 up to 2^95 (factor sizes). ==> mfaktc needs "long runs"! - mfaktc can find factors outside the given range. E.g. './mfaktc.exe -tf 66362159 40 41' has a high change to report 124246422648815633 as a factor. Actually this is a factor of M66362159 but it's size is between 2^56 and 2^57! Offcourse './mfaktc.exe -tf 66362159 56 57' will find this factor, too. The reason for this behaviour is that mfaktc works on huge factor blocks. This is controlled by GridSize in mfaktc.ini. The default value is 3 which means that mfaktc runs up to 1048576 factor candidates at once (per class). So the last block of each class is filled up with factor candidates above to upper limit. While this is a huge overhead for small ranges it's save to ignore it on bigger ranges. If a class contains 100 blocks the overhead is on average 0.5%. When a class needs 1000 blocks the overhead is 0.05%...[/quote] |
Will somebody miss the [B]debug[/B] option "VERBOSE_TIMING"? (If you don't know this option you won't miss it.)
It is a debugging/timing option which I've used in ancient versions of mfaktc. If nobody tells me a good reason why I shouldn't remove it I'll remove it in mfaktc 0.19! I'm not even sure if it works as expected in the current code... Oliver |
mfaktc for base 10 repunits
1 Attachment(s)
Hi,
I took the source of mfaktc 0.18 and changed some parts to handle base 10 repunits. I added a new code path (64 Bit kernel) and removed some other parts. Here is an overview of the changes: [LIST][*]rewrote lots of code to handle repunits[*]removed barrett kernel as it is not needed for repunits[*]removed 72 bit kernel -> No support for older GPUs[*]added 64 bit kernel[*]added a selftest for repunits[*]by default MORE_CLASSES is switched off (faster for smaller exponents)[*]moved the optional multiply into the modulus calculation (faster because the multiplication is done over less registers)[*]reduced the lower limit for exponents to 1000 (tested, but not guaranteed to work)[/LIST]Some notes about running mfaktc-repunit: [LIST=1][*]Format of worktodo files is the same (the exponents are now for base 10).[*]On a GeForce 460 GTX 1 instance of mfaktc (using the 64 Bit kernel) uses only a bit more than 50% of the GPU resources, so running 2 instances saturates the GPU, both then running nearly as fast as alone.[/LIST]@Oliver: I am not sure if this could be merged with the original mfaktc, you have to decide. This version gave our search for the next repunit (probable) prime a massive boost. Still the bottleneck is the PRP test. Danilo |
Win64 executable for mfaktc-repunit
1 Attachment(s)
Due to the size restriction of the forum I could not put the executable in the source, so here it is:
|
Hi Danilo,
I'll take a look at your modifications later. [QUOTE=MrRepunit;301522]Due to the size restriction of the forum I could not put the executable in the source, so here it is:[/QUOTE] I guess I'll upload it to [url]www.mersenneforum.org/mfaktc/[/url] later, including Win32 and Win64 built and CUDA DLLs. Oliver |
[QUOTE=TheJudger;301332]Will somebody miss the [B]debug[/B] option "VERBOSE_TIMING"? (If you don't know this option you won't miss it.)
It is a debugging/timing option which I've used in ancient versions of mfaktc. If nobody tells me a good reason why I shouldn't remove it I'll remove it in mfaktc 0.19! I'm not even sure if it works as expected in the current code... Oliver[/QUOTE] I've used it only at the very beginning of my OpenCL port. Go ahead and kick it. |
[QUOTE=James Heinrich;298279]According to [URL="http://mersenne-aries.sili.net/mfaktc.php?sort=ghdpd&noA=1"]my chart[/URL] based on one benchmark from a while ago, I have the 680 and 470 very close together, with the 680 slightly behind (206 vs 218 GHz-days/day). Should I increase the expected performance of the compute-3.0 cards?
[I]edit:[/I] I've just added more 600 series GPUs to my list. What an ugly mess of computer 2.1 / 3.0 chips making up the lineup. And three variants of the GT 640! Performance-per-watt is all over the place, even performance itself: the GT 630 is rated 672 GFLOPS vs 415 GFLOPS for the 40nm version of the GT 640. But thanks to the discrepancy between 2.1 and 3.0 performance, the GT 640 still outperforms at mfaktc.[/QUOTE] I suspect that the 640 which is currently in the table is the rebadged 4xx something (I've read that Nvidia and vendors are doing it). But now the real (GK107) 640s started showing up (Amazon, Newegg, not Tigerdirect yet). Does anyone have one? I wonder how they measure up. I wonder what were they thinking while putting 2Gb of memory in the new entry level cards; probably, the redefinition of "entry" level or in preparation for even more [strike]bloated[/strike] fine textures etc etc etc. |
[QUOTE=Batalov;301886]I wonder what were they thinking while putting 2Gb of memory in the new entry level cards; probably, the redefinition of "entry" level or in preparation for even more [strike]bloated[/strike] fine textures etc etc etc.[/QUOTE]It's not such a bad thing -- once we have viable GPU-based P-1 software available maybe we can get 16GB videocards to run it on. :smile:
|
Oh, you now have GK107s in the list. They look awful (even GDDR5)!
Ok, one less choice to think about (that's for the kids' computer upgrade). :-) |
[QUOTE=Batalov;301959]Oh, you now have GK107s in the list. They look awful (even GDDR5)![/QUOTE]Based on very limited benchmark data, but should still be ballpark-accurate.
|
[QUOTE=James Heinrich;294327]
[...] mfaktc: compute 1.3 = [color=orangered]54%[/color] compute 2.0 = [color=limegreen]150%[/color] compute 2.1 = [color=blue]100%[/color] compute 3.0 = [color=red]33%[/color][/QUOTE] My current feeling about CC 2.1 (e.g. GTX 460, GTX 560): The "performance issue" is not the ILP (instruction level parallism), it is just the 32-bit integer multiply performance of these chips. Take a recent version of the "CUDA C Programming Guide", in version 4.2 of this document there is a interesting table on page 74 (Arithmetic Instructions)... The additional cores per multiprocessor on the CC 2.1 chips (48 vs. 32 cores per multiprocessor) don't increase the throughput. Mfaktc needs [B]alot[/B] of 32-bit interger multiplies... Oliver |
What is the real lower limit on exponent size?
no factor for M300109 from 2^60 to 2^61 [mfaktc 0.18-bcp-test2 75bit_mul32] by "David Campeau" on 2012-04-22 - David |
[QUOTE=dbaugh;303494]What is the real lower limit on exponent size?
no factor for M300109 from 2^60 to 2^61 [mfaktc 0.18-bcp-test2 75bit_mul32] by "David Campeau" on 2012-04-22 - David[/QUOTE] That is a custom, test version, distributed in "closed circles" to the people who can be trusted in reporting "no factor" results for such small exponents. For the respective version, the lowest limit is 2000. Such low exponents had a lot of ECM done, they theoretically have no factors under 40, 45, 50 decimal digits, that is much over 140-150-165 bits (except some unprobable ECM "miss"). So, again theoretically, one can report fake "no factors" results up to 80, 90 bits etc, without doing any tests, and without risking (too much) to be wrong, and getting thousands of GHzDays of TF credit for such reports. That would be childish, but people are tempted by that huge amount of credit. If you think you qualify, and want to waste your time in TF-ing extremely-low-exponent range (again, it will be a waste of time, according with the amount of ECM done, and it will not help GIMPS, all exponents are doublechecked, but hey, :smile: finding factors is cool!), then you can ask Oliver for a copy. It is slower then the normal version, not only because of the higher amount of possible candidates (the amount of possible candidates at the same bit level is larger and larger as the exponent gets smaller and smaller), but also because special precautions needed in the software (see this thread, 4 or 5 pages before this current page). The revers of the coin is that - in spite of the amount of P-1 and ECM done - you still may find missed factors, and be hero. I did not find any factor up to now, and I only reported about 10% of the "no factors" results I got, and only after discussions with Oliver on PM, when he convinced me that "it is normal not to find factors" (I was very little confident in the beginning, after few hundred tests did not produce any factor, and I tried to push Oliver to look into it). After that I reported few results, and moved on to other work types, more useful for the project, tired to come out empty handed every time. At least, taking "normal" (50M) exponents from gpu-2-72 is easier, you don't have to crawl the PrimeNet pages by yourself to look for unassigned stuff... Bcp and few others are still doing this "TF-in-the-extremely-low-range" activity. You may directly PM them for news, status, coordination, etc. |
Now, that's how to answer a question!! Many thanks. -DB
|
For ze programmers...
How hard would it be to create a program to do TF in the same manner as mfaktc for, say, 3-, 5-, etc? Just a curiosity, don't shoot the cat! |
[QUOTE=c10ck3r;303761]For ze programmers...
How hard would it be to create a program to do TF in the same manner as mfaktc for, say, 3-, 5-, etc? Just a curiosity, don't shoot the cat![/QUOTE] Depends on if they have special forms for factors, like Meresnne numbers, and if so, how similar are those forms. You'll have to ask someone who knows some math to answer that question. |
[QUOTE=c10ck3r;303761]For ze programmers...
How hard would it be to create a program to do TF in the same manner as mfaktc for, say, 3-, 5-, etc? Just a curiosity, don't shoot the cat![/QUOTE] Not hard. Already exists. It is called mfaktc. MrRepunit posted the modifications just recently for 10-. Run diff, observe the differences, check the mod classes for your favorite base. (b-1) is always a factor for b>2. The harder part is not the code but keeping the dataset and making it easy for prospective contributors to contrubute. For 2- and 10-, there are interested people who do that, but what will you do with a small factor of, say, (3^10000019-1)/2 or even be sure that it has not been known for a decade, or that it will be found again tomorrow by someone? And lastly, who would be interested to know that factor? For (3^10000079-1)/2, there are some very easy factors; for (3^10000103-1)/2, there's 39094722671497... |
Build problems - help
I installed 64-bit Ubuntu 10.04, Cuda version 5, and mfaktc-0.18.tar.gz. I did a make. Running the built mfaktc fails all the self-tests. Where do I go from here?
BTW, I built CUDALucas 2.0.1 and it works just fine on my GTX 460. |
Isn't CUDA 5 still in preview? Perhaps there's something CUDA 5 doesn't like about mfaktc. (I assume CUDALucas was built with CUDA 5? I know that it automatically targets arch_1.3, which would force CUDA < 5, so that may be why it avoided issues.)
|
A more general question from a Linux and CUDA neophyte.
What is the recommended development environment for GPU programming? Is there a Linux IDE that is integrated with CUDA debugging and profiling tools? |
[QUOTE=Prime95;304557]Is there a Linux IDE that is integrated with CUDA debugging and profiling tools?[/QUOTE]
[URL="http://ydl.net/eclipse_cuda_plugin/"]http://ydl.net/eclipse_cuda_plugin/[/URL] For those who use IDEs, Eclipse is your (Open Source) friend.... :smile: |
[QUOTE=Prime95;304554]I installed 64-bit Ubuntu 10.04, Cuda version 5, and mfaktc-0.18.tar.gz. I did a make. Running the built mfaktc fails all the self-tests. Where do I go from here?
BTW, I built CUDALucas 2.0.1 and it works just fine on my GTX 460.[/QUOTE] So it builds without much warnings (usually there are some warning in sieve.c about signed/unsigned variable mismatch)? I guess you're using CUDA 5.0.7 (preview release), right? Fails all the selftest -> missing factors (no factor found) or wrong factors? I guess you're using default src/params.c and mfaktc.ini, right? You can try to enable some debugging code in src/params.h:[LIST][*]If you enable [I]RAW_GPU_BENCH[/I] the effect is basically that you disable the sieve code (ofcouse this slows down the application)[*]you can try to enable [I]CHECKS_MODBASECASE[/I] and [I]USE_DEVICE_PRINTF[/I], this will enable debug code in the long integer division code, if something is really screwed the [I]USE_DEVICE_PRINTF[/I] will cause overflows because of too much printfs in GPU context.[/LIST] Another option to test: add [I]-malign-double[/I] to the CFLAGS in the Makefile (default in mfaktc-0.19...). There are some known issues with CUDA/gcc about alignment of 64bit variables. Can you provide me some lines of [I]./mfaktc.exe -v 2 -st[/I]? (just start it and stop after a few seconds by pressing <Ctrl>+C) and send me the screen output. I'll try CUDA 5.0.7 later on my system. Oliver P.S. at least it is good to know that the selftest works... |
[QUOTE=TheJudger;304626]So it builds without much warnings (usually there are some warning in sieve.c about signed/unsigned variable mismatch)?
I guess you're using CUDA 5.0.7 (preview release), right? Fails all the selftest -> missing factors (no factor found) or wrong factors? I guess you're using default src/params.c and mfaktc.ini, right? Can you provide me some lines of [I]./mfaktc.exe -v 2 -st[/I]? (just start it and stop after a few seconds by pressing <Ctrl>+C) and send me the screen output.[/QUOTE] Correct - a simple install and make without modifying anything. I'll try your debug suggestions later. Here is the output you wanted: [CODE]mfaktc v0.18 (64bit built) Compiletime options THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 193154bits SIEVE_SPLIT 250 MORE_CLASSES enabled Runtime options SievePrimes 25000 SievePrimesAdjust 1 NumStreams 3 CPUStreams 3 GridSize 3 WorkFile worktodo.txt Checkpoints enabled CheckpointDelay 30s Stages enabled StopAfterFactor bitlevel PrintMode full AllowSleep no CUDA version info binary compiled for CUDA 5.0 CUDA runtime version 5.0 CUDA driver version 5.0 CUDA device info name GeForce GTX 460 compute capability 2.1 maximum threads per block 1024 number of multiprocessors 7 (336 shader cores) clock rate 1502MHz Automatic parameters threads per grid 917504 ########## testcase 1/1557 ########## Starting trial factoring M50804297 from 2^67 to 2^68 k_min = 1599999998520 k_max = 1900000000000 Using GPU kernel "71bit_mul24" class | candidates | time | ETA | avg. rate | SievePrimes | CPU wait 3387/4620 | 14.68M | 0.213s | n.a. | 68.92M/s | 25000 | 13.45% no factor for M50804297 from 2^67 to 2^68 [mfaktc 0.18 71bit_mul24] ERROR: selftest failed for M50804297 no factor found tf(): total time spent: 0.219s Starting trial factoring M50804297 from 2^67 to 2^68 k_min = 1599999998520 k_max = 1900000000000 Using GPU kernel "75bit_mul32" class | candidates | time | ETA | avg. rate | SievePrimes | CPU wait 3387/4620 | 14.68M | 0.145s | n.a. | 101.24M/s | 28125 | 0.38% no factor for M50804297 from 2^67 to 2^68 [mfaktc 0.18 75bit_mul32] ERROR: selftest failed for M50804297 no factor found tf(): total time spent: 0.151s Starting trial factoring M50804297 from 2^67 to 2^68 k_min = 1599999998520 k_max = 1900000000000 Using GPU kernel "95bit_mul32" class | candidates | time | ETA | avg. rate | SievePrimes | CPU wait 3387/4620 | 14.68M | 0.164s | n.a. | 89.51M/s | 24609 | 0.34% no factor for M50804297 from 2^67 to 2^68 [mfaktc 0.18 95bit_mul32] ERROR: selftest failed for M50804297 no factor found tf(): total time spent: 0.170s Starting trial factoring M50804297 from 2^67 to 2^68 k_min = 1599999998520 k_max = 1900000000000 Using GPU kernel "barrett79_mul32" class | candidates | time | ETA | avg. rate | SievePrimes | CPU wait 3387/4620 | 14.68M | 0.123s | n.a. | 119.35M/s | 21532 | 0.49% no factor for M50804297 from 2^67 to 2^68 [mfaktc 0.18 barrett79_mul32] ERROR: selftest failed for M50804297 no factor found tf(): total time spent: 0.128s [/CODE] |
[QUOTE=TheJudger;304626][*]If you enable [I]RAW_GPU_BENCH[/I] the effect is basically that you disable the sieve code (ofcouse this slows down the application)[*]you can try to enable [I]CHECKS_MODBASECASE[/I] and [I]USE_DEVICE_PRINTF[/I], this will enable debug code in the long integer division code
Another option to test: add [I]-malign-double[/I] to the CFLAGS in the Makefile (default in mfaktc-0.19...).[/QUOTE] None of these helped or diagnosed any problems. |
OK, next step: I'll try CUDA 5.0.7 on my system. I'm not sure if this will happen today.
Oliver |
[QUOTE=TheJudger;304646]OK, next step: I'll try CUDA 5.0.7 on my system. I'm not sure if this will happen today.[/QUOTE]
Thanks. No rush, I'm primarily playing with gpusieve. BTW, my bigger problem is NVidia's developer web site is down thanks to hackers. |
I'm able to reproduce this issue with mfaktc 0.18 + CUDA toolkit 5.0.7-preview on openSUSE 12.1.
CUDA 5.0 driver + CUDA toolkit 4.2 -> OK CUDA 5.0 driver + CUDA toolkit 5.0.7 -> fail Oliver |
first impressions:
[LIST][*]floatingpoint approximation for divisions seems to be OK[*]data transfer seems to be OK[/LIST]I see issue with carrys (using carry flag) in e.g. tf_barrett96.cu: mod_simple_96(). WTF? So for now I can only say to everyone: [COLOR="Red"][B]Don't use CUDA Toolkit 5.0.7 for mfaktc![/B][/COLOR] Oliver |
Some data from tf_barrett96.cu: mod_simple_96():
[CODE] qi = 0 q = 00000007 3C3F1F[COLOR="Red"]20[/COLOR] C454D397 nn = 00000000 00000000 00000000 res = 00000007 3C3F1F[COLOR="Red"]1F[/COLOR] C454D397 [/CODE] res = q - nn; So for now it looks like CUDA 5.0.7 fails when somebody uses sub with carry when the subtrahend is 0. So for now it looks like a bug in CUDA 5.0.7. Oliver |
[QUOTE=TheJudger;304737]Some data from tf_barrett96.cu: mod_simple_96():
[CODE] qi = 0 q = 00000007 3C3F1F[COLOR="Red"]20[/COLOR] C454D397 nn = 00000000 00000000 00000000 res = 00000007 3C3F1F[COLOR="Red"]1F[/COLOR] C454D397 [/CODE] res = q - nn; So for now it looks like CUDA 5.0.7 fails when somebody uses sub with carry when the subtrahend is 0. So for now it looks like a bug in CUDA 5.0.7. Oliver[/QUOTE] Have you tried adding the volatile keyword to your asm statements? |
Nvidia confirmed the bug so I would say: not my fault/problem! :smile:
Oliver |
1 Attachment(s)
We submitted all of our completed work today in one file and were awarded some "extra credit". (See attached image.)
[CODE]P-1 found a factor in stage #2, B1=565000, B2=12147500, E=6. M56350163 has a factor: 24948611431313562132407 P-1 found a factor in stage #2, B1=540000, B2=11475000, E=6. M54203297 has a factor: 43709161575143787520913[/CODE] |
[QUOTE=Xyzzy;305027]We submitted all of our completed work today in one file and were awarded some "extra credit"[/QUOTE]I'm not even sure how to come up with those numbers... that's approx 150%-200% what credit you should get for those factors even as credited for TF. :unsure:
|
[QUOTE=James Heinrich;305028]I'm not even sure how to come up with those numbers... that's approx 150%-200% what credit you should get for those factors even as credited for TF. :unsure:[/QUOTE]
James, did you change the manual web forms along the lines we were discussing? If so, did the B1/B2 bounds get recorded correctly? Maybe, the underlying PHP guessed the wrong FFT size or we passed in a bogus FFT size? |
[QUOTE=Prime95;305032]James, did you change the manual web forms along the lines we were discussing? If so, did the B1/B2 bounds get recorded correctly? Maybe, the underlying PHP guessed the wrong FFT size or we passed in a bogus FFT size?[/QUOTE]No, I hadn't got to that yet (I was going to... the day PrimeNet was down for a few hours), the manual form is as yet unchanged.
|
[QUOTE=James Heinrich;305033]No, I hadn't got to that yet...[/QUOTE]
Weird. All the more reason to make those changes! |
20+% improvement
1 Attachment(s)
Oliver,
I propose creating a barrett77_mul32. This is the same as barrett79_mul32 but with the mod_simple_96 moved out of the loop. As long as f does not exceed 77 bits, a will not exceed 80 bits (above 80 bits and square_96_160 will fail). I tested this out and it passes the self tests up through 77 bits. Raw speed went from 205M/sec to 250M/sec. Crude source is attached. |
[QUOTE=Prime95;306572]Oliver,
....... I tested this out and it passes the self tests up through 77 bits. Raw speed went from 205M/sec to 250M/sec..... [/QUOTE] :w00t: Wow. |
More "extra credit":
[CODE]Processing result: M56505451 has a factor: 86553876518403762963169 CPU credit is 323.9309 GHz-days. Processing result: M56488651 has a factor: 35566445275259107720993 CPU credit is 129.5622 GHz-days. Processing result: M56491177 has a factor: 23502006329787341695151 CPU credit is 89.0731 GHz-days.[/CODE] |
[QUOTE=Xyzzy;306576]More "extra credit":
[CODE]Processing result: M56505451 has a factor: 86553876518403762963169 CPU credit is 323.9309 GHz-days. Processing result: M56488651 has a factor: 35566445275259107720993 CPU credit is 129.5622 GHz-days. Processing result: M56491177 has a factor: 23502006329787341695151 CPU credit is 89.0731 GHz-days.[/CODE][/QUOTE] :omg: |
[QUOTE=Xyzzy;306576]More "extra credit":[/QUOTE]
Yeah, you looked like you need some credit, so that's why :razz: ------------------------------------------------------------ @prime95, related to barrett77: "Enter George Woltman, an excellent programmer and organizer..." (from the Encyclopedia Galactica:smile:, the [URL="http://primes.utm.edu/mersenne/"]History of Mersenne Primes[/URL] section) (we need a smiley which take out his hat!) (edit, ok, this will substitute::bow:) When can we get mfaktc binaries for win64? (eventually for both the "classic" version, and the one for tf small expos, here a 20% improvement will look great, in fact we would be happier with a "barrett67" and a 50% improvement :razz:) |
wow, that is a perfomance boost :w00t:
|
[QUOTE=Prime95;306572]Oliver,
I propose creating a barrett77_mul32. This is the same as barrett79_mul32 but with the mod_simple_96 moved out of the loop. As long as f does not exceed 77 bits, a will not exceed 80 bits (above 80 bits and square_96_160 will fail). I tested this out and it passes the self tests up through 77 bits. Raw speed went from 205M/sec to 250M/sec. Crude source is attached.[/QUOTE] Very nice! In mfakto, this new 77-bit kernel is even 5% faster than the 70-bit, and 10% faster than the 73-bit kernels I have, making it the fastest again for VLIW5. The newer architectures benefit less from this kernel. Too bad this trick only works for the 79-bit kernel with it's fixed 2[SUP]81[/SUP]/f inverse. The other barretts with the 2[SUP]bit_max+1[/SUP]/f inverse cannot deal with the larger square in my kernels (the inverse does not seem to have enough significant digits). |
[QUOTE=Bdot;306663]Very nice! In mfakto, this new 77-bit kernel is even 5% faster than the 70-bit, and 10% faster than the 73-bit kernels [/QUOTE]
Glad it helped and is passing your tests too. You can also create a 78-bit kernel that only adjusts the result when there is a multiplication by 2. |
Hi George,
[QUOTE=Prime95;306572]Oliver, I propose creating a barrett77_mul32. This is the same as barrett79_mul32 but with the mod_simple_96 moved out of the loop. As long as f does not exceed 77 bits, a will not exceed 80 bits (above 80 bits and square_96_160 will fail). I tested this out and it passes the self tests up through 77 bits. Raw speed went from 205M/sec to 250M/sec. Crude source is attached.[/QUOTE] cool, I'll test this (again!). Some time ago I've tried similar but failed somehow. Did you run the tests with CHECKS_MODBASECASE (src/params.h) enabled? @others: please be carefully, there are some other changes and testing needed before this is save for daily usage, with this modification alone it [B]will[/B] choose this kernel for TF up to 2[SUP]79[/SUP] and it [B]will[/B] fail there. I guess I'll reschedule my release plan for 0.19 and add this. Oliver |
[QUOTE=TheJudger;306707]I guess I'll reschedule my release plan for 0.19 and add this.[/QUOTE]
We fully agree with this! Eagerly waiting! |
[QUOTE=Bdot;306663]
Too bad this trick only works for the 79-bit kernel with it's fixed 2[SUP]81[/SUP]/f inverse. The other barretts with the 2[SUP]bit_max+1[/SUP]/f inverse cannot deal with the larger square in my kernels (the inverse does not seem to have enough significant digits).[/QUOTE] I haven't tried it, but this should also work for the barrett 96-bit kernel for factors up to 90 bits. That is, a 90-bit factor will generate a 90-bit remainder + 3 bits because we're pretty sloppy calculating the remainder. When we square the 93-bit result we get a 186-bit value. We then apply 1/f to get a 96-bit quotient - which just fits in our 3 registers. |
[QUOTE=Prime95;306723]I haven't tried it, but this should also work for the barrett 96-bit kernel for factors up to 90 bits. That is, a 90-bit factor will generate a 90-bit remainder + 3 bits because we're pretty sloppy calculating the remainder. When we square the 93-bit result we get a 186-bit value. We then apply 1/f to get a 96-bit quotient - which just fits in our 3 registers.[/QUOTE]
I tried (with my 75-bit kernel[SUP]*[/SUP], and a 68-bit factor), and the 1/f step left a remainder that increased with each loop until I had an overflow. I'll check the details later. [SUP]*[/SUP] 5 words with 15 bits each, to avoid the expensive 32-bit multiplications and use mul24 instead. |
[QUOTE=TheJudger;304737]Some data from tf_barrett96.cu: mod_simple_96():
[CODE] qi = 0 q = 00000007 3C3F1F[COLOR="Red"]20[/COLOR] C454D397 nn = 00000000 00000000 00000000 res = 00000007 3C3F1F[COLOR="Red"]1F[/COLOR] C454D397 [/CODE] res = q - nn; So for now it looks like CUDA 5.0.7 fails when somebody uses sub with carry when the subtrahend is 0. So for now it looks like a bug in CUDA 5.0.7. Oliver[/QUOTE] [QUOTE=TheJudger;305015]Nvidia confirmed the bug so I would say: not my fault/problem! :smile: Oliver[/QUOTE] btw.: Nvidia told me they have fixed the bug with a driver update. Unfortionaly this driver is not yet available for me. |
[QUOTE=Prime95;306723]I haven't tried it, but this should also work for the barrett 96-bit kernel for factors up to 90 bits. [/QUOTE]
I forgot about all the nasty bit-shifting that kernel performs. It may not be possible to retrieve a 96-bit quotient -- needs further research. |
[QUOTE=Prime95;306572]Oliver,
I propose creating a barrett77_mul32. This is the same as barrett79_mul32 but with the mod_simple_96 moved out of the loop. As long as f does not exceed 77 bits, a will not exceed 80 bits (above 80 bits and square_96_160 will fail). I tested this out and it passes the self tests up through 77 bits. Raw speed went from 205M/sec to 250M/sec. Crude source is attached.[/QUOTE] This kernel does [B]not[/B] work up to 77 bits. When the factor candidates are above ~2[SUP]76.8[/SUP] there is relative high chance for an interger overflow (interim results >= 2[SUP]80[/SUP]). This seems to occur when the exponent has continuous 1 in binary representation (which causes the "optional multiply by 2"). I'm not sure whether this is the only cause or not. The kernel works absolute save for FCs up to 2[SUP]76[/SUP] so there is a very high chance that mfaktc 0.19 will feature a new kernel: barrett76_mul32. I need to check on my CC 1.3 GPU, too. I guess this will be the fastest kernel for those old GPUs, too. :smile: George: I want to test the current code on my GTX 275 this evening, after that I'll sent you the new code (which features some debugging code, too). Oliver |
[QUOTE=TheJudger;306808]This kernel does [B]not[/B] work up to 77 bits. When the factor candidates are above ~2[SUP]76.8[/SUP] there is relative high chance for an interger overflow.[/QUOTE]
OK, I finally did the error analysis (rather than relying on the comment in the code that implied the Barrett operation resulted in a value off by at most a factor of 3. We are multiplying a 3 word value (floor (2^160/f)) by a 5 word value and ignoring the 5 bottom words of the result. By my reckoning the big multiply is ignoring 6 partial results in the the 4th word which could generate 5 carries. Also, accounting for the error introduced by the floor function introduces another possible carry. Thus, the quotient can be off by up to 6. The doubling gives us off by 12, which means we need 4 pad bits -- just as Oliver observed. |
Due to a recent hike in our electricity rate, too much current draw, inadequate branch circuits and an inadequate central air cooling system, we are going to drop off of trial factoring with our GPUs.
This decision was made two months ago but we apparently were in denial (Not the river in Egypt!) that the bills we were receiving were anomalies. We plan to remove them today and put the boxes on P-1, using the resources at gpu72.com. of course. This will drop our electrical usage by about half. So, we have four GPUs to sell, cheap. [URL]http://www.newegg.com/Product/Product.aspx?Item=N82E16814121432[/URL] $200 each, plus shipping and insurance. They do not have a warranty but they have run 24×7 for a long time with no issues. Be aware that each GPU takes up three slots! If you are interested in one or more of these, please PM us. |
These are good ones! :tempted:
[QUOTE]Why do you keep calling me JéSUS?! Do I look Puerto-Rican to you? ...My name is Zeus![/QUOTE] |
I would have jumped on one of these in an instant had I not just gotten a Gigabyte 570 off eBay for about the same price.
|
Noooooooooo....
Who's going to keep me company at the top. :) BTW I might be moving in Sept. There maybe some down time for me coming up. -- Craig [QUOTE=Xyzzy;307021]Due to a recent hike in our electricity rate, too much current draw, inadequate branch circuits and an inadequate central air cooling system, we are going to drop off of trial factoring with our GPUs. This decision was made two months ago but we apparently were in denial (Not the river in Egypt!) that the bills we were receiving were anomalies. We plan to remove them today and put the boxes on P-1, using the resources at gpu72.com. of course. This will drop our electrical usage by about half. So, we have four GPUs to sell, cheap. [URL]http://www.newegg.com/Product/Product.aspx?Item=N82E16814121432[/URL] $200 each, plus shipping and insurance. They do not have a warranty but they have run 24×7 for a long time with no issues. Be aware that each GPU takes up three slots! If you are interested in one or more of these, please PM us.[/QUOTE] |
| All times are UTC. The time now is 22:30. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.