![]() |
First noticeable runtime glitch among my 12 crunching phones, this phone was one of the last-set-up and has been crunching 24/7 for 2 weeks:
[code] FATAL: iter = 17649705; nonzero exit carry in radix176_ditN_cy_dif1 - input wordsize may be too small.[/code] At which point said run was interrupted and the program started on the next exponent in the worktodo.ini file without incident. Once I noticed the glitch, killed the run, restored the interrupted-run exponent to the first line of worktodo file, restart was without incident and no repetition of the above error: [code] Restarting M50382649 at iteration = 17640000. Res64: C70880DBB325AAD6, residue shift count = 34099054 M50382649: using FFT length 2816K = 2883584 8-byte floats, initial residue shift count = 34099054 this gives an average 17.472232125022195 bits per digit Using complex FFT radices 176 16 16 32 [Apr 25 18:19:24] M50382649 Iter# = 17650000 [35.03% complete] clocks = 00:10:59.738 [ 65.9739 msec/iter] Res64: 618DD61892F108AD. AvgMaxErr = 0.045421568. MaxErr = 0.062500000. Residue shift count = 47302787.[/code] |
[QUOTE=ewmayer;514694]First noticeable runtime glitch among my 12 crunching phones, this phone was one of the last-set-up and has been crunching 24/7 for 2 weeks:
[code] FATAL: iter = 17649705; nonzero exit carry in radix176_ditN_cy_dif1 - input wordsize may be too small.[/code]At which point said run was interrupted and the program started on the next exponent in the worktodo.ini file without incident. Once I noticed the glitch, killed the run, restored the interrupted-run exponent to the first line of worktodo file, restart was without incident and no repetition of the above error: [code] Restarting M50382649 at iteration = 17640000. Res64: C70880DBB325AAD6, residue shift count = 34099054 M50382649: using FFT length 2816K = 2883584 8-byte floats, initial residue shift count = 34099054 this gives an average 17.472232125022195 bits per digit Using complex FFT radices 176 16 16 32 [Apr 25 18:19:24] M50382649 Iter# = 17650000 [35.03% complete] clocks = 00:10:59.738 [ 65.9739 msec/iter] Res64: 618DD61892F108AD. AvgMaxErr = 0.045421568. MaxErr = 0.062500000. Residue shift count = 47302787.[/code][/QUOTE] So, one glitch in a 12 pack in 2+ weeks of running? Plus failover to next assignment so not much throughput lost. Not bad! |
[QUOTE=kriesel;514744]Plus failover to next assignment so not much throughput lost. Not bad![/QUOTE]I would have thought that auto-retry for a few times before giving up and moving on would be better. Such things are the job for the computer, not the human operator. Currently it requires manually monitoring the progress and being prepared to manually stop, manually edit and manually restart things. So much "manually" in there. Computers were supposed to be our slaves, not the other way around.
|
[QUOTE=retina;514747]I would have thought that auto-retry for a few times before giving up and moving on would be better. Such things are the job for the computer, not the human operator. Currently it requires manually monitoring the progress and being prepared to manually stop, manually edit and manually restart things. So much "manually" in there. Computers were supposed to be our slaves, not the other way around.[/QUOTE]
I'll probably add automated retry-from-last-savefile for this, but this particular kind of data-corruption error is relatively rare in my experience, so handling it has been a low priority. It's also a little tricky because the same error signature can occur without data corruption, if the user forces an FFT length significantly larger than the default for a given exponent, e.g. [code] M20000059: using FFT length 5120K = 5242880 8-byte floats, initial residue shift count = 9771056 this gives an average 3.814708518981933 bits per digit Using complex FFT radices 160 32 32 16 mers_mod_square: Init threadpool of 2 threads Using 2 threads in carry step FATAL: iter = 275; nonzero exit carry in radix160_ditN_cy_dif1 - input wordsize may be too small. Return with code ERR_CARRY[/code] ...but the main use case is default production run mode, so the retry logic can simply differentiate on that basis. First completed DC due tomorrow, from the first-configured phone, which has been running for ~5 weeks. Fingers crossed! |
2 Attachment(s)
[QUOTE=ewmayer;514748]First completed DC due tomorrow, from the first-configured phone, which has been running for ~5 weeks. Fingers crossed![/QUOTE]
[url=https://www.mersenne.org/report_exponent/?exp_lo=50315123&full=1]That's a match[/url]. This was run on the phone pictured below - first half of the run, before I mass-setup the batch of 10 S7 Active phones, I had it sitting in one of my 5-port-charger-with-USB-fan modules, but as I ended up with 12 phones and just 10 ports splits between 2 such modules, I took this phone and one other, selected on basis of their appearing to have the best-functioning internal cooling systems, and hung them by microUSB cable from a simple little 2-port wall-plug USB charger next to a west-facing living room window which admits cool breezes most of day, except from ~2-7pm when the sun hits that building wall. At night this phones runs ~60ms/iter @2816K; during that late-afternoon warm spell it will throttle back to as slow as ~70 ms/iter on warm days. |
Nice. That briefing app looks like one of those apps that take the leftmost tab on the home tabs with the other tabs being used for app icons. You can probably disable it by holding down on an empty app icon space as if you wanted to rearrange icons, swipe to the briefing tab then click the disable toggle top right.
|
Update on my 12-phone cluster: As of yesterday all but one laggard phone - more on that shortly - has successfully completed its first DC. In addition to the first-completed DC of [url=https://mersenneforum.org/showthread.php?t=23998&page=16]post #170[/url] we have DCs of exponents
[url=https://www.mersenne.org/report_exponent/?exp_lo=50291471&full=1]50291471[/url],[url=https://www.mersenne.org/report_exponent/?exp_lo=50329519&full=1]50329519[/url],[url=https://www.mersenne.org/report_exponent/?exp_lo=50345153&full=1]50345153[/url],[url=https://www.mersenne.org/report_exponent/?exp_lo=50356067&full=1]50356067[/url],[url=https://www.mersenne.org/report_exponent/?exp_lo=50356997&full=1]50356997[/url],[url=https://www.mersenne.org/report_exponent/?exp_lo=50380357&full=1]50380357[/url],[url=https://www.mersenne.org/report_exponent/?exp_lo=50382649&full=1]50382649[/url],[url=https://www.mersenne.org/report_exponent/?exp_lo=50382677&full=1]50382677[/url],[url=https://www.mersenne.org/report_exponent/?exp_lo=50382757&full=1]50382757[/url],[url=https://www.mersenne.org/report_exponent/?exp_lo=50382881&full=1]50382881[/url]. There were 2 runs which experienced "nonzero exit carry" data-corruption issues of the kind noted in [url=https://mersenneforum.org/showthread.php?t=23998&page=16]post #166[/url], causing them (due to lack of better handling of this particular kind of error in v18) to halt the run and proceed to the next exponent in the worktodo.ini file. I manually halted both runs, restored the first-run exponent to the top of worktodo.ini and restarted, both runs subsequently completed without incident, and gave results matching the first-time test. That tells me that the data corruption in question - at least based on this limited sample - is in the big residue array, as opposed to the auxiliary precomputed FFT/DWT-related data tables, which are O(sqrt(n)) in size and thus provide a much smaller "attack surface" for data corruption issues. And thus that the way to handle such errors is to simply halt processing and restart from the most-recent savefile. I've implemented this in my v19 under-development code and switched all 12 phones to the latter, in hopes that the second round of DCs will experience at least one more such incident and thus test the new retry-on-error logic. One other performance-related tweak I've made in v19 is of the "tunable accuracy based on exponent" variety. This is based in the DWT-weights computation code in the carry routines. As noted above, as with the FFT twiddles, my DWT code combines data from a pair of O(sqrt(n)) precomputed tables to on-the-fly generate the needed n DWT weights and their inverses. (Both forward and inverse weights use the same pair of tables, the inverse-weights computation makes use of the identity wt[j]*wt[n-j] = 2, thus to compute 1/wt[j] we use the 2-tables-based forward-weight computation of wt[n-j] and multiply the result by 0.5). Several years ago I added code in the carry step which, instead of computing the weights needed for each convolution output from scratch this way, uses a cheaper chained-weights computation to generate the next few weights from an initial one by applying a fixed (exponent-dependent) multiplier and a conditional factor of 0.5 which can be cheaply computed in branchless fashion. My choice for the length of said up-multiply chain (i.e. how frequently we do a high-accuracy 2-tables re-init of the current DWT weight) was based on the maximum exponents at each FFT length, thus the resulting chains tend to be of short length, typically no more than 8, and - this is dependent on the odd component of the FFT length - as short as 3. But, those high-accuracy reinit steps are expensive, and there is no good reason to use such a fixed chain length targeting the extreme end of the exponent range for each FFT length for the smaller exponents covered by said FFT length. So in v19 I have switched to a 3-tier chain-length scheme, consisting of long,medium and short chain lengths, typical values for these might be 16,8,4 or 15,10,5 depending on FFT length. Based on my numerical trials the lengths get applied like so, based on pmax, the maximum allowed exponent at the given FFT length - parse the below in C case()-style fall-through fashion: o p < 0.98*pmax uses long chain o p < 0.99*pmax uses medium chain o p >= 0.99*pmax uses short chain (previously all p's at the FFT length used the short chain). I get a speedup of up to 5% from using the long chain, and since each FFT length covers exponents ~10% larger than those of the next-smaller FFT length, 0.98*pmax covers roughly the lower 90% of exponents at each FFT length. Lastly, as noted in post #23 of [url=https://mersenneforum.org/showthread.php?t=23030&page=3]the next-gen Odroid thread[/url], at the FFT lengths corresponding to the current first-time-test wavefront, the ARMv8 build on 4-core (or more) CPUs really likes the larger leading radices available at each FFT length. E.g. at 5120K leading radix 320 runs as much as a full 10% faster than leading radix 160. Based on my earlier coding-up-and-testing of leading radices 288 and 320, none of the x86-based CPU families I've run on has shown this marked a preference for leading radices > 256: 288 did give as much as a 5% speedup over 144 on some CPUs, but 320 on x86 was a bust, so I left things lying there until now. Since we are within a year or so of transition from 5120K to 5632K at the first-time-test wavefront, in order to get maximal performance on ARM I need leading radix 352. I've added that in my v19 dev-code, and here is the before-and-after comparison versus v18, using the 5632K entries from the respective mlucas.cfg files, the numbers are from running 4-threaded on the 4 cortex-a73 cores of my Odroid N2: v18: [i]5632 msec/iter = 130.55 radices = 176 32 32 16[/i] v19: [i]5632 msec/iter = 114.77 radices = 352 16 16 32[/i] Now, interestingly, the ARM likes larger leading radices so much that even for my DCs running at 2816K, having radix 352 available gives as much as a 5% speedup. Interestingly, said speedup is very thermals-dependent: On my S7 broke-o-phones having good-functioning internal cooling the resulting speedup is modest, 2-3%, e.g. from 61ms/iter to 59ms/iter on the fastest of my phones. But on the phones with poorer internal cooling (based on runtimes), the radix-352 speedup is greater. Using the v18 build and leading radix 176 my one 'laggard' phone mentioned at top of this post was running at an average of 85ms/iter @2816K, for a throughput fully 30% less than my fastest phone (a slimline S7), and - to make for an apples-to-apples comparison - fully 20-25% than its silicone-housing S7 Active brethren, and that 85ms/iter was with the laggard phone placed in one of the edge slots of its 5-port USB charging station, where it catches the maximum outer-part-of-the-fan-blades airflow of the USB fan cooling said charging station. When I switched that particular phone to the beta v19 code, its per-iteration time @2816K dropped from 85ms to 74ms. It used the v18 code for most of it first-DC run, however, thus will need a full week more to complete its initial DC. |
The last of my 12 broke-o-phones successfully completed its first DC last week, and earlier today the first (and fastest, as it happens) of the Dirty Dozen finished its 2nd DC and started a first-time test. Timings went from 60 ms/iter @2816K to 98 ms/iter @4608K, a near-perfectly-linear scaling. (The theoretical n * log(n) scaling extrapolates to 101 ms/iter @4608K). Thus it'll need around 100 days to complete said LL test ... but with 12 such tests going on simultaneously things don't look bad at all.
|
Update - 23 of the 24 DCs (2 per phone) in my initial-assignments batch completed successfully, one mismatched the first-time-test result and a TC showed the cellphone result to be the bad one.
More direly, after several months of running, 5 of the original 12 phones have had to be shut off due to gas-swollen battery packs, and several more are showing signs of that. The 10 S7 Active phones alas are [url=https://www.ifixit.com/Guide/Samsung+Galaxy+S7+Active+Battery+Replacement/108065]such a tight integrated package that it's nigh-impossible to get at the battery[/url], but I tried disconnecting it in the one regular-S7 phone that had swelling, only to find the phone won't power up without the battery connected. So the picture on the cellphone compute cluster front looks a lot less promising that a few months ago - one would need to limit oneself to models with relatively-accessible batteries, Note that the battery-gas-swelling problem is apparently very common even for regular phone-style usage, just shows up much even more frequently under the stress of 24/7 running. Annoyingly, a simply tiny designed-in pinhole-vent mechanism would cure this problem, but I'm very leery of DIYing such by poking in a hole in one of the swollen battery packs to let the accumulated gas escape ... although with a shut-off phone rendered unusable by said problem, maybe letting the battery first run down to 0 and trying it would be worthwhile. at worst the battery will be rendered unusable, as in "no worse than currently, just with an extra pinhole." Matt's take: [quote]Hi, That is unfortunate, so matey was right about the batteries I thought it was outdated advice. Maybe there's a way to fake the battery being connected or run power from the battery connector instead of USB. 3rd party batteries for S7 exist and are only £4 so there can't be anything too fancy in them.[/quote] That's the risk associated with this kind of proof of principle work! That's really the only blocking issue, the phones whose batteries have not suffered the swelling are crunching great, and I rarely have to intervene to restart a phone and such. |
[QUOTE=ewmayer;522874]Update - 23 of the 24 DCs (2 per phone) in my initial-assignments batch completed successfully, one mismatched the first-time-test result and a TC showed the cellphone result to be the bad one.
More direly, after several months of running, 5 of the original 12 phones have had to be shut off due to gas-swollen battery packs, and several more are showing signs of that. The 10 S7 Active phones alas are [URL="https://www.ifixit.com/Guide/Samsung+Galaxy+S7+Active+Battery+Replacement/108065"]such a tight integrated package that it's nigh-impossible to get at the battery[/URL], but I tried disconnecting it in the one regular-S7 phone that had swelling, only to find the phone won't power up without the battery connected. So the picture on the cellphone compute cluster front looks a lot less promising that a few months ago - one would need to limit oneself to models with relatively-accessible batteries, Note that the battery-gas-swelling problem is apparently very common even for regular phone-style usage, just shows up much even more frequently under the stress of 24/7 running. Annoyingly, a simply tiny designed-in pinhole-vent mechanism would cure this problem, but I'm very leery of DIYing such by poking in a hole in one of the swollen battery packs to let the accumulated gas escape ... although with a shut-off phone rendered unusable by said problem, maybe letting the battery first run down to 0 and trying it would be worthwhile. at worst the battery will be rendered unusable, as in "no worse than currently, just with an extra pinhole." [/QUOTE] Have you seen the mess a leaking lithium battery can make? I have an old system that I long ago installed a big lithium "Forever" RTC battery in. When it finally leaked, admittedly decades later, it etched the solder and pads off parts of a card or two below it. It looked like my cat had gotten very sick in there. A gentle wipe of the spongy matter crusting the card dislodged a chip that had been cut loose by the etching. But who needs a functioning hard disk interface? You could try a venting experiment with a sharp pin or very high numbered drill bit, on a fully discharged battery, and the phone that turned in the bad DC. Might be a good idea to put a plastic tray under it to protect your furniture or floor. A quick try here powering up an old Motorola QA30 without battery shows it wakes up, detects no battery, complains unable to charge (the air where the battery would be), and then refuses any interactive input. But some code is running to get that far. Maybe rooting your phone would get you around the no-battery barrier? |
[QUOTE=ewmayer;522874]I tried disconnecting it in the one regular-S7 phone that had swelling, only to find the phone won't power up without the battery connected.[/QUOTE]I've had the contact plate separate from a cell phone battery. I wonder if you could connect the battery's severed contact plate to a power supply with some narrow gauge enamel wire, insert an inert spacer, and fool it.
|
| All times are UTC. The time now is 16:05. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.