mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU to 72 (https://www.mersenneforum.org/forumdisplay.php?f=95)
-   -   GPU to 72 status... (https://www.mersenneforum.org/showthread.php?t=16263)

Uncwilly 2020-02-19 18:01

[QUOTE=chalsall;537928]Because of this, I have invested the time in getting the GPU_TF Notebook to also run CPU jobs in parallel. This is close to being ready for production -- will be running initially with some beta testers (beyond just me).[/QUOTE]
me me me :jcrombie: pick me

James Heinrich 2020-02-19 19:58

[QUOTE=chalsall;537928]I have invested the time in getting the GPU_TF Notebook to also run CPU jobs in parallel.[/QUOTE]Thanks for doing this, I was hoping you would. :tu:

linament 2020-02-20 05:46

GPU Work Type
 
I just noticed today that a Notebook Access Key that I had set to LL TF (Depth First) says Unknown in the GPU Work Type column on the Notebook Access Key page ([url]https://www.gpu72.com/account/instances/[/url]). I didn't use this key today, so I don't know what type of work actually gets assigned to it.

chalsall 2020-02-20 14:29

[QUOTE=linament;537971]I didn't use this key today, so I don't know what type of work actually gets assigned to it.[/QUOTE]

Ah... Thanks. As usual, an SPE... Fixed.

This was just an error on what was displayed to the humans on that one page. The fetch would have gotten the work-type actually set.

bayanne 2020-02-20 15:24

Looking forward to this :)

chalsall 2020-02-20 17:21

[QUOTE=bayanne;538002]Looking forward to this :)[/QUOTE]

Yeah... It will be good to do a bit of CPU work on the side (and/or, when GPUs aren't available).

Please keep in mind that these CPUs are not that quick. ~30 running hours for a Cat 2 P-1 (done "well", though).

kriesel 2020-02-20 22:19

[QUOTE=chalsall;537928]I have invested the time in getting the GPU_TF Notebook to also run CPU jobs in parallel. This is close to being ready for production -- will be running initially with some beta testers (beyond just me).

Although, to be honest, there's not a whole lotta power there. Only a single (hyperthreaded) core of a Xeon @ 2.30GHz. 12G of RAM, though, so good for (slowish) P-1'ing. And a pity to just let it sit there idle.[/QUOTE]Excellent. A core is a terrible thing to waste. Especially if it's among dozens or hundreds.

These first-PRP tests were all done by mprime on those single Colab Xeon cores, sometimes hyperthreaded and sometimes not, with permanent storage on Google drive to bridge the brief runs.

[CODE]86939693 Kriesel colab 57990AE5CDD378__ 30388581 1 3 2019-12-20 00:52
87092557 Kriesel colab B26A478128F5A9__ 27398510 1 3 2019-12-29 20:11
90417403 Kriesel colab 506CACA072C1F7__ 61584056 1 3 2020-02-01 21:58
90808273 Kriesel colab 711968184E7EE8__ 63588385 1 3 2020-02-17 07:28[/CODE]

chalsall 2020-02-21 19:47

[QUOTE=kriesel;538042]Excellent. A core is a terrible thing to waste. Especially if it's among dozens or hundreds.[/QUOTE]

Yeah... I don't think we'll ever get into the hundreds, but I am designing this to be ready to scale... :wink:

Amusingly, one of my beta test runs actually just found a factor (Stage 1)!

Man, "cloud" is fun!!! :tu:

[CODE]
20200221_185700 DEEP: BS: [Work thread Feb 21 18:57] M92729447 has a factor: 185587507878894135482237279303 (P-1, B1=720000)
20200221_185700 DEEP: BS: [Comm thread Feb 21 18:57] Sending result to server: UID: [REDACTED]/Colab_iROOT, M92729447 has a factor: 185587507878894135482237279303 (P-1, B1=720000)
20200221_185700 DEEP: BS: [Comm thread Feb 21 18:57]
20200221_185701 DEEP: BS: [Comm thread Feb 21 18:57] PrimeNet success code with additional info:
20200221_185701 DEEP: BS: [Comm thread Feb 21 18:57] CPU credit is 3.7816 GHz-days.
20200221_185701 DEEP: BS: [Comm thread Feb 21 18:57] Successfully quit GIMPS.
[/CODE]

chalsall 2020-02-21 20:54

Just in time TF'ing...
 
So, thanks to those who are choosing "LL Depth" factoring for your Colab Instances.

Somewhat interestingly, Cat 4 dropped into the high end of 105M last night. So this means any TF'ing work to 77 there gets picked up by an LL'er within a few minutes.

Just so everyone knows, Cat 4 need about 100 assignments a day -- they always bite off more than they can chew... Most will be recycled, but some complete. And, also, they're in some ways P-1'ers for the Cat 2 and 3s. :smile:

kladner 2020-02-25 22:16

Almost bought at least a GTX 1650, maybe1660, today. I restrained myself, at least for today.

nomead 2020-02-26 02:59

[QUOTE=kladner;538331]Almost bought at least a GTX 1650, maybe1660, today. I restrained myself, at least for today.[/QUOTE]

Please don't even consider the plain 1650, there is such a small price difference to the 1650 Super that it doesn't make sense. (Over 40% more CUDA cores on the Super version) Unless you're limited by having no PCIe power connector, and thus 75 W maximum power budget...

1660 vs. 1660 Super is different, there the difference is only the memory, which makes a difference in games but not TF.

James Heinrich 2020-02-26 03:50

[QUOTE=nomead;538342]Please don't even consider the plain 1650, there is such a small price difference to the 1650 Super that it doesn't make sense.[/QUOTE][url]https://www.mersenne.ca/mfaktc.php?filter=1650|1660[/url]
The performance difference is clear, but it comes down to what price you can get each for as to what makes value sense.

kladner 2020-02-26 05:57

Thanks very much for that tip. I was having a hard time finding CUDA core numbers on the packages I was looking at. Thanks also for explaining the distinction between 1660 types. At this point, I might be looking for a good sale price, as long as it's not from Amazon. I try to work around them as much as possible.


Edit: And thanks, James. I will peruse the ratings before I jump into anything.

axn 2020-02-26 06:00

[QUOTE=James Heinrich;538343][url]https://www.mersenne.ca/mfaktc.php?filter=1650|1660[/url]
The performance difference is clear, but it comes down to what price you can get each for as to what makes value sense.[/QUOTE]

Some of the numbers there are not accurate, I think. For example, the GFLOPS for 1650 Super is way off. See [url]https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_16_series[/url]

James Heinrich 2020-02-26 12:59

[QUOTE=axn;538348]Some of the numbers there are not accurate, I think. For example, the GFLOPS for 1650 Super is way off.[/QUOTE]Thanks for catching that, the 1650 was 10% too high and the 1650 Super was 30% too low. As always, please call out any errors you may spot in my data.

chalsall 2020-02-26 16:29

[QUOTE=chalsall;538011]Yeah... It will be good to do a bit of CPU work on the side (and/or, when GPUs aren't available).[/QUOTE]

Just a quick update on this...

Alpha testing is going well. Beta testing will start tomorrow.

Please PM me if you're interested in being a beta tester. It will involve a tiny change to the Notebook code.

P.S. I both love, and hate, projects like this... MASSIVE feature creep... :chalsall:

Uncwilly 2020-02-26 20:54

[QUOTE=chalsall;538371]Please PM me if you're interested in being a beta tester. It will involve a tiny change to the Notebook code.[/QUOTE]I would love to beta, but I away from my evil lair for a few days.

chalsall 2020-02-28 18:51

GPU72_TF Notebook. Not just for GPUs anymore!!!
 
OK... So, after finding a stupid greedy regex bug which could result in checkpoint files being lost for (ironically) large candidates, this is now ready for gamma testing. The last step before moving this into full production.

For anyone who would like to try this out, edit your GPU72_TF Notebook code to have this line:

[CODE]!wget -qO bootstrap.tgz https://www.gpu72.com/colab/bootstrap_cpu.tgz[/CODE]

Basicially, just add the "_cpu" string to the line.

Then, relaunch your Notebook, and it will do CPU work in parallel. Currently, it's Cat 2 P-1'ing, but I plan to add other options soon. E.g. Cat 3 and DC P-1'ing.

Note that this only works for those whose Primenet Username the system knows. I'll add a form over the weekend to let people do this themselves -- currently, it's manual DB work; PM me if you're eager to try this now.

A CPU is created on Primenet with the same name as the Instance the job is running under, and under which the results will be submitted to Primenet. The checkpoint files are generated and thrown back to the server every ten minutes, so on average only five minutes of work is lost when an instance is killed.

It is safe to stop a currently running GPU72_TF Session, make this change, and then restart.

As always, SPE and/or "hmmm..." things welcomed...

petrw1 2020-02-28 19:44

I'm interested.
I'd like to do DC-P1 work.
Will it allow me to provide my own worktodo lines?

Cool
Wayne

[QUOTE=chalsall;538535]OK... So, after finding a stupid greedy regex bug which could result in checkpoint files being lost for (ironically) large candidates, this is now ready for gamma testing. The last step before moving this into full production.

For anyone who would like to try this out, edit your GPU72_TF Notebook code to have this line:

[CODE]!wget -qO bootstrap.tgz https://www.gpu72.com/colab/bootstrap_cpu.tgz[/CODE]

Basicially, just add the "_cpu" string to the line.

Then, relaunch your Notebook, and it will do CPU work in parallel. Currently, it's Cat 2 P-1'ing, but I plan to add other options soon. E.g. Cat 3 and DC P-1'ing.

Note that this only works for those whose Primenet Username the system knows. I'll add a form over the weekend to let people do this themselves -- currently, it's manual DB work; PM me if you're eager to try this now.

A CPU is created on Primenet with the same name as the Instance the job is running under, and under which the results will be submitted to Primenet. The checkpoint files are generated and thrown back to the server every ten minutes, so on average only five minutes of work is lost when an instance is killed.

It is safe to stop a currently running GPU72_TF Session, make this change, and then restart.

As always, SPE and/or "hmmm..." things welcomed...[/QUOTE]

Chuck 2020-02-28 19:50

Colab CPUs
 
I didn't realize I needed to relaunch the notebooks so I have just done that. In my "view assignments" list on GPU72 I see the P-1 assignments, but where do I go to see their progress?

chalsall 2020-02-28 19:54

[QUOTE=petrw1;538540]I'm interested. I'd like to do DC-P1 work.[/QUOTE]

Yeah... That's been envisioned.

Because of the race into the 10xMs by the Cat 3 and 4s, there are a lot of candidates which get LL'ed without a P-1. I'm currently targeting this myself (there are 317 such candidates needing a DC P-1).

I also figured you (and perhaps a few others) would be interested in strategically redoing poorly P-1'ed candidates... :wink:

[QUOTE=petrw1;538540]Will it allow me to provide my own worktodo lines?[/QUOTE]

Nope... These must be issued by GPU72 to be managed by all the ephemeral instance runs. But I'm open to suggestions as to ranges of interest.

chalsall 2020-02-28 19:56

[QUOTE=Chuck;538541]...but where do I go to see their progress?[/QUOTE]

They /should/ be updated on your Assignments page. The checkpoints are uploaded every ten minutes, and the percentage completed value every hour.

OH, SHOOT!!!

There might be a problem with the checkpoint uploading!!! I developed this using my Reverse tunnels, and it appears the cron sub-system this relies on isn't being brought up correctly by the CPU Payload. Please standby. (Note, this doesn't effect the GPU work in any way, this is still sane.)

Chuck 2020-02-28 19:57

[QUOTE=chalsall;538543]They /should/ be updated on your Assignments page. The checkpoints are uploaded every ten minutes, and the percentage completed value every hour.[/QUOTE]

OK I probably just need to wait.

chalsall 2020-02-28 20:20

[QUOTE=Chuck;538544]OK I probably just need to wait.[/QUOTE]

OK... Fixed the CRON based uploader issue.

Chuck... Could you please stop and restart your Notebooks? You'll be issued new P-1 assignments, but they should have the checkpoints being uploaded properly, and the Updated field change every ten minutes.

Sorry about that. An example of development rigging having an impact on the deployed code. A bit like quantum observer disruption of a system, only different... :wink:

Chuck 2020-02-28 22:37

[QUOTE=chalsall;538547]OK... Fixed the CRON based uploader issue.

Chuck... Could you please stop and restart your Notebooks? You'll be issued new P-1 assignments, but they should have the checkpoints being uploaded properly, and the Updated field change every ten minutes.

Sorry about that. An example of development rigging having an impact on the deployed code. A bit like quantum observer disruption of a system, only different... :wink:[/QUOTE]

Done.

chalsall 2020-02-28 22:51

[QUOTE=Chuck;538556]Done.[/QUOTE]

Excellent! Thanks. And I see your checkpoints! :tu:

petrw1 2020-03-01 21:24

Thumbs Up
 
Appears I have Colab running 2 instances of TF AND P1

petrw1 2020-03-02 02:41

[QUOTE=petrw1;538658]Appears I have Colab running 2 instances of TF AND P1[/QUOTE]

Still seeing the sessions die after an hour or two.

If the session dies that is running both GPU-TF and CPU-P1 … and I restart it is it safe to assume that both the TF and P1 will be picked up again after the restart?

Any idea if the session would live longer if I ran ONLY P1.
That seems to be more in need anyway.

bayanne 2020-03-02 06:10

What is the quick way to add that snippet of code into the notebook please?

I did find another way, and it is working on a P1, but think there must be a simpler way :)

James Heinrich 2020-03-02 11:53

[QUOTE=bayanne;538687]What is the quick way to add that snippet of code into the notebook please?[/QUOTE]I don't know how you're [i]supposed[/i] to do it, but from the [B]⋮[/B] menu hovering in the upper right, choose Form->Show Code and then you get a sliver of code box on the left in which to find (~line 20) and edit the !wget line.

James Heinrich 2020-03-02 11:53

[QUOTE=chalsall;538547]OK... Fixed the CRON based uploader issue.[/QUOTE]For those of us who were caught with this problem, there seem to be P-1 assignments languishing, if you have checkpoints available for them please make sure they get processed, or if (as I presume the problem to be) there is no checkpoint file I guess you can throw them back in the pool?

bayanne 2020-03-02 13:25

So when the GPU part finishes, how does one keep the CPU part running?

chalsall 2020-03-02 15:58

[QUOTE=James Heinrich;538697]For those of us who were caught with this problem, there seem to be P-1 assignments languishing, if you have checkpoints available for them please make sure they get processed, or if (as I presume the problem to be) there is no checkpoint file I guess you can throw them back in the pool?[/QUOTE]

OK... Just so everyone knows, everyone is running this code now. Version 0.42.

However, only those whose Primenet Username the system knows is assigned P-1'ing work. I'll have the form ready to enter this information into the system later today.

Like with Colab TF assignments, once a P-1 assignment has been issued it is held by the user until completion. Checkpoint files are uploaded every ten minutes, and the Percent completed and estimated completion is updated every hour.

The GPU Bootstrap code how has IPC with the CPU Payload, so at the start of the run and every ~30 minutes a line like "100970xxx P-1 77 19.47% Stage: 1" is displayed. This can be made every ten minutes if people would prefer more frequent reporting.

I'm "eating my own dog food" with this, and watching the logs closely. But if anyone sees anything strange please let me know. This is the "edge-case" phase.

"Why did they do that?" That's a two part question. The answer to first question is "Why not?" The answer to the second question is "Yes." :wink:

chalsall 2020-03-02 16:06

[QUOTE=bayanne;538720]So when the GPU part finishes, how does one keep the CPU part running?[/QUOTE]

When the GPU Section finishes it means the Instance has been killed. Read: both the GPU and CPU jobs have been terminated.

However, if you're not able to get a GPU backend this can still run CPU only. Just answer "Connect" when asked if you want a CPU only backend, and then run as usual.

Currently, there's a massive amount of debugging output when running in this mode. I'll clean that up later today -- no effect on the background work going on.

My modus operandi with this has been to ask for a GPU Instance, and proceed as usual if I get one. If not, I ask for the CPU Instance and run the Section. Then, every few hours I ask for a GPU from those contexts which are currently running CPU only. If I get one I run the GPU72_TF Section, which then launches the GPU and CPU parallel workers.

Interestingly, I've found that the CPU Instance which is "replaced" by the GPU Instance (as far as the Web GUI is concerned) continues running for up to several hours... :smile:

chalsall 2020-03-02 16:18

[QUOTE=petrw1;538683]Still seeing the sessions die after an hour or two. If the session dies that is running both GPU-TF and CPU-P1 … and I restart it is it safe to assume that both the TF and P1 will be picked up again after the restart?[/QUOTE]

A trick I've found works well for instance longevity...

Once you've been given an instance (CPU or GPU), click on the "RAM / Disk" drop-down menu in the upper right-hand side of the interface, and choose "Connect to hosted runtime".

If you're running a GPU, this will immediately reconnect and then show "Busy". If you're running only a CPU it will then give you the same "Do you want a CPU only?", but then after "Connecting" you're reattached to the same (running) CPU instance.

This seems to basically be a way to tell the system that you know this is going to run for a long time. Don't expect further human interaction.

Someone else figured this out; I can't remember who nor where it was posted. Probably on the long Colab thread.

[QUOTE=petrw1;538683]Any idea if the session would live longer if I ran ONLY P1.
That seems to be more in need anyway.[/QUOTE]

I think the CPU only instances tend to last for 12 hours are so, but the GPUs are coveted and vary considerably in their runtimes and compute kit.

Also, we still need as much GPU TF'ing as we can get. Need to stay ahead of the Cat 3 and 4's! :smile:

Chuck 2020-03-02 23:06

[QUOTE=chalsall;538735]

Like with Colab TF assignments, once a P-1 assignment has been issued it is held by the user until completion. Checkpoint files are uploaded every ten minutes, and the Percent completed and estimated completion is updated every hour.

The GPU Bootstrap code how has IPC with the CPU Payload, so at the start of the run and every ~30 minutes a line like "100970xxx P-1 77 19.47% Stage: 1" is displayed. This can be made every ten minutes if people would prefer more frequent reporting.
[/QUOTE]

I'd vote to see the updated P-1 information every ten minutes.

petrw1 2020-03-03 00:21

[QUOTE=chalsall;538735]once a P-1 assignment has been issued it is held by the user until completion. Checkpoint files are uploaded every ten minutes, and the Percent completed and estimated completion is updated every hour.
[/QUOTE]

I could be senile (okay more than typical senility for my age) but I'm quite sure that last night before bed I had a P-1 assignment reporting stage 2.
When I restarted all the workers this morning they were all stage 1 and no more than 60%.
I have NO P-1 completions.

chalsall 2020-03-03 01:01

[QUOTE=petrw1;538773]I could be senile (okay more than typical senility for my age) but I'm quite sure that last night before bed I had a P-1 assignment reporting stage 2. When I restarted all the workers this morning they were all stage 1 and no more than 60%. I have NO P-1 completions.[/QUOTE]

There /might/ be something strange going on, for a /few/ people. I haven't figured out what's going on yet -- extremely tricky debugging what I can't see nor access. And, of course, all of my various tests are all running perfectly fine -- exact same code paths as everyone else.

I've added some code to send back to the server the working directory when the checkpointing code doesn't seem sane.

bayanne 2020-03-03 06:26

[QUOTE=petrw1;538773]I could be senile (okay more than typical senility for my age) but I'm quite sure that last night before bed I had a P-1 assignment reporting stage 2.
When I restarted all the workers this morning they were all stage 1 and no more than 60%.
I have NO P-1 completions.[/QUOTE]

That has happened to me too

chalsall 2020-03-03 14:57

[QUOTE=bayanne;538789]That has happened to me too[/QUOTE]

OK... I think I've figured out what's going on... The script does not handle stopping and restarting well. I /thought/ that any forked processes get killed when the Notebook is interrupted, but this isn't true.

Working on a fix now...

chalsall 2020-03-03 17:34

[QUOTE=chalsall;538809]Working on a fix now...[/QUOTE]

OK... The Bootstrap module will now SIGINT the CPUWrapper module, which in turn SIGINTs the Payload module, which in turn SIGINTs the mprime process...

The upside of this is after mprime exits the Payload module gives the Checkpointer a chance to upload the just-written checkpoint file.

Please /don't/ stop and then restart a running Colab session to get this new code. But all future runs will pick this up.

bayanne 2020-03-04 12:05

Opting to work on CPU tasks as well has meant that I am now getting GPU instances less frequently.
I am wondering whether this is being somewhat counter productive ...

chalsall 2020-03-04 13:06

[QUOTE=bayanne;538855]Opting to work on CPU tasks as well has meant that I am now getting GPU instances less frequently. I am wondering whether this is being somewhat counter productive ...[/QUOTE]

While it is impossible to guess at what Google's algorithms are weighting, I don't /think/ so. More likely what we're observing is the ebb-and-flow of demand vs. availability.

But to test your theory, change your CPU Worktype to "Disabled" and the CPU won't be used (the CPU payload provided is just a sleep(forever) call in such cases).

James Heinrich 2020-03-04 14:56

Does this make sense? :unsure:
[code]Beginning GPU Trial Factoring Environment Bootstrapping...
Please see https://www.gpu72.com/ for additional details.

20200304_145207: GPU72 TF V0.42 Bootstrap starting (now with CPU support!)...
20200304_145207: Working as "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"...

20200304_145207: Installing needed packages
20200304_145213: Fetching initial work...
20200304_145213: Running GPU type Tesla K80

20200304_145214: running a simple selftest...
20200304_145218: Selftest statistics
20200304_145218: number of tests 107
20200304_145219: successfull tests 107
20200304_145219: selftest PASSED!
20200304_145219: Bootstrap finished. Exiting.[/code]It has a GPU, but it's not doing any TF... not sure if it's doing P-1 but I don't see any comment about that either.

My other instance I started at the same time is also borkend:[code]Beginning GPU Trial Factoring Environment Bootstrapping...
Please see https://www.gpu72.com/ for additional details.

20200304_145330: GPU72 TF V0.42 Bootstrap starting (now with CPU support!)...
20200304_145330: Working as "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"...

20200304_145330: Installing needed packages
20200304_145335: Fetching initial work...
20200304_145336: Running GPU type Tesla T4

20200304_145336: running a simple selftest...
20200304_145340: Selftest statistics
20200304_145340: number of tests 107
20200304_145340: successfull tests 107
20200304_145340: selftest PASSED!
20200304_145340: Bootstrap finished. Exiting.[/code]

chalsall 2020-03-04 15:07

[QUOTE=James Heinrich;538863]Does this make sense?[/QUOTE]

No!!! Grrr...

Please try rerunning the Sections. According to the DB you /were/ issued work.

Edit: Actually, one of your three instance was issued TF work, the other two were only issued P-1 work. This shouldn't happen.

Working theory: a DNS lookup failure could explain this. I'll add a check to the Comms script module to retry if it doesn't successfully get the first batch of work. But simply rerunning your failed sections should fix the issue right now.

James Heinrich 2020-03-04 15:38

[QUOTE=chalsall;538865]But simply rerunning your failed sections should fix the issue right now.[/QUOTE]I restarted both. The one seems normal:[code]20200304_153523: Exponent TF Level % Done ETA GHzD/D Itr Time | Class #, Seq # | #FCs | SieveRate | SieveP | Uptime
20200304_153538: 100180889 75 to 76 0.1% 1h39m 1102.38 6.236s | 0/4620, 1/960 | 40.81G | 6544.7M/s | 82485 | 0:02
20200304_153538: 100969277 P-1 77 0.00% Stage: 1
20200304_153643: 100180889 75 to 76 1.5% 1h37m 1110.57 6.190s | 52/4620, 14/960 | 40.81G | 6593.3M/s | 82485 | 0:04[/code]The other seems to have a lot more P-1 lines than I expect right on init:[code]20200304_153537: Installing needed packages
20200304_153554: Fetching initial work...
20200304_153556: Running GPU type Tesla T4

20200304_153556: running a simple selftest...
20200304_153605: Selftest statistics
20200304_153605: number of tests 107
20200304_153605: successfull tests 107
20200304_153605: selftest PASSED!
20200304_153605: Starting trial factoring M106899509 from 2^75 to 2^76 (71.58 GHz-days)

20200304_153605: Exponent TF Level % Done ETA GHzD/D Itr Time | Class #, Seq # | #FCs | SieveRate | SieveP | Uptime
20200304_153620: 106899509 75 to 76 0.1% 55m42s 1848.60 3.485s | 0/4620, 1/960 | 38.25G | 10974.9M/s | 82485 | 0:45
20200304_153620: 100968493 P-1 77 0.00% Stage: 1
20200304_153620: 100968493 P-1 77 2.64% Stage: 1
20200304_153620: 100968493 P-1 77 5.29% Stage: 1
20200304_153620: 100968493 P-1 77 7.94% Stage: 1
20200304_153620: 100968493 P-1 77 10.59% Stage: 1
20200304_153620: 100969361 P-1 77 0.00% Stage: 1
20200304_153722: 106899509 75 to 76 2.4% 55m52s 1801.06 3.577s | 111/4620, 23/960 | 38.25G | 10692.6M/s | 82485 | 0:47[/code]

chalsall 2020-03-04 16:12

[QUOTE=James Heinrich;538869]I restarted both. The one seems normal: ... The other seems to have a lot more P-1 lines than I expect right on init:[/QUOTE]

Thanks for the data...

What is happening here is you now have two P-1 jobs running in parallel. An interesting edge case. There's nothing we can do about this, but it shows me some deltas I need to make to the payloads to handle these kinds of rare (but not impossible) edge cases.

Somewhat amusingly, yesterday Chuck had a similar situation. Even though he did a Factory Reset, two of his instances continued uploading debugging information for 24 hours; ~0.6 GB/minute... My /var/ was not amused... :wink:

PhilF 2020-03-04 16:46

[QUOTE=chalsall;538871]two of his instances continued uploading debugging information for 24 hours; ~0.6 GB/minute... My /var/ was not amused... :wink:[/QUOTE]

Maybe that's a new tactic they are employing in order to try to drive us away, lol :spinner:

Chuck 2020-03-04 17:36

[QUOTE=chalsall;538871]Thanks for the data...

What is happening here is you now have two P-1 jobs running in parallel. An interesting edge case. There's nothing we can do about this, but it shows me some deltas I need to make to the payloads to handle these kinds of rare (but not impossible) edge cases.

Somewhat amusingly, yesterday Chuck had a similar situation. Even though he did a Factory Reset, two of his instances continued uploading debugging information for 24 hours; ~0.6 GB/minute... My /var/ was not amused... :wink:[/QUOTE]

Did I do something wrong that enabled this debug uploading?

chalsall 2020-03-04 17:43

[QUOTE=Chuck;538880]Did I do something wrong that enabled this debug uploading?[/QUOTE]

No... I did.

All you did was stop and restart your Sections. Perfectly reasonable.

But then I had made an assumption that was incorrect, and my code started misbehaving. Then I had the code send back debugging information in such situations, not realizing just how large the data would be nor how often it would be sent...

A classic SPE, working in an environment where it's a "you'd better get this correct, because you have no control over it once it's running" situation.

And then, of course, not getting it correct... DWIM!!! :wink:

Chuck 2020-03-04 17:52

[QUOTE=chalsall;538881]No... I did.

All you did was stop and restart your Sections. Perfectly reasonable.

But then I had made an assumption that was incorrect, and my code started misbehaving. Then I had the code send back debugging information in such situations, not realizing just how large the data would be nor how often it would be sent...

A classic SPE, working in an environment where it's a "you'd better get this correct, because you have no control over it once it's running" situation.

And then, of course, not getting it correct... DWIM!!! :wink:[/QUOTE]

Sometimes when I restart a session, the little spinning indicator in the upper left corner scrolls up off the screen out of sight when I scroll the window to the bottom of the screen. When this happens, I found that if I scroll back to the top of the window and re-click "Default" on the logging level, it corrects the problem and the indicator does not scroll off the top.

I thought I might have accidentally selected "Verbose".

Chuck 2020-03-04 21:04

Colab restarts
 
My sessions expired after 24 hours as usual. There were three sessions and as I restarted each, it went through the bootstrap process and exited immediately. I restarted the three sessions again and they then ran normally.

Evidently I picked up an extra P-1 assignment as each session is displaying two different P-1 progress lines starting at 0.00%.

chalsall 2020-03-04 21:18

[QUOTE=Chuck;538893]My sessions expired after 24 hours as usual. There were three sessions and as I restarted each, it went through the bootstrap process and exited immediately. I restarted the three sessions again and they then ran normally.[/QUOTE]

OK... Thanks for the data. Having audited the code a DNS lookup failure is the only possibility. This is supported by the fact your instances only asked for P-1 work, but not TF work.

I've got a fall-back contingency worked out in my head, which I'll implement shortly.

[QUOTE=Chuck;538893]Evidently I picked up an extra P-1 assignment as each session is displaying two different P-1 progress lines starting at 0.00%.[/QUOTE]

Yes... But the good news is the code is sane (or, at least, not insane).

You'll pick up those initial P-1 assignments in due course. I don't yet have the system expiring abandoned assignments without work done on them. Later.

Chuck 2020-03-04 22:53

I started a CPU only just for fun
 
Just to see what it would look like, I started an additional Colab session not requesting a GPU. Now I can see the scrolling P-1 messages.

I see the code stops and restarts the process every hour when it send the progress message to the server. Interesting to see. I wonder if the session will stop after 24 hours like my other Colab sessions do.

chalsall 2020-03-04 23:17

[QUOTE=Chuck;538898]I see the code stops and restarts the process every hour when it send the progress message to the server. Interesting to see.[/QUOTE]

Yeah... I'm going to have to figure out how to have it *not* do that.

What is happening is the mprime process is getting "new settings" from the Primenet server (through the GPU72 proxy). I need to intercept those messages and ensure the running client doesn't see any changes. Rather wasteful having it stop and restart, particularly during Stage 2...

And, indeed... Please let us know how long your CPU-only session lasts.

Prime95 2020-03-05 00:16

Can you turn off auto benchmarking? (Add AutoBench=0 to prime.txt)

chalsall 2020-03-05 00:25

[QUOTE=Prime95;538906]Can you turn off auto benchmarking? (Add AutoBench=0 to prime.txt)[/QUOTE]

Thanks! Done.

chalsall 2020-03-05 14:59

Kinda cool...
 
1 Attachment(s)
Just to share a screenshot that I think is kinda cool...

It's from my main workstation, where I run four Colab sessions in parallel. For each, I'll often set up reverse SSH and HTTP tunnels in order to observe what's happening in the background.

This is just short of twelve hours of a CPU only run (just about to finish a 100M P-1).

James Heinrich 2020-03-05 16:12

[QUOTE=James Heinrich;538863]Does this make sense? :unsure:[/quote][QUOTE=chalsall;538865]No!!! Grrr...
Please try rerunning the Sections. According to the DB you /were/ issued work.
Working theory: a DNS lookup failure could explain this. I'll add a check to the Comms script module to retry if it doesn't successfully get the first batch of work. But simply rerunning your failed sections should fix the issue right now.[/QUOTE]If you care it just happened again, across 3 instances I started at about the same time.

I waited a couple minutes and restarted each and they all worked this time, each with 2x intial P-1 lines (one halfway through stage1, one at 0%) so presumably I got assigned work on the first attempt but my instance didn't receive it(?)

chalsall 2020-03-05 16:30

[QUOTE=James Heinrich;538950]If you care it just happened again, across 3 instances I started at about the same time.[/QUOTE]

I care very much!!! Thanks for the data! Important.

This means my attempted fix (looping for ten attempts over a minute) didn't work.

Hmmm...

[QUOTE=James Heinrich;538950]I waited a couple minutes and restarted each and they all worked this time, each with 2x intial P-1 lines (one halfway through stage1, one at 0%) so presumably I got assigned work on the first attempt but my instance didn't receive it(?)[/QUOTE]

What this means is your first launch attempts (for all three sessions) successfully got the Bootstrap and the CPU Payload (with the P-1 work). But they /didn't/ ask for TF work (different communications channel).

OK. I'll meditate on this, and come up with an angle of attack... I'm thinking of throwing a fall-back DNS entry into /etc/hosts, and see if that fixes this issue.

The good news is even though you saw the initial status line from the first P-1 run, it should have been cleanly killed. As in, you should only be running a single P-1 job on each instance.

Man, tricky stuff. Loving it!!! :smile:

Chuck 2020-03-05 17:11

Two different progress messages
 
1 Attachment(s)
I am seeing these two different messages on one of the Colab instances — almost like it is processing the same exponent twice at the same time?

chalsall 2020-03-05 17:25

[QUOTE=Chuck;538956]I am seeing these two different messages on one of the Colab instances — almost like it is processing the same exponent twice at the same time?[/QUOTE]

Hmmm... I think you might be correct... Sorry about that...

I'm /pretty/ sure this won't happen with the code served out since last night (~0200 UTC). I introduced several redundant shutdown request vectors which should ensure there are no unwanted parallel runs going on.

It's /probably/ safe to try stopping and restarting your GPU72_TF Section. Or, you could just wait for them to expire when expected. The worst that's happening now is you might only using ~50% of a CPU, rather than 100%.

Uncwilly 2020-03-05 18:35

I have been seeing the same thing as James has been reporting, with the same fix. I have been ill the past few days and haven't felt like posting.

chalsall 2020-03-05 19:23

[QUOTE=Uncwilly;538961]I have been seeing the same thing as James has been reporting, with the same fix. I have been ill the past few days and haven't felt like posting.[/QUOTE]

OK, thanks for the data. I can see from the logs that James' is not alone.

So, for anyone who launches the GPU72_TF payload and it stops after the mfaktc self-test, please curse in my general direction and then click Run again.

This has actually helped beat up the CPU payload resiliency. Each running instance should be devoting 100% of the CPU towards a single P-1 work unit at a time, no matter how many times the Section is restarted.

Thanks for all the patience with this, and all the compute! :smile:

P.S. And, of course, the exit is a profoundly stupid error by yours truly. The Bootstrap should loop if it runs out of work! Why ever exit? Death before exit!!!

Chuck 2020-03-05 20:01

[QUOTE=chalsall;538900]Yeah... I'm going to have to figure out how to have it *not* do that.

What is happening is the mprime process is getting "new settings" from the Primenet server (through the GPU72 proxy). I need to intercept those messages and ensure the running client doesn't see any changes. Rather wasteful having it stop and restart, particularly during Stage 2...

And, indeed... Please let us know how long your CPU-only session lasts.[/QUOTE]

My CPU only session ended after 24 hours and 1 minute (paid tier).

Chuck 2020-03-05 21:55

[QUOTE=chalsall;538963]OK, thanks for the data. I can see from the logs that James' is not alone.

So, for anyone who launches the GPU72_TF payload and it stops after the mfaktc self-test, please curse in my general direction and then click Run again.

This has actually helped beat up the CPU payload resiliency. Each running instance should be devoting 100% of the CPU towards a single P-1 work unit at a time, no matter how many times the Section is restarted.

P.S. And, of course, the exit is a profoundly stupid error by yours truly. The Bootstrap should loop if it runs out of work! Why ever exit? Death before exit!!![/QUOTE]

I just had three sessions exit after 24 hours each. All three needed the second restart to get going again.

bayanne 2020-03-06 13:24

Found a factor with 4th P1 exponent :)

chalsall 2020-03-06 15:53

[QUOTE=bayanne;539008]Found a factor with 4th P1 exponent :)[/QUOTE]

Excellent!

Just to share... It was kinda cool watching George use my code to run his (and Oliver's) code in a donated Colab instance, and for him to find a factor during his very first P-1 run!!!

I don't believe in luck; it's just statistics. I do, /kinda/, believe in karma, though... :smile:

James Heinrich 2020-03-07 02:19

Is this (zero exponent and stage) expected output at the end of stage1?
It seems to progress to stage2, I've just not paid attention before to the display at the transition.[quote]20200307_015640: [color=green]100968617 P-1 77 99.73% Stage: 1[/color]
20200307_015647: 102442429 73 to 74 52.1% 7m32s 1709.74 0.983s | 2412/4620, 500/960 | 9.98G | 10150.4M/s | 82485 | 1:58
20200307_015736: [color=green][b]0 P-1 77 Stage: 0[/b][/color]
...
20200307_021530: 102226979 73 to 74 70.8% 4m36s 1706.40 0.987s | 3264/4620, 680/960 | 10.00G | 10130.6M/s | 82485 | 2:16
20200307_021601: [color=green]100968617 P-1 77 3.13% Stage: 2[/color][/quote]

petrw1 2020-03-07 02:45

Too many active sessions
 
At least once a day I only get 1 session.
I start the tunnel... 1 session.
Can't start actual run.

Uncwilly 2020-03-07 02:56

[QUOTE=James Heinrich;539058]Is this (zero exponent and stage) expected output at the end of stage1?
It seems to progress to stage2, I've just not paid attention before to the display at the transition.[/QUOTE]
That happens during the GCD run. I have noticed them too. Chris forgot to anticipate that.

chalsall 2020-03-07 14:33

[QUOTE=petrw1;539059]At least once a day I only get 1 session. I start the tunnel... 1 session. Can't start actual run.[/QUOTE]

Of my seven accounts (spread across three machines in two countries) I'll often only get one or two GPU instances. I've yet to not be given a CPU session.

This morning at 1200 UTC I've not been given a single GPU. And not many people are running the GPU72_TF Notebook at the moment.

Just for clarity... When you say you "start the tunnel", are you running InstanceROOT reverse tunnels first? This should have no impact on anything, but I want to understand your context for debugging modeling.

chalsall 2020-03-07 14:37

[QUOTE=Uncwilly;539061]That happens during the GCD run. I have noticed them too. Chris forgot to anticipate that.[/QUOTE]

I didn't /forget/ exactly. Just busy... I need to finish off the regex filtering/transforms. Humans come after the kit... :wink:

I'm reworking the output for both the GPU and CPU contexts, to make them more compatible, cleaner and more informative. Also, to turn off all the deep debugging and multiple timestamps!

chalsall 2020-03-08 00:25

Some eye candy...
 
Hey all.

So, in my spare time I've been experimenting with the D3 Javascript graphing package. I thought some of you might be interested in one of the results...

If you click on the new links in the Range column in the [URL="https://www.gpu72.com/reports/available/"]available report[/URL], you'll be taken to a chart of the TF level of that 1M range. There are links from this page that lets you drill down to see all unfactored candidates, DC candidates, and those DC'ed but not factored.

For example, [URL="https://www.gpu72.com/charts/tf/dc/98/"]this chart of the 98M range[/URL] shows that almost all FC runs were done after TF'ing to 77.

Data (and databases) are fun!!! :smile:

P.S. These charts are rather crude, and don't (yet) scale to the window size. They're designed for 1920 pixel displays.

Chuck 2020-03-08 01:08

[QUOTE=chalsall;539132]Hey all.

So, in my spare time I've been experimenting with the D3 Javascript graphing package. I thought some of you might be interested in one of the results...

If you click on the new links in the Range column in the [URL="https://www.gpu72.com/reports/available/"]available report[/URL], you'll be taken to a chart of the TF level of that 1M range. There are links from this page that lets you drill down to see all unfactored candidates, DC candidates, and those DC'ed but not factored.

For example, [URL="https://www.gpu72.com/charts/tf/dc/98/"]this chart of the 98M range[/URL] shows that almost all FC runs were done after TF'ing to 77.
[/QUOTE]

I don't understand these charts at all. What are the 0000 - 7400 running across the bottom and what are the colors?

chalsall 2020-03-08 01:12

[QUOTE=Chuck;539138]What are the 0000 - 7400 running across the bottom and what are the colors?[/QUOTE]

0.01M sub-ranges of the 1M range being viewed. I'll try to make that rendering clearer later.

Edit: Colors... Scroll your browser to the right. The key is in the upper-right-hand corner. Red == 72; Purple is 77.

Chuck 2020-03-08 01:54

[QUOTE=chalsall;539140]0.01M sub-ranges of the 1M range being viewed. I'll try to make that rendering clearer later.

Edit: Colors... Scroll your browser to the right. The key is in the upper-right-hand corner. Red == 72; Purple is 77.[/QUOTE]

That makes more sense as it goes from 0000 to 9900.

Chuck 2020-03-08 12:54

My Colab results are not being submitted automatically
 
Since Saturday night my Colab results have not been automatically submitted. I just did a manual submission of seven results.

chalsall 2020-03-08 16:20

[QUOTE=Chuck;539158]Since Saturday night my Colab results have not been automatically submitted. I just did a manual submission of seven results.[/QUOTE]

OK... The Primenet API is designed such that the Client has to exchange settings parameters every so often, or else the Primenet server will complain about accepting results.

I now know how long that time period is...

For those who started auto-submitting about a week ago, your "virtual" machine has checked in with Primenet, and all results are being autosubmitted again.

James Heinrich 2020-03-09 16:02

[QUOTE=chalsall;538894]a DNS lookup failure is the only possibility.
I've got a fall-back contingency worked out in my head, which I'll implement shortly.
But the good news is the code is sane (or, at least, not insane).[/QUOTE]I just fired up my 3 instances, and two started working normally but the third (actually the first one I started by a couple seconds) did not start nice. Instead of doing the selftest and exiting, it does the selftest and loops. And loops and loops. After some time I gave up and manually stopped/restarted it, but it kept looping through the selftest (display edited for brevity but it repeated the octet of output lines each time):[code]20200309_155258 ( 0:04): Installing needed packages
20200309_155303 ( 0:04): Fetching initial work...
20200309_155303 ( 0:04): Running GPU type Tesla P100-PCIE-16GB

20200309_155303 ( 0:04): running a simple selftest...
20200309_155308 ( 0:04): Selftest statistics
20200309_155308 ( 0:04): number of tests 107
20200309_155308 ( 0:04): successfull tests 107
20200309_155308 ( 0:04): selftest PASSED!
20200309_155308 ( 0:04): Fetching initial work...
20200309_155308 ( 0:04): Running GPU type Tesla P100-PCIE-16GB

20200309_155308 ( 0:04): running a simple selftest...
20200309_155314 ( 0:04): running a simple selftest...
20200309_155321 ( 0:04): running a simple selftest...
20200309_155329 ( 0:04): running a simple selftest...
20200309_155337 ( 0:04): running a simple selftest...
20200309_155345 ( 0:04): running a simple selftest...
20200309_155354 ( 0:05): running a simple selftest...
20200309_155402 ( 0:05): running a simple selftest...
20200309_155411 ( 0:05): running a simple selftest...
20200309_155419 ( 0:05): running a simple selftest...
20200309_155428 ( 0:05): running a simple selftest...
20200309_155437 ( 0:05): running a simple selftest...
20200309_155446 ( 0:05): running a simple selftest...
20200309_155454 ( 0:06): running a simple selftest...
20200309_155504 ( 0:06): running a simple selftest...
20200309_155513 ( 0:06): running a simple selftest...
20200309_155522 ( 0:06): running a simple selftest...
20200309_155531 ( 0:06): running a simple selftest...
20200309_155540 ( 0:06): running a simple selftest...
20200309_155550 ( 0:06): running a simple selftest...
20200309_155559 ( 0:07): running a simple selftest...
20200309_155608 ( 0:07): running a simple selftest...
20200309_155617 ( 0:07): running a simple selftest...
20200309_155626 ( 0:07): running a simple selftest...
20200309_155635 ( 0:07): running a simple selftest...
20200309_155644 ( 0:07): running a simple selftest...
20200309_155653 ( 0:08): running a simple selftest...
20200309_155702 ( 0:08): running a simple selftest...
20200309_155711 ( 0:08): running a simple selftest...
20200309_155721 ( 0:08): running a simple selftest...
20200309_155730 ( 0:08): running a simple selftest...
20200309_155739 ( 0:08): running a simple selftest...
20200309_155748 ( 0:08): running a simple selftest...
20200309_155758 ( 0:09): running a simple selftest...

Exiting...
Can't locate LWP/UserAgent.pm in @INC (you may need to install the LWP::UserAgent module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.26.1 /usr/local/share/perl/5.26.1 /usr/lib/x86_64-linux-gnu/perl5/5.26 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.26 /usr/share/perl/5.26 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at ./comms.pl line 32.
BEGIN failed--compilation aborted at ./comms.pl line 32.
Done.[/code]Scroll down to the error at the end of the above dump, it may be relevant. It appeared when I aborted the session (clicked the "interrupt execution" button).

chalsall 2020-03-09 16:19

[QUOTE=James Heinrich;539218]Scroll down to the error at the end of the above dump, it may be relevant. It appeared when I aborted the session (clicked the "interrupt execution" button).[/QUOTE]

Thank you!!! Critical information.

I added the loop to the Bootstrap code, such that it would keep retrying the fetch. This was based on the theory that the problem was DNS lookup failure.

However, this indicates that instead, the problem is the "apt install" call isn't working for the needed Perl modules at the very beginning of the script.

So... For anyone who sees this type of behavior, please stop and rerun the Section. Cursing in my general direction is always an option as well...

Edit: OK, I've just "pushed" a new version of the Bootstrap payload (V0.421) which expands the loop such that the apt install is retried as well. Thanks, James; I never would have figured that out if you hadn't posted that bit of the output. My instances, of course, never exhibited this behaviour even though I'm running the exact same code.

James Heinrich 2020-03-09 17:36

[QUOTE=chalsall;539219]I've just "pushed" a new version of the Bootstrap payload (V0.421) which expands the loop such that the apt install is retried as well.[/QUOTE]Better :smile:

[code]20200309_173329 ( 0:05): GPU72 TF V0.421 Bootstrap starting (now with CPU support!)...
20200309_173329 ( 0:05): Working as "<redacted>"...

20200309_173329 ( 0:05): Installing needed packages
20200309_173335 ( 0:05): Fetching initial work...
20200309_173336 ( 0:05): Running GPU type Tesla K80

20200309_173336 ( 0:05): running a simple selftest...
20200309_173341 ( 0:05): Selftest statistics
20200309_173341 ( 0:05): number of tests 107
20200309_173341 ( 0:05): successfull tests 107
20200309_173341 ( 0:05): selftest PASSED!
20200309_173341 ( 0:05): Installing needed packages
20200309_173345 ( 0:05): Fetching initial work...
20200309_173346 ( 0:05): Running GPU type Tesla K80

20200309_173346 ( 0:05): running a simple selftest...
20200309_173353 ( 0:05): Selftest statistics
20200309_173353 ( 0:05): number of tests 107
20200309_173353 ( 0:05): successfull tests 107
20200309_173353 ( 0:05): selftest PASSED!
20200309_173353 ( 0:05): Installing needed packages
20200309_173413 ( 0:05): Fetching initial work...
20200309_173414 ( 0:05): Running GPU type Tesla K80

20200309_173414 ( 0:05): running a simple selftest...
20200309_173431 ( 0:06): Selftest statistics
20200309_173431 ( 0:06): number of tests 107
20200309_173431 ( 0:06): successfull tests 107
20200309_173431 ( 0:06): selftest PASSED!
20200309_173431 ( 0:06): Starting trial factoring M99845491 from 2^74 to 2^75 (38.32 GHz-days)

20200309_173431 ( 0:06): Exponent TF Level % Done ETA GHzD/D Itr Time | Class #, Seq # | #FCs | SieveRate | SieveP
20200309_173445 ( 0:06): 99845491 74 to 75 0.1% 2h22m 387.68 8.896s | 0/4620, 1/960 | 20.47G | 2301.6M/s | 82485[/code]Failed the first two times, but then caught and ran.

Isn't it fun to write code for things you can't actually test? :whee:

Uncwilly 2020-03-09 18:11

[QUOTE=James Heinrich;539224]Better :smile:

Failed the first two times, but then caught and ran.

Isn't it fun to write code for things you can't actually test? :whee:[/QUOTE]
I saw behaviour like this yesterday (I think while the code was in flux.

Uncwilly 2020-03-09 23:07

I noticed that there was a factor found by a P-1 instance that is on my GPU72 graph, but the one for the TF factor found within 24 hours of it is not on the graph. Both of theses are in the last 48 hours.

linament 2020-03-10 01:18

Result not needed
 
Thought I would pass this on, one of my GPU72 Colab assignments received a result not needed message from PrimeNet today when I reported it. [URL="https://www.mersenne.org/M107578847"]M107578847[/URL] (272-273).

Chuck 2020-03-10 11:27

Same exponent assigned twice for P-1
 
I have two Colab notebooks which have been assigned the same exponent for P-1 factoring (102953047). One is currently at 59% of stage 1 and the other at 41% of stage 1.

EugenioBruno 2020-03-10 12:29

I'm not sure if this is the right place to ask.

Up until now, I've been getting TF work for my GPU from the primenet website. I've now tried to get work from GPU to 72, and it looks like very, very different work.

I was getting TF to 73 for exponents around 110M, while GPU to 72 gives me exponents to 77 for factors around 96M.

I'm not sure I understand the rationale between one kind of work vs the other, and which is "better" (as in, saves more time for primality checking? I think that's the simplest good metric to measure in this case, but I'm not sure I understand everything about primenet yet).

I will note that after 1000GHz/d of manual testing TF work from primenet, I haven't found a factor yet. (now that I've upgraded my GPU to a 1650 things should move along faster, I think)

Chuck 2020-03-10 12:32

[QUOTE=EugenioBruno;539278]I'm not sure if this is the right place to ask.

I will note that after 1000GHz/d of manual testing TF work from primenet, I haven't found a factor yet. (now that I've upgraded my GPU to a 1650 things should move along faster, I think)[/QUOTE]

Sometimes I find several factors in a day, other times I go a month without finding any.

EugenioBruno 2020-03-10 14:41

haha, just a few checks after reading your message, after 200 or so total TFs, I finally found a factor! :)

[url]https://www.mersenne.org/report_exponent/?exp_lo=109373191&full=1[/url]

chalsall 2020-03-10 16:15

[QUOTE=Uncwilly;539252]I noticed that there was a factor found by a P-1 instance that is on my GPU72 graph, but the one for the TF factor found within 24 hours of it is not on the graph. Both of theses are in the last 48 hours.[/QUOTE]

Hey... OK, quickly catching up...

This is a small bug introduced by the Colab auto-submitter... The Factor Found code path doesn't insert the factor into a table. The system does know a factor is found, but some of the reports reference the aforementioned table.

This is trivial to fix; no data is lost -- I'll be able to have the system back-fill the missed entries. This week.

chalsall 2020-03-10 16:18

[QUOTE=linament;539262]Thought I would pass this on, one of my GPU72 Colab assignments received a result not needed message from PrimeNet today when I reported it. [URL="https://www.mersenne.org/M107578847"]M107578847[/URL] (272-273).[/QUOTE]

OK. Thanks for the report.

I'm afraid this is an example of a "friend" TF'ing "off the books". He doesn't respect the assignments as officially issued by Primenet (some of which GPU72 simply "lends" out to participants to work), and so he occasionally steps on toes.

I'm afraid there's nothing I can do about this, beyond pleading that he not do it...

chalsall 2020-03-10 16:20

[QUOTE=Chuck;539277]I have two Colab notebooks which have been assigned the same exponent for P-1 factoring (102953047). One is currently at 59% of stage 1 and the other at 41% of stage 1.[/QUOTE]

Hmmm... OK, thanks for the report. I'll take a look at the logs, and see what I can infer.

In the short term, I've lengthened the recycling period for Colab P-1 assignments. It was twenty (20) minutes; it's now seventy (70).

chalsall 2020-03-10 16:24

[QUOTE=EugenioBruno;539278]I was getting TF to 73 for exponents around 110M, while GPU to 72 gives me exponents to 77 for factors around 96M.

I'm not sure I understand the rationale between one kind of work vs the other, and which is "better" (as in, saves more time for primality checking? I think that's the simplest good metric to measure in this case, but I'm not sure I understand everything about primenet yet).[/QUOTE]

Thanks for joining our little group! :tu:

Basically, we have determined that it is "optimal" to TF to 77 bits before running the First Check.

But... That's actually quite a bit of work, and not everyone wants to go that high. And that's perfectly fine. Your kit, your time, your choice.

Thus, GPU72 will let people choose the depth (and, optionally, range) they'd like to "pledge" to work on. All work is valuable, and will be targeted as best benifits the GIMPS goal of finding the next Mersenne Prime! :smile:

Uncwilly 2020-03-10 16:25

[QUOTE=chalsall;539292]The system does know a factor is found, but some of the reports reference the aforementioned table.

This is trivial to fix; no data is lost -- I'll be able to have the system back-fill the missed entries. This week.[/QUOTE]Yeah, I noticed that the table at the top of the Individual Overall Stats page knows about the factors, just not the graph. :tu:

Since my report, another one showed up. I was running about 12% below the predicted number of factors. Now only about 8%. :brian-e:

As always, you are making it easy for us to [STRIKE]abuse[/STRIKE] use resources out there to help. :bow:

linament 2020-03-10 16:39

Another thing I noticed, when my last Colab session ended, I am pretty sure that I had an incomplete TF assignment. When I was able to restart a Colab GPU session today, that incomplete TF assignment has disappearred.

chalsall 2020-03-10 17:07

[QUOTE=linament;539298]Another thing I noticed, when my last Colab session ended, I am pretty sure that I had an incomplete TF assignment. When I was able to restart a Colab GPU session today, that incomplete TF assignment has disappearred.[/QUOTE]

Do you happen to know what the assignment was? And approximate time (UTC please)?

I could drill down on the logs if I knew at least the former; otherwise, there's way too much traffic to look through.

I'm pretty sure the assignment/reassignment code paths are sane. But I'm happy to be proven wrong (so I can fix it).

chalsall 2020-03-10 18:12

[QUOTE=Uncwilly;539296]As always, you are making it easy for us to [STRIKE]abuse[/STRIKE] use resources out there to help. :bow:[/QUOTE]

Thanks!

And, I /live/ for problem spaces like this! Thanks to everyone who beats up my code (and helps find factors)! :smile:


All times are UTC. The time now is 01:02.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.