-   Raiders of the Lost Primes (
-   -   Testing.... (

gd_barnes 2010-02-23 09:17

Holy cow. It looks like port 9950 had already dried out again. That's 30K+ pairs. I'm not going to load it again because that was a sufficient stress test. If you guys want, try port 9975. It has a bunch of pairs at n=~540K from the 11th drive and will NOT dry out. :-)

Max, I'm going to be curious as to what pairs the server couldn't process overall. Clearly it cannot seem to handle the small k=3 pairs that are prime right at the beginning of the file. Like before, my client shows them as processed but as rejected yet the server shows no rejected pairs with them just sitting in the joblist. I suspect we encountered a load-related issue with LLRnet there. The tests are all n=1K-10K primes.

I'm pretty sure the above is an already existing LLRnet issue. What I will do is set up another personal server and run the same initial tests through it to recreate the problem. I'll then run an "old" client on the same file to make sure that we have not introduced a new problem here. If not, we can let it go. Few people would ever use LLRnet for testing n<10K.

I'll go ahead and start up port 6000 up on all of my machines again for the night. There's no reason to let them sit idle over night.

Excluding the issue with the server not "taking" the small k=3 tests right at the beginning, with this testing, we've identified 3 issues on the Linux client. Max has now fixed 2 of them and it appears he'll look into the 3rd issue on Tues. related to stopping a client when the server is down.

I know that on the Window's side, the primes.txt file was being written correctly. What we'll also still need to do there is see if the # of iterations and stopping a client when the server is dried/down issues still exist in Windows.


kar_bon 2010-02-23 09:30

good work!

the cLLR-option OutputIterations works fine in my DOS-version: stop script, edited that line to 100000 iterations, started script and the next load of workunits, cLLR will display that new iteration-counting properly.

to G4000: so i will try to submit the other results to the server when i'm home again.

while testing port 9950 i've found no prime, because of Gary's 31 cores :grin:
but my prime-logging works, too.

to your last post:
- iterations ok
- server down/dried see post #54

mdettweiler 2010-02-23 17:08

[quote=gd_barnes;206445]I thought you fixed #1 on Jeepford also.[/quote]
No, I didn't, sorry.

kar_bon 2010-02-23 18:08

issue from post #54 solved:

in function DialogWithServer() change some lines this way:
t, k, n = GetPair()
if t and k and n then
print(format(" Fetching WU #1/%d: %s %s",WUCacheSize,k,n))
changed = 1

so the output "Fetching..." will only displayed, when getting a new pair was successful, otherwise exit the function.

mdettweiler 2010-02-23 18:40

[quote=kar_bon;206480]issue from post #54 solved:

in function DialogWithServer() change some lines this way:
t, k, n = GetPair()
if t and k and n then
print(format(" Fetching WU #1/%d: %s %s",WUCacheSize,k,n))
changed = 1

so the output "Fetching..." will only displayed, when getting a new pair was successful, otherwise exit the function.[/quote]
Okay, I've applied that fix in the llrnet.lua file for my Perl script as well (just finished uploading).

Meanwhile, I also clarified the documentation a bit: Gary mentioned that it was a tad wordy and thus he missed some key setup instructions, so I added a bit saying "start here if you just want to find out how to get started". :smile:

gd_barnes 2010-02-23 20:07

Hey guys. Excellent work! What a great team effort!

I feel I messed up on the stress test yesterday a little bit. I was rushing somewhat because I had somewhere I needed to be from 6-11 PM. Today I'm free.

Here is why I think I messed up a bit:

I had a bunch of pairs in there for n=1000 to 10K. That's too intense and I see in looking in the knpairs.txt file and joblist.txt file that 552 of them got handed out but for some reason the server wouldn't accept them even though I'm seeing them in the results for my clients. Many of them are shown as "rejected" on the client side but there is no rejected.txt file on the server side.

This is obviously something LLRnet cannot handle. But the point is, we are testing OUR changes. We're not trying to find existing problems in LLRnet; we're only trying to verify that we did not negatively impact its ability to handle huge loads. Observe this load that I effectively put on it:

A pair at n=500K should take 10,000 times as long to test as a pair at n=5K; that is (500K/5K)^2. So by putting 31 cores on pairs processing around n=5K, it was the effect of a stress test with 310,000 cores at n=500K or 3,100 cores at n=50K!!!!!!

The true objective of our stress test here is to make sure that OUR code did not cause any stress related issues. I feel confident that it hasn't but that cannot be assumed.

Based on this, here is what I'd like to do:

Run another stress test today with a better thought out set of test data. Here is what I'll do:

Port 9950:
Some k=3 to 50 primes for n=10K-50K.
Some k=3 to 100 sieved pairs for n=10K-50K. (I'll run a quick sieve on those k's to get the pairs.)
(So that I don't dry the server so fast, I'll make sure we get at least 100,000 pairs in there this time around.)

Port 9975:
Some primes for k=2000-2200 for n=50K-100K.
Some "regular" pairs for k=2000-2020 for n=50K-125K.
Like before, pairs from the 11th drive for n=~539K.

Note that this is different than yesterday in that I'm putting the k=2000-2020 "medium" sized primes and tests in port 9975 ahead of the n=~540K tests from the 11th drive. Karsten and others, although I'm loading a lot more pairs in port 9950 this time around, having the "medium" sized tests on a separate port than my super stress test will make sure that you can do some testing, find some primes, etc. before I suck up all the pairs. :-)

I realize that with tests in that range for port 9950, we'll still have the equivalent of several thousand cores on there running at n=500K but it won't be several hundred thousand. It would be good if we could verify that LLRnet would handle 1000-2000 cores for a rally so I feel we're in the ball park of a useful test at this level. If there is still a problem with some pairs not coming through, I'll move up to n=25K-75K (or perhaps n=50K-100K) tests. If there is still a problem at that level, I'll be a bit concerned that we've affected something. (Seems unlikely.)

Bottom line recommendation for testing:
Port 9950; I'll run most of that but others are welcome to join in with a few cores if you want.
Port 9975; mostly intended for everyone except me.

In effect, port 9950 is the "legitmate" super stress test and port 9975 is the more "normal" test to finallize that the fixes in the last day are working correctly.

Time line: About like yesterday except that I'll be around all evening. I'll go eat now and start working on the servers within the hour. This time around; allowing time for problems; let's look at about 3:30-4:00 PM CST (10:30-11 PM in Germany) for having the servers ready to go. Although I have to copy the updated clients to all my cores, I know what I'm doing this time so it should go much faster.

...hope you guys aren't too blery eyed from the late nights. :-) Thanks for all of your hard work. :smile:


mdettweiler 2010-02-23 20:36

Sounds like a good plan. BTW, I'll be out of the house from about 5:30-10:30 EST; that works out to 4:30-9:30 CST. Not that it makes much of a difference since I don't really have enough cores to contribute meaningfully to any stress test, but I figured I'd let you guys know. :smile:

Way cool on being able to figure out the equivalent # of cores at 500K where we start running into problems with LLRnet. Once we've narrowed it down to a more exact figure with these stress tests, that should be immensely helpful for future rallies. Come to think of it, I should be able to apply a similar method to determine more exactly the load of PRPnet 2.4.6 at n=500K; that way we can find out whether a PRPnet rally would even be feasible prior to the perfecting of 3.x.

kar_bon 2010-02-23 20:50

[QUOTE=gd_barnes;206492]Hey guys. Excellent work! What a great team effort!


...hope you guys aren't too blery eyed from the late nights. :-) Thanks for all of your hard work. :smile:

yep, great work and another issue solved (not that bad, only a cosmetic operation).

no, not really hard work for me! just 2 weeks ago i begun to test this idea and after the first script run fine, it was fun to do this nobody thought before of this option!

and it's working great! (ok, small n-values could be tested by individuals in few days).
and what about other possibilities? think about it! taking LLRnet only as server/client and with changing the script other programs should run as well.

now [b]we[/b] are able to use LLRnet's abilities and we're independent from others!
that's the great chance we must use! although we should inform Jean (and he informs Vincent) of this new use of LLRnet!

gd_barnes 2010-02-23 22:22

Port 9975 is loaded up and ready to go with the previously mentioned pairs. There's a bunch of primes at the beginning.

I'm still working on port 9950.

gd_barnes 2010-02-23 23:25

Port 9950 is loaded up and ready to go.

The delay was caused because I'm having tremendous difficulty getting my main machine, Jeepford, to connect to the server. (It's internet access is fine because I'm typing from it right now.) It keeps saying "could not connect to server after 5 tries". I've checked everything in the llr-clientconfig.txt file and it looks good. I've stopped and restarted the server; no luck. I've stopped and restarted the clients; no luck. I'm going to try loading the client on some other machines now. FYI, I'm using the "Perl" command.

I thought it would be easier and faster today. I was mistaken as usual. It's always something. Ergh.

kar_bon 2010-02-23 23:33

[QUOTE=gd_barnes;206510]I thought it would be easier and faster today. I was mistaken as usual. It's always something. Ergh.[/QUOTE]

check, if you deleted all old files like tosend.txt, workfile.txt, workfile.res and check the entries in llr-clientconfig.txt again!

All times are UTC. The time now is 02:21.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.