mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Data (https://www.mersenneforum.org/forumdisplay.php?f=21)
-   -   Newer milestone thread (https://www.mersenneforum.org/showthread.php?t=13871)

Madpoo 2015-12-23 02:06

[QUOTE=blip;419945]Well, that system is a bit short on memory, which also could be faster. So, as soon as I can afford it, I will add more and faster RAM. It sports an i7-4930K and runs mprime with 5 exponents plus [URL="http://www.mersenne.org/report_exponent/?exp_lo=450000017&full=1"]this one[/URL]. So, it will be busy for some time.

But anyways, it moves forward all right. I just wanted to give a heads up here and manage expectations.[/QUOTE]

Aha! There's your problem.

In my own testing I discovered that if you run multiple workers on the same system, the performance will degrade a LOT if any of the exponents are over a certain size (the exact size varies, mostly on the speed of the RAM).

With DDR3 memory I could run multiple workers okay if they were all under 37M or 38M, give or take (multiple workers on one CPU, I'm not counting what happens across multiple CPUs).

If you were to stop all but one of the workers, you will almost certainly notice that the worker still running will speed up a lot. Give it a try and you'll see what I mean.

In the case of very large ( > 100 million digit) exponents, my recommendation is to run one worker with all of the physical cores assigned to it. Trying to run one large FFT worker alongside other smaller FFT workers is just going to thrash the memory within an inch of it's life. :smile:

Great for stress testing, lousy for LL throughput.

blip 2015-12-23 10:06

Ok, I put all 6 cores now on 59425643. It should be finished in about 70 h.

I have to figure out how to optimise workers/threads for optimal LL throughput.

Are there some details on how workers with "big" exponents impact other workers?

Madpoo 2015-12-23 18:09

[QUOTE=blip;419965]Ok, I put all 6 cores now on 59425643. It should be finished in about 70 h.

I have to figure out how to optimise workers/threads for optimal LL throughput.

Are there some details on how workers with "big" exponents impact other workers?[/QUOTE]

The details are anecdotal on my part... I've shared in other threads (where exactly on here, I couldn't say) my experimentation.

All I know is that my testing on systems with DDR3 (dual CPU Xeon systems in all cases) showed that I could run two workers, each using all of the physical cores on its CPU, and:[LIST][*]if both exponents were below 58M things were fine[*]if one exponent was above 58M, the other one needed to be below 38M[*]otherwise there must be some kind of extra memory thrashing that happens[/LIST]Regarding performance on a single CPU, I did a little bit of testing on systems with dual 4-core (8 HT) CPUs. Even though it's dual CPU, I think I could extrapolate this to a single CPU:[LIST][*]4 cores of the CPU in a single worker, no problem doing whatever size (but then with dual CPU, the findings above would apply)[*]2 workers with 2 cores each: exponents below 38-40M worked best. Larger than that and the performance of the 2 workers starts to degrade[*]4 workers with 1 core each: it could still, sort of, manage exponents below 38M but, for me anyway, it wasn't any more efficient than running 2 workers with 2 cores each. Usually when you add extra cores to a worker it doesn't scale linearly. Going from 1 to 2 cores does NOT cut the total time in half. But in this case I found that running 2 workers with two cores actually was about twice as fast as 4 workers with one core. So I went with that just to churn out the results of a single test faster.[/LIST]
The big fun was when I got a system with DDR4 (a dual Xeon v3 system). The DDR4 must play very well with the FFT code because I could now run 2 workers using all of the cores (14 cores per CPU on this lovely box), and it didn't matter how large the exponents are.

Generally I use that system to test the exponents 60M and higher since I can run two at once without any balancing act. On other systems I have to put 60M+ exponents in one worker and fill the other worker with exponents < 38M.

Not only that but on the DDR4 system it scales much better when adding additional cores to a worker. There's not as much "penalty" for doing so. In fact, on the older systems I could add one core from the other CPU socket and get a small boost in speed, but adding 2+ cores would start to degrade performance. On this system though, I could use all 14 on one CPU and 7-8 cores from the other CPU and still see the performance increase. Adding 8+ cores from the other CPU doesn't start to decrease performance (doesn't increase it either...it tends to be about the same even up to having all 28 cores on one worker).

From that I tend to conclude that memory bandwidth is the real limit on running multiple workers or multi-core per worker. But even with good, reliable, fast DDR3 I'm not sure if it can beat what I see with solid DDR4 memory. And of course some of that may be the architectural differences between the Xeon v2 and Xeon v3 CPUs as well (faster QPI between sockets, etc).

Summary is, you may have to experiment to see what works best in your situation, but those were my findings if it's helpful to you.

Madpoo 2015-12-23 18:15

[QUOTE=blip;419965]Ok, I put all 6 cores now on 59425643. It should be finished in about 70 h.[/QUOTE]

Oh, and by the way, yeah, the fact that 6 cores on one worker can do it in 70 hours... from that we could say that, *ideally* if it were just 1 core chugging away at it, it might finish in 70 * 6 hours, or 17.5 days.

So maybe the 20-21 days it was going to take wasn't so far off the mark. On the other hand, adding extra cores to a worker doesn't scale linearly, so a single cored worker would be a bit more efficient and might look more like 70 * 5.5 or so and finish in around 16 days.

I applaud you for tackling a big 100M digit exponent. They're fun when they finally finish... some weird sense of accomplishment. In your shoes I might consider something like having 3 cores work on that one, and have the other 3 cores do double-checking work, preferably exponents below 38M since that's what worked best for me. In my case I had dual CPUs so I'm not sure what it would be to split the work like that on a single CPU.

blip 2015-12-23 20:56

[QUOTE=Madpoo;420003]
I applaud you for tackling a big 100M digit exponent. They're fun when they finally finish... some weird sense of accomplishment. In your shoes I might consider something like having 3 cores work on that one, and have the other 3 cores do double-checking work, preferably exponents below 38M since that's what worked best for me. In my case I had dual CPUs so I'm not sure what it would be to split the work like that on a single CPU.[/QUOTE]
Thanks. I will finish all other exponents on that machine first and then figure out how to go forward. When I first got that big exponent, mprime reported an ETA > 2000d. Let's see how we can improve that.

Madpoo 2015-12-24 06:35

[QUOTE=blip;420022]Thanks. I will finish all other exponents on that machine first and then figure out how to go forward. When I first got that big exponent, mprime reported an ETA > 2000d. Let's see how we can improve that.[/QUOTE]

Well, with 6 cores all working on that one exponent, you could hope for a six-fold improvement, maybe 340 days. In reality it might be more like a year, give or take a month? :smile: Could be faster because your estimate of 2000 days could have been slowed down since it was running 5 other smaller ones at the same time. Hard to say.

Not for the faint of heart.

manfred4 2015-12-24 11:29

One other way to improve that would be factoring it to 82 bits first, which has quite a decent chance of ruling that candidate out once and for all.

blip 2015-12-24 18:43

[QUOTE=manfred4;420073]One other way to improve that would be factoring it to 82 bits first, which has quite a decent chance of ruling that candidate out once and for all.[/QUOTE]
well, it could be prime...

But yes, I know. I pushed it to 78 bits, and then decided to give it a try with LL, just to see if and how it is working with an exponent of that size. (you have to start somewhere...). P-1 took a while, and now I have a process on that machine running probably until EOL of that specific system :-)

I need more power!

cuBerBruce 2015-12-26 15:01

[QUOTE]Countdown to first time checking all exponents below 60M: 1[/QUOTE]

[size=5][b]1[/b][/size] to go. (And is that last one possibly done already?)

henryzz 2015-12-26 15:22

[QUOTE=Madpoo;419937]One problem presented by these ideas is that the interim files aren't small. Sure, individually the file for something in the 72M range is "only" 9.2 MB, but cumulatively, for all of the assignments out there... well, it's a lot of data.

By way of estimate, take the exponent size and divide by 8 to get the approx file size in bytes.

For all active LL and DC assignments, that would add up to:
815,845,182,349 bytes (816 GB / 760 GiB).

Granted, many active assignments have zero progress. If I only include assignments with a percent done > 0:
259,718,199,046 bytes (260 GB / 242 GiB).

Sure, okay, disk storage [I]could[/I] handle that, but what about bandwidth concerns? How often would each assignment be expected to check in their latest interim file? Probably not time based, but every XX %.

Let's say it was every 25%, so realistically it would upload an interim file 3 times (at 25, 50 and 75%). So you'd be uploading 3x whatever. It would be spread out over however many days... I haven't the foggiest idea what the average time is for a worker to reach 25% complete. :smile: Depends on exponent size and CPU speed, so any average will be wildly misleading. LOL

Anyway, you get the idea... storage and bandwidth are the problems with any kind of interim file repository. Back in dialup days it was even more so than today, but even now with nice fast internet connections, on the server side at least, there's a cost to bandwidth. I forget how much monthly data the server's current home allows (it's higher than what we use, I'll leave it at that), but suffice to say, it's less than it would need to be.

On other server colocations, you're not paying for total data per month, but rather it's billed at a 95th percentile of bandwidth, so it's a little more possible on a budget but only if the uploads (and occasional downloads?) are spread out evenly to avoid big spikes at peak times of day or something. I have a feeling the Prime95 client, were it to automate uploading interims at whatever %, it would average fairly smoothly over the course of a day between all clients.[/QUOTE]

How fast is the needed bandwidth increasing in comparison with the cost per MB decreasing?
I would guess that this sort of setup would be getting cheaper over time. When would this look like being viable?

cuBerBruce 2015-12-26 23:22

[QUOTE]All exponents below 60,343,331 have been tested at least once.[/QUOTE]

The 60M milestone has been reached!
:party: :toot:


All times are UTC. The time now is 23:15.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.