mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2011-01-07, 17:10   #12
lorgix
 
lorgix's Avatar
 
Sep 2010
Scandinavia

3·5·41 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Take a look on that page: http://mersenne.org/various/math.php

When talking about mfaktc, the algorithm on that page is running on the GPU while the preselection of candidates (that part which mentions "sieve of Eratosthenes") runs on the CPU.
Thanks, I think I've got it now.

In short; In order to take advantage of the superior factoring capabilities of a GPU, I would have to invest of CPU cycles anyway. Still more efficient given that I want to factor the number at hand. But it's certainly a different deal.
lorgix is offline   Reply With Quote
Old 2011-01-07, 18:04   #13
lorgix
 
lorgix's Avatar
 
Sep 2010
Scandinavia

3×5×41 Posts
Default

Quote:
Originally Posted by Mr. P-1 View Post
Your understanding isn't correct. prime95 does P-1 stage 2 in blocks. Each block has a size which is a multiple of 30 (2*3*5), 210 (2*3*5*7), or 2310 (2*3*5*7*11). the "relative primes" are relative to the chosen blocksize. 480 is indeed all of the relatively prime congruence classes modulo 2310.
I'm sorry, my understanding of how P-1 is done isn't perfect.

Quote:
It's not difficult to extrapolate from known memory usage, to that necessary to do all 480 relative primes in one pass. My default memory setting on my Core 2 Duo is 1370MB, which allows for one core to be doing 60 relative primes per pass on a 2560MB FFT exponent, while the other is doing stage 1. That suggests that about 10GB would be sufficient to do all 480 relative primes on exponents of this size. What I don't know is whether prime95 would chose a larger blocksize with that kind of memory available.
It seems efficiency goes up with available memory. And thus my question is; How do I maximize my efficiency with what I do have? or What can I do with maximum efficiency?

Also, not that that makes a big difference either, when memory is sufficient; Suyama's extension can be used.

Quote:
Bear in mind that the per pass overhead is small compared to the overall running time of the algorithm. If you double the number of relative primes per pass, from 20 to 40 say, you'll save X amount of time. Double it again, to 80, and the additional saving is X/2. The returns really do diminish quite rapidly.
So that's how fast the potential time saved diminishes, ok. Do you also know the relative amount of the time saved?

Are you saying that the time a pass takes is roughly proportional to the amount of relative primes processed during it?

Quote:
For this reason, when specing out a machine for GIMPS it's generally not cost effective to load it with vast amounts of memory. You'd do better to spend the money on a faster processor, faster memory, etc.
Right now I'm thinking about what to do with my present hardware.

And it doesn't make sense to spend extra cycles and time only to not allocate available memory.

But I get your point, more generally. I have 8GB RAM. Let's say I insist on Intel, the price difference between dual and quad would correspond to ~1.5GB of RAM...

Speaking of hardware; do you know which parameters affect P-1 and/or ECM performance most? Like timings etc.. (Any other comments on the hardware portion of my original post?)

Quote:
I'm not familiar with how prime95 ECM uses memory, but I would imagine that the same principles apply.
It's more complicated than the case of P-1.

Fix B1, and memory use will not increase strictly with B2...

I have even less understanding of ECM than of P-1.
lorgix is offline   Reply With Quote
Old 2011-01-07, 23:40   #14
Mr. P-1
 
Mr. P-1's Avatar
 
Jun 2003

7·167 Posts
Default

Quote:
Originally Posted by lorgix View Post
It seems efficiency goes up with available memory. And thus my question is; How do I maximize my efficiency with what I do have? or What can I do with maximum efficiency?
As you know, more is better, but only if it reduces the number of passes, or if it results in a better plan (basically the choice of blocksize and the Suyama exponent.)

Quote:
Also, not that that makes a big difference either, when memory is sufficient; Suyama's extension can be used.
Suyama's extension doesn't require a huge additional amount of memory. The reason it's only used when memory is very plentiful is that the net benefit is so small that the extra memory required is usually better employed reducing the number of passes.

Quote:
So that's how fast the potential time saved diminishes, ok.
Yes, but it's not smooth. Increasing the number of relative primes per pass from 120 to 159 will make no difference at all. Increasing from 159 to 160 will make a difference.

Quote:
Do you also know the relative amount of the time saved?
No, I haven't attempted to benchmark it at different numbers of passes. Too many confounding factors. My standard configuration is, as I remarked above, 1370MB (out of 2GB) for 60 relative primes for eight passes. I've tried it a 69 for seven passes, and 80 for six, but it doesn't make a great deal of difference.

Quote:
Are you saying that the time a pass takes is roughly proportional to the amount of relative primes processed during it?
Yes, other things held equal.

Quote:
And it doesn't make sense to spend extra cycles and time only to not allocate available memory.
No, of course not.

Quote:
Speaking of hardware; do you know which parameters affect P-1 and/or ECM performance most? Like timings etc.. (Any other comments on the hardware portion of my original post?)
I'm no hardware expert, but I would conjecture the memory speed to make more difference than the memory timings, particularly during stage 2. The FFT code (used by LL, and both stages of P-1 and ECM) makes good use of memory prefetches, which I would expect to mitigate the effect of changes in memory latency. Stage 2 has a voracious demand for memory bandwidth. (All that allocated memory is being used, after all.) I've noticed that on my own machine, if both cores are running P-1 stage two, both take a considerable performance hit, which is why I have maxhighmemworkers set to 1 in local.txt.

On my machine I usually set the core that is doing stage 2 to a high priority, well above standard, so as to shift all other processes onto the other core. Still I find I accumulate stage 1 work over time, and occasionally switch the other core to some other task.

Quote:
It's more complicated than the case of P-1.

Fix B1, and memory use will not increase strictly with B2...
Nor does it with P-1. Optimal B2 increases with increasing memory, not because it takes more memory, but because it's more cost effective.

Quote:
I have even less understanding of ECM than of P-1.
I understand P-1 reasonalbly well, though there are undoubtedly subtleties which elude me. I don't pretend to understand how ECM works.
Mr. P-1 is offline   Reply With Quote
Old 2011-01-08, 07:48   #15
lorgix
 
lorgix's Avatar
 
Sep 2010
Scandinavia

26716 Posts
Default

Quote:
Originally Posted by Mr. P-1 View Post
As you know, more is better, but only if it reduces the number of passes, or if it results in a better plan (basically the choice of blocksize and the Suyama exponent.)

...

Suyama's extension doesn't require a huge additional amount of memory. The reason it's only used when memory is very plentiful is that the net benefit is so small that the extra memory required is usually better employed reducing the number of passes.
Yes, that was my understanding.

Someone said it well... along the lines of;
(relative primes) mod (primes per pass) should be either 0 or ~<(primes per pass)

I'm a little confused about blocksizes... IIRC the run I finished this night used 432relative primes, and E=12. That took ~5GB btw.


Quote:
Yes, but it's not smooth. Increasing the number of relative primes per pass from 120 to 159 will make no difference at all. Increasing from 159 to 160 will make a difference.
Isn't this slightly in conflict with saying that the time a pass takes is roughly proportional to the amount of relative primes processed during it?

Quote:
I'm no hardware expert, but I would conjecture the memory speed to make more difference than the memory timings, particularly during stage 2. The FFT code (used by LL, and both stages of P-1 and ECM) makes good use of memory prefetches, which I would expect to mitigate the effect of changes in memory latency. Stage 2 has a voracious demand for memory bandwidth. (All that allocated memory is being used, after all.) I've noticed that on my own machine, if both cores are running P-1 stage two, both take a considerable performance hit, which is why I have maxhighmemworkers set to 1 in local.txt.
Sounds like memory parameters are relevant then.

I'd like to get input from someone with hardware expertise.

Right now I can read at 10.4GB/s and write at 7.8GB/s.

Quote:
Nor does it with P-1. Optimal B2 increases with increasing memory, not because it takes more memory, but because it's more cost effective.
It doesn't?

Any way, in ECM; the memory use as a function of B2 is a discontinuous function.
lorgix is offline   Reply With Quote
Old 2011-01-08, 08:30   #16
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

22·3·641 Posts
Default

Quote:
Originally Posted by Mr. P-1 View Post
Bear in mind that the per pass overhead is small compared to the overall running time of the algorithm. If you double the number of relative primes per pass, from 20 to 40 say, you'll save X amount of time. Double it again, to 80, and the additional saving is X/2. The returns really do diminish quite rapidly.
Quote:
Originally Posted by lorgix View Post
So that's how fast the potential time saved diminishes, ok.
Quote:
Originally Posted by Mr. P-1 View Post
Yes, but it's not smooth. Increasing the number of relative primes per pass from 120 to 159 will make no difference at all. Increasing from 159 to 160 will make a difference.
Quote:
Originally Posted by lorgix View Post
Isn't this slightly in conflict with saying that the time a pass takes is roughly proportional to the amount of relative primes processed during it?
No conflict.

Mr. P-1's "Yes, but it's not smooth. Increasing ..." response was about "per pass overhead", not total time per pass.

Your "roughly proportional to the amount of relative primes processed during it" applies to total time per pass.
cheesehead is offline   Reply With Quote
Old 2011-01-08, 09:02   #17
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

22·3·641 Posts
Default

Quote:
Originally Posted by Mr. P-1 View Post
I'm not familiar with how prime95 ECM uses memory, but I would imagine that the same principles apply.
Quote:
Originally Posted by lorgix View Post
It's more complicated than the case of P-1.

Fix B1, and memory use will not increase strictly with B2...
Quote:
Originally Posted by Mr. P-1 View Post
Nor does it with P-1. Optimal B2 increases with increasing memory, not because it takes more memory, but because it's more cost effective.
Quote:
Originally Posted by lorgix View Post
It doesn't?
There are three ways in which prime95 P-1 may be invoked:

1) Implicitly, as part of a "Test=" LL assignment for an exponent for which P-1 has not yet been performed,

2) Explicitly with "Pfactor=" (this is the worktodo keyword for a P-1 assignment from PrimeNet),

3) Explicitly with "Pminus1="

In case 3), the Pminus1= line explicitly specifies B1 and B2. Prime95 does no bounds-choosing, but uses what's specified. This is not the case you're talking about here.

In cases 1) and 2), prime95 does choose the optimal B1 and B2 bounds. One of the inputs to the bounds-choosing algorithm is the user-specified Available Memory. The algorithm will calculate how many workareas can be accommodated in the Available Memory, and whether or not the Suyama extension will yield a worthwhile improvement. Then, it tries varying values of B1 and B2 to find the optimal ratio of

(probability of finding a factor at that B1/B2)*

to

(estimated elapsed time for stage 1 plus stage 2 at that B1/B2).

In general, increasing the amount of Available Memory allows the algorithm to make its estimates with increasing numbers of workareas and, if high enough, Suyama extension. The latter numbers, in turn, affect both the probability of finding a factor and the estimated elapsed run time for stages 1 and 2.

So, in the bounds-choosing algorithm, it's not that increasing B2 => using more memory.

It's that allowing more memory => being able to use more workareas and Suyama extension, which become more effective at raising the probability-to-time ratio for higher B2 values.

- - -

* - Actually, the ratio is

(estimated time saved by not running LL(s) if a factor is found)

to

(estimated elapsed time for stage 1 plus stage 2 at that B1/B2).

However, (estimated time saved by not running LL(s) if a factor is found) =

(probability of finding a factor at that B1/B2) * (estimated time for remaining LL(s) if no factor is found).

Since the algorithm is always concerned with only one exponent at a time, the (estimated time for remaining LL(s) if no factor is found) is constant for all choices of B1/B2, and can be omitted from a description comparing one set of B1/B2 to another set of B1/B2 for optimal P-1 on that exponent.

Last fiddled with by cheesehead on 2011-01-08 at 09:20
cheesehead is offline   Reply With Quote
Old 2011-01-08, 09:14   #18
lorgix
 
lorgix's Avatar
 
Sep 2010
Scandinavia

26716 Posts
Default

A bigger area changes the optimal bounds. That makes sense.
lorgix is offline   Reply With Quote
Old 2011-01-08, 10:38   #19
lavalamp
 
lavalamp's Avatar
 
Oct 2007
Manchester, UK

53·11 Posts
Default

Quote:
Originally Posted by lorgix View Post
Hello,

I'm trying to figure out clock speeds, timings and voltages for my RAM, MB & CPU.


Hardware:

ASUS P7H55
i3 540
3.07 -> 3.20GHz
4*2GB Corsair XMS3 1600MHz CL9


CPU & MB:

I'm running the CPU at 160*20 instead of spec 133*23. Vcore is set to Auto, which amounts to 1.13~1.14V in practice. This seems to require ~53W at boot. But then runs at ~50.1W.

I recently installed an additional fan, since then the core temp hasn't passed 58°C. Before that it would reach at least 61°C.

Now; Should I tinker with core voltage, PCH voltage, IMC voltage or PLL voltage? Does the above make sense? Any comments, questions or tips?
I'd bump the multiplier back up to 23 (in single increments of course, stress testing after each). Since you're already at 160 BCLK, you can easily get 480 MHz more out of your CPU without affecting any other components.

Additionally, if you disable all of the EIST (speedstep) and the various sleep states so that the CPU can't downclock itself when there's minimal activity, but you leave Turbo mode enabled, then on some motherboards you effectively gain access to permanent Turbo mode, where it can't downclock the CPU once it's on. If this applies to you then you could get to 3,840 MHz without changing the BCLK at all. However, you would likely want to alter the cooling first. Unless you're relying on Turbo for your overclock though, when you get to higher frequencies (greater than 3.5GHz say) I recommend disabling it.

Before changing any more settings, I would recommend running the Intel Burn Test. It generally finds faults quicker than the Prime95 torture test:
http://downloads.guru3d.com/IntelBur...load-2047.html

Run at least 5 rounds with as much memory as you have available, and close all other programs to put as much stress on the CPU as possible.

Intel say that the safe upper limit for the i3 VCore is 1.4 V, but a lot of people stay below 1.35 or 1.3 V for 24/7 use on air cooling. You should be able to get a lot of mileage out of your CPU before 1.3 V though, providing your cooling can keep up.
lavalamp is offline   Reply With Quote
Old 2011-01-08, 11:49   #20
lorgix
 
lorgix's Avatar
 
Sep 2010
Scandinavia

26716 Posts
Default

Thank you very much for input on the hardware!

A little update;

Vcore ranges 1.128~1.144. P=50.21W most of the time. (very stable)

I don't think any automatic downclocking is enabled. Except that overheat thing, the name of which I can't remember. Doesn't seem to ever have kicked in though.

I think 58°C was a one time thing. It didn't go there again for 30+hrs of almost completely uninterrupted full load. So I reset the log; it reached 56 three hours ago. Besides that it has been below 56°C for a little over 3hrs now. Guessing it would reach 57 with a little more disk and GPU activity.

160*20 has been thoroughly tested. I think by now I can say that about the 8-9-9-23 timings too. (Been running several P-1 sessions allocating 4~6GB for many hours)

The system failed after 1~2days at 160*21, but I'm almost positive that was caused by too aggressive RAM tweaking.

All cooling is stock. But I put in two additional fans. And set high RPM on all. One moving air out (2000~2100RPM, 120mm), below the PSU. The other pulling air in (3000~3100, 90mm) through a "tube" directed at the CPU(CPU fan mostly at 2000~2100RPM, even though it drops when idle, which I don't like). I also removed an older disk that was producing heat.

GPU doesn't go much further than 40°C the way I use it. My only internal disk barely reaches body temp. And the motherboard is almost indistinguishable from RT since the extra fans.

Everything I've described so far seems stable.


Moving on;

The program you linked says XP/Vista. I dl'd it, should it be ok on 64bit Win7?

BIOS allows BCLK to be set to 80-500. CPU mult. is 9-23. RAM is 10, 8 or lower..

Is BCLK somehow sensitive in itself? Or does it just set the pace for CPU, RAM, DMI etc.? I haven't been keeping up with hardware for a few CPU generations...

Should maxing out the CPU be considered to be a problem with 1, 2 or 3 variables? Multiplier, Voltage, and BCLK. In other words; is it enough to maximize BCLK*multiplier? Or...? You get the point.

Should I leave CPU voltage on [Auto] while setting mult. to 21?

Any comments/tips regarding RAM?


Thank you very much!
lorgix is offline   Reply With Quote
Old 2011-01-08, 13:53   #21
Mr. P-1
 
Mr. P-1's Avatar
 
Jun 2003

7·167 Posts
Default

Quote:
Originally Posted by cheesehead View Post
In case 3), the Pminus1= line explicitly specifies B1 and B2. Prime95 does no bounds-choosing, but uses what's specified. This is not the case you're talking about here.
Correct, and excuse me for not making that clear

Quote:
In cases 1) and 2), prime95 does choose the optimal B1 and B2 bounds. One of the inputs to the bounds-choosing algorithm is the user-specified Available Memory. The algorithm will calculate how many workareas can be accommodated in the Available Memory, and whether or not the Suyama extension will yield a worthwhile improvement. Then, it tries varying values of B1 and B2 to find the optimal ratio of

(probability of finding a factor at that B1/B2)*

to

(estimated elapsed time for stage 1 plus stage 2 at that B1/B2).
Quote:
* - Actually, the ratio is

(estimated time saved by not running LL(s) if a factor is found)

to

(estimated elapsed time for stage 1 plus stage 2 at that B1/B2).

However, (estimated time saved by not running LL(s) if a factor is found) =

(probability of finding a factor at that B1/B2) * (estimated time for remaining LL(s) if no factor is found).
While this is not technically incorrect, (The ratio exists. It has a particular value at optimal B1,B2, which is different from its value at other B1,B2, and so could be considered the optimal ratio) it is quite misleading.

The quantity being optimised is the difference, not the ratio, between the cost of the computation and the expected benefit. The bounds are optimal when this quantity is minimised, which happens when the partial derivatives of this quantity with respect to B1 and B2 are both zero.

If \Delta (cost - benefit) = 0
then \Delta cost = \Delta benefit
and so \Delta cost / \Delta benefit = 1

So there is an optimal ratio, but it is between the delta values. or equivalently between the partial derivatives (with respect to both B1 and B2) of cost and benefit, not between cost and benefit per se

Last fiddled with by Mr. P-1 on 2011-01-08 at 14:23
Mr. P-1 is offline   Reply With Quote
Old 2011-01-08, 14:19   #22
Mr. P-1
 
Mr. P-1's Avatar
 
Jun 2003

7·167 Posts
Default

Quote:
Originally Posted by cheesehead View Post
No conflict.
Correct.

Quote:
Mr. P-1's "Yes, but it's not smooth. Increasing ..." response was about "per pass overhead", not total time per pass.
Yes, specifically about the total per pass overhead for the entire stage 2 run.

Quote:
Your "roughly proportional to the amount of relative primes processed during it" applies to total time per pass.
Yes. It's not exactly proportional (because the per pass overhead is a constant) but it's roughly proportional (because the per pass overhead is small). Moreover, the more relative primes processed in each pass, the smaller the departure from exact proportionality.

Last fiddled with by Mr. P-1 on 2011-01-08 at 14:19
Mr. P-1 is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Large FFT tweaking Zerowalker Information & Answers 8 2013-04-19 15:01
Tweaking polynomial search for C197 fivemack Msieve 38 2011-07-08 08:12
Tweaking and compiling the Kleinjung siever Batalov Factoring 57 2010-11-30 18:03

All times are UTC. The time now is 13:43.


Fri Jul 7 13:43:47 UTC 2023 up 323 days, 11:12, 0 users, load averages: 1.09, 1.04, 1.09

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔