![]() |
![]() |
#1 |
Jun 2005
373 Posts |
![]()
Last night before falling asleep I had the following idea (don't know if it has been discussed or impremented or if there is a better method).
I come from Seventeenorbust, and we are running many P-1 tests in a row. Now, stage2 needs a lot of memory, stage1 does not. My first idea was basically to run two copys of prime95 on a Hyperthreading-machine, to one copy assigned few mem, to the other one a lot. The first copy runs Stage1, the second Stage2 (of an other test, where stage1 has already been done). This could be organised manually, or by a small program that takes care of the savefiles and the worktodo.ini's. But then I thought that the easiest way would be to implement it directly in prime95 itself; I assume that the optimal B1/B2 have to be chosen in a different way, too. I know that prime95 can take advantage of two or more processors, but perhaps not in this way. Perhaps, on the other hand, would the fact that there are two processors, but only one memory, slow down things. Don't know. Just an idea. I'm looking forward for your replys, though. Yours sincerely, H. |
![]() |
![]() |
![]() |
#2 |
May 2005
2·809 Posts |
![]()
If you really mean Intel's hyperthreading technology (one physical + one logical CPU), then I doubt there would be any significant improvement doing two intensive task simultaneously. It should change however when multicore CPU is used (e.g. two physical CPUs).
|
![]() |
![]() |
![]() |
#3 |
Mar 2005
Denmark
7 Posts |
![]()
I really can't figure out it there would be any gain.
As I run these P-1 tests right now I have something like half of the time where the computer has lot's of it's memory taken by mprime and the other half of the time where almost nothing is taken. If I turned on HT and did as suggested then as far as I can tell I would just get the same throughput but have lot's of memory taken all the time. Both stage1 and stage2 it is the memory that is "the bottleneck" ... if memory access or something else had been the bottleneck then I could see where the gain would be. I may very well be wrong ... there might actually be ways to intensivate this process. I do for example have P4-boxes, one with little ram (256 MB) and one with enough (1024 MB) ... if I could set them up so the one with little ram only does stage1 and then let the one with enough ram do stage2 then there would probably be a gain somehow, but doing all of this manually is just takes more time than I want to spend doing this. |
![]() |
![]() |
![]() |
#4 |
Jul 2004
Nowhere
809 Posts |
![]()
I know that prime95 can take advantage of two or more processors, but perhaps not in this way.
Perhaps, on the other hand, would the fact that there are two processors, but only one memory, slow down things. this is incorrect prime95 does not take advantage of 2 processers you actually have to run 2 copys of prime 95 set to different procs manually. prime 95 is a single proc program |
![]() |
![]() |
![]() |
#5 |
"Nancy"
Aug 2002
Alexandria
2,467 Posts |
![]()
I don't think it would help. Prime95 utilises the fpu nearly 100% regardless of whether it is doing stage 1 or 2, so the two processes will merely compete for fpu units. In stage 2, due to the far larger memory footprint, memory access latencies may be more common, so the stage 1 thread may be able to do a bit of useful work there while stage 2 is waiting for data. I'd expect it will help less than the two processes throwing each other's data out of cache will hurt, though.
You might try running software in parallel that stresses completely different execution units, for example GMP-ECM which does not use the fpu at all. Doing stage 1 on small numbers should use very little cache space as well, but possibly still too much if Prime95 assumes it gets to use the L1 cache all by itself. Alex Last fiddled with by akruppa on 2005-06-05 at 17:09 |
![]() |
![]() |
![]() |
#6 |
Mar 2004
3×167 Posts |
![]()
It seems to me like people are conjecturing that it won't work without giving a reason why running two copies of prime95 on (SOME) hyperthreaded machines increases throughput.
Try it and see what happens. Set it for (say) 1000 iterations per screen output, get a bunch of screen outputs and take an average. Do that for both stage 1 and stage 2. Then try running two copies, one doing stage 1 and the other doing stage 2, and get an average of those, and tell us what they are. |
![]() |
![]() |
![]() |
#7 | |
Sep 2002
Oeiras, Portugal
11×131 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#8 |
Mar 2003
New Zealand
13×89 Posts |
![]()
The best thing, as Alex suggested, is to do the timings yourself on your own machine. It can be surprising which programs work well together and which don't.
The fact that two programs use the same part of the CPU doesn't necessarily mean that they won't hyperthread well together. For example, running two instances of mprime both doing stage one ECM on the same number gives between 15-20% better throughput on my P4. |
![]() |
![]() |
![]() |
#9 | |
Nov 2003
164448 Posts |
![]() Quote:
This is false. I run two copies of my Number Field Sieve on my home PC (Hyperthreaded P IV at 3.4Ghz) and it gives a significant speed-up. Here are some typical numbers. For the lattice sieve, on 2,749+, (a 12K x 6K sieve region), it takes 17 seconds to process one special q with just 1 instance of the code running. With 2 instances running, it takes 25 seconds each, but since there are two running, it effectively means one special q every 12.5 seconds. This is a big improvement over 17 seconds. The biggest obstacle when running two copies is, of course, cache contention. |
|
![]() |
![]() |
![]() |
#10 |
Jun 2005
373 Posts |
![]()
Ok, I begin to believe it wouldn't work. But, if somebody would like to make the test anyway, I would be glad. Some instructions, to make it easy, with an real example from Seventeenorbust. The tests will take some hours, you will not find a factor.
First, without HT. Assign prime95 the memory m (this is a variable). Put in your worktodo.ini the line Pminus1=10223,2,9360665,1,50000,50000,0 This is stage 1 of the first test. At the end, it doesn't delete the savefile, when exiting. It is called l9360665, without extension. Save it somewhere else, but keep it in the same directory, too, we need it for stage2. Note the time used in stage1, and run stage2: Pminus1=10223,2,9360665,1,50000,500000,0 Note the time again. Now, with HT. With two copys of prime95, you assign much mem (m_1) to the first, less (m_2) to the second. Make sure that m_1+m_2=m. The first runs the line Pminus1=10223,2,9360665,1,50000,500000,0 (don't forget to put the unused savefile back in the directory) the second one runs another test, stage1: Pminus1=10223,2,9360617,1,50000,50000,0 Stop the time it takes for the slower one of both copies. If the time is less than the time for a whole test, we have a gain. If not, not. I would have done this of course, but I have no HT machine. I apologize for the long instructions, you guys are way more experienced than I am, but, well, you see, better to make sure. See you, H. |
![]() |
![]() |
![]() |
#11 |
Aug 2002
Termonfeckin, IE
3×919 Posts |
![]()
George's SSE2 FFT code (also used in P-1) is probably more fine-tuned than your NFS code and therefore will likely not give a comparable speed increase.
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Budget PC Throughput | Rodrigo | Hardware | 14 | 2011-09-26 10:16 |
tpsieve-cuda slows down with increasing p | amphoria | Twin Prime Search | 0 | 2011-07-23 10:52 |
how is the throughput calculated? | ixfd64 | PrimeNet | 5 | 2008-05-21 13:39 |
My throughput does not compute... | petrw1 | Hardware | 9 | 2007-08-13 14:38 |
Increasing range of attachment extensions allowed | fivemack | Forum Feedback | 3 | 2007-06-05 18:07 |