mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2005-06-05, 03:38   #1
hhh
 
hhh's Avatar
 
Jun 2005

373 Posts
Default increasing P-1 throughput with hyperthreading?

Last night before falling asleep I had the following idea (don't know if it has been discussed or impremented or if there is a better method).

I come from Seventeenorbust, and we are running many P-1 tests in a row.
Now, stage2 needs a lot of memory, stage1 does not.
My first idea was basically to run two copys of prime95 on a Hyperthreading-machine, to one copy assigned few mem, to the other one a lot. The first copy runs Stage1, the second Stage2 (of an other test, where stage1 has already been done).
This could be organised manually, or by a small program that takes care of the savefiles and the worktodo.ini's.
But then I thought that the easiest way would be to implement it directly in prime95 itself; I assume that the optimal B1/B2 have to be chosen in a different way, too.

I know that prime95 can take advantage of two or more processors, but perhaps not in this way.
Perhaps, on the other hand, would the fact that there are two processors, but only one memory, slow down things.

Don't know. Just an idea.

I'm looking forward for your replys, though.
Yours sincerely, H.
hhh is offline   Reply With Quote
Old 2005-06-05, 08:17   #2
Cruelty
 
Cruelty's Avatar
 
May 2005

23×7×29 Posts
Default Hyperthreading?

If you really mean Intel's hyperthreading technology (one physical + one logical CPU), then I doubt there would be any significant improvement doing two intensive task simultaneously. It should change however when multicore CPU is used (e.g. two physical CPUs).
Cruelty is offline   Reply With Quote
Old 2005-06-05, 13:46   #3
Frodo42
 
Mar 2005
Denmark

7 Posts
Default

I really can't figure out it there would be any gain.

As I run these P-1 tests right now I have something like half of the time where the computer has lot's of it's memory taken by mprime and the other half of the time where almost nothing is taken. If I turned on HT and did as suggested then as far as I can tell I would just get the same throughput but have lot's of memory taken all the time.

Both stage1 and stage2 it is the memory that is "the bottleneck" ... if memory access or something else had been the bottleneck then I could see where the gain would be.

I may very well be wrong ... there might actually be ways to intensivate this process. I do for example have P4-boxes, one with little ram (256 MB) and one with enough (1024 MB) ... if I could set them up so the one with little ram only does stage1 and then let the one with enough ram do stage2 then there would probably be a gain somehow, but doing all of this manually is just takes more time than I want to spend doing this.
Frodo42 is offline   Reply With Quote
Old 2005-06-05, 16:56   #4
moo
 
moo's Avatar
 
Jul 2004
Nowhere

809 Posts
Default

I know that prime95 can take advantage of two or more processors, but perhaps not in this way.
Perhaps, on the other hand, would the fact that there are two processors, but only one memory, slow down things.


this is incorrect prime95 does not take advantage of 2 processers you actually have to run 2 copys of prime 95 set to different procs manually. prime 95 is a single proc program
moo is offline   Reply With Quote
Old 2005-06-05, 17:08   #5
akruppa
 
akruppa's Avatar
 
"Nancy"
Aug 2002
Alexandria

2,467 Posts
Default

I don't think it would help. Prime95 utilises the fpu nearly 100% regardless of whether it is doing stage 1 or 2, so the two processes will merely compete for fpu units. In stage 2, due to the far larger memory footprint, memory access latencies may be more common, so the stage 1 thread may be able to do a bit of useful work there while stage 2 is waiting for data. I'd expect it will help less than the two processes throwing each other's data out of cache will hurt, though.

You might try running software in parallel that stresses completely different execution units, for example GMP-ECM which does not use the fpu at all. Doing stage 1 on small numbers should use very little cache space as well, but possibly still too much if Prime95 assumes it gets to use the L1 cache all by itself.

Alex

Last fiddled with by akruppa on 2005-06-05 at 17:09
akruppa is offline   Reply With Quote
Old 2005-06-05, 23:02   #6
JuanTutors
 
JuanTutors's Avatar
 
Mar 2004

509 Posts
Default

It seems to me like people are conjecturing that it won't work without giving a reason why running two copies of prime95 on (SOME) hyperthreaded machines increases throughput.

Try it and see what happens. Set it for (say) 1000 iterations per screen output, get a bunch of screen outputs and take an average. Do that for both stage 1 and stage 2. Then try running two copies, one doing stage 1 and the other doing stage 2, and get an average of those, and tell us what they are.
JuanTutors is offline   Reply With Quote
Old 2005-06-06, 12:39   #7
lycorn
 
lycorn's Avatar
 
Sep 2002
Oeiras, Portugal

5AB16 Posts
Default

Quote:
Originally Posted by dominicanpapi82
It seems to me like people are conjecturing that it won't work without giving a reason why running two copies of prime95 on (SOME) hyperthreaded machines increases throughput.
akruppa´s answer is more than conjecturing. His explanation about the 2 processes competing for FPU resources is correct and is the key reason why it doesn´t work. Now, if one uses an HT machine to run 2 Prime95 copies, one running TF code and the other one doing LL or P-1 work, that might be beneficial for the whole project throughput, as in that case the 2 processes would be using different parts of the CPU, and the advantages of HT would more likely be noticeable. AFAIR, some figures were presented in this forum, a while ago.
lycorn is offline   Reply With Quote
Old 2005-06-07, 04:11   #8
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

13·89 Posts
Default

The best thing, as Alex suggested, is to do the timings yourself on your own machine. It can be surprising which programs work well together and which don't.

The fact that two programs use the same part of the CPU doesn't necessarily mean that they won't hyperthread well together. For example, running two instances of mprime both doing stage one ECM on the same number gives between 15-20% better throughput on my P4.
geoff is offline   Reply With Quote
Old 2005-06-07, 12:16   #9
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

22·5·373 Posts
Thumbs up

Quote:
Originally Posted by Cruelty
If you really mean Intel's hyperthreading technology (one physical + one logical CPU), then I doubt there would be any significant improvement doing two intensive task simultaneously. It should change however when multicore CPU is used (e.g. two physical CPUs).

This is false. I run two copies of my Number Field Sieve on my home PC
(Hyperthreaded P IV at 3.4Ghz) and it gives a significant speed-up.

Here are some typical numbers.

For the lattice sieve, on 2,749+, (a 12K x 6K sieve region), it takes 17
seconds to process one special q with just 1 instance of the code
running. With 2 instances running, it takes 25 seconds each, but since there
are two running, it effectively means one special q every 12.5 seconds.
This is a big improvement over 17 seconds.

The biggest obstacle when running two copies is, of course, cache
contention.
R.D. Silverman is offline   Reply With Quote
Old 2005-06-07, 12:47   #10
hhh
 
hhh's Avatar
 
Jun 2005

1011101012 Posts
Default

Ok, I begin to believe it wouldn't work. But, if somebody would like to make the test anyway, I would be glad. Some instructions, to make it easy, with an real example from Seventeenorbust. The tests will take some hours, you will not find a factor.
First, without HT.
Assign prime95 the memory m (this is a variable). Put in your worktodo.ini the line

Pminus1=10223,2,9360665,1,50000,50000,0

This is stage 1 of the first test. At the end, it doesn't delete the savefile, when exiting. It is called l9360665, without extension. Save it somewhere else, but keep it in the same directory, too, we need it for stage2. Note the time used in stage1, and run stage2:

Pminus1=10223,2,9360665,1,50000,500000,0

Note the time again.

Now, with HT.

With two copys of prime95, you assign much mem (m_1) to the first, less (m_2) to the second. Make sure that m_1+m_2=m.
The first runs the line

Pminus1=10223,2,9360665,1,50000,500000,0

(don't forget to put the unused savefile back in the directory)
the second one runs another test, stage1:

Pminus1=10223,2,9360617,1,50000,50000,0

Stop the time it takes for the slower one of both copies. If the time is less than the time for a whole test, we have a gain.
If not, not.

I would have done this of course, but I have no HT machine.
I apologize for the long instructions, you guys are way more experienced than I am, but, well, you see, better to make sure.
See you, H.
hhh is offline   Reply With Quote
Old 2005-06-07, 12:49   #11
garo
 
garo's Avatar
 
Aug 2002
Termonfeckin, IE

ACB16 Posts
Default

George's SSE2 FFT code (also used in P-1) is probably more fine-tuned than your NFS code and therefore will likely not give a comparable speed increase.
garo is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Budget PC Throughput Rodrigo Hardware 14 2011-09-26 10:16
tpsieve-cuda slows down with increasing p amphoria Twin Prime Search 0 2011-07-23 10:52
how is the throughput calculated? ixfd64 PrimeNet 5 2008-05-21 13:39
My throughput does not compute... petrw1 Hardware 9 2007-08-13 14:38
Increasing range of attachment extensions allowed fivemack Forum Feedback 3 2007-06-05 18:07

All times are UTC. The time now is 21:48.

Sat Apr 10 21:48:18 UTC 2021 up 2 days, 16:29, 1 user, load averages: 1.49, 1.63, 1.64

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.