mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2003-10-16, 15:43   #1
TauCeti
 
TauCeti's Avatar
 
Mar 2003
Braunschweig, Germany

3428 Posts
Default Doing TF and LL together on HT-System / Impact on TF-limit tradeoff calculation?

Hi!

I know that P-4 hyperthreading was discussed earlier here in the forum, but today i played again with my HT-P4 and have some strange results i cannot explain.

Setup:

Pentium 4 HT, Win XP, Prime 23.4

Test:

I did run two instances of Prime 95 simultanously. Instance 1 was LL-testing M176xxxxx and Instance 2 doing TF of M228XXXXX. I expected to see a slight throughput increase (compared to two identical machines with one doing LL and the other TF). But i expected that increase to be larger for TF _below_ 2^64, because after that the P-4 should also use SSE2 to TF and compete with the SSE2-LL for the resources.

But my results show that using HT with TF 2^65 up to 2^67 yield an even better performance gain.

Using two instances (LL and TF) without HT produced exactly the doubled times compared to the single-threads, so i ommit that data here.

Here the data for the HT-Setup:

a) TF up to 2^64 (checked for 2^63 and 2^64)

LL-single thread: 0.032s/iteration
TF-single thread: 10.6s/1000 iterations

LL-HT: 0.057s/iteration
TF-HT: 17.4s/1000 iterations

b) TF from 2^64 up to 2^67

LL-single thread: 0.032s/iteration
TF-single thread: 13 s/1000 iterations

LL-HT: 0.059s/iterations
TFR-HT: 18.0s/1000 iterations

So, if you have two machines available (of course also true with only one machine) and you plan to do both LL and TF, enabling HT and running two instances of Prime95 on both machines will gain about 22% TF-increase and 12% LL-increase in total throughput compared to running one LL instance on one machine and one TF-instance on the other machine in case a)

So far so good. But what really puzzles me is case b). If you use HT with TF from 2^64 up, you gain an enormous 44% overall throughput increase for TF and 8% increase for LL using the scenario described.

Prime95 is using TF (and ECM-TF) to eliminate unnecessary LL-tests. AFAIK so far the factoring costs that are used to determine the TF tradeoff limit are calculated timing the factoring on non-HT machines. I suggest to reevaluate that formula at for users willing to do TF _and_ LL on HT-enabled P4-Systems. Due to the increasing numbers of those configurations, the impact on overall GIMPS throughput (in terms of cleared numbers/timeframe) should be worth the effort. It would just be a second "Exponents up to.../Trial factored to" table in the client that would be used if the user chooses to do TF and LL on his HT-system.

One could argue that P4-systems should not be used at all for TF because of the non-SSE2 Athlon gang still doing work there, but if we take into account the coming transition to Athlon64/Operon (no HT) and Prescott i would not subsribe to that reasoning.

Tau
TauCeti is offline   Reply With Quote
Old 2003-10-16, 18:55   #2
GP2
 
GP2's Avatar
 
Sep 2003

259010 Posts
Default

That's very interesting, and I wonder if anyone else with a HT P4 CPU can confirm this.

There's only one drawback I can think of: running two threads puts more stress on a machine, which might put a borderline machine over the edge.

For instance, with a single thread you might be able to overclock a P4 2.4 GHz C to 3.0 GHz, whereas if you were running two threads that overclock might not be stable. So you'd have to overclock less, which would negate the advantage.

Still, this is something interesting to investigate, especially for those who don't overclock.
GP2 is offline   Reply With Quote
Old 2003-10-16, 23:27   #3
GP2
 
GP2's Avatar
 
Sep 2003

2·5·7·37 Posts
Default

So LL testing uses floating-point SSE2 and trial-factoring uses integer SSE2 and apparently you really can get greater overall throughput by running both simultaneously on one of the new hyperthreading P4 "C" CPUs.
GP2 is offline   Reply With Quote
Old 2003-10-17, 09:44   #4
TauCeti
 
TauCeti's Avatar
 
Mar 2003
Braunschweig, Germany

E216 Posts
Default

Interesting. So using TF together with LL on HT-Systems will indeed improve overall throughput.

Regarding Overclocking i do not see unescapable problems using LL and TF together. Some data here for my system concerning power usage:

Idle: 72W
TF: 112W
LL: 135W
LL+Fact HT: 135W

It may even be, that using HT the heat is better distributed on the die (i don't know the layout of the SSE2-float and integer regions on the die) and _better_ overclocking is possible using both LL and TF. But that is only speculation...

Tau
TauCeti is offline   Reply With Quote
Old 2003-10-17, 10:09   #5
lycorn
 
lycorn's Avatar
 
Sep 2002
Oeiras, Portugal

1,459 Posts
Default

Your posts have in fact "awakened" some thoughts about TF that have been in the back of my mind for a while. In fact, regardless of a machine being HT or not, the default TF bit limits (see the Math page on GIMPS site) should have been adjusted for P4s, from the moment SSE2 instructions came into play for TF (I believe it was with version 22.7). As the TF efficiency has increased by a significant factor (3 to 4 times faster for bit levels above 64) the factoring cost used for the calculation was reduced, probably making sense to increase the upper bit limit for P4s.
Now, TauCeti observations on HT have added yet more valid points on this subject.
Maybe it is time for George to look into it... George, any ideas, or plans?
lycorn is offline   Reply With Quote
Old 2003-10-18, 23:13   #6
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5·103 Posts
Default

Tauceti,

I'm not see the same efficiency as you were. I was running a 20M LL and some 10M P-1 factoring.

Here is two completition times:


Combo of 20M LL and 10M P-1 factoring:
[Thu Oct 16 03:07:39 2003]
UID: S90106/CCD150C90, M10410563 completed P-1, B1=60000, B2=990000, WZ1: 0D719733
[Thu Oct 16 06:25:24 2003]

10M P-1 factoring on it's own:
[Fri Oct 17 00:08:45 2003]
UID: S90106/CCD150C90, M10411699 completed P-1, B1=60000, B2=990000, WZ1: 0D60972D
[Fri Oct 17 01:17:52 2003]

My 20M LL iteration times on it's own were 0.046, but with the P-1 factoring they hovered between 0.095 to 0.110.

The primary reason I stopped the 20M LL was concerns about stability (of 2 prime instances) people had posted regarding tests on their own machines. I hadn't s done any stability checking on this PC with two threads of prime95.

The possible reason why I'm not seeing the same efficiencies could be that my CPU throttled due to thermal considerations. With only the 20M LL my cpu runs 60-62degC.

I'm spewing I didn't get the temp readings while both threads were running. I'll leave that for another test once the current work is finished.

System:
P4 2.8/800HT running at stock speed
2x DDR333 in dual channel
Winxp SP1
Prime 23.7

-- Craig
nucleon is offline   Reply With Quote
Old 2003-10-18, 23:24   #7
garo
 
garo's Avatar
 
Aug 2002
Termonfeckin, IE

276410 Posts
Default

P-1 factoring is completely different cup of tea. it involves the calculations that are similar in nature to the LL test so P-1 and LL will impede each other's progress.
garo is offline   Reply With Quote
Old 2003-10-18, 23:50   #8
GP2
 
GP2's Avatar
 
Sep 2003

2·5·7·37 Posts
Default

It's a pity that trial-factoring isn't much of a bottleneck for the project right now.

Perhaps eventually an integer version of LL testing might run in parallel with the current floating-point version? But that's pure blue-sky dreaming.

By the way, is hyper-threading really the wave of the future, or is it just a temporary fad and we'll be seeing multiple cores instead in the future?

In the meantime, it might be very interesting for a large group like ARS Technica (Team_Prime_Rib), with its hands in many different distributing computing projects, to try running two different clients in parallel (SETI@Home and GIMPS, Distributed Folding and GIMPS, etc) on the P4 "C" processors and see which ones get the best total throughput.

And then find a fellow ARSian with a P4 "C" box working on another project, and have both boxes run both clients in parallel, with both projects benefitting overall.
GP2 is offline   Reply With Quote
Old 2003-10-19, 00:37   #9
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5·103 Posts
Default

IMO, HT is just the start of next current evolutionary thread - yes I agree mutiple cores is the next one.

Like pipelining, it's a technique to get more power of a given die. We started with a few levels, and now we have the current pipeline behemoths of P4 etc..

HT/multiple cores only made sense for the consumer market in the last few years. In the past people only ran one app. Now we have mail client retrieving mail in the bg, instant messaging clients etc.. All needing their cpu slice to do their work without interupting the primary task.

According to Endian.net, we'll see mamoth cpus with up to 8cores per die in the near future. That's some insane power. Also intel is working on CPUs that can run different OS at once within the same cpu.

What I find with HT is that the PC feels more responsive. I think that I/O is handled better with a HT CPU. I noticed that with the LL and P-1 test at the same time, the PC "felt" slow. The PC didn't respond as quick to my keyboard/mouse clicks.

-- Craig
nucleon is offline   Reply With Quote
Old 2003-10-19, 01:54   #10
Complex33
 
Complex33's Avatar
 
Aug 2002
Texas

9B16 Posts
Default

I really like my HT box because I can set P95's priority to 10 and still use the machince but have lower iteration times when the computer is not in use as compared to a lower priority.
Complex33 is offline   Reply With Quote
Old 2003-10-19, 02:12   #11
PageFault
 
PageFault's Avatar
 
Aug 2002
Dawn of the Dead

111010112 Posts
Default

I'm one of those and have run at least ten different projects over the years, mainly in gauntlets. I have a HT cpu but I chose w2k for the install. Next 2.4C I'll try that disease of an OS to play with HT and various clients.

Crunching seti results in a huge gain - at 3 GHz, two clients produce about 20 WU per day. As a result, the AMD trollboyz are very disappointed. Those who have crunched seti know how impressive that is. f@h is reputed to have similar gains. It appears that anything with inefficient code gains from multiple deployment.

Quote:
Originally posted by GP2

In the meantime, it might be very interesting for a large group like ARS Technica (Team_Prime_Rib), with its hands in many different distributing computing projects, to try running two different clients in parallel (SETI@Home and GIMPS, Distributed Folding and GIMPS, etc) on the P4 "C" processors and see which ones get the best total throughput.

And then find a fellow ARSian with a P4 "C" box working on another project, and have both boxes run both clients in parallel, with both projects benefitting overall.
PageFault is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Impact of AI xilman Lounge 19 2017-01-26 16:03
First pre-impact discovery for NEO search! cheesehead Astronomy 42 2013-11-22 04:54
GPUs impact on TF petrw1 GPU Computing 0 2013-01-06 03:23
GPU TF work and its impact on P-1 davieddy Lounge 161 2011-08-09 10:27
Another Impact on Jupiter Spherical Cow Astronomy 24 2009-08-12 19:32

All times are UTC. The time now is 03:36.

Thu May 13 03:36:24 UTC 2021 up 34 days, 22:17, 1 user, load averages: 3.56, 3.56, 3.72

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.