![]() |
![]() |
#1 |
Mar 2003
Braunschweig, Germany
2·113 Posts |
![]()
Hi!
I know that P-4 hyperthreading was discussed earlier here in the forum, but today i played again with my HT-P4 and have some strange results i cannot explain. Setup: Pentium 4 HT, Win XP, Prime 23.4 Test: I did run two instances of Prime 95 simultanously. Instance 1 was LL-testing M176xxxxx and Instance 2 doing TF of M228XXXXX. I expected to see a slight throughput increase (compared to two identical machines with one doing LL and the other TF). But i expected that increase to be larger for TF _below_ 2^64, because after that the P-4 should also use SSE2 to TF and compete with the SSE2-LL for the resources. But my results show that using HT with TF 2^65 up to 2^67 yield an even better performance gain. Using two instances (LL and TF) without HT produced exactly the doubled times compared to the single-threads, so i ommit that data here. Here the data for the HT-Setup: a) TF up to 2^64 (checked for 2^63 and 2^64) LL-single thread: 0.032s/iteration TF-single thread: 10.6s/1000 iterations LL-HT: 0.057s/iteration TF-HT: 17.4s/1000 iterations b) TF from 2^64 up to 2^67 LL-single thread: 0.032s/iteration TF-single thread: 13 s/1000 iterations LL-HT: 0.059s/iterations TFR-HT: 18.0s/1000 iterations So, if you have two machines available (of course also true with only one machine) and you plan to do both LL and TF, enabling HT and running two instances of Prime95 on both machines will gain about 22% TF-increase and 12% LL-increase in total throughput compared to running one LL instance on one machine and one TF-instance on the other machine in case a) So far so good. But what really puzzles me is case b). If you use HT with TF from 2^64 up, you gain an enormous 44% overall throughput increase for TF and 8% increase for LL using the scenario described. Prime95 is using TF (and ECM-TF) to eliminate unnecessary LL-tests. AFAIK so far the factoring costs that are used to determine the TF tradeoff limit are calculated timing the factoring on non-HT machines. I suggest to reevaluate that formula at for users willing to do TF _and_ LL on HT-enabled P4-Systems. Due to the increasing numbers of those configurations, the impact on overall GIMPS throughput (in terms of cleared numbers/timeframe) should be worth the effort. It would just be a second "Exponents up to.../Trial factored to" table in the client that would be used if the user chooses to do TF and LL on his HT-system. One could argue that P4-systems should not be used at all for TF because of the non-SSE2 Athlon gang still doing work there, but if we take into account the coming transition to Athlon64/Operon (no HT) and Prescott i would not subsribe to that reasoning. Tau |
![]() |
![]() |
![]() |
#2 |
Sep 2003
A1D16 Posts |
![]()
That's very interesting, and I wonder if anyone else with a HT P4 CPU can confirm this.
There's only one drawback I can think of: running two threads puts more stress on a machine, which might put a borderline machine over the edge. For instance, with a single thread you might be able to overclock a P4 2.4 GHz C to 3.0 GHz, whereas if you were running two threads that overclock might not be stable. So you'd have to overclock less, which would negate the advantage. Still, this is something interesting to investigate, especially for those who don't overclock. |
![]() |
![]() |
![]() |
#3 |
Sep 2003
3·863 Posts |
![]()
So LL testing uses floating-point SSE2 and trial-factoring uses integer SSE2 and apparently you really can get greater overall throughput by running both simultaneously on one of the new hyperthreading P4 "C" CPUs.
|
![]() |
![]() |
![]() |
#4 |
Mar 2003
Braunschweig, Germany
2×113 Posts |
![]()
Interesting. So using TF together with LL on HT-Systems will indeed improve overall throughput.
Regarding Overclocking i do not see unescapable problems using LL and TF together. Some data here for my system concerning power usage: Idle: 72W TF: 112W LL: 135W LL+Fact HT: 135W It may even be, that using HT the heat is better distributed on the die (i don't know the layout of the SSE2-float and integer regions on the die) and _better_ overclocking is possible using both LL and TF. But that is only speculation... Tau |
![]() |
![]() |
![]() |
#5 |
"GIMFS"
Sep 2002
Oeiras, Portugal
2·5·157 Posts |
![]()
Your posts have in fact "awakened" some thoughts about TF that have been in the back of my mind for a while. In fact, regardless of a machine being HT or not, the default TF bit limits (see the Math page on GIMPS site) should have been adjusted for P4s, from the moment SSE2 instructions came into play for TF (I believe it was with version 22.7). As the TF efficiency has increased by a significant factor (3 to 4 times faster for bit levels above 64) the factoring cost used for the calculation was reduced, probably making sense to increase the upper bit limit for P4s.
Now, TauCeti observations on HT have added yet more valid points on this subject. Maybe it is time for George to look into it... George, any ideas, or plans? |
![]() |
![]() |
![]() |
#6 |
Mar 2003
Melbourne
5·103 Posts |
![]()
Tauceti,
I'm not see the same efficiency as you were. I was running a 20M LL and some 10M P-1 factoring. Here is two completition times: Combo of 20M LL and 10M P-1 factoring: [Thu Oct 16 03:07:39 2003] UID: S90106/CCD150C90, M10410563 completed P-1, B1=60000, B2=990000, WZ1: 0D719733 [Thu Oct 16 06:25:24 2003] 10M P-1 factoring on it's own: [Fri Oct 17 00:08:45 2003] UID: S90106/CCD150C90, M10411699 completed P-1, B1=60000, B2=990000, WZ1: 0D60972D [Fri Oct 17 01:17:52 2003] My 20M LL iteration times on it's own were 0.046, but with the P-1 factoring they hovered between 0.095 to 0.110. The primary reason I stopped the 20M LL was concerns about stability (of 2 prime instances) people had posted regarding tests on their own machines. I hadn't s done any stability checking on this PC with two threads of prime95. The possible reason why I'm not seeing the same efficiencies could be that my CPU throttled due to thermal considerations. With only the 20M LL my cpu runs 60-62degC. I'm spewing I didn't get the temp readings while both threads were running. I'll leave that for another test once the current work is finished. System: P4 2.8/800HT running at stock speed 2x DDR333 in dual channel Winxp SP1 Prime 23.7 -- Craig |
![]() |
![]() |
![]() |
#7 |
Aug 2002
Termonfeckin, IE
24×173 Posts |
![]()
P-1 factoring is completely different cup of tea. it involves the calculations that are similar in nature to the LL test so P-1 and LL will impede each other's progress.
|
![]() |
![]() |
![]() |
#8 |
Sep 2003
3·863 Posts |
![]()
It's a pity that trial-factoring isn't much of a bottleneck for the project right now.
Perhaps eventually an integer version of LL testing might run in parallel with the current floating-point version? But that's pure blue-sky dreaming. By the way, is hyper-threading really the wave of the future, or is it just a temporary fad and we'll be seeing multiple cores instead in the future? In the meantime, it might be very interesting for a large group like ARS Technica (Team_Prime_Rib), with its hands in many different distributing computing projects, to try running two different clients in parallel (SETI@Home and GIMPS, Distributed Folding and GIMPS, etc) on the P4 "C" processors and see which ones get the best total throughput. And then find a fellow ARSian with a P4 "C" box working on another project, and have both boxes run both clients in parallel, with both projects benefitting overall. |
![]() |
![]() |
![]() |
#9 |
Mar 2003
Melbourne
51510 Posts |
![]()
IMO, HT is just the start of next current evolutionary thread - yes I agree mutiple cores is the next one.
Like pipelining, it's a technique to get more power of a given die. We started with a few levels, and now we have the current pipeline behemoths of P4 etc.. HT/multiple cores only made sense for the consumer market in the last few years. In the past people only ran one app. Now we have mail client retrieving mail in the bg, instant messaging clients etc.. All needing their cpu slice to do their work without interupting the primary task. According to Endian.net, we'll see mamoth cpus with up to 8cores per die in the near future. That's some insane power. Also intel is working on CPUs that can run different OS at once within the same cpu. What I find with HT is that the PC feels more responsive. I think that I/O is handled better with a HT CPU. I noticed that with the LL and P-1 test at the same time, the PC "felt" slow. The PC didn't respond as quick to my keyboard/mouse clicks. -- Craig |
![]() |
![]() |
![]() |
#10 |
Aug 2002
Texas
5·31 Posts |
![]()
I really like my HT box because I can set P95's priority to 10 and still use the machince but have lower iteration times when the computer is not in use as compared to a lower priority.
|
![]() |
![]() |
![]() |
#11 | |
Aug 2002
Dawn of the Dead
5·47 Posts |
![]()
I'm one of those and have run at least ten different projects over the years, mainly in gauntlets. I have a HT cpu but I chose w2k for the install. Next 2.4C I'll try that disease of an OS to play with HT and various clients.
Crunching seti results in a huge gain - at 3 GHz, two clients produce about 20 WU per day. As a result, the AMD trollboyz are very disappointed. Those who have crunched seti know how impressive that is. f@h is reputed to have similar gains. It appears that anything with inefficient code gains from multiple deployment. Quote:
|
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Impact of AI | xilman | Lounge | 19 | 2017-01-26 16:03 |
First pre-impact discovery for NEO search! | cheesehead | Astronomy | 42 | 2013-11-22 04:54 |
GPUs impact on TF | petrw1 | GPU Computing | 0 | 2013-01-06 03:23 |
GPU TF work and its impact on P-1 | davieddy | Lounge | 161 | 2011-08-09 10:27 |
Another Impact on Jupiter | Spherical Cow | Astronomy | 24 | 2009-08-12 19:32 |