 2007-03-07, 02:52 #1 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 3·11·239 Posts Hyperthread v25.2 benchmarks I need a volunteer or two to do some advanced hyperthreading benchmarks under Windows. You'll need version 25.2 at ftp://mersenne.org/gimps/p95v252.zip. DO NOT OVERWRITE YOUR CURRENT PRIME95 - put this test software in its own folder. The question I'm trying to answer is this: Should the default setting for prime95 on a hyperthreaded CPU be run one LL test using two threads? Or do you get more throughput by running 2 LL tests? We can try to answer this question by using Advanced/Benchmark, but I think we'll need to actually time throughput by running several minutes of one LL test using 2 threads and several minutes of two independent LL tests. Last fiddled with by Prime95 on 2007-03-07 at 02:53
 2007-03-07, 07:14 #2 E_tron     Sep 2002 Austin, TX 10618 Posts so, this calls for volunteers possessing: Intel Pentium 4 with Hyper-Threading technology Intel Xeon (Pentium 4 based) with Hyper-Threading technology any others?
 2007-03-07, 12:38 #3 Andi47     Oct 2004 Austria 2·17·73 Posts I will download the file as soon as my internet connection at home is working again (hopefully today in the evening). I could do some benchmarks on a Hyperthreaded P4. Are there particular numbers or FFT sizes to run for a few minutes? Do you also need some benchmarks on a Dual Processor system? I could do some on my Core 2 Duo laptop in one or two weeks. (currently it is running a Huge P-1 stage 2 with GMP-ECM, so I don't want to interrupt it.) Edit: Is 25.2 mature enough to run a few doublechecks on small Mersennes (for testing the software)? Last fiddled with by Andi47 on 2007-03-07 at 12:41
 2007-03-07, 14:43 #4 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 11110110011112 Posts Yes, a hyper-threaded P4 is required. A variety of L2 cache sizes would be nice as that could affect the answer. I do not need any dual-processor benchmarks. I own a dual-core P4. As expected you get more throughput by running 2 independent LL tests. After running a simple benchmark, I think running throughput timings on the 1M and 2M FFT sizes should be sufficient. 25.2 ought to work on a double-check. It has not been thoroughly QA'ed. Communication with the server has been disabled so you'd have to email any results.
 2007-03-07, 19:53 #5 Ethan Hansen     Oct 2005 23×5 Posts I tested 1M and 2M FFT sizes on two processors. Both CPUs are 3GHz, making the comparisons easier. The first system uses a desktop P4 processor with an 8K L1 cache and 512K L2. The second is a Xeon with 16K L1 and 2M L2. The systems have roughly equivalent memory; 2G DDR-400 for the P4, 3G DDR2-400 for the Xeon. Notation conventions: F1/T1 = 1M FFT and 1 Thread, F2/T2 = 2M FFT, 2 threads (two simultaneous exponents tested), etc. Timings are average in seconds for 10K LL iterations (reported every 500 iterations, average value used) with the test repeated twice. Timings: Code: Processor F1/T1 F1/T2 F2/T1 F2/T2 ---------------------------------------------- P4 0.031 0.063 0.068 0.150 Xeon 0.036 0.076 0.066 0.132 My guess is that the larger L2 cache line size aids the P4 system with smaller exponents. Running two exponents in parallel had minimal effect on the timings at 1M FFT sizes for either processor. The Xeon system showed no degradation in throughput for the 2M FFT size as well.
 2007-03-07, 21:43 #6 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 3×11×239 Posts Ethan, if I interpret your numbers correctly, you get absolutely no benefit from hyperthreading by running two independent LL tests. Can you now try v25.2 (get the one I just made today)? Go into Test/Worker threads dialog box and select 1 worker thread using 2 threads per LL test. Thanks.
 2007-03-07, 21:50 #7 E_tron     Sep 2002 Austin, TX 23116 Posts There's one more chip capable of Hyper-Threading: Dual-Core Pentium Extreme Edition codenamed Smithfield. It's basicly 2 prescott pentium 4s with HT enabled.
 2007-03-07, 23:07 #8 Ethan Hansen     Oct 2005 4010 Posts George, Sorry, I'm not following you. The Test/Worker Threads dialog has an entry for Number of Worker Threads, but I do not see an option to set the number of threads per LL test. I'm running the P95 timestamped March 7th, 16:30. E_tron: There were two EE versions that supported hyperthreading. The first used the Smithfield core (EE 840), while the second used Presler (EE 955/965). The difference between these and the standard processors, aside from a hefty price premium, was the unlocked multiplier and hyperthreading being enabled. HT only worked on a very few motherboards; the 840 required the 955X chipset, while the 955/965 only functioned with the i975X. Not all motherboards sporting these chipsets allowed HT to be used. To the 37 or so people who actually purchased these mini space heaters: my condolences.
 2007-03-08, 01:15 #9 ATH Einyen     Dec 2003 Denmark 24×32×23 Posts Here is my CPU info from Prime95 and CPU-Z: cpu.jpg I tested LL test at 36M: 2 LL test at once, 1 on each thread: 2LL.jpg 1 LL test on both threads: 1LL.jpg Its almost the same. In 1 hour you would get 3600/0.1137 + 3600/0.112 = 63,805 total iterations running 2 LLs, and 3600/0.0569 = 63,269 iterations on the single LL, but thats only 0.84% more and may just be minor fluxtuations. Let me know if you want more or longer tests, I can do them tomorrow thursday. Edit: Ops, I just saw I never changed available memory above 8Mb, but that shouldn't affect it right? It did use more than 8 Mb anyway. Btw, for the 2 LL test it didn't save the worktodo properly: AdvancedTest=36000109 AdvancedTest=36000199 [Thread #2] So when I restarted it, worker thread2 did not have any work. Last fiddled with by ATH on 2007-03-08 at 01:20
Quote:
 Originally Posted by Ethan Hansen Sorry, I'm not following you. The Test/Worker Threads dialog has an entry for Number of Worker Threads, but I do not see an option to set the number of threads per LL test.
It is "Number of CPUs to use". Perhaps I could word that better.

Quote:
 Originally Posted by ATH Its almost the same. In 1 hour you would get 3600/0.1137 + 3600/0.112 = 63,805 total iterations running 2 LLs, and 3600/0.0569 = 63,269 iterations on the single LL, but thats only 0.84% more and may just be minor fluxtuations. Let me know if you want more or longer tests, I can do them tomorrow thursday.
The memory setting is irrelevant. Can you also do 1 LL test with 1 cpu per test? Thanks.

So far on the data I have:

A P4 with 512K L2 cache:
1M FFTs and smaller are 2-3% faster running 1 LL test with 2 threads
FFTs larger than 1M are slower.

Running 2 threads increases number of instructions scheduled on the ALU/FPU but increases the pressure on the L1 and L2 caches.

