mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2008-12-31, 18:46   #1
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

24×32×5 Posts
Default Faster Lucas-Lehmer test using integer arithmetic?

I *think* there is a possibility of speeding up LM-tests by a factor of 2-4 using pure integer arithmetic and a hybrid of FFTs and number theoretical transforms.

To see whether this is possible, I need to know how much time the processor spends in the most inner loops, so I've attached a small program that measures the clocks taken for arithmetic on two different (and incompatible) types of data-structures.

Using 64-bit general purpose registers is about 40% faster on an Athlon64, but I have no idea how other processors (Core, Core 2, i2c, Phenom) perform. I suspect that the SSE2 version will be faster on these, which would have the bonus of not needing a 64-bit operating system.

Anybody who can compile and run programs can help me by doing the following:

1. Download the attached "speedtest.cpp.bz2" and bunzip
2. compile with "g++ -O3 -o speedtest speedtest.cpp"
3. run several times (10x or so) record the fastest results, and post them in this thread like this:

(Processor: Athlon64)
Clocks/Element using 64-bit GPRs: 1.5332
Clocks/Element using SSE2: 2.5415

Thank you!

P.S. even if it is possible to tweak the assembly routines to get out another 10%, part of the plan is to tap that 10% later on by squeezing in instructions to prefetch other data into the caches.
Attached Files
File Type: bz2 speedtest.cpp.bz2 (1.3 KB, 182 views)
__HRB__ is offline   Reply With Quote
Old 2008-12-31, 22:55   #2
nuggetprime
 
nuggetprime's Avatar
 
Mar 2007
Austria

2×151 Posts
Default

On my Q6600 i get(turned -O3 on during compilation):
Clocks/Element using 64-bit GPRs: 3.14648
Clocks/Element using SSE2: 2.87842

Is that worse than the athlon?
nuggetprime is offline   Reply With Quote
Old 2008-12-31, 23:09   #3
starrynte
 
starrynte's Avatar
 
Oct 2008
California

22×59 Posts
Default

oh, you have to have linux...
(i tried compiling in dev-c++, but it gave me errors)
starrynte is offline   Reply With Quote
Old 2009-01-01, 00:11   #4
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

24×32×5 Posts
Default Interpreting 1st results

Quote:
Originally Posted by nuggetprime View Post
On my Q6600 i get(turned -O3 on during compilation):
Clocks/Element using 64-bit GPRs: 3.14648
Clocks/Element using SSE2: 2.87842

Is that worse than the athlon?
Thanks for testing. Are those the lowest clock-counts of several runs? If so, I'm surprised, because Core 2 totally wiped the floor with Athlons on other algorithms I've implemented.

Just in case I've been cryptic about how to interpret the results: The lower the "Clocks/Element", the better.

I figured the GPR version would be slower on Intel processors, because the add with carry (adc) instruction has a latency of 2 cycles with a throughput of 1 'adc' per clock. On Athlons 'adc' has a latency of only 1 cycle with a throughput of 3 per clock (Theoretically at least. In practice every adc depends on the state of the carry flag and the instruction itself modifies the carry flag, so only one 'adc' can actually be done per clock). No surprise that they are at least twice as fast.

The SSE2 performance is a disappointment. Core 2 has 128-bit SSE units, so I had hopped that they would be twice as fast as Athlons that process SSE instructions in 2 blocks of 64-bit.

Probably the latency is again higher here for Core 2. Currently the SSE-loop processes four 64-bit integers in parallel (in 2 registers). Let's see what happens when arithmetic is done on eight values (in 4 registers) in parallel. I'll upload a version that does that today or tomorrow.

The bottleneck for SSE on Athlon64s is that they can only do 1.5 SSE instructions per clock. The inner SSE2 loop has 12 instructions to process 4 elements, so the optimum would be 2 clocks/element.

My prediction is that Phenoms will do the SSE2 version in 1.27 clocks (i.e. faster than the GPR version), because AMD simply doubled everything SSE related.

Last fiddled with by __HRB__ on 2009-01-01 at 00:32 Reason: Tried to improve clarity
__HRB__ is offline   Reply With Quote
Old 2009-01-01, 00:22   #5
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

24·32·5 Posts
Default Compling with dev-c++ fails

Quote:
Originally Posted by starrynte View Post
oh, you have to have linux...
(i tried compiling in dev-c++, but it gave me errors)
I didn't use any Linux-only features, but I only tested it with g++-4.2. If dev-c++ uses g++-3.something, that might be the source of the issues.

What are the errors?
__HRB__ is offline   Reply With Quote
Old 2009-01-01, 01:17   #6
Cruelty
 
Cruelty's Avatar
 
May 2005

110010100102 Posts
Default

Output from Q9450 @ 3.2GHz
Code:
Clocks/Element using 64-bit GPRs: 2.07031
Clocks/Element using 64-bit GPRs: 2.07812
Clocks/Element using 64-bit GPRs: 2.03125
Clocks/Element using 64-bit GPRs: 2.03906
Clocks/Element using 64-bit GPRs: 2.03125
Clocks/Element using 64-bit GPRs: 2.03125
Clocks/Element using 64-bit GPRs: 2.03125
Clocks/Element using 64-bit GPRs: 2.03125
Clocks/Element using SSE2: 1.98242
Clocks/Element using SSE2: 1.86523
Clocks/Element using SSE2: 1.84961
Clocks/Element using SSE2: 1.84961
Clocks/Element using SSE2: 1.84961
Clocks/Element using SSE2: 1.85156
Clocks/Element using SSE2: 1.86133
Clocks/Element using SSE2: 1.84961

Last fiddled with by Cruelty on 2009-01-01 at 01:20
Cruelty is offline   Reply With Quote
Old 2009-01-01, 01:34   #7
eugene2x
 
Dec 2008
LA, CA

1012 Posts
Default

Code:
/home/euser/Desktop/speedtest.cpp:47: error: ‘__m128i’ does not name a type
/home/euser/Desktop/speedtest.cpp: In member function ‘void SSE2::operator+=(SSE2&)’:
/home/euser/Desktop/speedtest.cpp:51: error: ‘__m128i’ was not declared in this scope
/home/euser/Desktop/speedtest.cpp:51: error: expected `;' before ‘discard’
/home/euser/Desktop/speedtest.cpp:76: error: ‘discard’ was not declared in this scope
/home/euser/Desktop/speedtest.cpp:77: error: ‘X’ was not declared in this scope
/home/euser/Desktop/speedtest.cpp:77: error: ‘struct SSE2’ has no member named ‘X’
/home/euser/Desktop/speedtest.cpp:78: error: lvalue required in asm statement
/home/euser/Desktop/speedtest.cpp:78: error: invalid lvalue in asm output 0
/home/euser/Desktop/speedtest.cpp:78: error: invalid lvalue in asm output 1
/home/euser/Desktop/speedtest.cpp:78: error: invalid lvalue in asm output 2
/home/euser/Desktop/speedtest.cpp:78: error: invalid lvalue in asm output 3
/home/euser/Desktop/speedtest.cpp:78: error: invalid lvalue in asm output 4
Tried a compile under Ubuntu 8.04, and this was the result. Odd.
eugene2x is offline   Reply With Quote
Old 2009-01-01, 01:47   #8
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

24·32·5 Posts
Default Updated test of inner loops

While I was modifying the code to do 4-way and 8-way SSE2 I found an embarrassing copy&paste error which might have been responsible for the weak performance on Core 2. Performance is unchanged on Athlon64.

Please try the attached code for the new 'speedtest' that doesn't have the mistake in the 4-way SSE code and includes modified code for 8-way paralellism.

Sorry about the goof.

1. download & bunzip2
2. g++ -O3 -o speedtest speedtest-v2.cpp
3. run several time and post lowest Clocks/Element for GPR, SSE2 (4-way) and SSE2 (8-way)

P.S. What do I have to do to get rid of the original attachment in the first post?
P.P.S. 1.85 Clocks on the Q9450 looks promising...
Attached Files
File Type: bz2 speedtest-v2.cpp.bz2 (1.5 KB, 162 views)

Last fiddled with by __HRB__ on 2009-01-01 at 01:59
__HRB__ is offline   Reply With Quote
Old 2009-01-01, 02:05   #9
Cruelty
 
Cruelty's Avatar
 
May 2005

110010100102 Posts
Default

Voila:
Code:
Speedtest 0.2
Clocks/Element using 64-bit GPRs (1-Way): 2.0625
Clocks/Element using 64-bit GPRs (1-Way): 2.07031
Clocks/Element using 64-bit GPRs (1-Way): 2.03125
Clocks/Element using 64-bit GPRs (1-Way): 2.03906
Clocks/Element using 64-bit GPRs (1-Way): 2.03125
Clocks/Element using 64-bit GPRs (1-Way): 2.02344
Clocks/Element using 64-bit GPRs (1-Way): 2.03125
Clocks/Element using 64-bit GPRs (1-Way): 2.03125

Clocks/Element using SSE2 (4-Way): 1.74805
Clocks/Element using SSE2 (4-Way): 1.55859
Clocks/Element using SSE2 (4-Way): 1.5625
Clocks/Element using SSE2 (4-Way): 1.54883
Clocks/Element using SSE2 (4-Way): 1.55078
Clocks/Element using SSE2 (4-Way): 1.55664
Clocks/Element using SSE2 (4-Way): 1.55078
Clocks/Element using SSE2 (4-Way): 1.55664

Clocks/Element using SSE2 (8-Way): 1.69238
Clocks/Element using SSE2 (8-Way): 1.52539
Clocks/Element using SSE2 (8-Way): 1.52246
Clocks/Element using SSE2 (8-Way): 1.54492
Clocks/Element using SSE2 (8-Way): 1.52051
Clocks/Element using SSE2 (8-Way): 1.51953
Clocks/Element using SSE2 (8-Way): 1.54102
Clocks/Element using SSE2 (8-Way): 1.53027
Cruelty is offline   Reply With Quote
Old 2009-01-01, 02:08   #10
eugene2x
 
Dec 2008
LA, CA

1012 Posts
Default

Still not working on Core 2 Duo... Is it because you designed it for quad cores?
eugene2x is offline   Reply With Quote
Old 2009-01-01, 02:12   #11
starrynte
 
starrynte's Avatar
 
Oct 2008
California

111011002 Posts
Default

Quote:
Originally Posted by __HRB__ View Post
What are the errors?
Line 47:
'_m128i' does not name a type
In member function 'void SSE2_4::operator+=(SSE2_4&)':

Line 51:
'_m128i' undeclared (first use this function)
(Each undeclared identifier is reported only once for each function it appears in)

Line 51:
expected ';' before "discard"

Line 76:
'discard' undeclared (first use this function)

Line 77:
'X' undeclared (first use this function)

Line 77:
'struct SSE2_4' has no member named 'X'

Line 77:
At global scope:

Line 85:
'_m128i' does not name a type
In member function 'void SSE2_8::operator+=(SSE2_8&)':

Line 89:
'_m128i' undeclared (first use this function)

Line 89:
expected ';' before "discard"

Line 127:
'discard' undeclared (first use this function)

Line 129:
'X' undeclared (first use this function)

Line 129:
'struct SSE2_8' has no member named 'X'

(here is the compiler output:
Code:
Compiler: Default compiler
Executing  g++.exe...
g++.exe "D:\Document\My Documents\speedtest-v2.cpp" -o "D:\Document\My Documents\speedtest-v2.exe"   -O3 -o  -I"C:\Dev-Cpp\lib\gcc\mingw32\3.4.2\include"  -I"C:\Dev-Cpp\include\c++\3.4.2\backward"  -I"C:\Dev-Cpp\include\c++\3.4.2\mingw32"  -I"C:\Dev-Cpp\include\c++\3.4.2"  -I"C:\Dev-Cpp\include"   -L"C:\Dev-Cpp\lib" 
D:\Document\My Documents\speedtest-v2.cpp:47: error: `__m128i' does not name a type
D:\Document\My Documents\speedtest-v2.cpp: In member function `void SSE2_4::operator+=(SSE2_4&)':
D:\Document\My Documents\speedtest-v2.cpp:51: error: `__m128i' undeclared (first use this function)
D:\Document\My Documents\speedtest-v2.cpp:51: error: (Each undeclared identifier is reported only once for each function it appears in.)
D:\Document\My Documents\speedtest-v2.cpp:51: error: expected `;' before "discard"
D:\Document\My Documents\speedtest-v2.cpp:76: error: `discard' undeclared (first use this function)

D:\Document\My Documents\speedtest-v2.cpp:77: error: `X' undeclared (first use this function)
D:\Document\My Documents\speedtest-v2.cpp:77: error: 'struct SSE2_4' has no member named 'X'
D:\Document\My Documents\speedtest-v2.cpp: At global scope:
D:\Document\My Documents\speedtest-v2.cpp:85: error: `__m128i' does not name a type
D:\Document\My Documents\speedtest-v2.cpp: In member function `void SSE2_8::operator+=(SSE2_8&)':
D:\Document\My Documents\speedtest-v2.cpp:89: error: `__m128i' undeclared (first use this function)
D:\Document\My Documents\speedtest-v2.cpp:89: error: expected `;' before "discard"
D:\Document\My Documents\speedtest-v2.cpp:127: error: `discard' undeclared (first use this function)
D:\Document\My Documents\speedtest-v2.cpp:129: error: `X' undeclared (first use this function)
D:\Document\My Documents\speedtest-v2.cpp:129: error: 'struct SSE2_8' has no member named 'X'

Execution terminated
apparently it is g++ 3.4.2, i'll see if an upgrade will help)
starrynte is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Modifying the Lucas Lehmer Primality Test into a fast test of nothing Trilo Miscellaneous Math 25 2018-03-11 23:20
Lucas-Lehmer test Mathsgirl Information & Answers 23 2014-12-10 16:25
Question on Lucas Lehmer variant (probably a faster prime test) MrRepunit Math 9 2012-05-10 03:50
Sumout Test in Lucas Lehmer? paramveer Information & Answers 8 2012-01-30 08:23
Lucas-Lehmer Test storm5510 Math 22 2009-09-24 22:32

All times are UTC. The time now is 10:43.

Wed Dec 2 10:43:12 UTC 2020 up 83 days, 7:54, 1 user, load averages: 2.22, 2.14, 2.11

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.