mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2021-06-25, 15:15   #1
Jean Penné
 
Jean Penné's Avatar
 
May 2004
FRANCE

7×83 Posts
Plus LLR version 4.0.0 released.

Hi All,

I uploaded today the version 4.0.0 of the LLR program.
You can find it now on my personal site :

http://jpenne.free.fr/

The 32bit Windows and Linux compressed binaries are available as usual.
The Linux 64bit binaries are released here, and also the Mac OS 64bit binaries.
The Mac OS 32bit is not released here because I have not the 32bit hwloc library which is needed, and could not build it on my Mac mini...
I uploaded also the complete source in a compressed file ; it may be used to build the 64bit Windows binaries.

What is new in this version :

It is linked with the last version 30.6 of George Woltman's gwnum library.
No really new feature, from 3.8.24, but some improvements related to reliability and speed.
I avoid now the use of giants functions invg() and gcdg() which are slow and seem not to be very reliable.
To do that, I am using gtompz() and mpztog() conversion functions.
Also, I replaced everywhere the gwnum squaring and multiplication functions gwsquare() and gwmul() by their new forms :
gwsquare2(), gwmul3(), gwmul3_carefully(), etc...
As usual, I need help to build the 64bit Windows binaries.
I uploaded also the GNU gmp6.1.0 compressed source I used on 32bit VC6.0
I hope it can be used to build this library on Windows 64bit and link it with LLR...
Please, inform me if you encountered any problem while using this new version.
Best Regards,
Jean
Jean Penné is offline   Reply With Quote
Old 2021-07-10, 05:55   #2
Happy5214
 
Happy5214's Avatar
 
"Alexander"
Nov 2008
The Alamo City

52·31 Posts
Default

Thanks for the long-needed major version bump!

Can someone actually post a 64-bit Windows build this time? I've been running the 32-bit cllr 3.8.24 on my old laptop since it's not well-suited to run Visual Studio.
Happy5214 is offline   Reply With Quote
Old 2021-07-10, 14:26   #3
mathwiz
 
Mar 2019

5·41 Posts
Default

Thanks for the new version!

Quote:
Originally Posted by Jean Penné View Post
What is new in this version :

It is linked with the last version 30.6 of George Woltman's gwnum library.
No really new feature, from 3.8.24, but some improvements related to reliability and speed.
Can you elaborate a bit? What sorts of speed improvements should we see, on which architecture(s), and is it for numbers of a particular form, etc?

In some limited testing, I have not observed a speedup.
mathwiz is online now   Reply With Quote
Old 2021-10-15, 23:29   #4
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

11×71 Posts
Default

Can someone verify something for me running LLR at a Xeon V4 system and others? (edit: a system with AVX)

I have built new dual Xeon e5-2699v4 yet it is ES version.

It has 44 cores in total and i'm very happy with it.

It runs on all 44 cores without hyperthreading and it clocks itself to 2.0Ghz under this full load (watercooled all cores are around 50C under full load, highest 51C).

Now my old system is 8 core Xeons L5420 2.5Ghz. No turboboost obviously.

In theory the e5v4 Xeon delivers each core 32 flops fp64 a clock.
In theory the core2duo Xeon L5420 delivers 4 flops fp64 a clock.

So factor 8 difference a clock.

Yet at small bitsizes (1mbit) i measure only factor 3.13 here and larger bitsizes i measure that going up to factor 3.6x faster for the Xeon e5 a clock.

That should be closer to factor 8 difference however.

Now i do not know what causes this. It is as if AVX doesn't work or maybe something else.
Can someone try to test this at a box single core with sllr64 and see which timings he gets?

Much appreciated.

So timings do not need to be very accurate. A few seconds more or less no big deal.
Am i missing a factor 2 performance somewhere is the question.

Here is the file 'try.txt'
314000000000000000:M:0:2:258
9473 1024968
9473 1025002
9473 1025220
9473 1025338
9473 1025602
9473 1025724

llr.ini :
WorkDone=0
Work=0
PgenInputFile=try.txt
PgenOutputFile=resbench4
PgenLine=1
HeaderLine=0
Pid=10817
OldCpuSpeed=2500
NewCpuSpeedCount=0
NewCpuSpeed=0
PRPGerbiczCompareIntervalAdj=1

At the Xeon e5 the timings of this is 511.x or 512.x seconds for all of the above exponents single core tested (other 43 cores busy). At 2.0Ghz, no hyperthreading.

At the old Xeon L5420's this runs now it's 1280 seconds first timing.
yet whether it's 3.13 diff (compensated for clockspeed) or 3.6 (larger bitsizes) - i miss a factor 2 there.

Many thanks,
Vincent

Last fiddled with by diep on 2021-10-15 at 23:30
diep is offline   Reply With Quote
Old 2021-10-16, 03:10   #5
axn
 
axn's Avatar
 
Jun 2003

5,189 Posts
Default

Quote:
Originally Posted by diep View Post
In theory the e5v4 Xeon delivers each core 32 flops fp64 a clock.
In theory the core2duo Xeon L5420 delivers 4 flops fp64 a clock.
Hmmm... I thought AVX delivered 8/cycle and FMA3 delivered 16/cycle. Are you sure about the 32 number?
axn is online now   Reply With Quote
Old 2021-10-16, 06:24   #6
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

11·71 Posts
Default

Quote:
Originally Posted by axn View Post
Hmmm... I thought AVX delivered 8/cycle and FMA3 delivered 16/cycle. Are you sure about the 32 number?
From a manufacturer definition perspective yes 100% sure. Manufacturers define the perhaps theoretic possibility of executing 16 fp64 instructions a clock as 32 gflops fp64.
(edit: even if for some cpu's hardware architects have proven that the instruction stream cannot decode enough instructions a clock to achieve the manufacturer perspective,
such manufacturers in question still kept defining things there to the above model - so the definition doesn't mean it is theoretic possible for 16 fp64 instructions to get executed - which in reality as we know is 4 AVX instructions a clock as we look at it from a single double viewpoints perspective).

hashwell and broadwell cores are given therefore as 32 gflops a clock by definition.

At c2d xeons cores with off chip memory - i can find that as 4 gflops fp64 a clock given.
Which doesn't mean necessarily that is the case with the L5420's i have been running on the past 10 years here.

Now that would mean it can execute up to 2 g instructions a clock a core (edit: with a SSE2 instruction already counted as 2 here as it executes 2 doubles so we see it from the instruction on a double type perspective). With multiplication the situation there is complicated though to ever achieve this as the throughput latency of fp64 multiplication SSE2/SSSE instructions is most definitely more than 1 clock. Initial Nehalem i7 core also needs more than 1 clock throughputlatency for multiplication whereas later cores can do that in 1 clock.

At least this is the situation as how i understand it.

Now what cpu's practical achieve - that is yet a total other reality of course.

Here the question is what timing you get there and whether that's significantly faster than what i get here. These engineering sample cpu's i got might for example not identify themselves correctly to the software that run on them, just to mention something.

Last fiddled with by diep on 2021-10-16 at 06:38
diep is offline   Reply With Quote
Old 2021-10-16, 06:29   #7
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

11000011012 Posts
Default

Timings i have on a single core of the L5420s is:

9473*2^1024968-1 is not prime. RES64: 20E2F460750CC947. Time : 1280.557 sec.
9473*2^1025002-1 is not prime. RES64: B76C5E01AC969C27. Time : 1233.439 sec.
9473*2^1025220-1 is not prime. RES64: 4207BF5915F3B797. Time : 1191.275 sec.
9473*2^1025338-1 is not prime. RES64: 75EB97511F4070E6. Time : 1241.910 sec.
9473*2^1025602-1 is not prime. RES64: C8255E7C578A5C36. Time : 1183.743 sec.
9473*2^1025724-1 is not prime. RES64: 2AD235954071BD7F. Time : 1209.558 sec.

That's at 2.5Ghz. The wildly varying timings is something i see already for 10 years there on those Xeons - this box is connected to the internet. Such boxes not connected to internet have less variety there.

Now the Xeon e5's v4 ES here which run under full load at 2.0Ghz:

511-512 seconds very consistently (not a single timing different).

Are my Xeon e5's achieving the same speed like others have here with broadwell/haswell core type cpu's or do i lose factor 2 somewhere?
diep is offline   Reply With Quote
Old 2021-10-16, 12:15   #8
rob147147
 
Apr 2013
Durham, UK

26 Posts
Default

For comparison I ran your tests on a sightly older E5-2630 v3 using LLR4.0.0:

9473*2^1024968-1 is not prime. RES64: 20E2F460750CC947. Time : 415.186 sec.
9473*2^1025002-1 is not prime. RES64: B76C5E01AC969C27. Time : 381.239 sec.
9473*2^1025220-1 is not prime. RES64: 4207BF5915F3B797. Time : 382.592 sec.
9473*2^1025338-1 is not prime. RES64: 75EB97511F4070E6. Time : 382.597 sec.
9473*2^1025602-1 is not prime. RES64: C8255E7C578A5C36. Time : 383.175 sec.
9473*2^1025724-1 is not prime. RES64: 2AD235954071BD7F. Time : 383.757 sec.

The machine was busy with something else during the first job.
rob147147 is offline   Reply With Quote
Old 2021-10-16, 13:23   #9
axn
 
axn's Avatar
 
Jun 2003

121058 Posts
Default

Quote:
Originally Posted by rob147147 View Post
For comparison I ran your tests on a sightly older E5-2630 v3 using LLR4.0.0:
2630 is a Haswell with base clock of 2.4 GHz. Haswell & Broadwell should have same theoretical max FLOPs/cycle. A straight extrapolation of clock speed says 512*2.0/2.4 = 425 sec on your processor which is close enough (especially if the achieved clock is about 10% higher)

EDIT:- Probably part of the explanation is that you did not have all 8 cores fully occupied. Can you run 8 tests in parallel and report the timing?

Last fiddled with by axn on 2021-10-16 at 13:26
axn is online now   Reply With Quote
Old 2021-10-16, 18:09   #10
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

30D16 Posts
Default

Quote:
Originally Posted by rob147147 View Post
For comparison I ran your tests on a sightly older E5-2630 v3 using LLR4.0.0:

9473*2^1024968-1 is not prime. RES64: 20E2F460750CC947. Time : 415.186 sec.
9473*2^1025002-1 is not prime. RES64: B76C5E01AC969C27. Time : 381.239 sec.
9473*2^1025220-1 is not prime. RES64: 4207BF5915F3B797. Time : 382.592 sec.
9473*2^1025338-1 is not prime. RES64: 75EB97511F4070E6. Time : 382.597 sec.
9473*2^1025602-1 is not prime. RES64: C8255E7C578A5C36. Time : 383.175 sec.
9473*2^1025724-1 is not prime. RES64: 2AD235954071BD7F. Time : 383.757 sec.

The machine was busy with something else during the first job.
much appreciated!

I assume it turboboosted to 3.2Ghz running this 1 core load.

If i extrapolate the time: 383s * 3.2Ghz / 2Ghz = 612.8 seconds and i had 612 seconds as a timing.
So that's exactly the same!
diep is offline   Reply With Quote
Old 2021-10-16, 18:18   #11
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

14158 Posts
Default

Quote:
Originally Posted by axn View Post
2630 is a Haswell with base clock of 2.4 GHz. Haswell & Broadwell should have same theoretical max FLOPs/cycle. A straight extrapolation of clock speed says 512*2.0/2.4 = 425 sec on your processor which is close enough (especially if the achieved clock is about 10% higher)

EDIT:- Probably part of the explanation is that you did not have all 8 cores fully occupied. Can you run 8 tests in parallel and report the timing?
No need - it will have turboboosted to 3.2Ghz which is to the exact second then the same timing i had.

383 * 3.2 / 2 = 612 seconds

Time to try LLR2.
diep is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
LLR Version 3.8.21 Released Jean Penné Software 26 2019-07-08 16:54
LLR Version 3.8.20 released Jean Penné Software 30 2018-08-13 20:00
LLR Version 3.8.19 released Jean Penné Software 11 2017-02-23 08:52
LLR Version 3.8.11 released Jean Penné Software 37 2014-01-29 16:32
llr 3.8.2 released as dev-version opyrt Prime Sierpinski Project 11 2010-11-18 18:24

All times are UTC. The time now is 14:20.


Fri Dec 3 14:20:16 UTC 2021 up 133 days, 8:49, 0 users, load averages: 1.15, 1.19, 1.21

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.