mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > GMP-ECM

View Poll Results: Would you use a 'fat binary' of GMP-ECM?
No, 5% more performance is important to me 4 21.05%
No, I'm happy with the version 'my guy' builds 0 0%
Yes, I want one fewer thing to worry about 15 78.95%
Voters: 19. You may not vote on this poll

Reply
 
Thread Tools
Old 2012-02-11, 13:15   #1
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

1101110000102 Posts
Default Would you use a 'fat binary' of GMP-ECM?

I've never liked all the mental effort that everyone here expends to get the perfect GMP-ECM binary that runs at optimal speed. It seems there are a dozen different sources of compiled binaries that are tweaked for this platform or that. In contrast, GMP switched to a 'fat binary' approach many years ago, where for x86 the build process compiles all the assembly code for all architectures and a runtime dispatcher chooses the bundles of code that are called. This is slightly slower than using a binary with just one set of assembly code compiled in, but you don't need to figure out any perfect combinations for you or your userbase.

The reason I mention this is that I'm currently performing an overhaul of some of the plumbing in GMP-ECM, and it's becoming clear that parts of the library are leaving a fair amount of performance on the table because they can only make one choice of how to do things at runtime.

For example, stage 2 performs number-theoretic FFTs that have many internal bells and whistles. If you use SSE2 then these FFTs run 30% faster on 32-bit platforms, so the build process detects SSE2 automatically and turns it on. However, on 32-bit platforms there is a choice of how you do modular multiplication, and the compiled-in choice is the second fastest (it's 10% slower), because the fastest choice would severely limit how big a stage 2 you can do. Why not just build in both, and choose the fastest one at runtime? A development branch of the library that I'm working on also has another set of choices that remove any limits on how big a stage 2 your 32-bit machine can handle, at a cost of (currently) 1.5x in performance. If this was the default then nobody would be happy, but if you don't use it then your 32-bit machine will only ever be able to handle smaller problems.

The stage 1 code is much more CPU-specific and would clearly benefit from a CPU-dispatcher; it can just compile all the various redc versions plus a generic one, then put them in a table at library start time and look in the table when it knows which machine it's running on.

The problem is that building all of this is a great deal of work for the developers, which is only worthwhile if the user community can resist the temptation to 'roll their own' anyway. This has definitely worked with GMP, in part because building GMP on your own is sometimes incredibly painful, especially on windows, but also because there isn't much point in building your own anymore and your own products can ship a generic library that will work well on anything your users have.

So if we added this to the next version of GMP-ECM and the version that shipped with Ubuntu (or whatever) suddenly gets 95% of the performance of figuring out your own perfect set of compile options, would that be good enough for you?
jasonp is offline   Reply With Quote
Old 2012-02-11, 14:34   #2
axn
 
axn's Avatar
 
Jun 2003

10001111100012 Posts
Default

So I get 10% faster stage 1, and 0% faster stage 2, and a potential 5% performance loss overall ( due to not-quite-perfect compile options ), I should still come out ahead, right (assuming 3:1 stage1/stage2 split)? What am I missing?
axn is offline   Reply With Quote
Old 2012-02-11, 15:25   #3
xilman
Bamboozled!
 
xilman's Avatar
 
May 2003
Down not across

22·41·61 Posts
Default

Quote:
Originally Posted by axn View Post
So I get 10% faster stage 1, and 0% faster stage 2, and a potential 5% performance loss overall ( due to not-quite-perfect compile options ), I should still come out ahead, right (assuming 3:1 stage1/stage2 split)? What am I missing?
An encouragement to build your own binary which is optimum for your own machine and/or your own interests?

I've always found it extremely easy to build ECM but, there again, I try not to use Windoze.
xilman is offline   Reply With Quote
Old 2012-02-11, 17:16   #4
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

54248 Posts
Default

This is another option for "windoze" users: http://www.mersenneforum.org/showpos...&postcount=289

It's only tested by 2 people that I know of, but it seems to work.
ATH is offline   Reply With Quote
Old 2012-02-11, 20:36   #5
yoyo
 
yoyo's Avatar
 
Oct 2006
Berlin, Germany

5×113 Posts
Default

I really would be happy to get a fat binary. I send GMP-ECM to all my Boinc users which have all kind of systems.
yoyo
yoyo is offline   Reply With Quote
Old 2012-02-11, 20:53   #6
debrouxl
 
debrouxl's Avatar
 
Sep 2009

97710 Posts
Default

Well, if the performance penalty of using this (currently hypothetical) fat GMP-ECM binary is really 5%, compared to the performance of a CPU family-specific binary, large scale ECM users (bdodson et al.) are unlikely to use a fat binary, aren't they ?

Should the task of a fat GMP-ECM binary be undertaken, I guess that all of us could happily contribute CPU power to further tuning (if necessary, of course), on various CPU families
debrouxl is offline   Reply With Quote
Old 2012-02-12, 17:20   #7
chris2be8
 
chris2be8's Avatar
 
Sep 2009

111000100012 Posts
Default

Am I correct in assuming that the fat binary would decide which CPU is was running on at startup, so the overhead would be very small (milliseconds at most)? It would presumably be a lot larger on disk but only the code for the current CPU would be loaded so the memory overhead would be small.

If that's true I'm all for it. And if it can chose the best code path for the CPU, target size, B1, B2, etc if could easily run faster than the thin binary.

The only case I think of for the thin binary is if disk space is limited, eg diskless systems with just a little flash memory to boot off.

Chris K
chris2be8 is offline   Reply With Quote
Old 2012-02-12, 21:19   #8
xilman
Bamboozled!
 
xilman's Avatar
 
May 2003
Down not across

22·41·61 Posts
Default

I just voted. Took me so long because I am "my guy" and I didn't know which of the two to vote for.
xilman is offline   Reply With Quote
Old 2012-02-12, 22:25   #9
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

352210 Posts
Default

Large-scale ECM users always need special considerations, since 5% of the time of thousands of machines can safely be called valuable.

Usually a fat library has a hook into the application startup process that identifies the CPU and instruction set only once, then fills up a global table with pointers to the functions chosen for that run. For the rest of runtime the only difference then becomes that a function call becomes a table lookup instead of a jump to an address known at compile time.

All the headaches with this are in the build process, not at runtime. Msieve has a fat binary architecture when running QS, and it does that by taking the same source file and compiling it over and over again with different arguments, plus different names for the top-level functions in each architecture. Doing the same for GMP-ECM means lots of contortions in automake, plus trying to make the Visual Studio builds do the same thing, which may not be possible.
jasonp is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Binary Multitasking a1call Lounge 8 2016-12-03 21:20
CUDALucas: which binary to use? Karl M Johnson GPU Computing 15 2015-10-13 04:44
How to build a binary of SVN183? Andi47 Msieve 12 2010-02-01 19:30
2-d binary representation only_human Miscellaneous Math 9 2009-02-23 00:11
Convert Binary to BCD code tinhnho Miscellaneous Math 8 2005-09-18 15:14

All times are UTC. The time now is 19:40.

Fri Jun 5 19:40:22 UTC 2020 up 72 days, 17:13, 1 user, load averages: 1.42, 1.29, 1.30

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.