mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2007-05-31, 02:04   #12
rtharper
 
rtharper's Avatar
 
Apr 2007

22 Posts
Default

Hi! I'm the aforementioned student! My name is Tom Harper and I'll be working on some of the software for GIMPS (is that an acceptable acronym!). You can read my progress at http://summerofsolaris.aftereternity.co.uk/

I anticipate a lot of communication with you all, as the knowledge required for this sort of thing seems sort of esoteric. In the meantime, if you have any ideas for Glucas or Mlucas optimisation, let me know @ rtharper@aftereternity.co.uk!
rtharper is offline   Reply With Quote
Old 2007-05-31, 15:14   #13
crash893
 
crash893's Avatar
 
Sep 2002

23·37 Posts
Default

cant really track it if you dont update your blog
crash893 is offline   Reply With Quote
Old 2007-05-31, 18:31   #14
rtharper
 
rtharper's Avatar
 
Apr 2007

48 Posts
Default

Patience! It's been less than a week since GSoC officially started, and I only got a little bit of a headstart (finals, graduation, etc...). A post about the first week is up there. You can expect more frequent (i.e. daily or more often) posts from now on (Rob admonished me that you all would be suitably interested to the extent that I owe it to you to document every excruciating detail!).
rtharper is offline   Reply With Quote
Old 2007-05-31, 21:13   #15
rgiltrap
 
rgiltrap's Avatar
 
Apr 2006
Down Under

89 Posts
Default

Quote:
Originally Posted by rtharper View Post
Rob admonished me that you all would be suitably interested to the extent that I owe it to you to document every excruciating detail!.
hehe, thanks crash for reinforcing my point

I suggest that Mac OS X users stay posted, since Tom's primary machine is a Core 2 duo Mac he is likely to get out some new Glucas / Mlucas builds for these in the coming weeks (even though this is officially an OpenSolaris mentored project).
rgiltrap is offline   Reply With Quote
Old 2007-05-31, 23:37   #16
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

3×132×23 Posts
Default

For my part, the operating mantra is "inline assembly code can be fun!"

(cue Rod Serling voiceover)

"Consider if you will, some simple trial-factoring 64-bit modmul code running on x86/ia32. In high-level C code, letting the compiler do the 64-bit integer emulation:

Starting Trial-factoring Pass 0...
Trial-factoring Pass 0: time = 00:01:25.983
Starting Trial-factoring Pass 1...
M18018467 has a factor: 195863445150291847. Program: E3.0x
Trial-factoring Pass 1: time = 00:01:24.912
Starting Trial-factoring Pass 2...

With a whiff of inline ASM, no serious effort at optimization and no use of SSE2:

Starting Trial-factoring Pass 0...
Trial-factoring Pass 0: time = 00:00:37.554
Starting Trial-factoring Pass 1...
M18018467 has a factor: 195863445150291847. Program: E3.0x
Trial-factoring Pass 1: time = 00:00:36.963
Starting Trial-factoring Pass 2...

More than twice as fast as high-level code using an optimizing compiler, ladies and gentlemen. An effect this profound would cause a person to question their sanity, unless they were writing inline assembler in ... the Twilight Zone."
ewmayer is offline   Reply With Quote
Old 2007-07-24, 02:45   #17
rgiltrap
 
rgiltrap's Avatar
 
Apr 2006
Down Under

89 Posts
Default

Just a quick update since GSoC has passed the halfway point.

A HEAP of work has been done on the Mlucas 3.x code over the last 8 weeks.

Tom Harper has parallelized the FFT routines, while Ernst Mayer has done the same to the carry routines. We are now seeing quite high levels of parallelism when using 2-8 concurrent threads.

I don't want anyone to get too excited at this stage as over the next couple of weeks there needs to be some rigorous testing and much further fine tuning performed.

At the moment we are limited to 8 threads but should be able to reach 16 very soon at which time a direct comparison can be made between Glucas & Mlucas performance at 16 threads (though I'm putting my money on Mlucas ).

We have been testing the performance on the following boxes:
  • Itanium2 (16cores)
  • Sparc64 VI (16 cores)
  • Athlon64 X2 (2 cores)
  • Opteron (16 cores)
This has shown that each CPU and system architecture is very different in terms of single thread performance and scalability, at this time we have no idea which is going to be the fastest. For performing the fastest real time verification the SMP machines (Itanium2 & Sparc64 VI) appear to scale much better than the NUMA machines (Opteron) as would be expected.

I'm not going to release any specific timings at this stage but I will say that we have in some circumstances seen scaling that is better than this which bodes well for a fast verification of the yet to be found M45

Cheers, Rob.
rgiltrap is offline   Reply With Quote
Old 2007-07-24, 16:40   #18
Jeff Gilchrist
 
Jeff Gilchrist's Avatar
 
Jun 2003
Ottawa, Canada

3×17×23 Posts
Default

If you want any help testing that MLucas code in Linux, I can try it out on the large Itanium2 beast I have been using for M44/43 verification (128 CPUs). Using 16 cores was the best bang for the buck with GLucas.
Jeff Gilchrist is offline   Reply With Quote
Old 2007-07-24, 18:39   #19
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

266158 Posts
Default

Quote:
Originally Posted by Jeff Gilchrist View Post
If you want any help testing that MLucas code in Linux, I can try it out on the large Itanium2 beast I have been using for M44/43 verification (128 CPUs). Using 16 cores was the best bang for the buck with GLucas.
Thanks, Jeff - but you see [do your best Monty Python and the Holy Grail French Knnniggit accent here], I already got one. Yahs, it's a-varry nice...

I've been doing most of my timing tests on a 16-core Itanium 2 system hosted on the HP testdrive program. 16-way ||ism is all I plan to code for in the near future, since the particular || structure of my FFT implementation lends itself best to the 2-16 core range. The Sun folks [Rob and Tom Duell] have some nice multicore Sparc 6 and Opteron/Solaris systems, so we continually monitor and compare the benchmarks on 3 different systems.

My brief take or "executive summary" of where things are:

- Nearly all of the basic || FFT code - in particular the modified-to-be-thread-friendly data access scheme - was already in place, I did most of that work during a hiatus from work in 2005. Tom Harper's key contribution was tracking down the source of a subtle OpenMP loop-handling issue which was causing the || code to go haywire in unpredictable, nonrepeatable ways, which I didn't have the debug tools or MT experience to solve on my own. Once we had that solved, progress has been very rapid.

- Compared to [say] Glucas, the Mlucas MT approach has several distinct advantages. For starters, no performance hit in going from unthreaded to threaded. For instance, here are numbers from Glucas timing tests on a multicore Itanium system [In fact, I believe almost identical to the one I'm using], posted to the "Perpetual benchmark Thread" by Tony Reix:
Quote:
Originally Posted by T.Rex
With no thread, Glucas takes 0.1628 sec/iter .
With 1 thread, Glucas takes 0.2091 sec/iter . Scalability is: 1 .(0.78)
With 2 threads, Glucas takes 0.1086 sec/iter . Scalability is: 1.93 .(1.5)
With 4 threads, Glucas takes 0.0651 sec/iter . Scalability is: 3.21 .(2.5)
With 6 threads, Glucas takes 0.0501 sec/iter . Scalability is: 4.17 .(3.25)
With 8 threads, Glucas takes 0.0415 sec/iter . Scalability is: 5.04 .(3.92)
See that big hit in going from unthreaded to 1-thread? We don't have that, thanks to a carefully designed low-overhead MT approach. Mlucas' coarse-grained, big-data-chunk ||ism also allows the threads to work as independently as possible, giving better scalability. Here are the current numbers for an identical 2048K FFT length as above, run on a 1.5GHz Itanium, also similar or identical as Tony's runs above -- you need to compare Tony's rightmost speedup factors in () to the ones here, because our baseline was the unthreaded code:
Code:
#thread sec/iter  speedup
------- -----     ------
-       .134       1.00
1       .134       1.00
2       .064       2.09
4       .033       4.06
8       .021       6.38
You can see that the MT performance is actually slightly superlinear for 2-4 threads - that is again because the || FFT was designed so each thread gets a big chunk of data it can crunch independently of the others - i.e. a dataset of size [~16MB in this case] larger than the L2 cache of a single CPU gets broken into chunks which fit neatly into the individual L2 caches attached to the processor handling each thread.

[We're currently investigating the sudden performance drop in going above 4 threads.]

Cheers,
-E

[BTW, I never forgot your e-mail of last September, asking about a || Mlucas -- but I didn't want to reply with either an excuse or a vague "I'm working on it", instead I thought it better to use as a carrot to actually get something working -- though Rob G. has been a more-than-adequate niggler in that regard, as well. ;) I was actually going to e-mail you later this week to let you know how things were shaping up, but you just saved me the work.]

Last fiddled with by ewmayer on 2007-07-25 at 17:03
ewmayer is offline   Reply With Quote
Old 2007-07-25, 02:38   #20
rgiltrap
 
rgiltrap's Avatar
 
Apr 2006
Down Under

8910 Posts
Default

Quote:
Originally Posted by ewmayer View Post
though Rob G. has been a more-than-adequate niggler in that regard, as well. ;)
Yeah.. the investment I made with the Silicon Valley mafia has really paid off. As long as the code keeps coming out there won't be any more 'visits'
rgiltrap is offline   Reply With Quote
Old 2007-07-25, 16:53   #21
Jeff Gilchrist
 
Jeff Gilchrist's Avatar
 
Jun 2003
Ottawa, Canada

100100101012 Posts
Default

Nice. I'm glad you have a bunch of machines for testing, just thought I would offer in case you wanted someone else to try and break things.
Jeff Gilchrist is offline   Reply With Quote
Old 2007-07-25, 17:01   #22
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

266158 Posts
Default

Quote:
Originally Posted by Jeff Gilchrist View Post
Nice. I'm glad you have a bunch of machines for testing, just thought I would offer in case you wanted someone else to try and break things.
If and when we get the ~linear ||ism ratcheted up to 16 threads, you will be welcome to run 8 copies of that [each doing a different exponent] on your Beast. In winter, that would probably heat a small office building. ;)
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Summer is over up here.... swl551 Lounge 0 2014-09-13 12:23
Long hot summer... davieddy Soap Box 7 2011-09-12 10:45
British Summer time is here at last davieddy Lounge 17 2008-04-09 17:09
summer vacation jasong jasong 1 2007-09-05 12:31
Prime95 - summer edition flava Software 16 2003-05-19 02:17

All times are UTC. The time now is 10:04.


Tue Oct 26 10:04:53 UTC 2021 up 95 days, 4:33, 0 users, load averages: 2.49, 1.90, 1.88

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.