Register FAQ Search Today's Posts Mark Forums Read

 2007-05-31, 02:04 #12 rtharper     Apr 2007 22 Posts Hi! I'm the aforementioned student! My name is Tom Harper and I'll be working on some of the software for GIMPS (is that an acceptable acronym!). You can read my progress at http://summerofsolaris.aftereternity.co.uk/ I anticipate a lot of communication with you all, as the knowledge required for this sort of thing seems sort of esoteric. In the meantime, if you have any ideas for Glucas or Mlucas optimisation, let me know @ rtharper@aftereternity.co.uk!
 2007-05-31, 15:14 #13 crash893     Sep 2002 23·37 Posts cant really track it if you dont update your blog
 2007-05-31, 18:31 #14 rtharper     Apr 2007 48 Posts Patience! It's been less than a week since GSoC officially started, and I only got a little bit of a headstart (finals, graduation, etc...). A post about the first week is up there. You can expect more frequent (i.e. daily or more often) posts from now on (Rob admonished me that you all would be suitably interested to the extent that I owe it to you to document every excruciating detail!).
2007-05-31, 21:13   #15
rgiltrap

Apr 2006
Down Under

89 Posts

Quote:
 Originally Posted by rtharper Rob admonished me that you all would be suitably interested to the extent that I owe it to you to document every excruciating detail!.
hehe, thanks crash for reinforcing my point

I suggest that Mac OS X users stay posted, since Tom's primary machine is a Core 2 duo Mac he is likely to get out some new Glucas / Mlucas builds for these in the coming weeks (even though this is officially an OpenSolaris mentored project).

 2007-05-31, 23:37 #16 ewmayer ∂2ω=0     Sep 2002 República de California 3×132×23 Posts For my part, the operating mantra is "inline assembly code can be fun!" (cue Rod Serling voiceover) "Consider if you will, some simple trial-factoring 64-bit modmul code running on x86/ia32. In high-level C code, letting the compiler do the 64-bit integer emulation: Starting Trial-factoring Pass 0... Trial-factoring Pass 0: time = 00:01:25.983 Starting Trial-factoring Pass 1... M18018467 has a factor: 195863445150291847. Program: E3.0x Trial-factoring Pass 1: time = 00:01:24.912 Starting Trial-factoring Pass 2... With a whiff of inline ASM, no serious effort at optimization and no use of SSE2: Starting Trial-factoring Pass 0... Trial-factoring Pass 0: time = 00:00:37.554 Starting Trial-factoring Pass 1... M18018467 has a factor: 195863445150291847. Program: E3.0x Trial-factoring Pass 1: time = 00:00:36.963 Starting Trial-factoring Pass 2... More than twice as fast as high-level code using an optimizing compiler, ladies and gentlemen. An effect this profound would cause a person to question their sanity, unless they were writing inline assembler in ... the Twilight Zone."
 2007-07-24, 02:45 #17 rgiltrap     Apr 2006 Down Under 89 Posts Just a quick update since GSoC has passed the halfway point. A HEAP of work has been done on the Mlucas 3.x code over the last 8 weeks. Tom Harper has parallelized the FFT routines, while Ernst Mayer has done the same to the carry routines. We are now seeing quite high levels of parallelism when using 2-8 concurrent threads. I don't want anyone to get too excited at this stage as over the next couple of weeks there needs to be some rigorous testing and much further fine tuning performed. At the moment we are limited to 8 threads but should be able to reach 16 very soon at which time a direct comparison can be made between Glucas & Mlucas performance at 16 threads (though I'm putting my money on Mlucas ). We have been testing the performance on the following boxes:Itanium2 (16cores) Sparc64 VI (16 cores) Athlon64 X2 (2 cores) Opteron (16 cores) This has shown that each CPU and system architecture is very different in terms of single thread performance and scalability, at this time we have no idea which is going to be the fastest. For performing the fastest real time verification the SMP machines (Itanium2 & Sparc64 VI) appear to scale much better than the NUMA machines (Opteron) as would be expected. I'm not going to release any specific timings at this stage but I will say that we have in some circumstances seen scaling that is better than this which bodes well for a fast verification of the yet to be found M45 Cheers, Rob.
 2007-07-24, 16:40 #18 Jeff Gilchrist     Jun 2003 Ottawa, Canada 3×17×23 Posts If you want any help testing that MLucas code in Linux, I can try it out on the large Itanium2 beast I have been using for M44/43 verification (128 CPUs). Using 16 cores was the best bang for the buck with GLucas.
2007-07-24, 18:39   #19
ewmayer
2ω=0

Sep 2002
República de California

266158 Posts

Quote:
 Originally Posted by Jeff Gilchrist If you want any help testing that MLucas code in Linux, I can try it out on the large Itanium2 beast I have been using for M44/43 verification (128 CPUs). Using 16 cores was the best bang for the buck with GLucas.
Thanks, Jeff - but you see [do your best Monty Python and the Holy Grail French Knnniggit accent here], I already got one. Yahs, it's a-varry nice...

I've been doing most of my timing tests on a 16-core Itanium 2 system hosted on the HP testdrive program. 16-way ||ism is all I plan to code for in the near future, since the particular || structure of my FFT implementation lends itself best to the 2-16 core range. The Sun folks [Rob and Tom Duell] have some nice multicore Sparc 6 and Opteron/Solaris systems, so we continually monitor and compare the benchmarks on 3 different systems.

My brief take or "executive summary" of where things are:

- Nearly all of the basic || FFT code - in particular the modified-to-be-thread-friendly data access scheme - was already in place, I did most of that work during a hiatus from work in 2005. Tom Harper's key contribution was tracking down the source of a subtle OpenMP loop-handling issue which was causing the || code to go haywire in unpredictable, nonrepeatable ways, which I didn't have the debug tools or MT experience to solve on my own. Once we had that solved, progress has been very rapid.

- Compared to [say] Glucas, the Mlucas MT approach has several distinct advantages. For starters, no performance hit in going from unthreaded to threaded. For instance, here are numbers from Glucas timing tests on a multicore Itanium system [In fact, I believe almost identical to the one I'm using], posted to the "Perpetual benchmark Thread" by Tony Reix:
Quote:
 Originally Posted by T.Rex With no thread, Glucas takes 0.1628 sec/iter . With 1 thread, Glucas takes 0.2091 sec/iter . Scalability is: 1 .(0.78) With 2 threads, Glucas takes 0.1086 sec/iter . Scalability is: 1.93 .(1.5) With 4 threads, Glucas takes 0.0651 sec/iter . Scalability is: 3.21 .(2.5) With 6 threads, Glucas takes 0.0501 sec/iter . Scalability is: 4.17 .(3.25) With 8 threads, Glucas takes 0.0415 sec/iter . Scalability is: 5.04 .(3.92)
See that big hit in going from unthreaded to 1-thread? We don't have that, thanks to a carefully designed low-overhead MT approach. Mlucas' coarse-grained, big-data-chunk ||ism also allows the threads to work as independently as possible, giving better scalability. Here are the current numbers for an identical 2048K FFT length as above, run on a 1.5GHz Itanium, also similar or identical as Tony's runs above -- you need to compare Tony's rightmost speedup factors in () to the ones here, because our baseline was the unthreaded code:
Code:
#thread sec/iter  speedup
------- -----     ------
-       .134       1.00
1       .134       1.00
2       .064       2.09
4       .033       4.06
8       .021       6.38
You can see that the MT performance is actually slightly superlinear for 2-4 threads - that is again because the || FFT was designed so each thread gets a big chunk of data it can crunch independently of the others - i.e. a dataset of size [~16MB in this case] larger than the L2 cache of a single CPU gets broken into chunks which fit neatly into the individual L2 caches attached to the processor handling each thread.

[We're currently investigating the sudden performance drop in going above 4 threads.]

Cheers,
-E

[BTW, I never forgot your e-mail of last September, asking about a || Mlucas -- but I didn't want to reply with either an excuse or a vague "I'm working on it", instead I thought it better to use as a carrot to actually get something working -- though Rob G. has been a more-than-adequate niggler in that regard, as well. ;) I was actually going to e-mail you later this week to let you know how things were shaping up, but you just saved me the work.]

Last fiddled with by ewmayer on 2007-07-25 at 17:03

2007-07-25, 02:38   #20
rgiltrap

Apr 2006
Down Under

8910 Posts

Quote:
 Originally Posted by ewmayer though Rob G. has been a more-than-adequate niggler in that regard, as well. ;)
Yeah.. the investment I made with the Silicon Valley mafia has really paid off. As long as the code keeps coming out there won't be any more 'visits'

 2007-07-25, 16:53 #21 Jeff Gilchrist     Jun 2003 Ottawa, Canada 100100101012 Posts Nice. I'm glad you have a bunch of machines for testing, just thought I would offer in case you wanted someone else to try and break things.
2007-07-25, 17:01   #22
ewmayer
2ω=0

Sep 2002
República de California

266158 Posts

Quote:
 Originally Posted by Jeff Gilchrist Nice. I'm glad you have a bunch of machines for testing, just thought I would offer in case you wanted someone else to try and break things.
If and when we get the ~linear ||ism ratcheted up to 16 threads, you will be welcome to run 8 copies of that [each doing a different exponent] on your Beast. In winter, that would probably heat a small office building. ;)

 Similar Threads Thread Thread Starter Forum Replies Last Post swl551 Lounge 0 2014-09-13 12:23 davieddy Soap Box 7 2011-09-12 10:45 davieddy Lounge 17 2008-04-09 17:09 jasong jasong 1 2007-09-05 12:31 flava Software 16 2003-05-19 02:17

All times are UTC. The time now is 10:04.

Tue Oct 26 10:04:53 UTC 2021 up 95 days, 4:33, 0 users, load averages: 2.49, 1.90, 1.88