![]() |
Q: Mlucas on Linux on Alpha hardware - Problems ???
Dear list
I'm trying to run Mlucas on a old Digital Alpha Server 800 5/500 running Linux Red Hat 7.2 (for Alpha) I use the ev4 pre-compiled binary from Ernst Mayer's source code timings page: http://hogranch.com/mayer/gimps_timings.html According to the homepage the following steps are neded: 1) Download the programs you need 1a) The Mlucas configuration file 2) Run self-tests All until, and including, the self test works fine even though the screen output are a bit messy. But the prime calculations are working and I ger exact matches for the control strings. 3) Get exponents from PrimeNet 1) CONNECT TO THE PRIMENET SERVER 1a) CREATE AN ACCOUNT I already have an account for another computer... 2) SELECT MANUAL TESTS. 3) CHECK OUT EXPONENTS. I obtained the folowing task, and placed it in the worktodo.ini file like this: --- worktodo.ini --- DoubleCheck=9819503,64 DoubleCheck=9819533,64 --- end of worktodo.ini --- 4) FACTORING This is where things start to go wrong... Weather I do the sugested test "Mlucas-2.7b-ev4 2202517 0 80000000000000" or I just starts the work on the worktodo.ini file I get the same message. I guess the existance of the worktodo.ini file have priority over command line stuff ?!? Anyway when I run it it look like this: [mhv@DmuAxel selftest]$ ../Mlucas-2.7b-ev4 looking for worktodo.ini file... worktodo.ini file found...checking next exponent in range... forrtl: info: Fortran error message number is 59. forrtl: warning: Could not open message catalog: for_msg.cat. forrtl: info: Check environment variable NLSPATH and protection of /usr/lib/for_msg.cat. forrtl: severe (59): Message not found [mhv@DmuAxel selftest]$ Obvously something is missing from my computer, manely for_msg.cat. I guess it's Fortran related but have no further clue. Please how do I get on with employing this wonderful piece of hardware in the search??? :-) Martin@Hvidberg.net[b][/b] |
EV5 -
:surprised:ops:
I see just now that I have been using the EV4 pre-compile, where I ofcause should have used the EV5, since the DIGITAL Alpha Server 800 5/500 is based on an Alpha EV5 CPU. This do not remove or change the problem - so please feel free to reply anyhow... Martin@Hvidberg.net |
Re: Q: Mlucas on Linux on Alpha hardware - Problems ???
Hi, Martin:
[quote="MartinHvidberg"]3) CHECK OUT EXPONENTS. I obtained the folowing task, and placed it in the worktodo.ini file like this: --- worktodo.ini --- DoubleCheck=9819503,64 DoubleCheck=9819533,64 --- end of worktodo.ini ---[/quote] A few lines further down on the README page, it says: [quote]For each exponent you receive from PrimeNet, check the depth to which the corresponding Mersenne number has already been trial-factored against the appropriate entry in column 2 of the above table. If the depth listed in the work assignment line is at least as great as that in the table, no further trial-factoring need be done, i.e. only a Lucas-Lehmer test is needed. Paste the exponent(s) (and [b]ONLY[/b] the exponent(s) - unlike Prime95, Mlucas recognizes only one assignment type) into your worktodo.ini file and proceed to STEP 5. If one or more of your assigned numbers do require some factoring, continue to STEP 4. Or, if you (understandably) don't want the extra administrative burden that factoring currently requires, simply release any assignments that require factoring, request some new ones, and proceed directly to Lucas-Lehmer testing. [/quote] That means that your worktodo.ini file should contain just 9819503 9819533 [quote="MartinHvidberg"]4) FACTORING This is where things start to go wrong... Weather I do the sugested test "Mlucas-2.7b-ev4 2202517 0 80000000000000" or I just starts the work on the worktodo.ini file I get the same message. I guess the existance of the worktodo.ini file have priority over command line stuff ?!? [/quote] Mlucas (at least in any of its currently released incarnations) has no trial-factoring capability. The readme clearly says to use Peter Montgomery's Mfactor code (or Prime95 on your PC, for that matter) to do any needed trial factoring. Mfactor (in both source and binary form) is available in the same ftp archive. I'm working on a factoring engine to add to my C version of Mlucas, but because efficient factoring needs much more platform-specific code than LL testing, this won't be available before the Fall. [quote="MartinHvidberg"] Anyway when I run it it look like this: [mhv@DmuAxel selftest]$ ../Mlucas-2.7b-ev4 looking for worktodo.ini file... worktodo.ini file found...checking next exponent in range... forrtl: info: Fortran error message number is 59. forrtl: warning: Could not open message catalog: for_msg.cat. forrtl: info: Check environment variable NLSPATH and protection of /usr/lib/for_msg.cat. forrtl: severe (59): Message not found [mhv@DmuAxel selftest]$ Obvously something is missing from my computer, manely for_msg.cat. I guess it's Fortran related but have no further clue. [/quote] The standalone (static) binaries will give this type of message when an error is encountered during execution and the appropriate message catalog isn't found on the host system. If you fix the problems with your worktodo.ini file, there should be no errors, and the problem should go away. Let me know if you have any further problems. I plan to release an (LL-test only) of Mlucas.c by the end of the month, so once you get that you can expect to see your runtimes to drop dramatically vs. version 2.7b. Or, if you don't mind doing a quick build on your Alpha, do anon-ftp to hogranch.com, then cd pub/mayer/src/C mget *.h *.c and answer 'y' to all the individual file prompts of mget. Here are the alpha build instructions from the comments at the top of the Mlucas.c file: [quote] COMPILING THE PROGRAM: The following list gives the best compile options for various Unix platforms on which the code has been run successfully. If your particular hardware/software is not represented, please let me know what kinds of results you get with your compiler (especially which optimization sequence worked best). First gunzip (or gzip -d) the .gz file, then unpack the tar archive using tar xvf *tar, then compile and link as follows: *(1A) Compaq C compiler (cc) for TruUnix V5.0+: 21064/ev4/ev45: cc -o Mlucas -O4 -assume accuracy_sensitive -unroll 1 -arch ev4 -tune ev4 -Olimit 100000 *.c -lm 21164/ev5/ev56/generic: cc -o Mlucas -O4 -assume accuracy_sensitive -unroll 1 -arch generic -tune generic -Olimit 100000 *.c -lm 21164/ev6/ev67/ev68: cc -o Mlucas -O4 -assume accuracy_sensitive -unroll 1 -arch ev6 -tune ev6 -Olimit 100000 *.c -lm NOTES: - Do NOT use the <-fast> flag for TruUnix V5 and above - the code will not run properly. - I've found (on 21064, 21164 and 21264) that using -O5 will increase compile time, and may actually hurt performance of this code. If you want to try it on your system, note that when using -O5, you *must* also use the <-assume accuracy_sensitive> option. - Using 'generic' as the argument to the -tune and -arch flags has the advantage that an executable compiled with it will run on any generation of the Alpha architecture and usually performs well across a broad variety of systems, whereas 'ev6' *only* runs on the ev6. But 'ev6' should also give slightly better performance on the ev6. So if for some reason you need just a single binary use 'generic,' otherwise it's better to use the appropriate architecture-specific tunings. *(1B) Compaq C compiler (ccc) for Linux: same as for Unix, but use ccc instead of cc, and note that the Linux compiler ignores the -Olimit flag. [/quote] Note that when self-testing the C version using the sample exponents and Res64s of the readme page, [b]run for 99 iterations[/b], rather than 100. That's because the Fortran code's iteration counter starts at 1, whereas the C code's starts at 0. For example, to run the 768K-FFT-length self-test using the C code, here are your entry lines: Mlucas 15060013 768 <== FFT length (in K) 1 <== '1' here means 'do a short timing test, rather than lanching a full LL test' 99 <== If comparing reesult to a Fortran-version Res64, use 99 ietrations, rather than 100 0 <== Index of the FFT radix set to try - also try with 1,2,3... and use the one that gives the best timing for your full-length LL tests at this FFT length, by adding it to your mlucas.cfg file as described by the README instructions. 1 <== 1 here to turn on per-iteration roundoff error checking, 0 to turn it off For the above run, I get [code:1] Mlucas 2.7c ftp://hogranch.com/pub/mayer/README.html#2 INFO: Using prefetch. looking for worktodo.ini file... no worktodo.ini file found...switching to interactive mode. Enter exponent >15060013 Enter FFT length in K (set K = 0 for default FFT length) >768 Enter 0 to run a full LL test, any other integer for a self-test >1 Enter number of iterations for timing test >99 Enter index of radix set to be used for the FFT: (See file fft_radix.txt for a list of available choices; enter -1 to get the default) >0 Enter 1 to enable per-iteration error checking, 0 for no error checking >1 p is prime...proceeding with Lucas-Lehmer test... M15060013: using FFT length 768K = 786432 8-byte floats. this gives an average 19.149796803792317 bits per digit INFO: Using real*16 for FFT sincos and DWT weights tables inits. Using complex FFT radices 6 16 16 16 16 99 iterations of M15060013 with FFT length 786432 Res64: B7BECF87319A0EB8. AvgMaxErr = 0.210626132. Program: E2.7c Clocks = 00:00:08.233 [/code:1] and note that the Res64 matches that of the README self-test table. Good luck, -Ernst |
Trying out the 2.7c (or is it 2.8x ?)
Hi Ernst
Thanks for your thorugh and helpful reply. I't seems that it's now working, after I edited the worktodo.ini according to your instructions. I would like to go for the Mlucas-27c since you promise better performence. I have downloadet all the .c and .h files and am planning a compile. I'm kind of new to this platform, but I would like to use the Compaq/HP compiler ccc as they claims that it's better than cc and far better than gcc. I was just looking at your sugested compiler options and comparing with the ccc man page. It seems to make sence with somthing like: ccc -o Mlucas-2.7c-ev56 -O4 -inline speed -assume accuracy_sensitive -unroll 1 -arch ev56 -tune ev56 -Olimit 100000 *.c -lm Only I can't find the "-assume accuracy_sensitive" options in the ccc man page ? Even "man -K accuracy_sensitive" comes out blank. When compiling, with the above statement, I get a lot (14) statements all saying: "In this statement, type long double has the same representation as type double on this platform. (longdoublenyi)" I don't know if this means truble, or can just be ignored? When I run the example you sugest in your comments I get the following: ---8<--- [mhv@DmuAxel ver27c]$ ./Mlucas-2.7c-ev56 Mlucas 2.8x http://hogranch.com//mayer/README.html#2 INFO: Using prefetch. looking for worktodo.ini file... no worktodo.ini file found...switching to interactive mode. Enter exponent >15060013 Enter FFT length in K (set K = 0 for default FFT length) >768 Enter 0 to run a full LL test, any other integer for a self-test >1 Enter number of iterations for timing test >99 Enter index of radix set to be used for the FFT: (See file fft_radix.txt for a list of available choices; enter -1 to get the default) >0 Enter 1 to enable per-iteration error checking, 0 for no error checking >1 p is prime...proceeding with Lucas-Lehmer test... M15060013: using FFT length 768K = 786432 8-byte floats. this gives an average 19.149796803792317 bits per digit INFO: Using real* 8 for FFT sincos and DWT weights tables inits. Using complex FFT radices 6 16 16 16 16 99 iterations of M15060013 with FFT length 786432 Res64: B7BECF87319A0EB8. AvgMaxErr = 0.210626132. Program: E2.8x Clocks = 00:01:06.288 Done ... ---8<--- It seems to be the same Res64: as you get, but the perfomense sucks... 1 min 6 sec. compared to your 8 sec. What platforme were you using? It should be noted that another instance of Mlucas was running on the same machine, at the same time. [mhv@DmuAxel ver27c]$ ps -A | grep 'Mlucas' 21689 pts/5 00:52:34 Mlucas-2.7b-gen 9850 pts/4 00:00:36 Mlucas-2.7c-ev5 It also says 2.8x ? Is it 2.7c or 2.8x, schould I care? I'll not try to compile including the -fast option since the man ccc page sayes that float operations my give different results! But other ideeas are of cause very welcome... Best Regards & Thanks again Martin@Hvidberg.net |
Hi Ernst
I have tried som different compiler options. See below: As you can se I can get as low as <42 sec. by striping irelevant options and inserting "-inline speed" I can get no where near your 8 sec. But I assume you are using a fast machine? I'll poperly have a closer look at the -inline options... :-) Martin --- Different compier options --- ccc-ernst: The one you sugested, slightely changes to fit ccc and ev56. ccc -o Mlucas-2.7c-ev56 -O4 -inline speed -assume accuracy_sensitive -unroll 1 -arch ev56 -tune ev56 -Olimit 100000 *.c -lm > Clocks = 00:00:42.271 ccc-plain: Stripped -inline, -assume and -0limit options ccc -o Mlucas-2.7c-ev56plain -O4 -unroll 1 -arch ev56 -tune ev56 *.c -lm > Clocks = 00:00:45.975 ccc-emmh: As ccc-ernst, but strip -assume option ccc -o Mlucas-2.7c-ev56emmh -O4 -inline speed -unroll 1 -arch ev56 -tune ev56 -Olimit 100000 *.c -lm > Clocks = 00:00:42.016 ccc-emmh2: As ccc-emmh, but also stripping -0limit option ccc -o Mlucas-2.7c-ev56emmh2 -O4 -inline speed -unroll 1 -arch ev56 -tune ev56 *.c -lm > Clocks = 00:00:41.898 |
Hi, Martin:
Glad you got it to compile. Yes, the long double warnings are ignorable. I did my timing run on a 1 GHz ev68, which should be 2 to 2.5x faster per-cycle than an ev56. And yes, having another instance of Mlucas running on your machine might well throw off your timings, especially on a small-cache, relatively low-bandwidth machine like the ev56. I suggest you do your timing tests on an otherwise idle system. I hadn't played with the -inline flag much recently, since I didn't recall it making any appreciable difference when I tried it on the ev6, but based on your results it may be worth trying it out. In hindsight, I *would* expect the degree of inling to make more of a difference on the ev56, due its tiny (8 kB L1 data and 8 kB L1 instruction, 96 kB mixed D/I L2) caches. If you see a similar speedup from -inline speed on an idle system, I'll modify my compile tips appropriately. I'm going to also play with this on the ev6. Also note that Mlucas.c savefiles are not compatible with Fortran-version ones, so the sooner you can deploy the C binary on your system, the less wasted cycles you'll have (since the C version will likely be sufficiently faster that it'll make sense to simply rerun your current exponent using the newer code.) |
I see no performance difference using -inline speed on either ev56 or ev6 under TruUnix. According to the compiler manpages, -inline speed is the default, so this makes sense. Now I'm curious as to why Martin [b]would[/b] see a timing difference when building with this flag, since it's supposed to be invoked by default anyway.
Ah, I just tried the same option (using ccc) on an Alpha/Linux ev6 system, and there I do see a small (~4%) speedup. Alas, I don't have access to an ev56 running Linux, but as soon as Martin redoes his timings without any other obs running on his system, we'll know what kind of speedup to expect on that platform. |
[quote]... but as soon as Martin redoes his timings without any other obs running on his system, we'll know what kind of speedup to expect on that platform.[/quote]
The 1 min 6 sec was obscured by another processes, but the forty-somthing seconds are all on an ideal system. I most likely had a webbrowser and an xterm open , but thy were doing noting. The system is rather limited on RAM and uses +90% just having Linux running, so maybe that explains the bad performense. I can't reach the system from here, but if it serves a purpose I'll try rerun the earlier quoted test again on an "ideal as posible" system on monday. How bad is this performence anyway? I have only Ernst's 8 sec as a reference, but what about other systems? Is the system so slow that I should consider unplogging it for good? Or maybe reinstall Linux in a no GUI version? :-) Martin |
[quote="MartinHvidberg"]How bad is this performence anyway? I have only Ernst's 8 sec as a reference, but what about other systems?
Is the system so slow that I should consider unplogging it for good? Or maybe reinstall Linux in a no GUI version?[/quote] If my run needed 8 sec on a 1 GHz ev6, then ~40 sec sounds about right for a 400MHz ev56: around 2x slower than ev6 on a per-cycle basis, hence ~5x slower in absolute terms. Mlucas.c on ev56 gives comparable performance to Prime95 running on a P3 on a per-cycle basis, but there simply aren't a whole lot of 1GHz ev56 systems out there, and these days 400-500 MHz is quite slow. But if this system would otherwise be idling (i.e. it's up for a reason, but not heavily loaded), might as well put those spare cycles to work, I say. |
Neglected to mention in my previosu post about building and self-testing: If your system is unloaded (or has a constant load running on it), you can use the automated self-test feature of Mlucas.c to run through all available radix combos for each FFT length in a decently wide range and determine which is best at each length - just parse the timings for the various radix combos at each FFT length and modify your mlucas.cfg accordingly.
Mlucas -s {s|m|l|x} (pick one of the latter letters) run through lengths 128-512K, 576-2048K, 2304-4096K and 4608-8192K, respectively. 'Mlucas -s a' runs through all of these. Haven't had time to pull it all together in an updated readme file, unfortunately, hence this dribs-and-drabs trickle of information, for which I apologize. If only I could get paid to do this full-time... |
| All times are UTC. The time now is 06:15. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.