![]() |
![]() |
#1 |
Nov 2002
Anchorage, AK
3·7·17 Posts |
![]()
i'd like to run the latest version of mlucas on a sunfire 4800 and a enterprise e420r. i've downloaded the precompiled binary 2.7b, but as i searched the forum threads i noticed there was (is?) a 2.7c that may be somewhat faster.
presently i'm running 2.7b (prefetch) on the sunfire without a mlucas.cfg because i'm not familiar enough with tweaking the configuration file to make it optimal for that processor. i'm also running 2.7b (prefetch) on the e420r using the bundled config file. i suppose i'm asking for a little help in the configuration file to get it running on the sunfire 4800 optimally, and if there is a faster (better?) version to run on these machines. |
![]() |
![]() |
![]() |
#2 |
Aug 2002
Termonfeckin, IE
24×173 Posts |
![]()
You should definitely try out the 2.8beta. You can download the source from ftp://hogranch.com/pub/mayer/src/C
Get all the files in the directory. You'll need to compile it yourself and play with the configuration settings to find the optimal settings. Mlucas.c has some hints on how to go about compiling it. When I compiled it for a Sun Ultra10 a while back I used the following command: cc -o Mlucas -Bstatic -fast -xO5 -xsafe=mem -xprefetch *.c -lm & I found the Sun cc compiler was much better than gcc. You may need to modify a few options for your two machines. There are some new features in 2.8 that help with automated benchmarking to come up with the optimum mlucas.cfg configuration. I'm sure Ernst Mayer will post with more suggestions soon. Play with it and skim through Mlucas.c in the meanwhile. |
![]() |
![]() |
![]() |
#3 |
Nov 2002
Anchorage, AK
3·7·17 Posts |
![]()
thanks for the link. i downloaded the files and am now trying to compile it on the sunfire 4800.
i used your suggestion but added the -xtarget=ultra3 i just checked it and here's what i got: Mlucas.c: br.c: mers_mod_square.c: qfloat.c: radix10_ditN_cy_dif1.c: radix11_ditN_cy_dif1.c: radix12_ditN_cy_dif1.c: radix13_ditN_cy_dif1.c: radix14_ditN_cy_dif1.c: radix15_ditN_cy_dif1.c: radix16_dif_dit_pass.c: radix16_ditN_cy_dif1.c: radix16_wrapper_square.c: radix18_ditN_cy_dif1.c: radix32_dif_dit_pass.c: radix32_ditN_cy_dif1.c: radix32_wrapper_square.c: radix5_ditN_cy_dif1.c: radix6_ditN_cy_dif1.c: radix7_ditN_cy_dif1.c: radix8_dif_dit_pass.c: cg: assertion failed in file ../src/ms_pipe/sp_opt.cc at line 3373 cg: Internal error: bad memory flow arc cg: 1 errors cc: cg failed for radix8_dif_dit_pass.c [1] Exit 2 cc -o Mlucas -Bstatic -fast -xO5 -xsafe=mem -xprefetch -xtarget=ultra3 *.c -lm Last fiddled with by delta_t on 2004-01-02 at 17:44 |
![]() |
![]() |
![]() |
#4 |
∂2ω=0
Sep 2002
República de California
5·2,351 Posts |
![]()
You've got 2 choices here: first is to build the latest version (anon-ftp to hogranch.com, cd pub/mayer/src/C, mget *) yourself (assuming you have access to the SunPro C compiler), and use the automated self-test feature (type Mlucas -h to see the options here) to help you find the best set of FFT radices for the runlengths of interest, which will go into the mlucas.cfg file, whose format and purpose is described here.
Your second choice is to try a gzipped version of the sparc binary I use (built for me by Bill Rea - our Sparcs at work only have gcc) here: ftp://hogranch.com/pub/mayer/bin/SPA...as2.8_sparc.gz That version of the code has pretty much the same performance as the latest code, but lacks the automated self-test feature. To use it to build your mlucas.cfg file, go to the above .../src/C ftp archive and get only the Mlucas.c file. Scroll to the bottom portion of the source file, where you'll see a table of exponents and 64-bit hex residues, with entries that look like Code:
/* Array of distinct test cases for self-tests. Add one extra slot to vector for user-specified self-test exponents: */ struct testCase testVec[numTest+1] = { /* FFT #radices p 100-iter Res64 #bits per digit FFT radices AvgMaxErr */ /* Small: x86 alfa */ { 128, 3, 2550001,"CB6030D5790E2460"},/* testVec[ 0] 19.455 16,16,16,16 .1034 .1334 */ { 144, 2, 2920013,"7CC1B41482BCB7C0"},/* testVec[ 1] 19.803 9,16,16,32 .1508 .2113 */ { 160, 6, 3265007,"B912804D7FE4A9E5"},/* testVec[ 2] 19.928 10,16,16,32 .2020 .2656 */ { 176, 3, 3550007,"5059094E256FB886"},/* testVec[ 3] 19.698 11,16,16,32 .1686 .2403 */ { 192, 6, 3900067,"4744CB8E5287DA60"},/* testVec[ 4] 19.837 12,16,16,32 .1885 .2523 */ { 224, 6, 4540007,"1DA37E1FAC27BC68"},/* testVec[ 5] 19.793 14,16,16,32 .2097 .2929 */ { 256, 7, 5190001,"15216788A374E144"},/* testVec[ 6] 19.798 16,16,16,32 .2563 .3086 */ { 288, 2, 5780087,"ADB1333A531F6EED"},/* testVec[ 7] 19.599 9,16,32,32 .1774 .2384 */ { 320, 3, 6400013,"6B2DF2F4FD779CBC"},/* testVec[ 8] 19.531 10,16,32,32 .1846 .2392 */ { 352, 2, 7010011,"4FC7B9144100998F"},/* testVec[ 9] 19.448 11,16,32,32 .1756 .2585 */ { 384, 3, 7600013,"2AFA7C90899B583E"},/* testVec[10] 19.328 12,16,32,32 .1383 .1872 */ { 416, 2, 8330009,"74AB1D925A0E7DB7"},/* testVec[11] 19.555 13,16,32,32 .2488 .3152 */ { 448, 3, 8950001,"7D9DD642E10F2525"},/* testVec[12] 19.509 14,16,32,32 .2041 .2906 */ { 480, 2, 9490001,"01A4E738255C522B"},/* testVec[13] 19.307 15,16,32,32 .1642 .2186 */ { 512, 3, 10110007,"24AAC84A6CD400BE"},/* testVec[14] 19.283 16,16,32,32 .1884 .2260 */ /* Medium: */ { 576, 2, 11350013,"7087EA4B45F416A6"},/* testVec[15] 19.243 9,32,32,32 .1657 .2181 */ { 640, 2, 12590009,"93E43FC168EAF6BF"},/* testVec[16] 19.211 10,32,32,32 .1885 .2382 */ { 704, 1, 13799939,"7A8B6F72D5F3A862"},/* testVec[17] 19.143 11,32,32,32 .1747 .2542 */ { 768, 2, 15099979,"D731A6D76D99F3F5"},/* testVec[18] 19.201 12,32,32,32 .1692 .2304 */ { 832, 1, 16299979,"39AB362A15AF832C"},/* testVec[19] 19.132 13,32,32,32 .2154 .2632 */ { 896, 2, 17599997,"EDF99B1D21DE8835"},/* testVec[20] 19.182 14,32,32,32 .2041 .2773 */ { 960, 1, 18899999,"AF0F81144A3372A4"},/* testVec[21] 19.226 15,32,32,32 .2186 .2915 */ {1024, 6, 20099983,"119B2956917D0CC1"},/* testVec[22] 19.169 16,32,32,32 .2457 .2934 */ {1152, 5, 22500011,"3D81D5C9CC3D1C65"},/* testVec[23] 19.073 9,16,16,16,16 .1845 .2582 */ {1280, 2, 25000009,"B4A3AF6909228279"},/* testVec[24] 19.073 10,16,16,16,16 .2534 .3083 */ ... time Mlucas 22500011 <=== exponent for LL test 1152 <=== FFT length (in K) for LL test 1 <=== 0 for a full LL test, 1 for a shorter timing test 100 <=== if previous line was a 1, how many iterations for the timing test 0 <=== This is the radix set index 1 <=== 0 for error checking off, 1 for EC on. Start with radix set 0 and increase by one each run until you start getting "radix set XYZ not available - using defaults" warnings. All radix sets should give Res64 = 3D81D5C9CC3D1C65, as per the Mlucas.c table entry. Of the radix sets you tried, pick the one that yielded the smallest runtime and add the corresponding entry to your mlucas.cfg file, e.g. if RS 3 gave the best time @1152K, your mlucas.cfg file would look like # # mlucas.cfg optimized for UlraSparc blah blah... # 200000 # 1152 3 The format of the .cfg file is important - you must begin with precisely 3 #-prefixed lines, where you may enter comments to the right of the # as desired. The fourth line tells the program how many initial iterations to do with per-iteration error checking turned on - in the above example if it gets through the first 200000 iterations on a given exponent with no roundoff errors greater than roughly 0.4, it turns of EC for the rest of the run. You can see if EC slows the code down appreciably by rerunning the self-tests, but entering a 0 instead of a 1 on the last line of input. If EC-on is no more than 1 or 2 % slower than EC-off, I recommend putting a large signed 32-bit integer (say 1000000000) on line 4 of the .cfg file, to force EC to be always on. Once you've set up your mlucas.cfg file, create a worktodo.ini file in the same dir as your executable and your .cfg filer, enter an exponent in it, and invoke the program sans any flags, e.g. with "nice Mlucas &". |
![]() |
![]() |
![]() |
#5 |
Nov 2002
Anchorage, AK
1011001012 Posts |
![]()
I've downloaded the source and after playing around with the compiler switches, I think I have compiled a binary for the SunFire 4800 running Solaris 9. Here is what I used:
cc -o Mlucas -dalign -fsimple=2 -fns -fsingle -xbuiltin=%all -xlibmil -Bstatic -xO5 -xsafe=mem -xprefetch -xarch=v8plusb *.c -lm I've also used the same switches (except using -xarch=v8plusa) on an Enterprise E420R running Solaris 5.8. I've run the self tests to set up the configuration files and I'm now running a double check as the first test on one of the CPUs. If anyone wants any timings or the binary or anything, leave a post. Thanks for the help. |
![]() |
![]() |
![]() |
#6 |
∂2ω=0
Sep 2002
República de California
5×2,351 Posts |
![]()
Sounds like you're up and running OK. Note that another nice aspect of the built-in self-test sets in the current (and future) code is that it makes it a lot easier to do run-time profiling of the binary. Just build a version with all your usual compiler flags and also with -xcollect, then run one or more of the self-test sets, then incorporate the RTP data that were collected by doing a final build with -xuse replacing -xcollect. I believe Bill Rea got a nice (10-20%) speedup at most FFT lengths this way. Note that the optimal FFT radix sets may change once profiling has been done.
Happy Hunting, -Ernst |
![]() |
![]() |
![]() |
#7 |
Nov 2002
Anchorage, AK
3·7·17 Posts |
![]()
Hello Ernst,
I got your PM and responded before reading this post. I'll give the profiling a try and run the self tests with and without profiling. However please see the PM regarding a question I had on the self tests. |
![]() |
![]() |
![]() |
#8 | |
Nov 2002
Anchorage, AK
3·7·17 Posts |
![]() Quote:
Okay, I've recompiled the code several times and have finally came up with the two versions I used for testing and timings. One is the regular compile, while the other is the runtime-profiled version. I will post the two mlucas.cfg files which includes the FFT size, the fastest radix set index, and it's associated clocks. The RTP version is typically faster, except once you get above the 4096K FFT size, then the profiled version is a little slower on most of them. As you said, the radix index sets are different. Last fiddled with by delta_t on 2004-01-07 at 09:35 |
|
![]() |
![]() |
![]() |
#9 |
Nov 2002
Anchorage, AK
3×7×17 Posts |
![]()
Here are the results before profiling.
--------- # mlucas.cfg - **before profiling** # compile flags: cc -xarch=native -xcache=64/32/4:8192/512/2 -dalign -fsimple=2 -fns -fsingle -xbuiltin=%all -xlibmil -Bstatic -xO5 -xsafe=mem -xprefetch -xprofile=collect *.c -lm -o Mlucas # system 8-way Sun Fire 4800 Solaris 9 1000000000 # Following lines: {FFT length(K) | Radix Set Index} # Best time 128 3 # 2.97 144 2 # 3.54 160 5 # 3.859 176 2 # 4.58 192 5 # 4.4 224 5 # 5.629 256 7 # 5.91 288 1 # 7.469 320 2 # 8.25 352 2 # 9.449 384 3 # 9.5 416 2 # 11.15 448 3 # 12.73 480 0 # 14.189 512 0 # 14.259 576 0 # 16.64 640 1 # 17.48 704 1 # 20.399 768 1 # 20.6 832 0 # 25.8 896 1 # 26 960 1 # 26.989 1024 6 # 26.48 1152 3 # 33.009 1280 1 # 39.21 1408 1 # 47.96 1536 1 # 46.909 1664 1 # 57.02 1792 2 # 1:00.369 1920 2 # 1:04.400 2048 2 # 1:04.109 2304 1 # 1:17.400 2560 2 # 1:26.709 2816 1 # 1:48.239 3072 1 # 1:49.519 3328 2 # 2:00.099 3584 1 # 2:11.750 3840 2 # 2:19.389 4096 2 # 2:21.729 4608 2 # 2:45.530 5120 2 # 3:04.009 5632 1 # 4:08.900 6144 2 # 3:52.069 6656 2 # 4:13.830 7168 1 # 4:33.470 7680 1 # 4:51.589 8192 4 # 5:04.740 |
![]() |
![]() |
![]() |
#10 |
Nov 2002
Anchorage, AK
35710 Posts |
![]()
Here's the runtime-profiled version.
---------- # mlucas.cfg - after profiling # compile flags: cc -xarch=native -xcache=64/32/4:8192/512/2 -dalign -fsimple=2 -fns -fsingle -xbuiltin=%all -xlibmil -Bstatic -xO5 -xsafe=mem -xprefetch -xprofile=use:Mlucas *.c -lm -o Mlucas # system 8-way Sun Fire 4800 Solaris 9 1000000000 # Following lines: {FFT length(K) | Radix Set Index} # Best time 128 1 # 2.68 144 2 # 3.24 160 3 # 3.6 176 2 # 4.12 192 3 # 4.36 224 6 # 5.389 256 5 # 5.809 288 1 # 6.969 320 3 # 8.009 352 2 # 8.75 384 3 # 9.22 416 2 # 10.96 448 3 # 11.529 480 1 # 11.99 512 2 # 12.439 576 0 # 15.529 640 2 # 16.57 704 1 # 20.129 768 2 # 20.449 832 0 # 24.41 896 2 # 24.289 960 1 # 26.85 1024 4 # 25.25 1152 1 # 31.929 1280 1 # 35.24 1408 1 # 42 1536 2 # 46.109 1664 1 # 54.409 1792 1 # 56.149 1920 1 # 1:00.479 2048 3 # 1:02.95 2304 3 # 1:12.409 2560 1 # 1:23.84 2816 2 # 1:38.909 3072 2 # 1:44.469 3328 2 # 1:59.510 3584 1 # 2:10.210 3840 2 # 2:11.699 4096 5 # 2:22.169 4608 3 # 2:45.460 5120 2 # 3:19.870 5632 2 # 3:51.300 6144 1 # 4:01.699 6656 1 # 4:28.350 7168 2 # 4:59.839 7680 1 # 5:22.290 8192 4 # 5:09.470 |
![]() |
![]() |
![]() |
#11 |
Nov 2002
Anchorage, AK
3×7×17 Posts |
![]()
I am going to repost these numbers again once I do one more compile. I don't think the flags I used for these two compiles give the fastest times. I'll try again and repost.
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Mlucas and mprime on the same box | daxmick | Software | 5 | 2018-01-05 09:48 |
Mlucas on ubuntu | Damian | Mlucas | 17 | 2017-11-13 18:12 |
Mlucas version 17 | ewmayer | Mlucas | 3 | 2017-06-17 11:18 |
MLucas on IBM Mainframe | Lorenzo | Mlucas | 52 | 2016-03-13 08:45 |
Mlucas on Sparc - | Unregistered | Mlucas | 0 | 2009-10-27 20:35 |