![]() |
[QUOTE=Batalov;194773]In SVN revision 377, I've added some oomph to the 64-bit asm-optimized sievers in the experimental branch. (up to 20% speedup for 15e, 10% for 14e, 16e, a bit less for others, while all retrogression tests hold.) [/QUOTE]
I compared experimental siever 14e from 353 to 377. 353: [code]$ ~/ggnfs-353/bin/gnfs-lasieve4I14e -a 4788.2448.poly -f 20000000 -c 2000 Warning: lowering FB_bound to 19999999. total yield: 2800, q=20002007 ([color=red]0.11330 sec/rel[/color]) [/code] 377: [code]$ ~/ggnfs-377/bin/gnfs-lasieve4I14e -a 4788.2448.poly -f 20000000 -c 2000 Warning: lowering FB_bound to 19999999. total yield: 2800, q=20002007 ([color=red]0.13144 sec/rel[/color]) [/code] :sad: Core 2 Duo (65nm) @ 3.4 GHz. Linked with MPIR 1.2.1. Polynomial is the one from this thread: [url]http://www.mersenneforum.org/showthread.php?t=12583[/url] |
Try [FONT=Courier New]L1_BITS 15[/FONT] in [FONT=Courier New]piii/siever-config.h[/FONT]. Probably the cache in C2D is smaller than I am used to, and that's why your test is very valuable. [If the cache is smaller than we try to fit in it, then we get a setback, rather than acceleration.]
If still slow, try L1_BITS 14 (which is the old default). Should be no change from old versions (except for various patches), but if there is, then the builds are different. Did you use Jeff's, or both builds are yours? Thx! P.S. The largest changes should be for 15e and 16e, the other binaries already have fairly unrolled loops. Try old and new 15e and 16e. I tried on M941. Will try on this poly as well. |
[QUOTE=Batalov;194779]Did you use Jeff's, or both builds are yours?[/QUOTE]
Jeff doesn't do builds for *nix |
[QUOTE=Batalov;194779]Try [FONT=Courier New]L1_BITS 15[/FONT] in [FONT=Courier New]piii/siever-config.h[/FONT]. Probably the cache in C2D is smaller than I am used to, and that's why your test is very valuable. [If the cache is smaller than we try to fit in it, then we get a setback, rather than acceleration.][/QUOTE]
Do you mean athlon64/siever-config.h ? In that file, I changed L1_BITS from 16 to 15, and now this happens: [code]$ ./gnfs-lasieve4I14e -a 4788.2448.poly -f 20000000 -c 2000 Warning: lowering FB_bound to 19999999. SCHED_PATHOLOGY q0=20000003 k=11 excess=70 SCHED_PATHOLOGY q0=20000023 k=1 excess=134 SCHED_PATHOLOGY q0=20000023 k=10 excess=0 SCHED_PATHOLOGY q0=20000023 k=1 excess=244 SCHED_PATHOLOGY q0=20000059 k=1 excess=184 SCHED_PATHOLOGY q0=20000059 k=1 excess=388 SCHED_PATHOLOGY q0=20000059 k=2 excess=394 SCHED_PATHOLOGY q0=20000059 k=1 excess=120 SCHED_PATHOLOGY q0=20000059 k=1 excess=166 SCHED_PATHOLOGY q0=20000063 k=2 excess=14 SCHED_PATHOLOGY q0=20000081 k=1 excess=478 SCHED_PATHOLOGY q0=20000093 k=1 excess=450 SCHED_PATHOLOGY q0=20000159 k=2 excess=92 SCHED_PATHOLOGY q0=20000159 k=2 excess=454 SCHED_PATHOLOGY q0=20000171 k=1 excess=92 SCHED_PATHOLOGY q0=20000171 k=1 excess=68 SCHED_PATHOLOGY q0=20000213 k=12 excess=328 SCHED_PATHOLOGY q0=20000221 k=1 excess=116 SCHED_PATHOLOGY q0=20000243 k=1 excess=80 SCHED_PATHOLOGY q0=20000243 k=5 excess=332 SCHED_PATHOLOGY q0=20000269 k=2 excess=128 SCHED_PATHOLOGY q0=20000287 k=1 excess=412 SCHED_PATHOLOGY q0=20000297 k=3 excess=142 SCHED_PATHOLOGY q0=20000329 k=8 excess=4 SCHED_PATHOLOGY q0=20000353 k=1 excess=120 SCHED_PATHOLOGY q0=20000353 k=1 excess=70 SCHED_PATHOLOGY q0=20000353 k=1 excess=106 SCHED_PATHOLOGY q0=20000389 k=1 excess=268 SCHED_PATHOLOGY q0=20000429 k=14 excess=152 SCHED_PATHOLOGY q0=20000443 k=3 excess=36 SCHED_PATHOLOGY q0=20000471 k=7 excess=32 SCHED_PATHOLOGY q0=20000471 k=1 excess=354 SCHED_PATHOLOGY q0=20000531 k=2 excess=118 SCHED_PATHOLOGY q0=20000531 k=3 excess=166 SCHED_PATHOLOGY q0=20000531 k=1 excess=200 SCHED_PATHOLOGY q0=20000567 k=1 excess=158 SCHED_PATHOLOGY q0=20000569 k=1 excess=110 SCHED_PATHOLOGY q0=20000573 k=2 excess=502 SCHED_PATHOLOGY q0=20000573 k=14 excess=124 SCHED_PATHOLOGY q0=20000599 k=3 excess=72 SCHED_PATHOLOGY q0=20000623 k=3 excess=98 SCHED_PATHOLOGY q0=20000689 k=1 excess=242 SCHED_PATHOLOGY q0=20000693 k=8 excess=62 SCHED_PATHOLOGY q0=20000713 k=7 excess=268 SCHED_PATHOLOGY q0=20000723 k=1 excess=186 SCHED_PATHOLOGY q0=20000723 k=2 excess=108 SCHED_PATHOLOGY q0=20000753 k=1 excess=404 SCHED_PATHOLOGY q0=20000753 k=2 excess=324 SCHED_PATHOLOGY q0=20000779 k=2 excess=96 SCHED_PATHOLOGY q0=20000791 k=15 excess=130 SCHED_PATHOLOGY q0=20000801 k=2 excess=576 SCHED_PATHOLOGY q0=20000821 k=5 excess=10 SCHED_PATHOLOGY q0=20000821 k=10 excess=110 SCHED_PATHOLOGY q0=20000837 k=6 excess=22 SCHED_PATHOLOGY q0=20000839 k=1 excess=224 SCHED_PATHOLOGY q0=20000839 k=1 excess=296 SCHED_PATHOLOGY q0=20000839 k=3 excess=120 SCHED_PATHOLOGY q0=20000843 k=1 excess=270 SCHED_PATHOLOGY q0=20000843 k=1 excess=88 SCHED_PATHOLOGY q0=20000861 k=1 excess=8 SCHED_PATHOLOGY q0=20000861 k=1 excess=278 SCHED_PATHOLOGY q0=20000867 k=1 excess=500 SCHED_PATHOLOGY q0=20000867 k=13 excess=82 SCHED_PATHOLOGY q0=20000867 k=1 excess=440 SCHED_PATHOLOGY q0=20000873 k=1 excess=392 SCHED_PATHOLOGY q0=20000909 k=1 excess=552 SCHED_PATHOLOGY q0=20000917 k=2 excess=354 SCHED_PATHOLOGY q0=20000951 k=1 excess=216 SCHED_PATHOLOGY q0=20000969 k=1 excess=326 SCHED_PATHOLOGY q0=20000971 k=15 excess=86 SCHED_PATHOLOGY q0=20000971 k=3 excess=120 SCHED_PATHOLOGY q0=20000971 k=1 excess=278 SCHED_PATHOLOGY q0=20000971 k=2 excess=266 SCHED_PATHOLOGY q0=20000971 k=1 excess=94 SCHED_PATHOLOGY q0=20001001 k=14 excess=222 SCHED_PATHOLOGY q0=20001001 k=1 excess=116 SCHED_PATHOLOGY q0=20001019 k=9 excess=36 SCHED_PATHOLOGY q0=20001067 k=2 excess=396 SCHED_PATHOLOGY q0=20001073 k=1 excess=626 SCHED_PATHOLOGY q0=20001073 k=10 excess=66 SCHED_PATHOLOGY q0=20001083 k=3 excess=156 SCHED_PATHOLOGY q0=20001083 k=1 excess=534 SCHED_PATHOLOGY q0=20001083 k=2 excess=168 SCHED_PATHOLOGY q0=20001151 k=6 excess=322 SCHED_PATHOLOGY q0=20001161 k=1 excess=38 SCHED_PATHOLOGY q0=20001181 k=12 excess=88 SCHED_PATHOLOGY q0=20001181 k=1 excess=126 SCHED_PATHOLOGY q0=20001203 k=1 excess=192 SCHED_PATHOLOGY q0=20001227 k=1 excess=170 SCHED_PATHOLOGY q0=20001227 k=2 excess=38 SCHED_PATHOLOGY q0=20001239 k=8 excess=58 SCHED_PATHOLOGY q0=20001239 k=4 excess=136 SCHED_PATHOLOGY q0=20001259 k=4 excess=530 SCHED_PATHOLOGY q0=20001259 k=1 excess=314 SCHED_PATHOLOGY q0=20001259 k=1 excess=102 SCHED_PATHOLOGY q0=20001263 k=1 excess=208 SCHED_PATHOLOGY q0=20001269 k=1 excess=46 SCHED_PATHOLOGY q0=20001341 k=1 excess=84 SCHED_PATHOLOGY q0=20001341 k=1 excess=190 SCHED_PATHOLOGY q0=20001439 k=1 excess=124 SCHED_PATHOLOGY q0=20001439 k=1 excess=308 SCHED_PATHOLOGY q0=20001491 k=4 excess=62 SCHED_PATHOLOGY q0=20001551 k=3 excess=100 SCHED_PATHOLOGY q0=20001551 k=2 excess=590 SCHED_PATHOLOGY q0=20001551 k=1 excess=272 SCHED_PATHOLOGY q0=20001557 k=3 excess=162 SCHED_PATHOLOGY q0=20001613 k=4 excess=168 SCHED_PATHOLOGY q0=20001613 k=1 excess=256 SCHED_PATHOLOGY q0=20001659 k=4 excess=72 SCHED_PATHOLOGY q0=20001679 k=3 excess=56 SCHED_PATHOLOGY q0=20001679 k=1 excess=214 SCHED_PATHOLOGY q0=20001763 k=12 excess=116 SCHED_PATHOLOGY q0=20001769 k=2 excess=102 SCHED_PATHOLOGY q0=20001799 k=1 excess=236 SCHED_PATHOLOGY q0=20001811 k=3 excess=48 SCHED_PATHOLOGY q0=20001833 k=7 excess=126 SCHED_PATHOLOGY q0=20001833 k=2 excess=62 SCHED_PATHOLOGY q0=20001833 k=3 excess=496 SCHED_PATHOLOGY q0=20001847 k=1 excess=292 SCHED_PATHOLOGY q0=20001853 k=1 excess=372 SCHED_PATHOLOGY q0=20001899 k=1 excess=190 SCHED_PATHOLOGY q0=20001959 k=2 excess=66 SCHED_PATHOLOGY q0=20001959 k=2 excess=416 SCHED_PATHOLOGY q0=20001977 k=1 excess=136 SCHED_PATHOLOGY q0=20001977 k=1 excess=404 total yield: 0, q=20002007 (inf sec/rel) [/code] By the way, the shared cache size on this C2D is 4MB. [QUOTE=Batalov;194779]Did you use Jeff's, or both builds are yours?[/QUOTE] They were both mine. |
In this thread I only wanted to discuss Windows builds, because I have no access to them - this is Jeff's and Brian's domain.
The asm64-bit builds are tricky -- if you change [B]L1_BITS[/B], don't forget to change [B]l1_bits[/B] in [FONT=Courier New]ls-defs.asm[/FONT] and of course clean up all .o and .a, and build all as listed in INSTALL file. Otherwise, you will get a broken build, surely. |
[QUOTE=Batalov;194785]In this thread I only wanted to discuss Windows builds, because I have no access to them - this is Jeff's and Brian's domain.[/QUOTE]
Where shall we discuss this then? [QUOTE=Batalov;194785]The asm64-bit builds are tricky -- if you change [B]L1_BITS[/B], don't forget to change [B]l1_bits[/B] in [FONT=Courier New]ls-defs.asm[/FONT] and of course clean up all .o and .a, and build all as listed in INSTALL file. Otherwise, you will get a broken build, surely.[/QUOTE] Yep, I was starting from a clean directory each time. I will change [b]l1_bits[/b] as you suggested next. Right now I'm running siever 15e without any changes, will report the numbers for it in a bit. |
[QUOTE=jrk;194786]I'm running siever 15e without any changes, will report the numbers for it in a bit.[/QUOTE]
353: [code]$ ~/ggnfs-353/bin/gnfs-lasieve4I15e -a 4788.2448.poly -f 20000000 -c 1000 Warning: lowering FB_bound to 19999999. total yield: 3479, q=20001001 (0.14711 sec/rel) [/code] 377: [code]$ ~/ggnfs-377/bin/gnfs-lasieve4I15e -a 4788.2448.poly -f 20000000 -c 1000 Warning: lowering FB_bound to 19999999. total yield: 3479, q=20001001 (0.18397 sec/rel)[/code] |
Apparently, for your CPU, L1_BITS 15 is better!
This is important for Greg and NFS@HOME binaries. On Phenom 940, timings for this poly on several regions (20M, 45M, 200M) are better by a few percent with both new 14e and 15e over old ones. Timings for M941 are better by 10%+ (M941 was tested with 15e, 16e and on both sides). The output files are 100% consistent (to truly compare them, it is best to [FONT=Courier New]sed 's,:.*,,'[/FONT] i.e. cut off all factors, leave only a,b). _________ P.S. With a bit of an overwrite, a 'thick' binary can be built which will have all optimized variants inside, and include a benchmark that would in turn prepare a config file, or even train itself for a specific project. The current kitchen is to try everything for one's own CPU and save the best binary. Same for ECM, right? I still keep two ecm binaries around (-enable/-disable-redc). Should be one in an ideal world. |
[QUOTE=jrk;194786]I will change [b]l1_bits[/b] as you suggested next.[/QUOTE]
Rev 377 & Changing L1_BITS to 15, testing both 14e and 15e again: [code] $ ./gnfs-lasieve4I14e -a 4788.2448.poly -f 20000000 -c 2000 Warning: lowering FB_bound to 19999999. total yield: 2800, q=20002007 (0.11304 sec/rel) $ ./gnfs-lasieve4I15e -a 4788.2448.poly -f 20000000 -c 1000 Warning: lowering FB_bound to 19999999. total yield: 3479, q=20001001 (0.14816 sec/rel) [/code] Now virtually the same as 353 on this c157. |
Again, that was with the athlon64 asm code.
|
Ok, I think I got it now. ("I learned something today", like the say in South park.)
In terms of [B]L1[/B] data cache size, all Core2's (duos, quads) and even Nehalem have 32Kb per core (=2[sup]15[/sup]). Phenoms, Opterons have 64Kb per core (=2[sup]16[/sup]). So, for Intel chips, keep L1_BITS at [B]15[/B], but for AMD chips, [B]16[/B] gives a bit of an edge. L2 cache is slower (a dozen cycles penalty) and that showed in your tests; its size doesn't matter. Thanks, Jayson! P.S. i7 has a relatively fast L2 cache; remains to be interesting to test. |
| All times are UTC. The time now is 22:54. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.