mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Factoring (https://www.mersenneforum.org/forumdisplay.php?f=19)
-   -   Running GGNFS (https://www.mersenneforum.org/showthread.php?t=9645)

jrk 2009-11-04 06:33

[QUOTE=Batalov;194773]In SVN revision 377, I've added some oomph to the 64-bit asm-optimized sievers in the experimental branch. (up to 20% speedup for 15e, 10% for 14e, 16e, a bit less for others, while all retrogression tests hold.) [/QUOTE]

I compared experimental siever 14e from 353 to 377.

353:
[code]$ ~/ggnfs-353/bin/gnfs-lasieve4I14e -a 4788.2448.poly -f 20000000 -c 2000
Warning: lowering FB_bound to 19999999.
total yield: 2800, q=20002007 ([color=red]0.11330 sec/rel[/color])
[/code]

377:
[code]$ ~/ggnfs-377/bin/gnfs-lasieve4I14e -a 4788.2448.poly -f 20000000 -c 2000
Warning: lowering FB_bound to 19999999.
total yield: 2800, q=20002007 ([color=red]0.13144 sec/rel[/color])
[/code]

:sad:

Core 2 Duo (65nm) @ 3.4 GHz. Linked with MPIR 1.2.1. Polynomial is the one from this thread: [url]http://www.mersenneforum.org/showthread.php?t=12583[/url]

Batalov 2009-11-04 06:42

Try [FONT=Courier New]L1_BITS 15[/FONT] in [FONT=Courier New]piii/siever-config.h[/FONT]. Probably the cache in C2D is smaller than I am used to, and that's why your test is very valuable. [If the cache is smaller than we try to fit in it, then we get a setback, rather than acceleration.]

If still slow, try L1_BITS 14 (which is the old default). Should be no change from old versions (except for various patches), but if there is, then the builds are different. Did you use Jeff's, or both builds are yours?

Thx!

P.S. The largest changes should be for 15e and 16e, the other binaries already have fairly unrolled loops. Try old and new 15e and 16e. I tried on M941. Will try on this poly as well.

axn 2009-11-04 06:52

[QUOTE=Batalov;194779]Did you use Jeff's, or both builds are yours?[/QUOTE]

Jeff doesn't do builds for *nix

jrk 2009-11-04 07:13

[QUOTE=Batalov;194779]Try [FONT=Courier New]L1_BITS 15[/FONT] in [FONT=Courier New]piii/siever-config.h[/FONT]. Probably the cache in C2D is smaller than I am used to, and that's why your test is very valuable. [If the cache is smaller than we try to fit in it, then we get a setback, rather than acceleration.][/QUOTE]

Do you mean athlon64/siever-config.h ?

In that file, I changed L1_BITS from 16 to 15, and now this happens:

[code]$ ./gnfs-lasieve4I14e -a 4788.2448.poly -f 20000000 -c 2000
Warning: lowering FB_bound to 19999999.
SCHED_PATHOLOGY q0=20000003 k=11 excess=70
SCHED_PATHOLOGY q0=20000023 k=1 excess=134
SCHED_PATHOLOGY q0=20000023 k=10 excess=0
SCHED_PATHOLOGY q0=20000023 k=1 excess=244
SCHED_PATHOLOGY q0=20000059 k=1 excess=184
SCHED_PATHOLOGY q0=20000059 k=1 excess=388
SCHED_PATHOLOGY q0=20000059 k=2 excess=394
SCHED_PATHOLOGY q0=20000059 k=1 excess=120
SCHED_PATHOLOGY q0=20000059 k=1 excess=166
SCHED_PATHOLOGY q0=20000063 k=2 excess=14
SCHED_PATHOLOGY q0=20000081 k=1 excess=478
SCHED_PATHOLOGY q0=20000093 k=1 excess=450
SCHED_PATHOLOGY q0=20000159 k=2 excess=92
SCHED_PATHOLOGY q0=20000159 k=2 excess=454
SCHED_PATHOLOGY q0=20000171 k=1 excess=92
SCHED_PATHOLOGY q0=20000171 k=1 excess=68
SCHED_PATHOLOGY q0=20000213 k=12 excess=328
SCHED_PATHOLOGY q0=20000221 k=1 excess=116
SCHED_PATHOLOGY q0=20000243 k=1 excess=80
SCHED_PATHOLOGY q0=20000243 k=5 excess=332
SCHED_PATHOLOGY q0=20000269 k=2 excess=128
SCHED_PATHOLOGY q0=20000287 k=1 excess=412
SCHED_PATHOLOGY q0=20000297 k=3 excess=142
SCHED_PATHOLOGY q0=20000329 k=8 excess=4
SCHED_PATHOLOGY q0=20000353 k=1 excess=120
SCHED_PATHOLOGY q0=20000353 k=1 excess=70
SCHED_PATHOLOGY q0=20000353 k=1 excess=106
SCHED_PATHOLOGY q0=20000389 k=1 excess=268
SCHED_PATHOLOGY q0=20000429 k=14 excess=152
SCHED_PATHOLOGY q0=20000443 k=3 excess=36
SCHED_PATHOLOGY q0=20000471 k=7 excess=32
SCHED_PATHOLOGY q0=20000471 k=1 excess=354
SCHED_PATHOLOGY q0=20000531 k=2 excess=118
SCHED_PATHOLOGY q0=20000531 k=3 excess=166
SCHED_PATHOLOGY q0=20000531 k=1 excess=200
SCHED_PATHOLOGY q0=20000567 k=1 excess=158
SCHED_PATHOLOGY q0=20000569 k=1 excess=110
SCHED_PATHOLOGY q0=20000573 k=2 excess=502
SCHED_PATHOLOGY q0=20000573 k=14 excess=124
SCHED_PATHOLOGY q0=20000599 k=3 excess=72
SCHED_PATHOLOGY q0=20000623 k=3 excess=98
SCHED_PATHOLOGY q0=20000689 k=1 excess=242
SCHED_PATHOLOGY q0=20000693 k=8 excess=62
SCHED_PATHOLOGY q0=20000713 k=7 excess=268
SCHED_PATHOLOGY q0=20000723 k=1 excess=186
SCHED_PATHOLOGY q0=20000723 k=2 excess=108
SCHED_PATHOLOGY q0=20000753 k=1 excess=404
SCHED_PATHOLOGY q0=20000753 k=2 excess=324
SCHED_PATHOLOGY q0=20000779 k=2 excess=96
SCHED_PATHOLOGY q0=20000791 k=15 excess=130
SCHED_PATHOLOGY q0=20000801 k=2 excess=576
SCHED_PATHOLOGY q0=20000821 k=5 excess=10
SCHED_PATHOLOGY q0=20000821 k=10 excess=110
SCHED_PATHOLOGY q0=20000837 k=6 excess=22
SCHED_PATHOLOGY q0=20000839 k=1 excess=224
SCHED_PATHOLOGY q0=20000839 k=1 excess=296
SCHED_PATHOLOGY q0=20000839 k=3 excess=120
SCHED_PATHOLOGY q0=20000843 k=1 excess=270
SCHED_PATHOLOGY q0=20000843 k=1 excess=88
SCHED_PATHOLOGY q0=20000861 k=1 excess=8
SCHED_PATHOLOGY q0=20000861 k=1 excess=278
SCHED_PATHOLOGY q0=20000867 k=1 excess=500
SCHED_PATHOLOGY q0=20000867 k=13 excess=82
SCHED_PATHOLOGY q0=20000867 k=1 excess=440
SCHED_PATHOLOGY q0=20000873 k=1 excess=392
SCHED_PATHOLOGY q0=20000909 k=1 excess=552
SCHED_PATHOLOGY q0=20000917 k=2 excess=354
SCHED_PATHOLOGY q0=20000951 k=1 excess=216
SCHED_PATHOLOGY q0=20000969 k=1 excess=326
SCHED_PATHOLOGY q0=20000971 k=15 excess=86
SCHED_PATHOLOGY q0=20000971 k=3 excess=120
SCHED_PATHOLOGY q0=20000971 k=1 excess=278
SCHED_PATHOLOGY q0=20000971 k=2 excess=266
SCHED_PATHOLOGY q0=20000971 k=1 excess=94
SCHED_PATHOLOGY q0=20001001 k=14 excess=222
SCHED_PATHOLOGY q0=20001001 k=1 excess=116
SCHED_PATHOLOGY q0=20001019 k=9 excess=36
SCHED_PATHOLOGY q0=20001067 k=2 excess=396
SCHED_PATHOLOGY q0=20001073 k=1 excess=626
SCHED_PATHOLOGY q0=20001073 k=10 excess=66
SCHED_PATHOLOGY q0=20001083 k=3 excess=156
SCHED_PATHOLOGY q0=20001083 k=1 excess=534
SCHED_PATHOLOGY q0=20001083 k=2 excess=168
SCHED_PATHOLOGY q0=20001151 k=6 excess=322
SCHED_PATHOLOGY q0=20001161 k=1 excess=38
SCHED_PATHOLOGY q0=20001181 k=12 excess=88
SCHED_PATHOLOGY q0=20001181 k=1 excess=126
SCHED_PATHOLOGY q0=20001203 k=1 excess=192
SCHED_PATHOLOGY q0=20001227 k=1 excess=170
SCHED_PATHOLOGY q0=20001227 k=2 excess=38
SCHED_PATHOLOGY q0=20001239 k=8 excess=58
SCHED_PATHOLOGY q0=20001239 k=4 excess=136
SCHED_PATHOLOGY q0=20001259 k=4 excess=530
SCHED_PATHOLOGY q0=20001259 k=1 excess=314
SCHED_PATHOLOGY q0=20001259 k=1 excess=102
SCHED_PATHOLOGY q0=20001263 k=1 excess=208
SCHED_PATHOLOGY q0=20001269 k=1 excess=46
SCHED_PATHOLOGY q0=20001341 k=1 excess=84
SCHED_PATHOLOGY q0=20001341 k=1 excess=190
SCHED_PATHOLOGY q0=20001439 k=1 excess=124
SCHED_PATHOLOGY q0=20001439 k=1 excess=308
SCHED_PATHOLOGY q0=20001491 k=4 excess=62
SCHED_PATHOLOGY q0=20001551 k=3 excess=100
SCHED_PATHOLOGY q0=20001551 k=2 excess=590
SCHED_PATHOLOGY q0=20001551 k=1 excess=272
SCHED_PATHOLOGY q0=20001557 k=3 excess=162
SCHED_PATHOLOGY q0=20001613 k=4 excess=168
SCHED_PATHOLOGY q0=20001613 k=1 excess=256
SCHED_PATHOLOGY q0=20001659 k=4 excess=72
SCHED_PATHOLOGY q0=20001679 k=3 excess=56
SCHED_PATHOLOGY q0=20001679 k=1 excess=214
SCHED_PATHOLOGY q0=20001763 k=12 excess=116
SCHED_PATHOLOGY q0=20001769 k=2 excess=102
SCHED_PATHOLOGY q0=20001799 k=1 excess=236
SCHED_PATHOLOGY q0=20001811 k=3 excess=48
SCHED_PATHOLOGY q0=20001833 k=7 excess=126
SCHED_PATHOLOGY q0=20001833 k=2 excess=62
SCHED_PATHOLOGY q0=20001833 k=3 excess=496
SCHED_PATHOLOGY q0=20001847 k=1 excess=292
SCHED_PATHOLOGY q0=20001853 k=1 excess=372
SCHED_PATHOLOGY q0=20001899 k=1 excess=190
SCHED_PATHOLOGY q0=20001959 k=2 excess=66
SCHED_PATHOLOGY q0=20001959 k=2 excess=416
SCHED_PATHOLOGY q0=20001977 k=1 excess=136
SCHED_PATHOLOGY q0=20001977 k=1 excess=404
total yield: 0, q=20002007 (inf sec/rel)
[/code]

By the way, the shared cache size on this C2D is 4MB.

[QUOTE=Batalov;194779]Did you use Jeff's, or both builds are yours?[/QUOTE]
They were both mine.

Batalov 2009-11-04 07:21

In this thread I only wanted to discuss Windows builds, because I have no access to them - this is Jeff's and Brian's domain.

The asm64-bit builds are tricky -- if you change [B]L1_BITS[/B], don't forget to change [B]l1_bits[/B] in [FONT=Courier New]ls-defs.asm[/FONT] and of course clean up all .o and .a, and build all as listed in INSTALL file. Otherwise, you will get a broken build, surely.

jrk 2009-11-04 07:27

[QUOTE=Batalov;194785]In this thread I only wanted to discuss Windows builds, because I have no access to them - this is Jeff's and Brian's domain.[/QUOTE]
Where shall we discuss this then?

[QUOTE=Batalov;194785]The asm64-bit builds are tricky -- if you change [B]L1_BITS[/B], don't forget to change [B]l1_bits[/B] in [FONT=Courier New]ls-defs.asm[/FONT] and of course clean up all .o and .a, and build all as listed in INSTALL file. Otherwise, you will get a broken build, surely.[/QUOTE]
Yep, I was starting from a clean directory each time.

I will change [b]l1_bits[/b] as you suggested next. Right now I'm running siever 15e without any changes, will report the numbers for it in a bit.

jrk 2009-11-04 07:39

[QUOTE=jrk;194786]I'm running siever 15e without any changes, will report the numbers for it in a bit.[/QUOTE]

353:
[code]$ ~/ggnfs-353/bin/gnfs-lasieve4I15e -a 4788.2448.poly -f 20000000 -c 1000
Warning: lowering FB_bound to 19999999.
total yield: 3479, q=20001001 (0.14711 sec/rel)
[/code]

377:
[code]$ ~/ggnfs-377/bin/gnfs-lasieve4I15e -a 4788.2448.poly -f 20000000 -c 1000
Warning: lowering FB_bound to 19999999.
total yield: 3479, q=20001001 (0.18397 sec/rel)[/code]

Batalov 2009-11-04 07:54

Apparently, for your CPU, L1_BITS 15 is better!
This is important for Greg and NFS@HOME binaries.

On Phenom 940, timings for this poly on several regions (20M, 45M, 200M) are better by a few percent with both new 14e and 15e over old ones.
Timings for M941 are better by 10%+ (M941 was tested with 15e, 16e and on both sides). The output files are 100% consistent (to truly compare them, it is best to [FONT=Courier New]sed 's,:.*,,'[/FONT] i.e. cut off all factors, leave only a,b).
_________

P.S. With a bit of an overwrite, a 'thick' binary can be built which will have all optimized variants inside, and include a benchmark that would in turn prepare a config file, or even train itself for a specific project. The current kitchen is to try everything for one's own CPU and save the best binary.
Same for ECM, right? I still keep two ecm binaries around (-enable/-disable-redc). Should be one in an ideal world.

jrk 2009-11-04 08:28

[QUOTE=jrk;194786]I will change [b]l1_bits[/b] as you suggested next.[/QUOTE]

Rev 377 & Changing L1_BITS to 15, testing both 14e and 15e again:

[code]
$ ./gnfs-lasieve4I14e -a 4788.2448.poly -f 20000000 -c 2000
Warning: lowering FB_bound to 19999999.
total yield: 2800, q=20002007 (0.11304 sec/rel)
$ ./gnfs-lasieve4I15e -a 4788.2448.poly -f 20000000 -c 1000
Warning: lowering FB_bound to 19999999.
total yield: 3479, q=20001001 (0.14816 sec/rel)
[/code]
Now virtually the same as 353 on this c157.

jrk 2009-11-04 08:31

Again, that was with the athlon64 asm code.

Batalov 2009-11-04 09:35

Ok, I think I got it now. ("I learned something today", like the say in South park.)

In terms of [B]L1[/B] data cache size, all Core2's (duos, quads) and even Nehalem have 32Kb per core (=2[sup]15[/sup]). Phenoms, Opterons have 64Kb per core (=2[sup]16[/sup]).
So, for Intel chips, keep L1_BITS at [B]15[/B], but for AMD chips, [B]16[/B] gives a bit of an edge. L2 cache is slower (a dozen cycles penalty) and that showed in your tests; its size doesn't matter.

Thanks, Jayson!

P.S. i7 has a relatively fast L2 cache; remains to be interesting to test.


All times are UTC. The time now is 22:54.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.