![]() |
[QUOTE=ltd;105775]Did a new test with the extra parameters for mulmod.
Results are closer together now but the tendency stays that mulmod runs faster if a second process(eulernet) runs in parallel. ( SSE2 Results) [/QUOTE] This is possible, the only sure way to get accurate timings on a HT system is to run two copies of the same program, otherwise one will get a bigger share of the cpu. The operating system can't take this into account when it reports CPU time, it always assumes they both get 50%. |
sr5sieve 1.5.0
This version attempts to find the best mulmod method to use by doing some benchmarks before sieving starts. x86 only at this stage, I will get the x86_64 version going soon.
Run it with the -v switch to see which method it is using. The methods are named x86/N or sse2/N where N is the number of mulmods that are interleaved. Run with the -vv option to get details of the relative speed of each method as found by the benchmarks. If you want to experiment you can use the -B and -G switches to select a particular method. For example `-B sse2/4' uses 4 interleaved sse2 mulmods for the baby steps routine. The code corresponds to previous versions roughly as follows: sse2/2 as in 1.4.34; sse2/4 as in 1.4.37; sse2/8 as in 1.4.40; x86/8 as in 1.4.42. There is some overhead created by the need to select different methods at runtime, I am not sure how this will affect different machines. The main cost is that some functions are now called through a variable pointer instead of coded inline. Any benchmarks comparing version 1.5.0 to whichever 1.4.x version was fastest for your machine would be useful, especially if 1.5.0 is slower. |
[QUOTE=geoff;105855]This is possible, the only sure way to get accurate timings on a HT system is to run two copies of the same program, otherwise one will get a bigger share of the cpu. The operating system can't take this into account when it reports CPU time, it always assumes they both get 50%.[/QUOTE]
Sorry, I was confused. Getting a time faster than when idle is not normal, even on a HT machine. |
[B]C2D E6600 @ 3GHz[/B] - it seems that [i]sr2sieve-amd[/i] is faster in case of giant-step method.
BTW: both executables do not detect L2 cache size, assuming default value of 256kB.[code]sr2sieve-intel -l 32 -L 2048 -vv sr2sieve 1.5.0 -- A sieve for multiple sequences k*b^n+/-1. Using SSE2 code path, L1 data cache 32Kb (supplied), L2 cache 2048Kb (supplied). Read 493896 terms for 9 sequences from ABCD format file `sr2data.txt'. Split 9 base 2 sequences into 649 base 2^180 subsequences. Loaded Legendre symbol lookup tables for 9 sequences from `sr2cache.bin'. Using 16 Kb for the baby-steps giant-steps hashtable, maximum density 0.23. Best time for baby step method sse2/2: 37494. Best time for baby step method sse2/4: 25164. Best time for baby step method sse2/8: 22032. Best time for baby step method sse2/16: 21798. Best time for baby step method x86/1: 43992. Best time for baby step method x86/2: 33750. Best time for baby step method x86/4: 28314. Best time for baby step method x86/8: 26793. Best time for giant step method sse2/2: 27333. Best time for giant step method sse2/4: 20862. [B]Best time for giant step method sse2/8: 19116. Best time for giant step method sse2/16: 19107.[/B] Best time for giant step method x86/1: 27171. Best time for giant step method x86/2: 25272. Best time for giant step method x86/4: 24228. Best time for giant step method x86/8: 24300. Best time for ladder method sse2/2: 5076. Best time for ladder method sse2/4: 2880. Best time for ladder method sse2/8: 2367. Best time for ladder method sse2/16: 2655. Best time for ladder method x86/1: 8028. Best time for ladder method x86/2: 4194. Best time for ladder method x86/4: 3249. Best time for ladder method x86/8: 3276. Best time for ladder method add/1: 10539. Using baby step method sse2/16, giant step method sse2/16, ladder method sse2/8. Resuming from checkpoint pmin=5141372131943 in `checkpoint.txt'. Using 1024Kb for the Sieve of Eratosthenes bitmap. Expecting to find factors for about 1705.31 terms in this range. sr2sieve started: 1000000 <= n <= 1999997, 5141372131943 <= p <= 5200000000000 p=5141376065113, 1603412 p/sec, 1420 factors, 88.28% done, 193 sec/factor[/code] [code]sr2sieve-amd -l 32 -L 2048 -vv sr2sieve 1.5.0 -- A sieve for multiple sequences k*b^n+/-1. Using SSE2 code path, L1 data cache 32Kb (supplied), L2 cache 2048Kb (supplied). Read 493896 terms for 9 sequences from ABCD format file `sr2data.txt'. Split 9 base 2 sequences into 649 base 2^180 subsequences. Loaded Legendre symbol lookup tables for 9 sequences from `sr2cache.bin'. Using 16 Kb for the baby-steps giant-steps hashtable, maximum density 0.23. Best time for baby step method sse2/2: 38205. Best time for baby step method sse2/4: 25704. Best time for baby step method sse2/8: 22968. Best time for baby step method sse2/16: 22527. Best time for baby step method x86/1: 45027. Best time for baby step method x86/2: 33993. Best time for baby step method x86/4: 29457. Best time for baby step method x86/8: 27027. Best time for giant step method sse2/2: 27756. Best time for giant step method sse2/4: 20709. [B]Best time for giant step method sse2/8: 18189. Best time for giant step method sse2/16: 19080.[/B] Best time for giant step method x86/1: 26883. Best time for giant step method x86/2: 25164. Best time for giant step method x86/4: 24129. Best time for giant step method x86/8: 24300. Best time for ladder method sse2/2: 5085. Best time for ladder method sse2/4: 2889. Best time for ladder method sse2/8: 2367. Best time for ladder method sse2/16: 2655. Best time for ladder method x86/1: 7884. Best time for ladder method x86/2: 4167. Best time for ladder method x86/4: 3267. Best time for ladder method x86/8: 3303. Best time for ladder method add/1: 10494. Using baby step method sse2/16, giant step method sse2/8, ladder method sse2/8. Resuming from checkpoint pmin=5141013121169 in `checkpoint.txt'. Using 1024Kb for the Sieve of Eratosthenes bitmap. sr2sieve started: 1000000 <= n <= 1999997, 5141013121169 <= p <= 5200000000000 p=5141297681683, 1587250 p/sec, 1419 factors, 88.26% done, 195 sec/factor[/code] |
[B]CeleronM @ 1.5GHz[/B] - there are also some minor differences between intel and amd binaries, where amd executable would be faster [code]sr2sieve-intel -vv
sr2sieve 1.5.0 -- A sieve for multiple sequences k*b^n+/-1. Using SSE2 code path, L1 data cache 32Kb (detected), L2 cache 1024Kb (detected). Read 493896 terms for 9 sequences from ABCD format file `sr2data.txt'. Split 9 base 2 sequences into 649 base 2^180 subsequences. Loaded Legendre symbol lookup tables for 9 sequences from `sr2cache.bin'. Using 16 Kb for the baby-steps giant-steps hashtable, maximum density 0.23. Best time for baby step method sse2/2: 46722. Best time for baby step method sse2/4: 45557. Best time for baby step method sse2/8: 45008. Best time for baby step method sse2/16: 44633. Best time for baby step method x86/1: 49756. Best time for baby step method x86/2: 39744. Best time for baby step method x86/4: 37841. Best time for baby step method x86/8: 36608. Best time for giant step method sse2/2: 39338. Best time for giant step method sse2/4: 38509. Best time for giant step method sse2/8: 37383. Best time for giant step method sse2/16: 38928. Best time for giant step method x86/1: 32374. Best time for giant step method x86/2: 35539. [b]Best time for giant step method x86/4: 29850.[/b] Best time for giant step method x86/8: 29863. Best time for ladder method sse2/2: 6440. Best time for ladder method sse2/4: 6115. Best time for ladder method sse2/8: 6151. Best time for ladder method sse2/16: 6416. Best time for ladder method x86/1: 8563. Best time for ladder method x86/2: 4855. [b]Best time for ladder method x86/4: 4336.[/b] Best time for ladder method x86/8: 4473. Best time for ladder method add/1: 11120. Using baby step method x86/8, giant step method x86/4, ladder method x86/4. Resuming from checkpoint pmin=5141810132443 in `checkpoint.txt'. Using 512Kb for the Sieve of Eratosthenes bitmap. Expecting to find factors for about 1705.31 terms in this range. sr2sieve started: 1000000 <= n <= 1999997, 5141810132443 <= p <= 5200000000000 p=5141843557339, 558384 p/sec, 1421 factors, 88.37% done, 556 sec/factor[/code] [code]sr2sieve-amd -vv sr2sieve 1.5.0 -- A sieve for multiple sequences k*b^n+/-1. Using SSE2 code path, L1 data cache 32Kb (detected), L2 cache 1024Kb (detected). Read 493896 terms for 9 sequences from ABCD format file `sr2data.txt'. Split 9 base 2 sequences into 649 base 2^180 subsequences. Loaded Legendre symbol lookup tables for 9 sequences from `sr2cache.bin'. Using 16 Kb for the baby-steps giant-steps hashtable, maximum density 0.23. Best time for baby step method sse2/2: 47496. Best time for baby step method sse2/4: 45853. Best time for baby step method sse2/8: 46046. Best time for baby step method sse2/16: 46919. Best time for baby step method x86/1: 52386. Best time for baby step method x86/2: 40618. Best time for baby step method x86/4: 38537. Best time for baby step method x86/8: 37051. Best time for giant step method sse2/2: 39347. Best time for giant step method sse2/4: 38099. Best time for giant step method sse2/8: 36258. Best time for giant step method sse2/16: 39071. Best time for giant step method x86/1: 32979. Best time for giant step method x86/2: 30831. [b]Best time for giant step method x86/4: 29744.[/b] Best time for giant step method x86/8: 30012. Best time for ladder method sse2/2: 6481. Best time for ladder method sse2/4: 6102. Best time for ladder method sse2/8: 6108. Best time for ladder method sse2/16: 6469. Best time for ladder method x86/1: 8564. Best time for ladder method x86/2: 4859. [b]Best time for ladder method x86/4: 4323.[/b] Best time for ladder method x86/8: 4471. Best time for ladder method add/1: 11139. Using baby step method x86/8, giant step method x86/4, ladder method x86/4. Resuming from checkpoint pmin=5141861039863 in `checkpoint.txt'. Using 512Kb for the Sieve of Eratosthenes bitmap. Expecting to find factors for about 1705.31 terms in this range. sr2sieve started: 1000000 <= n <= 1999997, 5141861039863 <= p <= 5200000000000 p=5141878341563, 556718 p/sec, 1422 factors, 88.38% done, 558 sec/factor[/code] |
[B]A63 3400+ @ 2.4GHz[/B] - intel binary is slower without any exceptions on A64
[code]sr2sieve-amd -vv sr2sieve 1.5.0 -- A sieve for multiple sequences k*b^n+/-1. Using SSE2 code path, L1 data cache 64Kb (detected), L2 cache 512Kb (detected). Read 493896 terms for 9 sequences from ABCD format file `sr2data.txt'. Split 9 base 2 sequences into 649 base 2^180 subsequences. Using 32 Kb for the baby-steps giant-steps hashtable, maximum density 0.11. Best time for baby step method sse2/2: 42030. Best time for baby step method sse2/4: 33084. Best time for baby step method sse2/8: 32145. Best time for baby step method sse2/16: 33230. Best time for baby step method x86/1: 46833. Best time for baby step method x86/2: 43472. Best time for baby step method x86/4: 35132. Best time for baby step method x86/8: 28691. Best time for giant step method sse2/2: 34313. Best time for giant step method sse2/4: 31106. Best time for giant step method sse2/8: 30402. Best time for giant step method sse2/16: 31219. Best time for giant step method x86/1: 37890. Best time for giant step method x86/2: 29052. Best time for giant step method x86/4: 26605. Best time for giant step method x86/8: 26959. Best time for ladder method sse2/2: 5855. Best time for ladder method sse2/4: 4119. Best time for ladder method sse2/8: 4082. Best time for ladder method sse2/16: 4444. Best time for ladder method x86/1: 8348. Best time for ladder method x86/2: 5103. Best time for ladder method x86/4: 3681. Best time for ladder method x86/8: 3350. Best time for ladder method add/1: 11125. Using baby step method x86/8, giant step method x86/4, ladder method x86/8. Resuming from checkpoint pmin=5140056163097 in `checkpoint.txt'. Using 256Kb for the Sieve of Eratosthenes bitmap. Expecting to find factors for about 1705.31 terms in this range. sr2sieve started: 1000000 <= n <= 1999997, 5140056163097 <= p <= 5200000000000 p=5141150746079, 1041479 p/sec, 1419 factors, 88.23% done, 298 sec/factor[/code] |
[QUOTE=Cruelty;105868][B]C2D E6600 @ 3GHz[/B] - it seems that [i]sr2sieve-amd[/i] is faster in case of giant-step method.
BTW: both executables do not detect L2 cache size, assuming default value of 256kB.[/QUOTE] Thanks, I'll look into the cache detection for this machine. It is an unfortunate feature of the Intel cpuid scheme that when new models come out it is necessary to update the source code to detect the cache size properly. |
sr5sieve 1.5.1
This version brings the x86-64 build up to date with the changes in 1.5.0. There is new mulmod code which hasn't been tested yet.
|
What about sr1sieve? :rolleyes:
|
sr5sieve 1.5.2
This version fixes a segfault that can occur if the -B switch is used without the -G switch.
It also updates the Intel cache detection code. Cruelty: can you check whether this version works on your C2D? |
sr1sieve 1.1.0
This version uses the benchmarking code to test which mulmod routines to use before sieving. Use -v or -vv to see the details as with sr5sieve 1.5.0.
Because the number of subsequences is usually much less with sr1sieve, the effect of this code is much more variable, especially with very light weight sequences. It may take some experimenting with the -G switch to get the best results. |
| All times are UTC. The time now is 22:37. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.