View Single Post
Old 2020-04-25, 21:33   #10
rogue's Avatar
Apr 2003
Between here and the

5,953 Posts

I have looked at the various mulmod routines available to me in x86 ASM to determine their relative speed. The i386 and sse ASM functions used by srXsieve will not compile using -m64 and as nobody (and I mean nobody) has asked for a 32-bit build in many years, I won't consider them going forward.

In this table the first two functions are part of mtsieve. The vec_mulmod_xx functions are from srXsieve. It is a little challenging to take the numbers "as is" because the functions from mtsieve accept 4 (sse) and 32 (avx) divisors as they sieve multiple prime concurrently, but the vec_mulmod_xx functions accept only 1.

sse_mulmod_4a_4b_4p   718750 ms 1796 ms (per 1e6 mulmod)
     vec_avx_mulmod  3750000 ms 1171 ms (per 1e6 mulmod)
  vec2_mulmod64_mmx 29296875 ms  610 ms (per 1e6 mulmod)
  vec4_mulmod64_mmx 12187500 ms  253 ms (per 1e6 mulmod)
  vec6_mulmod64_mmx  8015625 ms  166 ms (per 1e6 mulmod)
  vec8_mulmod64_mmx  6015625 ms  125 ms (per 1e6 mulmod)
  vec2_mulmod64_fpu 29781250 ms  620 ms (per 1e6 mulmod)
  vec4_mulmod64_fpu 13468750 ms  280 ms (per 1e6 mulmod)
  vec6_mulmod64_fpu  8875000 ms  184 ms (per 1e6 mulmod)
  vec8_mulmod64_fpu  6078125 ms  126 ms (per 1e6 mulmod)
The vec_mulmod_fpu functions support p up to 2^62 whereas the vec_mulmod_mmx support p up to 2^52.

It might be possible to convert some of the existing sieves in the framework to use the functions from srXsieve, but I haven't looked into that yet. I will likely only include the vec8_mulmod64_fpu function in mtsieve because the others just are not as fast and because supporting multiple just adds unnecessary complexity to the framework and the vec_mulmod_mmx functions require a global variable, which mtsieve will not support as it is multi-threaded. That could probably be changed, but I don't know if it would impact performance.

I have attached the code (containing a Windows exe). You can compile on your own with: gcc *mulmod*.S *fini.S avx*S test.c -m64 -O2 -o mulmod_test. I am curious as to what other CPUs output to see if these numbers are consistent. The compiled program takes 1 parameter, a number. That number is multiplied by 1e6 for the number of calls made to the function in question. The numbers above were run with the input parameter of 100 while no other CPU intensive programs were running.
Attached Files
File Type: 7z mulmod_test.7z (26.5 KB, 50 views)

Last fiddled with by rogue on 2020-04-25 at 21:35
rogue is offline   Reply With Quote