20200814, 11:41  #45  
"Alexander"
Nov 2008
The Alamo City
634_{8} Posts 
Quote:
Last fiddled with by Happy5214 on 20200814 at 11:43 

20200819, 15:26  #46 
"Alexander"
Nov 2008
The Alamo City
2^{2}·103 Posts 
I finally got my hands on that ODROID C4. It arrived Sunday (a day early). I tried to boot it Monday, only to find that the spare wireless keyboard I had allocated to it no longer worked. Today, I managed to find a spare Ethernet cable and plug it into our router, and I was able to log in through ssh. After several firsttimer coding errors, I finally have a working version of fpu_mulmod. I've attached a tarball with the new header, a remake of the push functions (which just copies the appropriate value to register d8), the two aforementioned version of fpu_mulmod (one using the 1/p value in d8 and the other using a simple fdiv by p every time), and a simple (grossly inefficient) Fermat test with a Mersenne prime using fpu_mulmod (should result in 1), along with a simple Makefile to build them.
The normal mulmod (with 1/p calculated once and stored in d8) runs in 1.52s on my ODROID, while the divp version (with the fdiv by p within mulmod on each iteration) takes 2.3s. Interestingly, the x87 version takes 1.43s on my Core 2 Quad, just showing how much progress processor technology has made. 
20200819, 18:26  #47 
"Alexander"
Nov 2008
The Alamo City
2^{2}·103 Posts 

20200819, 20:06  #48 
"Mark"
Apr 2003
Between here and the
2×5×601 Posts 
Next up would be to port some of the others. Most are just an increment harder than fpu_mulmod. fpu_mulmod_4a_4b_4p() will be hardest, but it just means that you have precomputed 1/p for four distinct values of p. If you have enough registers that are retained between calls to the function, then that would be great.

20200820, 10:24  #49  
"Alexander"
Nov 2008
The Alamo City
2^{2}×103 Posts 
Quote:
I renamed the fpu_push functions to fpu_store, since they store the result in d8 rather than pushing it to a stack. Are there any issues with renaming those? 

20200820, 12:00  #50  
"Mark"
Apr 2003
Between here and the
2·5·601 Posts 
Quote:


20200822, 02:12  #51  
Jun 2003
3^{2}·5^{2}·7 Posts 
Calculating best Q for srsieve2.exe
Quote:
A faster automated way of finding the bestQ would be 1) Set BASE_MULTIPLE to the gcd of differences of all consecutive terms left For N1, N2, N3, N4, ... left in the file Set BASE_MULTIPLE=gcd (N2N1, N3N2, N4N3, ...) So all N can be represented as N1=k1*BASE_MULTIPLE+c N2=k2*BASE_MULTIPLE+c 2) Then LIMIT_BASE should 720*7*11=55440 NDIVISORS=LIMIT_BASE 3) Instead of R[n%LIMIT_BASE] = 1; we use R[k%LIMIT_BASE] = 1; OR R[ (n/BASE_MULTIPLE) %LIMIT_BASE] = 1; where k=(n/BASE_MULTIPLE) 4) Use the current code of srsieve2 to find bestq 5) Q=bestq*BASE_MULTIPLE 

20200822, 02:55  #52  
"Mark"
Apr 2003
Between here and the
2·5·601 Posts 
Quote:


20200822, 03:48  #53 
Jun 2003
3^{2}×5^{2}×7 Posts 
I modified sr1sieve to read any Q from command line and got significant benefit for low weight series. I am trying to combine the benefit with your faster FPU/AVX code.
The problem modifying srsieve2 code is that it supports multiple k version and calculating the Q for that is very complicated. 
20200825, 15:55  #54  
"Alexander"
Nov 2008
The Alamo City
110011100_{2} Posts 
Quote:
Quote:
Speaking of, the fpu_mulmod ARM ports have been finished and are attached. I managed to use the minimum amount of memory accesses (just transferring the array values to/from registers). Here are the timings for my test programs on ARM CortexA55 (which do a poor man's Fermat test on the largest known Mersenne prime exponent): Code:
time ./mulmod 3^82589932 mod 82589933 = 1 real 0m1.043s user 0m1.040s sys 0m0.000s time ./mulmod_iter 3^82589932 mod 82589933 = 1 real 0m0.870s user 0m0.864s sys 0m0.004s time ./mulmod_iter_4a 2^82589932 mod 82589933 = 1 2^82589932 mod 82589933 = 2 2^82589932 mod 82589933 = 4 2^82589932 mod 82589933 = 8 real 0m1.737s user 0m1.736s sys 0m0.000s time ./mulmod_4a_4b_4p 2^82589932 mod 82589933 = 1 3^82589932 mod 82589933 = 1 4^82589932 mod 82589933 = 1 5^82589932 mod 82589933 = 1 real 0m2.300s user 0m2.296s sys 0m0.000s Code:
time ./mulmod 3^82589932 mod 82589933 = 1 real 0m1.260s user 0m1.232s sys 0m0.000s time ./mulmod_iter 3^82589932 mod 82589933 = 1 real 0m1.110s user 0m1.074s sys 0m0.005s time ./mulmod_iter_4a 2^82589932 mod 82589933 = 1 2^82589932 mod 82589933 = 2 2^82589932 mod 82589933 = 4 2^82589932 mod 82589933 = 8 real 0m1.284s user 0m1.252s sys 0m0.004s time ./mulmod_4a_4b_4p 2^82589932 mod 82589933 = 1 3^82589932 mod 82589933 = 1 4^82589932 mod 82589933 = 1 5^82589932 mod 82589933 = 1 real 0m1.543s user 0m1.487s sys 0m0.005s 

20200921, 17:26  #55 
Jun 2003
3^{2}·5^{2}·7 Posts 
I am looking at BestQ code
Code:
uint32_t GenericSubsequenceHelper::RateQ(uint32_t Q, uint32_t s) { uint32_t baby, giant, work; ChooseSteps(&baby, &giant, Q, s); work = baby*BABY_WORK + s*(giant1)*GIANT_WORK + Q*EXP_WORK + s*SUBSEQ_WORK; return work; } Can you help me understand what is the Q*EXP_WORK  I do not see anything corresponding to this in the discrete log function? For a large Q if only one subsequence is left  it should be significantly faster than using a lower Q but the "Q*EXP_WORK" prevents use of large Q. Thanks. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
mtsieve  rogue  Software  457  20201118 15:26 
srsieve/sr2sieve enhancements  rogue  Software  281  20200929 16:36 
LLRnet enhancements  kar_bon  No Prime Left Behind  10  20080328 11:21 
TODO list and suggestions/comments/enhancements  Greenbank  Octoproth Search  2  20061203 17:28 
Suggestions for future enhancements  Reboot It  Software  16  20031017 01:31 