![]() |
![]() |
#45 | |
"Alexander"
Nov 2008
The Alamo City
5×101 Posts |
![]() Quote:
Last fiddled with by Happy5214 on 2020-08-14 at 11:43 |
|
![]() |
![]() |
![]() |
#46 |
"Alexander"
Nov 2008
The Alamo City
5×101 Posts |
![]()
I finally got my hands on that ODROID C4. It arrived Sunday (a day early). I tried to boot it Monday, only to find that the spare wireless keyboard I had allocated to it no longer worked. Today, I managed to find a spare Ethernet cable and plug it into our router, and I was able to log in through ssh. After several first-timer coding errors, I finally have a working version of fpu_mulmod. I've attached a tarball with the new header, a remake of the push functions (which just copies the appropriate value to register d8), the two aforementioned version of fpu_mulmod (one using the 1/p value in d8 and the other using a simple fdiv by p every time), and a simple (grossly inefficient) Fermat test with a Mersenne prime using fpu_mulmod (should result in 1), along with a simple Makefile to build them.
The normal mulmod (with 1/p calculated once and stored in d8) runs in 1.52s on my ODROID, while the divp version (with the fdiv by p within mulmod on each iteration) takes 2.3s. Interestingly, the x87 version takes 1.43s on my Core 2 Quad, just showing how much progress processor technology has made. |
![]() |
![]() |
![]() |
#47 |
"Alexander"
Nov 2008
The Alamo City
5×101 Posts |
![]() |
![]() |
![]() |
![]() |
#48 |
"Mark"
Apr 2003
Between here and the
11000011010002 Posts |
![]()
Next up would be to port some of the others. Most are just an increment harder than fpu_mulmod. fpu_mulmod_4a_4b_4p() will be hardest, but it just means that you have pre-computed 1/p for four distinct values of p. If you have enough registers that are retained between calls to the function, then that would be great.
|
![]() |
![]() |
![]() |
#49 | |
"Alexander"
Nov 2008
The Alamo City
5·101 Posts |
![]() Quote:
I renamed the fpu_push functions to fpu_store, since they store the result in d8 rather than pushing it to a stack. Are there any issues with renaming those? |
|
![]() |
![]() |
![]() |
#50 | |
"Mark"
Apr 2003
Between here and the
186816 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#51 | |
Jun 2003
1,579 Posts |
![]() Quote:
A faster automated way of finding the bestQ would be 1) Set BASE_MULTIPLE to the gcd of differences of all consecutive terms left For N1, N2, N3, N4, ... left in the file Set BASE_MULTIPLE=gcd (N2-N1, N3-N2, N4-N3, ...) So all N can be represented as N1=k1*BASE_MULTIPLE+c N2=k2*BASE_MULTIPLE+c 2) Then LIMIT_BASE should 720*7*11=55440 NDIVISORS=LIMIT_BASE 3) Instead of R[n%LIMIT_BASE] = 1; we use R[k%LIMIT_BASE] = 1; OR R[ (n/BASE_MULTIPLE) %LIMIT_BASE] = 1; where k=(n/BASE_MULTIPLE) 4) Use the current code of srsieve2 to find bestq 5) Q=bestq*BASE_MULTIPLE |
|
![]() |
![]() |
![]() |
#52 | |
"Mark"
Apr 2003
Between here and the
186816 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#53 |
Jun 2003
1,579 Posts |
![]()
I modified sr1sieve to read any Q from command line and got significant benefit for low weight series. I am trying to combine the benefit with your faster FPU/AVX code.
The problem modifying srsieve2 code is that it supports multiple k version and calculating the Q for that is very complicated. |
![]() |
![]() |
![]() |
#54 | ||
"Alexander"
Nov 2008
The Alamo City
1111110012 Posts |
![]() Quote:
Quote:
Speaking of, the fpu_mulmod ARM ports have been finished and are attached. I managed to use the minimum amount of memory accesses (just transferring the array values to/from registers). Here are the timings for my test programs on ARM Cortex-A55 (which do a poor man's Fermat test on the largest known Mersenne prime exponent): Code:
time ./mulmod 3^82589932 mod 82589933 = 1 real 0m1.043s user 0m1.040s sys 0m0.000s time ./mulmod_iter 3^82589932 mod 82589933 = 1 real 0m0.870s user 0m0.864s sys 0m0.004s time ./mulmod_iter_4a 2^82589932 mod 82589933 = 1 2^82589932 mod 82589933 = 2 2^82589932 mod 82589933 = 4 2^82589932 mod 82589933 = 8 real 0m1.737s user 0m1.736s sys 0m0.000s time ./mulmod_4a_4b_4p 2^82589932 mod 82589933 = 1 3^82589932 mod 82589933 = 1 4^82589932 mod 82589933 = 1 5^82589932 mod 82589933 = 1 real 0m2.300s user 0m2.296s sys 0m0.000s Code:
time ./mulmod 3^82589932 mod 82589933 = 1 real 0m1.260s user 0m1.232s sys 0m0.000s time ./mulmod_iter 3^82589932 mod 82589933 = 1 real 0m1.110s user 0m1.074s sys 0m0.005s time ./mulmod_iter_4a 2^82589932 mod 82589933 = 1 2^82589932 mod 82589933 = 2 2^82589932 mod 82589933 = 4 2^82589932 mod 82589933 = 8 real 0m1.284s user 0m1.252s sys 0m0.004s time ./mulmod_4a_4b_4p 2^82589932 mod 82589933 = 1 3^82589932 mod 82589933 = 1 4^82589932 mod 82589933 = 1 5^82589932 mod 82589933 = 1 real 0m1.543s user 0m1.487s sys 0m0.005s |
||
![]() |
![]() |
![]() |
#55 |
Jun 2003
1,579 Posts |
![]()
I am looking at BestQ code
Code:
uint32_t GenericSubsequenceHelper::RateQ(uint32_t Q, uint32_t s) { uint32_t baby, giant, work; ChooseSteps(&baby, &giant, Q, s); work = baby*BABY_WORK + s*(giant-1)*GIANT_WORK + Q*EXP_WORK + s*SUBSEQ_WORK; return work; } Can you help me understand what is the Q*EXP_WORK -- I do not see anything corresponding to this in the discrete log function? For a large Q if only one subsequence is left - it should be significantly faster than using a lower Q but the "Q*EXP_WORK" prevents use of large Q. Thanks. |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
mtsieve | rogue | Software | 543 | 2021-02-27 18:43 |
srsieve/sr2sieve enhancements | rogue | Software | 287 | 2021-01-16 08:02 |
LLRnet enhancements | kar_bon | No Prime Left Behind | 10 | 2008-03-28 11:21 |
TODO list and suggestions/comments/enhancements | Greenbank | Octoproth Search | 2 | 2006-12-03 17:28 |
Suggestions for future enhancements | Reboot It | Software | 16 | 2003-10-17 01:31 |