20200426, 00:50  #12  
Jun 2003
11^{2}×13 Posts 
Quote:
Step 1: For BSGS we want to compare G[i] to B[j] for giant and baby step respectively. Assuming there are G_n giant steps and B_n baby steps. Then we calculate G[i] (mod p) and B[j] (mod p) We want to store the baby step value (could be the other way around also) Instead of comparing G[i] and B[j] we want to limit our search space by using probability. This can reduce the amount of memory space needed and search operations but can produce false positive. There will be a inverse relationship between how much memory we use and how many false positives we get. Let G[i]=B[j] (mod p) then [G[i]=B[j] (mod p)](mod n) G[i]%n=GN[i] B[j]%n=BN[j] GN[i]=BN[j] We will store BN[j] for comparison with GN[i] ** if n is a power of 2 (2^x) then this modular arithmetic can be done at no extra cost. So we do not need to store the full 64 bits of G[i] and B[j]. Effectively we have reduced our search space. We can store these in a hash table or use Step 2 below in certain cases to speed up even further. For very small B_n range (<510) we can just use an array instead of hash table which should be faster. This would happen if there were too few candidates left in the sieve. Step 2: For small n we can create an bit index (array of bits) to store all B[j] values. Does not work for large n values and would need to use a regular hash table. At default all values in the array are set to zero. Then for each BN[j] value you convert the corresponding location in the array to 1. For each GN[i] you read the corresponding value from the array. Complexity to add a value is O(1) and to search a value is O (1) If you do get a positive result (=1) it suggests that G[i] might produce a factor and then you need to check G[i] against B[1..j]. This will differentiate between false positive and true factors. This might not need to be done if there were no values corresponding to a particular G[i] in the candidate left range. To check G[i] against B[1..j] instead of repeating B_n steps this can be done in 2*sqrt(B_n) steps by repeating the BSGS algorithm again on this subrange. Code:
Bit array [n]; for (int a=0; a<n; a++) {array[a]=0;} int b=BN[j]; array[b]=1; int g=GN[i]; if(array[g]==1) {// possible factor found and need to double check} 2) Choosing the size of n We want to reduce the number of false positive and reduce the amount of memory used Larger the n the fewer the number of false positive and more the amount of memory used. Percent chances of a positive result will be (G_n*B_n*100)/n For a 1% false positive rate n will have to be around G_n*B_n*100 The false positive rate should be adjusted to the speed up from using less memory/array structure vs extra work needed to test the false positive. n values will quickly become very large and this is best suited for low weight candidates or small ranges being searched with a few k. 

20200426, 02:53  #13 
"Alexander"
Nov 2008
The Alamo City
2^{3}·47 Posts 
My desktop's Core 2 doesn't have AVX, so this is from my laptop (Intel Core i34012Y, Kubuntu 18.04):
Code:
sse_mulmod_4a_4b_4p 2018927 ms 5047 ms (per 1e6 mulmod) vec_avx_mulmod 21365975 ms 6676 ms (per 1e6 mulmod) vec2_mulmod64_mmx 82026202 ms 1708 ms (per 1e6 mulmod) vec4_mulmod64_mmx 34685917 ms 722 ms (per 1e6 mulmod) vec6_mulmod64_mmx 24030532 ms 500 ms (per 1e6 mulmod) vec8_mulmod64_mmx 18026937 ms 375 ms (per 1e6 mulmod) vec2_mulmod64_fpu 80305117 ms 1673 ms (per 1e6 mulmod) vec4_mulmod64_fpu 38246894 ms 796 ms (per 1e6 mulmod) vec6_mulmod64_fpu 27558843 ms 574 ms (per 1e6 mulmod) vec8_mulmod64_fpu 18187244 ms 378 ms (per 1e6 mulmod) Code:
#define __USE_GNU #include <sys/resource.h> Last fiddled with by Happy5214 on 20200426 at 02:55 
20200426, 12:18  #14 
"Mark"
Apr 2003
Between here and the
61·97 Posts 
I am not concerned about memory usage on a CPU as much as I would be on a GPU, but if the lookup could be faster, that would be helpful. Please write some code so that we can see how it performs.

20200426, 14:16  #15 
"Mark"
Apr 2003
Between here and the
61×97 Posts 
I added the mtsieve function fpu_mulmod_4a_4b_4p to this. I didn't include it previously because it I didn't use it correctly which led to incorrect output. Here is the updated output:
Code:
fpu_mulmod_4a_4b_4p 890625 ms 2226 ms (per 1e6 mulmod) sse_mulmod_4a_4b_4p 734375 ms 1835 ms (per 1e6 mulmod) vec_avx_mulmod 3828125 ms 1196 ms (per 1e6 mulmod) vec2_mulmod64_mmx 30625000 ms 638 ms (per 1e6 mulmod) vec4_mulmod64_mmx 12359375 ms 257 ms (per 1e6 mulmod) vec6_mulmod64_mmx 8187500 ms 170 ms (per 1e6 mulmod) vec8_mulmod64_mmx 6078125 ms 126 ms (per 1e6 mulmod) vec2_mulmod64_fpu 29906250 ms 623 ms (per 1e6 mulmod) vec4_mulmod64_fpu 13546875 ms 282 ms (per 1e6 mulmod) vec6_mulmod64_fpu 8968750 ms 186 ms (per 1e6 mulmod) vec8_mulmod64_fpu 6031250 ms 125 ms (per 1e6 mulmod) Code:
a[0] = a[0] * b[0] mod p[0] a[1] = a[1] * b[1] mod p[1] a[2] = a[2] * b[2] mod p[2] a[3] = a[3] * b[3] mod p[3] Code:
for (int i=0; i<count; i++) a[0] = a[0] * b[0] mod p a[1] = a[1] * b[1] mod p a[2] = a[2] * b[2] mod p a[3] = a[3] * b[3] mod p Code:
for (int i=0; i<count; i++) a[0] = a[0] * b[0] mod p[0] a[1] = a[1] * b[1] mod p[1] a[2] = a[2] * b[2] mod p[2] a[3] = a[3] * b[3] mod p[3] The sources to both of these are included in the attached and the usage of both is in the test.c file. For those of you are are not afraid of x86 assembly, I'd love you to take a crack at such a function. Last fiddled with by rogue on 20200426 at 14:17 
20200426, 15:46  #16 
Random Account
Aug 2009
U.S.A.
5×307 Posts 
Using your update. Same parameter as before. 100.

20200426, 16:27  #17  
Jun 2003
11^{2}·13 Posts 
Quote:
As I mentioned in the earlier post that it only is useful for small n range or low weight and might not work for a larger range due to memory requirements. It might not be of any practical use. It does use more memory but look up is significantly faster. See: https://docs.microsoft.com/enus/cpp...s?view=vs2019 for more details on bitset. Code:
#include <bitset> // built in STD library for bitmaps. //BSGS algorithm // For a range r=b*g for k*2^n+1 // then baby step = k*2^(1) to k*2^b  b steps (mod p) // giant step will be 2^(n) to (2^b) step 2^b g steps (mod p) const int g = 32; // number of giant steps const int b = 32; // number of baby steps const int c = 1024; // for 0.1% false positive rate. rate=1/c*100% int n = g*b*c; //works better if n=2^m std::bitset<n> bitmap; // by default in the constructor everything is set to zero/false void main(void) { uint64_t giant[g]; uint64_t baby[b]; // replace arrays with mulmod code in mtsieve to generate giant and baby values (mod p) for (int i = 0; i < b; i++) { uint32_t temp = baby[i]; temp = temp%n; // can use bit operations if n is a power of 2 bitmap.set(temp); //sets bit to 1 } // we are done with setting up the table // Now we will compare with the giant values for (int j= 0; j < g; j++) { uint32_t temp = giant[j]; temp = temp%n; // can use bit operations if n is a power of 2 if (bitmap.test(temp)) { // we might have a possible factor // Possible factor will be k*2^(b*j)*2^x+1 =0 (mod p) // Where x is between 1...b // Need to search this range for factors // Repeat BSGS for this smaller range. } } bitmap.reset(); // sets everything to false } Last fiddled with by Citrix on 20200426 at 16:49 

20200426, 19:26  #18  
Jun 2003
625_{16} Posts 
Quote:
For the low weight sequences Assuming BSGS requires 64 steps each for a range of 4096 Would the following simpler algorithm be faster on CPU or GPU than BSGS for a 4096 range? Multiple primes could be tested in parallel. (Assuming we use BestQ and Legendre symbols in both algorithms). Code:
void low_wt_sieve (int k, int base, int nmin, int nmax, int sign, int step, int p) { // checks if p divides k*base^n+sign (over range of n with base^step size increments) // calculate number of iterations int number=(nmaxnmin)/step+1; int start=(k*base^nmin)%p; int step_exp=base^step%p; for (int x=0; x<number; x++) { start=start*step_exp%p; if (start+sign==p  startsign==0) { //output factor } } } 

20200426, 21:29  #19  
"Mark"
Apr 2003
Between here and the
61·97 Posts 
Quote:
As for your other post using bitset, I'll look into it further to see if it can be used. 

20200427, 01:42  #20 
"Mark"
Apr 2003
Between here and the
61×97 Posts 
Upon reviewing the code, vec4_mulmod64_fpu works like this:
vec4_mulmod64_fpu works like this: Code:
for (int i=0; i<count; i++) y[0] = x[0] * b mod p y[1] = x[1] * b mod p y[2] = x[2] * b mod p y[3] = x[3] * b mod p Does anyone know if there is an additional latency to using fmul with a memory address instead of st(x)? It doesn't appear that there is one from Agner Fog's tables, but I don't know if anyone has practical experience who can also answer the question. 
20200427, 15:37  #21 
"Mark"
Apr 2003
Between here and the
61×97 Posts 
As I think about it more, primorial and factorial won't benefit that much because they have a specialized routine that reduces the number of asm function calls.

20200427, 18:21  #22 
"Mark"
Apr 2003
Between here and the
61×97 Posts 
I realized that I wasn't computing the runtimes correctly for the vec_xx functions. Here are updated numbers, which make a lot more sense now. I'm sorry that I didn't discover my mistake sooner. I assume that nobody has wasted time on this besides me.
Code:
fpu_mulmod_4a_4b_4p 859375 ms 2148 ms (per 1e6 mulmod) sse_mulmod_4a_4b_4p 703125 ms 1757 ms (per 1e6 mulmod) vec_avx_mulmod 3750000 ms 1171 ms (per 1e6 mulmod) vec2_mulmod64_mmx 57687500 ms 1201 ms (per 1e6 mulmod) vec4_mulmod64_mmx 47265625 ms 984 ms (per 1e6 mulmod) vec6_mulmod64_mmx 44859375 ms 934 ms (per 1e6 mulmod) vec8_mulmod64_mmx 44953125 ms 936 ms (per 1e6 mulmod) vec2_mulmod64_fpu 58921875 ms 1227 ms (per 1e6 mulmod) vec4_mulmod64_fpu 50531250 ms 1052 ms (per 1e6 mulmod) vec6_mulmod64_fpu 50812500 ms 1058 ms (per 1e6 mulmod) vec8_mulmod64_fpu 42796875 ms 891 ms (per 1e6 mulmod) Last fiddled with by rogue on 20200427 at 18:31 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
mtsieve  rogue  Software  447  20200919 05:50 
srsieve/sr2sieve enhancements  rogue  Software  280  20200917 16:10 
LLRnet enhancements  kar_bon  No Prime Left Behind  10  20080328 11:21 
TODO list and suggestions/comments/enhancements  Greenbank  Octoproth Search  2  20061203 17:28 
Suggestions for future enhancements  Reboot It  Software  16  20031017 01:31 