mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2020-09-21, 19:23   #56
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

22·3·523 Posts
Default

Quote:
Originally Posted by Citrix View Post
I am looking at BestQ code

Code:
uint32_t    GenericSubsequenceHelper::RateQ(uint32_t Q, uint32_t s)
{
   uint32_t baby, giant, work;

   ChooseSteps(&baby, &giant, Q, s);

   work = baby*BABY_WORK + s*(giant-1)*GIANT_WORK + Q*EXP_WORK + s*SUBSEQ_WORK;

   return work;
}
I understand the BABY_WORK, GIANT_WORK and SUBSEQ_WORK steps.
Can you help me understand what is the Q*EXP_WORK -- I do not see anything corresponding to this in the discrete log function?

For a large Q if only one subsequence is left - it should be significantly faster than using a lower Q but the "Q*EXP_WORK" prevents use of large Q.
This was borrowed from srsieve. The original code was written by Geoff Reynolds.

Note that the code does not have any sr1sieve or sr2sieve enhancements. I have a ways to go to pull in the sr2sieve bsgs logic.
rogue is offline   Reply With Quote
Old 2020-09-22, 14:27   #57
rob147147
 
Apr 2013
Durham, UK

32·7 Posts
Default

I think EXP_WORK relates to the additional mulmods involved in pre-calculating b^d (mod p) for all the appropriate remainder values 'd' where 0 <= d < Q.

E.g. We are rewriting the terms of a sequence k * b^n -1 as k * b^d * b^(Qm) -1 where n=Qm+d.

The idea behind Q is to identify which values 'd' are unnecessary, thus ending up with less than Q subsequences over a range 1/Q times the size of the original range, hopefully giving a performance improvement. There could be multiple terms with the same remainder 'd' so we can precalculate b^d (mod p) for those instances. My understanding from looking at the source code of sr1sieve (bsgs.c) is that it is this mulmod that we are accounting for in EXP_WORK.
rob147147 is offline   Reply With Quote
Old 2020-09-24, 03:45   #58
Citrix
 
Citrix's Avatar
 
Jun 2003

62B16 Posts
Default

Quote:
Originally Posted by rob147147 View Post
I think EXP_WORK relates to the additional mulmods involved in pre-calculating b^d (mod p) for all the appropriate remainder values 'd' where 0 <= d < Q.

E.g. We are rewriting the terms of a sequence k * b^n -1 as k * b^d * b^(Qm) -1 where n=Qm+d.

The idea behind Q is to identify which values 'd' are unnecessary, thus ending up with less than Q subsequences over a range 1/Q times the size of the original range, hopefully giving a performance improvement. There could be multiple terms with the same remainder 'd' so we can precalculate b^d (mod p) for those instances. My understanding from looking at the source code of sr1sieve (bsgs.c) is that it is this mulmod that we are accounting for in EXP_WORK.
Robert, I think you are correct. I came to the same conclusion when reviewing the code. Though for sr1sieve there would not be any need to calculate b^d (mod p) for particular d values as there is no candidates left for certain d. That is the whole point of choosing a Q value.

For multiple k (sr2sieve) the program should intelligently decide which b^d should be computed.

In Srsieve2 the following function would need to be corrected to make it faster for single k.

Code:
void   GenericWorker::BuildTables(uint64_t *baseToUse, uint64_t *primeList, double *invp, uint64_t *bm64){
...}
Citrix is offline   Reply With Quote
Old 2020-09-24, 05:31   #59
Citrix
 
Citrix's Avatar
 
Jun 2003

30538 Posts
Default

I tried to replace mulmod by powmod to avoid unnecessary multiplication. I get an error during run time. Any idea why the calculations are not being done correctly?

Code:
void   GenericWorker::BuildTables(uint64_t *baseToUse, uint64_t *primeList, double *invp, uint64_t *bm64)
{
   uint64_t inv_b[4];
   uint32_t qIdx, seqIdx, ssIdx;
   uint64_t umod[4], inv[4];

   fpu_push_1divp(primeList[3]);
   fpu_push_1divp(primeList[2]);
   fpu_push_1divp(primeList[1]);
   fpu_push_1divp(primeList[0]);
   
   // Precompute 1/b^d (mod p) for 0 <= d <= Q.

   bd64[0][0] = bd64[0][1] = bd64[0][2] = bd64[0][3] = 1;
   
   bm64[0] = inv_b[0] = invmod64(baseToUse[0], primeList[0]);
   bm64[1] = inv_b[1] = invmod64(baseToUse[1], primeList[1]);
   bm64[2] = inv_b[2] = invmod64(baseToUse[2], primeList[2]);
   bm64[3] = inv_b[3] = invmod64(baseToUse[3], primeList[3]);

  /* for (qIdx=1; qIdx<ii_BestQ; qIdx++)
   {
      bd64[qIdx][0] = bm64[0];
      bd64[qIdx][1] = bm64[1];
      bd64[qIdx][2] = bm64[2];
      bd64[qIdx][3] = bm64[3];
         
      fpu_mulmod_4a_4b_4p(bm64, inv_b, primeList);
   }
   */
    
   for (seqIdx=0; seqIdx<ii_SequenceCount; seqIdx++)
   {
      ck64[seqIdx][0] = smod64(-ip_Sequences[seqIdx].c, primeList[0]);
      ck64[seqIdx][1] = smod64(-ip_Sequences[seqIdx].c, primeList[1]);
      ck64[seqIdx][2] = smod64(-ip_Sequences[seqIdx].c, primeList[2]);
      ck64[seqIdx][3] = smod64(-ip_Sequences[seqIdx].c, primeList[3]);
      
      umod[0] = umod64(ip_Sequences[seqIdx].k, primeList[0]);
      umod[1] = umod64(ip_Sequences[seqIdx].k, primeList[1]);
      umod[2] = umod64(ip_Sequences[seqIdx].k, primeList[2]);
      umod[3] = umod64(ip_Sequences[seqIdx].k, primeList[3]);
      
      inv[0] = invmod64(umod[0], primeList[0]);
      inv[1] = invmod64(umod[1], primeList[1]);
      inv[2] = invmod64(umod[2], primeList[2]);
      inv[3] = invmod64(umod[3], primeList[3]);
      
      fpu_mulmod_4a_4b_4p(ck64[seqIdx], inv, primeList);
   }

   // Compute -c/(k*b^d) (mod p) for each subsequence.
   for (ssIdx=0; ssIdx<ii_SubsequenceCount; ssIdx++)
   {
      bdck64[ssIdx][0] = ck64[ip_Subsequences[ssIdx].seqIdx][0];
      bdck64[ssIdx][1] = ck64[ip_Subsequences[ssIdx].seqIdx][1];
      bdck64[ssIdx][2] = ck64[ip_Subsequences[ssIdx].seqIdx][2];
      bdck64[ssIdx][3] = ck64[ip_Subsequences[ssIdx].seqIdx][3];


	  fpu_powmod_4b_1n_4p(bm64, ip_Subsequences[ssIdx].d, primeList);
      
      //fpu_mulmod_4a_4b_4p(bdck64[ssIdx], bd64[ip_Subsequences[ssIdx].d], primeList);
	  fpu_mulmod_4a_4b_4p(bdck64[ssIdx], bm64, primeList);

	  bm64[0] = inv_b[0];
	  bm64[1] = inv_b[1];
	  bm64[2] = inv_b[2];
	  bm64[3] = inv_b[3];
   }
   fpu_powmod_4b_1n_4p(bm64, ii_BestQ, primeList);

   
   fpu_pop();
   fpu_pop();
   fpu_pop();
   fpu_pop();
}
Citrix is offline   Reply With Quote
Old 2020-09-24, 12:44   #60
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

22·3·523 Posts
Default

fpu_powmod_4b_1n_4p() assumes that there is nothing on the FPU stack, but fpu_mulmod_4a_4b_4p() requires the first four entries on the stack to be preset (per fpu_push_1divp). You would have to create a version of fpu_powmod_4b_1n_4p() that supports precomputed values on the stack.
rogue is offline   Reply With Quote
Old 2020-09-26, 12:15   #61
Happy5214
 
Happy5214's Avatar
 
"Alexander"
Nov 2008
The Alamo City

22×3×47 Posts
Default

Quote:
Originally Posted by rogue View Post
fpu_powmod_4b_1n_4p() assumes that there is nothing on the FPU stack, but fpu_mulmod_4a_4b_4p() requires the first four entries on the stack to be preset (per fpu_push_1divp). You would have to create a version of fpu_powmod_4b_1n_4p() that supports precomputed values on the stack.
On that subject, I'm about to implement the ARM version of fpu_powmod_4b_1n_4p() (it'll be the only one I'll have to spill registers for ). Is there a reason why fpu_mulmod_4a_4b_4p() preloads the 1/p values and fpu_powmod_4b_1n_4p() doesn't? Should I keep the current semantics with the ARM version?

I discovered when I implemented fpu_powmod() that the x87 version does in fact use right-to-left exponentiation, as I noted in a previous post, so you were right on that concern earlier.

Last fiddled with by Happy5214 on 2020-09-26 at 12:16
Happy5214 is online now   Reply With Quote
Old 2020-09-26, 13:06   #62
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

22×3×523 Posts
Default

Quote:
Originally Posted by Happy5214 View Post
On that subject, I'm about to implement the ARM version of fpu_powmod_4b_1n_4p() (it'll be the only one I'll have to spill registers for ). Is there a reason why fpu_mulmod_4a_4b_4p() preloads the 1/p values and fpu_powmod_4b_1n_4p() doesn't? Should I keep the current semantics with the ARM version?

I discovered when I implemented fpu_powmod() that the x87 version does in fact use right-to-left exponentiation, as I noted in a previous post, so you were right on that concern earlier.
In most cases fpu_powmod_4b_1n_4p() is called once to set up a loop and fpu_mulmod_4a_4b_4p() is called within a loop. There is a time speed bump to setting up the stack prior to the call to fpu_powmod_4b_1n_4p(), but it wouldn't benefit all sieves and most that would benefit would probably see a negligible benefit. The long term goal would then be to preset the stack (or registers for ARM) for all fpu functions.
rogue is offline   Reply With Quote
Old 2020-12-27, 18:26   #63
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

22·3·523 Posts
Default

I have posted mtsieve 2.1.0 over at sourceforge. There are many changes with this release, but users of gcwsievecl, mfsieve, and mfsievecl will get the most benefit. The changes are:

Code:
2.1.0 - December 27, 2020
   framework:
      On Windows, switched to using gcc from msys2 instead of gcc from mingw64.  -O3
      gives a nice performance bump to some of the non-ASM code.
      
      Exit after showing help when using -h swtich instead of giving fatal error.
      
      Add logic to support validation of factors passed with -I, but not all sieves are
      coded to do this validation.
      
      On GPU builds, default -W to 0 and -G to 1.  A "special" CPU worker will be created
      as necessary to test ranges of p that are to small for the GPU code.
   
      For GPU executables, add -H to show some GPU details when each kernel is created.
   
      Improve factor rate calculation when using GPU workers.
      
      Created FUTURE.txt for items I would like to get working.
   
   cksieve:  version 1.3
      Added logic to validate factors passed with -I.

   gcwsieve, gcwsievecl:  version 1.4
      Added logic to validate factors passed with -I.
      Add -f option to gcwsieve so that it can generate ABC output that is supported by LLR.
      Added -M option and fixed -S for gcwsievecl.
      Improved speed of GPU code of gcwsievecl by about 25%.
      The GPU code is more than 20x faster than the CPU for the same amount of work.
   
   mfsieve, mfsievecl:  version 1.7
      Added logic to validate factors passed with -I.
      Replaced all ASM routines with non-ASM routines based upon Montgomery mulitplication.  This
      provides a speed bump of at least 30% over the previous release.
      Fixed the GPU code and replaced magic number logic with Montgomery multiplcation as it is
      faster.  The new GPU code is more then 25% faster than the old GPU code.
      The GPU code is more than 20x faster than the CPU for the same amount of work.
   
    sgsieve:  version 1.2
      Added logic to validate factors passed with -I.
      
    twinsieve:  version 1.3
      Added logic to validate factors passed with -I.
      
    xyyxsieve, xyyxsievecl:  version 1.8
      Added logic to validate factors passed with -I.
I also add a FUTURE.txt file with what I want to do in the future. I know there are things that I have promised that are not in this list. Please let me know what I have missed:

Code:
2.0.7 - December 22, 2020
   framework:
      Replace AMD SDK on Windows with open source SDK for OpenCL.
      
      Add more output for -H for GPU executables.
      
   afsieve, afsievecl:
      Replace logic with Montgomery multiplcation.
      Add factor validation when using -I.
 
   dmdsieve:
      Add factor validation when using -I.
      
   fbncsieve:
      Add factor validation when using -I.
      
   fkbnsieve:
      Add factor validation when using -I.
      
   gfndsieve:
      Add factor validation when using -I.
      
   kbbsieve:
      Add factor validation when using -I.
      
   psieve:
      Add factor validation when using -I.
      
   pixsieve:
   
   srsieve2:
      Implement sr1sieve and sr2sieve logic.
      Add factor validation when using -I.
I will probably complete most of the -I updates in the coming days and a replacement OpenCL SDK on Windows is close behind.
rogue is offline   Reply With Quote
Old 2020-12-27, 19:21   #64
pepi37
 
pepi37's Avatar
 
Dec 2011
After milion nines:)

2×709 Posts
Default

If you can rise upper limit of srsieve2.
pepi37 is online now   Reply With Quote
Old 2020-12-27, 20:06   #65
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

22×3×523 Posts
Default

Quote:
Originally Posted by pepi37 View Post
If you can rise upper limit of srsieve2.
So you need it to go above 2^52?
rogue is offline   Reply With Quote
Old 2020-12-27, 21:01   #66
pepi37
 
pepi37's Avatar
 
Dec 2011
After milion nines:)

2×709 Posts
Default

Quote:
Originally Posted by rogue View Post
So you need it to go above 2^52?
1e17 is good for me
pepi37 is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
srsieve/sr2sieve enhancements rogue Software 300 2021-03-18 20:31
mtsieve rogue Software 543 2021-02-27 18:43
LLRnet enhancements kar_bon No Prime Left Behind 10 2008-03-28 11:21
TODO list and suggestions/comments/enhancements Greenbank Octoproth Search 2 2006-12-03 17:28
Suggestions for future enhancements Reboot It Software 16 2003-10-17 01:31

All times are UTC. The time now is 20:03.

Sat Apr 10 20:03:16 UTC 2021 up 2 days, 14:44, 1 user, load averages: 2.05, 1.60, 1.49

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.