mersenneforum.org mtsieve enhancements
 Register FAQ Search Today's Posts Mark Forums Read

2020-09-21, 19:23   #56
rogue

"Mark"
Apr 2003
Between here and the

11000001010012 Posts

Quote:
 Originally Posted by Citrix I am looking at BestQ code Code: uint32_t GenericSubsequenceHelper::RateQ(uint32_t Q, uint32_t s) { uint32_t baby, giant, work; ChooseSteps(&baby, &giant, Q, s); work = baby*BABY_WORK + s*(giant-1)*GIANT_WORK + Q*EXP_WORK + s*SUBSEQ_WORK; return work; } I understand the BABY_WORK, GIANT_WORK and SUBSEQ_WORK steps. Can you help me understand what is the Q*EXP_WORK -- I do not see anything corresponding to this in the discrete log function? For a large Q if only one subsequence is left - it should be significantly faster than using a lower Q but the "Q*EXP_WORK" prevents use of large Q.
This was borrowed from srsieve. The original code was written by Geoff Reynolds.

Note that the code does not have any sr1sieve or sr2sieve enhancements. I have a ways to go to pull in the sr2sieve bsgs logic.

 2020-09-22, 14:27 #57 rob147147   Apr 2013 Durham, UK 32·7 Posts I think EXP_WORK relates to the additional mulmods involved in pre-calculating b^d (mod p) for all the appropriate remainder values 'd' where 0 <= d < Q. E.g. We are rewriting the terms of a sequence k * b^n -1 as k * b^d * b^(Qm) -1 where n=Qm+d. The idea behind Q is to identify which values 'd' are unnecessary, thus ending up with less than Q subsequences over a range 1/Q times the size of the original range, hopefully giving a performance improvement. There could be multiple terms with the same remainder 'd' so we can precalculate b^d (mod p) for those instances. My understanding from looking at the source code of sr1sieve (bsgs.c) is that it is this mulmod that we are accounting for in EXP_WORK.
2020-09-24, 03:45   #58
Citrix

Jun 2003

110001010112 Posts

Quote:
 Originally Posted by rob147147 I think EXP_WORK relates to the additional mulmods involved in pre-calculating b^d (mod p) for all the appropriate remainder values 'd' where 0 <= d < Q. E.g. We are rewriting the terms of a sequence k * b^n -1 as k * b^d * b^(Qm) -1 where n=Qm+d. The idea behind Q is to identify which values 'd' are unnecessary, thus ending up with less than Q subsequences over a range 1/Q times the size of the original range, hopefully giving a performance improvement. There could be multiple terms with the same remainder 'd' so we can precalculate b^d (mod p) for those instances. My understanding from looking at the source code of sr1sieve (bsgs.c) is that it is this mulmod that we are accounting for in EXP_WORK.
Robert, I think you are correct. I came to the same conclusion when reviewing the code. Though for sr1sieve there would not be any need to calculate b^d (mod p) for particular d values as there is no candidates left for certain d. That is the whole point of choosing a Q value.

For multiple k (sr2sieve) the program should intelligently decide which b^d should be computed.

In Srsieve2 the following function would need to be corrected to make it faster for single k.

Code:
void   GenericWorker::BuildTables(uint64_t *baseToUse, uint64_t *primeList, double *invp, uint64_t *bm64){
...}

 2020-09-24, 05:31 #59 Citrix     Jun 2003 1,579 Posts I tried to replace mulmod by powmod to avoid unnecessary multiplication. I get an error during run time. Any idea why the calculations are not being done correctly? Code: void GenericWorker::BuildTables(uint64_t *baseToUse, uint64_t *primeList, double *invp, uint64_t *bm64) { uint64_t inv_b[4]; uint32_t qIdx, seqIdx, ssIdx; uint64_t umod[4], inv[4]; fpu_push_1divp(primeList[3]); fpu_push_1divp(primeList[2]); fpu_push_1divp(primeList[1]); fpu_push_1divp(primeList[0]); // Precompute 1/b^d (mod p) for 0 <= d <= Q. bd64[0][0] = bd64[0][1] = bd64[0][2] = bd64[0][3] = 1; bm64[0] = inv_b[0] = invmod64(baseToUse[0], primeList[0]); bm64[1] = inv_b[1] = invmod64(baseToUse[1], primeList[1]); bm64[2] = inv_b[2] = invmod64(baseToUse[2], primeList[2]); bm64[3] = inv_b[3] = invmod64(baseToUse[3], primeList[3]); /* for (qIdx=1; qIdx
 2020-09-24, 12:44 #60 rogue     "Mark" Apr 2003 Between here and the 182916 Posts fpu_powmod_4b_1n_4p() assumes that there is nothing on the FPU stack, but fpu_mulmod_4a_4b_4p() requires the first four entries on the stack to be preset (per fpu_push_1divp). You would have to create a version of fpu_powmod_4b_1n_4p() that supports precomputed values on the stack.
2020-09-26, 12:15   #61
Happy5214

"Alexander"
Nov 2008
The Alamo City

32·72 Posts

Quote:
 Originally Posted by rogue fpu_powmod_4b_1n_4p() assumes that there is nothing on the FPU stack, but fpu_mulmod_4a_4b_4p() requires the first four entries on the stack to be preset (per fpu_push_1divp). You would have to create a version of fpu_powmod_4b_1n_4p() that supports precomputed values on the stack.
On that subject, I'm about to implement the ARM version of fpu_powmod_4b_1n_4p() (it'll be the only one I'll have to spill registers for ). Is there a reason why fpu_mulmod_4a_4b_4p() preloads the 1/p values and fpu_powmod_4b_1n_4p() doesn't? Should I keep the current semantics with the ARM version?

I discovered when I implemented fpu_powmod() that the x87 version does in fact use right-to-left exponentiation, as I noted in a previous post, so you were right on that concern earlier.

Last fiddled with by Happy5214 on 2020-09-26 at 12:16

2020-09-26, 13:06   #62
rogue

"Mark"
Apr 2003
Between here and the

618510 Posts

Quote:
 Originally Posted by Happy5214 On that subject, I'm about to implement the ARM version of fpu_powmod_4b_1n_4p() (it'll be the only one I'll have to spill registers for ). Is there a reason why fpu_mulmod_4a_4b_4p() preloads the 1/p values and fpu_powmod_4b_1n_4p() doesn't? Should I keep the current semantics with the ARM version? I discovered when I implemented fpu_powmod() that the x87 version does in fact use right-to-left exponentiation, as I noted in a previous post, so you were right on that concern earlier.
In most cases fpu_powmod_4b_1n_4p() is called once to set up a loop and fpu_mulmod_4a_4b_4p() is called within a loop. There is a time speed bump to setting up the stack prior to the call to fpu_powmod_4b_1n_4p(), but it wouldn't benefit all sieves and most that would benefit would probably see a negligible benefit. The long term goal would then be to preset the stack (or registers for ARM) for all fpu functions.

 2020-12-27, 18:26 #63 rogue     "Mark" Apr 2003 Between here and the 5×1,237 Posts I have posted mtsieve 2.1.0 over at sourceforge. There are many changes with this release, but users of gcwsievecl, mfsieve, and mfsievecl will get the most benefit. The changes are: Code: 2.1.0 - December 27, 2020 framework: On Windows, switched to using gcc from msys2 instead of gcc from mingw64. -O3 gives a nice performance bump to some of the non-ASM code. Exit after showing help when using -h swtich instead of giving fatal error. Add logic to support validation of factors passed with -I, but not all sieves are coded to do this validation. On GPU builds, default -W to 0 and -G to 1. A "special" CPU worker will be created as necessary to test ranges of p that are to small for the GPU code. For GPU executables, add -H to show some GPU details when each kernel is created. Improve factor rate calculation when using GPU workers. Created FUTURE.txt for items I would like to get working. cksieve: version 1.3 Added logic to validate factors passed with -I. gcwsieve, gcwsievecl: version 1.4 Added logic to validate factors passed with -I. Add -f option to gcwsieve so that it can generate ABC output that is supported by LLR. Added -M option and fixed -S for gcwsievecl. Improved speed of GPU code of gcwsievecl by about 25%. The GPU code is more than 20x faster than the CPU for the same amount of work. mfsieve, mfsievecl: version 1.7 Added logic to validate factors passed with -I. Replaced all ASM routines with non-ASM routines based upon Montgomery mulitplication. This provides a speed bump of at least 30% over the previous release. Fixed the GPU code and replaced magic number logic with Montgomery multiplcation as it is faster. The new GPU code is more then 25% faster than the old GPU code. The GPU code is more than 20x faster than the CPU for the same amount of work. sgsieve: version 1.2 Added logic to validate factors passed with -I. twinsieve: version 1.3 Added logic to validate factors passed with -I. xyyxsieve, xyyxsievecl: version 1.8 Added logic to validate factors passed with -I. I also add a FUTURE.txt file with what I want to do in the future. I know there are things that I have promised that are not in this list. Please let me know what I have missed: Code: 2.0.7 - December 22, 2020 framework: Replace AMD SDK on Windows with open source SDK for OpenCL. Add more output for -H for GPU executables. afsieve, afsievecl: Replace logic with Montgomery multiplcation. Add factor validation when using -I. dmdsieve: Add factor validation when using -I. fbncsieve: Add factor validation when using -I. fkbnsieve: Add factor validation when using -I. gfndsieve: Add factor validation when using -I. kbbsieve: Add factor validation when using -I. psieve: Add factor validation when using -I. pixsieve: srsieve2: Implement sr1sieve and sr2sieve logic. Add factor validation when using -I.` I will probably complete most of the -I updates in the coming days and a replacement OpenCL SDK on Windows is close behind.
 2020-12-27, 19:21 #64 pepi37     Dec 2011 After milion nines:) 2×5×139 Posts If you can rise upper limit of srsieve2.
2020-12-27, 20:06   #65
rogue

"Mark"
Apr 2003
Between here and the

5·1,237 Posts

Quote:
 Originally Posted by pepi37 If you can rise upper limit of srsieve2.
So you need it to go above 2^52?

2020-12-27, 21:01   #66
pepi37

Dec 2011
After milion nines:)

2×5×139 Posts

Quote:
 Originally Posted by rogue So you need it to go above 2^52?
1e17 is good for me

 Similar Threads Thread Thread Starter Forum Replies Last Post rogue Software 506 2021-01-17 23:57 rogue Software 287 2021-01-16 08:02 kar_bon No Prime Left Behind 10 2008-03-28 11:21 Greenbank Octoproth Search 2 2006-12-03 17:28 Reboot It Software 16 2003-10-17 01:31

All times are UTC. The time now is 15:16.

Sat Jan 23 15:16:11 UTC 2021 up 51 days, 11:27, 0 users, load averages: 2.38, 2.09, 2.11