mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Math

Reply
 
Thread Tools
Old 2019-09-12, 17:32   #122
hansl
 
hansl's Avatar
 
Apr 2019

5×41 Posts
Default

Quote:
Originally Posted by hansl View Post
If we look at your matrix of classes mod 420, you can see that each row is actually the same as all the others, just rotated by some amount (its all basically a single wheel).
...
So right now I hypothesize that there exists some simple(without any loops) modular arithmetic to determine the amount of rotation/offset between any two rows, given their residues, but I haven't worked it out yet. That would make it much simpler to only store one row(the first one, for example) and offset into it by the appropriate amount.
I would still very much like to know if this is possible, but I haven't been able to work out such a solution. Not even sure where to begin.

Anyone?
hansl is offline   Reply With Quote
Old 2019-09-12, 23:24   #123
hansl
 
hansl's Avatar
 
Apr 2019

5×41 Posts
Default

I actually got a bit of help on IRC today and found that the offsets can be calculated using CRT.

The method Is not all that simple or without loops, but at least it seems to correctly compute the offset without *necessarily* creating each vector. Perhaps it can be simplified further.

I wrote a little function to help visualize the process
Code:
kw(p,M)={ 
    my(c=1); 
    for(k=1,M,
        my(q=2*k*p+1,b=q%8);
        if(M==4,
            if(b!=3 && b!=5,print1("1"),print1("0")),
            if(isprime(M),
                if(gcd(M,q)==1,print1("1"),print1("0")),
                if(b!=3 && b!=5 && gcd(M,q)==1, print1(c,","); c=1,c++)
            )
        )
    );
};
If passed in M==4 or M prime, then it prints a simple series of 1 or 0 for which (k mod M) classes result in (2*k*p+1) coprime with M.

Otherwise (if M is composite) it will display the gaps between classes for k mod M,
like the original script.

So for example if we generate classes of k, for p=1 mod 60
Code:
? kw(1,60)
3,5,3,4,5,3,1,11,1,3,5,4,3,5,3,1,
Then, say we want to know what the offset into that array p=7 mod 60 would start at.
Code:
? kw(7,60)
5,3,1,3,5,3,4,5,3,1,11,1,3,5,4,3,
We can visually see that the pattern for this one corresponds to starting 13 elements into the first vector(starting from 0). Or another way to say it is we would start k at the sum of those preceding values in the first vector: (3+5+3+4+5+3+1+11+1+3+5+4+3) = 51

But to instead get that result without generating this vector all over again, we have to look at how much each individual factor is shifted between the two p values(we count 4 as its own single factor here):
Code:
? kw(1,4)
0011
? kw(7,4)
1001
The second result is shifted in by 3 with respect to the first.
So we make a note: Mod(3,4)

Now we look at the other prime factors of 60:
Code:
? kw(1,3)
011
? kw(7,3)
011
This one isn't shifted between the two, so we note: Mod(0,3)

Code:
? kw(1,5)
10111
? kw(7,5)
01111
This one starts 1 element into the first vector: Mod(1,5)

Then we put it all together with CRT:
Code:
? chinese(chinese(Mod(3,4),Mod(0,3)),Mod(1,5))
%158 = Mod(51, 60)
There's the 51 I mentioned before, which we could convert back to a vector index/ofset by doing partial sum of the original vector.

I'm not really sure all this saves any time in terms of total computations. The intention would be to pre-compute these offsets once for each (p mod 60), which could also be done by temporarily generating each full vector, and pattern matching where it begins in the original. The main idea is to just not store all rotations of the same vector over and over, taking up RAM.

Last fiddled with by hansl on 2019-09-12 at 23:37
hansl is offline   Reply With Quote
Old 2019-09-14, 05:51   #124
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

258B16 Posts
Default

That's correct, but pre-computing and storing the vectors (like in my initial code with the loop) is much faster, you do it only once and storage space of few kilobytes won't be a problem nowadays.
LaurV is offline   Reply With Quote
Old 2019-09-14, 06:35   #125
hansl
 
hansl's Avatar
 
Apr 2019

5×41 Posts
Default

Well, yeah its only a few KB when you are doing mod 60, or 420, but as you add more factors to this modulus, it needs eulerphi(m)^2 space to store every rotated vector. Growing by the square of each successive prime(minus 1 for phi) gets out of hand quickly.

Since I now calculate just the first vector, and the offsets into that vector, I only need to store at most 2*eulerphi(m) elements.

So, in the C++ code I'm working on now, instead of 4*3*5*7=420, from your original script, or 4*3*5*7*11=4620 that mfaktc uses, I can technically include factors up to 23. Although that number of classes (product=446,185,740, phi=72,990,720) gets absurdly impractical due to overhead of just iterating through millions of classes during the actual TF. Still even at that level it only takes ~10s or so and a GB or two of memory to store all the relations.

From the testing/benchmarking I've done so far, I think 4*3*5*7*13=60060 is the sweet spot. Mayyybe one more "level" depending one how many bits being factored to, I need to test more after the code is more complete. I've implemented calculating these classes in C++ and can loop through them to do factoring in the same way as the PARI script, but at the moment its separate from the sieve-based part of the code that I developed first. Just working on integrating these two aspects now.

I think these classes should speed up the sieve a bit, since these smallest factors require the most bit flipping operations across the sieve. It sort of acts like a pre-sieve, so hopefully will see some more gains.
hansl is offline   Reply With Quote
Old 2019-09-16, 07:00   #126
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

7·1,373 Posts
Default

Technically, you need more classes when the bitlevel is high and the exponent is low.

As the exponent increases, OR as the bitlevel decreases, you have, in each case and in both combined, less factor candidates per chunk (or per bitlevel, depends how you work).

In your case, testing HUGE exponents to LOW bitlevels, you do not need more classes, but less classes, otherwise the overhead with sieving each class is too big. The number of candidates in each class per each chunk (or bitlevel) has to be large enough to "worth" the sieving. In fact, I would be quite surprised if, when TFing 4G to 10G ranges to 70 bits or so, the 420 classes would not be the fastest, by far.

When you sieve each remaining class, that also takes a lot of time, you need to "pre-calculate" a starting point and then sieve all the candidates out, for each prime and where you need to work actually would be to find the "sweet spot" of how far you sieve (how many primes). Because after a while, you will still have candidates which are not prime in the list, but it will take less time to do exponentiation for them, than to continue sieving. And if your chunks are small, then you do too much exponentiation, compared with sieving.

But, ignore all that, and even assume you need to go to 4620 classes, the "matrix" you talk about, will be 960*960=921600 entries, each storing a number between 0 and 4619 (i.e. 2 bytes in the worst case). This is a piece of cake for the GB rams/cards we use today, and it can be stored quite efficient, in much lower space (think about!). Of course, at the extreme, you can reduce it to just 1920 entries (3840 bytes) by storing the classes in a 960 shorts vector and the shifts in other 960. But is it worth?

I don't see why the effort

If you want, for example, to improve the things, one huge improvement which is really easy to do, would be to add another 2 bits for the exponent for mfaktc. Think about how the exponentiation works, you start with 1, then for every bit in the exponent from the left to the right, you square, and if the bit is 1, you shift left with one position.

For example, if I want to compute 2^13, using L-to-R exponentiation, then 13 in binary is 1101, so we start with 1, we (square, shift left), (square, shift left), (square, no shift left (the third bit is zero)), (square, shift left). As shift left is equivalent with a multiplication by 2, then we will have: 1, (1, 2), (4, 8), (64, 64), (4096, 8192), which is the result we intended (parenthesis added for clarity).

Now imagine that instead of 13, which is a 4 bit number, we have a 6-bit. The first two bits may be either 00, 01, 10, or 11. If we would have to raise 2 at a 6-bit power, then we would start with 1, and for the first two bits we will square, and either multiply by 2 (shift left) or not, depending on the first two bits, two times. In any of the four cases above, after the first two steps of the exponentiation, we will end with either 1, or 2, or 4, or 8. Which is 2 at the power designed by the two bits.

Then we continue with the last 4 bits of the exponent, which is exactly what we did when we exponentiated 4 bits, except that now, we only start with 1 when the first 2 bits were 00, on the other 3 cases we start with either 2, 4, or 8.

Got the idea? Extending mfaktc from 32 bits exponent, to 34 bits exponent, is kinda piece of cake, nothing scary.. You read the 34 bits exponent from the file or command line, in a string or a 64 bit variable, split it in first 2 bits and the last 32 bits, give some error if too big. Then, selecting the classes to work with, depends only of the modularity of the exponent to 4620, and it can be easily solved by the CPU or the GPU at the beginning of testing, the size of the exponent does not affect sieving in any way, and it can be ignored after. You only need to work with 32 bits, as mfaktc already does, but when you call the exponentiation function on the card, (SIMD part) you just need an additional parameter to specify the starting value. Right now, the starting value is always 1 (**), but you will need to specify it depending on the first two bits of the exponent (which further, you just ignore). Of course, you can specify a larger value too (16, 32, 64, 128) if you want a 3-bit extension, etc., but two bits would be enough to cover the 10G exponents (in fact, 34 bits can go to over 17G, but that is not the point).

----------
(**) in fact, because first bit of any number, ignoring the non-significant zeros, is always 1, exponentiation usually starts "after" the first bit, i.e. it ignores the first bit completely and it starts with the value 2, which would have been obtained if started with 1 and do first "square and shift" for the first bit.

Last fiddled with by LaurV on 2019-09-16 at 07:07
LaurV is offline   Reply With Quote
Old 2019-09-16, 13:44   #127
hansl
 
hansl's Avatar
 
Apr 2019

110011012 Posts
Default

Wow, thanks for the detailed feedback.

Yeah, I got my sieve to "work" with the classes code, and found that due to the unanticipated overhead of finding the first bit to clear, its fastest at like 4*3 classes, and I think it might still be slightly slower than my previous sieving with *no* classes.

The previous mention of 60060 being fastest was of course with no other sieving.

So now I'm putting the sieving with no classes support back in, so I can test things side by side again...

I don't know, I might try modifying mfaktc but honestly I'm still a bit intimidated by CUDA stuff.

The one other thing that bugged me about mfaktc was that it seems very inefficient in terms of how the worktodo is handled, rewriting the whole thing every exponent. This isn't a big deal for its intended use-case of deep factoring a relatively small # of exponents, but for my case, I have to split the work into thousands or tens-of-thousands of tiny worktodo, to avoid all this IO overhead. I would probably want to also modify it so that it can just take a [min,max] ]range of exponents from the command line and skip the worktodo entirely.
hansl is offline   Reply With Quote
Old 2019-09-23, 06:22   #128
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

258B16 Posts
Default

We agree with the last part (of mfaktX being not efficient on handling the worktodo files). I don't know if you are aware or not, but you can pass the exponent in command line, so if you do very short assignments (when each task takes seconds, or less), you may be faster if you call mfaktX from a batch or script in a loop for all exponents, and avoid worktodo file completely. It may worth a try. The [min, max] idea is not very good, because you would need to keep track (a whole database!) of the exponents already factored in the range, otherwise you will do a lot of duplicated work more than once, when you deepen the bitlevels for the exponents that survived former bitlevel.

Last fiddled with by LaurV on 2019-09-23 at 06:26
LaurV is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Distribution of Mersenne Factors tapion64 Miscellaneous Math 21 2014-04-18 21:02
Known Mersenne factors CRGreathouse Math 5 2013-06-14 11:44
A strange new (?) fact about Mersenne factors ChriS Math 14 2006-04-12 17:36
Factors of Mersenne Numbers asdf Math 17 2004-07-24 14:00
Factors of Mersenne numbers ? Fusion_power Math 13 2003-10-28 20:52

All times are UTC. The time now is 18:02.


Fri Jul 16 18:02:21 UTC 2021 up 49 days, 15:49, 1 user, load averages: 1.87, 1.51, 1.47

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.