mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   Compare interim files with different start shifts? (https://www.mersenneforum.org/showthread.php?t=16720)

zanmato 2012-04-13 11:32

Compare interim files with different start shifts?
 
Is there software which can compare the interim residue files of two first runs of an LL test (with different starting shifts) generated by prime95? I've tried looking at the prime95 source (not the best at reading c), and can just about handle the format of the interim file. Even if what I assume is encryption can be overcome, could the residues be compared? I don't know how a shifted start can still get the same answer (but I do understand the motivation, different FFTs), but can only guess (hope) that if the final iteration result is comparable, that all previous iterations are comparable. Does a shifted start use a commonly known property that I can read up on somewhere, if not how does it work?

This is what I think happens at present (please correct me if I'm wrong):[LIST][*]If you start two separate tests from the beginning, with the same bit shift (same 64 bit residue), the residue files are different because they are encrypted by a different checksum[*]If you start one test from the beginning, then start another with an interim file of the first, the same checksum is used so interim residues between tests can be compared directly[/LIST]
The motivation is for testing larger mersenne numbers, most tests I've seen seem to have failed due to errors. One way to overcome this with non-ecc ram is to run identical tests (on similarly performing hardware) at the same time. Compare the interim residues, and if they're different restart both tests from known good interim files (matching interims). Even if both tests have errors in the same interval, the overwhelming probability is that they will be different errors, it will be caught and both tests restarted from known good positions.

Comparing interim files of identical tests is fine to eliminate errors creeping in, but does half the throughput. If comparing interim files of different starting shifts is possible, then the two tests are separate first time tests, with the same throughput as current tests (slightly higher as bad tests due to errors are caught instead of executing to completion). One person could use two computers, or two people could use a computer each and communicate the interim files once in a while, if that is necessary for both tests to be considered first time tests.

Batalov 2012-04-13 16:39

Most of your guesses look wrong.
No different FFT sizes (not due to shifts, in the context of your Q).
Residues can be compared, yes. Quite easily. The whole files can be compared with a 20-line program (and even without using GMP).
The whole checkpoint file (except for a short header) is a long binary number. It is not encrypted - only one small part of the header is (to avoid forgeries).

If you read the source (which is readily available), you will find all these answers pretty soon.

zanmato 2012-04-13 21:45

[quote=Batalov]No different FFT sizes (not due to shifts, in the context of your Q).[/quote]I said shift because of this taken from commonb.c, line 5161, in the lucas lehmer function int prime (...):[code]/* Start off with the 1st Lucas number - four */
/* Note we do something a little strange here. We actually set the */
/* first number to 4 but shifted by a random amount. This lets two */
/* different machines check the same Mersenne number and operate */
/* on different FFT data - thus greatly reducing the chance that */
/* a CPU or program error corrupts the results. */[/code]

[quote=Batalov]Residues can be compared, yes. Quite easily. The whole files can be compared with a 20-line program (and even without using GMP)
The whole checkpoint file (except for a short header) is a long binary number. It is not encrypted - only one small part of the header is (to avoid forgeries)..[/quote]Just to make sure we are on the same page, your 20-line program assumes the 'long binary numbers' are identical in both files, so you simply do a byte for byte comparison of the bignum? I started (from scratch) two separate LL tests of the same number, and compared the 1000th iteration. The 64 bit residue printed the same, but the 'long binary number' was different (residue files [url=http://www.sendspace.com/file/lvr6hh]here[/url], 14.1MiB):

p4X81261.001.A:[code][Work thread Apr 13 21:57] Worker starting
[Work thread Apr 13 21:57] Setting affinity to run worker on any logical CPU.
[Work thread Apr 13 21:57] Starting primality test of M58481261 using Core2 type-3 FFT length 3M, Pass1=1K, Pass2=3K
[Work thread Apr 13 21:57] Iteration: 1000 / 58481261 [0.00%]. Per iteration time: 0.032 sec.
[Work thread Apr 13 21:57] M58481261 interim We8 residue D831AFA5F7914DE5 at iteration 1000
[Work thread Apr 13 21:57] M58481261 interim We8 residue 1F187364A850369D at iteration 1001
[Work thread Apr 13 21:58] M58481261 interim We8 residue BD61DC027FCEF1CA at iteration 1002
^C[Main thread Apr 13 21:58] Stopping all worker threads.[/code]

p4X81261.001.B:[code][Work thread Apr 13 21:59] Worker starting
[Work thread Apr 13 21:59] Setting affinity to run worker on any logical CPU.
[Work thread Apr 13 21:59] Starting primality test of M58481261 using Core2 type-3 FFT length 3M, Pass1=1K, Pass2=3K
[Work thread Apr 13 21:59] Iteration: 1000 / 58481261 [0.00%]. Per iteration time: 0.032 sec.
[Work thread Apr 13 21:59] M58481261 interim We8 residue D831AFA5F7914DE5 at iteration 1000
[Work thread Apr 13 21:59] M58481261 interim We8 residue 1F187364A850369D at iteration 1001
[Work thread Apr 13 21:59] M58481261 interim We8 residue BD61DC027FCEF1CA at iteration 1002
^C[Main thread Apr 13 21:59] Stopping all worker threads.[/code]

Dubslow 2012-04-13 21:51

Consider that one iteration takes a few milliseconds, and that your reaction time to seeing the interim residues + time taken by MPrime to react to ^C >> a few milliseconds. I think it's considerably probable that your two save files are not at the same iteration; that's gotta be stored in the file headers that Batalov mentioned.

One thing to be sure is that the bignum is stored unshifted -- I believe that was one of your original questions, however it was not specifically answered by Batalov (it may have been answered indirectly). I think it's likely that it is stored in shifted form, that would be somewhat simpler. I'd say instead of looking at prime(), look at whatever function writes the save files. (Edit: That's line 4550 on commonb.c. Edit2: The actual write-functions are not in the same file. Edit3: A quick grep shows that write_gwnum() is defined in commonc.c, line 4207.)
Edit4: I can't make heads or tails of it, not the least of which because I don't really have a clue about all the custom datatypes/structures.
[code]int write_gwnum (
int fd,
gwhandle *gwdata,
gwnum g,
unsigned long *sum)
{
giant tmp;
unsigned long i, len, bytes;

tmp = popg (&gwdata->gdata, ((int) gwdata->bit_length >> 5) + 10);
if (tmp == NULL) return (FALSE);
gwtogiant (gwdata, g, tmp);
len = tmp->sign;
if (len == 0) return (FALSE);
if (!write_long (fd, len, sum)) return (FALSE);
bytes = len * sizeof (uint32_t);
if (_write (fd, tmp->n, bytes) != bytes) return (FALSE);
*sum = (uint32_t) (*sum + len);
for (i = 0; i < len; i++) *sum = (uint32_t) (*sum + tmp->n[i]);
pushg (&gwdata->gdata, 1);
return (TRUE);
}[/code]

Batalov 2012-04-13 22:17

I only have time to draw a silly cartoon, so that's exactly what i will do.
File 1 will be
[COLOR=blue][header...bla-bla..|shiftValue1|...bla-bla..][the rest is long number, you can read it in an array of bytes][/COLOR]
File 2 will be
[COLOR=blue][header...bla-bla..|shiftValue2|...bla-bla..][the rest is long number, you can read it and compare on-the-fly to that other array of bytes with offset2=((offset1-shiftValue1+shiftValue2)/8)%P with some trivial bitwise shift for (shiftValue1-shiftValue2)%8)][/COLOR]


[COLOR=black]P.S. ...well and some more careful wrap-around from the beginning to the end of the number in one particular place. The gist is that the other number is cyclically rotated by ([COLOR=#0000ff]shiftValue1-shiftValue2[/COLOR]), that's all. This is because the Mp is a long bunch of 1s in binary and the overflow of any multiplication by 2^k simply lands in the low k bits and so on - see [URL="http://www.mersenne.org/various/math.php"]The math[/URL] section of GIMPS site for a proper, proof-read explanation. That's why all shifted tests are equivalent (but they go through totally different value-sets in FFT). The unshifted RES64 value (that goes on screen and in logs) is taken from the properly rotated 64-bit sub-string, but the whole huge residue is just likewise rotated.[/COLOR]

P.P.S. In the savefiles, [COLOR=#0000ff]shiftValue [/COLOR][COLOR=black]is not the initial shift value. Conveniently it is the[I] running[/I] shift-value (which gets +=initial shift value mod p in every iteration). This makes this whole excercise very easy.[/COLOR]

zanmato 2012-04-13 23:27

[quote=Dubslow]it's considerably probable that your two save files are not at the same iteration
...
look at whatever function writes the save files[/quote]I used InterimFiles=1000 in prime.txt, which tells the program to spit an LLSaveFile out every 1000 iterations. As I mentioned in the OP, I understand the layout of the file (have confirmed the iteration counter is at 1000 in both residue files). The bignum seems to be stored as is (got as far as you have in your edit), but I didn't fully analyse it. To be honest I got as far as gwtogiant, then it started to get heavy. I posted this topic to make sure I wasn't wasting my time, and was somewhere near the right track.

[quote=Batalov]...[/quote]
Thank you. You've confirmed the residue is something that can be compared. I'll try to figure out the details tomorrow, your pps's are what i had hoped would be all that was necessary. That link is also very useful. It doesn't load properly in opera so I've never seen the contents. Until now I assumed it was a dead link.

PS: Something that may trip me up is endianness, seems like it might be applicable looking at _write and the fact the longs in the header are LE. Haven't confirmed it will be an issue, but now that I've said it I can use it as an excuse if I'm too dumb to figure this all out :grin:

Batalov 2012-04-14 02:31

I wrote a lazy implementation for you. You can surely make it better.
[CODE]#include <stdio.h>
#include <stdlib.h>

main(int argc,char **argv)
{
unsigned int z,c,sh,h;
unsigned int P,C,SH;
int i,j,l;
char *progname;
unsigned char bf[1024];
FILE *fp;
char *na,*p,*s;
unsigned char *data;
unsigned char *data1 = NULL;
progname = argv[0];
if(argc<3) {
fprintf(stderr," Use: %s p_file p_file2 [p-file3...]\n", progname, p);
exit(-1);
}

#define HEADER_LEN 68

fp = NULL;
for(i=1;i<argc;i++) {
if(fp)fclose(fp);
p = na = argv[i];
if(!(fp=fopen(p,"rb"))) {
fprintf(stderr,"%s: can't open file %s \n", progname, p);
continue;
}
if(fread( bf, sizeof(char) , HEADER_LEN, fp )!=HEADER_LEN) {
fprintf(stderr,"%s: can't read header from file %s \n", progname, p);
continue;
}
for(s=p; *s ;s++) if(*s=='/' || *s=='\\') p=s+1;

/* insert vailidity tests here: check magic value, check some other */

z = bf[20]+256*(bf[21]+256*(bf[22]+256*bf[23])); /* the p value */
c = bf[56]+256*(bf[57]+256*(bf[58]+256*bf[59])); /* current iter */
h = bf[60]+256*(bf[61]+256*(bf[62]+256*bf[63])); /* current shift */
{ /* let's restore the initial shift */
sh = h;
if(sh<0 || sh>=z) sh= -1;
else if(c>z/2) {
for (j=c; j<z; j++) {
sh += sh;
if (sh>=z) sh -= z ;
}
} else {
for (j=c; j>1; j--) {
if(sh&1) sh += z;
sh /= 2;
}
}
} /* when I wrote this 4 years ago, I was to lazy to do it differently */

/* let's humor the reader with some general stats and (possibly) errors */
printf(" %s\t%8d / %8d iterations done; initial shift=%d\n", na, c, z, sh);
if(bf[52]) printf(" %s\t%4d type-0 ERROR%s\n", na, bf[52], bf[52]==1?"":"s");
if(bf[53]) printf(" %s\t%4d type-1 ERROR%s\n", na, bf[53], bf[53]==1?"":"s");
if(bf[54]) printf(" %s\t%4d type-2 ERROR%s\n", na, bf[54], bf[54]==1?"":"s");
if(bf[55]) printf(" %s\t%4d type-3 ERROR%s\n", na, bf[55], bf[55]==1?"":"s");

l=(z+7)>>3;
if(i<=2) data = (unsigned char *)malloc(l+4);
if(fread( data, sizeof(char) , l, fp )!= l) {
fprintf(stderr,"%s: can't read header from file %s \n", progname, p);
continue;
}
for(j=0;j<3;j++) { /* let's prepare extra wraparound bits; they are the same as the beginning */
int t = data[j]<<(z%8);
data[l+j-1] |= (char)t;
data[l+j ] = (char)(t>>8);
}
if(!data1) { /* p_file1 - to be read */
printf(" %s\t now have read reference data for M%d\n", na, z, na);
P = z;
C = c;
SH = h;
data1 = data;
} else { /* p_file2, 3, 4 - to be compared to p_file1 */
if(P != z) {
fprintf(stderr," %s\t this file is for a different Mp (M%d != M%d) \n", na, P, z);
continue;
}
if(C != c) {
fprintf(stderr," %s\t this file is for the same M%d but different iteration (%d != %d) \n", na, P, C, c);
continue;
}
{ /* comparison, in byte chunks, -- this can be optimized in many ways; I am too lazy */
int ndiff = 0;
int d = (P + SH - h)%P;
for(j=0; j<P; j+=8) { /* j is a starting bit (we will actually compare P/8+8 bits) */
unsigned char a;
int jj = (j+d)%P;
if(jj%8==0) {
a=data1[jj/8];
} else {
a=(data1[jj/8]>>((jj%8))) | (data1[jj/8+1]<<(8-jj%8));
}
if(a!=data[j/8]) {
if(ndiff++ < 10) printf(" %s\t mismatch in byte #%d (%02X != %02X) \n", na, j/8, a, data[j]);
}
}
if(ndiff==0) printf(" %s\t MATCHES! ******** \n", na);
}
}
}
exit(0);
}
[/CODE]

zanmato 2012-04-14 08:31

That is great, thank you. There's no need to make it better (for however you define better). It works, and takes very little time to execute :grin:

Batalov 2012-04-14 08:52

You are welcome.

There could be a few situations, where the code would need a bit of fixing. One is: if we give it many save files from different numbers - because I alloc only two arrays, it will crash. Another: in very early iterations, the save file may be much shorter (because of the fact that high bits are 0's, they are not wirtten in the file). Come to think of it, the program may work incorrectly when the highest byte (or two, three, with decreasing probability) will be zero. The program will incorrectly report an error because it won't be able to read "l" bytes.* The real count of bytes is almost always "l" (where l is as written, based on p in bits) but the correct length is in bytes 64-67, I think.

(looked up) ...Ah yes, so it is (divided by 4). This should be read and the rest of the data[] should be zeroed. This will take care of the rare situations.

There are probably more bugs and as you can easily see, some unused parameters in printf (because I cut-and-pasted and didn't debug more than on some real data that I easily produced with P95: started LL on some number 3-4 times).

_______
[COLOR=dimgray]*The easiest way to see this in practice is to take LL test for Mp where p is =32m+1 and start LL test and to save at every 1000 iters. You will see that the savefiles sizes will vary by 4 bytes (!!). Randomly. This should happen at about 50-50%, because the only bit written in the last 4-byte int is either 0 or 1. The program will fail comparing these. I'll post a cleanup tomorrow, if I'll have time.[/COLOR]

aketilander 2012-04-14 21:15

[QUOTE=zanmato;296312]Compare the interim residues, and if they're different restart both tests from known good interim files (matching interims). Even if both tests have errors in the same interval, the overwhelming probability is that they will be different errors, it will be caught and both tests restarted from known good positions.[/QUOTE]

Well, I think that it may be better to restart only one of the tests from the latest interim file and let the other computer continue or do other work while waiting. If you still have a mismatch at next checkpoint then you should restart also the other test from last good checkpoint. The loss will be less on average, when you have a mismatch, if you do it in this way.

If the number of iterations between interim files are not too large (5M - 20M iterations) the probability is very small that you will have an error on both computers in the same interval of iterations if your computers have passed the torture test.

I have a question: Since you have the interim residues why do you want to compare the whole files? If the residues are identical the probability that the files are identical is extremely high if I have understood this rightly (which may not be true).

aketilander 2012-04-14 21:30

[QUOTE=Dubslow;296350]Consider that one iteration takes a few milliseconds, and that your reaction time to seeing the interim residues + time taken by MPrime to react to ^C >> a few milliseconds. I think it's considerably probable that your two save files are not at the same iteration; that's gotta be stored in the file headers that Batalov mentioned.[/QUOTE]

You use

InterimFiles=1000

(or any other number then 1000) in prime.txt

to ensure that you get interim files and interim residues (in results.txt) from the same iteration.


All times are UTC. The time now is 01:06.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.