mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2010-11-01, 01:31   #1
bdodson
 
bdodson's Avatar
 
Jun 2005
lehigh.edu

210 Posts
Default gpu poly search error

I've been getting np2 hanging, sometimes a day or more with no output at
all --- this for the c187 --- once in 4M, and a second time in 5M. After some
time fiddling to see which stage1 report(s) was causing the problem, turns
out that one of the -np1's reported an error on the stdout file from the range
"searching leading coefficients from 4000001 to 4400000"
Code:
error generating or reading NFS polynomials
perhaps with a file server err, as the file was corrupt (grep reports "binary
file" without locating anything; empty missing lines at the start of the file
...). I hadn't actually looked at the file, until I found the line below in
msieve.dat.m; and then found a second one for the other range
"from 5000001 to 5400000":
Code:
4007700 24015536 295219382270877590927 801983503937356382677653357465991274

5084748 25083344 280739665478577317867 765040171045381367403062334481129902
in a file that's supposed to have just three fields, (a5, p, m)'s.
The 4M report file hung on more than one line, although many/most
of the other lines were OK. The 5M file didn't report any error. Maybe
-np2 should check to see that the msieve.dat.m line is properly formatted?

Losing a few lines (of 1000s, 10000s) isn't a problem; it's the hanging,
and not knowing that something's gone wrong to know to go on to the
rest of the valid reports that's the trouble. Unless these stage1 reports
indicate a problem in the code? -Bruce

Last fiddled with by bdodson on 2010-11-01 at 01:33 Reason: typo
bdodson is offline   Reply With Quote
Old 2010-11-01, 01:56   #2
jrk
 
jrk's Avatar
 
May 2008

3×5×73 Posts
Default

Quote:
Originally Posted by bdodson View Post
turns
out that one of the -np1's reported an error on the stdout file from the range
"searching leading coefficients from 4000001 to 4400000"
Code:
error generating or reading NFS polynomials
This happens when msieve doesn't have a complete polynomial when it terminates NFS, and so will always happen when you run -np1, i.e. it is harmless.

Quote:
Originally Posted by bdodson View Post
perhaps with a file server err, as the file was corrupt (grep reports "binary
file" without locating anything; empty missing lines at the start of the file
...). I hadn't actually looked at the file, until I found the line below in
msieve.dat.m; and then found a second one for the other range
"from 5000001 to 5400000":
Code:
4007700 24015536 295219382270877590927 801983503937356382677653357465991274

5084748 25083344 280739665478577317867 765040171045381367403062334481129902
in a file that's supposed to have just three fields, (a5, p, m)'s.
That's suspicious of some kind of file corruption. FYI here's the code in msieve which writes the .dat.m file:

Code:
/*------------------------------------------------------------------*/
static void stage1_callback_log(mpz_t high_coeff, mpz_t p, mpz_t m, 
				double coeff_bound, void *extra) {
	
	FILE *mfile = (FILE *)extra;
	gmp_fprintf(mfile, "%Zd %Zd %Zd\n",
			high_coeff, p, m);
	fflush(mfile);
}
It just prints three gmp integers, so I wonder how you got four? The second number in your lines looks like it doesn't belong.
jrk is offline   Reply With Quote
Old 2010-11-01, 02:04   #3
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3,529 Posts
Default

Do you have multiple poly search processes writing to the same file? That could cause the problems you're seeing; specifying a different argument to '-s' (if you are not doing so now, or running an msieve binary from different directories) will cause output from different GPUs to go to different output files; otherwise I'd suspect a filesystem problem that's making file writes collide.
jasonp is offline   Reply With Quote
Old 2010-11-01, 15:11   #4
bdodson
 
bdodson's Avatar
 
Jun 2005
lehigh.edu

210 Posts
Default

Quote:
Originally Posted by jasonp View Post
Do you have multiple poly search processes writing to the same file? That could cause the problems you're seeing; specifying a different argument to '-s' (if you are not doing so now, or running an msieve binary from different directories) will cause output from different GPUs to go to different output files; otherwise I'd suspect a filesystem problem that's making file writes collide.
No, in this case the cards were writing into different directories; and the
-np2's also in different directories than the -np1's. I suppose I could check
for disk errors by "sort -gk4 msieve.dat.m". Turns out that I missed one of
the 5M's
Code:
5000040 282950932555811249513 767572566639277931962886857122963054
5000040 282988014873105079573 767572566770261762635560319792058762
5000040 283489882496584278539 767572566780564045348713900359963743
   ...
5141820 303759607119684153587 763292097447179536382655903394573001
5141820 303874311272843118707 763292097967737416866712224883219685
5094360 25093256 290781228927362427487 764742170064234883654754500369055132
5084748 25083344 280739665478577317867 765040171045381367403062334481129902
Ah; maybe that accounts for all of the inputs that hang, here's 4M
Code:
4000260 264511855886585219909 802595085782688477156868964611859014
  ...
4128540 277554032589933354403 797544346807989836260266516658633381
4128540 277625813552862465761 797544346932719528504349552877501128
4007700 24015536 295219382270877590927 801983503937356382677653357465991274
4011384 24005864 265672669931552184923 802370401984067760730681286443632763
4010040 24005540 266427315108384809443 802383381517071143811386449187655371
-Bruce
bdodson is offline   Reply With Quote
Old 2010-11-01, 15:44   #5
jrk
 
jrk's Avatar
 
May 2008

3×5×73 Posts
Default

With those corrupted lines, here's where it's getting stuck:

gnfs/poly/stage2/stage2.c in pol_expand():
Code:
	mpz_tdiv_q_2exp(c->gmp_help1, gmp_d, (mp_limb_t)1);
	for (i = 0; i < degree; i++) {
		while (mpz_cmpabs(c->gmp_a[i], c->gmp_help1) > 0) {
			if (mpz_sgn(c->gmp_a[i]) < 0) {
				mpz_add(c->gmp_a[i], c->gmp_a[i], gmp_d);
				mpz_sub(c->gmp_a[i+1], c->gmp_a[i+1], gmp_p);
			}
			else {
				mpz_sub(c->gmp_a[i], c->gmp_a[i], gmp_d);
				mpz_add(c->gmp_a[i+1], c->gmp_a[i+1], gmp_p);
			}
		}
	}
At i==4, the while loop keeps going forever.
jrk is offline   Reply With Quote
Old 2010-11-01, 17:43   #6
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3,529 Posts
Default

Argh, that while() loop should do two or three iterations at most...
jasonp is offline   Reply With Quote
Old 2010-11-01, 22:35   #7
Random Poster
 
Random Poster's Avatar
 
Dec 2008

179 Posts
Default

Quote:
Originally Posted by bdodson View Post
Code:
5000040 282950932555811249513 767572566639277931962886857122963054
5000040 282988014873105079573 767572566770261762635560319792058762
5000040 283489882496584278539 767572566780564045348713900359963743
   ...
5141820 303759607119684153587 763292097447179536382655903394573001
5141820 303874311272843118707 763292097967737416866712224883219685
5094360 25093256 290781228927362427487 764742170064234883654754500369055132
5084748 25083344 280739665478577317867 765040171045381367403062334481129902
Code:
4000260 264511855886585219909 802595085782688477156868964611859014
  ...
4128540 277554032589933354403 797544346807989836260266516658633381
4128540 277625813552862465761 797544346932719528504349552877501128
4007700 24015536 295219382270877590927 801983503937356382677653357465991274
4011384 24005864 265672669931552184923 802370401984067760730681286443632763
4010040 24005540 266427315108384809443 802383381517071143811386449187655371
Removing 9 characters from the beginning of those offending lines leaves what appear to be valid lines, so it looks like gmp_fprintf sometimes writes just 9 characters instead of the whole string. Maybe you could gmp_sprintf to a buffer, check the contents of the buffer (print a warning and discard the buffer if the check fails), and then fwrite the buffer to the file; this should work around the bug if it's in gmp's formatting code (which I think is more likely than a bug in the operating system's file writing code).
Random Poster is offline   Reply With Quote
Old 2010-11-09, 15:47   #8
bdodson
 
bdodson's Avatar
 
Jun 2005
lehigh.edu

100000000002 Posts
Default

Quote:
Originally Posted by Random Poster View Post
Removing 9 characters from the beginning of those offending lines leaves what appear to be valid lines, so it looks like gmp_fprintf sometimes writes just 9 characters instead of the whole string. Maybe you could gmp_sprintf to a buffer, check the contents of the buffer (print a warning and discard the buffer if the check fails), and then fwrite the buffer to the file; this should work around the bug if it's in gmp's formatting code (which I think is more likely than a bug in the operating system's file writing code).
Ooops; here's a new winner
Code:
150672 148717065853967295793 1546349397151620 148003488673044184871 154441084180341103999
8407965507924884
with sort -gk4 showing
Code:
162060 151898518516855821613 1523978967286095124294544200586784087
162060 157845105824749120277 1523978967289425984216744549275830446
150672 148717065853967295793 1546349397151620 148003488673044184871 1544410841803411039998407965507924884
151164 13151032 128683849570608945163 1545611518056124233627904175463785373
This was with an alternate to the main code, the "special_q" version. -Bruce

(I'm not sure which hung. Both occur after the last stage1 hit that
ran with a stage2 report; the one with 4 fields (of 3!) just shortly
after the new one with 5 fields (of 3 ...).)
bdodson is offline   Reply With Quote
Old 2010-11-09, 17:49   #9
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

100011101001102 Posts
Default

It is probably not always 9 chars.
A couple strings collide in a random place like
XXXXXXXX XXXXXXXXXXXXyyyyyy yyyyyyyyyyyyyyyy yyyyyyyyyyyyyyyyyyyyyyyyy

For this last one, the proper blue string seems to be
150672 148717065853967295793 1546349397|
151620 148003488673044184871 1544410841803411039998407965507924884

The red line should have its tail some where as a line with just one field, and could be rescued too probably.

Instead of sort -gk4, try
awk 'NF!=3'
Batalov is offline   Reply With Quote
Old 2010-11-09, 19:16   #10
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

334610 Posts
Default

Is it possible that the fflush(mfile) is happening prior to the full completion of writing a line? Perhaps inserting a brief delay would show. . .
EdH is online now   Reply With Quote
Old 2010-11-09, 19:46   #11
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

2·33·132 Posts
Default

Yeah, that's what Random Poster said a long ago. But he also said (I think) a deeper thing - that this is not necessarily this application's fault, but instead either gmp or the system libc fault - that I tend to agree with.

A similar (but not exactly the same) thing happened to Prime95 with printing some invalid factors with repeated digit patterns (which could hint to memory bad alloc, but the margins of this message are to narrow to elaborate), and that defect was also OS-specific. I am tempted to look at Prime95's source and see if he simply wrote around the library bug in disgust.

Is libgmp linked statically in this particular binary that emits errors?

Last fiddled with by Batalov on 2010-11-09 at 19:49 Reason: narrow, naroow, tpyos... blegh
Batalov is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Poly search candidates schickel Msieve 32 2013-11-05 19:11
Poly Search vs Sieving times EdH Factoring 10 2013-10-14 20:00
Resume msieve poly search job? Andi47 Msieve 1 2011-03-28 04:30
Poly search for c157 from 4788:2422 henryzz Aliquot Sequences 59 2009-07-04 06:27
Poly search for c137 from 4788:2408 axn Aliquot Sequences 15 2009-05-28 16:50

All times are UTC. The time now is 02:35.

Sat Sep 26 02:35:10 UTC 2020 up 15 days, 23:46, 0 users, load averages: 1.36, 1.52, 1.50

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.