![]() |
GGNFS bug in gnfs-lasieve4e.c?
After a report of a crash on Windows, I think I have found a bug in gnfs-lasieve4e.c. Sadly, removing it does not solve the Windows problem, but it still looks like a bug. In the code around line 2530:
[CODE] { unsigned long small_factors[10]; nf = rho_factor(small_factors, large_factors[s1]); for (i = 0; i < nf; i++) { mpz_set_ui(large_primes[s1][i], small_factors[i]); if (mpz_sizeinbase(large_primes[s1][i],2) > max_primebits[s1]) { n_mpqsvain[s1]++; break; } } if ((i >= nf) && (nf >= 1)) nlp[s1] = nf; else { nlp[s1]=0; } } [/CODE] the function rho_factor() returns -1 when there are no factors and the following loop should not be entered. But, because the variable 'i' in the following loop is unsigned, the 'i < nf' condition treats the signed -1 value of nf as an unsigned maximum value and the loop is hence entered when it should not be. This is what caused the bug reported on Windows but, sadly, correcting it fails to fix the Windows problem. The win32 code appears to work but the x64 version runs for a long time and then crashes. For the moment I have reverted the SVN code to its original state (i.e. one that includes the above bug) but I will update to a corrected version if others who know more about the sieve code than I do agree that this is indeed a bug. Brian |
Without fixing this bug both the win32 and x64 Windows versions of the siever fail. Once this bug is fixed the win32 versions appear to work but the x64 versions still fail. Since I am fairly confident that this is a bug, I have corrected it in the GGNFS SVN. I have also updated the Visual Studio 2010 build files to ensure that the build picks up the correct version of gmp.h and not an outdated one.
I should, however, emphasise that there is still an issue with the x64 versions that I have not been able to fix. Brian |
[QUOTE=Brian Gladman;246766]I should, however, emphasise that there is still an issue with the x64 versions that I have not been able to fix.[/QUOTE]
This looks like a compiler bug. The function rho_factor is defined as returning an int (a signed 32-bit type) and the variable nf is defined as a long (a signed 64-bit type), so this line [CODE]nf = rho_factor(small_factors, large_factors[s1]);[/CODE]should do a sign-extend conversion from 32 to 64 bits, but apparently it does a zero-extend conversion instead. Changing nf to an int should fix the problem. |
[QUOTE=Random Poster;247107]This looks like a compiler bug. The function rho_factor is defined as returning an int (a signed 32-bit type) and the variable nf is defined as a long (a signed 64-bit type), so this line
[CODE]nf = rho_factor(small_factors, large_factors[s1]);[/CODE]should do a sign-extend conversion from 32 to 64 bits, but apparently it does a zero-extend conversion instead. Changing nf to an int should fix the problem.[/QUOTE] Hi, There is no such conversion on Windows since longs and unsigned longs are 32-bits. The problem is in the subsequent loop where the loop control value is unsigned so the nf value gets treated as an unsigned maximum value. Brian |
[QUOTE=Brian Gladman;247119]There is no such conversion on Windows since longs and unsigned longs are 32-bits.[/QUOTE]
Even on 64-bit Windows? Are you sure? I'm asking because that's the only reason I can think of that would make the code work on 32-bit Win but fail on 64-bit Win. [QUOTE=Brian Gladman;247119]The problem is in the subsequent loop where the loop control value is unsigned so the nf value gets treated as an unsigned maximum value.[/QUOTE] But you already fixed that problem, didn't you? And that fix can't make any difference between 32-bit and 64-bit machines. So how can it still be failing on a 64-bit machine? |
Yes, I'm sure that ints and longs are the same length on Windows.
Yes, I fixed it but what I think has happened is that a bug elsewhere caused this code to be entered and this triggered the bug that I found (in fact this code is not entered at all in for Andi's example when run in win32 mode). Brian |
[QUOTE=Brian Gladman;247260]Yes, I'm sure that ints and longs are the same length on Windows.
Yes, I fixed it but what I think has happened is that a bug elsewhere caused this code to be entered and this triggered the bug that I found (in fact this code is not entered at all in for Andi's example when run in win32 mode).[/QUOTE] Note that on Windows that both long and int are the same size (32 bits) on both Win32 and Win64 systems. You need to use long long on Win64 to hold a 64 bit value. I suspect the problem is due to an assumption that a long is 64 bits for all 64-bit OSes. This is why many applications that can be compiled across both Windows and *nix typedef int32, uint32, int64, and uint64. It eliminates any possible confusion regarding the intended size of a variable. |
One problem with GMP is that the function prototypes use 'unsigned long' for the type of a 1-word argument to multiple precision functions. So if you have a 64-bit number whose value may be larger than 2^32, you [i]cannot[/i] just use it as an argument to functions like mpz_add_ui(). It would work everywhere except 64-bit windows if you did so.
|
Thanks, Jason and Rogue, for your comments. I spend a lot of time tracking down '32/64 bit long' issues but I have not found one here yet. But you are quite likely to be right about this.
On another issue, does anyone know if the experimental version of the ggnfs sieve code is worth investing in? It would take a bit of effort to convert the assembler code for use on Windows x64 so I would not want to embark on this if it isn't worthwhile. Brian |
[QUOTE=Brian Gladman;247365]On another issue, does anyone know if the experimental version of the ggnfs sieve code is worth investing in? It would take a bit of effort to convert the assembler code for use on Windows x64 so I would not want to embark on this if it isn't worthwhile.[/QUOTE]
If I remember correctly, the 64bit experimental version in Linux runs twice as fast as the non-experimental version so it would be a very good investment. Someone else should confirm that is still the case. Jeff. |
[QUOTE=Jeff Gilchrist;247383]If I remember correctly, the 64bit experimental version in Linux runs twice as fast as the non-experimental version so it would be a very good investment. Someone else should confirm that is still the case.
Jeff.[/QUOTE]That was my experience, once I managed to get it built. To be precise, my experience was with a package sent to me directly rather than downloaded from SF. Paul |
These packages are the same; SF source lasieve4_64 branch started from the privately forwarded tarball once everyone agreed that it was opensource... In SF, there are actually a few patches for 16e that could be relevant to this hang/crash (there was a hang/crash in the early lasieve4_64 linux binary as well and a couple internal loops were plugged even though the underlying reason was not found - it was assumed that it was obscured by asm code; now, if this bug is essentially the same, you too have two choices: find the true bug, or fix the hanging places: they look like this: [FONT=Arial Narrow]while(x<y) {do something; x+=d}[/FONT] where d actually happens to be zero).
|
Here is [URL="http://ggnfs.svn.sourceforge.net/viewvc/ggnfs/trunk/src/experimental/lasieve4_64/gnfs-lasieve4e.c?r1=385&r2=355"]the patch[/URL] collection you may want to adapt to the Windows part of the source, i.e. the asm-less ([SIZE=1]the infinite loops were in the portion around Line 2278, and a crash was around Line 3715, -- the latter is a division by zero. Why does that variable stored in x[0] rarely turn into 0? Nobody knows. Maybe a memory overrun, maybe 0x10000 becomes zero when it is coerced into u16_t. The original source operated with much smaller sieve sizes and this never happened to the original authors[/SIZE].)
I would put it, but how would I know that it does good or bad? -- I don't use Windows and I don't have the compiler/debugger. |
Another test-case (reproducible crash) with SVN-400 (32 bit) from Jeff's homepage (15e siever), not sure if that's the same bug:
>gnfs-lasieve4I15e -r alq4788.2617.poly -o alq4788.2617_25.26a_01.out -f 25013159 -c 186841 Warning: lowering FB_bound to 25013158. Special q 25013159 does not divide ***crash*** The poly file is: [code]n: 2229348552361822129355877431354933637944777944345113399577912185681707793442268713514615675586421258420770429669349552195250785917635169750211125857968489791489287723679679 # norm 8.661335e-17 alpha -8.684199 e 2.548e-13 rroots 3 skew: 58927582.05 c0: -75835208698116436165893795763297338053392995 c1: 394274499523867352569474046681343689 c2: 271986736245998513067550648137 c3: -5670997805595833828841 c4: -102371276964158 c5: 353400 Y0: -1445381140453176527811641315884132 Y1: 5618925887111873447 rlim: 40000000 alim: 40000000 lpbr: 30 lpba: 31 mfbr: 60 mfba: 90 rlambda: 2.6 alambda: 3.6[/code] Edit: and yet another crash: >gnfs-lasieve4I15e -r alq4788.2617.poly -o alq4788.2617_25.26c.out -f 25414339 -c 185661 Warning: lowering FB_bound to 25414338. Special q 25414339 does not divide ***crash*** Edit2: ooops, I just see that I accidentally sieved on the -r side. I should not modify copy/pasted command lines, better type them from scratch... :blush: :doh!: |
yet another 15e crash - this time sieving on the "right" side (32 bit windows binary of SVN400 running under windows 7):
>gnfs-lasieve4I15e -a alq4788.2617.poly -o alq4788.2617_25.26e.out -f 25800000 -c 200000 -R Warning: an incomplete line in the original file; if just a few, it's ok, they will be skipped Resuming with -f 25872619 -c 127381 Warning: lowering FB_bound to 25872618. Special q 25872619 does not divide ***crash*** |
[QUOTE=Andi47;249318]yet another 15e crash - this time sieving on the "right" side (32 bit windows binary of SVN400 running under windows 7):
>gnfs-lasieve4I15e -a alq4788.2617.poly -o alq4788.2617_25.26e.out -f 25800000 -c 200000 -R Warning: an incomplete line in the original file; if just a few, it's ok, they will be skipped Resuming with -f 25872619 -c 127381 Warning: lowering FB_bound to 25872618. Special q 25872619 does not divide ***crash***[/QUOTE] I have just committed a Visual Studio 2010 x64 build of the experimental lattice siever code in the ggnfs SVN. Given the amount of assembler code I had to convert and then add Windows calling conventions, I am truly amazed that it does not crash. Sadly however, it doesn't work and gives 'pathology' messsages and terminaates: SCHED_PATHOLOGY q0=43505239 k=1 excess=18130 SCHED_PATHOLOGY q0=43505251 k=1 excess=17702 SCHED_PATHOLOGY q0=43505291 k=1 excess=18054 SCHED_PATHOLOGY q0=43505323 k=1 excess=17936 SCHED_PATHOLOGY q0=43505327 k=1 excess=18084 total yield: 0, q=43505381 (1.#INF0 sec/rel) I can debug it but I am going to need help from someone who knows much more about the inner workings of the code than I do. Brian |
| All times are UTC. The time now is 15:41. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.