mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Factoring (https://www.mersenneforum.org/forumdisplay.php?f=19)
-   -   Running GGNFS (https://www.mersenneforum.org/showthread.php?t=9645)

henryzz 2010-03-23 20:28

[quote=axn;209310]Duplicate relations are not completely identical!!! The format is such that the special-q used to find the relation is stored in the end of the line. Same relation can be found with different special q's, and hence the lines may be textually different.

[code]sort -u -k 1 -t ":"[/code] might work[/quote]
That still doesnt explain why lots of non-duplicated seemed to be deleted as well though. I am curious. I would have thought those commands safe(and useless) to do.

axn 2010-03-23 20:34

might be some aliases defined for sort / uniq with some additional options?

siew 2010-03-23 20:53

[CODE]-10000468077,29069:121fa6b7,3868103,4CF3,2D6B9,3,17,29,61,107:6a11131,4578865,114DF,230DF,F22ADB,5,7,11,13,25,125,26B,821,2,2,719A0D1
-10000468077,29069:121fa6b7,3868103,4CF3,2D6B9,3,17,29,61,107:6a11131,4578865,114DF,230DF,F22ADB,5,7,11,13,25,125,26B,821,2,2,719A0D1
-10000468077,29069:121fa6b7,3868103,4CF3,2D6B9,3,17,29,61,107:6a11131,4578865,114DF,230DF,F22ADB,5,7,11,13,25,125,26B,821,2,2,719A0D1
-10000468077,29069:121fa6b7,3868103,4CF3,2D6B9,3,17,29,61,107:6a11131,4578865,114DF,230DF,F22ADB,5,7,11,13,25,125,26B,821,2,2,719A0D1
-10000468659,25586:405D,EC4D,25F23B,5BB4ED,3,3,3,5,7,B,47,C1,7B5:166bb35,D0123,138671,1E6EF3,1EF17F,2D6169,42617B,13,35,2,2,2,2,2,2,54B860F
-10000468659,25586:405D,EC4D,25F23B,5BB4ED,3,3,3,5,7,B,47,C1,7B5:54b860f,D0123,138671,1E6EF3,1EF17F,2D6169,42617B,13,35,2,2,2,2,2,2,166BB35
[/CODE]

for example, in sorted file i has following lines,

i made small custom tool to remove duplicate lines.
after parsing input file with my util, i got:
[CODE]found 5260417 hash collisions in 28469558 relations
added 120920 free relations
[/CODE]
....
will try to calc exact duplicated strings to find where is the truth

axn 2010-03-23 21:10

[QUOTE=siew;209315][CODE]found 5260417 hash collisions in 28469558 relations
added 120920 free relations
[/CODE]
[/QUOTE]

don't just go by the "collisions" figure. msieve will also list the actual duplicate counts (which may be higher or lower).

siew 2010-03-23 21:51

Run msieve second and third times on the same input file

[CODE]found 5260417 hash collisions in 28469558 relations
added 120920 free relations[/CODE]

[CODE]found 5268462 hash collisions in 28590477 relations
added 1 free relations[/CODE]

[CODE]found 5268460 hash collisions in 28590479 relations[/CODE]

why the difference ? same input file, same verision of msieve (1.44)

Batalov 2010-03-23 23:29

[quote=siew;209320]Run msieve second and third times on the same input file

[code]found 5260417 hash collisions in 28469558 relations
added 120920 free relations[/code]

[code]found 5268462 hash collisions in 28590477 relations
added 1 free relations[/code]

[code]found 5268460 hash collisions in 28590479 relations[/code]

why the difference ? same input file, same verision of msieve (1.44)[/quote]
Possible reasons:

1. It is not the same file, if you keep running msieve in some directory. msieve is adding to the .dat file. (that's why later it reports: "added 1 free relations"; just 1, or it possibly could have been zero; it means that it sees from the .dat file that the free relations are already there)

2. It is also possible (if your numbers keep jumping) that your system is unstable: disk, memory, something else.

3. Why do you keep stopping it? Let it run already. Don't obsess about hash collisions - these are just a pre-screening instrument for a two-pass elimination routine. You don't need to remove (all of the) redundant entries -- msieve will do that better than your scripts.

Batalov 2010-03-23 23:45

[quote=henryzz;209312]That still doesnt explain why lots of non-duplicated seemed to be deleted as well though. I am curious. I would have thought those commands safe(and useless) to do.[/quote]
There are many subtly buggy implementations of [FONT=Fixedsys]sort[/FONT] and their bugs come out on huge inputs (a 4+Gb input is not what the writers had in mind). Note that there will be temporary files that GNU [FONT=Fixedsys]sort[/FONT] creates; their location is OS-dependent; if it creates them in a /tmp partition which may overfill (or your disk is somewhat full in any case), then the results will be unpredictably compromised. Semantics of the -u with additional options also needs testing on your particular OS. Try sanity checks on small inputs.

{memory lane} Because the first things I had to routinely do on the very first versions of the mammalian genomes (that arrived a decade ago from Celera and from UCSC) had to do with shredding, suffix distribution analyses, redundancy counting, I had a dubious pleasure of observing all possible (first IRIX, then GNU) [FONT=Fixedsys]sort[/FONT] bugs and trying to patch them, or work around them. Then, of course the more sophisticated tools were written.{\memory lane} Note that it is relevant to what JasonP coded for msieve filtering step - why do you think he didn't use "[FONT=Fixedsys]sort -u[/FONT]" or something? Because he coded robustly around the possible memory and disk overflow problems, and did so quite elegantly. It is a two-pass algorithm. Never mind the hash collisions; they will be non-zero [I]even[/I] on a non-redundant input.

Just trust msieve to do the job.

[COLOR=green]EDIT: Here's an example from a project where redundant entries were already removed, observe what happens then:[/COLOR]
[COLOR=green][code][/COLOR][COLOR=green]Fri Jan 22 19:33:09 2010 found 3993194 hash collisions in 95370252 relations[/COLOR]
[COLOR=green]Fri Jan 22 19:33:42 2010 added 2438358 free relations[/COLOR]
[COLOR=green]Fri Jan 22 19:33:42 2010 commencing duplicate removal, pass 2[/COLOR]
[COLOR=green]Fri Jan 22 19:35:40 2010 found 0 duplicates and 97808610 unique relations[/COLOR]
[COLOR=green]Fri Jan 22 19:35:40 2010 memory use: 261.2 MB[/COLOR]
[COLOR=green]Fri Jan 22 19:35:40 2010 reading ideals above 720000[/COLOR]
[COLOR=green]Fri Jan 22 19:35:41 2010 commencing singleton removal, initial pass[/COLOR]
[COLOR=green]Fri Jan 22 20:21:47 2010 memory use: 2756.0 MB[/COLOR]
[COLOR=green][/code][/COLOR]

jasonp 2010-03-24 04:48

From the source file that implements the duplicate removal:
[code]
/* produce <savefile_name>.d, a binary file containing the
line numbers of relations in the savefile that should *not*
graduate to the singleton removal (or are just plain invalid).

This code has to touch all of the relations, and when the
dataset is large avoiding excessive memory use is tricky.
Cavallar's paper suggests dropping relations into a hashtable
and ignoring relations that map to the same hash bin. This keeps
memory use down but causes good relations to be thrown away.
Another option is to put the (a,b) values of relations into the
hashtable, so that hash collisions can be resolved rigorously.
Unfortunately this means we have to budget 12 bytes for each
unique relation, and there could be tens (hundreds!) of millions
of them.

The implementation here is a compromise: we do duplicate removal
in two passes. The first pass maps relations into a hashtable
of bits, and we save (on disk) the list of hash bins where two
or more relations collide. The second pass refills the hashtable
of bits with just these entries, then reads through the complete
dataset again and saves the (a,b) values of any relation that
maps to one of the filled-in hash bins. The memory use in the
first pass is constant, and the memory use of the second pass
is 12 bytes per duplicate relation. Assuming unique relations
greatly outnumber duplicates, this solution finds all the duplicates
with no false positives, and the memory use is low enough so
that singleton filtering is a larger memory bottleneck

One useful optimization for really big problems would turn the
first-pass hashtable into a Bloom filter using several hash
functions. This would make it much more effective at avoiding
false positives as the hashtable gets more congested */
[/code]

Note that while the comments indicate that different relations will never be mistakenly treated as identical and pruned, in practice the code only looks at the bottom 36 bits of 'a' values and the bottom 29 bits of 'b' values when comparing relations. For very big problems this could sometimes mean a few relations are deleted erroneously. This shortcut is to keep the second-stage hashtable from becoming too large.

ET_ 2010-03-24 19:44

Poly file question
 
I'd like to contribute to the Kamada project calculating some polynomials with msieve-GPU.

I succeded in creating the .fb file (well, I had to stop the program after 12 hours even if the calculated time was 6.59 hours, but it's a known feature).

I also transformed the .fb file into a viable .poly file with factmsieve.py.

Now, how can I calculate the Murphy_E value needed by the project? :smile:

Luigi

axn 2010-03-24 19:48

[QUOTE=ET_;209410]Now, how can I calculate the Murphy_E value needed by the project?[/QUOTE]
Find the poly in msieve.dat.p. The e value will be listed above the poly in a comment line (beginning with a #)

ET_ 2010-03-24 20:00

[QUOTE=axn;209411]Find the poly in msieve.dat.p. The e value will be listed above the poly in a comment line (beginning with a #)[/QUOTE]

Thank you axn, it worked! :smile:

Luigi


All times are UTC. The time now is 22:58.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.