![]() |
[quote=axn;209310]Duplicate relations are not completely identical!!! The format is such that the special-q used to find the relation is stored in the end of the line. Same relation can be found with different special q's, and hence the lines may be textually different.
[code]sort -u -k 1 -t ":"[/code] might work[/quote] That still doesnt explain why lots of non-duplicated seemed to be deleted as well though. I am curious. I would have thought those commands safe(and useless) to do. |
might be some aliases defined for sort / uniq with some additional options?
|
[CODE]-10000468077,29069:121fa6b7,3868103,4CF3,2D6B9,3,17,29,61,107:6a11131,4578865,114DF,230DF,F22ADB,5,7,11,13,25,125,26B,821,2,2,719A0D1
-10000468077,29069:121fa6b7,3868103,4CF3,2D6B9,3,17,29,61,107:6a11131,4578865,114DF,230DF,F22ADB,5,7,11,13,25,125,26B,821,2,2,719A0D1 -10000468077,29069:121fa6b7,3868103,4CF3,2D6B9,3,17,29,61,107:6a11131,4578865,114DF,230DF,F22ADB,5,7,11,13,25,125,26B,821,2,2,719A0D1 -10000468077,29069:121fa6b7,3868103,4CF3,2D6B9,3,17,29,61,107:6a11131,4578865,114DF,230DF,F22ADB,5,7,11,13,25,125,26B,821,2,2,719A0D1 -10000468659,25586:405D,EC4D,25F23B,5BB4ED,3,3,3,5,7,B,47,C1,7B5:166bb35,D0123,138671,1E6EF3,1EF17F,2D6169,42617B,13,35,2,2,2,2,2,2,54B860F -10000468659,25586:405D,EC4D,25F23B,5BB4ED,3,3,3,5,7,B,47,C1,7B5:54b860f,D0123,138671,1E6EF3,1EF17F,2D6169,42617B,13,35,2,2,2,2,2,2,166BB35 [/CODE] for example, in sorted file i has following lines, i made small custom tool to remove duplicate lines. after parsing input file with my util, i got: [CODE]found 5260417 hash collisions in 28469558 relations added 120920 free relations [/CODE] .... will try to calc exact duplicated strings to find where is the truth |
[QUOTE=siew;209315][CODE]found 5260417 hash collisions in 28469558 relations
added 120920 free relations [/CODE] [/QUOTE] don't just go by the "collisions" figure. msieve will also list the actual duplicate counts (which may be higher or lower). |
Run msieve second and third times on the same input file
[CODE]found 5260417 hash collisions in 28469558 relations added 120920 free relations[/CODE] [CODE]found 5268462 hash collisions in 28590477 relations added 1 free relations[/CODE] [CODE]found 5268460 hash collisions in 28590479 relations[/CODE] why the difference ? same input file, same verision of msieve (1.44) |
[quote=siew;209320]Run msieve second and third times on the same input file
[code]found 5260417 hash collisions in 28469558 relations added 120920 free relations[/code] [code]found 5268462 hash collisions in 28590477 relations added 1 free relations[/code] [code]found 5268460 hash collisions in 28590479 relations[/code] why the difference ? same input file, same verision of msieve (1.44)[/quote] Possible reasons: 1. It is not the same file, if you keep running msieve in some directory. msieve is adding to the .dat file. (that's why later it reports: "added 1 free relations"; just 1, or it possibly could have been zero; it means that it sees from the .dat file that the free relations are already there) 2. It is also possible (if your numbers keep jumping) that your system is unstable: disk, memory, something else. 3. Why do you keep stopping it? Let it run already. Don't obsess about hash collisions - these are just a pre-screening instrument for a two-pass elimination routine. You don't need to remove (all of the) redundant entries -- msieve will do that better than your scripts. |
[quote=henryzz;209312]That still doesnt explain why lots of non-duplicated seemed to be deleted as well though. I am curious. I would have thought those commands safe(and useless) to do.[/quote]
There are many subtly buggy implementations of [FONT=Fixedsys]sort[/FONT] and their bugs come out on huge inputs (a 4+Gb input is not what the writers had in mind). Note that there will be temporary files that GNU [FONT=Fixedsys]sort[/FONT] creates; their location is OS-dependent; if it creates them in a /tmp partition which may overfill (or your disk is somewhat full in any case), then the results will be unpredictably compromised. Semantics of the -u with additional options also needs testing on your particular OS. Try sanity checks on small inputs. {memory lane} Because the first things I had to routinely do on the very first versions of the mammalian genomes (that arrived a decade ago from Celera and from UCSC) had to do with shredding, suffix distribution analyses, redundancy counting, I had a dubious pleasure of observing all possible (first IRIX, then GNU) [FONT=Fixedsys]sort[/FONT] bugs and trying to patch them, or work around them. Then, of course the more sophisticated tools were written.{\memory lane} Note that it is relevant to what JasonP coded for msieve filtering step - why do you think he didn't use "[FONT=Fixedsys]sort -u[/FONT]" or something? Because he coded robustly around the possible memory and disk overflow problems, and did so quite elegantly. It is a two-pass algorithm. Never mind the hash collisions; they will be non-zero [I]even[/I] on a non-redundant input. Just trust msieve to do the job. [COLOR=green]EDIT: Here's an example from a project where redundant entries were already removed, observe what happens then:[/COLOR] [COLOR=green][code][/COLOR][COLOR=green]Fri Jan 22 19:33:09 2010 found 3993194 hash collisions in 95370252 relations[/COLOR] [COLOR=green]Fri Jan 22 19:33:42 2010 added 2438358 free relations[/COLOR] [COLOR=green]Fri Jan 22 19:33:42 2010 commencing duplicate removal, pass 2[/COLOR] [COLOR=green]Fri Jan 22 19:35:40 2010 found 0 duplicates and 97808610 unique relations[/COLOR] [COLOR=green]Fri Jan 22 19:35:40 2010 memory use: 261.2 MB[/COLOR] [COLOR=green]Fri Jan 22 19:35:40 2010 reading ideals above 720000[/COLOR] [COLOR=green]Fri Jan 22 19:35:41 2010 commencing singleton removal, initial pass[/COLOR] [COLOR=green]Fri Jan 22 20:21:47 2010 memory use: 2756.0 MB[/COLOR] [COLOR=green][/code][/COLOR] |
From the source file that implements the duplicate removal:
[code] /* produce <savefile_name>.d, a binary file containing the line numbers of relations in the savefile that should *not* graduate to the singleton removal (or are just plain invalid). This code has to touch all of the relations, and when the dataset is large avoiding excessive memory use is tricky. Cavallar's paper suggests dropping relations into a hashtable and ignoring relations that map to the same hash bin. This keeps memory use down but causes good relations to be thrown away. Another option is to put the (a,b) values of relations into the hashtable, so that hash collisions can be resolved rigorously. Unfortunately this means we have to budget 12 bytes for each unique relation, and there could be tens (hundreds!) of millions of them. The implementation here is a compromise: we do duplicate removal in two passes. The first pass maps relations into a hashtable of bits, and we save (on disk) the list of hash bins where two or more relations collide. The second pass refills the hashtable of bits with just these entries, then reads through the complete dataset again and saves the (a,b) values of any relation that maps to one of the filled-in hash bins. The memory use in the first pass is constant, and the memory use of the second pass is 12 bytes per duplicate relation. Assuming unique relations greatly outnumber duplicates, this solution finds all the duplicates with no false positives, and the memory use is low enough so that singleton filtering is a larger memory bottleneck One useful optimization for really big problems would turn the first-pass hashtable into a Bloom filter using several hash functions. This would make it much more effective at avoiding false positives as the hashtable gets more congested */ [/code] Note that while the comments indicate that different relations will never be mistakenly treated as identical and pruned, in practice the code only looks at the bottom 36 bits of 'a' values and the bottom 29 bits of 'b' values when comparing relations. For very big problems this could sometimes mean a few relations are deleted erroneously. This shortcut is to keep the second-stage hashtable from becoming too large. |
Poly file question
I'd like to contribute to the Kamada project calculating some polynomials with msieve-GPU.
I succeded in creating the .fb file (well, I had to stop the program after 12 hours even if the calculated time was 6.59 hours, but it's a known feature). I also transformed the .fb file into a viable .poly file with factmsieve.py. Now, how can I calculate the Murphy_E value needed by the project? :smile: Luigi |
[QUOTE=ET_;209410]Now, how can I calculate the Murphy_E value needed by the project?[/QUOTE]
Find the poly in msieve.dat.p. The e value will be listed above the poly in a comment line (beginning with a #) |
[QUOTE=axn;209411]Find the poly in msieve.dat.p. The e value will be listed above the poly in a comment line (beginning with a #)[/QUOTE]
Thank you axn, it worked! :smile: Luigi |
| All times are UTC. The time now is 22:58. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.