mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2010-06-06, 20:07   #1
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

100101000001012 Posts
Default remdups: a relation-file filter

Greg's wonderful in its simplicity program is now added to SourceForge GGNFS/contrib branch. This is a very useful tool for minimizing relation sets (.dat file) for transferring. It may also be used right before msieve filtering as well, if your disk space is tight.

It is a completely standalone filter, not really part of msieve or ggnfs. The filter is not aware of the polynomial and only checks for the three-field colon-separated syntax (obvious caveat: do not filter together relations from different projects! the program will not complain). It correctly cares only about unique (a,b)-pairs; the data in the 2nd and 3rd fields may be permuted and textually not unique (and/or have tiny factors), which doesn't make them unique relations. The program does not remove relations with b>232 (for future use).

I use the unix-pipe'd version (remdups4): you may pipe (b)zcat'ted relations into it, you may discard the output (redirect to /dev/null) if you only want the stats. You may pipe straight into bzip2 (for FTPing after that).
But it will not work in Windows cmd shell. The original remdups is provided, as well, for Windows. (aside from using pipes, they are synchronized). CygWin should be fine.
Batalov is offline   Reply With Quote
Old 2010-06-06, 20:22   #2
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

23×3×5×72 Posts
Default

Quote:
Originally Posted by Batalov View Post
Greg's wonderful in its simplicity program is now added to SourceForge GGNFS/contrib branch. This is a very useful tool for minimizing relation sets (.dat file) for transferring. It may also be used right before msieve filtering as well, if your disk space is tight.

It is a completely standalone filter, not really part of msieve or ggnfs. The filter is not aware of the polynomial and only checks for the three-field colon-separated syntax (obvious caveat: do not filter together relations from different projects! the program will not complain). It correctly cares only about unique (a,b)-pairs; the data in the 2nd and 3rd fields may be permuted and textually not unique (and/or have tiny factors), which doesn't make them unique relations. The program does not remove relations with b>232 (for future use).

I use the unix-pipe'd version (remdups4): you may pipe (b)zcat'ted relations into it, you may discard the output (redirect to /dev/null) if you only want the stats. You may pipe straight into bzip2 (for FTPing after that).
But it will not work in Windows cmd shell. The original remdups is provided, as well, for Windows. (aside from using pipes, they are synchronized). CygWin should be fine.
Brilliant should be helpful when I acidentally duplicate lots of relations in the same file.
henryzz is offline   Reply With Quote
Old 2010-06-06, 20:42   #3
Phil MjX
 
Phil MjX's Avatar
 
Sep 2004

5·37 Posts
Default

Thanks !

Just starting a c161 with 2 to 4 PC and wondering if someone has taken time to write such a tool !

Regards.
Philippe
Phil MjX is offline   Reply With Quote
Old 2010-06-06, 22:48   #4
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

83A16 Posts
Default

A two-pass hashtable filter would use a lot less memory, but I wrote this in 10 minutes as a quick filter and it's been "good enough" to not bother rewriting it. The current version includes a few improvements thanks to Serge.
frmky is online now   Reply With Quote
Old 2010-06-07, 06:51   #5
apocalypse
 
Feb 2003

2·3·29 Posts
Default

Here's a first-draft patch to use a hashtable. On a file with about 5M relations, it seems to use c. 150 MB instead of c. 1GB. The hash function is almost certainly suboptimal, and I didn't get the bad relation check right. I thought I'd share it anyway in case someone else finds it useful or would like to give me feedback.

Code:
dev/trunk/contrib/remdups$ svn diff remdups4.c
Index: remdups4.c
===================================================================
--- remdups4.c  (revision 394)
+++ remdups4.c  (working copy)
@@ -31,9 +31,80 @@
 typedef unsigned long uint32;
 typedef long int32;
 typedef unsigned long long uint64;
+typedef long long int64;

+typedef struct _chain_t {
+  uint64 a;
+  uint64 b;
+  struct _chain_t* next;
+} chain_t;
+
+#define HASH_SIZE 1000000
+
 #define MDVAL 327673

+void init_hashtable(chain_t*** hashtable) {
+  if (hashtable == NULL) {
+    fprintf(stderr, "Unable to initialize hash\n");
+    exit(-1);
+  }
+
+  *hashtable = malloc(HASH_SIZE * sizeof(chain_t*));
+  if (!hashtable) {
+    fprintf(stderr, "Out of memory");
+    exit(-1);
+  }
+  memset(*hashtable, 0, HASH_SIZE * sizeof(chain_t*));
+}
+
+chain_t* new_hash_item(const uint64 a, const uint64 b) {
+  chain_t* new_item = malloc(sizeof(chain_t));
+  if (!new_item) {
+    fprintf(stderr, "Out of memory");
+    exit(-1);
+  }
+  new_item->a = a;
+  new_item->b = b;
+  new_item->next = NULL;
+  return new_item;
+}
+
+// return 1 if added, 0 if already exists
+int add_to_hash(chain_t** hashtable, const uint64 a, const uint64 b) {
+  uint32 hash = (a ^ b) % HASH_SIZE;
+  int exists = 0;
+  if (!hashtable[hash]) {
+    hashtable[hash] = new_hash_item(a, b);
+  } else {
+    chain_t* item = hashtable[hash];
+    while (item) {
+      if (a == item->a && b == item->b) {
+       exists = 1;
+       break;
+      } else if (item->next == NULL) {
+       break;
+      }
+      item = item->next;
+    }
+    if (!exists) {
+      item->next = new_hash_item(a, b);
+    }
+  }
+  return !exists;
+}
+
+int keep_line(chain_t** hashtable, const char* line) {
+  const char* comma = strchr(line, ',');
+  const char* colon = strchr(comma, ':');
+  if (!comma || !colon) {
+    fprintf(stderr, "malformed line: %s", line);
+    return -1;
+  }
+  int64 a = strtoll(line, NULL, 10);
+  int64 b = strtoll(comma + 1, NULL, 10);
+  return add_to_hash(hashtable, (uint64) a, (uint64) b);
+}
+
 int strccnt(const char *s, int c)
 {
        const unsigned char *us = (const unsigned char *) s;
@@ -52,8 +123,11 @@
        int numbad=0, numdups=0, numuniq=0,numskip=0;
        int DIM=1000;

+       chain_t** hashtable;
+       init_hashtable(&hashtable);
+
         if (argc == 2) {
          DIM=atoi(argv[1]);
         } else {
          fprintf(stderr,"\nusage: cat relations.file(s) | %s table_size > out_file \n"
                  "\t table_size is a number (1000-3000 recommended)\n\n", argv[0]);
@@ -101,6 +175,24 @@
                        printf("%s", buf);
                        continue;
                }
+
+               int ret = keep_line(hashtable, buf);
+               if (ret == -1) {  // bad line
+                 fprintf(badfile, "%s", buf);
+                 numbad++;
+                 goto skip_;
+               } else if (ret == 0) {  // duplicate
+                 numdups++;
+                 goto skip_;
+               } else if (ret == 1) {  // good
+                 numuniq++;
+                 if(numuniq % 500000 == 0)
+                   fprintf(stderr, "\r %.1fM relns \r", numuniq / 1000000.0);
+                 for(tmp=buf; *tmp; tmp++)
+                   *tmp = tolower(*tmp);  // will compress better
+                 printf("%s", buf);
+                 goto skip_;
+               }

                for(tmp = buf; *tmp && isspace(*tmp); tmp++);
apocalypse is offline   Reply With Quote
Old 2010-06-07, 08:55   #6
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

36·13 Posts
Default

Try both variants on a (pregenerated)* file with 300M relations; remdups was actually run on literally hundreds of real ones of this size (and up to 600M) - this is its intended use.
Check the memory and the running time. remdups will need DIM ~= 1500-2000 for this exercise. With less, it will still filter but with 'variable perfection', so it can fit smaller memory; it can be (and was) run on halfs of huge inputs, and then on combined outputs.

_________
*Here's an ad lib perl script to make a fake .dat file (it is slow, run it once, so that both progams would have the same fake input):
Code:
#!/usr/bin/perl -w
 
$skew = shift || 300000; # this is a GNFS-like skew
# $skew = 1.41; # some SNFS-like skew
$norm = shift || 1e8;
$redundancy = shift || 0.3;
$N = shift || 3e8;
 
$na = int($norm*sqrt($skew));
$nb = int($norm/sqrt($skew));
@RELS = ()x 100000; # initial bank of redundant relations
 
for($i=0; $i < 100000; $i++) {
  $a=int(rand($na*2));
  $b=int(rand($nb-1))+1; # no free relations in this bank
  $RELS[$i] = sprintf("%d,%d:%x,%x,23ffd1a:%x,%x,98adddeadbeaf",$a-$na,$b,$a-$b,rand($b),$b/3,$a/77);
}
 
printf "N 76565765176671279927470020884882929292929010838293809810011\n";
 
for($i=0; $i< $N; $i++) {
  $a=int(rand($na*2));
  $b=int(rand($nb));
  if(!$b) {
    printf "%d,0:\n",$a; # some free relations for a good measure
    next;
  }
  $c=sprintf("%d,%d:%x,%x,23ffd1a:%x,%x,98adddeadbeaf",$a-$na,$b,$a-$b,$a+$b,$b/3,rand(91377)); # or any other garbage
  $RELS[rand(100000)] = $c;
  $c = $RELS[rand(100000)] if(rand(1)<$redundancy);
  print "$c\n";
}
Batalov is offline   Reply With Quote
Old 2010-06-07, 12:32   #7
Wacky
 
Wacky's Avatar
 
Jun 2003
The Texas Hill Country

32×112 Posts
Default

Quote:
Originally Posted by frmky View Post
A two-pass hashtable filter would use a lot less memory
Why "Two-pass"? One pass should be sufficient.
As you input each item, check the hash table to see if it has been seen before.
If not, output it and enter it into the hash table as "seen before".

As for the hash table, storing the actual key (the "a" and "b") is extremely wasteful. Rather than using 128+ bits for each entry, one bit is sufficient to record "seen".
Choose a better hash function with many more (one bit) buckets and don't worry about the (rare) "false positive" hits that cause you to discard a few relations.
Wacky is offline   Reply With Quote
Old 2010-06-07, 12:59   #8
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

67258 Posts
Default

As is typical with engineering problems, given a choice of methods to solve a problem the best course is to use all of them at once :)

As the number of relations converges to the number of buckets, the number of false positives increases quadratically. It's easy to have lots of spare buckets when the complete dataset is 100 million relations, but we're closing in on 10x more than that for the largest jobs people here are handling. The professionals have to worry about problems 500x larger.

A two-pass method lets you use a bitwise hashtable to limit the dataset to (a little more than) only the duplicates, then an ordinary hashtable can remove only the true positives without having to be much larger than the number of duplicates. Even that method would run out of steam when dealing with huge numbers of relations; you can stave off the inevitable by turning the first hashtable into a Bloom filter, but perhaps a better idea is use a one-pass method combined with advanced associative data structures like Judy trees. (For the record, I've used Judy trees in the initial stages of NFS filtering but they have problems of their own, and they don't save memory unless you do a lot of work to adapt them to NFS data)
jasonp is offline   Reply With Quote
Old 2010-06-07, 16:00   #9
Mini-Geek
Account Deleted
 
Mini-Geek's Avatar
 
"Tim Sorbera"
Aug 2006
San Antonio, TX USA

17·251 Posts
Default

In filtering the relations from 22M-23M (1733741 raw relations) of this sieve:
It takes a lot of memory, runs quickly, and only eliminates about 0.72% of the candidates as duplicates. Do these numbers sound right?
Despite it suggesting a table size of 1000-3000, for me the optimal for this range seems to be more like 200. I have 2 GB of memory total, just over 1GB available to it without any swapping. Here are some test results with the table size, memory usage, and approximate time needed:
Code:
table size 100, memory usage 260 MB, 15 seconds, output:
200, 515 MB, 10 seconds
300, 770 MB, 15 seconds
They all output:
Found 1721231 unique, 12510 duplicate, and 1 bad relations.
Largest dimension used: 18 of 300
Average dimension used: 5.3 of 300

Last fiddled with by Mini-Geek on 2010-06-07 at 16:03
Mini-Geek is offline   Reply With Quote
Old 2010-06-07, 17:16   #10
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3,541 Posts
Default

It sounds fine. You won't be able to delete most of the duplicates until the complete dataset is available all at once, which is why special measures have to be taken in msieve.
jasonp is offline   Reply With Quote
Old 2010-06-07, 18:15   #11
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

36·13 Posts
Default

The suggested DIM should be changed to 5 per million raw relations (the user usually knows that number; it is wasteful to wc -l the file, and then again, it may be a pipe). In your case Tim, optimal value is 19 (see "Largest dimension used").
Batalov is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
error: cannot locate relation cardmaker Factoring 16 2017-07-17 12:38
Error reading relation jux YAFU 24 2016-02-13 10:43
Profanity filter? Dubslow Forum Feedback 15 2012-03-11 02:23
Realloc in CWI Filter R.D. Silverman NFSNET Discussion 17 2006-07-26 11:49
Filter BUG!!! Be Wary!!!! R.D. Silverman Factoring 25 2005-09-03 02:23

All times are UTC. The time now is 00:10.


Sat Jul 17 00:10:05 UTC 2021 up 49 days, 21:57, 1 user, load averages: 1.90, 1.67, 1.55

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.