mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Factoring (https://www.mersenneforum.org/forumdisplay.php?f=19)
-   -   remdups: a relation-file filter (https://www.mersenneforum.org/showthread.php?t=13498)

Batalov 2010-06-06 20:07

remdups: a relation-file filter
 
Greg's wonderful in its simplicity program is now added to [URL="http://ggnfs.svn.sourceforge.net/viewvc/ggnfs/trunk/contrib/remdups/"]SourceForge GGNFS/contrib[/URL] branch. This is a very useful tool for minimizing relation sets (.dat file) for transferring. It may also be used right before msieve filtering as well, if your disk space is tight.

It is a completely standalone filter, not really part of msieve or ggnfs. The filter is not aware of the polynomial and only checks for the three-field colon-separated syntax (obvious caveat: do not filter together relations from different projects! the program will not complain). It correctly cares only about unique (a,b)-pairs; the data in the 2nd and 3rd fields may be permuted and textually not unique (and/or have tiny factors), which doesn't make them unique [I]relations[/I]. The program does not remove relations with b>2[SUP]32[/SUP] (for future use).

[SIZE=1]I use the unix-pipe'd version (remdups4): you may pipe (b)zcat'ted relations into it, you may discard the output (redirect to /dev/null) if you only want the stats. You may pipe straight into bzip2 (for FTPing after that).[/SIZE]
[SIZE=1]But it will not work in Windows cmd shell. The original remdups is provided, as well, for Windows. (aside from using pipes, they are synchronized). [/SIZE][SIZE=1]CygWin should be fine.[/SIZE]

henryzz 2010-06-06 20:22

[quote=Batalov;217632]Greg's wonderful in its simplicity program is now added to [URL="http://ggnfs.svn.sourceforge.net/viewvc/ggnfs/trunk/contrib/remdups/"]SourceForge GGNFS/contrib[/URL] branch. This is a very useful tool for minimizing relation sets (.dat file) for transferring. It may also be used right before msieve filtering as well, if your disk space is tight.

It is a completely standalone filter, not really part of msieve or ggnfs. The filter is not aware of the polynomial and only checks for the three-field colon-separated syntax (obvious caveat: do not filter together relations from different projects! the program will not complain). It correctly cares only about unique (a,b)-pairs; the data in the 2nd and 3rd fields may be permuted and textually not unique (and/or have tiny factors), which doesn't make them unique [I]relations[/I]. The program does not remove relations with b>2[sup]32[/sup] (for future use).

[SIZE=1]I use the unix-pipe'd version (remdups4): you may pipe (b)zcat'ted relations into it, you may discard the output (redirect to /dev/null) if you only want the stats. You may pipe straight into bzip2 (for FTPing after that).[/SIZE]
[SIZE=1]But it will not work in Windows cmd shell. The original remdups is provided, as well, for Windows. (aside from using pipes, they are synchronized). [/SIZE][SIZE=1]CygWin should be fine.[/SIZE][/quote]
Brilliant should be helpful when I acidentally duplicate lots of relations in the same file.

Phil MjX 2010-06-06 20:42

Thanks !

Just starting a c161 with 2 to 4 PC and wondering if someone has taken time to write such a tool !

Regards.
Philippe

frmky 2010-06-06 22:48

A two-pass hashtable filter would use a lot less memory, but I wrote this in 10 minutes as a quick filter and it's been "good enough" to not bother rewriting it. The current version includes a few improvements thanks to Serge.

apocalypse 2010-06-07 06:51

Here's a first-draft patch to use a hashtable. On a file with about 5M relations, it seems to use c. 150 MB instead of c. 1GB. The hash function is almost certainly suboptimal, and I didn't get the bad relation check right. I thought I'd share it anyway in case someone else finds it useful or would like to give me feedback.

[CODE]dev/trunk/contrib/remdups$ svn diff remdups4.c
Index: remdups4.c
===================================================================
--- remdups4.c (revision 394)
+++ remdups4.c (working copy)
@@ -31,9 +31,80 @@
typedef unsigned long uint32;
typedef long int32;
typedef unsigned long long uint64;
+typedef long long int64;

+typedef struct _chain_t {
+ uint64 a;
+ uint64 b;
+ struct _chain_t* next;
+} chain_t;
+
+#define HASH_SIZE 1000000
+
#define MDVAL 327673

+void init_hashtable(chain_t*** hashtable) {
+ if (hashtable == NULL) {
+ fprintf(stderr, "Unable to initialize hash\n");
+ exit(-1);
+ }
+
+ *hashtable = malloc(HASH_SIZE * sizeof(chain_t*));
+ if (!hashtable) {
+ fprintf(stderr, "Out of memory");
+ exit(-1);
+ }
+ memset(*hashtable, 0, HASH_SIZE * sizeof(chain_t*));
+}
+
+chain_t* new_hash_item(const uint64 a, const uint64 b) {
+ chain_t* new_item = malloc(sizeof(chain_t));
+ if (!new_item) {
+ fprintf(stderr, "Out of memory");
+ exit(-1);
+ }
+ new_item->a = a;
+ new_item->b = b;
+ new_item->next = NULL;
+ return new_item;
+}
+
+// return 1 if added, 0 if already exists
+int add_to_hash(chain_t** hashtable, const uint64 a, const uint64 b) {
+ uint32 hash = (a ^ b) % HASH_SIZE;
+ int exists = 0;
+ if (!hashtable[hash]) {
+ hashtable[hash] = new_hash_item(a, b);
+ } else {
+ chain_t* item = hashtable[hash];
+ while (item) {
+ if (a == item->a && b == item->b) {
+ exists = 1;
+ break;
+ } else if (item->next == NULL) {
+ break;
+ }
+ item = item->next;
+ }
+ if (!exists) {
+ item->next = new_hash_item(a, b);
+ }
+ }
+ return !exists;
+}
+
+int keep_line(chain_t** hashtable, const char* line) {
+ const char* comma = strchr(line, ',');
+ const char* colon = strchr(comma, ':');
+ if (!comma || !colon) {
+ fprintf(stderr, "malformed line: %s", line);
+ return -1;
+ }
+ int64 a = strtoll(line, NULL, 10);
+ int64 b = strtoll(comma + 1, NULL, 10);
+ return add_to_hash(hashtable, (uint64) a, (uint64) b);
+}
+
int strccnt(const char *s, int c)
{
const unsigned char *us = (const unsigned char *) s;
@@ -52,8 +123,11 @@
int numbad=0, numdups=0, numuniq=0,numskip=0;
int DIM=1000;

+ chain_t** hashtable;
+ init_hashtable(&hashtable);
+
if (argc == 2) {
DIM=atoi(argv[1]);
} else {
fprintf(stderr,"\nusage: cat relations.file(s) | %s table_size > out_file \n"
"\t table_size is a number (1000-3000 recommended)\n\n", argv[0]);
@@ -101,6 +175,24 @@
printf("%s", buf);
continue;
}
+
+ int ret = keep_line(hashtable, buf);
+ if (ret == -1) { // bad line
+ fprintf(badfile, "%s", buf);
+ numbad++;
+ goto skip_;
+ } else if (ret == 0) { // duplicate
+ numdups++;
+ goto skip_;
+ } else if (ret == 1) { // good
+ numuniq++;
+ if(numuniq % 500000 == 0)
+ fprintf(stderr, "\r %.1fM relns \r", numuniq / 1000000.0);
+ for(tmp=buf; *tmp; tmp++)
+ *tmp = tolower(*tmp); // will compress better
+ printf("%s", buf);
+ goto skip_;
+ }

for(tmp = buf; *tmp && isspace(*tmp); tmp++);
[/CODE]

Batalov 2010-06-07 08:55

Try both variants on a (pregenerated)* file with 300M relations; remdups was actually run on literally hundreds of real ones of this size (and up to 600M) - this is its intended use.
Check the memory and the running time. remdups will need DIM ~= 1500-2000 for this exercise. With less, it will still filter but with 'variable perfection', so it can fit smaller memory; it can be (and was) run on halfs of huge inputs, and then on combined outputs.

_________
*Here's an ad lib perl script to make a fake .dat file (it is slow, run it once, so that both progams would have the same fake input):
[CODE]#!/usr/bin/perl -w

$skew = shift || 300000; # this is a GNFS-like skew
# $skew = 1.41; # some SNFS-like skew
$norm = shift || 1e8;
$redundancy = shift || 0.3;
$N = shift || 3e8;

$na = int($norm*sqrt($skew));
$nb = int($norm/sqrt($skew));
@RELS = ()x 100000; # initial bank of redundant relations

for($i=0; $i < 100000; $i++) {
$a=int(rand($na*2));
$b=int(rand($nb-1))+1; # no free relations in this bank
$RELS[$i] = sprintf("%d,%d:%x,%x,23ffd1a:%x,%x,98adddeadbeaf",$a-$na,$b,$a-$b,rand($b),$b/3,$a/77);
}

printf "N 76565765176671279927470020884882929292929010838293809810011\n";

for($i=0; $i< $N; $i++) {
$a=int(rand($na*2));
$b=int(rand($nb));
if(!$b) {
printf "%d,0:\n",$a; # some free relations for a good measure
next;
}
$c=sprintf("%d,%d:%x,%x,23ffd1a:%x,%x,98adddeadbeaf",$a-$na,$b,$a-$b,$a+$b,$b/3,rand(91377)); # or any other garbage
$RELS[rand(100000)] = $c;
$c = $RELS[rand(100000)] if(rand(1)<$redundancy);
print "$c\n";
}
[/CODE]

Wacky 2010-06-07 12:32

[QUOTE=frmky;217660]A two-pass hashtable filter would use a lot less memory[/QUOTE]
Why "Two-pass"? One pass should be sufficient.
As you input each item, check the hash table to see if it has been seen before.
If not, output it and enter it into the hash table as "seen before".

As for the hash table, storing the actual key (the "a" and "b") is extremely wasteful. Rather than using 128+ bits for each entry, one bit is sufficient to record "seen".
Choose a better hash function with many more (one bit) buckets and don't worry about the (rare) "false positive" hits that cause you to discard a few relations.

jasonp 2010-06-07 12:59

As is typical with engineering problems, given a choice of methods to solve a problem the best course is to use all of them at once :)

As the number of relations converges to the number of buckets, the number of false positives increases quadratically. It's easy to have lots of spare buckets when the complete dataset is 100 million relations, but we're closing in on 10x more than that for the largest jobs people here are handling. The professionals have to worry about problems 500x larger.

A two-pass method lets you use a bitwise hashtable to limit the dataset to (a little more than) only the duplicates, then an ordinary hashtable can remove only the true positives without having to be much larger than the number of duplicates. Even that method would run out of steam when dealing with huge numbers of relations; you can stave off the inevitable by turning the first hashtable into a Bloom filter, but perhaps a better idea is use a one-pass method combined with advanced associative data structures like Judy trees. (For the record, I've used Judy trees in the initial stages of NFS filtering but they have problems of their own, and they don't save memory unless you do a lot of work to adapt them to NFS data)

Mini-Geek 2010-06-07 16:00

In filtering the relations from 22M-23M (1733741 raw relations) of [URL="http://www.mersenneforum.org/showthread.php?t=13328"]this sieve[/URL]:
It takes a lot of memory, runs quickly, and only eliminates about 0.72% of the candidates as duplicates. Do these numbers sound right?
Despite it suggesting a table size of 1000-3000, for me the optimal for this range seems to be more like 200. I have 2 GB of memory total, just over 1GB available to it without any swapping. Here are some test results with the table size, memory usage, and approximate time needed:
[CODE]table size 100, memory usage 260 MB, 15 seconds, output:
200, 515 MB, 10 seconds
300, 770 MB, 15 seconds
They all output:
Found 1721231 unique, 12510 duplicate, and 1 bad relations.
Largest dimension used: 18 of 300
Average dimension used: 5.3 of 300[/CODE]

jasonp 2010-06-07 17:16

It sounds fine. You won't be able to delete most of the duplicates until the complete dataset is available all at once, which is why special measures have to be taken in msieve.

Batalov 2010-06-07 18:15

The suggested DIM should be changed to 5 per million raw relations (the user usually knows that number; it is wasteful to wc -l the file, and then again, it may be a pipe). In your case Tim, optimal value is 19 (see "Largest dimension used").


All times are UTC. The time now is 04:39.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.