mersenneforum.org Counting ideals during singleton removal
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

 2007-12-28, 19:56 #1 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 22×32×179 Posts Counting ideals during singleton removal I thought the singleton-removal process involved repeated passes which found ideals that only occurred in one relation, and then removed those relations and the ideals in which they occurred from the set of relations. This would mean that, after a singleton-removal pass, the number of relations and the number of distinct prime ideals would both decrease. But, running a test case (the present state of play for 6^383+1), I get Code: found 22130393 hash collisions in 98352468 relations commencing duplicate removal, pass 2 found 8693969 duplicates and 89658499 unique relations memory use: 504.8 MB ignoring smallest 5036341 rational and 5036568 algebraic ideals filtering ideals above 86691876 need 17123945 more relations than ideals commencing singleton removal, pass 1 relations with 0 large ideals: 968390 relations with 1 large ideals: 6903778 relations with 2 large ideals: 20314406 relations with 3 large ideals: 30536388 relations with 4 large ideals: 23471476 relations with 5 large ideals: 7464061 relations with 6 large ideals: 0 relations with 7+ large ideals: 0 89658499 relations and about 76965979 large ideals commencing singleton removal, pass 2 found 26602951 singletons current dataset: 63055548 relations and about 46701442 large ideals commencing singleton removal, pass 3 relations with 0 large ideals: 968390 relations with 1 large ideals: 6160702 relations with 2 large ideals: 16120073 relations with 3 large ideals: 21418424 relations with 4 large ideals: 14416423 relations with 5 large ideals: 3971536 relations with 6 large ideals: 0 relations with 7+ large ideals: 0 63055548 relations and about 65568001 large ideals commencing singleton removal, pass 4 found 27251901 singletons current dataset: 35803647 relations and about 32163994 large ideals commencing singleton removal, pass 5 found 8896096 singletons current dataset: 26907551 relations and about 22272846 large ideals commencing singleton removal, pass 6 found 3574443 singletons current dataset: 23333108 relations and about 18498547 large ideals commencing singleton removal, pass 7 found 1609082 singletons current dataset: 21724026 relations and about 16844081 large ideals commencing singleton removal, pass 8 found 776955 singletons current dataset: 20947071 relations and about 16056136 large ideals commencing singleton removal, pass 9 found 388921 singletons current dataset: 20558150 relations and about 15664346 large ideals commencing singleton removal, final pass memory use: 674.1 MB commencing in-memory singleton removal begin with 20558150 relations and 19331039 unique ideals reduce to 7745572 relations and 5116329 ideals in 25 passes I can see how removing 26602951 singleton-containing relations in pass two can reduce the number of ideals by 30264537 - the singleton-containing relations sometimes had more than one single-occurence ideal. But I can't see how 18866559 ideals grew back in pass three, or 3666693 in the final pass. All the other passes make perfect sense to me. Last fiddled with by fivemack on 2007-12-28 at 20:01
2007-12-29, 04:27   #2
jasonp
Tribal Bullet

Oct 2004

DD716 Posts

Quote:
 Originally Posted by fivemack I can see how removing 26602951 singleton-containing relations in pass two can reduce the number of ideals by 30264537 - the singleton-containing relations sometimes had more than one single-occurence ideal. But I can't see how 18866559 ideals grew back in pass three, or 3666693 in the final pass. All the other passes make perfect sense to me.
The first singleton removal passes are approximate; the ideals are not mapped to unique numbers because the required hashtable would take up excessive amounts of memory. Instead, a multiplicative hash function is applied to each ideal and used to map all the ideals into bins of a fixed-size array. If all the ideals mapping to a single bin are singletons, then the underlying relations are deleted like before. Singleton removal passes proceed from this hashtable until relations stop getting pruned.

For extra-large size problems this fixed-size hashtable can become congested. This is bad, because too many ideals collide in the hashtable, which means not enough singletons will be discovered. The filtering code detects when this is likely, and then changes the hash function and refills the table. After one singleton removal pass there are many fewer ideals to add to the table, and the new hash function shuffles the mapping of ideals to bins, so that more singletons can be deleted.

The idea is to remove as many singletons as possible before switching to a unique mapping between ideals and bins; this occurs in the pass labeled 'final pass'; when complete the list of relations and large ideals is resident only in memory, and singleton removal proceeds as expected.

All of this code is in gnfs/filter/singleton.c

2007-12-29, 12:26   #3
xilman
Bamboozled!

"πΊππ·π·π­"
May 2003
Down not across

101011000100102 Posts

Quote:
 Originally Posted by jasonp The first singleton removal passes are approximate...
I've found an even more approximate solution to be worth doing and very easy to code. Remove all the singleton primes (not ideals) first. Assuming that one polynomial is linear, this removes all the singleton ideals on that side. For the other polynomial, removing all the singleton primes usually reduces the data size substantiantially, thereby making the more sophisticated algorithm run faster in less space.

Paul

 2007-12-29, 13:11 #4 Wacky     Jun 2003 The Texas Hill Country 100010000012 Posts And, to carry Paul's suggestion even farther, I have been doing that (with line sieving relations only) even before eliminating the few duplicates that always manage to creep into a data set that is collected from distributed sources. With lattice sieving, you will to first eliminate the duplicates because the duplication rate is so high.
2007-12-29, 20:53   #5
jasonp
Tribal Bullet

Oct 2004

3·1,181 Posts

Quote:
 Originally Posted by xilman I've found an even more approximate solution to be worth doing and very easy to code. Remove all the singleton primes (not ideals) first. Assuming that one polynomial is linear, this removes all the singleton ideals on that side. For the other polynomial, removing all the singleton primes usually reduces the data size substantiantially, thereby making the more sophisticated algorithm run faster in less space.
After the last time this was discussed, I did some experiments implementing it within msieve; both methods perform about the same. I was a little surprised that there was no measurable time difference between using ideals and using the underlying primes; perhaps accessing the hashtable takes most of the time anyway.

I did find that the difference between the true number of ideals and the number of ideals as estimated by the hashtable was much larger when using primes only (which isn't surprising). That would mean the size of the collisionless hashtable would be unnecessarily large, unless several passes were added that used an approximate hashtable and the ideals.

Another possibility is to use the Judy library, which implements a compressing radix tree associative data structure. Judy can map primes to unique counts in very little memory, and ideals to unique counts using much more memory. Unfortunately, the latter is where the memory bottleneck is anyway, so I figured it wasn't worth the added external library dependency.

Examining the log files that everyone is posting from big jobs, it seems that at the switchover point between the approximate and exact methods, the number of ideals goes up by 10-20%, and this still represents hundreds of megabytes of wasted memory during the exact singleton removal. It would be nice to get rid of that somehow, I think right now this pass takes up about as much memory as the merge phase.

2007-12-30, 11:17   #6
xilman
Bamboozled!

"πΊππ·π·π­"
May 2003
Down not across

2×37×149 Posts

Quote:
 Originally Posted by jasonp After the last time this was discussed, I did some experiments implementing it within msieve; both methods perform about the same. I was a little surprised that there was no measurable time difference between using ideals and using the underlying primes; perhaps accessing the hashtable takes most of the time anyway. I did find that the difference between the true number of ideals and the number of ideals as estimated by the hashtable was much larger when using primes only (which isn't surprising). That would mean the size of the collisionless hashtable would be unnecessarily large, unless several passes were added that used an approximate hashtable and the ideals. Another possibility is to use the Judy library, which implements a compressing radix tree associative data structure. Judy can map primes to unique counts in very little memory, and ideals to unique counts using much more memory. Unfortunately, the latter is where the memory bottleneck is anyway, so I figured it wasn't worth the added external library dependency. Examining the log files that everyone is posting from big jobs, it seems that at the switchover point between the approximate and exact methods, the number of ideals goes up by 10-20%, and this still represents hundreds of megabytes of wasted memory during the exact singleton removal. It would be nice to get rid of that somehow, I think right now this pass takes up about as much memory as the merge phase.
Perhaps a significant difference is that I used a couple of bit maps, rather than a hash table.

Here is my qndsr.c program. It's small, inelegant and only does one pass per run of the program. Cleaning it up and making it do a better job is left as an exercise.

Code:
/* Quick & dirty singleton remover.  PC Leyland 2002-09-04 */

/*  Read relations from stdin and write a set of files which have
first-level singletons removed.  Subsequent runs of the progaram
are needed if singletons generated in the preceeding pass are to
be removed.  The output files are limited to (slightly greater
than) 1e9 bytes to avoid file size limits on common operating
systems.

Uses temporary files not exceeding 1Gb each and total size ~< size
of input.  There must be room for these, together with the output
file(s).

This is *not* an elegantly written program.  Anyone who wants to
make it nicer and/or more robust is welcome to work on it.  The
input file parser, in particular, is rather brittle.  strtol()
from stdlib.h could probably be used for cleanliness but may be a
bit slower because my code doesn't do error checking or work in
bases other than 10.

*/

#include <stdio.h>

static unsigned char ratprimes[1<<27], algprimes[1<<27]; /* Enough for primes to 1G */

main ()
{
int linesize, tfileno=0, tfileoutchars=0, ofileno=0, ofileoutchars=0;
int nratp, nalgp, p;
int i, tf, relcount = 0;
char linebuf[10240], tfilename[24], ofilename[24], *s;
FILE *tfile, *ofile;

int getline (char *buf, FILE *f);

sprintf (tfilename, "__TEMP__%03d", tfileno);
tfile = fopen (tfilename, "w");
if (tfile == (FILE *) NULL)
{
fprintf (stderr, "Can't open temporary file %s for writing\n");
exit (1);
}
while (linesize = getline (linebuf, stdin))
{
if (linebuf[0] != '0') continue; /* A relation? */
nratp = linebuf[2] - '0';
if (nratp < 0 || nratp > 9)
{
fprintf (stderr, "Strange # primes in 1st poly, %d\t %s", nratp, linebuf);
continue;
}
nalgp = linebuf[3] - '0';
if (nalgp < 0 || nalgp > 9)
{
fprintf (stderr, "Strange # primes in 2nd poly, %d\t %s", nalgp, linebuf);
continue;
}

relcount++;
s = linebuf + 5;
while (*s++ != ' ');	/* Empty, skip a */
while (*s++ != ' ');	/* Empty, skip b */

for (i = 1; i <= nratp; i++)
{
p = 0;
while (isdigit (*s))
{
p = p*10 + (*s++ - '0');
}
s++;			/* Skip space */

p >>= 1;
byte = p >> 2;
crumb = (p & 3) * 2;
maskhi = 2 << crumb;
masklo = 1 << crumb;

if (ratprimes[byte] & masklo)		/* Seen at least once */
ratprimes[byte] |= maskhi;		/* Set as seen at least twice */
else
ratprimes[byte] |= masklo;		/* Set as seen at least once */
}

for (i = 1; i <= nalgp; i++)
{
p = 0;
while (isdigit (*s))
{
p = p*10 + (*s++ - '0');
}
s++;			/* Skip space */

p >>= 1;
byte = p >> 2;
crumb = (p & 3) * 2;
maskhi = 2 << crumb;
masklo = 1 << crumb;
if (algprimes[byte] & masklo)		/* Seen at least once */
algprimes[byte] |= maskhi;		/* Set as seen at least twice */
else
algprimes[byte] |= masklo;		/* Set as seen at least once */
}

tfileoutchars += fprintf (tfile, "%s", linebuf);

if (tfileoutchars > 1000000000)
{
fclose (tfile);
sprintf (tfilename, "__TEMP__%03d", ++tfileno);
tfile = fopen (tfilename, "w");
if (tfile == (FILE *) NULL)
{
fprintf (stderr, "Can't open temporary file %s for writing\n");
exit (1);
}
tfileoutchars = 0;
}
}

printf ("Read %d (not necessarily distinct) relations\n", relcount);

/* Ok, now have a bit map which tells which primes on each side have
been seen at least twice, and we have one or more files
containing clean relations (no comments, etc).  Read them in and
write them out only if all primes in a relation are seen at least
twice. */

relcount = 0;

sprintf (ofilename, "OUT__%03d", ofileno);
ofile = fopen (ofilename, "w");
if (ofile == (FILE *) NULL)
{
fprintf (stderr, "Can't open output file %s\n");
exit (1);
}

for (tf = 0; tf <= tfileno; tf++)
{
fclose (tfile);
sprintf (tfilename, "__TEMP__%03d", tf);
tfile = fopen (tfilename, "r");
if (tfile == (FILE *) NULL)
{
fprintf (stderr, "Can't open temporary file %s for reading \n");
exit (1);
}

while (linesize = getline (linebuf, tfile))
{
nratp = linebuf[2] - '0';
nalgp = linebuf[3] - '0';

s = linebuf + 5;
while (*s++ != ' ');	/* Empty, skip a */
while (*s++ != ' ');	/* Empty, skip b */

for (i = 1; i <= nratp; i++)
{
p = 0;
while (isdigit (*s))
{
p = p*10 + (*s++ - '0');
}
s++;			/* Skip space */

p >>= 1;
byte = p >> 2;
crumb = (p & 3) * 2;
maskhi = 2 << crumb;

if (!(ratprimes[byte] & maskhi))	/* Not seen at least twice */
goto nextrel;
}

for (i = 1; i <= nalgp; i++)
{
p = 0;
while (isdigit (*s))
{
p = p*10 + (*s++ - '0');
}
s++;			/* Skip space */

p >>= 1;
byte = p >> 2;
crumb = (p & 3) * 2;
maskhi = 2 << crumb;

if (!(algprimes[byte] & maskhi))	/* Not seen at least twice */
goto nextrel;
}

/* Seen them all at least twice, so write it out. */

ofileoutchars += fprintf (ofile, "%s", linebuf);
relcount++;

if (ofileoutchars > 1000000000)
{
fclose (ofile);
sprintf (ofilename, "OUT__%03d", ++ofileno);
ofile = fopen (ofilename, "w");
if (ofile == (FILE *) NULL)
{
fprintf (stderr, "Can't open output file %s for writing\n");
exit (1);
}
ofileoutchars = 0;
}
nextrel: ; /* Empty */
}
}
fclose (tfile);
fclose (ofile);

printf ("Wrote %d (not necessarily distinct) relations\n", relcount);
}

int getline (char *s, FILE *infile)
{
int c, i = 0;

while ((c = getc (infile)) != EOF && c != '\n')
if (c != '\r') s[i++] = c;			/* Strip out CR */
if (c == '\n') s[i++] = c;
s[i] = '\0';
return (i);
}
Paul

P.S. I forgot to say this is for CWI-format relations. Converting to other formats is also left as an exercise.

Last fiddled with by xilman on 2007-12-30 at 11:22 Reason: Add P.S.

 Similar Threads Thread Thread Starter Forum Replies Last Post Nick Number Theory Discussion Group 5 2017-06-04 12:36 paul0 Factoring 2 2015-01-19 11:52 TObject PrimeNet 0 2012-09-23 17:59 fivemack Abstract Algebra & Algebraic Number Theory 10 2012-01-22 11:01 blackguard PrimeNet 4 2005-02-15 16:08

All times are UTC. The time now is 08:29.

Fri Dec 3 08:29:11 UTC 2021 up 133 days, 2:58, 0 users, load averages: 1.14, 1.84, 1.99