mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2007-12-28, 19:56   #1
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

22×32×179 Posts
Default Counting ideals during singleton removal

I thought the singleton-removal process involved repeated passes which found ideals that only occurred in one relation, and then removed those relations and the ideals in which they occurred from the set of relations.

This would mean that, after a singleton-removal pass, the number of relations and the number of distinct prime ideals would both decrease.

But, running a test case (the present state of play for 6^383+1), I get

Code:
found 22130393 hash collisions in 98352468 relations
commencing duplicate removal, pass 2
found 8693969 duplicates and 89658499 unique relations
memory use: 504.8 MB
ignoring smallest 5036341 rational and 5036568 algebraic ideals
filtering ideals above 86691876
need 17123945 more relations than ideals
commencing singleton removal, pass 1
relations with 0 large ideals: 968390
relations with 1 large ideals: 6903778
relations with 2 large ideals: 20314406
relations with 3 large ideals: 30536388
relations with 4 large ideals: 23471476
relations with 5 large ideals: 7464061
relations with 6 large ideals: 0
relations with 7+ large ideals: 0
89658499 relations and about 76965979 large ideals
commencing singleton removal, pass 2
found 26602951 singletons
current dataset: 63055548 relations and about 46701442 large ideals
commencing singleton removal, pass 3
relations with 0 large ideals: 968390
relations with 1 large ideals: 6160702
relations with 2 large ideals: 16120073
relations with 3 large ideals: 21418424
relations with 4 large ideals: 14416423
relations with 5 large ideals: 3971536
relations with 6 large ideals: 0
relations with 7+ large ideals: 0
63055548 relations and about 65568001 large ideals
commencing singleton removal, pass 4
found 27251901 singletons
current dataset: 35803647 relations and about 32163994 large ideals
commencing singleton removal, pass 5
found 8896096 singletons
current dataset: 26907551 relations and about 22272846 large ideals
commencing singleton removal, pass 6
found 3574443 singletons
current dataset: 23333108 relations and about 18498547 large ideals
commencing singleton removal, pass 7
found 1609082 singletons
current dataset: 21724026 relations and about 16844081 large ideals
commencing singleton removal, pass 8
found 776955 singletons
current dataset: 20947071 relations and about 16056136 large ideals
commencing singleton removal, pass 9
found 388921 singletons
current dataset: 20558150 relations and about 15664346 large ideals
commencing singleton removal, final pass
memory use: 674.1 MB
commencing in-memory singleton removal
begin with 20558150 relations and 19331039 unique ideals
reduce to 7745572 relations and 5116329 ideals in 25 passes
I can see how removing 26602951 singleton-containing relations in pass two can reduce the number of ideals by 30264537 - the singleton-containing relations sometimes had more than one single-occurence ideal. But I can't see how 18866559 ideals grew back in pass three, or 3666693 in the final pass. All the other passes make perfect sense to me.

Last fiddled with by fivemack on 2007-12-28 at 20:01
fivemack is offline   Reply With Quote
Old 2007-12-29, 04:27   #2
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

DD716 Posts
Default

Quote:
Originally Posted by fivemack View Post
I can see how removing 26602951 singleton-containing relations in pass two can reduce the number of ideals by 30264537 - the singleton-containing relations sometimes had more than one single-occurence ideal. But I can't see how 18866559 ideals grew back in pass three, or 3666693 in the final pass. All the other passes make perfect sense to me.
The first singleton removal passes are approximate; the ideals are not mapped to unique numbers because the required hashtable would take up excessive amounts of memory. Instead, a multiplicative hash function is applied to each ideal and used to map all the ideals into bins of a fixed-size array. If all the ideals mapping to a single bin are singletons, then the underlying relations are deleted like before. Singleton removal passes proceed from this hashtable until relations stop getting pruned.

For extra-large size problems this fixed-size hashtable can become congested. This is bad, because too many ideals collide in the hashtable, which means not enough singletons will be discovered. The filtering code detects when this is likely, and then changes the hash function and refills the table. After one singleton removal pass there are many fewer ideals to add to the table, and the new hash function shuffles the mapping of ideals to bins, so that more singletons can be deleted.

The idea is to remove as many singletons as possible before switching to a unique mapping between ideals and bins; this occurs in the pass labeled 'final pass'; when complete the list of relations and large ideals is resident only in memory, and singleton removal proceeds as expected.

All of this code is in gnfs/filter/singleton.c
jasonp is offline   Reply With Quote
Old 2007-12-29, 12:26   #3
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

101011000100102 Posts
Default

Quote:
Originally Posted by jasonp View Post
The first singleton removal passes are approximate...
I've found an even more approximate solution to be worth doing and very easy to code. Remove all the singleton primes (not ideals) first. Assuming that one polynomial is linear, this removes all the singleton ideals on that side. For the other polynomial, removing all the singleton primes usually reduces the data size substantiantially, thereby making the more sophisticated algorithm run faster in less space.


Paul
xilman is offline   Reply With Quote
Old 2007-12-29, 13:11   #4
Wacky
 
Wacky's Avatar
 
Jun 2003
The Texas Hill Country

100010000012 Posts
Default

And, to carry Paul's suggestion even farther, I have been doing that (with line sieving relations only) even before eliminating the few duplicates that always manage to creep into a data set that is collected from distributed sources.

With lattice sieving, you will to first eliminate the duplicates because the duplication rate is so high.
Wacky is offline   Reply With Quote
Old 2007-12-29, 20:53   #5
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3·1,181 Posts
Default

Quote:
Originally Posted by xilman View Post
I've found an even more approximate solution to be worth doing and very easy to code. Remove all the singleton primes (not ideals) first. Assuming that one polynomial is linear, this removes all the singleton ideals on that side. For the other polynomial, removing all the singleton primes usually reduces the data size substantiantially, thereby making the more sophisticated algorithm run faster in less space.
After the last time this was discussed, I did some experiments implementing it within msieve; both methods perform about the same. I was a little surprised that there was no measurable time difference between using ideals and using the underlying primes; perhaps accessing the hashtable takes most of the time anyway.

I did find that the difference between the true number of ideals and the number of ideals as estimated by the hashtable was much larger when using primes only (which isn't surprising). That would mean the size of the collisionless hashtable would be unnecessarily large, unless several passes were added that used an approximate hashtable and the ideals.

Another possibility is to use the Judy library, which implements a compressing radix tree associative data structure. Judy can map primes to unique counts in very little memory, and ideals to unique counts using much more memory. Unfortunately, the latter is where the memory bottleneck is anyway, so I figured it wasn't worth the added external library dependency.

Examining the log files that everyone is posting from big jobs, it seems that at the switchover point between the approximate and exact methods, the number of ideals goes up by 10-20%, and this still represents hundreds of megabytes of wasted memory during the exact singleton removal. It would be nice to get rid of that somehow, I think right now this pass takes up about as much memory as the merge phase.
jasonp is offline   Reply With Quote
Old 2007-12-30, 11:17   #6
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

2×37×149 Posts
Default

Quote:
Originally Posted by jasonp View Post
After the last time this was discussed, I did some experiments implementing it within msieve; both methods perform about the same. I was a little surprised that there was no measurable time difference between using ideals and using the underlying primes; perhaps accessing the hashtable takes most of the time anyway.

I did find that the difference between the true number of ideals and the number of ideals as estimated by the hashtable was much larger when using primes only (which isn't surprising). That would mean the size of the collisionless hashtable would be unnecessarily large, unless several passes were added that used an approximate hashtable and the ideals.

Another possibility is to use the Judy library, which implements a compressing radix tree associative data structure. Judy can map primes to unique counts in very little memory, and ideals to unique counts using much more memory. Unfortunately, the latter is where the memory bottleneck is anyway, so I figured it wasn't worth the added external library dependency.

Examining the log files that everyone is posting from big jobs, it seems that at the switchover point between the approximate and exact methods, the number of ideals goes up by 10-20%, and this still represents hundreds of megabytes of wasted memory during the exact singleton removal. It would be nice to get rid of that somehow, I think right now this pass takes up about as much memory as the merge phase.
Perhaps a significant difference is that I used a couple of bit maps, rather than a hash table.

Here is my qndsr.c program. It's small, inelegant and only does one pass per run of the program. Cleaning it up and making it do a better job is left as an exercise.

Code:
/* Quick & dirty singleton remover.  PC Leyland 2002-09-04 */

/*  Read relations from stdin and write a set of files which have
    first-level singletons removed.  Subsequent runs of the progaram
    are needed if singletons generated in the preceeding pass are to
    be removed.  The output files are limited to (slightly greater
    than) 1e9 bytes to avoid file size limits on common operating
    systems.

    Uses temporary files not exceeding 1Gb each and total size ~< size
    of input.  There must be room for these, together with the output
    file(s).

    This is *not* an elegantly written program.  Anyone who wants to
    make it nicer and/or more robust is welcome to work on it.  The
    input file parser, in particular, is rather brittle.  strtol()
    from stdlib.h could probably be used for cleanliness but may be a
    bit slower because my code doesn't do error checking or work in
    bases other than 10.

*/

#include <stdio.h>

static unsigned char ratprimes[1<<27], algprimes[1<<27]; /* Enough for primes to 1G */

main ()
{
  int linesize, tfileno=0, tfileoutchars=0, ofileno=0, ofileoutchars=0;
  int nratp, nalgp, p;
  int byte, crumb, maskhi, masklo;
  int i, tf, relcount = 0;
  char linebuf[10240], tfilename[24], ofilename[24], *s;
  FILE *tfile, *ofile;

  int getline (char *buf, FILE *f);

  sprintf (tfilename, "__TEMP__%03d", tfileno);
  tfile = fopen (tfilename, "w");
  if (tfile == (FILE *) NULL)
    {
      fprintf (stderr, "Can't open temporary file %s for writing\n");
      exit (1);
    }
  while (linesize = getline (linebuf, stdin))
    {
      if (linebuf[0] != '0') continue; /* A relation? */
      nratp = linebuf[2] - '0';
      if (nratp < 0 || nratp > 9)
	{
	  fprintf (stderr, "Strange # primes in 1st poly, %d\t %s", nratp, linebuf);
	  continue;
	}
      nalgp = linebuf[3] - '0';
      if (nalgp < 0 || nalgp > 9)
	{
	  fprintf (stderr, "Strange # primes in 2nd poly, %d\t %s", nalgp, linebuf);
	  continue;
	}

      relcount++;
      s = linebuf + 5;
      while (*s++ != ' ');	/* Empty, skip a */
      while (*s++ != ' ');	/* Empty, skip b */

      for (i = 1; i <= nratp; i++)
	{
	  p = 0;
	  while (isdigit (*s))
	    {
	      p = p*10 + (*s++ - '0');
	    }
	  s++;			/* Skip space */

	  p >>= 1;
	  byte = p >> 2;
	  crumb = (p & 3) * 2;
	  maskhi = 2 << crumb;
	  masklo = 1 << crumb;

	  if (ratprimes[byte] & masklo)		/* Seen at least once */
	    ratprimes[byte] |= maskhi;		/* Set as seen at least twice */
	  else
	    ratprimes[byte] |= masklo;		/* Set as seen at least once */
	}

      for (i = 1; i <= nalgp; i++)
	{
	  p = 0;
	  while (isdigit (*s))
	    {
	      p = p*10 + (*s++ - '0');
	    }
	  s++;			/* Skip space */

	  p >>= 1;
	  byte = p >> 2;
	  crumb = (p & 3) * 2;
	  maskhi = 2 << crumb;
	  masklo = 1 << crumb;
	  if (algprimes[byte] & masklo)		/* Seen at least once */
	    algprimes[byte] |= maskhi;		/* Set as seen at least twice */
	  else
	    algprimes[byte] |= masklo;		/* Set as seen at least once */
	}
      
      tfileoutchars += fprintf (tfile, "%s", linebuf);

      if (tfileoutchars > 1000000000)
	{
	  fclose (tfile);
	  sprintf (tfilename, "__TEMP__%03d", ++tfileno);
	  tfile = fopen (tfilename, "w");
	  if (tfile == (FILE *) NULL)
	    {
	      fprintf (stderr, "Can't open temporary file %s for writing\n");
	      exit (1);
	    }
	  tfileoutchars = 0;
	}
    }

  printf ("Read %d (not necessarily distinct) relations\n", relcount);

  /* Ok, now have a bit map which tells which primes on each side have
     been seen at least twice, and we have one or more files
     containing clean relations (no comments, etc).  Read them in and
     write them out only if all primes in a relation are seen at least
     twice. */
  
  relcount = 0;

  sprintf (ofilename, "OUT__%03d", ofileno);
  ofile = fopen (ofilename, "w");
  if (ofile == (FILE *) NULL)
    {
      fprintf (stderr, "Can't open output file %s\n");
      exit (1);
    }

  for (tf = 0; tf <= tfileno; tf++)
    {
      fclose (tfile);
      sprintf (tfilename, "__TEMP__%03d", tf);
      tfile = fopen (tfilename, "r");
      if (tfile == (FILE *) NULL)
	{
	  fprintf (stderr, "Can't open temporary file %s for reading \n");
	  exit (1);
	}

      while (linesize = getline (linebuf, tfile))
	{
	  nratp = linebuf[2] - '0';
	  nalgp = linebuf[3] - '0';

	  s = linebuf + 5;
	  while (*s++ != ' ');	/* Empty, skip a */
	  while (*s++ != ' ');	/* Empty, skip b */

	  for (i = 1; i <= nratp; i++)
	    {
	      p = 0;
	      while (isdigit (*s))
		{
		  p = p*10 + (*s++ - '0');
		}
	      s++;			/* Skip space */

	      p >>= 1;
	      byte = p >> 2;
	      crumb = (p & 3) * 2;
	      maskhi = 2 << crumb;

	      if (!(ratprimes[byte] & maskhi))	/* Not seen at least twice */
		goto nextrel;
	    }

	  for (i = 1; i <= nalgp; i++)
	    {
	      p = 0;
	      while (isdigit (*s))
		{
		  p = p*10 + (*s++ - '0');
		}
	      s++;			/* Skip space */
	  
	      p >>= 1;
	      byte = p >> 2;
	      crumb = (p & 3) * 2;
	      maskhi = 2 << crumb;

	      if (!(algprimes[byte] & maskhi))	/* Not seen at least twice */
		goto nextrel;
	    }
	  
	  /* Seen them all at least twice, so write it out. */
	  
	  ofileoutchars += fprintf (ofile, "%s", linebuf);
	  relcount++;

	  if (ofileoutchars > 1000000000)
	    {
	      fclose (ofile);
	      sprintf (ofilename, "OUT__%03d", ++ofileno);
	      ofile = fopen (ofilename, "w");
	      if (ofile == (FILE *) NULL)
		{
		  fprintf (stderr, "Can't open output file %s for writing\n");
		  exit (1);
		}
	      ofileoutchars = 0;
	    }
	nextrel: ; /* Empty */
	}
    }	  
  fclose (tfile);
  fclose (ofile);

  printf ("Wrote %d (not necessarily distinct) relations\n", relcount);
}

int getline (char *s, FILE *infile)
{
   int c, i = 0;

   while ((c = getc (infile)) != EOF && c != '\n')
     if (c != '\r') s[i++] = c;			/* Strip out CR */
   if (c == '\n') s[i++] = c;
   s[i] = '\0';
   return (i);
}
Paul

P.S. I forgot to say this is for CWI-format relations. Converting to other formats is also left as an exercise.

Last fiddled with by xilman on 2007-12-30 at 11:22 Reason: Add P.S.
xilman is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Basic Number Theory 21: ideals and homomorphisms Nick Number Theory Discussion Group 5 2017-06-04 12:36
NFS: Sieving the norm over ideals vs. integers paul0 Factoring 2 2015-01-19 11:52
Automatic removal of assignments stuck in stage two TObject PrimeNet 0 2012-09-23 17:59
Theorems about ideals fivemack Abstract Algebra & Algebraic Number Theory 10 2012-01-22 11:01
clean removal of a machine from GIMPS blackguard PrimeNet 4 2005-02-15 16:08

All times are UTC. The time now is 08:29.


Fri Dec 3 08:29:11 UTC 2021 up 133 days, 2:58, 0 users, load averages: 1.14, 1.84, 1.99

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.