20110711, 07:05  #1 
May 2011
23 Posts 
Filtering
Hi,
I have a question to ask. Based on the filtering step, is the 1st maximum bound 2^32 since the bucket is [0,2^32] and uint32 is used for filtmin_a and filtmin_r? If so, isn't the bound limit a little low if larger numbers are used? Or did i misunderstood the code. Thanks! 
20110711, 10:27  #2 
(loop (#_fork))
Feb 2006
Cambridge, England
2·7·461 Posts 
The filtmin parameter serves to divide ideals into 'small' and 'large'; these roughly correspond to the small and largeprime bounds in sieving, though it's not an exact match. Even for RSA768 the sieving was done with 'small'=2^30. So this particular limit is not a serious restriction at present.

20110711, 10:50  #3 
Tribal Bullet
Oct 2004
5×709 Posts 
This is one of several limitations in the filtering that prevents it being realistically used for extremely large problems. A more pressing limitation is that relation numbers max out at 2^32 so that a factorization must have less than 4 billion relations. We're already 25% of the way to that limit for the largest Msieve jobs, but getting further will require an update to the sieving tools that people use.

20110711, 12:23  #4 
May 2011
10111_{2} Posts 
I see.. Other than the filtering bound and number of relations limitations, are there other limitations that prevents the filtering step from being realistically used for extremely large problems?

20110711, 17:17  #5 
"Bob Silverman"
Nov 2003
North of Boston
2^{3}×3×311 Posts 

20110711, 19:57  #7 
Tribal Bullet
Oct 2004
5×709 Posts 
Besides the limit on the number of relations, there are only a few things stopping, say, filtering on RSA768size numbers, i.e. many billions of relations:
 clique removal needs to be possible from disk files  duplicate and singleton removal needs to be MPIenabled, otherwise the size of internal hashtables will exceed the memory of the machine  inmemory clique removal has an internal 16GB limit on the size of an intermediate structure With all of these done and a largememory machine for the merge phase, we'd be good to go. There's a parallel problem that the sieving tools would need to be comfortable with large primes > 2^33 
20110712, 14:34  #8  
Bamboozled!
"๐บ๐๐ท๐ท๐ญ"
May 2003
Down not across
11·1,039 Posts 
Quote:
Back in the days of NFSNET we (I and Don Leclair, IIRC) wrote duplicate and singleton removal software for a single threaded single process environment. It wasn't perhaps as fast as it would have been had more memory been available but it worked well. Perhaps I should dig up the remdup.c software and then take this discussion offline. Paul Last fiddled with by xilman on 20110712 at 14:34 Reason: Fix /tag] 

20110712, 17:51  #9 
Tribal Bullet
Oct 2004
5·709 Posts 
MPI isn't strictly needed, but when there's a terabyte of relations then you have to account somehow for intermediate lists of things (primes, ideals, relation numbers or coordinates, etc) exceeding the memory of the filtering machine. That becomes much more manageable given a cluster of average size machines all pulling information out of the same dataset and boiling it down by themselves, with a global pass at the end to reconcile all the results.
For singleton removal, you at least have to have a hashtable that maps primes or ideals to a unique number, so that relations can be assigned an ideal list. Right now the size of that hashtable is the limiting factor in the memory use of the filtering. (Yes I know there are loads of spaceefficient hashtable implementations out there) 
20110713, 06:00  #10 
May 2011
23 Posts 
I see.. okie! So upgrading the filtering steps should not be that complex but rather because of the current sieving limitations, there is not much incentives to cater for large numbers (>768) factorization yet right?
Sorry one more question, i am stuck with understanding how the first bound is chosen. What is the role of bin_max, bin_min and hits_per_prime in the duplicate.c source code? Anyway, thanks guys for all the help you guys provided! Really appreciate it! =) 
20110713, 11:21  #11  
Tribal Bullet
Oct 2004
5·709 Posts 
Quote:
We have a histogram of the number of times a prime occurs, which was computed essentially for free when the first duplicate removal pass ran. Every prime < 2^32 encountered in every relation (including the duplicates :) increments the count in one 128ksize bucket. Then the average number of times a prime appears in bucket i is approximately (count of hits in bucket i) / (number of primes in bucket i) and the desired filtering bound is the middle of the first bucket (working from 2^32 backwards) where this number exceeds 40.0. We don't need to count the exact number of primes in bucket i, instead we compute (number of primes less than (start of bucket i+1))  (number of primes less than (start of bucket i)) using the approximation that the number of primes less than x is about x / log(x). These are what bin_min and bin_max are used for, and hits_per_prime is the computed average. If you haven't seen it already, read this for more detail on Msieve's NFS filtering. Most of it is still accurate today. Last fiddled with by jasonp on 20110713 at 11:29 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
NFS filtering in the era of big data  jasonp  Msieve  36  20180507 19:55 
NFS filtering error...  Stargate38  YAFU  4  20160420 16:53 
The big filtering bug strikes again (I think)  Dubslow  Msieve  20  20160205 14:00 
Filtering Error  richs  Msieve  8  20150118 17:40 
Filtering  R.D. Silverman  Cunningham Tables  14  20100805 08:30 