mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2013-03-11, 17:47   #1
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

2·3·19·31 Posts
Default NFS filtering in the era of big data

I haven't been very responsive to Msieve problems lately, but that's because I'm finally getting the time to explore something that's been years overdue.

Those of you who handle very large NFS factorizations know that the filtering code in Msieve has been essentially unchanged since about 2008, and is rapidly nearing the limit of what it can handle. The machine Greg used for filtering M1061 had 64GB of memory, and he needed about half of it to produce a matrix, plus a lot of oversieving, plus preprocessing to remove duplicates. There are well-known upper limits on the size of job that the current filtering code can handle, and (for example) Paul Zimmermann's dataset for RSA704 exceeded them. The record-size jobs of the near future are going to have 100x more data than that to deal with.

Rather than fix the problems with the current code one by one, I've started an experiment in branches/msieve-db of the repository that uses Berkeley DB to deal with relations, instead of the myriad flat-file formats that we currently use. Ideally we will eventually be able to specify a maximum amount of memory that filtering will be allowed to take, and any requirements for memory above that will be converted to manipulating disk files. Thus a given machine will be able to handle problems much larger than the available memory.

So far I've only had time to overhaul the duplicate removal, and the organization of the data for the next steps is not clear to me, but there's a lot of promise here. Whatever the end result looks like, it will be very different from the current code. For example, using databases opens up the possibility of storing relations in sorted order, so intermediate files can be compressed. The duplicate removal reads in the relations and performs a 2-pass merge sort to turn duplicate relations into neighbors, with a fixed amount of disk IO. There's no hashtable in this scheme, so for arbitrary size problems the memory use of duplicate removal can be basically whatever memory cache you tell Berkeley DB to use. So the results are free of duplicates, converted to binary, and stored in compressed form in sorted order, all in 2 passes. Plus BDB allows random access to the entire dataset, which opens up new possibilities for clique removal and merging.

Progress will be slow, but using a complex data structure like a database will free us to think about how to manipulate data instead of how to work around needing sequential passes through ad hoc intermediate files.
jasonp is offline   Reply With Quote
Old 2013-03-11, 18:14   #2
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

22×3×883 Posts
Default

Quote:
Originally Posted by jasonp View Post
I haven't been very responsive to Msieve problems lately, but that's because I'm finally getting the time to explore something that's been years overdue.

Those of you who handle very large NFS factorizations know that the filtering code in Msieve has been essentially unchanged since about 2008, and is rapidly nearing the limit of what it can handle. The machine Greg used for filtering M1061 had 64GB of memory, and he needed about half of it to produce a matrix, plus a lot of oversieving, plus preprocessing to remove duplicates. There are well-known upper limits on the size of job that the current filtering code can handle, and (for example) Paul Zimmermann's dataset for RSA704 exceeded them. The record-size jobs of the near future are going to have 100x more data than that to deal with.

Rather than fix the problems with the current code one by one, I've started an experiment in branches/msieve-db of the repository that uses Berkeley DB to deal with relations, instead of the myriad flat-file formats that we currently use. Ideally we will eventually be able to specify a maximum amount of memory that filtering will be allowed to take, and any requirements for memory above that will be converted to manipulating disk files. Thus a given machine will be able to handle problems much larger than the available memory.

So far I've only had time to overhaul the duplicate removal, and the organization of the data for the next steps is not clear to me, but there's a lot of promise here. Whatever the end result looks like, it will be very different from the current code. For example, using databases opens up the possibility of storing relations in sorted order, so intermediate files can be compressed. The duplicate removal reads in the relations and performs a 2-pass merge sort to turn duplicate relations into neighbors, with a fixed amount of disk IO. There's no hashtable in this scheme, so for arbitrary size problems the memory use of duplicate removal can be basically whatever memory cache you tell Berkeley DB to use. So the results are free of duplicates, converted to binary, and stored in compressed form in sorted order, all in 2 passes. Plus BDB allows random access to the entire dataset, which opens up new possibilities for clique removal and merging.

Progress will be slow, but using a complex data structure like a database will free us to think about how to manipulate data instead of how to work around needing sequential passes through ad hoc intermediate files.
An excellent idea!

With a suitably abstract interface it should also be possible to change databases according to what is most suitable in particular environments for particularly sized projects, whether the relations number in the hundreds millions to hundreds of billions.

Paul
xilman is offline   Reply With Quote
Old 2013-03-11, 21:07   #3
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

247516 Posts
Default

...and then, on to the sqrt for the 21st century, too!

Some sqrts in the largest projects now take a day apiece, ...and not to mention that the "no irreducible prime found, switching to small primes" part may need some more work.
Batalov is offline   Reply With Quote
Old 2013-03-11, 22:22   #4
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

2·3·19·31 Posts
Default

For small problems I don't think this work is a viable substitute to the existing code; the overhead will be a lot higher. Not to mention that BDB has an aggressively-open-source license, akin to GPL, so the branch will probably have to be permanently separate from the trunk.

I haven't thought at all about making the square roots better. If it takes a cluster to fit the LA then it's silly to need the same cluster for the square roots.

My current test is a dataset with about 6M relations, and it was a bit of a shock how much work BDB needed to get high performance. Managing the batching of reads and writes manually ran over 3x faster than using cursors, and for this example the duplicate removal still takes longer than the entire filtering phase using msieve trunk.

Last fiddled with by jasonp on 2013-03-11 at 22:25
jasonp is offline   Reply With Quote
Old 2013-03-11, 22:49   #5
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2,069 Posts
Default

That's somewhat concerning. How long would reading a billion relations take?

Regarding square roots, is there any limitation we are approaching with the current method? If not, I'm not at all concerned about a day/sqrt/core. Sure, it could be done much more quickly with a much more complex algorithm, but a day after waiting weeks for LA doesn't bother me.

Speaking of which, clusters are now moving to Tesla GPU's to get most of their computing power. Case in point is the new Titan cluster at Oak Ridge, and parts of Keeneland, Lonestar, and the upcoming Stampede (though most of Stampede uses Xeon Phi coprocs). Would you have the interest and cycles available to take a stab at LA on GPUs? The existing MPI code can still be used to break the computation over multiple GPUs in multiple nodes. CUDA 5 on K20 now supports direct GPU-to-GPU transfers over MPI without first copying to host memory, so hopefully we now have the transfer bandwidth to make it worthwhile. If you are interested in working on this together, I will write a proposal to obtain the needed K20's so we have full-time exclusive access for development, and I can request time on Stampede Kepler nodes in my next XSEDE proposal for larger scale testing.
frmky is offline   Reply With Quote
Old 2013-03-12, 01:23   #6
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

353410 Posts
Default

It's not an apples-to-apples comparison. Duplicate removal in the trunk writes tiny files, uses a hashtable that must fit in memory and grows linearly in the number of duplicates, and leaves the original data alone so that it has to get parsed again for the singleton removal. Msieve-db rearranges all the data for faster access later, and incidentally performs duplicate removal. I'm purposely making BDB use less memory than it is comfortable with, to see how it behaves under low-memory circumstances. I suspect the performance would be dramatically better with automatic tuning of BDB's memory cache, so that small problems never touch disk; I'm working on adding that now.

We can discuss GPU LA over email.
jasonp is offline   Reply With Quote
Old 2013-03-12, 12:13   #7
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

16BE16 Posts
Default

In this sort of model is there any reason why you have to restart from scratch every time you do filtering?
henryzz is offline   Reply With Quote
Old 2013-03-12, 14:02   #8
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

2×3×19×31 Posts
Default

No, this is the model GGNFS used to use so that relations would accumulate over several passes. In the current code, you would be able to merge the relations from the second sort pass efficiently into an existing database, at the cost of somewhat more IO than building the DB from scratch. Plus the resulting DB would be larger than if it was all built at once.
jasonp is offline   Reply With Quote
Old 2013-03-12, 19:54   #9
debrouxl
 
debrouxl's Avatar
 
Sep 2009

17218 Posts
Default

Amusingly, under Debian testing amd64 (providing libdb 5.1), msieve-db SVN r857 -nc1 bails out with
Code:
DB_ENV->set_flags: direct I/O either not configured or not supported
DB open failed: No such file or directory
Besides Windows (referenced in the commit logs by "read cursors have miserable performance in windows"), what environments are you testing on at the moment ?
debrouxl is offline   Reply With Quote
Old 2013-03-12, 23:34   #10
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

1101110011102 Posts
Default

I'm testing on WinXP and MacOS X right now. The first line of output is not fatal, the second tries to create a BDB environment in directory

<dat_file_name>.filter

which must exist. I hadn't gotten around to adding the code to create the dir if it doesn't exist, sorry. Also, the counting of duplicates is completely wrong, will fix tonight.
jasonp is offline   Reply With Quote
Old 2013-03-13, 07:11   #11
debrouxl
 
debrouxl's Avatar
 
Sep 2009

977 Posts
Default

With SVN r859, the "DB open failed: No such file or directory" has been replaced by
Code:
__bamc_compress_iput: Unknown flag: 0
DB bulk write failed: Invalid argument
which occurs about 30s into the filtering.

Isn't 0777 a slightly overbroad mode for the directory ? 0755 should be enough in most cases.
debrouxl is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
NFS filtering error... Stargate38 YAFU 4 2016-04-20 16:53
The big filtering bug strikes again (I think) Dubslow Msieve 20 2016-02-05 14:00
Filtering Sleepy Msieve 25 2011-08-04 15:05
Filtering R.D. Silverman Cunningham Tables 14 2010-08-05 08:30
Filtering Phenomenon R.D. Silverman NFSNET Discussion 2 2005-09-16 04:00

All times are UTC. The time now is 21:57.

Sun Mar 7 21:57:01 UTC 2021 up 94 days, 18:08, 0 users, load averages: 2.86, 3.03, 2.81

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.