mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2013-03-18, 14:14   #23
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

67168 Posts
Default

Glad to see it finally worked...

What to use for organizing a ton of data on disk is not a silly question at all. We don't need persistent data guarantees, transactions, logging, or anything; just batch lookup, batch storage, batch sorting and merging, compression and probably an explicitly managed page cache. This is a tiny fraction of what's in BDB, though getting BDB's page cache to do what I want has been most of the current battle :)

I get similar timing results with a similar size dataset. How long did filtering take with the trunk code?
jasonp is offline   Reply With Quote
Old 2013-03-18, 16:53   #24
debrouxl
 
debrouxl's Avatar
 
Sep 2009

977 Posts
Default

Quote:
I get similar timing results with a similar size dataset. How long did filtering take with the trunk code?
In order to give precise timings, I'd have to run filtering of the dataset on the Core i7-2670QM with msieve trunk.
However, AFAICT, at the moment, msieve-db is a bit slower than trunk on that computer for a 29-bit LPs task: on this computer, the entire filtering of such tasks (including clique removal, cycle optimization, etc.) usually takes less than 50 minutes.

On Wikipedia, LevelDB is said to be faster than BDB for some workloads. Among the requirements you listed above, it definitely does sorting and compression, and there are fewer features you don't need / want than in BDB, but I don't have hands-on experience with it, and I don't know whether it would fit the other requirements.

Last fiddled with by debrouxl on 2013-03-18 at 17:38
debrouxl is offline   Reply With Quote
Old 2013-03-24, 19:22   #25
debrouxl
 
debrouxl's Avatar
 
Sep 2009

977 Posts
Default Much faster indeed...

In the commit message of r871, I noticed
Quote:
re-engineer duplicate removal to be a lot faster
Sounds good, so I immediately tested:
Code:
commencing relation filtering
Make sure to do 'echo 0 > /proc/sys/vm/swappiness'
setting target matrix density to 85.0
estimated available RAM is 966.0 MB
commencing duplicate removal, pass 1
read 10M relations
read 20M relations
read 30M relations
read 40M relations
read 50M relations
found 55905073 relations
commencing duplicate removal, pass 2
wrote 0 duplicates and 55905073 unique relations
reading ideals above 1000000
RelProcTime: 234
elapsed time 00:03:55
That's indeed much faster than my previous test of msieve-db - and in fact, much faster than msieve trunk as well. In my experience, msieve has never sifted through ~56M raw relations in three minutes or so before
This is consistent with the solid white color of the disk activity LED, and iotop indicating 80-90 MB/s read speed.

The experiment's conditions were the same as before: otherwise idle Core i7-2670QM @ 2.2 GHz, 4 GB RAM.
debrouxl is offline   Reply With Quote
Old 2013-03-24, 23:49   #26
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

1101110011102 Posts
Default

Well, it's faster because the relations aren't actually getting factored now. I'm experimenting with pushing that off to the singleton removal, when we know where the duplicates are and don't have to parse them. On my test machine, factoring all the relations takes 8 minutes, sorting the <ab_pair,relation_num> pairs in memory takes 2 more minutes and writing out to disk takes 1 more minute.
jasonp is offline   Reply With Quote
Old 2013-03-25, 07:23   #27
debrouxl
 
debrouxl's Avatar
 
Sep 2009

977 Posts
Default

Would factoring batches of relations lend itself to parallelization across cores ?
debrouxl is offline   Reply With Quote
Old 2013-03-25, 13:51   #28
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

DCE16 Posts
Default

It would probably be better to modify the relation parsing to allow a relation to proceed even if its complete factorization cannot be determined. The theory is that 90+% of relations will be thrown away at some point in the filtering, so it's a waste of time to check absolutely everything up front. The downside is that all the filtering stages would have to be prepared to recover if big batches of relations are suddenly found to be bad (i.e. left over from a previous factorization with a different polynomial, which has happened many times). We don't have to do that now, so that (for example) the filtering can decide up front how hard it will work to get a matrix, without needing to leave anything in reserve in case of surprises later.
jasonp is offline   Reply With Quote
Old 2013-03-31, 08:52   #29
debrouxl
 
debrouxl's Avatar
 
Sep 2009

977 Posts
Default

A quibble about a hunk of SVN r872: once in a while, hard-to-track bugs are found in the code generated by compilers when bit fields are involved. You might save yourself some headaches by using a full 64-bit value for rel_index
Of course, a dataset with more than 2^40 relations does not make any sort of practical sense right now and in the foreseeable future...
debrouxl is offline   Reply With Quote
Old 2013-03-31, 12:55   #30
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

2·3·19·31 Posts
Default

I know. It's all subject to change right now.

More generally, the latest SVN now will perform the first part of the singleton removal, where ideals get assigned unique numbers and free relations get created. It takes a very long time, comparatively speaking, so I'm thinking about how to rearrange things. The end result is a .lp file that's compatible with the rest of the filtering, but it's likely we don't want that as the end state, since the rest of the filtering has irritating limits as well. At least we can start the filtering on an arbitrary multi-terabyte set of relations now, without multi-terabytes of RAM for a giant hashtable.

I note that Amazon EC2 will rent out an 8-core machine with 117GB of memory and 42 TB of disk, but it's their most expensive at $4.50 an hour. Even though filtering isn't going to take very long, it's interesting that getting terabytes of relations into Amazon in the first place can be seriously expensive. Data transfer into EC2 is free, but your VM has to be booted and you're being charged while the transfer takes place. Those terabytes of disk go away when you shut down the machine, so you'd need to push your data into Amazon S3, where they will cost $.10 per GB per month. Amazon now has a service where you can mail them hard disks to import for you, for a cost around $130 per commodity HD. Perhaps you can mail them one or two drives with (a,b) coordinates and construct the relations manually. EC2 lets you only have two of these machines at a time without special permission, so the filtering for RSA768 (10TB of relations) would probably cost several thousand dollars.

OTOH, it's extraordinarily cool that we live in an era where this option is even possible.
jasonp is offline   Reply With Quote
Old 2014-09-22, 12:13   #31
skan
 
skan's Avatar
 
Apr 2012

2·47 Posts
Default

There are also other interesting databases such as: MonetDB, RocksDB, CouchBase, Redis, Aerospike, Hyperdex and SciDB.
Some accept kind of SQL statements, some use something like a mathematical programming language.
Have you compared the speed of any of them?

Last fiddled with by skan on 2014-09-22 at 12:14
skan is offline   Reply With Quote
Old 2014-09-22, 12:33   #32
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

22×5×373 Posts
Default

Quote:
Originally Posted by skan View Post
There are also other interesting databases such as: MonetDB, RocksDB, CouchBase, Redis, Aerospike, Hyperdex and SciDB.
Some accept kind of SQL statements, some use something like a mathematical programming language.
Have you compared the speed of any of them?
Explain please. Why do you believe that any of these would help?
R.D. Silverman is offline   Reply With Quote
Old 2014-09-22, 13:20   #33
skan
 
skan's Avatar
 
Apr 2012

2·47 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
Explain please. Why do you believe that any of these would help?
I'm not an expert, just wondering.
All of them are newer and claim to be very fast, (and faster than BerkeleyDB).
Some maybe have easier (or different) ways to interact with, more like SQL instead of complicated coding.
For example SciDB is like a large multidimensional matrix.
Some scale better for distributed work, maybe Aerospike or Hyperdex.

Though I guess most mersenneforum don't have a cluster of computers with terabytes of memory, they will more likely work with Sata or SSD disks or just in-memory with just one computer (maybe Redis, CouchBase or MonetDB is the solution then).

I would like to know if somebody has opinions on them to discard...
I'm also interested in finding a good database for scientific computing.
skan is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
NFS filtering error... Stargate38 YAFU 4 2016-04-20 16:53
The big filtering bug strikes again (I think) Dubslow Msieve 20 2016-02-05 14:00
Filtering Sleepy Msieve 25 2011-08-04 15:05
Filtering R.D. Silverman Cunningham Tables 14 2010-08-05 08:30
Filtering Phenomenon R.D. Silverman NFSNET Discussion 2 2005-09-16 04:00

All times are UTC. The time now is 21:41.

Sun Mar 7 21:41:15 UTC 2021 up 94 days, 17:52, 0 users, load averages: 2.81, 2.54, 2.46

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.