mersenneforum.org NFS filtering in the era of big data
 Register FAQ Search Today's Posts Mark Forums Read

 2013-03-18, 14:14 #23 jasonp Tribal Bullet     Oct 2004 67168 Posts Glad to see it finally worked... What to use for organizing a ton of data on disk is not a silly question at all. We don't need persistent data guarantees, transactions, logging, or anything; just batch lookup, batch storage, batch sorting and merging, compression and probably an explicitly managed page cache. This is a tiny fraction of what's in BDB, though getting BDB's page cache to do what I want has been most of the current battle :) I get similar timing results with a similar size dataset. How long did filtering take with the trunk code?
2013-03-18, 16:53   #24
debrouxl

Sep 2009

977 Posts

Quote:
 I get similar timing results with a similar size dataset. How long did filtering take with the trunk code?
In order to give precise timings, I'd have to run filtering of the dataset on the Core i7-2670QM with msieve trunk.
However, AFAICT, at the moment, msieve-db is a bit slower than trunk on that computer for a 29-bit LPs task: on this computer, the entire filtering of such tasks (including clique removal, cycle optimization, etc.) usually takes less than 50 minutes.

On Wikipedia, LevelDB is said to be faster than BDB for some workloads. Among the requirements you listed above, it definitely does sorting and compression, and there are fewer features you don't need / want than in BDB, but I don't have hands-on experience with it, and I don't know whether it would fit the other requirements.

Last fiddled with by debrouxl on 2013-03-18 at 17:38

2013-03-24, 19:22   #25
debrouxl

Sep 2009

977 Posts
Much faster indeed...

In the commit message of r871, I noticed
Quote:
 re-engineer duplicate removal to be a lot faster
Sounds good, so I immediately tested:
Code:
commencing relation filtering
Make sure to do 'echo 0 > /proc/sys/vm/swappiness'
setting target matrix density to 85.0
estimated available RAM is 966.0 MB
commencing duplicate removal, pass 1
found 55905073 relations
commencing duplicate removal, pass 2
wrote 0 duplicates and 55905073 unique relations
RelProcTime: 234
elapsed time 00:03:55
That's indeed much faster than my previous test of msieve-db - and in fact, much faster than msieve trunk as well. In my experience, msieve has never sifted through ~56M raw relations in three minutes or so before
This is consistent with the solid white color of the disk activity LED, and iotop indicating 80-90 MB/s read speed.

The experiment's conditions were the same as before: otherwise idle Core i7-2670QM @ 2.2 GHz, 4 GB RAM.

 2013-03-24, 23:49 #26 jasonp Tribal Bullet     Oct 2004 1101110011102 Posts Well, it's faster because the relations aren't actually getting factored now. I'm experimenting with pushing that off to the singleton removal, when we know where the duplicates are and don't have to parse them. On my test machine, factoring all the relations takes 8 minutes, sorting the pairs in memory takes 2 more minutes and writing out to disk takes 1 more minute.
 2013-03-25, 07:23 #27 debrouxl     Sep 2009 977 Posts Would factoring batches of relations lend itself to parallelization across cores ?
 2013-03-25, 13:51 #28 jasonp Tribal Bullet     Oct 2004 DCE16 Posts It would probably be better to modify the relation parsing to allow a relation to proceed even if its complete factorization cannot be determined. The theory is that 90+% of relations will be thrown away at some point in the filtering, so it's a waste of time to check absolutely everything up front. The downside is that all the filtering stages would have to be prepared to recover if big batches of relations are suddenly found to be bad (i.e. left over from a previous factorization with a different polynomial, which has happened many times). We don't have to do that now, so that (for example) the filtering can decide up front how hard it will work to get a matrix, without needing to leave anything in reserve in case of surprises later.
 2013-03-31, 08:52 #29 debrouxl     Sep 2009 977 Posts A quibble about a hunk of SVN r872: once in a while, hard-to-track bugs are found in the code generated by compilers when bit fields are involved. You might save yourself some headaches by using a full 64-bit value for rel_index Of course, a dataset with more than 2^40 relations does not make any sort of practical sense right now and in the foreseeable future...
 2013-03-31, 12:55 #30 jasonp Tribal Bullet     Oct 2004 2·3·19·31 Posts I know. It's all subject to change right now. More generally, the latest SVN now will perform the first part of the singleton removal, where ideals get assigned unique numbers and free relations get created. It takes a very long time, comparatively speaking, so I'm thinking about how to rearrange things. The end result is a .lp file that's compatible with the rest of the filtering, but it's likely we don't want that as the end state, since the rest of the filtering has irritating limits as well. At least we can start the filtering on an arbitrary multi-terabyte set of relations now, without multi-terabytes of RAM for a giant hashtable. I note that Amazon EC2 will rent out an 8-core machine with 117GB of memory and 42 TB of disk, but it's their most expensive at $4.50 an hour. Even though filtering isn't going to take very long, it's interesting that getting terabytes of relations into Amazon in the first place can be seriously expensive. Data transfer into EC2 is free, but your VM has to be booted and you're being charged while the transfer takes place. Those terabytes of disk go away when you shut down the machine, so you'd need to push your data into Amazon S3, where they will cost$.10 per GB per month. Amazon now has a service where you can mail them hard disks to import for you, for a cost around \$130 per commodity HD. Perhaps you can mail them one or two drives with (a,b) coordinates and construct the relations manually. EC2 lets you only have two of these machines at a time without special permission, so the filtering for RSA768 (10TB of relations) would probably cost several thousand dollars. OTOH, it's extraordinarily cool that we live in an era where this option is even possible.
 2014-09-22, 12:13 #31 skan     Apr 2012 2·47 Posts There are also other interesting databases such as: MonetDB, RocksDB, CouchBase, Redis, Aerospike, Hyperdex and SciDB. Some accept kind of SQL statements, some use something like a mathematical programming language. Have you compared the speed of any of them? Last fiddled with by skan on 2014-09-22 at 12:14
2014-09-22, 12:33   #32
R.D. Silverman

Nov 2003

22×5×373 Posts

Quote:
 Originally Posted by skan There are also other interesting databases such as: MonetDB, RocksDB, CouchBase, Redis, Aerospike, Hyperdex and SciDB. Some accept kind of SQL statements, some use something like a mathematical programming language. Have you compared the speed of any of them?
Explain please. Why do you believe that any of these would help?

2014-09-22, 13:20   #33
skan

Apr 2012

2·47 Posts

Quote:
 Originally Posted by R.D. Silverman Explain please. Why do you believe that any of these would help?
I'm not an expert, just wondering.
All of them are newer and claim to be very fast, (and faster than BerkeleyDB).
Some maybe have easier (or different) ways to interact with, more like SQL instead of complicated coding.
For example SciDB is like a large multidimensional matrix.
Some scale better for distributed work, maybe Aerospike or Hyperdex.

Though I guess most mersenneforum don't have a cluster of computers with terabytes of memory, they will more likely work with Sata or SSD disks or just in-memory with just one computer (maybe Redis, CouchBase or MonetDB is the solution then).

I would like to know if somebody has opinions on them to discard...
I'm also interested in finding a good database for scientific computing.

 Similar Threads Thread Thread Starter Forum Replies Last Post Stargate38 YAFU 4 2016-04-20 16:53 Dubslow Msieve 20 2016-02-05 14:00 Sleepy Msieve 25 2011-08-04 15:05 R.D. Silverman Cunningham Tables 14 2010-08-05 08:30 R.D. Silverman NFSNET Discussion 2 2005-09-16 04:00

All times are UTC. The time now is 21:41.

Sun Mar 7 21:41:15 UTC 2021 up 94 days, 17:52, 0 users, load averages: 2.81, 2.54, 2.46