![]() |
![]() |
#23 |
Tribal Bullet
Oct 2004
67168 Posts |
![]()
Glad to see it finally worked...
What to use for organizing a ton of data on disk is not a silly question at all. We don't need persistent data guarantees, transactions, logging, or anything; just batch lookup, batch storage, batch sorting and merging, compression and probably an explicitly managed page cache. This is a tiny fraction of what's in BDB, though getting BDB's page cache to do what I want has been most of the current battle :) I get similar timing results with a similar size dataset. How long did filtering take with the trunk code? |
![]() |
![]() |
![]() |
#24 | |
Sep 2009
977 Posts |
![]() Quote:
However, AFAICT, at the moment, msieve-db is a bit slower than trunk on that computer for a 29-bit LPs task: on this computer, the entire filtering of such tasks (including clique removal, cycle optimization, etc.) usually takes less than 50 minutes. On Wikipedia, LevelDB is said to be faster than BDB for some workloads. Among the requirements you listed above, it definitely does sorting and compression, and there are fewer features you don't need / want than in BDB, but I don't have hands-on experience with it, and I don't know whether it would fit the other requirements. Last fiddled with by debrouxl on 2013-03-18 at 17:38 |
|
![]() |
![]() |
![]() |
#25 | |
Sep 2009
977 Posts |
![]()
In the commit message of r871, I noticed
Quote:
Code:
commencing relation filtering Make sure to do 'echo 0 > /proc/sys/vm/swappiness' setting target matrix density to 85.0 estimated available RAM is 966.0 MB commencing duplicate removal, pass 1 read 10M relations read 20M relations read 30M relations read 40M relations read 50M relations found 55905073 relations commencing duplicate removal, pass 2 wrote 0 duplicates and 55905073 unique relations reading ideals above 1000000 RelProcTime: 234 elapsed time 00:03:55 ![]() This is consistent with the solid white color of the disk activity LED, and iotop indicating 80-90 MB/s read speed. The experiment's conditions were the same as before: otherwise idle Core i7-2670QM @ 2.2 GHz, 4 GB RAM. |
|
![]() |
![]() |
![]() |
#26 |
Tribal Bullet
Oct 2004
1101110011102 Posts |
![]()
Well, it's faster because the relations aren't actually getting factored now. I'm experimenting with pushing that off to the singleton removal, when we know where the duplicates are and don't have to parse them. On my test machine, factoring all the relations takes 8 minutes, sorting the <ab_pair,relation_num> pairs in memory takes 2 more minutes and writing out to disk takes 1 more minute.
|
![]() |
![]() |
![]() |
#27 |
Sep 2009
977 Posts |
![]()
Would factoring batches of relations lend itself to parallelization across cores ?
|
![]() |
![]() |
![]() |
#28 |
Tribal Bullet
Oct 2004
DCE16 Posts |
![]()
It would probably be better to modify the relation parsing to allow a relation to proceed even if its complete factorization cannot be determined. The theory is that 90+% of relations will be thrown away at some point in the filtering, so it's a waste of time to check absolutely everything up front. The downside is that all the filtering stages would have to be prepared to recover if big batches of relations are suddenly found to be bad (i.e. left over from a previous factorization with a different polynomial, which has happened many times). We don't have to do that now, so that (for example) the filtering can decide up front how hard it will work to get a matrix, without needing to leave anything in reserve in case of surprises later.
|
![]() |
![]() |
![]() |
#29 |
Sep 2009
977 Posts |
![]()
A quibble about a hunk of SVN r872: once in a while, hard-to-track bugs are found in the code generated by compilers when bit fields are involved. You might save yourself some headaches by using a full 64-bit value for rel_index
![]() Of course, a dataset with more than 2^40 relations does not make any sort of practical sense right now and in the foreseeable future... |
![]() |
![]() |
![]() |
#30 |
Tribal Bullet
Oct 2004
2·3·19·31 Posts |
![]()
I know. It's all subject to change right now.
More generally, the latest SVN now will perform the first part of the singleton removal, where ideals get assigned unique numbers and free relations get created. It takes a very long time, comparatively speaking, so I'm thinking about how to rearrange things. The end result is a .lp file that's compatible with the rest of the filtering, but it's likely we don't want that as the end state, since the rest of the filtering has irritating limits as well. At least we can start the filtering on an arbitrary multi-terabyte set of relations now, without multi-terabytes of RAM for a giant hashtable. I note that Amazon EC2 will rent out an 8-core machine with 117GB of memory and 42 TB of disk, but it's their most expensive at $4.50 an hour. Even though filtering isn't going to take very long, it's interesting that getting terabytes of relations into Amazon in the first place can be seriously expensive. Data transfer into EC2 is free, but your VM has to be booted and you're being charged while the transfer takes place. Those terabytes of disk go away when you shut down the machine, so you'd need to push your data into Amazon S3, where they will cost $.10 per GB per month. Amazon now has a service where you can mail them hard disks to import for you, for a cost around $130 per commodity HD. Perhaps you can mail them one or two drives with (a,b) coordinates and construct the relations manually. EC2 lets you only have two of these machines at a time without special permission, so the filtering for RSA768 (10TB of relations) would probably cost several thousand dollars. OTOH, it's extraordinarily cool that we live in an era where this option is even possible. |
![]() |
![]() |
![]() |
#31 |
Apr 2012
2·47 Posts |
![]()
There are also other interesting databases such as: MonetDB, RocksDB, CouchBase, Redis, Aerospike, Hyperdex and SciDB.
Some accept kind of SQL statements, some use something like a mathematical programming language. Have you compared the speed of any of them? Last fiddled with by skan on 2014-09-22 at 12:14 |
![]() |
![]() |
![]() |
#32 |
Nov 2003
22×5×373 Posts |
![]()
Explain please. Why do you believe that any of these would help?
|
![]() |
![]() |
![]() |
#33 | |
Apr 2012
2·47 Posts |
![]() Quote:
All of them are newer and claim to be very fast, (and faster than BerkeleyDB). Some maybe have easier (or different) ways to interact with, more like SQL instead of complicated coding. For example SciDB is like a large multidimensional matrix. Some scale better for distributed work, maybe Aerospike or Hyperdex. Though I guess most mersenneforum don't have a cluster of computers with terabytes of memory, they will more likely work with Sata or SSD disks or just in-memory with just one computer (maybe Redis, CouchBase or MonetDB is the solution then). I would like to know if somebody has opinions on them to discard... I'm also interested in finding a good database for scientific computing. |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
NFS filtering error... | Stargate38 | YAFU | 4 | 2016-04-20 16:53 |
The big filtering bug strikes again (I think) | Dubslow | Msieve | 20 | 2016-02-05 14:00 |
Filtering | Sleepy | Msieve | 25 | 2011-08-04 15:05 |
Filtering | R.D. Silverman | Cunningham Tables | 14 | 2010-08-05 08:30 |
Filtering Phenomenon | R.D. Silverman | NFSNET Discussion | 2 | 2005-09-16 04:00 |