![]() |
On relations file format
[QUOTE=fivemack;308396]Yes provided you can read base-sixteen (so in some senses no); if you look at the last few lines of the file, it will read something like
[code] 22617993,32185531:4f7c7bad,25f9,3ea88d,288611,a5a09,f8b7,8425,6b252c3:7a526e39,57d8af69,67563a1,1b0f047,3ce17,55d9 70093203,74087957:7e890e81,ff64977d,9ef,455,f0a43,a58b,6997,5e75,6b252c3:574e3e89,679,1de69b7,1163ae9,366137,1e707,5071 119968003,87443153:ee91c059,886dd67,c83,4cf4afd,3addd07,668cdb,6b252c3:48ebe3c1,9fb4baa9,13f1d0ef,1ec1,ef9,2669bf,109dd1 [/code] Notice that each of these has the same number 6b252c3 before the colon; that's the last special-Q value, and turns out to be 112349891 in hex. So the 112300000-112400000 chunk is just under half-done. Some versions of the siever write incrementally to the output file 1234.t, in which case you look at the last line of that, but I think serge's current compilation doesn't. A 1e6 chunk is a lot of work - probably about two weeks on your fast machines, possibly three weeks.[/QUOTE] Wouldn't the file be quite a bit smaller if the rels were recorded in (say) base 36? Or would that make compression harder (and thus larger)? |
Bill, it doesn't matter much for this size of a problem. Every relation takes, say, 0.7s to generate; writing it and downloading a chunk takes negligible time. GMP can read and write numbers in any base. (For compression it is best that only lowercase letters were used for output [or only upper] but not both.)
Tom, have you thought about using 16e siever and a much lesser sieving area? At q~268M, the 16e siever is only 10% slower and produces [STRIKE]3x[/STRIKE]* >2x more relations (so the sieving area would be way shorter). It does take 1.5G of memory compared to 1.1G for the 15e. ____________ *waited for more output, while typing. 3x was a local anomaly. EDIT2: With my 1.5G 16e test processes, I effectively starved the 15e's (the box I am torturing has only 4G of memory), and they shed the fat and revealed that each of them needs only 0.66G res (still 1.1G virtual). So, 15e might be a better bet for this project, because you can easily recruit 4G 4-cpu boxes; there are more of them in existence than the 8G+ (the latter must be quite new, born after the dramatic drop in memory prices per Gb). |
[QUOTE=Batalov;308443]Bill, it doesn't matter much for this size of a problem. Every relation takes, say, 0.7s to generate; writing it and downloading a chunk takes negligible time. GMP can read and write numbers in any base. (For compression it is best that only lowercase letters were used for output [or only upper] but not both.)[/QUOTE]
Well no, but I meant for large collections of relations, e.g. preparing to do LA by downloading rels from NFS@Home/RSALS (especially w.r.t. the former server space problems). If the compression ratio is the same, that could have been some significant space savings for RSALS. |
You could recode a reasonable chunk (or a full dataset that you may have somewhere), gzip, compare and share the results. There probably will be a small benefit, no doubt. It has to be substantial - to change the de facto standard file format [makes air quotes and winks with both eyes].
There were also proposals to bzip, 7zip, etc. (they pack better, but are much slower) Here's another idea: (context: the factors less than 1000 are already [I]not[/I] reported in this file and are reconstitued on the fly by msieve) drop factors less than 10000. A radical version of the same idea: keep only a and b values in the file. Less radical variant: keep a few of the largest factors and only the changed (wrt previous line) q0. These were all tried - for storage and/or for transmission. A modified version of msieve can read any of these file dialects (it is best to recode the "a,b:" file into a standard file once, before real use - because miseve re-reads the file many times). |
[QUOTE=Batalov;308443](For compression it is best that only lowercase letters were used for output [or only upper] but not both.)[/QUOTE]Don't know if the relations in a later SVN are already the same case, but to convert the relation files [CODE]cat file | tr '[A-Z]' '[a-z]' > file.out[/CODE]
|
Only the very old legacy sievers were using both.
That was one of the very first patches: to unify case. In retrospect, I think uppercase should have been preferred: dc balkes when I use it to convert by click'n'pasting (I have a script though) [CODE]echo "16i 6B252C3 p" |dc ==> 112349891 echo "16i 6b252c3 p" |dc ==> error echo "16i 6b252c3 p" | tr '[a-f]' '[A-F]' | dc ==> fine, but awkward [/CODE] dc is available on any computer ;-) Nevermind though. Not a good reason to change. |
[QUOTE=Batalov;308461]Only the very old legacy sievers were using both.
That was one of the very first patches - to unify case. In retrospect, I think uppercase should have been preferred: dc balkes when I use it to convert by click'n'pasting (I have a script though) [CODE]echo "16i 6b252c3 p" |dc ==> 112349891 echo "16i 6b252c3 p" |dc ==> error echo "16i 6b252c3 p" | tr '[a-f]' '[A-F]' | dc ==> fine, but awkward [/CODE] dc is available on any computer ;-) Nevermind though. Not a good reason to change.[/QUOTE] I've always been of the opinion that upper case looks more like a number, i.e. the numerals 0-9 are all about the same height as upper case letters, but they're much taller than lower case letters. Words we read are mostly lower case, while (base 10) numbers are thus "upper case"; thus, all hex should use upper case A-F as well :rolleyes: |
A totally new (binary) file format would be probably way better.
It should be able to store values in variable number of bytes (a byte of even a bit stream with interjected tags). Primes could be converted into their ordinal (or at the very least they could be stored in mod 30 <<3 + mod30class; 2 of course not stored as well as all small primes). Of course, most likely this will not beat a gzip pass. "Muda da, muda da, said Ecclesiastes, [URL="http://en.wikipedia.org/wiki/Muda_(Japanese_term)"]muda da[/URL]. Kore wa subete no kyoei kokorodesu." |
Latter-day msieve versions read the original data file 2-3 times during the filtering, once for the linear algebra and once for each dependency in the square root. I have plans to convert the relation text into a Berkeley DB database. Factoring relations on the fly would be very nice, but we really need a subsystem that is tuned for finding small factors in large numbers (CADO has one and Alex's dissertation describes a more powerful one).
|
iirc
[QUOTE="gmp"]
[U]Function:[/U] size_t [B]mpz_out_str[/B] [I](FILE *stream, int base, mpz_t op)[/I] Output op on stdio stream stream, as a string of digits in base base. The base may vary from 2 to 36. [/QUOTE] It is either undocumented or documented elsewhere, that [I]base = [/I]-16 will produce hex-in-uppercase. EDIT: indeed, mpz/out_str.c: [CODE] if (base >= 0) { num_to_text = "0123456789abcdefghijklmnopqrstuvwxyz"; if (base == 0) base = 10; else if (base > 36) { num_to_text = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"; if (base > 62) return 0; } } else { base = -base; num_to_text = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"; } [/CODE] |
I once dealt with bona fide CWI relations. (afair, it strips more small factors, and the format is a bit different)
First, I built msieve that worked straight with them. After a few hours, I couldn't bear to look at the (crawl) speed of file import (it was dominated by trial factoring), killed it, wrote a recoder (miseve-based!) and then remdups'd the resulting file and everything was smooth after that. |
[QUOTE=Batalov;308455]A radical version of the same idea: keep only a and b values in the file.[/QUOTE]
Even more radical: Keep the lattice basis vectors for each special-q root and you can write each relation in lattice coordinates to save even more space. |
Here's a somewhat facetious reason to use base 36: Now we can go hunting for funny words spelled out by the factors. :smile:
|
Hmm, less promising than I thought.
[code]bill@Gravemind:~/yafu/rsals∰∂ wc -l nfs.dat 93193867 nfs.dat bill@Gravemind:~/yafu/rsals∰∂ wc -l rels36 91974259 rels36 bill@Gravemind:~/yafu/rsals∰∂ wc -c nfs.dat 11683340634 nfs.dat bill@Gravemind:~/yafu/rsals∰∂ wc -c rels36 10045856817 rels36 bill@Gravemind:~/yafu/rsals∰∂ head -n 2 nfs.dat 353566,832085:5200a0f,af8df,20bc63,2912c0b,3114f59,1087,751:34dd2953,18cf85c1,1aa0f1,1159,3,709,1360f1f 1337012,1463419:199c69b1,1ce65701,5bc9,96fd,3ab829,d,29,17f,5ab,13:a47a68b,2ee792d,dea7,3d444f,3,5,11b,11,11,13,1360f1f bill@Gravemind:~/yafu/rsals∰∂ head -n 2 rels36 353566,832085:1F6Z2N,FEU7,19ZDV,PN3T7,UN3H5,39J,1G1:EO1JKJ,6VTR5T,11EK1,3FD,3,1E1,C3J1B 1337012,1463419:73TK1D,80O6IP,I4P,TTP,2AHBD,D,15,AN,14B,J:2UOL1N,TA5V1,17ZB,2E24V,3,5,7V,H,H,J,C3J1B bill@Gravemind:~/yafu/rsals∰∂ remdups4 nfs.dat Counting rels in nfs.dat. This might take a few minutes for large rel files. Found 93193103 unique, 5 duplicate (0.0% of total), and 107 bad relations. Largest dimension used: 362 of 930 Average dimension used: 284.4 of 930 bill@Gravemind:~/yafu/rsals∰∂ bzip2 nfs.dat bill@Gravemind:~/yafu/rsals∰∂ wc -c nfs.dat.bz2 5130787706 nfs.dat.bz2 bill@Gravemind:~/yafu/rsals∰∂ bzip2 rels36 bill@Gravemind:~/yafu/rsals∰∂ wc -c rels36.bz2 5035796294 rels36.bz2[/code] (Don't be fooled, the conversion took ~2 hrs, and each of the zips took around an hour.) Uncompressed 36/hex ratio: 85.98% Compressed 36/hex ratio: 98.15% 36 un/compressed ratio: 50.13% hex un/compressed ratio: 43.92% So it seems I was right about one thing: adding the extra characters made the compression less effective. |
Sounds about right. This is because if gzip does its job right, then the gzipped size is the "raw information content" size.
See, the hex file has more air, 36-base has less air (in both cases, there are unused bits as well as abundant commas), but gzip squeezes air out and the bit stream remains. Now, bzip2 and 7zip will compress better still (at the expense of much longer time) and the resulting compressed sizes may be closer still. What you did is a small pre-compression. Information (or essentially random data) is incompressible. The only way to make compressed files really smaller is to give up some information and pay back later (pay back with time) to restore the sacrificed information. |
[QUOTE=Batalov;308443]Tom, have you thought about using 16e siever and a much lesser sieving area? At q~268M, the 16e siever is only 10% slower and produces [STRIKE]3x[/STRIKE]* >2x more relations (so the sieving area would be way shorter). It does take 1.5G of memory compared to 1.1G for the 15e. [/QUOTE]
Yes, I ran 0.1% of the job with both 15e and 16e; in my tests 16e was significantly slower (10% on this project is more than a CPU-year!) to get to 400 million relations, and since it also uses significantly more memory than 15e I thought 15e would be a better choice. |
| All times are UTC. The time now is 04:52. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.