![]() |
On relations file format
[QUOTE=fivemack;308396]Yes provided you can read base-sixteen (so in some senses no); if you look at the last few lines of the file, it will read something like
[code] 22617993,32185531:4f7c7bad,25f9,3ea88d,288611,a5a09,f8b7,8425,6b252c3:7a526e39,57d8af69,67563a1,1b0f047,3ce17,55d9 70093203,74087957:7e890e81,ff64977d,9ef,455,f0a43,a58b,6997,5e75,6b252c3:574e3e89,679,1de69b7,1163ae9,366137,1e707,5071 119968003,87443153:ee91c059,886dd67,c83,4cf4afd,3addd07,668cdb,6b252c3:48ebe3c1,9fb4baa9,13f1d0ef,1ec1,ef9,2669bf,109dd1 [/code] Notice that each of these has the same number 6b252c3 before the colon; that's the last special-Q value, and turns out to be 112349891 in hex. So the 112300000-112400000 chunk is just under half-done. Some versions of the siever write incrementally to the output file 1234.t, in which case you look at the last line of that, but I think serge's current compilation doesn't. A 1e6 chunk is a lot of work - probably about two weeks on your fast machines, possibly three weeks.[/QUOTE] Wouldn't the file be quite a bit smaller if the rels were recorded in (say) base 36? Or would that make compression harder (and thus larger)? |
Bill, it doesn't matter much for this size of a problem. Every relation takes, say, 0.7s to generate; writing it and downloading a chunk takes negligible time. GMP can read and write numbers in any base. (For compression it is best that only lowercase letters were used for output [or only upper] but not both.)
Tom, have you thought about using 16e siever and a much lesser sieving area? At q~268M, the 16e siever is only 10% slower and produces [STRIKE]3x[/STRIKE]* >2x more relations (so the sieving area would be way shorter). It does take 1.5G of memory compared to 1.1G for the 15e. ____________ *waited for more output, while typing. 3x was a local anomaly. EDIT2: With my 1.5G 16e test processes, I effectively starved the 15e's (the box I am torturing has only 4G of memory), and they shed the fat and revealed that each of them needs only 0.66G res (still 1.1G virtual). So, 15e might be a better bet for this project, because you can easily recruit 4G 4-cpu boxes; there are more of them in existence than the 8G+ (the latter must be quite new, born after the dramatic drop in memory prices per Gb). |
[QUOTE=Batalov;308443]Bill, it doesn't matter much for this size of a problem. Every relation takes, say, 0.7s to generate; writing it and downloading a chunk takes negligible time. GMP can read and write numbers in any base. (For compression it is best that only lowercase letters were used for output [or only upper] but not both.)[/QUOTE]
Well no, but I meant for large collections of relations, e.g. preparing to do LA by downloading rels from NFS@Home/RSALS (especially w.r.t. the former server space problems). If the compression ratio is the same, that could have been some significant space savings for RSALS. |
You could recode a reasonable chunk (or a full dataset that you may have somewhere), gzip, compare and share the results. There probably will be a small benefit, no doubt. It has to be substantial - to change the de facto standard file format [makes air quotes and winks with both eyes].
There were also proposals to bzip, 7zip, etc. (they pack better, but are much slower) Here's another idea: (context: the factors less than 1000 are already [I]not[/I] reported in this file and are reconstitued on the fly by msieve) drop factors less than 10000. A radical version of the same idea: keep only a and b values in the file. Less radical variant: keep a few of the largest factors and only the changed (wrt previous line) q0. These were all tried - for storage and/or for transmission. A modified version of msieve can read any of these file dialects (it is best to recode the "a,b:" file into a standard file once, before real use - because miseve re-reads the file many times). |
[QUOTE=Batalov;308443](For compression it is best that only lowercase letters were used for output [or only upper] but not both.)[/QUOTE]Don't know if the relations in a later SVN are already the same case, but to convert the relation files [CODE]cat file | tr '[A-Z]' '[a-z]' > file.out[/CODE]
|
Only the very old legacy sievers were using both.
That was one of the very first patches: to unify case. In retrospect, I think uppercase should have been preferred: dc balkes when I use it to convert by click'n'pasting (I have a script though) [CODE]echo "16i 6B252C3 p" |dc ==> 112349891 echo "16i 6b252c3 p" |dc ==> error echo "16i 6b252c3 p" | tr '[a-f]' '[A-F]' | dc ==> fine, but awkward [/CODE] dc is available on any computer ;-) Nevermind though. Not a good reason to change. |
[QUOTE=Batalov;308461]Only the very old legacy sievers were using both.
That was one of the very first patches - to unify case. In retrospect, I think uppercase should have been preferred: dc balkes when I use it to convert by click'n'pasting (I have a script though) [CODE]echo "16i 6b252c3 p" |dc ==> 112349891 echo "16i 6b252c3 p" |dc ==> error echo "16i 6b252c3 p" | tr '[a-f]' '[A-F]' | dc ==> fine, but awkward [/CODE] dc is available on any computer ;-) Nevermind though. Not a good reason to change.[/QUOTE] I've always been of the opinion that upper case looks more like a number, i.e. the numerals 0-9 are all about the same height as upper case letters, but they're much taller than lower case letters. Words we read are mostly lower case, while (base 10) numbers are thus "upper case"; thus, all hex should use upper case A-F as well :rolleyes: |
A totally new (binary) file format would be probably way better.
It should be able to store values in variable number of bytes (a byte of even a bit stream with interjected tags). Primes could be converted into their ordinal (or at the very least they could be stored in mod 30 <<3 + mod30class; 2 of course not stored as well as all small primes). Of course, most likely this will not beat a gzip pass. "Muda da, muda da, said Ecclesiastes, [URL="http://en.wikipedia.org/wiki/Muda_(Japanese_term)"]muda da[/URL]. Kore wa subete no kyoei kokorodesu." |
Latter-day msieve versions read the original data file 2-3 times during the filtering, once for the linear algebra and once for each dependency in the square root. I have plans to convert the relation text into a Berkeley DB database. Factoring relations on the fly would be very nice, but we really need a subsystem that is tuned for finding small factors in large numbers (CADO has one and Alex's dissertation describes a more powerful one).
|
iirc
[QUOTE="gmp"]
[U]Function:[/U] size_t [B]mpz_out_str[/B] [I](FILE *stream, int base, mpz_t op)[/I] Output op on stdio stream stream, as a string of digits in base base. The base may vary from 2 to 36. [/QUOTE] It is either undocumented or documented elsewhere, that [I]base = [/I]-16 will produce hex-in-uppercase. EDIT: indeed, mpz/out_str.c: [CODE] if (base >= 0) { num_to_text = "0123456789abcdefghijklmnopqrstuvwxyz"; if (base == 0) base = 10; else if (base > 36) { num_to_text = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"; if (base > 62) return 0; } } else { base = -base; num_to_text = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"; } [/CODE] |
I once dealt with bona fide CWI relations. (afair, it strips more small factors, and the format is a bit different)
First, I built msieve that worked straight with them. After a few hours, I couldn't bear to look at the (crawl) speed of file import (it was dominated by trial factoring), killed it, wrote a recoder (miseve-based!) and then remdups'd the resulting file and everything was smooth after that. |
| All times are UTC. The time now is 23:31. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.