mersenneforum.org On relations file format
 Register FAQ Search Today's Posts Mark Forums Read

2012-08-18, 16:25   #1
Dubslow

"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
On relations file format

Quote:
 Originally Posted by fivemack Yes provided you can read base-sixteen (so in some senses no); if you look at the last few lines of the file, it will read something like Code: 22617993,32185531:4f7c7bad,25f9,3ea88d,288611,a5a09,f8b7,8425,6b252c3:7a526e39,57d8af69,67563a1,1b0f047,3ce17,55d9 70093203,74087957:7e890e81,ff64977d,9ef,455,f0a43,a58b,6997,5e75,6b252c3:574e3e89,679,1de69b7,1163ae9,366137,1e707,5071 119968003,87443153:ee91c059,886dd67,c83,4cf4afd,3addd07,668cdb,6b252c3:48ebe3c1,9fb4baa9,13f1d0ef,1ec1,ef9,2669bf,109dd1 Notice that each of these has the same number 6b252c3 before the colon; that's the last special-Q value, and turns out to be 112349891 in hex. So the 112300000-112400000 chunk is just under half-done. Some versions of the siever write incrementally to the output file 1234.t, in which case you look at the last line of that, but I think serge's current compilation doesn't. A 1e6 chunk is a lot of work - probably about two weeks on your fast machines, possibly three weeks.
Wouldn't the file be quite a bit smaller if the rels were recorded in (say) base 36?

Or would that make compression harder (and thus larger)?

 2012-08-18, 18:30 #2 Batalov     "Serge" Mar 2008 Phi(4,2^7658614+1)/2 100101100010002 Posts Bill, it doesn't matter much for this size of a problem. Every relation takes, say, 0.7s to generate; writing it and downloading a chunk takes negligible time. GMP can read and write numbers in any base. (For compression it is best that only lowercase letters were used for output [or only upper] but not both.) Tom, have you thought about using 16e siever and a much lesser sieving area? At q~268M, the 16e siever is only 10% slower and produces 3x* >2x more relations (so the sieving area would be way shorter). It does take 1.5G of memory compared to 1.1G for the 15e. ____________ *waited for more output, while typing. 3x was a local anomaly. EDIT2: With my 1.5G 16e test processes, I effectively starved the 15e's (the box I am torturing has only 4G of memory), and they shed the fat and revealed that each of them needs only 0.66G res (still 1.1G virtual). So, 15e might be a better bet for this project, because you can easily recruit 4G 4-cpu boxes; there are more of them in existence than the 8G+ (the latter must be quite new, born after the dramatic drop in memory prices per Gb). Last fiddled with by Batalov on 2012-08-18 at 18:49 Reason: my kbord is loosng lettrs
2012-08-18, 18:56   #3
Dubslow

"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts

Quote:
 Originally Posted by Batalov Bill, it doesn't matter much for this size of a problem. Every relation takes, say, 0.7s to generate; writing it and downloading a chunk takes negligible time. GMP can read and write numbers in any base. (For compression it is best that only lowercase letters were used for output [or only upper] but not both.)
Well no, but I meant for large collections of relations, e.g. preparing to do LA by downloading rels from NFS@Home/RSALS (especially w.r.t. the former server space problems). If the compression ratio is the same, that could have been some significant space savings for RSALS.

 2012-08-18, 19:15 #4 Batalov     "Serge" Mar 2008 Phi(4,2^7658614+1)/2 960810 Posts You could recode a reasonable chunk (or a full dataset that you may have somewhere), gzip, compare and share the results. There probably will be a small benefit, no doubt. It has to be substantial - to change the de facto standard file format [makes air quotes and winks with both eyes]. There were also proposals to bzip, 7zip, etc. (they pack better, but are much slower) Here's another idea: (context: the factors less than 1000 are already not reported in this file and are reconstitued on the fly by msieve) drop factors less than 10000. A radical version of the same idea: keep only a and b values in the file. Less radical variant: keep a few of the largest factors and only the changed (wrt previous line) q0. These were all tried - for storage and/or for transmission. A modified version of msieve can read any of these file dialects (it is best to recode the "a,b:" file into a standard file once, before real use - because miseve re-reads the file many times).
2012-08-18, 19:23   #5
smh

"Sander"
Oct 2002
52.345322,5.52471

29·41 Posts

Quote:
 Originally Posted by Batalov (For compression it is best that only lowercase letters were used for output [or only upper] but not both.)
Don't know if the relations in a later SVN are already the same case, but to convert the relation files
Code:
cat file | tr '[A-Z]' '[a-z]' > file.out

 2012-08-18, 19:37 #6 Batalov     "Serge" Mar 2008 Phi(4,2^7658614+1)/2 23×1,201 Posts Only the very old legacy sievers were using both. That was one of the very first patches: to unify case. In retrospect, I think uppercase should have been preferred: dc balkes when I use it to convert by click'n'pasting (I have a script though) Code: echo "16i 6B252C3 p" |dc ==> 112349891 echo "16i 6b252c3 p" |dc ==> error echo "16i 6b252c3 p" | tr '[a-f]' '[A-F]' | dc ==> fine, but awkward dc is available on any computer ;-) Nevermind though. Not a good reason to change. Last fiddled with by Batalov on 2012-08-18 at 21:21 Reason: forgot to uc the ^V
2012-08-18, 19:47   #7
Dubslow

"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts

Quote:
 Originally Posted by Batalov Only the very old legacy sievers were using both. That was one of the very first patches - to unify case. In retrospect, I think uppercase should have been preferred: dc balkes when I use it to convert by click'n'pasting (I have a script though) Code: echo "16i 6b252c3 p" |dc ==> 112349891 echo "16i 6b252c3 p" |dc ==> error echo "16i 6b252c3 p" | tr '[a-f]' '[A-F]' | dc ==> fine, but awkward dc is available on any computer ;-) Nevermind though. Not a good reason to change.
I've always been of the opinion that upper case looks more like a number, i.e. the numerals 0-9 are all about the same height as upper case letters, but they're much taller than lower case letters. Words we read are mostly lower case, while (base 10) numbers are thus "upper case"; thus, all hex should use upper case A-F as well

 2012-08-18, 19:57 #8 Batalov     "Serge" Mar 2008 Phi(4,2^7658614+1)/2 258816 Posts A totally new (binary) file format would be probably way better. It should be able to store values in variable number of bytes (a byte of even a bit stream with interjected tags). Primes could be converted into their ordinal (or at the very least they could be stored in mod 30 <<3 + mod30class; 2 of course not stored as well as all small primes). Of course, most likely this will not beat a gzip pass. "Muda da, muda da, said Ecclesiastes, muda da. Kore wa subete no kyoei kokorodesu."
 2012-08-18, 21:54 #9 jasonp Tribal Bullet     Oct 2004 DD716 Posts Latter-day msieve versions read the original data file 2-3 times during the filtering, once for the linear algebra and once for each dependency in the square root. I have plans to convert the relation text into a Berkeley DB database. Factoring relations on the fly would be very nice, but we really need a subsystem that is tuned for finding small factors in large numbers (CADO has one and Alex's dissertation describes a more powerful one).
2012-08-18, 22:33   #10
Batalov

"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

23×1,201 Posts
iirc

Quote:
 Originally Posted by gmp Function: size_t mpz_out_str (FILE *stream, int base, mpz_t op) Output op on stdio stream stream, as a string of digits in base base. The base may vary from 2 to 36.
It is either undocumented or documented elsewhere, that base = -16 will produce hex-in-uppercase.

EDIT: indeed, mpz/out_str.c:
Code:
  if (base >= 0)
{
num_to_text = "0123456789abcdefghijklmnopqrstuvwxyz";
if (base == 0)
base = 10;
else if (base > 36)
{
num_to_text = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
if (base > 62)
return 0;
}
}
else
{
base = -base;
num_to_text = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
}

Last fiddled with by Batalov on 2012-08-18 at 22:39

 2012-08-18, 23:20 #11 Batalov     "Serge" Mar 2008 Phi(4,2^7658614+1)/2 258816 Posts I once dealt with bona fide CWI relations. (afair, it strips more small factors, and the format is a bit different) First, I built msieve that worked straight with them. After a few hours, I couldn't bear to look at the (crawl) speed of file import (it was dominated by trial factoring), killed it, wrote a recoder (miseve-based!) and then remdups'd the resulting file and everything was smooth after that.

 Similar Threads Thread Thread Starter Forum Replies Last Post xilman Msieve 2 2015-11-27 09:54 R.D. Silverman Msieve 25 2013-04-17 07:40 Dubslow Msieve 5 2012-09-13 11:26 schickel Aliquot Sequences 5 2009-04-02 12:44 fivemack Factoring 7 2007-08-04 17:32

All times are UTC. The time now is 00:42.

Mon Nov 29 00:42:53 UTC 2021 up 128 days, 19:11, 0 users, load averages: 0.63, 0.97, 1.11