mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2012-08-18, 16:25   #1
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
Default On relations file format

Quote:
Originally Posted by fivemack View Post
Yes provided you can read base-sixteen (so in some senses no); if you look at the last few lines of the file, it will read something like

Code:
22617993,32185531:4f7c7bad,25f9,3ea88d,288611,a5a09,f8b7,8425,6b252c3:7a526e39,57d8af69,67563a1,1b0f047,3ce17,55d9
70093203,74087957:7e890e81,ff64977d,9ef,455,f0a43,a58b,6997,5e75,6b252c3:574e3e89,679,1de69b7,1163ae9,366137,1e707,5071
119968003,87443153:ee91c059,886dd67,c83,4cf4afd,3addd07,668cdb,6b252c3:48ebe3c1,9fb4baa9,13f1d0ef,1ec1,ef9,2669bf,109dd1
Notice that each of these has the same number 6b252c3 before the colon; that's the last special-Q value, and turns out to be 112349891 in hex. So the 112300000-112400000 chunk is just under half-done.

Some versions of the siever write incrementally to the output file 1234.t, in which case you look at the last line of that, but I think serge's current compilation doesn't.

A 1e6 chunk is a lot of work - probably about two weeks on your fast machines, possibly three weeks.
Wouldn't the file be quite a bit smaller if the rels were recorded in (say) base 36?

Or would that make compression harder (and thus larger)?
Dubslow is offline   Reply With Quote
Old 2012-08-18, 18:30   #2
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

100101100010002 Posts
Default

Bill, it doesn't matter much for this size of a problem. Every relation takes, say, 0.7s to generate; writing it and downloading a chunk takes negligible time. GMP can read and write numbers in any base. (For compression it is best that only lowercase letters were used for output [or only upper] but not both.)

Tom, have you thought about using 16e siever and a much lesser sieving area? At q~268M, the 16e siever is only 10% slower and produces 3x* >2x more relations (so the sieving area would be way shorter). It does take 1.5G of memory compared to 1.1G for the 15e.

____________
*waited for more output, while typing. 3x was a local anomaly.

EDIT2: With my 1.5G 16e test processes, I effectively starved the 15e's (the box I am torturing has only 4G of memory), and they shed the fat and revealed that each of them needs only 0.66G res (still 1.1G virtual). So, 15e might be a better bet for this project, because you can easily recruit 4G 4-cpu boxes; there are more of them in existence than the 8G+ (the latter must be quite new, born after the dramatic drop in memory prices per Gb).

Last fiddled with by Batalov on 2012-08-18 at 18:49 Reason: my kbord is loosng lettrs
Batalov is offline   Reply With Quote
Old 2012-08-18, 18:56   #3
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
Default

Quote:
Originally Posted by Batalov View Post
Bill, it doesn't matter much for this size of a problem. Every relation takes, say, 0.7s to generate; writing it and downloading a chunk takes negligible time. GMP can read and write numbers in any base. (For compression it is best that only lowercase letters were used for output [or only upper] but not both.)
Well no, but I meant for large collections of relations, e.g. preparing to do LA by downloading rels from NFS@Home/RSALS (especially w.r.t. the former server space problems). If the compression ratio is the same, that could have been some significant space savings for RSALS.
Dubslow is offline   Reply With Quote
Old 2012-08-18, 19:15   #4
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

960810 Posts
Default

You could recode a reasonable chunk (or a full dataset that you may have somewhere), gzip, compare and share the results. There probably will be a small benefit, no doubt. It has to be substantial - to change the de facto standard file format [makes air quotes and winks with both eyes].

There were also proposals to bzip, 7zip, etc. (they pack better, but are much slower)

Here's another idea: (context: the factors less than 1000 are already not reported in this file and are reconstitued on the fly by msieve) drop factors less than 10000. A radical version of the same idea: keep only a and b values in the file. Less radical variant: keep a few of the largest factors and only the changed (wrt previous line) q0. These were all tried - for storage and/or for transmission. A modified version of msieve can read any of these file dialects (it is best to recode the "a,b:" file into a standard file once, before real use - because miseve re-reads the file many times).
Batalov is offline   Reply With Quote
Old 2012-08-18, 19:23   #5
smh
 
smh's Avatar
 
"Sander"
Oct 2002
52.345322,5.52471

29·41 Posts
Default

Quote:
Originally Posted by Batalov View Post
(For compression it is best that only lowercase letters were used for output [or only upper] but not both.)
Don't know if the relations in a later SVN are already the same case, but to convert the relation files
Code:
cat file | tr '[A-Z]' '[a-z]' > file.out
smh is offline   Reply With Quote
Old 2012-08-18, 19:37   #6
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

23×1,201 Posts
Default

Only the very old legacy sievers were using both.
That was one of the very first patches: to unify case.

In retrospect, I think uppercase should have been preferred: dc balkes when I use it to convert by click'n'pasting (I have a script though)
Code:
echo "16i 6B252C3 p" |dc ==> 112349891
echo "16i 6b252c3 p" |dc ==> error
echo "16i 6b252c3 p" | tr '[a-f]' '[A-F]' | dc ==> fine, but awkward
dc is available on any computer ;-)

Nevermind though. Not a good reason to change.

Last fiddled with by Batalov on 2012-08-18 at 21:21 Reason: forgot to uc the ^V
Batalov is offline   Reply With Quote
Old 2012-08-18, 19:47   #7
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts
Default

Quote:
Originally Posted by Batalov View Post
Only the very old legacy sievers were using both.
That was one of the very first patches - to unify case.

In retrospect, I think uppercase should have been preferred: dc balkes when I use it to convert by click'n'pasting (I have a script though)
Code:
echo "16i 6b252c3 p" |dc ==> 112349891
echo "16i 6b252c3 p" |dc ==> error
echo "16i 6b252c3 p" | tr '[a-f]' '[A-F]' | dc ==> fine, but awkward
dc is available on any computer ;-)

Nevermind though. Not a good reason to change.
I've always been of the opinion that upper case looks more like a number, i.e. the numerals 0-9 are all about the same height as upper case letters, but they're much taller than lower case letters. Words we read are mostly lower case, while (base 10) numbers are thus "upper case"; thus, all hex should use upper case A-F as well
Dubslow is offline   Reply With Quote
Old 2012-08-18, 19:57   #8
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

258816 Posts
Default

A totally new (binary) file format would be probably way better.

It should be able to store values in variable number of bytes (a byte of even a bit stream with interjected tags). Primes could be converted into their ordinal (or at the very least they could be stored in mod 30 <<3 + mod30class; 2 of course not stored as well as all small primes). Of course, most likely this will not beat a gzip pass.

"Muda da, muda da, said Ecclesiastes, muda da. Kore wa subete no kyoei kokorodesu."
Batalov is offline   Reply With Quote
Old 2012-08-18, 21:54   #9
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

DD716 Posts
Default

Latter-day msieve versions read the original data file 2-3 times during the filtering, once for the linear algebra and once for each dependency in the square root. I have plans to convert the relation text into a Berkeley DB database. Factoring relations on the fly would be very nice, but we really need a subsystem that is tuned for finding small factors in large numbers (CADO has one and Alex's dissertation describes a more powerful one).
jasonp is offline   Reply With Quote
Old 2012-08-18, 22:33   #10
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

23×1,201 Posts
Default iirc

Quote:
Originally Posted by gmp
Function: size_t mpz_out_str (FILE *stream, int base, mpz_t op)
Output op on stdio stream stream, as a string of digits in base base. The base may vary from 2 to 36.
It is either undocumented or documented elsewhere, that base = -16 will produce hex-in-uppercase.

EDIT: indeed, mpz/out_str.c:
Code:
  if (base >= 0)
    {
      num_to_text = "0123456789abcdefghijklmnopqrstuvwxyz";
      if (base == 0)
        base = 10;
      else if (base > 36)
        {
          num_to_text = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
          if (base > 62)
            return 0;
        }
    }
  else
    {
      base = -base;
      num_to_text = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
    }

Last fiddled with by Batalov on 2012-08-18 at 22:39
Batalov is offline   Reply With Quote
Old 2012-08-18, 23:20   #11
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

258816 Posts
Default

I once dealt with bona fide CWI relations. (afair, it strips more small factors, and the format is a bit different)

First, I built msieve that worked straight with them. After a few hours, I couldn't bear to look at the (crawl) speed of file import (it was dominated by trial factoring), killed it, wrote a recoder (miseve-based!) and then remdups'd the resulting file and everything was smooth after that.
Batalov is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting CWI format to ggnfs format xilman Msieve 2 2015-11-27 09:54
CWI format --> GGNFS format R.D. Silverman Msieve 25 2013-04-17 07:40
Matrix file format questions Dubslow Msieve 5 2012-09-13 11:26
Format for .alq files schickel Aliquot Sequences 5 2009-04-02 12:44
More relations mean many more relations wanted fivemack Factoring 7 2007-08-04 17:32

All times are UTC. The time now is 00:42.


Mon Nov 29 00:42:53 UTC 2021 up 128 days, 19:11, 0 users, load averages: 0.63, 0.97, 1.11

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.