mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Factoring (https://www.mersenneforum.org/forumdisplay.php?f=19)
-   -   On relations file format (https://www.mersenneforum.org/showthread.php?t=17082)

Dubslow 2012-08-18 16:25

On relations file format
 
[QUOTE=fivemack;308396]Yes provided you can read base-sixteen (so in some senses no); if you look at the last few lines of the file, it will read something like

[code]
22617993,32185531:4f7c7bad,25f9,3ea88d,288611,a5a09,f8b7,8425,6b252c3:7a526e39,57d8af69,67563a1,1b0f047,3ce17,55d9
70093203,74087957:7e890e81,ff64977d,9ef,455,f0a43,a58b,6997,5e75,6b252c3:574e3e89,679,1de69b7,1163ae9,366137,1e707,5071
119968003,87443153:ee91c059,886dd67,c83,4cf4afd,3addd07,668cdb,6b252c3:48ebe3c1,9fb4baa9,13f1d0ef,1ec1,ef9,2669bf,109dd1
[/code]

Notice that each of these has the same number 6b252c3 before the colon; that's the last special-Q value, and turns out to be 112349891 in hex. So the 112300000-112400000 chunk is just under half-done.

Some versions of the siever write incrementally to the output file 1234.t, in which case you look at the last line of that, but I think serge's current compilation doesn't.

A 1e6 chunk is a lot of work - probably about two weeks on your fast machines, possibly three weeks.[/QUOTE]

Wouldn't the file be quite a bit smaller if the rels were recorded in (say) base 36?

Or would that make compression harder (and thus larger)?

Batalov 2012-08-18 18:30

Bill, it doesn't matter much for this size of a problem. Every relation takes, say, 0.7s to generate; writing it and downloading a chunk takes negligible time. GMP can read and write numbers in any base. (For compression it is best that only lowercase letters were used for output [or only upper] but not both.)

Tom, have you thought about using 16e siever and a much lesser sieving area? At q~268M, the 16e siever is only 10% slower and produces [STRIKE]3x[/STRIKE]* >2x more relations (so the sieving area would be way shorter). It does take 1.5G of memory compared to 1.1G for the 15e.

____________
*waited for more output, while typing. 3x was a local anomaly.

EDIT2: With my 1.5G 16e test processes, I effectively starved the 15e's (the box I am torturing has only 4G of memory), and they shed the fat and revealed that each of them needs only 0.66G res (still 1.1G virtual). So, 15e might be a better bet for this project, because you can easily recruit 4G 4-cpu boxes; there are more of them in existence than the 8G+ (the latter must be quite new, born after the dramatic drop in memory prices per Gb).

Dubslow 2012-08-18 18:56

[QUOTE=Batalov;308443]Bill, it doesn't matter much for this size of a problem. Every relation takes, say, 0.7s to generate; writing it and downloading a chunk takes negligible time. GMP can read and write numbers in any base. (For compression it is best that only lowercase letters were used for output [or only upper] but not both.)[/QUOTE]

Well no, but I meant for large collections of relations, e.g. preparing to do LA by downloading rels from NFS@Home/RSALS (especially w.r.t. the former server space problems). If the compression ratio is the same, that could have been some significant space savings for RSALS.

Batalov 2012-08-18 19:15

You could recode a reasonable chunk (or a full dataset that you may have somewhere), gzip, compare and share the results. There probably will be a small benefit, no doubt. It has to be substantial - to change the de facto standard file format [makes air quotes and winks with both eyes].

There were also proposals to bzip, 7zip, etc. (they pack better, but are much slower)

Here's another idea: (context: the factors less than 1000 are already [I]not[/I] reported in this file and are reconstitued on the fly by msieve) drop factors less than 10000. A radical version of the same idea: keep only a and b values in the file. Less radical variant: keep a few of the largest factors and only the changed (wrt previous line) q0. These were all tried - for storage and/or for transmission. A modified version of msieve can read any of these file dialects (it is best to recode the "a,b:" file into a standard file once, before real use - because miseve re-reads the file many times).

smh 2012-08-18 19:23

[QUOTE=Batalov;308443](For compression it is best that only lowercase letters were used for output [or only upper] but not both.)[/QUOTE]Don't know if the relations in a later SVN are already the same case, but to convert the relation files [CODE]cat file | tr '[A-Z]' '[a-z]' > file.out[/CODE]

Batalov 2012-08-18 19:37

Only the very old legacy sievers were using both.
That was one of the very first patches: to unify case.

In retrospect, I think uppercase should have been preferred: dc balkes when I use it to convert by click'n'pasting (I have a script though)
[CODE]echo "16i 6B252C3 p" |dc ==> 112349891
echo "16i 6b252c3 p" |dc ==> error
echo "16i 6b252c3 p" | tr '[a-f]' '[A-F]' | dc ==> fine, but awkward
[/CODE]
dc is available on any computer ;-)

Nevermind though. Not a good reason to change.

Dubslow 2012-08-18 19:47

[QUOTE=Batalov;308461]Only the very old legacy sievers were using both.
That was one of the very first patches - to unify case.

In retrospect, I think uppercase should have been preferred: dc balkes when I use it to convert by click'n'pasting (I have a script though)
[CODE]echo "16i 6b252c3 p" |dc ==> 112349891
echo "16i 6b252c3 p" |dc ==> error
echo "16i 6b252c3 p" | tr '[a-f]' '[A-F]' | dc ==> fine, but awkward
[/CODE]
dc is available on any computer ;-)

Nevermind though. Not a good reason to change.[/QUOTE]

I've always been of the opinion that upper case looks more like a number, i.e. the numerals 0-9 are all about the same height as upper case letters, but they're much taller than lower case letters. Words we read are mostly lower case, while (base 10) numbers are thus "upper case"; thus, all hex should use upper case A-F as well :rolleyes:

Batalov 2012-08-18 19:57

A totally new (binary) file format would be probably way better.

It should be able to store values in variable number of bytes (a byte of even a bit stream with interjected tags). Primes could be converted into their ordinal (or at the very least they could be stored in mod 30 <<3 + mod30class; 2 of course not stored as well as all small primes). Of course, most likely this will not beat a gzip pass.

"Muda da, muda da, said Ecclesiastes, [URL="http://en.wikipedia.org/wiki/Muda_(Japanese_term)"]muda da[/URL]. Kore wa subete no kyoei kokorodesu."

jasonp 2012-08-18 21:54

Latter-day msieve versions read the original data file 2-3 times during the filtering, once for the linear algebra and once for each dependency in the square root. I have plans to convert the relation text into a Berkeley DB database. Factoring relations on the fly would be very nice, but we really need a subsystem that is tuned for finding small factors in large numbers (CADO has one and Alex's dissertation describes a more powerful one).

Batalov 2012-08-18 22:33

iirc
 
[QUOTE="gmp"]
[U]Function:[/U] size_t [B]mpz_out_str[/B] [I](FILE *stream, int base, mpz_t op)[/I]
Output op on stdio stream stream, as a string of digits in base base. The base may vary from 2 to 36. [/QUOTE]
It is either undocumented or documented elsewhere, that [I]base = [/I]-16 will produce hex-in-uppercase.

EDIT: indeed, mpz/out_str.c:
[CODE] if (base >= 0)
{
num_to_text = "0123456789abcdefghijklmnopqrstuvwxyz";
if (base == 0)
base = 10;
else if (base > 36)
{
num_to_text = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
if (base > 62)
return 0;
}
}
else
{
base = -base;
num_to_text = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
}
[/CODE]

Batalov 2012-08-18 23:20

I once dealt with bona fide CWI relations. (afair, it strips more small factors, and the format is a bit different)

First, I built msieve that worked straight with them. After a few hours, I couldn't bear to look at the (crawl) speed of file import (it was dominated by trial factoring), killed it, wrote a recoder (miseve-based!) and then remdups'd the resulting file and everything was smooth after that.

jrk 2012-08-19 00:05

[QUOTE=Batalov;308455]A radical version of the same idea: keep only a and b values in the file.[/QUOTE]
Even more radical: Keep the lattice basis vectors for each special-q root and you can write each relation in lattice coordinates to save even more space.

Dubslow 2012-08-19 01:02

Here's a somewhat facetious reason to use base 36: Now we can go hunting for funny words spelled out by the factors. :smile:

Dubslow 2012-08-19 07:19

Hmm, less promising than I thought.
[code]bill@Gravemind:~/yafu/rsals∰∂ wc -l nfs.dat
93193867 nfs.dat
bill@Gravemind:~/yafu/rsals∰∂ wc -l rels36
91974259 rels36
bill@Gravemind:~/yafu/rsals∰∂ wc -c nfs.dat
11683340634 nfs.dat
bill@Gravemind:~/yafu/rsals∰∂ wc -c rels36
10045856817 rels36
bill@Gravemind:~/yafu/rsals∰∂ head -n 2 nfs.dat
353566,832085:5200a0f,af8df,20bc63,2912c0b,3114f59,1087,751:34dd2953,18cf85c1,1aa0f1,1159,3,709,1360f1f
1337012,1463419:199c69b1,1ce65701,5bc9,96fd,3ab829,d,29,17f,5ab,13:a47a68b,2ee792d,dea7,3d444f,3,5,11b,11,11,13,1360f1f
bill@Gravemind:~/yafu/rsals∰∂ head -n 2 rels36
353566,832085:1F6Z2N,FEU7,19ZDV,PN3T7,UN3H5,39J,1G1:EO1JKJ,6VTR5T,11EK1,3FD,3,1E1,C3J1B
1337012,1463419:73TK1D,80O6IP,I4P,TTP,2AHBD,D,15,AN,14B,J:2UOL1N,TA5V1,17ZB,2E24V,3,5,7V,H,H,J,C3J1B
bill@Gravemind:~/yafu/rsals∰∂ remdups4 nfs.dat
Counting rels in nfs.dat. This might take a few minutes for large rel files.
Found 93193103 unique, 5 duplicate (0.0% of total), and 107 bad relations.
Largest dimension used: 362 of 930
Average dimension used: 284.4 of 930
bill@Gravemind:~/yafu/rsals∰∂ bzip2 nfs.dat
bill@Gravemind:~/yafu/rsals∰∂ wc -c nfs.dat.bz2
5130787706 nfs.dat.bz2
bill@Gravemind:~/yafu/rsals∰∂ bzip2 rels36
bill@Gravemind:~/yafu/rsals∰∂ wc -c rels36.bz2
5035796294 rels36.bz2[/code]
(Don't be fooled, the conversion took ~2 hrs, and each of the zips took around an hour.)

Uncompressed 36/hex ratio: 85.98%
Compressed 36/hex ratio: 98.15%
36 un/compressed ratio: 50.13%
hex un/compressed ratio: 43.92%

So it seems I was right about one thing: adding the extra characters made the compression less effective.

Batalov 2012-08-19 07:31

Sounds about right. This is because if gzip does its job right, then the gzipped size is the "raw information content" size.
See, the hex file has more air, 36-base has less air (in both cases, there are unused bits as well as abundant commas), but gzip squeezes air out and the bit stream remains.

Now, bzip2 and 7zip will compress better still (at the expense of much longer time) and the resulting compressed sizes may be closer still. What you did is a small pre-compression. Information (or essentially random data) is incompressible. The only way to make compressed files really smaller is to give up some information and pay back later (pay back with time) to restore the sacrificed information.

fivemack 2012-08-20 10:05

[QUOTE=Batalov;308443]Tom, have you thought about using 16e siever and a much lesser sieving area? At q~268M, the 16e siever is only 10% slower and produces [STRIKE]3x[/STRIKE]* >2x more relations (so the sieving area would be way shorter). It does take 1.5G of memory compared to 1.1G for the 15e. [/QUOTE]

Yes, I ran 0.1% of the job with both 15e and 16e; in my tests 16e was significantly slower (10% on this project is more than a CPU-year!) to get to 400 million relations, and since it also uses significantly more memory than 15e I thought 15e would be a better choice.


All times are UTC. The time now is 04:52.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.