mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Data

Reply
 
Thread Tools
Old 2005-09-27, 10:39   #1
HiddenWarrior
 
HiddenWarrior's Avatar
 
Jun 2003
Russia, Novosibirsk

2×107 Posts
Default Best way to store huge data

For example the list of expos with factors, or list of expos with bit levels. What ideas?
HiddenWarrior is offline   Reply With Quote
Old 2005-09-27, 13:30   #2
JHagerson
 
JHagerson's Avatar
 
May 2005
Naperville, IL, USA

3048 Posts
Default

HiddenWarrior,

What do you mean by "huge"? Are you talking about lots of individual data items or are you talking about each data item requiring many bits of storage for its representation?

What do you mean by "best"? Do you have some particular purpose for the data? Are you archiving it or do you need to have it available for quick access by some process?

I don't mean to give you the third degree here, but without some additional information it would be very difficult to provide a meaningful answer to your question.

John
JHagerson is offline   Reply With Quote
Old 2005-09-27, 17:53   #3
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

22·3·641 Posts
Default

HiddenWarrior,

... and please give rough numerical estimates of size of what you're looking at. For instance, 10^6 items of 10^3 bytes each, or 10^10 items of 10^6 bytes each, or what? Recommendations may vary accordingly. :-)
cheesehead is offline   Reply With Quote
Old 2005-10-04, 09:53   #4
HiddenWarrior
 
HiddenWarrior's Avatar
 
Jun 2003
Russia, Novosibirsk

2·107 Posts
Default

I mean the best way (minimal in size) to store strings like exponent,factor.
For example, there's one way:

To store 29,233 (29=expo, 233=divider) we do the following:
1. Make first byte to show where the symbol ',' is = 3
2. Removing ',' from string and we get 29233
3. Converting it to 256 base (dividing it 256 till the number is <256). Result is: 114, 49.
4. Now we got string: 3,114,49 - we convert it to ASCII symbols and get 3 bytes-string.
Finally, from original 6 bytes we got only 3 bytes (from which we can restore original)! 50% compression for that expo,factor.

Is that method the best? What mechanism is in Human readable format?
HiddenWarrior is offline   Reply With Quote
Old 2005-10-04, 10:23   #5
HiddenWarrior
 
HiddenWarrior's Avatar
 
Jun 2003
Russia, Novosibirsk

2·107 Posts
Default

Ok, in advance, we can extend above algo:
Let we have string, that hold info about expo (e),status (s),value (v):
e...,s,v...
e... is maximum 99 bytes length (Billion Digits project works only with 10 bytes expos in length, so we have a huge ubound limit!)
s is 1 byte. It's values are: 0 = untested expo (in this case v... holds current factorization level), 1 = mersenne (v... holds the mersenne position in oficial tables 1..42 for now :), 2 = has divider (v... holds the divider).
So, we approve the algo:
1: 1 byte is the position of first symbol ',' in the string = len(e...)+1, than we add to that byte s*100 and we get: 1 byte = len(e...)+1+s*(100)
2: We remove ',' from the string and get: e...sv...
3: as the s is already in 1 byte, we remove it: e...v...
4: Converting e...v... to 256-base by dividing: e...v...(base 10)->e..v...(base-256)
5: Converting all received bytes to ASCII chars - it is the result!

Example (first Mersenne + first with divider + first unfactored):

2,1,1
11,2,23
1061,0,60

Counting 1 bytes:
1+1+1*100=102
2+1+2*100=203
4+1+0*100=4

Removing ',':
211
11223
1061060

Removing s:
21
1123
106160

Base-256:
21
4,99
1,158,176

Converting to ASCII and splitting it by any symbol:
102,21^203,4,99^4,1,158,176 (sorry, I can't show here the ASCII - symbols :)

Finally, we got 11 bytes to store original info (it was 21 bytes without CR-LFs)!

At last, we can compress that resulting string!
HiddenWarrior is offline   Reply With Quote
Old 2005-10-04, 13:37   #6
JHagerson
 
JHagerson's Avatar
 
May 2005
Naperville, IL, USA

22×72 Posts
Default Necessary to restore?

HiddenWarrior, You have done a great job of compressing the data if that's all you want to do (for whatever reason). However, if you want to reverse the process and get the original data back from your compressed version, I'm afraid you have discarded too much information. In your discussion, you compute a representation for the position of the commas in your original data, but I don't think you store that number with the output. Also, you could look more closely at the definition of s. You don't need an entire byte to hold all of its values; it can be represented in its entirety by two bits.

As to your assertion that you can compress the resulting strings (last line of your post), you can try to compress the data further, but I would not expect much additional compression because the input data would appear essentially random.

2,1,1 probably would not compress very much because the overhead to store the positions of the commas would be as large (or larger) than the original (plaintext) data.

I would represent the data this way. One lead byte where the two most significant bits represent the status code. 00 would represent unfactored (sieving). 01 would be unfactored (deeper factoring). 10 would represent a Mersenne prime. 11 would be factored.

The rest of the first byte would depend on the status code. If the number is not factored (sieving), the remaining six bits would present the factor level less some minimum value. For example, if you chose not to represent any factorization less than 32, represent 32 as 00 0000, up to 95 as 11 1111. If the number was factored to 96 bits or more, you would represent the number of bytes requred by the factorization in the remaing six bits of the lead byte, and then represent the factor level as a string of bytes. If the number were a Mersenne prime, then its position in the prime table would go into the next six bits. If the number were factored, then you would indicate the number of bytes required by the factor in the last six bits.

If necessary, the factor level or the factor would appear next in binary. Finally, I would represent the exponent in binary. If most of your exponents needed only four bytes to represent, I would take advantage of that information and use leading zeros to create four byte values. If you have a mixture, you might want to define some bits in the first byte to tell you how many bytes you need to represent the number, followed by the number.

I hope that this is helpful. Is this only an academic exercise or does it have some application? I trust I have not just done your homework for you.
JHagerson is offline   Reply With Quote
Old 2005-10-04, 14:21   #7
JHagerson
 
JHagerson's Avatar
 
May 2005
Naperville, IL, USA

C416 Posts
Default

I thought about this some more (sometimes, I do my best thinking in the shower ) and I came up with another representation. I think this one will be bigger for small data but smaller for big data. You would need to try encoding a subset of your data and decide what works better for you.

I am numbering the bits by the power of 2 that each one represents: 76543210.

Byte 1, bits 6 and 7: 00 = unfactored, 01 = Mersenne, 10 = factored
Byte 1, bits 4 and 5: If unfactored or Mersenne, the number of bytes for the factor level or table entry less 1, otherwise set to zero and ignored
Byte 1, bits 0 through 3: the number of bytes for the exponent

If factored, Byte 2 would indicate the number of bytes for the factor, followed by the factor. Otherwise, Byte 2 would start the representation of the factor followed by the representation of the exponent.

Here is an example. I just finished factoring M35665369 to 62 bits. I would store this as follows:

Byte 1: 00 00 0100 (unfactored, 1 byte for factor level, 4 bytes for exponent)
Byte 2: 00111110 (62 in binary)
Bytes 3 - 6: 00000010 00100000 00110101 11011001 (35665369 in binary)

This would handle a Mersenne exponent up to 16 bytes long and a factor up to 255 bytes long.

The bottom line is that the best data structure for compression really depends on the data. I think what I have presented here would do quite well for the representation of numbers while they are being factored. However, it does not do very well with presenting the exponents that have been factored.

Rather than trying to shoehorn both types of data into one representation, maybe you would do better by segregating the factored numbers and representing them separately. Then, you could better apportion the number of bits in Byte 1 to represent the maximum number of bytes for the factors you want to store versus the number of bytes of the exponent. Remember, you can "stretch the bits" by defining a minimum factor or exponent size and storing only the offset (for example, if the minimum size is 2, you can store 2 [as zero] through 17 [as fifteen] in four bits).

Again, I hope that this helps.
JHagerson is offline   Reply With Quote
Old 2005-10-05, 02:48   #8
HiddenWarrior
 
HiddenWarrior's Avatar
 
Jun 2003
Russia, Novosibirsk

2×107 Posts
Default

Ok, that for the discussion! I'll try today to implement your notices and will see if they helped.

Now I have advanced implementation of my method. I'll optimize it a bit and will upload to MP site for everyone to try it.
HiddenWarrior is offline   Reply With Quote
Old 2005-10-05, 03:11   #9
HiddenWarrior
 
HiddenWarrior's Avatar
 
Jun 2003
Russia, Novosibirsk

3268 Posts
Default

--- Now, the finall RC technical task: ---

We have database that store data for GIMPS project:
It contains: unfactored expos, mersennes, dividers, LLT and DLLT - tested expos.
The limits:
eponents are 2 - 9.999.999.999 (i think now there is no need to store huger expos :)
Statuses for expos:
0 - unfactored (having some factoring bit level)
1 - Mersenne (enumerating number if Mersenne's table)
2 - has divider
3 - DLLT (having some factoring bit level)
4 - LLT (having some factoring bit level)
We need to compress it
HiddenWarrior is offline   Reply With Quote
Old 2005-10-05, 03:56   #10
HiddenWarrior
 
HiddenWarrior's Avatar
 
Jun 2003
Russia, Novosibirsk

2·107 Posts
Default

I found if combining bit-compression and byte compression, we can provide extra compression. The algo is known as KBBC (Knoppix Bit/Byte Compression) - is development of VSC.Group Studio 2005. Sorry, but no opened info yet. Here are implemetation for mersennes:

1. First byte is divided into High (4 bit) and Low (4 bit)
Low bit contains:
0000 - Mersenne (1-16) - this case High contains number 1-16 (enumerating position)
0001 - Mersenne (17-32) - this case High contains number 17-32 (enumerating position)
0010 - Mersenne (33-48) - this case High contains number 33-48 (enumerating position). It is current + 6 overtake!

This first 3 Low records contains all info needed to store data for Mersennes 1-48. As we know all of 1-42 of them, no need to store any more data in compression algo.

Last fiddled with by HiddenWarrior on 2005-10-05 at 03:57
HiddenWarrior is offline   Reply With Quote
Old 2005-10-05, 05:46   #11
JHagerson
 
JHagerson's Avatar
 
May 2005
Naperville, IL, USA

3048 Posts
Default

HiddenWarrior, I don't want to ignore your last two posts. However, I have been thinking more about the scheme I proposed earlier and just want to flesh it out a bit more. It might not be so necessary to segregate the factored from the unfactored numbers. I believe I read recently that one of the "relatively small" Mersenne numbers has a smallest factor of more than 500 bits. I was somewhat daunted by the prospect of needing to represent this number. But, I think I may have another idea that allows that to be done.

I would store the Mersenne exponent first as a field five bytes long (2^40 - 1 = 1,099,511,627,775 which is larger than the 9,999,999,999 maximum you stated -- however four bytes is too small and I think you would sacrifice too much performance trying to use only four and one-half bytes for this purpose).

The next byte would store the number status in its top three bits (you defined a total of five status values to store).

If the number is a Mersenne prime, you can use the remaining five bits of the first byte to represent its place in the Mersenne Pantheon (up to 63). If you think we will discover more than 63 Mersenne primes, then don't store that number here. Instead, use the scheme for storing the factoring depth discussed immediately below.

If the number remains in progress, the remaining five bits of the first byte store the number one smaller than the number of bytes required to represent the length of the number. This provides huge flexibility regarding the size of eventually found factors at the expense of some additional overhead.

A zero in the bit field would mean that the factoring level (or Mersenne position) is somewhere between zero and 255 and that number would be stored as a one-byte quantity.

A one in the bit field would mean that the factor or factoring level requires between 2 and 255 bytes to represent. The first byte would contain the length and the next "n" bytes would contain the factor (most likely).

A two in the bit field would mean that the factor requires between 256 and 65535 bytes to represent. The first two bytes would contain the length and the next "n" bytes would contain the factor.

And so on. If the length of a factor required 32 bytes in its representation, we have a new class of problems.

I hope that this is clear and that it might be helpful to someone.

I think it would help if you would please explain what it is you are trying to accomplish.
JHagerson is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
An aliquot sequence with huge, huge, huge tracts of sand !!! garambois Aliquot Sequences 50 2012-01-19 18:25
That is a HUGE factor Dubslow Information & Answers 9 2011-09-09 06:01
New York: Corpse Wheeled to Check-Cashing Store ewmayer Lounge 2 2008-01-15 16:18
store prices Fusion_power Puzzles 7 2003-08-31 01:37
I hope the new Apple store in Chicago opening today... Paulie Software 4 2003-07-16 14:19

All times are UTC. The time now is 19:10.


Tue Oct 26 19:10:25 UTC 2021 up 95 days, 13:39, 0 users, load averages: 2.76, 2.82, 2.67

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.