mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2018-06-09, 14:55   #1
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,233 Posts
Default Draft proposed standard for Mersenne hunting software neutral exchange format

The idea for a a portable neutral exchange format and discussion of it began on http://www.mersenneforum.org/showthr...493#post489493, with considerable contribution by Preda, including the original idea and sketch of a possible format, and some contributions by ewmayer and kriesel. (Possibly others, I don't recall.) The idea of parallel confirmation for speed of a new prime discovery from a collection of interim save files has been kicking around for a while. A neutral exchange capability would aid that.

In the quote insert below is a draft I've put together quickly from that discussion and some further thought on a possible standard.

Creating a draft standard raises for me the question of who gets to declare it a standard or conversely discourage the development of neutral file exchange capability. I think that would be the Mersenne Research Inc. board. Or, as things often go informally and by volunteer action here, authors or maintainers of the various Mersenne hunting software programs adopt it or something similar. Another possible way to approach exchange is to create separate standalone export and import utilities. I would be very interested in Prime95's thoughts on the pros and cons of a draft standard and such capabilities, as well as other authors of relevant code. Should standalone translators be discouraged, as possible threats to data integrity or provenance of the data?

Quote:
Mersenne neutral exchange file format early draft description

1 The following is a draft proposal for a program-independent file exchange format standard. Goals are to
a) promote progress in development and development debugging, of Mersenne prime hunting software,
b) aid effective use, and use debugging, of user runs,
b) provide a more human-readable file format,
c) enable expedited confirmation runs for newly discovered primes, by allowing parallel runs from start and a collection of interim save files to the next and to completion, respectively,
d) provide well documented and somewhat self-documented file format(s) with a lot of commonality among disparate programs.

2 Name of the format is Mersenne Neutral Exchange (MNE, pronounced "money")
The format consists of two sections, header, and bulk data.

3 The header consists of ASCII text only, no binary data. The header is a line oriented multirecord structure. Header information is in English if possible. Certain keywords are in English as is common practice, regardless of whether some other language is identified in a NOTE record.

3.1 All header information is in ASCII text format, with commas, spaces, and surrounding "" if names contain spaces, for readability by people. Numerical values are expressed in ASCII decimal or ASCII hexadecimal, most significant digit or nibble first. ASCII hexadecimal is labeled with a leading 0x. Full implementation is application-specific and is preferred. Interim incomplete implementations are preferable to nonimplementation.
Items not yet implemented are to be identified as such in ASCII text containing a description of the parameter and containing the text " not implemented". Examples follow below.

3.2 Record separators are the default for the OS on which the file is produced.

3.3 First line of the header is a record giving the name of the format, and version of the format.
Example: "Mersenne Neutral Exchange Format V1.0"

3.4 Second line is the program name, program version, source data file name, date and time of creation of the neutral file, and optionally the full path to the directory location of the file converted. Outermost quotes do not appear in the file.
Examples: "gpuOwL, v2.2, 149447533.130000000.owl, "2018-06-09 15:37:23 UTC""
"CUDALucas, v2.06beta, c149447533., "2018-06-09 15:37:23 UTC""
"CUDALucas, v2.06beta, s83626469.83500000.29e93f858b93b02e.cls, c:\Users\Ken\documents\cl-gtx1070"
"CUDAPm1, V0.20, c299500171s1., "date and time not implemented", "c:\Users\Ken\My Documents\gtx1070-cudapm1""
"Mlucas, V17.0, "...
etc.

3.5 Third line is the type of test and subtype if applicable, in ASCII text.
Examples: "Type LL"
"Type PRP"
"Type PRP type 1"
"Type PRP-CF type 5"
"Type P-1 stage 1"
"Type P-1 stage 2"
"Type ECM stage 1"
"Type ECM stage 2"
etc

3.6 Fourth line is the labeled exponent; "Exponent " followed by ASCII decimal value.
Example: "Exponent 1449447533"

3.7 Fifth line is the iteration count. (Completed count, except perhaps in P-1 it may be count in progress. Are there other such cases?) The initial seed value is iteration zero. (In LL, the starting 4 is iteration zero. In PRP, the starting 3 is iteration zero.) Format is "Iteration " followed by iteration completed expressed in ASCII decimal. (In P-1, stage 2, Iteration is replaced by Transform)
Example: "Iteration 130000000"

3.8 The next lines hold algorithm-specific parameters as applicable. Any error counts are cumulative for the data contents. Retry from a prior known-good should reset the error counts to the values relevant for that data state.

3.8.1 For Gerbicz-checked PRP, a record is required giving the block size. Format is "Gerbicz Blocksize " followed by the ASCII decimal value.
Example: "Gerbicz Blocksize 500"

3.8.2 For Gerbicz-checked PRP, a record is required giving the error count to date on the exported exponent's run. Format is "Gerbicz Errorcount " followed by the ASCII decimal value.
Example: "Gerbicz Errorcount 0"

3.8.3 For prime95 LL, multiple error counts of the various types to date may be given in a single record.
Examples: "prime95 roundoff 0, sumout 0, Jacobi 0" (a perfect run)
"prime95 roundoff 123 sumout 123 Jacobi 12"
"prime95 roundoff >9999 sumout >9999 Jacobi >9999" (a very bad run)

3.8.4,... (Add other cases here, such as P-1 B1, B2, e, NRP etc; ECM. P-1 is extensive.)

3.9 Optional note lines contain free form text, such as for comments. Note lines may be multiple per file. They must begin with "NOTE ". They may be printed by an importing program, or ignored, or saved for export later, or both printed during import and saved. It may be useful to tag exports for debugging purposes with structured program-generated notes, such as "NOTE OFFSET 123456" or fft length or various other parameters.
It may be useful at times to note the 64-bit residue here, as "NOTE Res64 0x0123456789ABCDEF" or similar.
Another use of Note lines is keeping track of the processing history of work in progress. Examples:
"NOTE Begun on CUDALucas V2.06beta and unable to finish due to bug nn. Exported at iteration 12345678".
"NOTE Imported into Mlucas to complete run"
"NOTE Exported from Mlucas V17.1 at iteration 23456789"
"NOTE Reimport into CUDALucas as test at 23456789"
If there is significant note content in an uncommon language other than English, noting the language is a welcome courtesy for those who may not recognize the language, much less understand any of it without the help of an online translator.
Example: "NOTE Language Macedonian"
Another potential use is program-generated notes such as system name, gpu name or unique identifier if applicable, for traceability of results.

3.10 Third to last line of header is format for the bulk data. Format is explicitly given as "Data Format ASCII hex" or "Data Format "Binary bytes".

3.11 Second to last line of header is a labeled ASCII hexadecimal CRC for the bulk data.
Examples: "CRC32 0x01234567"
"CRC32 not implemented"

3.12 Last line of header is "End of Header".

4. Bulk data may consist of ASCII hexadecimal only, or binary only. Binary form is an ordered byte stream, most significant byte first. ASCII hexadecimal form is an ordered nibble stream, most significant nibble first. Its content's meaning is defined in the header. (LL residue after n iterations, PRP after n iterations, etc.)

5. Following the bulk data is the end of the file.

6. Programs may support either import or export or both. Both is recommended, if only for debugging purposes, with command line options such as -i (sourcefilename) and -e (destination filename) or -import and -export.
For example, gpuOwL -i 149447533.130000000.owl -e 149447533.130000000.owl.txt
6.1 Programs may implement export of program-specific or run-specific parameters, such as offset, as described above in 3.9.

7. Program behavior on import Input must include robust error checking

7.1 Types not supported by the importing program must be identified by an explanatory message, which should be followed by program termination.
Example: "PRP type import is not supported by CUDALucas V2.06beta. Terminating."

7.2 The importing program should check for the presence and validity of all required elements. If any required element is missing, or determined invalid, an importing program should terminate the import, or give the user the option to proceed or terminate.

7.3 Examples of invalid inputs are negative values for iteration count, CRC, exponent or Gerbicz block length; nonnumeric input for iteration count or exponent or other values, non hexadecimal characters for CRC; unrecognized source program names; iteration counts exceeding the exponent.

7.4 Programs must support import of at least one bulk data form, and should support both binary byte and ASCII hexadecimal.

8. Program behavior on export

8.1 Output must include the required elements described above, resulting in a file valid for import by this program or another compliant program

8.2 Preservation and output of any generated or imported NOTE content is recommended, along with additional NOTE information applicable for the export.

8.3 Support for export of ASCII hexadecimal is preferred for readability.

9. Internal representation
Programs may implement internal representations similar to the neutral format. Implementing identical internal representation is discouraged. Efficiency, readability or data integrity could suffer if implementation is identical. For example, binary form of work in progress will have efficiency advantages, while ASCII hexadecimal will have readability advantages in the exchange format file, and little-endian versus big-endian efficiency advantages may also be present depending on the system hardware.

(end)
kriesel is online now   Reply With Quote
Old 2018-06-09, 17:52   #2
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

25A616 Posts
Default

Not writing any of the code that tests Mersenne numbers, you can ignore my comments if you wish.

For P-1, bounds should be recorded in a common format and location. I don't know if any software other than Prime95 does the BS extension and if that would make a difference in a save file.

Why not include TF in the format? It takes up almost no space, but if we are seeking to unify PRP, P-1, and LL, why not add TF?

In the NOTE section: if software imports a file from a different version or program, it should write the original line #2 (your section 3.4), and iteration line into the NOTE section. If a different program is used to continue, it should preserve the previous NOTE and add the appropriate lines. Maybe with a tag that indicates that it is the history of the test. (I will use HISTORY in my example.)

Example of original file:
Code:
Prime95, v29.4b
Type LL
Exponent 332191111
Iteration 100000
Example after CUDALucas resumed from that file:
Code:
CUDALucas, v2.06beta, c149447533., "2018-06-09 15:37:23 UTC"
Type LL
Exponent 332191111
Iteration 1000400
......
NOTE HISTORY Prime95, v29.4b
NOTE HISTORY Iteration 100000
Example after MLucas resumed from that file:
Code:
Mlucas, V17.0
Type LL
Exponent 332191111
Iteration 30003300
.....
NOTE HISTORY CUDALucas, v2.06beta, c149447533., "2018-05-09 15:37:23 UTC"
NOTE HISTORY Iteration 1000400
NOTE HISTORY Prime95, v29.4b
NOTE HISTORY Iteration 100000
<edit>What about the file name? Shouldn't that be specified? How will the software know which file to look at?

Maybe Mexponent_number YY-MM-DD hh:mm:ss-ms [tz].MNE so for the above example it would be:
M332191111 18-06-09 02:59:59-2001 +5.NME

That way the save files would have unique names and the software could find the latest, even if the original file dates and times are lost in transfer and copying.

Last fiddled with by Uncwilly on 2018-06-09 at 18:05
Uncwilly is offline   Reply With Quote
Old 2018-06-09, 18:59   #3
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10100011100012 Posts
Default

Quote:
Originally Posted by Uncwilly View Post
Not writing any of the code that tests Mersenne numbers, you can ignore my comments if you wish.

For P-1, bounds should be recorded in a common format and location. I don't know if any software other than Prime95 does the BS extension and if that would make a difference in a save file.

Why not include TF in the format? It takes up almost no space, but if we are seeking to unify PRP, P-1, and LL, why not add TF?

In the NOTE section: if software imports a file from a different version or program, it should write the original line #2 (your section 3.4), and iteration line into the NOTE section. If a different program is used to continue, it should preserve the previous NOTE and add the appropriate lines. Maybe with a tag that indicates that it is the history of the test. (I will use HISTORY in my example.)

Example of original file:
Code:
Prime95, v29.4b
Type LL
Exponent 332191111
Iteration 100000
Example after CUDALucas resumed from that file:
Code:
CUDALucas, v2.06beta, c332191111., "2018-06-09 15:37:23 UTC"
Type LL
Exponent 332191111
Iteration 1000400
......
NOTE HISTORY Prime95, v29.4b
NOTE HISTORY Iteration 100000
Example after MLucas resumed from that file:
Code:
Mlucas, V17.0
Type LL
Exponent 332191111
Iteration 30003300
.....
NOTE HISTORY CUDALucas, v2.06beta, c332191111., "2018-05-09 15:37:23 UTC"
NOTE HISTORY Iteration 1000400
NOTE HISTORY Prime95, v29.4b
NOTE HISTORY Iteration 100000
Thanks for your input.

My intent was not to unify the various computation types, but to include in the draft standard the ones where there's sufficient utility and interest. And generate discussion of what would be most useful. It's one draft describing several different neutral exchange file formats, for the very different computation types, which share some common elements.

The Exponent is constant, as is the computation type, during the life of a particular exchange file. Stage, iteration, etc may change as computation progresses, but exponent and computation type do not. Each exchange file is for communicating one exponent's state of progress, of one computation type. CUDALucas includes the exponent number in the checkpoint file name.

CUDAPm1 v0.20 does P-1 stage 1 and stage 2, and the B-S extension. There are up to 9 or 25 32-bit unsigned integer words beyond the long binary data, in its storage format, including several reserved words. It's not simple.

TF may be included at some point, if the standard goes forward. TF was not included for four reasons. One, it took a while to write without it, and quickly became long.
Two, I wanted to get some feedback before I went further, especially if the idea doesn't fly.
Three, the run time represented by a TF checkpoint is generally smaller than LL or PRP.
Four, the data size is much smaller for TF; bytes vs. megabytes. One could muddle through with a hex editor on a TF file, or even Notepad. Here's what one contained, more or less, for M651m: "651302251 80 81 4620 0.20: 392 0 056C1BA7" (A few chars have been changed here) I think those fields are exponent, starting bit level, next bit level, class count, mfaktc version, current class, ??, ??. Approx 42 bytes, versus the tens of megabytes for P-1 or LL or PRP of 100-megadigit exponents.
Mfakto has very similar. Something like "580700307 80 81 4620 mfakto 0.15pre6-Win: 525 0 57C49BFA". Mfakto was derived from mfaktc, hence the close similarity.

I came at this from the point of view of having read engineering data format standards, (IGES, STEP, etc.) and seen the inevitable differences in implementations, from one software to the next, from having been a user and CAD system manager, and writing occasionally "IGES standard" to "IGES standard" translators since some software subsetted the standard so much that conversion from software A to IGES to software B was poor.

I like your idea of a program generated NOTE HISTORY subcase. I'd probably make it a multivalue record, perhaps something like
"NOTE HISTORY Prime95, v29.4B, Start iteration 100000, End iteration 150000".
Then other runs, on Prime95 or elsewhere, could read chronologically line by line.
Or maybe it merits its own case: HISTORY.

Last fiddled with by kriesel on 2018-06-09 at 19:11
kriesel is online now   Reply With Quote
Old 2018-06-09, 19:25   #4
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,233 Posts
Default

Quote:
Originally Posted by Uncwilly View Post
What about the file name? Shouldn't that be specified? How will the software know which file to look at?
That's what the program extension to support the command line option -import <filename> is for.
Expanding a bit, I think by design, exchange file import and export should not be an automatic function. I feel it should require user intervention, and be infrequent.

The exchange formats are not intended as substitutes for the programs' native file format(s).
Exchange files are fundamentally different from a worktodo.add file, which is supposed to be found and processed automatically. They're also fundamentally different than routine savefiles, which are supposed to be generated automatically every n iterations.

Exchange files are supposed to not be processed automatically or generated automatically. The use cases are the exceptions, not the rule, not production running.

Time/date stamp of file creation (export) is among the data included in the header; line 2 in the exchange file, 3.4 in the draft standard.

Last fiddled with by kriesel on 2018-06-09 at 19:30
kriesel is online now   Reply With Quote
Old 2018-06-09, 22:51   #5
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

7·1,663 Posts
Default

Hi, Ken:

Thanks for taking the initiative on this. Couple of questions:

1. Is there any good reason to save P-1 stage 2 residues? It was always my understanding that only saving the stage 1 residue was important, since a stage 2 run for any desired stage 2 prime [plo,phi] bounds needs only the stage 1 residue, i.e. if another user did some stage 2 work starting from the same stage 1 residue and covering (say) a lower numeric stage 2 interval, any resulting stage 2 residue is irrelevant to anyone else's stage 2 work.

2. "Binary form is an ordered byte stream, most significant byte first." -- If I'm writing/reading a bytewise residue to/from a file, I have the residue in a byte array and just do fwrite/fread of that, which results in a low-to-high byte ordering. Since for these kinds of residue data every byte is as important as every other (i.e. there is no relative-significance hierarchy in terms of data integrity), why make the required coding more awkward than it needs to be?
ewmayer is offline   Reply With Quote
Old 2018-06-10, 02:27   #6
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

55916 Posts
Default

Quote:
Originally Posted by kriesel View Post
3.5 Third line is the type of test and subtype if applicable, in ASCII text.
Examples: "Type LL"
"Type PRP"
I see two distinct ways to identify the field: one is the line number ("third line"), and the other is the name of the field ("Type"). I think we should settle for one or the other. (my preference would be for field-ids (being explicit), not line position). (If the line position is used, then no field-id should be included).

I'll give it more thought.

Last fiddled with by preda on 2018-06-10 at 02:27
preda is online now   Reply With Quote
Old 2018-06-10, 02:31   #7
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

372 Posts
Default

Quote:
Originally Posted by ewmayer View Post
2. "Binary form is an ordered byte stream, most significant byte first." -- If I'm writing/reading a bytewise residue to/from a file, I have the residue in a byte array and just do fwrite/fread of that, which results in a low-to-high byte ordering. Since for these kinds of residue data every byte is as important as every other (i.e. there is no relative-significance hierarchy in terms of data integrity), why make the required coding more awkward than it needs to be?

Yep, I think we all store "byte-0" (least significant) first, so probably we'll do the same in the save-file as well. I don't think there's any problem here (just need to update that bit of spec).
preda is online now   Reply With Quote
Old 2018-06-10, 13:39   #8
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10100011100012 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Hi, Ken:

Thanks for taking the initiative on this. Couple of questions:

1. Is there any good reason to save P-1 stage 2 residues? It was always my understanding that only saving the stage 1 residue was important, since a stage 2 run for any desired stage 2 prime [plo,phi] bounds needs only the stage 1 residue, i.e. if another user did some stage 2 work starting from the same stage 1 residue and covering (say) a lower numeric stage 2 interval, any resulting stage 2 residue is irrelevant to anyone else's stage 2 work.

2. "Binary form is an ordered byte stream, most significant byte first." -- If I'm writing/reading a bytewise residue to/from a file, I have the residue in a byte array and just do fwrite/fread of that, which results in a low-to-high byte ordering. Since for these kinds of residue data every byte is as important as every other (i.e. there is no relative-significance hierarchy in terms of data integrity), why make the required coding more awkward than it needs to be?
1 If there's no intent to extend the B2 later, which I've read is complicated (not clear to me if it's even practical), and not implemented for CUDAPm1, and there's no need for debugging the file contents, because the run completed, CUDAPm1 does not keep stage 2 files. It also does not keep stage one files if the stage 2 completes. That's how it manages its own checkpoint files, which can be considered temporary files. That's entirely separate from savefiles, which are saved for potential restart from any point in case of error, in a specified subdirectory, and accumulate there until removed by the user, if periodically making save files is enabled. And exchange files are separate from both save files and checkpoint files; exchange files would either be created by an extension to the program, or by a separate standalone program, only when the user takes action to cause it, such as by using the -export option of the extended program, or running the separate standalone program. (64-bit residues along the way appear in console output and would get logged if there was and are a currently untapped means of detecting and halting on certain errors.

For production running, there may be (usually) no reason to keep stage one or stage two checkpoint files or savefiles or exchange files after the P-1 run is done. For debugging, or for working around program limitations, there may be. Possibly if a remarkably large factor was found, that passed pseudoprime testing itself, keeping the files for confirmation and proof might be useful.

Running CUDAPm1 as it is, I am finding that for large exponents some runs produce nearly complete stage 1 runs and then the gcd fails. Others get through the stage 1 gcd and then fail early in stage 2, in a variety of ways. These failed run files are being kept, and I hope to finish them later and also examine what's different about them. I wonder what gmp-ecm's input form for that would look like or if there's a better candidate for carrying to completion.

One of the ways CUDAPm1 stage 2 fails is the 64-bit residue in its progress is always a match to the last one from stage 1. This I think occurs when there's not quite enough memory for stage 2. Such a checkpoint file is broken in some way; the repeating residue persists if it's moved to a gpu with much more memory. An earlier checkpoint must be used instead.

2. Point taken. It will be revised. (Murphy and a coin toss?) LSB first is how CUDAPm1 checkpoint files are organized also, if I'm reading the hex editor correctly. (Source indicates 32-bit unsigned ints, and hex editor shows it little-endian storage, not surprising.) What's in the checkpoint files is one iteration after the iteration count and printed 64-bit residue, which complicates things.

I knocked out a prototype CUDAPM1 standalone translator since posting the draft. It's terribly memory inefficient and probably runtime too, but has the right number of bytes in the output and looks right to me and is doing LSB first. Rewrite coming up. The exercise has provided a useful "look under the hood" of the CUDAPm1 checkpoint storage format.

Your commentary is exactly the sort of feedback I was looking for. Thanks Ernst.
kriesel is online now   Reply With Quote
Old 2018-06-10, 14:25   #9
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,233 Posts
Default

Quote:
Originally Posted by preda View Post
I see two distinct ways to identify the field: one is the line number ("third line"), and the other is the name of the field ("Type"). I think we should settle for one or the other. (my preference would be for field-ids (being explicit), not line position). (If the line position is used, then no field-id should be included).

I'll give it more thought.
A little redundancy is a good thing.
1) If there is no redundancy in a format, errors could go undetected.
2) Fixed position means there's no need to search for type. We know where to find it.
3) Unlikely, but if there isn't well-formed type information and nothing else in record three, the file was probably written or transmitted incorrectly.

Having the type record early in the file is useful I think. It tells how to interpret the rest.

Flexibility is also a good thing. The type-specific header items are not all specified as to position. They may have no benefit of a particular type of ordering, as far as programs go.

In the header, a convention on ordering, and clear labeling, I think makes it more user friendly and readable. Having a familiar stable ordering makes it easier to see if something is missing or mangled. For a CUDAPm1 export translator, I used an ordering the same in most respects as the sequential storage of the data in the checkpoint file. That would be convenient for the program authors.

There are some features not present in this spec draft, that are common in other standards for somewhat similar purposes. For example, IGES includes counts of how many records there should be in a particular section. It's as a separate record. https://en.wikipedia.org/wiki/IGES#File_format shows the ugly old fixed format and hollerith character handling of this old format developed 30 years ago, for systems that were not new then. Looks like punch card data from 40+ years ago.
STEP followed. https://en.wikipedia.org/wiki/ISO_10303

Something I'm considering is whether there should be an explicit byte count for the bulk data, included in the header. Something like "Bulk Data Size 12345678". For a given application and computation type, it's perhaps derivable from the exponent & perhaps requiring additional parameters. It might be more convenient if it was there for comparison.

Also, is it worth putting a compact header CRC after the "End of Header"? As things stand now, there's no header CRC, line count, byte count, etc.

Data transmission has become much more reliable over the past 30 years, so maybe we can be a bit more relaxed about redundancy. But data sizes have gone up a lot. 77M bits vs 110k in 30 years. http://primes.utm.edu/mersenne/index.html#known

Last fiddled with by kriesel on 2018-06-10 at 14:31
kriesel is online now   Reply With Quote
Old 2018-06-10, 15:28   #10
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

5×7×23 Posts
Default

For simplicity consider locking down the newlines in the header to \n only (excluding \r\n), and end the header with 0x00. Easily extracting the header as a C-string and separating header from data is a useful feature IMO.
M344587487 is offline   Reply With Quote
Old 2018-06-10, 16:24   #11
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,233 Posts
Default

Quote:
Originally Posted by M344587487 View Post
For simplicity consider locking down the newlines in the header to \n only (excluding \r\n), and end the header with 0x00. Easily extracting the header as a C-string and separating header from data is a useful feature IMO.
Thanks for your input. As often happens, it's led to me thinking about some other aspects too.

I'll think about the newlines.
I'm guessing what you suggest for header line separators would be popular with linux users, and unpopular with Windows users, for whom the absence of 0d 0a means there's no line wrap occurring, until each file is opened in a suitable editor and saved again (which is likely to hose the binary data following). The opposite case, \r\n regardless, would mean linux systems would show double-spaced header lines. Linux versions of import software could simply ignore null length header lines in files received from Windows.
The current, OS-dependent termination, allows a common use case, exchange between disparate programs on the same system, and user review of the files, without such nuisances. I'm in the mostly-Windows camp. Part of the motivation for defining the draft as I did is to make it user-friendly. (That's why there's an ASCII hexadecimal choice for the bulk data.) I'd rather not get drawn into the \n vs \r\n and OS supremacy & popularity battles. https://stackoverflow.com/questions/...-and-r#1761086 (I had an instance of mfakto on Windows 10 exhibiting the \n only behavior, and it's a real nuisance, ongoingly, to look at its logs; things are not in columns as they should be, and it's much less readable, as a result. Mfakto does not do that on Windows 7.)
Perhaps the exchange software on all OSes ought be prepared to deal with both header record termination cases as input and translate to its host OS convention automatically for local user convenience. Like some modern text editors do.

Since the header being readable ASCII ensures 0x00 won't be present there, I see no immediate issue with having a 0x00 byte between the End of Header record and its record termination byte(s) and the bulk data. I suppose I should nail the header character set down some; no control characters (bell, tab, del, esc, etc.) in the header section except record terminations.

I think there's not interest in having the binary aligned to quad-byte boundary. If there was, it could be accomplished with a varying number of nul bytes. Single is simpler.
kriesel is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Available software for pursuing mersenne primes kriesel GPU Computing 48 2020-01-14 02:56
Converting CWI format to ggnfs format xilman Msieve 2 2015-11-27 09:54
LLT Cycles for Mersenne primality test: a draft T.Rex Math 1 2010-01-03 11:34
Trial division software for Mersenne SPWorley Factoring 7 2009-08-16 00:23
Speeding up Mersenne hunting m_f_h Math 8 2007-05-18 13:49

All times are UTC. The time now is 12:36.

Mon Jun 14 12:36:18 UTC 2021 up 17 days, 10:23, 0 users, load averages: 1.17, 1.26, 1.35

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.