Chest compression game
I'm interested in chess and compression, so the compression of chess games is a natural thing to nerd out on. The PGN format and the algebraic notation it contains is human readable but spaceinefficient, I'm noodling around trying to minimise the footprint of the moves from a set of games. As this is just for fun only standard complete games are considered, and metadata is ignored as that goes into normal compression which is not the point of the exercise.
Here's the representations so far: [LIST][*]SG4  Standard game representation format used in SI4 for comparison, the database format of scid[*]PGN  Standard PGN format[*]ANA  Newline delimited algebraic notation, PGN without the metadata for a fairer comparison[*]MLA  Move list array  Games are represented as a list of moves where each move is two bytes (source and destination, the upper two bits of destination denote promotion target if promoting)[*]ILA  Index list array  At each position all valid moves are generated, and the index of the move is stored, allowing a single byte to represent the move[*]IPA  Packed index list array  A list of indexes where each index only takes up as many bits as required instead of always a byte. For example if there are only 13 valid moves in a position only 4 bits are required to store the index.[/LIST] Here's comparisons using a corpus of 4096 random games (random as in a standard game where every valid move is chosen at random, ending decisively or in a draw by the 50 move rule), where PGN metadata is minimal boilerplate to pass scid validation: [code]Format Filesize %ofANA XZsize %ofANA.xz SG4 1722293 17.9% 1289764 44.5% PGN 9996400 103.9% 2907876 100.3% ANA 9623664 100.0% 2897808 100.0% MLA 3321282 34.5% 2139160 73.8% ILA 1664735 17.3% 1015896 35.1% IPA 915582 9.5% 910660 31.4%[/code]There's a potential optimisation to be made in the ordering of valid moves in a position, if more likely valid moves are listed earlier then chessunaware compressors can perform better as it skews index frequency (assuming real games not random). The proper way to do this would be to use stockfish, probably more trouble than it's worth for noodling but potentially a strong step albeit computationally expensive. None of the above representations consider redundancy across games (SG4 unknown), openings being the low hanging fruit (again, for real nonrandom games). There's a few obvious possibilities but the one that seems most suitable is to to store a serialised tree structure efficiently encoding the openings, not the entirety of all games but enough to encode the easy redundancy without incurring too much overhead. Above is pretty much everything considered so far, there may be avenues or existing solutions I'm oblivious to which would be interesting to learn about. How can it be refined further or differently? 
Perhaps adding Huffman encoding on top of IPA can show some improvement.
And one step further is full algebraic encoding as is used for compression in space probes where every bit sent is expensive. 
I'm not familiar with huffman coding with input symbols of variable width, it has potential particularly when smart ordering is implemented. On the other hand when smart ordering is implemented the ILA representation also benefits, a real compressor can probably benefit more than whatever I can cook up even given the headstart of a packed representation. There is something that might tip the scales in huffmanencoded IPA's favour, and that's the fact that the IPA representation is not optimal. On average a packed index is something like half a bit inefficient, in the example above 4 bits to represent 13 states is wasting 3 states. I was considering using the remaining states to encode multiple moves in a single go, the most viable followups to the most viable current moves, but I don't think the current state of my cranklevelspaghetticode is up to the task. Huffman or similar might work and would be simpler.
A rudimentary ordering might work well enough that stockfish and all the drawbacks associated with using it isn't worth the effort. The ordering could be something like: [LIST][*]Castling, when available it's normally done pretty quickly[*]Taking a piece with no recourse[*]Taking a higher value piece with recourse[*]Taking an equal value piece with recourse[*]Moving a piece to a safe square[*]Moving a piece to a square covered by the opponent[*]Taking a lower value piece with recourse, sacrifice plays are relatively rare[/LIST] That should give a compressor some nicer value distributions to chew on. I'm open to suggestions for refinement, this is just easily implementable as cover masks have already been implemented for validating moves. 
Switched to Caissabase for testing, a database of over 4 million real games with strong players. To do that PGN ingest is implemented, although it's far from robust it can import all games from Caissabase excluding FEN, variants, the odd games with  denoting an unknown move and the 5 games with illegal castling moves. It's apparent that the engine needs some optimising to handle this many games quick enough for the current formats let alone the next formats to try, regardless here's rough preliminary data:
[code] Format Size(MB) %ofANA XzSize %ofANA.xz SG4 508 26.7% 277 60.5% PGN 2946 155.0% 610 133.2% ANA 1901 100.0% 458 100.0% MLA 685 36.0% 329 71.8% ILA 307 16.1% 150 32.8% IPA 234 12.3% 193 42.1% [/code]xz compressed ILA beats xz compressed IPA showing that IPA is far from optimal. The decent compression of IPA by xz also highlights that the IPA representation needs work. The file format of ILA has been reworked into a Cstring instead of the former representation which used a short as a move count. TODO: [LIST][*]Optimise masks[*]Smart ordering as described above, doable once masks optimised[*]Tree format to deduplicate openings. This is largely done just waiting on some other components[*]FEN? Setting board state to/from FEN is trivial, the part that needs some thought is how to encode it into an archive format[*]Chess960? If FEN is supported it's probably not much effort to support this mode, just some reworking of the castling rules which is the only thing different from a standard game AFAIK[*]Metadata preprocessor. There's no point reinventing standard compression, but a simple preprocessor that greatly reduces the input to a standard compressor could make sense. Encode tags, group related strings, etc[/LIST] 
[QUOTE=M344587487;570138][*]IPA  Packed index list array  A list of indexes where each index only takes up as many bits as required instead of always a byte. For example if there are only 13 valid moves in a position only 4 bits are required to store the index.[/LIST][/QUOTE]
You can reach almost the optimal log(13)/log(2) bits difficulty when you have 13 possibilities for a move. The idea is that for each p number of possibility maintain a seperate array to extract the moves. Say for 13 possibilities: [CODE] unsigned int tab13[100000]; [/CODE] Since 13^121<(2^32)^14 in only 14 ints you can encode 121 chess moves where in each move you have only 13 different valid chess moves. So you need (32*14)/121=3.7025 bits and not 4 bits. Ofcourse going higher you can get a closer distance to the optimal log(13)/log(2)=3.7004 bits, and optionally you can use only 1D array to compress all these info into a single array instead of using large number of tab2,tab3,tab4,..,tab13,tab14,.. arrays. 
Even easier way if you want to encode only a single/few game:
use a generalized number system, say in the ith move you have only B[i] valid moves you have chosen the r[i]th move from these. And the game had L move then define x=r[0]+r[1]*B[0]+r[2]*B[0]*B[1]+... from this number you can easily extract the r[i] numbers: r[0]=x mod B[0] (and doesn't depend on future moves r[1],r[2],..) r[1]=(xr[0])/B[0] mod B[1] etc. with this you would need to use ceil(log2(B[0])+log2(B[1])+..+log2(B[L1])) bits of memory, hence if you are using this encoding for each game then you'd lose at most 1 bit per game, in average 0.5 bit. 
Thank you, the generalised number solution is nice I'll implement it with GMP. If smart ordering is assumed a standard compressor compressing the ILA form should beat it, but it by far beats the IPA representation. If handling each game individually I think it can be archived as a length byte followed by the number stored in <length> bytes; 99% of games in Caissabase can fit in this form, the remaining few hundred can be encoded with the length as a varint, 0xFF + u16 as the length. The other option is to lean in to the maximal compression goal and encode all games into a single giant number to avoid the ~12 bits of overhead per game (~6MB for Caissabase). That's a lot of mult/div on the multihundred megabyte number that would be Caissabase, although they're all divisions by small numbers so maybe it's viable.
To be uniquely decodable indexed representations need to be altered to encode the end of the game as an option within the index itself. This should be more efficient than more overhead in the form of a turn count per game in the header. 
Yes, forget that above we need to store also the number of bits of x. Here 15 or 16 bits should be enough maybe for all games.

Here's another test with 200,000 random games. ~23.2 million moves (200k of those "end of game") with ~115 moves per game. The random game generator works (slightly) more realistically, king vs king waltz's are now not the norm. The General Number System representation is implemented per game as GNA. ISA is a refined ILA, a Cstring representation of a list of indexes with \0 denoting end of game instead of a turn count. MLA is gone (as a format, it's still heavily used in the engine), replaced with what I'm confusingly now calling ILA. ILA is now a 2 byte per move index + possiblemovecount format that allows for quick conversion to the indexed formats, aka everything except PGN and ANA.
[code] Form Size %ofANA bitspermove  XZSize %ofANA.xz ANA 127480406 100.0% 44.20  41050624 100.0% ISA 23274407 18.3% 8.07  15659384 38.1% IPA 15748818 12.4% 5.46  15543648 37.9% GNA 13948569 10.9% 4.84  13949324 34.0% [/code]GNA is incompressible which is a good sign. For now the overhead is implemented as varint and the numbers byte aligned because it was easy for ~12 bits of overhead per game. As order isn't used a better solution would be to order the games by GNS bit count, put all of the counts together to be easily compressed by diff+RLE (or external compressor) followed by the numbers unaligned. As bit count is stored we could omit the leading set bit per game, 1 bit per game doesn't sound like much but it's 25KB in this test and ~0.5MB for Caissabase. End of game has been encoded but not what the result was, that can be rolled into the GNS for ease at the cost of two bits although in the header it could potentially be huffman encoded to squeeze out a little more juice. The maximum length of x assuming the 50 move rule is adhered to can probably fit in 16 bits (~6000 moves from both players at over 5 bits per move, which is a very generous bpm). Probably ~99.998% of Caissabase has a length of x that can fit within 11 bits, the ~70 remaining games might need 12 bits but I haven't tested it yet. 
Sort version 0 aka no sort, just the natural order from the valid move generator which iterates over the players pieces. Not random but not good.
[code] Form Size %ofANA XZSize %ofANA.xz ANA 1868403142 100.0% 442580176 100.0% ISA 342966732 18.4% 188721016 42.6% IPA 242932132 13.0% 199165684 45.0% GNA 214021318 11.5% 211957748 47.9% [/code]Sort Version 1, which sorts in this order: Kingside castle, Queenside castle, value of a move considering the opponents immediate response. The value is calculated by adding the value of a taken piece, and subtracting the value of the players piece if it can potentially be taken. Not particularly smart but not expensive to calculate, shifts obvious good and bad material moves up and down respectively. [code] Form Size %ofANA XZSize %ofANA.xz ANA 1868403142 100.0% 442580176 100.0% ISA 342966732 18.4% 174789504 39.5% IPA 242932132 13.0% 189283260 42.8% GNA 214005548 11.5% 211950820 47.9% [/code][LIST][*]GNA is slightly compressible relative to the random game test thanks I think to some of the games in Caissabase being repeated[*]GNA with sort v1 is slightly smaller than v0 thanks to generally having smaller indexes leading to sometimes using 1 less bit per game[*]When given a large amount of real games IPA beats GNA as IPA is surprisingly compressible. Each game is byte aligned which may explain some of it as opening redundancy is available[*]Sort v1 is better than anticipated, even the IPA.xz version saved ~10MB over v0 but the ISA.xz was the real winner at ~14MB saved[/LIST] Potential refinement for sort v2 or beyond: [LIST][*]Checkmate seems obvious but expensive to calculate as the opponents next moves need to be generated to test for it. Also by definition not the most lucrative thing to check for[*]Checks on uncovered squares, particularly when the opponent can't respond by blocking with an immediate threat[*]Detecting forks from squares, particularly when the worst material trade is favourable or if a check is involved[*]Even trades when ahead on material is a common tactic, conversely a player is likely to be tradeaverse when down on material[*]Instead of just considering the immediate response, evaluate a trade chain to its logical conclusion. This can be done with individual piece cover masks to count how covered a square is instead of just using a single mask showing what the opponent has covered[/LIST] Hard to evaluate how some of these techniques stack up against each other or how they can sanely be mixed into one algorithm. 
Move sorting early in a game may be improved by pairing the compressor with an opening book.

That's a neat idea, scid has an ECO file defining ~10k opening lines over half of which are over 10 ply long. Not a quick thing to implement but a quick thing to execute and potentially strong.
I've also taken a peek at stockfish. There doesn't seem to be an easy way to generate an ordered list of moves for a position but it's simple enough to give it a position and it returns the best calculated move using the UCI interface interface. With a depth of one some of us might even still be alive by the time all ~1/3 billion positions in caissabase are evaluated in this way. It would certainly accelerate the need for concurrency via OpenMP, something else on the TODO if I'm not bored of this by then. Stockfishbestmove and an opening book can both be integrated with the existing sortv1 ordering so they seem like the natural way to progress. If stockfish is quick enough the plan is to merge stockfish with v1 forming v2, merge ECO with v1 forming v3, and merging all three into v4 to then allow for some juicy comparisons. 
An ordered list of moves for a position can be extracted from stockfish, by setting MultiPV to the number of valid moves of the position before telling SF to evaluate. If stockfish is used at all then using the full ordering is the way to go, otherwise we're throwing away a stronger evaluation for no reason. At depth 1 (where depth is a multiplier) it does require ~1TB of UCI communication to my code and ~35GB to stockfish for Caissabase, if this ~1TB is a problem stockfish can be modified to reduce this by an order of magnitude quite easily. At depth 1 it would take ~50 hours of a single Zen2 core to process Caissabase, so PGN splitting is implemented as a quickanddirty way of being able to parallelise, allowing an 8 core Zen2 to process Caissabase in ~8 hours.
The pgn can be converted to ila with stockfish sort but there's an as yet undiagnosed problem which means that converting from ila fails. Every sort method shares the same code path for nonsort operations, stockfish ordering appears to be deterministic, and the other sort algorithms can be encoded/decoded at will, so the bug hunting spear needs sharpening. There's an extra way to squeeze more juice out of a generic compressor compressing ISA form, an ordering of the games increasing commonalities between adjacent games. Possibly as simple as ordering by the average index per game, next on the TODO list. 
Had a bit of a noodle again with this project. After cursing at past me for creating such an awful spider's nest of code the core engine has been reimplemented to fully use u64 bitboards (before the pieces were stored as u8 positions with bitboards only used partly). The new implementation is at least 10x quicker at generating moves and hasn't been fully optimised yet so that's a win. It makes heavy use of code like this:
[code]while(piece_type){ index=__builtin_ctzll(piece_type);//bitscan //test piece piece_type^=(1ull<<index);//pop }[/code]to pop the positions of all pieces of a certain type to do tests with (the tests use precomputed masks where possible). The implementation only uses basic bit manipulation as I don't know the more advanced concepts, but there's also the possibility of parallelising part of this implementation to AVX2 which may not be as easy with a more complex representation. AVX2 is a tempting rabbit hole even if it's not the best way forwards. If compression is ever touched on again the first port of call will probably be storing the entire set as a deduplicated nary tree using dwarf's representation which seems neat ( [URL]https://eli.thegreenplace.net/2011/09/29/aninterestingtreeserializationalgorithmfromdwarf[/URL] ). Then it's probably back to trying to cheaply rank moves in a position to make them more compressible. The nest of code still exists, it's just been added to so that future me has a reason to drink. 
All times are UTC. The time now is 01:25. 
Powered by vBulletin® Version 3.8.11
Copyright ©2000  2022, Jelsoft Enterprises Ltd.