mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Forum Feedback (https://www.mersenneforum.org/forumdisplay.php?f=61)
-   -   Unicode (https://www.mersenneforum.org/showthread.php?t=25379)

xilman 2020-03-18 18:41

Unicode
 
There is something not quite right about the treatment of many Unicode characters by the forum software. All code points below 007F are handled correctly. Given that this is the range of traditional US-ASCII it is what should be expected.

The pound sterling, £, comes through unscathed. This is U+A3 and I deduce that all below U+FF are fine.

Likewise, the Euro, € is handled correctly. This one is U+20A0.

[url]https://mersenneforum.org/showpost.php?p=539762&postcount=34[/url] contains a few characters with code points just above 012000. One of them is the Sumerian character GAL, which has code point 0120F2. If I type a GAL ( 𒃲 ) into this composition window it now displays perfectly. It also displays perfectly in a Preview window.

In the post referred to above, it is replaced by six U+FFFD characters. These display as a white question mark within a black diamond on my display.

Subsequent posts will attempt a binary search on the range U+2000 through U+12000 in attempt to discover what the forum software handles correctly and where it is broken. My guess that the breaking point lies at a binary boundary such as U+4000, U+8000 or U_10000.

Note that the breakage may lie in any of the software between accepting a post and serving it up to a reader. Given experience with Preview, my guess is that the underlying database is not UTF8-clean.

xilman 2020-03-18 18:44

[QUOTE=xilman;540069]There is something not quite right about the treatment of many Unicode characters by the forum software. All code points below 007F are handled correctly. Given that this is the range of traditional US-ASCII it is what should be expected.

The pound sterling, £, comes through unscathed. This is U+A3 and I deduce that all below U+FF are fine.

Likewise, the Euro, € is handled correctly. This one is U+20A0.

[url]https://mersenneforum.org/showpost.php?p=539762&postcount=34[/url] contains a few characters with code points just above 012000. One of them is the Sumerian character GAL, which has code point 0120F2. If I type a GAL ( ) into this composition window it now displays perfectly. It also displays perfectly in a Preview window.

In the post referred to above, it is replaced by six U+FFFD characters. These display as a white question mark within a black diamond on my display.

Subsequent posts will attempt a binary search on the range U+2000 through U+12000 in attempt to discover what the forum software handles correctly and where it is broken. My guess that the breaking point lies at a binary boundary such as U+4000, U+8000 or U_10000.

Note that the breakage may lie in any of the software between accepting a post and serving it up to a reader. Given experience with Preview, my guess is that the underlying database is not UTF8-clean.[/QUOTE]
Fascinating!

The GAL now displays perfectly in the quoted post (and this one), yet the posts in the referenced post are still incorrect.

Very tempted to go back to that post and edit it.

Uncwilly 2020-03-18 18:52

[QUOTE=xilman;540069][url]https://mersenneforum.org/showpost.php?p=539762&postcount=34[/url] contains a few characters with code points just above 012000. One of them is the Sumerian character GAL, which has code point 0120F2. If I type a GAL ( 𒃲 ) into this composition window it now displays perfectly. It also displays perfectly in a Preview window.[/QUOTE].

xilman 2020-03-18 20:24

[QUOTE=Uncwilly;540072].[/QUOTE]Thanks. I confirm it still displays correctly in your reply.

However, the GAL now shows as six U+FFFD characters post #2 yet remains correct everywhere else (for the time being).

Curiouser and curiouser.

Could you report whether you see the same as me in the amantes linguam Latinam post please? Or, for that matter, whether the GAL turns into six U+FFFD characters in any of the above.

Uncwilly 2020-03-18 20:56

[QUOTE=xilman;540082]Could you report whether you see the same as me in the amantes linguam Latinam post please? Or, for that matter, whether the GAL turns into six U+FFFD characters in any of the above.[/QUOTE]I see rhomboids with ? inside each, in that thread and in your quote. Is it your entry method? Alt-#### vs cut-and-paste?

xilman 2020-03-18 21:04

[QUOTE=Uncwilly;540086]I see rhomboids with ? inside each, in that thread and in your quote. Is it your entry method? Alt-#### vs cut-and-paste?[/QUOTE]I use cut and paste when entering the text for the first time. When I quote something I leave well alone. The Alt- method dopesn't work on this Linux / Xfce system; neither do I know what its equivalent might be.

The characters you see are the common representation of U+FFFD.

Ah, just found Ctrl-Shift-u-xxxx. Giving it a try ...

Yup works a treat. Let's see whether the GAL above survives a few round trips.

xilman 2020-03-18 21:18

Nope. It didn't even survive its first exposure. Not that I expected to to be treated any differently.

This is really irritating. The specific case of Sumerian cuneiform is relatively unimportant but it is a symptom of an underlying cause which may be more serious. Something in the pipeline is quite clearly not fully UTF-8 compatible.

(Incidentally, we saw a number of such issues in FlyBase when it migrated circa 2008. It was a real bugger tracking down and fixing them.)

LaurV 2020-03-22 15:27

rhomboids with question marks here too.
Is that (maybe) a local setting I can make? (like fonts to download, or change forum default font?)

xilman 2020-03-22 16:56

1 Attachment(s)
[QUOTE=LaurV;540490]rhomboids with question marks here too.
Is that (maybe) a local setting I can make? (like fonts to download, or change forum default font?)[/QUOTE]Not as far as I know.

To test this possibility I saved the downloaded bytes and examined them with od(1). The glyphs you describe are indeed the U+FFFD characters and not the original cuneiform. Your browser then displays them as such.

Precisely two posts in this thread still contain the correct characters: my original and uncwilly's first response. All the rest have been corrupted in some manner.

No idea what's going on; I suspect that only someone with superuser access to the forum software and database will be able to take corrective action.

The only workaround I can suggest is hinted at by my signature:

[ATTACH]21919[/ATTACH]

Xyzzy 2020-03-23 13:02

𒀀 𒀁 𒀂 𒀃 𒀄 𒀅 𒀆 𒀇 𒀈 𒀉 𒀊 𒀋 𒀌 𒀍 𒀎 𒀏 𒀐 𒀑 𒀒 𒀓 𒀔 𒀕 𒀖 𒀗 𒀘 𒀙 𒀚 𒀛 𒀜 𒀝 𒀞 𒀟 𒀠 𒀡 𒀢 𒀣 𒀤 𒀥 𒀦 𒀧 𒀨 𒀩 𒀪 𒀫 𒀬 𒀭 𒀮 𒀯 𒀰 𒀱 𒀲 𒀳 𒀴 𒀵 𒀶 𒀷 𒀸 𒀹 𒀺 𒀻 𒀼 𒀽 𒀾 𒀿 𒁀 𒁁 𒁂 𒁃 𒁄 𒁅 𒁆 𒁇 𒁈 𒁉 𒁊 𒁋 𒁌 𒁍 𒁎 𒁏 𒁐 𒁑 𒁒 𒁓 𒁔 𒁕 𒁖 𒁗 𒁘 𒁙 𒁚 𒁛 𒁜 𒁝 𒁞 𒁟 𒁠 𒁡 𒁢 𒁣 𒁤 𒁥 𒁦 𒁧 𒁨 𒁩 𒁪 𒁫 𒁬 𒁭 𒁮 𒁯 𒁰 𒁱 𒁲 𒁳 𒁴 𒁵 𒁶 𒁷 𒁸 𒁹 𒁺 𒁻 𒁼 𒁽 𒁾 𒁿 𒂀 𒂁 𒂂 𒂃 𒂄 𒂅 𒂆 𒂇 𒂈 𒂉 𒂊 𒂋 𒂌 𒂍 𒂎 𒂏 𒂐 𒂑 𒂒 𒂓 𒂔 𒂕 𒂖 𒂗 𒂘 𒂙 𒂚 𒂛 𒂜 𒂝 𒂞 𒂟 𒂠 𒂡 𒂢 𒂣 𒂤 𒂥 𒂦 𒂧 𒂨 𒂩 𒂪 𒂫 𒂬 𒂭 𒂮 𒂯 𒂰 𒂱 𒂲 𒂳 𒂴 𒂵 𒂶 𒂷 𒂸 𒂹 𒂺 𒂻 𒂼 𒂽 𒂾 𒂿 𒃀 𒃁 𒃂 𒃃 𒃄 𒃅 𒃆 𒃇 𒃈 𒃉 𒃊 𒃋 𒃌 𒃍 𒃎 𒃏 𒃐 𒃑 𒃒 𒃓 𒃔 𒃕 𒃖 𒃗 𒃘 𒃙 𒃚 𒃛 𒃜 𒃝 𒃞 𒃟 𒃠 𒃡 𒃢 𒃣 𒃤 𒃥 𒃦 𒃧 𒃨 𒃩 𒃪 𒃫 𒃬 𒃭 𒃮 𒃯 𒃰 𒃱 𒃲 𒃳 𒃴 𒃵 𒃶 𒃷 𒃸 𒃹 𒃺 𒃻 𒃼 𒃽 𒃾 𒃿 𒄀 𒄁 𒄂 𒄃 𒄄 𒄅 𒄆 𒄇 𒄈 𒄉 𒄊 𒄋 𒄌 𒄍 𒄎 𒄏 𒄐 𒄑 𒄒 𒄓 𒄔 𒄕 𒄖 𒄗 𒄘 𒄙 𒄚 𒄛 𒄜 𒄝 𒄞 𒄟 𒄠 𒄡 𒄢 𒄣 𒄤 𒄥 𒄦 𒄧 𒄨 𒄩 𒄪 𒄫 𒄬

Xyzzy 2020-03-23 13:09

Any previously entered text has probably been mangled by the forum software.

New text should be treated properly.

:mike:

xilman 2020-03-23 15:36

[QUOTE=Xyzzy;540595]Any previously entered text has probably been mangled by the forum software.

New text should be treated properly.

:mike:[/QUOTE]Excellent! :bow:

Would you tell us what was going wrong please?

Xyzzy 2020-03-23 17:56

#2

[URL]https://forum.vbulletin.com/forum/vbulletin-legacy-versions-products/legacy-vbulletin-versions/vbulletin-3-6-questions-problems-and-troubleshooting/265581-language-charset-problems-things-to-check?t=259250[/URL]

:mike:

xilman 2020-03-23 18:17

[QUOTE=Xyzzy;540626]#2

[URL]https://forum.vbulletin.com/forum/vbulletin-legacy-versions-products/legacy-vbulletin-versions/vbulletin-3-6-questions-problems-and-troubleshooting/265581-language-charset-problems-things-to-check?t=259250[/URL]

:mike:[/QUOTE]Curious! I would never have thought of that.

LaurV 2020-03-24 05:24

Now, if you already solved that, did you also had a look to spacing issue?

Maybe unrelated, but I am spending part of my time on Duolingo, and [URL="https://forum.duolingo.com/comment/28284738/GUIDE-Formatting-in-Duolingo"]there[/URL], they have a stange "markup" language to format the posts. For example, to split the text in paragraphs you can't just type an <enter>. That will be totally ignored, I guess they purge all the "strange" characters for safety or so. If you type two consecutive <enter>s, the result is a split with a galaxy of space in between the paragraphs (somehow like the mersenneforum is doing, and I have always to edit the post to delete that space). To have a "normal" split, you must type two consecutive spaces at the end of the paragraph (so, <space><space><enter>).(see the "spacing" section at the posted link).

Now, what is the connection? Well, the "hand" learned the trick and sometimes I type two spaces at the end of the paragraph when posted on another sites, here included. I have seen that occasionally when doing that on mersenneforum, the resulted split has only two empty lines, or it is even "normal" (with a single line). When I just put <enter>s, the result had always additional empty lines which need to be edited.

Therefore, I made some experiments here, and I found that the number of empty lines that are added when the post is "submitted" depends of the ending characters of the previous paragraph in quite a deterministic way. One <enter> is not always split in two enters.

Maybe this gives you some idea what to tickle? Maybe some settings? Unix vs Windoze line endings? CR/LF split into two CR or two LF? This "behavior" is quite new, it used not to be like this in the past, and it is quite annoying...

retina 2020-03-24 06:17

[QUOTE=LaurV;540730]Now, if you already solved that, did you also had a look to spacing issue?

Maybe unrelated, but I am spending part of my time on Duolingo, and [URL="https://forum.duolingo.com/comment/28284738/GUIDE-Formatting-in-Duolingo"]there[/URL], they have a stange "markup" language to format the posts. For example, to split the text in paragraphs you can't just type an <enter>. That will be totally ignored, I guess they purge all the "strange" characters for safety or so. If you type two consecutive <enter>s, the result is a split with a galaxy of space in between the paragraphs (somehow like the mersenneforum is doing, and I have always to edit the post to delete that space). To have a "normal" split, you must type two consecutive spaces at the end of the paragraph (so, <space><space><enter>).(see the "spacing" section at the posted link).

Now, what is the connection? Well, the "hand" learned the trick and sometimes I type two spaces at the end of the paragraph when posted on another sites, here included. I have seen that occasionally when doing that on mersenneforum, the resulted split has only two empty lines, or it is even "normal" (with a single line). When I just put <enter>s, the result had always additional empty lines which need to be edited.

Therefore, I made some experiments here, and I found that the number of empty lines that are added when the post is "submitted" depends of the ending characters of the previous paragraph in quite a deterministic way. One <enter> is not always split in two enters.

Maybe this gives you some idea what to tickle? Maybe some settings? Unix vs Windoze line endings? CR/LF split into two CR or two LF? This "behavior" is quite new, it used not to be like this in the past, and it is quite annoying...[/QUOTE]Are you using the fancy JS editor?

I have no trouble with the normal non-JS text box. Enter is what one expects. Spaces are what one expects.

It all just works.

And I even put two spaces at the end of each paragraph in this post. And no problems.

Let the browser do its job and disable all that fancy JS that is messing it up.

LaurV 2020-03-24 08:52

JS has nothing to do with editing or posting (i.e. writing the text in the box and clicking "submit reply", there is no JS involved). As a proof, your "two spaces" were parsed out when you submitted the post. There are no "two spaces" at the end of your lines, unless you used the "code" tags, where the spaciation is kept.


All times are UTC. The time now is 12:43.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.