mersenneforum.org  

Go Back   mersenneforum.org > Other Stuff > Forum Feedback

Reply
 
Thread Tools
Old 2020-03-18, 18:41   #1
xilman
Bamboozled!
 
xilman's Avatar
 
"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

10,753 Posts
Default Unicode

There is something not quite right about the treatment of many Unicode characters by the forum software. All code points below 007F are handled correctly. Given that this is the range of traditional US-ASCII it is what should be expected.

The pound sterling, £, comes through unscathed. This is U+A3 and I deduce that all below U+FF are fine.

Likewise, the Euro, € is handled correctly. This one is U+20A0.

https://mersenneforum.org/showpost.p...2&postcount=34 contains a few characters with code points just above 012000. One of them is the Sumerian character GAL, which has code point 0120F2. If I type a GAL ( 𒃲 ) into this composition window it now displays perfectly. It also displays perfectly in a Preview window.

In the post referred to above, it is replaced by six U+FFFD characters. These display as a white question mark within a black diamond on my display.

Subsequent posts will attempt a binary search on the range U+2000 through U+12000 in attempt to discover what the forum software handles correctly and where it is broken. My guess that the breaking point lies at a binary boundary such as U+4000, U+8000 or U_10000.

Note that the breakage may lie in any of the software between accepting a post and serving it up to a reader. Given experience with Preview, my guess is that the underlying database is not UTF8-clean.
xilman is offline   Reply With Quote
Old 2020-03-18, 18:44   #2
xilman
Bamboozled!
 
xilman's Avatar
 
"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

10,753 Posts
Default

Quote:
Originally Posted by xilman View Post
There is something not quite right about the treatment of many Unicode characters by the forum software. All code points below 007F are handled correctly. Given that this is the range of traditional US-ASCII it is what should be expected.

The pound sterling, £, comes through unscathed. This is U+A3 and I deduce that all below U+FF are fine.

Likewise, the Euro, € is handled correctly. This one is U+20A0.

https://mersenneforum.org/showpost.p...2&postcount=34 contains a few characters with code points just above 012000. One of them is the Sumerian character GAL, which has code point 0120F2. If I type a GAL ( ) into this composition window it now displays perfectly. It also displays perfectly in a Preview window.

In the post referred to above, it is replaced by six U+FFFD characters. These display as a white question mark within a black diamond on my display.

Subsequent posts will attempt a binary search on the range U+2000 through U+12000 in attempt to discover what the forum software handles correctly and where it is broken. My guess that the breaking point lies at a binary boundary such as U+4000, U+8000 or U_10000.

Note that the breakage may lie in any of the software between accepting a post and serving it up to a reader. Given experience with Preview, my guess is that the underlying database is not UTF8-clean.
Fascinating!

The GAL now displays perfectly in the quoted post (and this one), yet the posts in the referenced post are still incorrect.

Very tempted to go back to that post and edit it.

Last fiddled with by xilman on 2020-03-18 at 18:45 Reason: Added "(and this one)"
xilman is offline   Reply With Quote
Old 2020-03-18, 18:52   #3
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

9,787 Posts
Default

Quote:
Originally Posted by xilman View Post
https://mersenneforum.org/showpost.p...2&postcount=34 contains a few characters with code points just above 012000. One of them is the Sumerian character GAL, which has code point 0120F2. If I type a GAL ( 𒃲 ) into this composition window it now displays perfectly. It also displays perfectly in a Preview window.
.
Uncwilly is offline   Reply With Quote
Old 2020-03-18, 20:24   #4
xilman
Bamboozled!
 
xilman's Avatar
 
"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

10,753 Posts
Default

Quote:
Originally Posted by Uncwilly View Post
.
Thanks. I confirm it still displays correctly in your reply.

However, the GAL now shows as six U+FFFD characters post #2 yet remains correct everywhere else (for the time being).

Curiouser and curiouser.

Could you report whether you see the same as me in the amantes linguam Latinam post please? Or, for that matter, whether the GAL turns into six U+FFFD characters in any of the above.

Last fiddled with by xilman on 2020-03-18 at 20:29
xilman is offline   Reply With Quote
Old 2020-03-18, 20:56   #5
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

9,787 Posts
Default

Quote:
Originally Posted by xilman View Post
Could you report whether you see the same as me in the amantes linguam Latinam post please? Or, for that matter, whether the GAL turns into six U+FFFD characters in any of the above.
I see rhomboids with ? inside each, in that thread and in your quote. Is it your entry method? Alt-#### vs cut-and-paste?
Uncwilly is offline   Reply With Quote
Old 2020-03-18, 21:04   #6
xilman
Bamboozled!
 
xilman's Avatar
 
"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

2A0116 Posts
Default

Quote:
Originally Posted by Uncwilly View Post
I see rhomboids with ? inside each, in that thread and in your quote. Is it your entry method? Alt-#### vs cut-and-paste?
I use cut and paste when entering the text for the first time. When I quote something I leave well alone. The Alt- method dopesn't work on this Linux / Xfce system; neither do I know what its equivalent might be.

The characters you see are the common representation of U+FFFD.

Ah, just found Ctrl-Shift-u-xxxx. Giving it a try ...

Yup works a treat. Let's see whether the GAL above survives a few round trips.

Last fiddled with by xilman on 2020-03-18 at 21:09 Reason: Found Ctrl-Shift-u-0-1-2-0-f-2
xilman is offline   Reply With Quote
Old 2020-03-18, 21:18   #7
xilman
Bamboozled!
 
xilman's Avatar
 
"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

10,753 Posts
Default

Nope. It didn't even survive its first exposure. Not that I expected to to be treated any differently.

This is really irritating. The specific case of Sumerian cuneiform is relatively unimportant but it is a symptom of an underlying cause which may be more serious. Something in the pipeline is quite clearly not fully UTF-8 compatible.

(Incidentally, we saw a number of such issues in FlyBase when it migrated circa 2008. It was a real bugger tracking down and fixing them.)
xilman is offline   Reply With Quote
Old 2020-03-22, 15:27   #8
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

22×33×89 Posts
Default

rhomboids with question marks here too.
Is that (maybe) a local setting I can make? (like fonts to download, or change forum default font?)
LaurV is offline   Reply With Quote
Old 2020-03-22, 16:56   #9
xilman
Bamboozled!
 
xilman's Avatar
 
"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

10,753 Posts
Default

Quote:
Originally Posted by LaurV View Post
rhomboids with question marks here too.
Is that (maybe) a local setting I can make? (like fonts to download, or change forum default font?)
Not as far as I know.

To test this possibility I saved the downloaded bytes and examined them with od(1). The glyphs you describe are indeed the U+FFFD characters and not the original cuneiform. Your browser then displays them as such.

Precisely two posts in this thread still contain the correct characters: my original and uncwilly's first response. All the rest have been corrupted in some manner.

No idea what's going on; I suspect that only someone with superuser access to the forum software and database will be able to take corrective action.

The only workaround I can suggest is hinted at by my signature:

Click image for larger version

Name:	signature.png
Views:	138
Size:	11.1 KB
ID:	21919

Last fiddled with by xilman on 2020-03-22 at 17:01
xilman is offline   Reply With Quote
Old 2020-03-23, 13:02   #10
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

25×257 Posts
Default

𒀀 𒀁 𒀂 𒀃 𒀄 𒀅 𒀆 𒀇 𒀈 𒀉 𒀊 𒀋 𒀌 𒀍 𒀎 𒀏 𒀐 𒀑 𒀒 𒀓 𒀔 𒀕 𒀖 𒀗 𒀘 𒀙 𒀚 𒀛 𒀜 𒀝 𒀞 𒀟 𒀠 𒀡 𒀢 𒀣 𒀤 𒀥 𒀦 𒀧 𒀨 𒀩 𒀪 𒀫 𒀬 𒀭 𒀮 𒀯 𒀰 𒀱 𒀲 𒀳 𒀴 𒀵 𒀶 𒀷 𒀸 𒀹 𒀺 𒀻 𒀼 𒀽 𒀾 𒀿 𒁀 𒁁 𒁂 𒁃 𒁄 𒁅 𒁆 𒁇 𒁈 𒁉 𒁊 𒁋 𒁌 𒁍 𒁎 𒁏 𒁐 𒁑 𒁒 𒁓 𒁔 𒁕 𒁖 𒁗 𒁘 𒁙 𒁚 𒁛 𒁜 𒁝 𒁞 𒁟 𒁠 𒁡 𒁢 𒁣 𒁤 𒁥 𒁦 𒁧 𒁨 𒁩 𒁪 𒁫 𒁬 𒁭 𒁮 𒁯 𒁰 𒁱 𒁲 𒁳 𒁴 𒁵 𒁶 𒁷 𒁸 𒁹 𒁺 𒁻 𒁼 𒁽 𒁾 𒁿 𒂀 𒂁 𒂂 𒂃 𒂄 𒂅 𒂆 𒂇 𒂈 𒂉 𒂊 𒂋 𒂌 𒂍 𒂎 𒂏 𒂐 𒂑 𒂒 𒂓 𒂔 𒂕 𒂖 𒂗 𒂘 𒂙 𒂚 𒂛 𒂜 𒂝 𒂞 𒂟 𒂠 𒂡 𒂢 𒂣 𒂤 𒂥 𒂦 𒂧 𒂨 𒂩 𒂪 𒂫 𒂬 𒂭 𒂮 𒂯 𒂰 𒂱 𒂲 𒂳 𒂴 𒂵 𒂶 𒂷 𒂸 𒂹 𒂺 𒂻 𒂼 𒂽 𒂾 𒂿 𒃀 𒃁 𒃂 𒃃 𒃄 𒃅 𒃆 𒃇 𒃈 𒃉 𒃊 𒃋 𒃌 𒃍 𒃎 𒃏 𒃐 𒃑 𒃒 𒃓 𒃔 𒃕 𒃖 𒃗 𒃘 𒃙 𒃚 𒃛 𒃜 𒃝 𒃞 𒃟 𒃠 𒃡 𒃢 𒃣 𒃤 𒃥 𒃦 𒃧 𒃨 𒃩 𒃪 𒃫 𒃬 𒃭 𒃮 𒃯 𒃰 𒃱 𒃲 𒃳 𒃴 𒃵 𒃶 𒃷 𒃸 𒃹 𒃺 𒃻 𒃼 𒃽 𒃾 𒃿 𒄀 𒄁 𒄂 𒄃 𒄄 𒄅 𒄆 𒄇 𒄈 𒄉 𒄊 𒄋 𒄌 𒄍 𒄎 𒄏 𒄐 𒄑 𒄒 𒄓 𒄔 𒄕 𒄖 𒄗 𒄘 𒄙 𒄚 𒄛 𒄜 𒄝 𒄞 𒄟 𒄠 𒄡 𒄢 𒄣 𒄤 𒄥 𒄦 𒄧 𒄨 𒄩 𒄪 𒄫 𒄬
Xyzzy is offline   Reply With Quote
Old 2020-03-23, 13:09   #11
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

822410 Posts
Default

Any previously entered text has probably been mangled by the forum software.

New text should be treated properly.

Xyzzy is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 13:14.


Sat Jul 17 13:14:47 UTC 2021 up 50 days, 11:02, 1 user, load averages: 1.67, 1.77, 1.79

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.