![]() |
|
|
#1101 |
|
Undefined
"The unspeakable one"
Jun 2006
My evil lair
11000001101002 Posts |
With reference to the following assignment:
http://www.mersenne.org/report_expon...6917443&full=1 It shows as assigned to user: "João" and the webpage encoding is defined as UTF-8, but clearly the characters are wrong. More likely the user name is João which would encode the ã as the two UTF-8 bytes ã. So presumably the DB has recorded the user name in UTF-8 form (so far so good), but upon retrieval it gets pulled out as bytes and re-encoded to UTF-8? Or maybe it has recorded each UTF-8 byte as a separate UTF-8 character? Anyhow, whatever it is, is ain't right IMO. Last fiddled with by retina on 2016-08-21 at 05:23 |
|
|
|
|
|
#1102 | |
|
Sep 2003
1010000110012 Posts |
Quote:
cp1250: £o¿yñski → Łożyński cp1251: þðèé → Юрий HTML: Vidaković → Vidaković UTF-8: Løkken → Løkken ISO 8859-2: ©eliga → Šeliga ISO 8859-9: Buðday → Buğday KSC: ±è½Ã¸ó → 김시몬 GIMPS started 20 years ago, and many of these names date back to an era when Unicode was not common and a lot of 8-bit character sets were in use. Most of the time it's a straightforward mapping, but in a few cases the eight-bit character sets map to 8-bit control codes (\x80 to \x9F), and then the character is replaced by a question mark and you have to use language awareness: cp1250: Tomá? ?ikorský → Tomáš Šikorský Google can help nail these cases down with confidence. Sometimes these raw 8-bit control codes are actually in the database as such: UTF-8: KeÃ<9f>ler → Keßler cp850: J<81>rgen → Jürgen (e.g., here the fourth byte is a literal hex 9F, represented above as <9f>) UTF-8: <e6><b1><aa><e6><98><be><e6><9e><97> → 汪显林 SJIS: <89><c1><93><a1> <96><46><8f><ba> → 加藤 芳昭 We could fix most of these systematically, although a few cases seem hopelessly undecipherable. The question is: can the database handle full-blown Unicode names in Cyrillic, Korean, Japanese, Chinese, eastern European, etc. Right now, though, the database already contains some unprintable 8-bit control characters in the \x80 to \x9F range and that isn't a good thing. Last fiddled with by GP2 on 2016-08-21 at 08:22 |
|
|
|
|
|
|
#1103 | |
|
Undefined
"The unspeakable one"
Jun 2006
My evil lair
22·1,549 Posts |
Quote:
|
|
|
|
|
|
|
#1104 |
|
Sep 2003
5·11·47 Posts |
Same thing. UTF-8 is just a very common way of encoding Unicode.
I think PrimeNet uses SQL Server. Maybe Madpoo could give us a bit of insight into the data type of the relevant database field? I could supply him with a set of corrected names, assuming the database is able to store them. |
|
|
|
|
|
#1105 |
|
Undefined
"The unspeakable one"
Jun 2006
My evil lair
140648 Posts |
Okay, but I meant that the server could just be told that it is all bytes and it doesn't need to be clever with handling any code conversions. The only thing affected in that mode is that sorting and lexicographic ordering might not appear logical.
|
|
|
|
|
|
#1106 |
|
Sep 2003
5·11·47 Posts |
Here's an old thread from 2014 which says that the SQL Server database field is varchar rather than nvarchar, so it can't store Unicode. Unfortunate.
It would still be possible to fix a subset of the names, including "João" to "João". Or all the accented names could be converted to use HTML &-style entity numbers, like webpages do. Edit: hmm, that wouldn't work, see M50685343, where the name displays literally as "Vladan Vidaković" rather than "Vladan Vidaković" Last fiddled with by GP2 on 2016-08-21 at 16:35 |
|
|
|
|
|
#1107 | |
|
Serpentine Vermin Jar
Jul 2014
331110 Posts |
Quote:
A fix for that would be to alter the table and convert to nvarchar... I imagine for the most part it would be an easy enough transition since SQL can typically do implicit casts between them and won't complain, but there would be issues in boolean operations where varchar <> nvarchar so we'd have to dig into it more to make sure we're good. |
|
|
|
|
|
|
#1108 | |
|
"Forget I exist"
Jul 2009
Dumbassville
26×131 Posts |
Quote:
Last fiddled with by science_man_88 on 2016-08-21 at 18:04 |
|
|
|
|
|
|
#1109 | |
|
Sep 2003
5×11×47 Posts |
Quote:
Last fiddled with by GP2 on 2016-08-21 at 20:02 |
|
|
|
|
|
|
#1110 | |
|
Serpentine Vermin Jar
Jul 2014
7·11·43 Posts |
Quote:
![]() I thought about it some today and my best guess on how to improve it would be to add an additional nvarchar column and slowly port any DB calls/sprocs/etc to use that instead, deprecating and eventually removing the old column. Kinda messes things up but if this were my day job and I wanted to ensure reliable operations, that's how I'd do it... gradually and surely with some bleedover time. But even that means making sure any updates/inserts aren't using implicit column placements... I'm pretty sure anything like that is explicitly handled (as far as I've seen, that's the case) but Murphy's law says there'd be that one exception where an insert assumes columns are in a certain order and doesn't spell 'em out.
|
|
|
|
|
|
|
#1111 | |
|
If I May
"Chris Halsall"
Sep 2002
Barbados
37·263 Posts |
Quote:
![]() Your planned migration sounds sane. Add the new column at the end of the table and the worst case will be the new nvarchar column isn't updated by any legacy code doing implicit references. |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Official "Faits erronés dans de belles-lettres" thread | ewmayer | Lounge | 39 | 2015-05-19 01:08 |
| Official "all-Greek-to-me Fiction Literature and Cinema" Thread | ewmayer | Science & Technology | 41 | 2014-04-16 11:54 |
| Official "Lasciate ogne speranza" whinge-thread | cheesehead | Soap Box | 56 | 2013-06-29 01:42 |
| Official "Ernst is a deceiving bully and George is a meanie" thread | cheesehead | Soap Box | 61 | 2013-06-11 04:30 |
| Official "String copy Statement Considered Harmful" thread | Dubslow | Programming | 19 | 2012-05-31 17:49 |