![]() |
UTF-8 encoding conversion failure?
With reference to the following assignment:
[url]http://www.mersenne.org/report_exponent/?exp_lo=36917443&full=1[/url] It shows as assigned to user: "João" and the webpage encoding is defined as UTF-8, but clearly the characters are wrong. More likely the user name is João which would encode the ã as the two UTF-8 bytes ã. So presumably the DB has recorded the user name in UTF-8 form (so far so good), but upon retrieval it gets pulled out as bytes and re-encoded to UTF-8? Or maybe it has recorded each UTF-8 byte as a separate UTF-8 character? Anyhow, whatever it is, is ain't right IMO. |
[QUOTE=retina;440349]It shows as assigned to user: "João" and the webpage encoding is defined as UTF-8, but clearly the characters are wrong.[/QUOTE]
There are quite a few of these distorted names, and most of them can be identified unambiguously: cp1250: £o¿yñski → Łożyński cp1251: þðèé → Юрий HTML: Vidaković → Vidaković UTF-8: Løkken → Løkken ISO 8859-2: ©eliga → Šeliga ISO 8859-9: Buðday → Buğday KSC: ±è½Ã¸ó → 김시몬 GIMPS started 20 years ago, and many of these names date back to an era when Unicode was not common and a lot of 8-bit character sets were in use. Most of the time it's a straightforward mapping, but in a few cases the eight-bit character sets map to 8-bit control codes (\x80 to \x9F), and then the character is replaced by a question mark and you have to use language awareness: cp1250: Tomá? ?ikorský → Tomáš Šikorský Google can help nail these cases down with confidence. Sometimes these raw 8-bit control codes are actually in the database as such: UTF-8: KeÃ<9f>ler → Keßler cp850: J<81>rgen → Jürgen (e.g., here the fourth byte is a literal hex 9F, represented above as <9f>) UTF-8: <e6><b1><aa><e6><98><be><e6><9e><97> → 汪显林 SJIS: <89><c1><93><a1> <96><46><8f><ba> → 加藤 芳昭 We could fix most of these systematically, although a few cases seem hopelessly undecipherable. The question is: can the database handle full-blown Unicode names in Cyrillic, Korean, Japanese, Chinese, eastern European, etc. Right now, though, the database already contains some unprintable 8-bit control characters in the \x80 to \x9F range and that isn't a good thing. |
[QUOTE=GP2;440353]The question is: can the database handle full-blown Unicode names in Cyrillic, Korean, Japanese, Chinese, eastern European, etc.[/QUOTE]I think UTF-8 is a better option than Unicode.[QUOTE=GP2;440353]Right now, though, the database already contains some unprintable 8-bit control characters in the \x80 to \x9F range and that isn't a good thing.[/QUOTE]Those are in the UTF-8 encoding space, so should not be a problem.
|
[QUOTE=retina;440356]I think UTF-8 is a better option than Unicode.[/QUOTE]
Same thing. UTF-8 is just a very common way of encoding Unicode. I think PrimeNet uses SQL Server. Maybe Madpoo could give us a bit of insight into the data type of the relevant database field? I could supply him with a set of corrected names, assuming the database is able to store them. |
[QUOTE=GP2;440358]Same thing. UTF-8 is just a very common way of encoding Unicode.[/QUOTE]Okay, but I meant that the server could just be told that it is all bytes and it doesn't need to be clever with handling any code conversions. The only thing affected in that mode is that sorting and lexicographic ordering might not appear logical.
|
Here's [URL="http://www.mersenneforum.org/showthread.php?t=19671"]an old thread[/URL] from 2014 which says that the SQL Server database field is varchar rather than nvarchar, so it can't store Unicode. Unfortunate.
It would still be possible to fix a subset of the names, including "João" to "João". Or all the accented names could be converted to use HTML &-style entity numbers, like webpages do. Edit: hmm, that wouldn't work, see [URL="http://www.mersenne.org/report_exponent/?exp_lo=50685343&exp_hi=&full=1"]M50685343[/URL], where the name displays literally as "Vladan Vidaković" rather than "Vladan Vidaković" |
[QUOTE=GP2;440379]Here's [URL="http://www.mersenneforum.org/showthread.php?t=19671"]an old thread[/URL] from 2014 which says that the SQL Server database field is varchar rather than nvarchar, so it can't store Unicode. Unfortunate.
It would still be possible to fix a subset of the names, including "João" to "João". Or all the accented names could be converted to use HTML &-style entity numbers, like webpages do. Edit: hmm, that wouldn't work, see [URL="http://www.mersenne.org/report_exponent/?exp_lo=50685343&exp_hi=&full=1"]M50685343[/URL], where the name displays literally as "Vladan Vidaković" rather than "Vladan Vidaković"[/QUOTE] There are indeed some oddities when characters fall outside the extended ASCII range. The SQL table for usernames is unfortunately varchar instead of nvarchar. A fix for that would be to alter the table and convert to nvarchar... I imagine for the most part it would be an easy enough transition since SQL can typically do implicit casts between them and won't complain, but there would be issues in boolean operations where varchar <> nvarchar so we'd have to dig into it more to make sure we're good. |
[QUOTE=GP2;440379]Edit: hmm, that wouldn't work, see [URL="http://www.mersenne.org/report_exponent/?exp_lo=50685343&exp_hi=&full=1"]M50685343[/URL], where the name displays literally as "Vladan Vidaković" rather than "Vladan Vidaković"[/QUOTE]
my first thought is a way to stop cross site scripting using forms because technically( in theory) a script tag can use a form that interprets these characters ( for example the less than or greater than sign) in theory if processed into what they are meant for to cross site script or introduce code if not read like a normal string so for example if I wrote a code like < script ></script > in a name field in theory if it got changed without at least being checked it could activate any code inside. and be read as part of the page by the browser so I could in theory make a self retweeting tweet scenario. if you saw this post before posting you would see HTML equivalents of the less than and greater than signs. if these were to get parsed in theory I could make a code work to break it . |
[QUOTE=Madpoo;440386]I imagine for the most part it would be an easy enough transition since SQL can typically do implicit casts between them and won't complain, but there would be issues in boolean operations where varchar <> nvarchar so we'd have to dig into it more to make sure we're good.[/QUOTE]
How normalized is the database? Does the Public Name (from [url]http://www.mersenne.org/update/[/url] ) really get stored in more than one place, or get compared against, or used for purposes other than display? I would have thought that the V5UserID from prime.txt or the SrvrUID from local.txt would be the key fields in the tables. |
[QUOTE=GP2;440395]How normalized is the database? Does the Public Name (from [url]http://www.mersenne.org/update/[/url] ) really get stored in more than one place, or get compared against, or used for purposes other than display? I would have thought that the V5UserID from prime.txt or the SrvrUID from local.txt would be the key fields in the tables.[/QUOTE]
The user's public name is only referenced in one table, which makes sense of course. :smile: I thought about it some today and my best guess on how to improve it would be to add an additional nvarchar column and slowly port any DB calls/sprocs/etc to use that instead, deprecating and eventually removing the old column. Kinda messes things up but if this were my day job and I wanted to ensure reliable operations, that's how I'd do it... gradually and surely with some bleedover time. But even that means making sure any updates/inserts aren't using implicit column placements... I'm pretty sure anything like that is explicitly handled (as far as I've seen, that's the case) but Murphy's law says there'd be that one exception where an insert assumes columns are in a certain order and doesn't spell 'em out. :smile: |
[QUOTE=Madpoo;440512]But even that means making sure any updates/inserts aren't using implicit column placements... I'm pretty sure anything like that is explicitly handled (as far as I've seen, that's the case) but Murphy's law says there'd be that one exception where an insert assumes columns are in a certain order and doesn't spell 'em out. :smile:[/QUOTE]
Implicit column placements... Shutter.... :wink: Your planned migration sounds sane. Add the new column at the end of the table and the worst case will be the new nvarchar column isn't updated by any legacy code doing implicit references. |
| All times are UTC. The time now is 23:13. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.