 2014-09-06, 16:18 #1 Rodrigo     Jun 2010 Pennsylvania

I just noticed that Top Producers listings that include characters with accent marks, are not showing up properly, whereas they used to up until very recently. Has something changed to affect how accented characters are displayed? This is happening on every PC I've checked with, so I know it's not a settings change at the user end.

To give a couple of examples: #485 in the TF Top Producers, André Jordi, is showing up as "Andr� Jordi", with a small box where the "é" should be. And #713, Jean-François Nies, is rendered as "Jean-Fran�ois Nies." You get the idea.

What happened to change this, and can it get fixed?

R�drig� (just kidding there)
 2014-09-06, 16:52 #2 Mark Rose

It looks like the pages are all setting their character set to UTF-8, but I bet nobody converted existing names in the database which are probably ISO-8559-1 encoded.
 2014-09-07, 06:01 #3 kladner

I am seeing the '�' character under my normal Unicode encoding setting in Firefox. Changing the encoding to Western shows that space as containing- (see attachment). It displays yet other sets of characters under other settings like Central European Windows or ISO. So this is just a bit of cleanup on the migration, as I understand your answer, Mark. Attached Thumbnails
2014-09-07, 16:26   #4
Xyzzy

"Mike"
Aug 2002

7,703 Posts

Quote:
 I am seeing the '�' character under my normal Unicode encoding setting in Firefox. Changing the encoding to Western shows that space as containing- (see attachment).
A sample from Chrome set to UTF-8.
Attached Thumbnails

 2014-09-07, 19:11 #5 Rodrigo

FWIW, when I manually typed in the special characters to show what they were supposed to look like on the producers list, I entered the é as ALT-130, while the ç was ALT-135 (in Windows).

Rodrigo
 2014-09-08, 02:05 #6 Xyzzy     "Mike" Aug 2002 7,703 Posts It is interesting to see how the special characters appear when "viewed" by od: Code: \$ echo "To give a couple of examples: #485 in the TF Top Producers, André Jordi, is showing up as "Andr� Jordi", with a small box where the "é" should be. And #713, Jean-François Nies, is rendered as "Jean-Fran�ois Nies." You get the idea." | od -c 0000000 T o g i v e a c o u p l e 0000020 o f e x a m p l e s : # 4 0000040 8 5 i n t h e T F T o p 0000060 P r o d u c e r s , A n d r 0000100 303 251 J o r d i , i s s h o 0000120 w i n g u p a s A n d r 357 0000140 277 275 J o r d i , w i t h a 0000160 s m a l l b o x w h e r e 0000200 t h e 303 251 s h o u l d b 0000220 e . A n d # 7 1 3 , J e a 0000240 n - F r a n 303 247 o i s N i e s 0000260 , i s r e n d e r e d a s 0000300 J e a n - F r a n 357 277 275 o i s 0000320 N i e s . Y o u g e t t 0000340 h e i d e a . \n 0000351
 2014-09-08, 22:48 #7 Rodrigo

I guess that the questions now are, is it possible to fix this, and what would the fix involve? Maybe somebody has a way to convert the characters in some automated fashion. Or maybe somebody could volunteer to take guesses at what the special characters are supposed to be, and then feed them to the right person. I'd have to verify, but I'm confident I could get most of them right (though not nearly all without research, about 13/21 from the Top TF list).

Curiously, not all special characters got messed up this way. Check out #1951 on that list:
Code:
Ś�ṇȩł
Heh - a member name made up entirely of letters with diacritical marks. Maybe figuring out why some characters made it while others didn't, will yield useful clues.

Rodrigo
 2014-09-10, 05:46 #8 Rodrigo

I see now that the special characters have been fixed -- fabulous!

Rodrigo
2014-09-22, 16:46   #9
Serpentine Vermin Jar

Jul 2014

63158 Posts

Quote:
 Originally Posted by Rodrigo I see now that the special characters have been fixed -- fabulous! Rodrigo
On the hourly reports, it's a dump from SQL and the resulting file was not UTF-8. The fix was to do a post-dump conversion from ANSI to UTF-8 and that did the trick. We also made sure the encoding from the web server is UTF-8, and there are also a few ad-hoc reports that pull data from SQL which may or may not have some upper-ASCII characters, so those are slowly being fixed to make sure utf-8 is used to output anything where that could happen.

The database itself is using varchar for some columns where ideally it should be nvarchar, but after looking at it in total, changing those column definitions and all the other things that touch it would be a larger task, so we're hoping to "fix" it for now on the PHP side of things.

As it is, while accented characters are being stored and output in ANSI, I wondered what would happen if a Japanese user tried to set their public username to something in Hiragana ... the system probably wouldn't allow it or, if it did, the mix of encoding would produce some bizarre results. So yeah, it'd be good to use some proper fields long term. For now I'm just happy that the hourly reports are showing accented western characters properly now.

