How does the browser send data in a form for non-English characters?

Discussion:

(时间太久无法回复)

Chris Y

2006-04-20 00:16:00 UTC

I am using IE and IIS.

I am facing some problems with non-English characters and so did a test with a simple FORM with ENCTYPE='application/x-www-form-urlencoded' and a single <INPUT NAME='name'> box and and then wrote out the data that was received. My results are:

a. If I input just Chinese characters: æ¯æ³œäž, it was received and written out as: ÃÂ«ÃÃ³Â¶Â«. I captured the HTTP POST stream and the data passed was: name=%C3%AB%D4%F3%B6%AB.

b. If I put another character, the copyright symbol (Unicode +U00A9), before my Chinese characters: Â©æ¯æ³œäž, then the data is received correctly, ie the same as I have entered them. The data transmitted was: name=%A9%26%2327611%3B%26%2327901%3B%26%2319996%3B. I can understand that this is the same as: Â©毛泽&#19996.

I am at a loss of what is going on. Why would the browser encode it differently in the two cases? How can I force it to stick to one method (the second one)?

I am not totally familiar with Unicode. In Character Map, why does some characters have two codes, eg for æ¯, it is shown as U+6BDB (0xC3AB). What is C3AB? It is the one giving problem in my first case above.

Thanks in advance.

js

Michael (michka) Kaplan [MS]

2006-04-20 03:25:49 UTC

Permalink

U+6bdb is a CJK ideograph.

U+c3ab is a Hangul syllable -- but also if you look at the bytes in UTF-8
form than two of those bytes will indeed be 0xc3 and 0xab.

Have you properly set the Response encoding? The bytes are right but it is
how they are being interpretted that is causing you problems....
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

"Chris Y" <***@cursespam.com> wrote in message news:%23buCm%***@TK2MSFTNGP04.phx.gbl...
I am using IE and IIS.

I am facing some problems with non-English characters and so did a test with
a simple FORM with ENCTYPE='application/x-www-form-urlencoded' and a single
<INPUT NAME='name'> box and and then wrote out the data that was received.
My results are:

a. If I input just Chinese characters: ???, it was received and written out
as: Ã«Ôó¶«. I captured the HTTP POST stream and the data passed was:
name=%C3%AB%D4%F3%B6%AB.

b. If I put another character, the copyright symbol (Unicode +U00A9),
before my Chinese characters: ©???, then the data is received correctly, ie
the same as I have entered them. The data transmitted was:
name=%A9%26%2327611%3B%26%2327901%3B%26%2319996%3B. I can understand that
this is the same as: ©毛泽&#19996.

I am at a loss of what is going on. Why would the browser encode it
differently in the two cases? How can I force it to stick to one method
(the second one)?

I am not totally familiar with Unicode. In Character Map, why does some
characters have two codes, eg for ?, it is shown as U+6BDB (0xC3AB). What
is C3AB? It is the one giving problem in my first case above.

Thanks in advance.

js

Chris Y

2006-04-20 06:12:56 UTC

Permalink

Thanks, I did a few more experiments and found the cause.

I didn't put a charset statement in my web page. It appears that if no
charset is explicitly indicated, IE will try to find the most appropriate
charset to use. If it is just European characters, it will try and use the
Western European charset. If it's only Asian characters, it will use UTF-8.
If there are combination of both, it will use the method b I described
earlier.

If I force charset=UTF-8, then it will always come out correct. However, I
sent my web data to an external application and I have a tough time
converting a string that is UTF-8 encoded. Is first converting the string
(assuming it contains only one-byte characters) to bytes and then
UTF8Encoding.GetString() the way to go?

I don't know what charset setting can give the result in method b. The data
is sort of HTML encoded. I actually prefer this as I can retrieve the data
at one go with HttpUtility.HtmlDecode().

Post by Michael (michka) Kaplan [MS]
U+6bdb is a CJK ideograph.
U+c3ab is a Hangul syllable -- but also if you look at the bytes in UTF-8
form than two of those bytes will indeed be 0xc3 and 0xab.
Have you properly set the Response encoding? The bytes are right but it is
how they are being interpretted that is causing you problems....
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap
This posting is provided "AS IS" with
no warranties, and confers no rights.
I am using IE and IIS.
I am facing some problems with non-English characters and so did a test
with a simple FORM with ENCTYPE='application/x-www-form-urlencoded' and a
single <INPUT NAME='name'> box and and then wrote out the data that was
a. If I input just Chinese characters: ???, it was received and written
name=%C3%AB%D4%F3%B6%AB.
b. If I put another character, the copyright symbol (Unicode +U00A9),
before my Chinese characters: ©???, then the data is received correctly,
name=%A9%26%2327611%3B%26%2327901%3B%26%2319996%3B. I can understand that
this is the same as: ©毛泽&#19996.
I am at a loss of what is going on. Why would the browser encode it
differently in the two cases? How can I force it to stick to one method
(the second one)?
I am not totally familiar with Unicode. In Character Map, why does some
characters have two codes, eg for ?, it is shown as U+6BDB (0xC3AB). What
is C3AB? It is the one giving problem in my first case above.
Thanks in advance.
js