I think this only happens if the user manually switches to Latin-1 encoding. In ...

agwa · on Oct 18, 2012

> I think this only happens if the user manually switches to Latin-1 encoding.

That's correct. If you send your HTML document with a charset of UTF-8 (In the Content-Type header) then IE will submit forms using UTF-8 even if the user doesn't input any UTF-8 characters. Unless the user changes the encoding, but I have yet to hear a compelling reason why an ordinary user would do that under ordinary circumstances.

> The snowman hack serves to prevent the corruption from spreading.

It's clever, but the framework could also just reject POST and GET requests which contain invalid UTF-8 characters. (I'm flabbergasted that Ruby doesn't do this[1].) Otherwise a malicious user could try to inject non-UTF-8 characters into your database by sending crafted requests which nevertheless contain the "utf8=✓". And speaking from experience, you do not want to have to deal with encoding problems in your database.

[1] http://stackoverflow.com/questions/3222013/what-is-the-snowm...

Millennium · on Oct 18, 2012

Whether or not you use this hack, you can't naively trust the client to always send valid UTF-8; you're right about this. But because of this bug in IE, rejecting posts with invalid UTF-8 as malicious will net you some false-positive cases, where the user isn't malicious but the browser is being stupid. This hack takes care of the stupidity, leading to a better user experience for people who would otherwise have tripped the false-positive.

Tobias42 · on Oct 18, 2012

What if a sequence of byte values is valid in the charset that IE uses to encode the form data as well as in UTF-8, but is interpreted as different characters in UTF-8? With your method you would not detect an error and use the wrong characters. (Except if IE sends a content-type header with the actual encoding used, and this header is evaluated on the server side to convert the form data into a string. But in that case you don't have to check for invalid UTF-8 characters, but for characters that are invalid in the charset specified in the content-type header.)

carllerche · on Oct 18, 2012

Ordinary users would do it because many corporate sites used to direct them to do this as a fix for their browsers displaying invalid characters.

agwa · on Oct 18, 2012

Would they do that if their browser isn't displaying invalid characters? That's what I meant by "ordinary circumstances."

wycats · on Oct 18, 2012

Corrupted characters in a MySQL database is unfortunately quite common. The user fix will thus propagate the corruption without this fix.