Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think this only happens if the user manually switches to Latin-1 encoding. In that case IE will try to use the same encoding when submitting form data. The user might do this if you already have encoding problems and present a mixed Latin-1/UTF-8 page. The snowman hack serves to prevent the corruption from spreading.


> I think this only happens if the user manually switches to Latin-1 encoding.

That's correct. If you send your HTML document with a charset of UTF-8 (In the Content-Type header) then IE will submit forms using UTF-8 even if the user doesn't input any UTF-8 characters. Unless the user changes the encoding, but I have yet to hear a compelling reason why an ordinary user would do that under ordinary circumstances.

> The snowman hack serves to prevent the corruption from spreading.

It's clever, but the framework could also just reject POST and GET requests which contain invalid UTF-8 characters. (I'm flabbergasted that Ruby doesn't do this[1].) Otherwise a malicious user could try to inject non-UTF-8 characters into your database by sending crafted requests which nevertheless contain the "utf8=✓". And speaking from experience, you do not want to have to deal with encoding problems in your database.

[1] http://stackoverflow.com/questions/3222013/what-is-the-snowm...


Whether or not you use this hack, you can't naively trust the client to always send valid UTF-8; you're right about this. But because of this bug in IE, rejecting posts with invalid UTF-8 as malicious will net you some false-positive cases, where the user isn't malicious but the browser is being stupid. This hack takes care of the stupidity, leading to a better user experience for people who would otherwise have tripped the false-positive.


What if a sequence of byte values is valid in the charset that IE uses to encode the form data as well as in UTF-8, but is interpreted as different characters in UTF-8? With your method you would not detect an error and use the wrong characters. (Except if IE sends a content-type header with the actual encoding used, and this header is evaluated on the server side to convert the form data into a string. But in that case you don't have to check for invalid UTF-8 characters, but for characters that are invalid in the charset specified in the content-type header.)


Ordinary users would do it because many corporate sites used to direct them to do this as a fix for their browsers displaying invalid characters.


Would they do that if their browser isn't displaying invalid characters? That's what I meant by "ordinary circumstances."


Corrupted characters in a MySQL database is unfortunately quite common. The user fix will thus propagate the corruption without this fix.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: