Ruby 1.9.2, Encoding UTF-8 and Rails: invalid byte sequence in US-ASCII
written by Paul on January 20th, 2011 @ 12:05 AM
While working on a migration from ruby 1.8.7 to 1.9.2, I ran into some issues with encoding. Fortunately, we are using PostgreSQL and the database drivers are pretty good for UTF-8 support and encoding tagging, but there were still some snags in a few areas of my code.
My company, has pretty custom urls structures. Because of the concern of having multiple urls go to the same content and appearing to Google to be doing something bad, we have come code that ensures that the url that was requested is the same url that we would have generated and if it wasn’t we would redirect.
In this code, we generate a url and compare it to the value of
request.request_uri to see if we should redirect or not. On issue that came up is that Nginx and Passenger encode the unicode characters and Rack turns it into Binary, which is ASCII-8bit, but its really just means that no encoding is assigned.
In the browser a url might look like this:
But when my code genrated the url it looked like this:
The above could easily be fixed with this:
Then I had issued where I had a URL (request.request_uri) like:
It was ASCII-8bit which is really a way of saying that its binary or in other words that no encoding is set. The solution was pretty easy, I just assigned it the encoding that I knew it should be:
/h-336461-amboise_hotel-ch\xC3\xA2teau_de_noizay".force_encoding('UTF-8') # => "/h-336461-amboise_hotel-château_de_noizay"
Then I had an issue where templates/views were breaking due to some data in a haml view thinking that the test was ASCII: The test was supposed to look like this “Details for Château de Noizay,” but haml raised an exception “ActionView::TemplateError (invalid byte sequence in US-ASCII).”
After digging around a bit I was able to configure Apache (on my mac) by adding the following to the /etc/profile.
Then after restarting Apache on my mac, I refreshed, and when I did, the text that was supposed to look like “Details for Château de Noizay” ended up looking like “Details for Ch 도teau de Noizay”.
I was about to write my own hybrid asian/latin based languages but instead added the following to my environment.rb and everything seemed to come together like I had hoped it would.
Encoding.default_internal = 'UTF-8'
Now that my app was able to run without encoding errors, I said “yeah!”
Hope this scribble scrabble helps some poorly encoded soul.