Ruby 1.9.2, Encoding UTF-8 and Rails: invalid byte sequence in US-ASCII

Ruby 1.9.2, Encoding UTF-8 and Rails: invalid byte sequence in US-ASCII

written by Paul on January 20th, 2011 @ 12:05 AM

While working on a migration from ruby 1.8.7 to 1.9.2, I ran into some issues with encoding. Fortunately, we are using PostgreSQL and the database drivers are pretty good for UTF-8 support and encoding tagging, but there were still some snags in a few areas of my code.

My company, has pretty custom urls structures. Because of the concern of having multiple urls go to the same content and appearing to Google to be doing something bad, we have come code that ensures that the url that was requested is the same url that we would have generated and if it wasn’t we would redirect.

In this code, we generate a url and compare it to the value ofrequest.request_uri to see if we should redirect or not. On issue that came up is that Nginx and Passenger encode the unicode characters and Rack turns it into Binary, which is ASCII-8bit, but its really just means that no encoding is assigned.

In the browser a url might look like this:

/h-336461-amboise_hotel-château_de_noizay

But when my code genrated the url it looked like this:

/h-336461-amboise_hotel-ch%C3%A2teau_de_noizay

The above could easily be fixed with this:

URI.unescape(canonical_url)

Then I had issued where I had a URL (request.request_uri) like:

/h-336461-amboise_hotel-ch\xC3\xA2teau_de_noizay

It was ASCII-8bit which is really a way of saying that its binary or in other words that no encoding is set. The solution was pretty easy, I just assigned it the encoding that I knew it should be:

/h-336461-amboise_hotel-ch\xC3\xA2teau_de_noizay".force_encoding('UTF-8')
  # => "/h-336461-amboise_hotel-château_de_noizay"

Then I had an issue where templates/views were breaking due to some data in a haml view thinking that the test was ASCII: The test was supposed to look like this “Details for Château de Noizay,” but haml raised an exception “ActionView::TemplateError (invalid byte sequence in US-ASCII).”

After digging around a bit I was able to configure Apache (on my mac) by adding the following to the /etc/profile.

export LANG=en_US.UTF-8

Then after restarting Apache on my mac, I refreshed, and when I did, the text that was supposed to look like “Details for Château de Noizay” ended up looking like “Details for Ch 도teau de Noizay”.

I was about to write my own hybrid asian/latin based languages but instead added the following to my environment.rb and everything seemed to come together like I had hoped it would.

Encoding.default_internal = 'UTF-8'

Now that my app was able to run without encoding errors, I said “yeah!”

Hope this scribble scrabble helps some poorly encoded soul.

Thanks to a few articles I was not only reminded of some of the basics of encoding but learned to embrace the new changes within Ruby 1.9.2. We’ll see how tomorrow goes. ;)

 

5 responses to “Ruby 1.9.2, Encoding UTF-8 and Rails: invalid byte sequence in US-ASCII”

  1. Lucas Hills says :

    Many Thanks Paul! Great little article. I’ve been running into the same probs recently.. fun times..

  2. patlacambacal says :

    Hello!

    Im having a problem saving to my database… Im using utf-8 in both db and my app. But im getting this error

    Encoding::UndefinedConversionError: U+2122 from UTF-8 to US-ASCII:

    when im using these characters “,” and ’

    hope you can help.
    Thanks in advance.

    • peppyheppy says :

      Hello. Depending on what version of ruby, platform, or database driver use, you might get different results. Are you using MySQL with mysql2 driver? I had a problem a while ago where a specific ruby release (think it was r128) had bugs with utf-8 encoding.

      I would start by making sure all of the tools (driver, database, ruby, etc) you are using are the most current. Then if the problem continues I would find out where in the code (driver or app) that exception is being thrown and inspect the strings and their encodings more carefully. Good luck.

      • patlacambacal says :

        Im using

        ruby 1.9.3p392 (2013-02-22 revision 39386) [x86_64-linux]
        Rails 3.2.13
        my database is oracle11g

        I have this in my config/boot.rb

        require ‘rubygems’

        # Set up gems listed in the Gemfile.
        ENV['BUNDLE_GEMFILE'] ||= File.expand_path(‘../../Gemfile’, __FILE__)

        require ‘bundler/setup’ if File.exists?(ENV['BUNDLE_GEMFILE'])

        #Run this SQL statement in your client to determine NLS_LANG:
        #SELECT USERENV (‘language’) FROM DUAL
        ENV['NLS_LANG'] = ‘AMERICAN_AMERICA.AL32UTF8′

        and this in my config/application.rb

        config.encoding = “utf-8″

        In my local it works but not in production…

        Encoding::UndefinedConversionError: U+201C from UTF-8 to US-ASCII: INSERT INTO “table_name” (“some_field”, “some_field”, “some_field”, “some_field”, “some_field”, “some_field”, “some_field”, “some_field”, “some_field”, “some_field”, “some_field”, “some_field”, “some_field”, “some_field”, “some_field”) VALUES (:a1, :a2, :a3, :a4, :a5, :a6, :a7, :a8, :a9, :a10, :a11, :a12, :a13, :a14, :a15)

        Thanks for the reply peppyheppy!

  3. patlacambacal says :

    I fixed this problem by removing this
    require “ruby-plsql” in my environtment.rb

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: