Archive | Tips / Tutorials RSS for this section

Ruby 1.9.2, Encoding UTF-8 and Rails: invalid byte sequence in US-ASCII

Ruby 1.9.2, Encoding UTF-8 and Rails: invalid byte sequence in US-ASCII

written by Paul on January 20th, 2011 @ 12:05 AM

While working on a migration from ruby 1.8.7 to 1.9.2, I ran into some issues with encoding. Fortunately, we are using PostgreSQL and the database drivers are pretty good for UTF-8 support and encoding tagging, but there were still some snags in a few areas of my code.

My company, has pretty custom urls structures. Because of the concern of having multiple urls go to the same content and appearing to Google to be doing something bad, we have come code that ensures that the url that was requested is the same url that we would have generated and if it wasn’t we would redirect.

In this code, we generate a url and compare it to the value ofrequest.request_uri to see if we should redirect or not. On issue that came up is that Nginx and Passenger encode the unicode characters and Rack turns it into Binary, which is ASCII-8bit, but its really just means that no encoding is assigned.

In the browser a url might look like this:

/h-336461-amboise_hotel-château_de_noizay

But when my code genrated the url it looked like this:

/h-336461-amboise_hotel-ch%C3%A2teau_de_noizay

The above could easily be fixed with this:

URI.unescape(canonical_url)

Then I had issued where I had a URL (request.request_uri) like:

/h-336461-amboise_hotel-ch\xC3\xA2teau_de_noizay

It was ASCII-8bit which is really a way of saying that its binary or in other words that no encoding is set. The solution was pretty easy, I just assigned it the encoding that I knew it should be:

/h-336461-amboise_hotel-ch\xC3\xA2teau_de_noizay".force_encoding('UTF-8')
  # => "/h-336461-amboise_hotel-château_de_noizay"

Then I had an issue where templates/views were breaking due to some data in a haml view thinking that the test was ASCII: The test was supposed to look like this “Details for Château de Noizay,” but haml raised an exception “ActionView::TemplateError (invalid byte sequence in US-ASCII).”

After digging around a bit I was able to configure Apache (on my mac) by adding the following to the /etc/profile.

export LANG=en_US.UTF-8

Then after restarting Apache on my mac, I refreshed, and when I did, the text that was supposed to look like “Details for Château de Noizay” ended up looking like “Details for Ch 도teau de Noizay”.

I was about to write my own hybrid asian/latin based languages but instead added the following to my environment.rb and everything seemed to come together like I had hoped it would.

Encoding.default_internal = 'UTF-8'

Now that my app was able to run without encoding errors, I said “yeah!”

Hope this scribble scrabble helps some poorly encoded soul.

Thanks to a few articles I was not only reminded of some of the basics of encoding but learned to embrace the new changes within Ruby 1.9.2. We’ll see how tomorrow goes. 😉

 

Changing File Encoding Using Ruby 1.9.2

Changing File Encoding Using Ruby 1.9.2

written by Paul on January 3rd, 2011 @ 06:22 AM

Currently, I am in the process of upgrading an application from Ruby 1.8.7 to Ruby 1.9.2. One of the big differences between 1.8 and 1.9 is the multi-byte character support.

The Problem

We have thousands of static html files that were generated in Ruby 1.8 and when Ruby 1.9 reads them it fails. As usual, before I start to dig in to solving the problem I do a quick search and see what other people have been doing to solve the problem. My search yielded a bunch of multi-lined scripts and techniques… most of which were from the Ruby 1.8 days.

The Solution

In short I wrote a simple 4 lined script in irb and it completed my task quickly. One thing that I am really happy about it how Ruby 1.9.2 strings have a method called escape that provides great utility when performing these kinds of tasks.

So here is the code:


`find . -name '*.html'`.split("\n").each do |filename|
  puts filename
  handle = File.open(filename,"w+")
  handle.write(handle.read.encode('UTF-8'))
  handle.close
end; nil

If you are interested in the options with the encode method, go check them out.

 

Crawlable AJAX for SPEED with Rails

Recently at work we have been focusing our efforts on increasing overall performance of our site. Many of our pages have a lot of content on them, some might say too much. Thanks to Newrelic, we identified a couple of partials that were slow (consumed ~60% of the time) but could not just remove them from our page and the long term fix was going to be put into place over the coming weeks. Long story short, we though that it would be better to speed up the initial page load time and then call the more expensive partials asynchronously using separate AJAX calls. That way the page time would be faster and the slow parts would be split up between requests.

The Problem: Google’s Crawler doesn’t like AJAX (normally)

Googlebot still does not load javascript and perform AJAX calls. Because of this we don’t get credit from google for the content that we loaded on AJAX—which is a bummer and a show stopper for us. Duplicate content is bad forSEO and Google will come to our pages and see that they are similar, even though the user sees the relevant content as the page loads, it will “think” that our pages are mostly the same (header and footer, etc.)

The Solution: Google allows for Crawlable AJAX

On their site, Google suggests that sites that have AJAX on them use a particular approach to making them crawlable. I won’t go into the details of how google supports this because its all stated in their paper, but I did want to focus on how I implemented the solution.

Before I continue I want to say that I was hesitant to do this because at first glance I didn’t think it was be easy or effective, I was wrong and I apologize to my friend Chris Sloan for doubting him in the beginning as he proposed the idea. (he made me include this in the post and threatened my life if I didn’t)

Google basically wants to be able to see the ajaxified page as a whole static page, so they pass an argument to the page and in turn we are supposed to render the whole page without the need to call AJAX to fill portions of the page with content.

I wanted to funnel the AJAX calls for different partials though a single action within our site so I didn’t have to build out custom routes and custom actions for each partial, which would be extremely messy to maintain.

The Code

So here is a simple example of the approach we took:

Created a single action that tooked for specific classes and then made requests to the server passing a couple of key parameters: /ajaxified?component=&url=<%= request.request_uri %>

    module AJAXified # include this in a controller (or the app controller)

      # HOWTO
      # To add a new AJAX section, do the following:
      # 1) Write a test in crawlable_spec for the new container and url
      # 2) Add the new method/container to the the ALLOWED_CALLS array      
      # 3) Add the new method below so it sets required instance variables

      ALLOWED_CALLS=[:bunch_o_things]

      def is_crawler?
        params.key?(:_escaped_fragment_)
      end

      # Actual Instance Setting Methods Are BELOW This Line

      # Note: each method needs to return the partial/template to render

      def bunch_o_things(options=nil)
        @thing ||= Thing.find(options[:params][:id])

        @things_for_view = @thing.expensive_call
        'thing/things_to_view'
      end

      # Actual Instance Setting Methods Are ABOVE This Line

      public # below is the actual main ajax action

      def ajaxified
        raise "method \"#{params[:container]}\"is not allowed for AJAX crawlable" unless ALLOWED_CALLS.include? params[:container].to_sym

        raw_route = ActionController::Routing::Routes.recognize_path(params[:url],:method=>:get)
        request.params[:crawlable_controller] = raw_route[:controller]
        request.params[:crawlable_action]     = raw_route[:action]

        render :template => self.send(
          params[:container].to_sym, :params => request.params
        ), :layout => false
      end

    end

I needed to ensure that the method :is_crawler? is available within views as controller.is_crawler?

  hide_action :is_crawler?

In the controller action where the code would have normally been executed, we need to add a check for crawler so we don’t execute code that is not needed.

def show
    @thing = Thing.find(params[:id])

    if is_crawler?
      # sets @things_for_view
      bunch_o_things
    end
    ...
end

In the view:

<article id="things" data-crawlable="<%= controller.is_crawler? ? 'false' : 'true' %>">
  <% if controller.is_crawler? or request.xhr? %>
    <% @things_for_view.each do |thing| %>
      ... potentially expensive stuff ...
    <% end %>
  <% end %>
</article>

Because I had to water the code down a bit to show how it works ingeneral, this code is not tested nor has it been executed as is. I actually had to add more stuff around the project I did for work in order for it it work as we needed it to.

The general idea here is to centralize the partial render, reduce duplication within the controller and ensure that the code that slowed the partial down to begin with is not executed when the page is not being crawled.

In the end, we were able to reduce the initial request by ou users by 60% and google is able to crawl our site as it always had.

 

Saving time with Threads the Ruby way

I have been working on some projects that require me to do multiple serial webservice calls using soap and http clients. As you might guess without concurrency its such a waste waiting for the network IO and its ends up being accumulative in times—the more service calls the slower it gets (.5s+1s+2s+1s+1s = 5.5seconds). Originally I wasn’t worried because I knew I would come back and tweak the performance by using threads and so today was the day for me to get it going. Before I got too crazy coding i wanted to run some basic benchmarks just to see if it would really end up making things faster. Here is what I did:

Benchmark.bm { |rep| 
  rep.report("non-threading") { 
    1.upto(100) { |count|
      amount_rest = rand(4)
      # puts "##{count}: sleeping for #{amount_rest}" 
      sleep(amount_rest)
      # puts "##{count}: woke up from a #{amount_rest} second sleep" 
    }
  }

  rep.report("threading") { 
    threads = []
    1.upto(100) { |c|
      threads << Thread.new(c) { |count| 
        amount_rest = rand(4)
        # puts "##{count}: sleeping for #{amount_rest}" 
        sleep(amount_rest)
        # puts "##{count}: woke up from a #{amount_rest} second sleep" 
      }
    }
    while !(threads.map(&:status).uniq.compact.size == 1 and threads.map(&:status).uniq.compact.first == false)
     # puts "will check back soon" 
     sleep(0.3)
    end
  }
}

benchmark        user     system      total        real
non-threading  0.100000   0.290000   0.390000 (142.005792)
threading          0.010000   0.020000   0.030000 (  3.182716)

As you can see, the threading in Ruby works really well as long as each thread is not doing anything CPU intensive. Even though ruby 1.8.7 does not support native threads, the threading, as you can see above, does work well. When all was said and done, I ended up making more than a 100% improvement and it will work a bit better if and when we have to do more requests concurrently.

I do however look forward to using ruby 1.9, but this will do the trick for me now.

Setting the ACLs to public-read on Millions of S3 Objects

I learned a valuable lesson today. When you use Amazon’s Import/Export service be sure that your manifest file includes proper ACL meta-data. I left it to the defaults and my more than 600G of files (yes, all 23 million of them) were not readable on CloudFront for use on my site because they were not public. I tried to use Amazon’s web-based console to change the ACLs, but it was quite discouraging when it only updated about 100 a minute. I tried “Bucket Explorer” and although it was a bit faster, my 30-day trial would have expired before it was finished. I knew I had to script something that could do it quicker so I did a bit of research and figured that if I usedEC2 it could be 100-1000 x faster because it was considered an internal call by S3.

So here are the steps that I took to hack a solution together and I hope that if you are in my same boat you might find this helpful.

Start and EC2 instance and ssh into it:

ssh -p 22 -i ~/Sites/mysite/myec2key.pem root@ec2-174-129-75-24.compute-1.amazonaws.com

Install Python’s easy_install utiltity (on Ubuntu it is like this):

sudo apt-get install python-setuptools

A helpful utility named s3tunnel doesn’t allow you to update objects ACL so we will only use it to build our object list. The reason why I am used s2tunnel is because it is very fast at getting the list of objects. In my tests is was over 2000 objects per second.

Install s3tunnel: easy_install s3funnel

On the instance I create a directory to store work in so I can keep things simple. (you don’t have to do this if you don’t want to)

mkdir -p ~/amazon-files/input
mkdir ~/amazon-files/output
cd ~/amazon-files

Then I run the s3tunnel dump (you will have to replace AWS_KEY andAWS_SECRET with your respective s3 key)

nohup s3funnel bucket_name --aws_key=AWS_KEY --aws_secret_key=AWS_SECRET list > s3files.txt &

Then once the object list was complete I split it up a bit into smaller files:

cd input
split -l 5000 --suffix-length=6 ../s3files.txt s3

For my 23 million files this created about 4600 files.

Then I wrote a bash script that moved the files into 10 different directories. I chose 10 because that is how many threads that I wanted to run at the same time.

for file in $(ls input/s3*)
do 
  csum=`sum $file | cut -f1 -d' '`
  process = `expr $csum % 50`

  echo "Moving $file into input/$process" 
  if [[ ! -d input/$process ]]
  then
    mkdir input/$process
    mkdir output/$process    
  fi
  mv $file input/$process
done

Then I wrote a simple python script named amazon.py which I placed in the/amazon-files directory that used boto (a python library for s3—the same one that s3funnel uses under the hood.) This script looked like this:

#! /usr/bin/env python

import sys
import boto 
import re

print 'processing file: ' + sys.argv[1]

f = open('input/' + sys.argv[1], 'r')
c = boto.connect_s3("AWS_KEY", "AWS_SECRET")
b = c.get_bucket("bucket_name")
for line in f:
    b.set_acl('public-read', line.strip())

f.close()

Now that I have all of my objects evenly distributed into 10 separate directories I can now loop through each directory and kick-off one bash process at a time and move the completed files into the completed (output) this way if something goes wrong, I can see progress and can just restart the scripts. And it will continue (pretty much) where it left off. (give or take ~4000 objects)

for directory in $(ls input);
do
  nohup bash -l -c "for file in \$(ls input/${directory}); do python amazon.py input/${directory}/\${file} && mv input/${directory}/\${file} output/${directory}; done" &
done

I first started with 10 processes and then realized that 50 processes would have been better so I continued with 50. Running 50 processes on my 23 million objects would take about 12 hours to finish (532 objects updated a second.) All in all I was able to update all of the ACLs for the objects in what I now consider the fastest method possible.

This is obviously a hack and could use some cleanup and consolidation. Part of me wanted to just modify s3funnel to update all of the TCL’s but I am not that strong with Python and really just wanted to get my ACLs updated.

How about next time we use Import/Export we take a little longer to read about ACLs.

LibXML-Ruby and XPath with namespaces

So, have you ever wasted a half hour coding while also driving yourself absolutely insane? Was it when you were playing with libxml-ruby and xpath?

Minutes ago I was coding up a xml-rpc webservice when I realized that I was unable to get the nodes that I was looking for with xpath.

As usual I searched google looking for other people having the same issue and nothing helpful came up. I knew I had to write this post when I sawthis.

So my response xml looked somthing like this:

response = <<-REMOTE_XML
<?xml ...?>
<rootNode xmlns="http://happythanksgiving.com/htgn">
  <list>
    <item>hey</item>
    <item>there</item>
  </list>
</rootNode>
REMOTE_XML

My ruby was something like this:

document = XML::Parser.string(response).parse
namespace = 'http://happythanksgiving.com/htgn'
turkeys = document.find('/htgn:rootNode//item', namespace)

But turkeys.sizewas always 0.

I the found out that I needed to add the namespace prefix to each element in the xpath find…. duhh!

document = XML::Parser.string(response).parse
namespace = 'http://happythanksgiving.com/htgn'
hotels = document.find('/htgn:rootNode//htgn:item', namespace)

Note the xpath ”/htgn:rootNode//item” changed to ”/htgn:rootNode//htgn:item” (added the namespace prefix)

Hope this helps some poor hacker or me next July when I forget and start searching google. 😉

Sharing shell with ytalk on Ubuntu

A good friend of mine years ago used to use a command-line app called ytalk to show me around the bash shell (thanks Sione!). After a short while I stopped needing his help and so I stopped using ytalk. At work we really wanted to shell-share with remote team members who were unable to use the iChat screenshare because of OS and bandwidth limitations.

I remembered that ytalk was such a good tool for being able to see what someone else was doing in the shell and to show off your bash skills. I thought it was going to be easy to setup on Ubuntu, but as it turns out, although its still an available package, it is dead on install.

So…. here is what I ended up doing and I hope that if you do the same you will be ytalk’in in no time..

On ubuntu install ytalk:


sudo apt-get install ytalk

Change the default/broken inetd.comf configuration:


talk            dgram   udp    wait    nobody.tty    /usr/sbin/in.talkd      in.talkd
ntalk           dgram   udp    wait    nobody.tty    /usr/sbin/in.ntalkd     in.ntalkd

to:


talk            dgram   udp4    wait    root    /usr/sbin/in.talkd      in.talkd
ntalk           dgram   udp4    wait    root    /usr/sbin/in.ntalkd     in.ntalkd

Note the “4” after the udp and the “nobody.tty” change to “root”

In the /etc/services file, make sure the following lines are in there:


paul@box:~$ sudo grep talk /etc/services
talk            517/udp
ntalk           518/udp

I didn’t have to change anything, but its a good idea to confirm things.

Using YTalk

Initiating the chat:

You can do this in a couple of ways, the first and most obvious way is to coordinate with another person/user and ensure that the two of you are only logged in once to the same box ad then type.


paul@box:~$ ytalk fred

Or if your logged on more than once you can specify the tty in the request after finding out which one it is:


paul@box:~$ who
fred      pts/0        2009-11-06 10:50 (208.X.X.X)
fred      pts/2        2009-11-06 10:48 (208.X.X.X)
paul      pts/3        2009-11-06 14:02 (208.X.X.X)


ytalk fred#2

More on that can be found here: http://manpages.ubuntu.com/manpages/intrepid/man1/ytalk.1.html

Thanks to euphemus for the breakthroughs!

Hope you find ytalk as useful and coolific as I do.

Enjoy!

FAIL: sudo gem install mysql (Fixed)

The other day I had an issue with ruby and so I went to google to fine a fix…. I laughed when the second result was my own blog. 🙂

I figured it wouldn’t hurt to save me some time next time I run into the OS Xnightmare with the mysql gem so here is what happened and what I did to fix it.

After running “sudo gem install mysql” I got the following errors:


/usr/local/bin/ruby extconf.rb
checking for mysql_query() in -lmysqlclient... no
checking for main() in -lm... yes
checking for mysql_query() in -lmysqlclient... no
checking for main() in -lz... yes
checking for mysql_query() in -lmysqlclient... no
checking for main() in -lsocket... no
checking for mysql_query() in -lmysqlclient... no
checking for main() in -lnsl... no
checking for mysql_query() in -lmysqlclient... no

As usual I looked into the mkmf.log found in the gem directory and saw a bunch of these:


"gcc -o conftest -I. -I/usr/local/lib/ruby/1.8/i686-darwin9.6.2 -I. -I/usr/local/include   -D_XOPEN_SOURCE=1  -fno-common -pipe -fno-common conftest.c  -L. -L/usr/local/lib -L/usr
/local/lib -L.      -lruby-static -lmysqlclient  -lpthread -ldl -lobjc  " 
ld: library not found for -lmysqlclient
collect2: ld returned 1 exit status
checked program was:
/* begin */
1: /*top*/
2: int main() { return 0; }
3: int t() { mysql_query(); return 0; }
/* end */

So here is what I did to fix it:


sudo ln -s /usr/local/mysql/include /usr/local/include/mysql
sudo ln -s /usr/local/mysql/lib /usr/local/lib/mysql



[heppy /usr/local/lib/ruby/gems/1.8/gems/mysql-2.7 64]$ sudo gem install mysql
Building native extensions.  This could take a while...
Successfully installed mysql-2.7
1 gem installed
Installing ri documentation for mysql-2.7...

Yeah!

Mongrel to Passenger with CPanel

I host this blog on slicehost and used to have a couple of slices, one for rails, and one for client sites, php, email etc. Just a few hours ago I moved my blog from my Rails slice to what I call my CPanel slice using passenger and the process was smooth sailing. In the process I decided to leverage what I learned about Cpanel and Passenger and I created a gem calledcpanel-passenger which can be found on github.

The gem just installs a command called cpanel-passenger that takes a bunch of parameters to modify the Apache config in a way that will not make Cpanel upset.

There is a lot of work to do to make this script do all that one would want, but at least it makes setting up a rails app on passenger a simpler task with Cpanel. Feel free to fork the gem and add to it. Its just a matter of time and the Cpanel folks will bundle passenger as a supported module, but until then try this out on your VPS that is running Cpanel.

Enjoy!

A default route gone 404 when it should

UPDATE: This worked for Rails < 2.0, but now you should follow something like this

Rails routes are a critical piece of a rails application. One issue about the routes is that there isn’t a default route for the home page of an application. Typically, one would create a controller and create a route for a default controller and default action. Here is what one of mine looks like:

map.root :controller => 'main', :action => 'home'

There is one problem with this. The url http://domain.com/blah%20blah will go to the main controller and will throw an “no action/ no id given” exception which will result in a 500 error. This is not what you want for SEO or otherwise.

The solution is quite simple, all you have to do is add a method missing to the main controller and add a method missing that logs and renders a real 404 page and http status.

  def method_missing(method, *args)
    logger.warn "action #{method} dos not exist, 404" 
    render :file => File.join(RAILS_ROOT, 'public', '404.html'), :status => 404
  end

There may be better ways to do this, but this is one way around the false 500 errors, especially if your likely to get old inbound links to your site.