Setting the ACLs to public-read on Millions of S3 Objects

I learned a valuable lesson today. When you use Amazon’s Import/Export service be sure that your manifest file includes proper ACL meta-data. I left it to the defaults and my more than 600G of files (yes, all 23 million of them) were not readable on CloudFront for use on my site because they were not public. I tried to use Amazon’s web-based console to change the ACLs, but it was quite discouraging when it only updated about 100 a minute. I tried “Bucket Explorer” and although it was a bit faster, my 30-day trial would have expired before it was finished. I knew I had to script something that could do it quicker so I did a bit of research and figured that if I usedEC2 it could be 100-1000 x faster because it was considered an internal call by S3.

So here are the steps that I took to hack a solution together and I hope that if you are in my same boat you might find this helpful.

Start and EC2 instance and ssh into it:

ssh -p 22 -i ~/Sites/mysite/myec2key.pem root@ec2-174-129-75-24.compute-1.amazonaws.com

Install Python’s easy_install utiltity (on Ubuntu it is like this):

sudo apt-get install python-setuptools

A helpful utility named s3tunnel doesn’t allow you to update objects ACL so we will only use it to build our object list. The reason why I am used s2tunnel is because it is very fast at getting the list of objects. In my tests is was over 2000 objects per second.

Install s3tunnel: easy_install s3funnel

On the instance I create a directory to store work in so I can keep things simple. (you don’t have to do this if you don’t want to)

mkdir -p ~/amazon-files/input
mkdir ~/amazon-files/output
cd ~/amazon-files

Then I run the s3tunnel dump (you will have to replace AWS_KEY andAWS_SECRET with your respective s3 key)

nohup s3funnel bucket_name --aws_key=AWS_KEY --aws_secret_key=AWS_SECRET list > s3files.txt &

Then once the object list was complete I split it up a bit into smaller files:

cd input
split -l 5000 --suffix-length=6 ../s3files.txt s3

For my 23 million files this created about 4600 files.

Then I wrote a bash script that moved the files into 10 different directories. I chose 10 because that is how many threads that I wanted to run at the same time.

for file in $(ls input/s3*)
do 
  csum=`sum $file | cut -f1 -d' '`
  process = `expr $csum % 50`

  echo "Moving $file into input/$process" 
  if [[ ! -d input/$process ]]
  then
    mkdir input/$process
    mkdir output/$process    
  fi
  mv $file input/$process
done

Then I wrote a simple python script named amazon.py which I placed in the/amazon-files directory that used boto (a python library for s3—the same one that s3funnel uses under the hood.) This script looked like this:

#! /usr/bin/env python

import sys
import boto 
import re

print 'processing file: ' + sys.argv[1]

f = open('input/' + sys.argv[1], 'r')
c = boto.connect_s3("AWS_KEY", "AWS_SECRET")
b = c.get_bucket("bucket_name")
for line in f:
    b.set_acl('public-read', line.strip())

f.close()

Now that I have all of my objects evenly distributed into 10 separate directories I can now loop through each directory and kick-off one bash process at a time and move the completed files into the completed (output) this way if something goes wrong, I can see progress and can just restart the scripts. And it will continue (pretty much) where it left off. (give or take ~4000 objects)

for directory in $(ls input);
do
  nohup bash -l -c "for file in \$(ls input/${directory}); do python amazon.py input/${directory}/\${file} && mv input/${directory}/\${file} output/${directory}; done" &
done

I first started with 10 processes and then realized that 50 processes would have been better so I continued with 50. Running 50 processes on my 23 million objects would take about 12 hours to finish (532 objects updated a second.) All in all I was able to update all of the ACLs for the objects in what I now consider the fastest method possible.

This is obviously a hack and could use some cleanup and consolidation. Part of me wanted to just modify s3funnel to update all of the TCL’s but I am not that strong with Python and really just wanted to get my ACLs updated.

How about next time we use Import/Export we take a little longer to read about ACLs.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: