Saturday, June 13, 2009

Git, S3 and RewriteMap

I've written a tool to upload files to S3 using content addressable semantics, where the S3 key is a hash of the data. This is a very natural fit for exporting a Git repository, so obviously that is well supported ;-)

My problem

I host my website on a puny cable modem from a linux box that's been in my parents' living room since I was in highschool. The bandwidth sucks but I like the convenience of having shell access. If I was setting it up now I would probably just get a linode, but this already works so until the hardware dies I'm not going to fix what isn't broken.

The website is just static files stored in Git. I didn't want to encumber the HTML with hashed URIs that I would need to keep updated whenever I changed the images. I also wanted the HTML to be viewable locally as I make changes.

The only files I wanted offloaded to S3 were the largish images. The key is that the offloading to S3 had to be non-intrusive, or it wouldn't be worth the effort.

Git to S3

The first step was getting the content onto S3 from Git. For this I've written two Perl modules. Net::Amazon::S3::CAS does the heavy lifting of synchronizing a set of blobs to S3, while Net::Amazon::S3::CAS::Git provides the Git integration. Net::Amazon::S3::CAS also supports uploading blobs from other sources, such as the file system, it's not just Git specific.

Net::Amazon::S3::CAS::Git provides a git-to-s3 command which makes exporting from Git to S3 easy. It uses git-ls-tree internally. Here's an example setup:

git to-s3 --aws-secret-access-key foo \
          --aws-access-key-id bar \
          --bucket \
          --prefix \
          --prune --vhost \
          $( git ls-tree -r --name-only HEAD | grep -i '(png|jpg)$' )

In this example the ls-tree command is used to generate patterns to pass back to ls-tree (all the files that appear to be images). In the future I'll probably add MIME type and file size based filtering.

Net::Amazon::S3::CAS then uploads an S3 key based on the blob hash of each image. For example this image:

% git ls-tree HEAD bg.jpg
100644 blob ae5684b40b111e70c2dd4d69f498ddcbf4ff78dd bg.jpg

is uploaded to S3 as

Since the URIs contain a hash of the contents, the Expires and Cache-Control are set pretty aggressively (10 years by deafult, with public cache permissions). Similarly, blobs are skipped during uploading if they already exist in the bucket.


The git-to-s3 script prints a map to STDOUT which can be used in Apache using the RewriteMap directive. Save the output in a file, and then add this to your Apache config:

RewriteEngine On
RewriteMap s3 txt:/path/to/rewritemap.txt
RewriteCond ${s3:$1} !=""
RewriteRule ^/(.*)$ ${s3:$1} [R]

The RewriteCond ensures the RewriteRule only applies to URIs that have entries in the map.

With this in place now properly redirects to S3.

Hopefully S3 will support creating keys that give back 302 status codes at some point in the future, making this step unnecessary.

Git Hooks

Using post-receive hooks as described by Abhijit Menon-Sen you can automate this process. I won't go into the details he covers, since all I did was add the above to-s3 call to my post-receive.

This setup gives me the simplicity of editing static content, and all I need to do is run git push to update my live site, with automated uploading of the heavy files to S3.


Mike said...

You should use Nginx instead of apache if you are looking to be more efficient with the little power you have.

nothingmuch said...

If by "power" you mean "bandwidth" then I really fail to see how switching webservers would do anything. Apache can easily saturate a 40KBps cable modem on commodity hardware.

I know Nginx and Lighttpd are both easier to configure and more efficient, but apache is already configured and set up and has been for years. I might as well move everything to a real server if I'm going to change software.