The App that I submitted for the NYC BigApps contest has been getting some attention after it was mentioned in Wired and I wanted to write about the technology that I used to create it. The core of the application is built using Ruby on Rails, and it talks to a MySQL database using ActiveRecord. More interestingly, though, I added some other technologies to make pages load fast and to help scale the searching engine beyond what MySQL is able to achieve. For this, I leveraged Sphinx search as well as varnish page caching.
First, my goal in creating Clean.ly was to add a layer of information on top of the NYC restaurant data to make them more interesting and useful. Geocoding the restaurant addresses was a first step in this process, which was easily accomplished using the Google API. Next, I grabbed some other information about the restaurants, including type of food, from other freely available sources on the web.
For these data mining tasks, I needed to serialize a bunch of operations to make sure I didn’t go over API quotas. I also wanted to create a system that could accept data from different inputs without much trouble later. To meet these needs, I used a Kestrel queue to serialize HTTP retrievals from various providers.
Once I had the necessary data in place, Sphinx was an obvious choice to bring these data in front of the user on demand.
For those of you who have used the fulltext searching engine in Sphinx, it’s easy to envision how restaurant searching based on name or “tag” (such as type of food) would work. A lesser known fact about Sphinx is that in the newest releases, simple geodistance calculations are supported. The only real trick in implementing this part of the Clean.ly was that the latitude and longitude obtained from Google had to be converted to radians before piping them to the Sphinx indexer. After that, I had near-instant results with distance calculations from an input address, making it possible for users to search for restaurants close to where they want to eat.
Once I had geodist calculations in Sphinx working, I wanted to make sure that pages loaded quickly. Since Clean.ly doesn’t really have user-modifiable content, I immediately thought of page caching. In other work that I’ve done scaling high-performance web sites, I had implemented Varnish, a reverse page caching daemon, and had great success. It’s an extremely elegant and easy-to-use program which I’d highly recommend checking out for anyone interested in scaling high volume web sites.
Implementing Varnish for Clean.ly turned out to be a fairly straight-forward task. Clean.ly only has one server, a small instance on EC2 that I also use to host a variety of other small Rails projects. Varnish proxies port 80 on this instance to another port where Apache is listening, and it short-circuits and returns results for pages stored in the page cache.
In another post, I’d like to go into more detail on using Varnish with Rails applications. For now, though, I hope that you enjoy using Clean.ly. Be sure to check it out, as well as the other great submissions to the NYC BigApps contest.