2012-10-22

How Quora Supercharges Its Site By Optimizing For The 99th Percentile

Want to make your site faster without cutting features? Try asynchronous programming.



Site speed has a quantifiable effect on user experience, it impacts everything from your site's search optimization to the number of page views per visit. But many of the ways you'd increase speed involve compromising the product—something we didn't want to do at Quora. So when we set off to qualitatively improve our LiveNode framework, which handles the live updating of pages that have already been loaded, we first had to rethink what we meant by "fast."

Average page load time is a convenient measurement, and that's where a lot of people apply their fixes. But we also use the 99th percentile of page load time to represent what we care about improving: the bad experiences. It's not the fast page loads you remember—it's that one really slow page load that hurts. Our previous solutions for improving site speed were cutting features, but that’s not ideal. Improving something as fundamental as the framework—the software libraries and conventions used by developers as the foundation on which the site is built—would speed up the site without those tradeoffs.

Albert Sheu, Quora Infrastructure Engineer. Photo by: Katrina Li

We tried a lot of things to speed up the site—evaluating their results and iterating quickly—before we hit on the solution: Parallelize the web framework. We tried cutting features, deferring the generation of hidden elements, and tweaking our Python code to no avail.

Perhaps the oldest solution to increasing performance, known as vertical scaling, is to simply purchase faster hardware. Vertical scaling relies on the hypothesis that a CPU that's twice as fast will render pages roughly twice as fast. Facebook took a second approach to decreasing page rendering times: Do things more efficiently. The Facebook development team created a PHP to C++ compiler, known as HipHop for PHP, that runs the same PHP code two to six times faster.

We chose a third way: parallel infrastructure—baking speed directly into the web framework. At Quora engineering, we pay attention to what works for others when developing our own systems. Distributed systems aren't always better than vertically scaled systems, but with the development of technologies like Memcached, Hadoop, and EC2, scaling horizontally through parallel code execution is a viable technique for production systems and one that we have leveraged.

We wanted to make our web framework fast without any manual intervention from engineers. That meant engineering the web framework to do a lot of the optimizations behind the scenes. For example, the framework will automatically detect which portions of the page render the slowest and increase the parallelism for those sections, while at the same time decreasing parallelism for fast portions of the page in order to reduce network overhead. The results speak for themselves.

The end result is that no one has to think about parallelism when building new features on top of the framework—it just works. More automation means fewer engineers that have to actively attend to site speed, and more who can work on product or infrastructural improvements.


Demonstrating Parallel Architecture: Multiprocessing Twitter Search

In this exercise we're going to create a Twitter search client using Python and the Twitter JSON API, and then speed it up by using the Python multiprocessing module. Python comes with built-in JSON and HTTP support using the json and urllib modules, so we can throw together a quick—if sequential—Twitter search script like this:

# twsearch.py
# Runs a Twitter search on every passed-in parameter.

import json
import urllib
import sys

def render(tweet):
return u'{0}: {1}'.format(tweet['from_user'], tweet['text'])

def search(query):
query_url = 'http://search.twitter.com/search.json?q={0}'
page_fp = urllib.urlopen(query_url.format(query))
result_dict = json.loads(page_fp.read())
return [render(tweet) for tweet in result_dict['results']]

if __name__ == '__main__':
results = [search(token) for token in sys.argv[1:]]
for result in results:
for tweet in result:
print tweet.encode('utf8')

Pretty simple. You can copy-paste this script and run it by calling something like:

$ python twsearch.py quick brown fox

The script will output all the Twitter search results for "quick," "brown," and "fox" sequentially, waiting for the results of quick before moving on to brown, and then eventually searching for fox. The exact run time of this script depends on the search queries used, the speed of your Internet, and whether or not Twitter has cached the results of the queries, but for my Macbook Air running on Quora's office wireless, I get query times anywhere between 500ms to 1500ms.

$ time python twsearch.py quick brown fox jumps > /dev/null

real0m0.900s
user0m0.084s
sys0m0.049s

Again, this is more of a toy example, but the main function in this script loosely maps to the stages performed by LiveNode to render our web pages. In the LiveNode web framework, we generate a page by first recording a list of the components that make up the page: a footer, a sidebar, a header, a question title, a list of answers. Next we translate the components into HTML, and finally we output the HTML back to your browser.

Now a one-second load time for a page is pretty bad, and the reason why it takes this long is because every search result is being generated sequentially. In programming lingo, we say that the script blocks for every phrase we want to search on, waiting for Twitter's servers to respond before going on to the next search phrase. Python has a number of built-in ways to run code in parallel—via non-blocking methods, which we'll try in the following example.

import multiprocessing

pool = multiprocessing.Pool(processes=4)

def parallel_iter(f):
def new_f(*args, **kwargs):
ret = pool.apply_async(f, args, kwargs)
class ResultIter(object):
def __iter__(self):
return iter(ret.get(timeout=1))
return ResultIter()
return new_f

This code is significantly more complicated than before, so I'll step through it a section at a time. The function we're creating here is parallel_iter, a function that wraps around a function that normally returns an iterable object (a list, in this case), and passes that workload to a child process for execution.

pool = multiprocessing.Pool(processes=4)

The first thing we do now when we start this script is to immediately fork the process: creating identical code replicas—children to process our Twitter searches. This pool abstraction provided by Python lets you execute a function f asynchronously—non-blockingly—using the apply_async method, which works similarly to the built-in Python apply function.

def parallel_iter(f):
def new_f(*args, **kwargs):
ret = pool.apply_async(f, args, kwargs)

The difference between the parallel and normal versions of the code is that the actual result of the function f is not returned right away. Instead, we let one of the child processes execute the actual function call, and we extract the return value from the child process using a get() method call later. When we call ret.get() later, Python will check if the child process has finished processing the function f, and return the correct return value if it has.

The ResultIter class is also interesting. One of the goals of the parallel project was to make it completely transparent to other layers whether or not we were executing a component asynchronously. We didn't want get() calls littered throughout our view code. When we run search() normally, we expect a list of tweets to be returned. When we run the parallelized version of search(), we return a ResultIter object that looks as much like a list as possible. This can be done by overloading the __iter__ method in our class definition.

class ResultIter(object):
def __iter__(self):
return iter(ret.get(timeout=1))
return ResultIter()

Now that we have this parallel_iter function, we can turn any function that returns a list into a parallel function that returns a list. We're going to make one small change to our original script too: First we're going to change the name of our search function to search_sync, and use parallel_iter to create our parallelized search function.

def search_sync(query):
query_url = 'http://search.twitter.com/search.json?q={0}'

search = parallel_iter(search_sync)


(Normally we would be able to use parallel_iter automatically as a decorator, but this causes confusion when the multiprocessing module attempts to deserialize the search function for execution.)

When we put it all together, our final, parallelized Twitter search script looks like this. The changed portions of code are underlined.

# twsearch.py
# Runs a Twitter search on every passed-in parameter.

import json
import multiprocessing
import urllib
import sys

def render(tweet):
return u'{0}: {1}'.format(tweet['from_user'], tweet['text'])

def search_(query):
query_url = 'http://search.twitter.com/search.json?q={0}'
page = urllib.urlopen(query_url.format(query))
result_dict = json.loads(page.read())
return [render(tweet) for tweet in result_dict['results']]

pool = multiprocessing.Pool(processes=4)

def parallel_iter(f):
def new_f(*args, **kwargs):
ret = pool.apply_async(f, args, kwargs)
class ResultIter(object):
def __iter__(self):
return iter(ret.get(timeout=1))
return ResultIter()
return new_f

search = parallel_iter(search_)

if __name__ == '__main__':
results = [search(token) for token in sys.argv[1:]]
for result in results:
for tweet in result:
print tweet.encode('utf8')

When we run this version of the script, we definitely see a good speedup in search times. Running on the same parameters we used before, the script finishes anywhere between 250ms and 500ms, a 2-3 factor speedup from before.

$ time python twsearch.py quick brown fox jumps > /dev/null

real0m0.252s
user0m0.129s
sys0m0.077s

Project Notes

  • In Quora's LiveNode codebase, this method is very similar to the way we actually handle component rendering. However, since this is a small example, there are a lot of things that we do extra in order to make this a truly production system.
  • First off, we used multiple processes instead of multiple threads due to some limitations in the CPython interpreter. Because of Python's infamous global interpreter lock, referred to as the GIL, only one thread per process can be interpreting Python code at a time. This means that threading will properly parallelize I/O-intensive tasks in Python, but CPU-intensive tasks will still be handled with one core at a time.
  • We used the multiprocessing module and its pool abstraction for quick prototyping, but eventually migrated our deferred function execution to ZeroMQ and MessagePack. This allowed us to distribute the render load to multiple processes across multiple hosts, not just local child processes.
  • A lot of thought has to be put into what would happen when a particular function call fails. In LiveNode, we detect dropped messages and re-send them if they are dropped, and bubble errors back to the web server if exceptions are raised in an asynchronous call.

Albert Sheu works on Quora infrastructure. Follow him on Twitter.

[Bike Image: Flickr user J Mark Dodds]




Add New Comment

0 Comments