PayPal is one of the biggest financial institutions in the world, with 130 million accounts and more than 13,000 employees, but in recent years the company’s success has become an obstacle. While a new crop of payments startups were nipping at PayPal’s heels, the company was moving more slowly than ever. Development teams were still following a waterfall model, not the more modern agile development, and launching even a small service required more than 100 tickets and it took three weeks to provision 50 servers. To fix the problem would require a major engineering change--and a lot of charm to convince developers to change their workflow.
“In the process of scaling, the company got a little too conservative, and by way of doing that started driving away some of the technologists and innovators," PayPal’s CTO James Barrese says. "We are really going back to the roots of PayPal and reinventing the organization, the talent, the types of initiatives that we undertake, and the products that we build.”
To speed up production cycles, PayPal converted a substantial slice of its data centers--a fifth so far--into a private Openstack cloud. Openstack is an open-source cloud operating system that controls the computation, storage, and networking resources within a datacenter. It allows companies to create their own public cloud, as an alternative to a service like Amazon Web Services, or to implement a private cloud within the corporate firewall for use by internal business teams. PayPal did the latter.
“We started in January and by the end of the summer we were already there, “ says Barrese. “I would say that's a very accelerated timeline because we were on a mission to make it happen.” Production cycles are now seven times faster than a year ago.
According to Barrese, the biggest challenges in the project were not technical--although those were considerable--but organizational. It’s hard to cajole a technical team numbering in the thousands to work in a different way. Here’s his advice on how to do it.
PayPal mandates 99.9999% uptime and must maintain stringent security and at the same time it needed to massively improve on the speed and flexibility of deployment. This makes selecting the right technologies difficult. “I would encourage people to just try it and not get into analysis paralysis, says Barrese. “A large organization can spend six months or a year trying to make ‘the right decision’ and at the end of the day there's no perfect answer.You are going to learn a lot more by doing it than you will by analyzing it.”
This may mean working with immature technology. Openstack and other open source cloud technologies are still relatively young. PayPal has multiple data centers running in a mission critical environment, something which Openstack doesn't handle. “We are building where there are holes,” says Barrese. “We have to.”
While new tools and technologies often bubble up from development and operation teams themselves, Barrese says that for a project of this scale the entire organization has to be on board. “You have to have top-down support. We are putting the muscle behind making it work because it's not easy to go implement. If somebody is just trying to do it skunkworks in a back room, you can do a prototype but you are not going to do something at scale unless you have the backing of the organization.”
Maintaining constant uptime using Openstack is a new and relatively untested process, but missing availability targets may not be the operation’s biggest concern as provisioning becomes more automated. “There is a lot of unspoken fear where people are worried that it will eliminate their job,” says Barrese. “What's critical is to get those people to see that this is the best thing possible for your career, because you are going to be one of those people who knows how to transform to a cloud infrastructure, and second those people are going to be able to work on much higher-value activities like proactive monitoring, advanced detection, being able to predict failures versus the reactive stuff of trying to diagnose a failure.”
PayPal, like an large organization, had existing business targets and deadlines which still needed to be met while the technical overhaul was taking place. This meant that staging and sequencing was crucial. “The first part is you've got to get your band of pirates, the small team that's going to help you get initial installations done across engineering, quality, operations,” says Barrese. “There are three steps. There's an initial proof of concept. Let's just say a low-risk business use case. If it has problems or it's delayed it's not the end of the world. The second one is an extremely high-volume use case that proves scale. The third I like is a spanning set of use cases which exercises a lot of functionality.” Once you have proven that the new infrastructure stands up both at scale and in comprehensiveness, he says, you can start rolling it out to more and more teams.
Operations weren’t the only technical teams in the front line. PayPal’s engineers needed to rewrite their applications to run in a cloud infrastructure. Engineering had to start to build, test, and deploy their applications in a new way. “We are in the middle of wrapping up a really large transformation to agile,” says Barrese. “We were doing things in old waterfall style. We threw all of that out the window. We have completed restructured our teams. We've co-located teams. We have incorporated DevOps.”
The relationship between engineering and operations teams had to change. Engineering and operations usually have different goals and are rewarded for different things. “Typically what happens is that the ops team is measured by a certain set of production goals and the engineering team is rewarded for a different set of goals,” explains Barrese, “I cross-wired and gave shared goals. The ops teams have delivery goals as well as availability goals and engineering teams have both delivery and availability goals.”
PayPal now has one of the largest hybrid cloud deployments in the world and the company plans to eventually transfer all of its systems over to this kind of topology. Barrese’s only regret is that he didn’t start the project three months earlier. “You need to be prepared to get a little dirty in making this stuff work,” he says. “It's still being built, but it's also an exciting time to make it real. The tectonic plates are shifting. Every company should be going after this.”
[Image: Flickr user Peter Sheik]