2014-05-12

Co.Labs

When In Doubt, Build It Yourself (And Open Source The Code)

Faced with a very specific problem, SoundCloud's engineers opted to homebake a solution and share it with the world.



It's easy to take the basic functionality of a social network for granted, but even something as simple as an "activity feed" (like so many apps have) has complex moving parts. And when a site scales, those parts begin to break down. For engineers at SoundCloud, rebuilding their feed technology was so labor intensive that they decided to spare the rest of humanity from the task. And thus, they opened sourced it.

Here's their solution: It's called Roshi, a new open source distributed storage system for "time series events" in feeds—in laymans terms, news feeds. Roshi was developed by Peter Bourgon and other engineers at SoundCloud in order to scale up without slowing down the performance of the very social streams that keep users engaged.

What's A Feed For?

Let's say you follow Snoop Dogg on SoundCloud. If he reposts (SoundCloud's equivalent of retweeting) a track by an up-and-coming hip hop artist, you'd naturally expect to see the song in your stream. The traditional way of doing this, known as "fan out on write" essentially treats each user's stream (equivalent to the News Feed on Facebook) as an inbox, pushing updates out to each of them individually. It works, but it's inefficient, the storage costs add up, and changes to the social graph can become a pain to implement.

"At some point, those caveats and restrictions started affecting our ability to iterate on the stream," explains Bourgon. "To keep up with product ideas, we needed to address the infrastructure. And rather than tackling each problem in isolation, we thought about changing the model."

Rethinking Social Time Series Events With Roshi

SoundCloud's new approach relies on a methodology called "fan-in-on-read." When you view the stream of SoundCloud users you follow, the system will grab the most recent events (favorites, reposts, and the like) of those people and then dynamically merge that information on the fly. It speeds up writes and minimizes storage, but presents new challenges.

"Reads are difficult," Bourgon explains. "If you follow thousands of users, making thousands of simultaneous reads, time-sorting, merging, and cutting within a typical request-response deadline isn't trivial. As far as we know, nobody operating at our scale builds timelines via fan-in-on-read."

For Bourgon and the other developers working on Roshi, the solution came in the form of a specific CRDT (Convergent Replicated Data Type). These conflict-free replicated data types "manage to sidestep a lot of the common problems and pitfalls associated with distributed systems," Bourgon explains.

And with that, Roshi was born.

"One thing that's much easier to do on the new system is to handle social graph updates," says Bourgon. "If you follow or unfollow somebody, in the old system, that could take quite a while to propagate and to become visible on your stream. In the new system, that's more or less immediately apparent."

In the short-term, this streamlines certain backend processes, but it also primes SoundCloud's infrastructure for future product updates. "You could imagine that from there, any new feature that involves dynamically adding or removing content to your stream is now a lot easier to do," says Bourgon.

That will come in handy as the platform continues to grow. Since launching in 2007, the Berlin-based service has become known as a sort of "YouTube for audio" and today sees 12 hours of audio uploaded every minute.

When Building Is Better Than Buying

Of course, the SoundCloud team could have crammed any number of off-the-shelf products into this hole and done the trick. The old system, for example, was based on Apache Cassandra, a distributed database system that probably could have been bent to meet SoundCloud's needs in this case as well.

"The problem that we're solving with this thing is so simple to articulate and so simple to get your head around that it felt like the cognitive burden of learning a massive system like Cassandra and operating such a big black box felt like it wasn't worth our time," says Bourgon. "Not when we could relatively easily—Roshi is around 2,000 lines of code—craft something that solves specifically the problem we're trying to solve."

In this instance, crafting a homegrown, hyper-specific solution turned out to be more cost-effective and operationally useful than shoving some generic product into place. Of course, that's not always the case, but Bourgon hopes that projects like Roshi will help bolster the case for building, as opposed to buying, whenever appropriate.

"For me, the biggest thing is the shift in philosophy," Bourgon says. "I think a lot of startups are eager to buy software off the shelf that seems like it might fit and then sort of hammer it into place and solve whatever problem they might be solving. And it's not an invalid strategy. But I think we lose sight of the fact that that's just one option."




Add New Comment

1 Comments

  • Wow, this is probably the exact same thing I was telling my boss some weeks ago. We're creating complex simulation systems based on C++. Since we had issues about how the performance of certain components affected the rest of the system, we were desperately searching for a profiling tool which could help us with that.

    There are such libraries available (even though C++ has a quite bad tool support), but in the end there wasn't the one lib that suited our usecase. Some were just working with certain other thirdparties (like the Unity Profiler), some had license models which made them impossible/costly to use (like Telemetry) and a lot of tools just were not sophisticated enough.

    At the end I took things in my own hands and created the needed project myself: https://github.com/monsdar/CxxProf

    CxxProf is not the fastest nor the easiest profiling library available but it does what we needed: To profile distributed systems and point out where the cause of all evil is.