Software developers at NPR are currently tackling a huge problem. Its massive library of archival content is gathering dust and the system organizing it all needs an upgrade. The organization is weighing its options of building the software versus buying an off-the-shelf solution—looking at spending upwards of six figures to address the problem of taxonomy and cataloging the growing content the organization creates.
Because NPR creates lots of content in different formats, this is a pretty complex problem. Naturally, the solution won't be simple either.
"We’ve got our radio folks who are producing stories and programs and providing their own set of metadata," says Jonathan Epstein director of software development at NPR. "The library, in charge of their cataloging and archival of 40 plus years of our content, is doing their own effort. Then you’ve got bloggers who are tagging [web posts]. This is all sort of happening independently of each other and we really see a problem of wanting to connect the dots here."
This problem isn't unique to NPR. The recently leaked New York Times innovation report details how even an organization as well-staffed and forward-thinking as the Times can struggle with structured data. At the Times, the lack of metadata surrounding much of its archival content has led to major headaches.
In the digital world, tagging is a type of structured data — the information that allows things to be searched and sorted and made useful for analysis and innovation," says Epstein. "Some of the most successful Internet companies, including Netflix, Facebook and Pandora, have so much structured data — by tagging dozens or even hundreds of different elements of every movie, song and article — that they have turned the science of surfacing the right piece of content at the right time into the core of thriving businesses.
Epstein says NPR is currently figuring out the best path to take, whether by building something internally or buying a ready-made solution. Ideally the library would own and manage this yet-to-be-discovered tool, which would connect all the meta data between different areas of the business.
"We don’t always know exactly what we want when we start," says Epstein. "We know some details of what we want to build, but we really have to get our hands dirty with things. This is where research spikes come in and are key to this."
During the organization's most recent "serendipity days"—personal time every quarter dedicated toward projects of interest—two software engineers presented a project that touched on several of the metadata problems NPR is looking to address.
As part of this agile process, the company is now setting up a research spike and seriously looking into how it can leverage inside talent to tackle some or all of its current library problem.
Without mentioning specifics, Epstein hinted that not doing the technical due diligence has bit NPR in the past and so it’s been figuring out ways to work that out before it gets too deep down one rabbit hole.
The "process"—which isn’t really defined—is definitely an iterative one. "Failure is part of that process," Epstein adds.
For example with the library project, it’s unclear if a homegrown solution is even possible because of how extensible it will need to be. But it’s still an option worth exploring at this early phase.
Epstein isn’t expecting to build the whole thing in the two-week research spike window, but he is looking to get a better visual idea of what this still unknown piece looks like.
"It’s [about] starting to define the problem better," Epstein explains. "So even if we can’t do that, we’ve at least developed some sort of artifact where we can start to better understand when we’re looking at these individual vendors that we’re still talking to."
For businesses, though, it’s all about the money. Whether you’re talking about the straight-up sticker price, or the absorbed employee time. It comes down to the dollars and cents in one way or another.
"For us, buy doesn’t always mean buy," Epstein says. "It could mean open source, particularly because we are so reliant on it adhering to the overall NPR mission. It’s very much the open organization and open source is sort of the technological representation of that same mission."
Besides the metadata project, the most recent buy-versus-build decision turned out to be one that took the open source route. Epstein says that contributing back to the open source community is a critical part of working toward the common good, while helping boost the organization's tech reputation.
Most recently, during performance testing and looking for bottlenecks, the team discovered some issues around MySQL and connection management. The load wasn’t being evenly distributed on a bunch of clusters and in some cases were overloading on one server while another one was sitting idle.
After a few false starts of trying to correct the problem, the team took a step back and realized that this couldn’t be unique to them, so they found a PHP plug-in that addressed the issue they were having. After some testing, the fix went live a few days ago.
It's unclear whether having a dedicated guideline to reference for when to build or buy would be helpful for Epstein and his team, but chances are it wouldn't. It seems for NPR, it's all about the circumstances of the project and being able to move quickly.
[Image: Flickr user Dave Herholz]