2013-05-13

Scoops and Software: How The New York Times Tells Stories With Data

Embedded in the heart of the New York Times, Aron Pilhofer runs an experimental news team made up of veteran journalists and top-notch computer scientists. Their job is to tell stories using software, data, and old-fashioned journalistic skills. Here’s how they do it.



When veteran print journalist Aron Pilhofer started the Interactive News team at the New York Times in 2007, the Grey Lady was sputtering. Advertising revenue plummeted so much during the recession that by 2009 the New York Times Company’s stock had lost more than half of its value. In the midst of this turmoil, Pilhofer was running an experimental news team of hacks (journalists) and hackers (professional programmers). Since then, the Interactive News team has built tools to cover huge stories like the London Olympics, the 2012 presidential election, and most recently Hurricane Sandy.

What does your team at the New York Times look like?

The Interactive News team is a combination of digital storytelling and straight-up web development. There are 18 people on the technical side and that will grow by 2 or 3 in the next year. It's actually more an editorial product team than anything else. We build things like the Olympics website or the entire election night website. We will deal with the data coming in coming from the AP (Associated Press news agency) or in the case of the Olympics the IOC (International Olympic Committee). That can be a buttload of data. At the height of London we were parsing 300 messages a second. So you have to build pretty performant software.

What technology stack does your team use?

We are a Rails shop. When we started this team about 7 years ago, we went off on our own and we wanted to build on a framework like Rails and Django to get a much quicker path to production. That was a really good decision. That's allowed us to do projects that we otherwise couldn't have and scale them. We are doing a lot more with Node.js because it's very fast and has a very small footprint. A lot of the applications we are building, we are building on the client side. Our dashboard, for example, is a backbone application, but all logic is in the browser now whereas it used to be on the server. If you want that immediacy, so that when a reporter posts something new you want that to be reflected online almost immediately, you need that speed but you don't need all the bells and whistles of something like Ruby on Rails to generate what amounts to some JSON. Node.js we are using a lot more for that reason.

What’s the latest project you’re working on?

This month we started down a road of using “Big Data” techniques with news. We have our own campaign finance database. We bring in millions of records. We want to find ways to algorithmically discover clusters of donors and patterns of donations using Machine Learning. We are experimenting with this right now on Mayoral election data. The other day, Chase, who runs that team, had something like 32 cores crunching and munching up on the Amazon cloud. We are starting with Washington and politics generally because there is so much data that we think that we can create tools that will help us discover things which wouldn't be visible to the naked eye. In particular, we look at unstructured data like text--what different members said, press releases, what's on their websites, what's in their Twitter feeds. We think that Chase has a pretty smart approach to linking donors together so you can aggregate donations, which is really quite hard to do and is normally a very manual process.

What about cases where you don’t have access to all the relevant data?

That's a problem across the board. In campaign finance data, donations are only reported once you reach a particular threshold so you are actually missing the vast majority of donors. This came up big-time in the 2008 presidential election. Obama was raising so much from small donors--getting $20, $50 dollars from people all around the country--and giving away T-shirts or whatever. We have no idea who the vast majority of donors for that campaign were. How do you deal with that? You just need to be transparent with readers on how you arrived at what you arrived at. The New York Times has a relatively long history of doing this kind of work. No database is perfect. You need to understand the limitations of the data. It's a little bit closer to how one does science than to how one does humanities but in the end you are still getting to a story.



Is your process different for different types of stories?

Absolutely. When we launched our dashboard for (Hurricane) Sandy we didn't expect it to go for eight straight days. We thought maybe a day or two. We found very quickly that the piece of software we had written for a short event--a Super Bowl, an Oscars night--fell apart when it became a more persistent feature of the site. Pretty soon you have a very fat data file. We had to do some creative hacking to stop the thing from completely falling apart. Contrast that with say, the London Olympics, which was a massive 18-month project. It's a data processing problem, and it's a publishing problem since there are hundreds of pages that you are publishing. It was a white label product as well. We had an API. We had 12 clients including people like the Guardian and CBC. A project like that is a huge scaling problem. You have to build a very robust system. Hack is a four-letter word in a case like that. You need to have it fully planned out, you have to have the right people on it, you need to know that this is going to work.

What did you learn from the Olympics?

The Olympics is a very difficult project because you don't have good, or even close to good, test data. You don't have a full Olympics to play out. We ended up creating a piece of software to generate this stuff and tested in mid-stream to see what it looks like when you have a bracket in Hockey, so one team has advanced but the other team hasn't and seeding can change depending on who wins and who plays whom. All that logic needs to be worked, though, and if you don't do that ahead of time you are really screwed. That's what we did for London. But there was a moment when we were watching one of the Curling matches and it's tied and I said, “What happens if they’re tied at the end?” So we looked it up and they just keep going. We hadn't really accounted for that. So we were furiously hacking this in--“save, deploy, boom!” and we got it just in time. That's the kind of thing you can't have happen. We had dramatically underestimated the size and scope of a project like that. We had underestimated the scaling issues, not to serve pages but to actually render them. We pulled it off but not, I would say, elegantly.

What tips do you have for others interested in telling stories with data?

A lot of people think that we go into databases and try to find the stories, and it's actually the exact opposite. Data is a source like any other source. It's fallible, it's incomplete. Just like a human being, it's sometimes hard to know where the incompleteness and the lies are. True storytelling is a combination of narrative--you have an idea of what the story is--and then using the data as a way to support what you have already reported. A lot of the applications we build, the projects that we work on, take that same approach. It needs to tell a story first and foremost. If you can't look at an interactive and know what the lede is and what's the headline--ff you can't figure that out within 5 seconds, you've failed. I see people who get too excited about the data and not enough about the story. I see cases where there is just too much data. Narrow it down to the really key elements that people want and need and have some sense of information hierarchy, what I would think of as the lede, the nutgraf (a summary of what the story is about). If you don't think about those things you end up with a completely indecipherable mishmash.

How can you tell whether you have a clear story or a “mishmash"?

It's design, in a way. Design is a very undervalued skill. I started this team with the idea that the content is what matters. It doesn't matter if it's pretty. As long as the content is amazing people will come. I couldn't have been more wrong. I've retooled the team over the years to bring on more front-end people who get design. One of my deputies is a designer. That ability to identify what you are really trying to say is really hard. It's particularly hard when you are also the builder of the thing. The hardest thing in the world is to put yourself as the builder of the thing in the mindset of the person who is coming to it fresh. Find someone else, even if it's your Mom. But it's something you can learn. I've been a data guy forever. I am the kind of freak who looks at a spreadsheet and the hair on my arms stands up I get so excited. It still took me a very, very long time to develop a sense of how to tell a story in these non-traditional ways and I'm still learning a lot. This is a lifelong journey.


Want to learn more about how software and journalism work together? Read our tracker on the future of news.


[Image by Flickr user Alec Perkins]