Big data can be a nightmare. Sure, it's powerful stuff, but as anyone who's worked with large sets of raw data knows, they can be an epic bitch to wrangle. Cleaning things up into a consistent, usable format winds up burning an extraordinary amount of time—precious hours that could be going into something far more productive. But that may be about to change.
The issue is known among data scientists and statisticians as data transformation. And by Jeff Heer's estimate, it can eat up anywhere between 50% and 80% of a data wrangler's time. The University of Washington professor and longtime data visualization specialist is now the cofounder of a startup called Trifacta, which offers a web-based platform for easily transforming data sets.
"We've been trying to address this by changing the coding exercise into a sort of visual exploration," says Heer about Trifacta, which runs in the browser and uses behind-the-scenes algorithms to smartly reshape data without needing to code.
"As I looked back on what I was doing, I realized I was spending the majority of my time on data preparation," says Heer. "Everything from finding the data sets that are relevant to different questions, seeing if they're even responsive to that question, seeing what the quality is like, and transforming the format of the data."
Heer knew he wasn't alone. Before building Trifacta, he and cofounders Joe Hellerstein and Sean Kandel interviewed data gurus at dozens of companies and found that, indeed, data transformation is a huge pain in the ass for just about everyone.
"This is the elephant in the room of data science," Heer says. "I feel like we can make this process much more efficient and accessible."
Currently, most data transformation is done by hand inside software like Excel. As you might imagine, it's tedious work.
Just ask Enzo Yaksic, who has been collecting data about serial killers for 13 years. As the founder of the Serial Homicide Expertise and Information Sharing Collaborative (SHEISC), Yaksic has taken data sets from a variety of sources and built a centralized database of serial killers that has been used for academic and journalistic purposes.
With data coming from various law enforcement agencies and independent researchers, the results were anything but consistent. Each individual spreadsheet contained its own set of columns and even the columns that matched often had data points spelled out in different ways from spreadsheet to spreadsheet.
"The setup of each file was different," says Yaksic. "Some files were messy while others were pristine. Misspellings caused several duplicate records that had to be removed."
Merging the data, scrubbing it clean, and zapping duplicate records was all a manual process for Yaksic, who painstakingly pored through each cell in Excel, all in the name of good data.
Not only is this process every bit as tedious as it sounds, but it has a bigger problem: It doesn't scale. For smaller academic or marketing-oriented data projects, manual data transformation is possible, even if it's not fun. But once the data reaches a certain size, this type of cell-by-cell tweaking proves impossible.
To modify large data sets, typically some kind of custom script is employed. This approach is more efficient than doing things manually, but it's not without issues of its own.
"Writing this code is extremely time-consuming," says Heer. "Somebody writes the code, they run it on the data, and it takes a long time to run in many cases. And then you're looking at raw outputs to try and assess if the code did the right thing. If not, they have to go back, fix their code, debug it, and then run it again."
It's also technically prohibitive for many. "People who may have interesting analysis questions and not necessarily deep programming skills are sort of left out of the process altogether," explains Heer.
In short, neither manual editing nor semi-automated scripting have nailed the data transformation problem, in Heer's view. And he's not alone in thinking so.
The problem isn't limited to a single profession or industry. In the era of big data that is now well underway, companies of all sizes are generating ever-growing mountains of data. Rarely does it get spit out of their systems in a format that's easy to analyze, visualize, and learn from.
Take health care, for example. As the U.S. transitions to electronic health records, data about patients and their treatment holds huge potential just waiting to be unlocked. But the information generated by one health care provider's office may look wildly different from what comes out of another. Then factor in insurance companies and you've got a huge mess. Meanwhile, patients are starting to generate streams of their own personal data, a trend that will likely accelerate once Apple jumps in on the wearable gadget game and puts their new Health app on every iOS 8 home screen.
Tying all of this health-related data together in any meaningful way will only get more challenging as the breadth of the data grows. Heer knows what kinds of data headaches the health care industry experiences. One of Trifacta's first clients is a subdivision of Lockheed Martin that processes all Medicare and Medicaid claims on behalf of the U.S. government.
Trifacta lets users ingest a sample of their data, which is then viewable both as a spreadsheet and via built-in visualizations. Once the data is loaded in, users can make selections of the parts that need some tweaking. This is where things get interesting.
"Based on that selection, we have algorithms that actually search over a space of possible transformations," says Heer. "One way to think about this is as sort of an intelligent autocomplete." For instance, you might select a given chunk of cells and Trifacta will suggest replacing missing values or getting rid of unnecessary columns. To accomplish this, Heer and his team designed a proprietary programming language.
"Behind the scenes, our software is searching over possible statements in that language," says Heer. "But the flip side that's also pretty exciting is that we can take that programming language and then compile it to run in a variety of environments. So you might be working with a small subset of data directly in the browser but the results of that is not just the transformation of that sample, but it's actually a script that we can then take and run at scale across your cluster."
In other words, those scripts that you were hand-coding for one-off data transformations before? Trifacta writes the code for you and then spits it out in whatever format you need.
"Not only are we trying to make data transformation much more visual and interactive, but we're doing so in a way that can actually scale to large data volumes," says Heer.
The result? A tenfold increase in efficiency, the company boasts. Having just raised $25 million in its recent Series C round of funding, Trifacta has its sights set on streamlining the process even further.
[Image: Flickr user Philip Kromer]