Who wouldn't want to be a data scientist, the latest glamor job of the Nerd World? GigaOm told us how to find one or be one. Wired said you don't need a PhD in Mathematics to be one. You can even be an amateur data scientist, and Smart Data Collective has picked the sample projects to start with. Unable to resist any longer, I signed up to spend a month of immersion in data, hoping to emerge a newly minted data scientist.
I went to Turin, Italy to participate in the first Big Dive, which bills itself as “a training program to boost the technical skills needed to dive into the big data universe.” It would cover development, visualization, and data science.
Our deep dive into big data took place at the Casa del Pingone, a 15th-century house decorated with medieval frescos and equipped with the all-important espresso machine.
By the last day of the course, Fariba, my gelato-loving Iranian physicist project partner, and I had nothing for our final presentation the next morning. Our analysis of a week's worth of Twitter data had led us only up sundry blind alleys. The sight of our classmates purposefully completing their presentations felt intolerable. I had fled to the park where I was eating pizza and seriously considering not going back. The pizza wasn't helping; I was a failed data scientist.
My qualifications as a wannabe data scientist were limited to a background in Machine Learning from the previous millennium, attendance at one Strata Conference, and the publication of a few articles on big data applications. I wasn't sure I was ready for the challenge, but since even data scientists themselves couldn't seem to agree on what a data scientist did, how hard could it be?
Big Dive's grab bag of technologies included TurboGears2, Hadoop MapReduce, MongoDB, as well as Python-based tools for machine learning and network science and libraries such as D3 for visualization. It was a challenging program for a seriously lapsed developer like myself. My Big Dive classmates included software developers, data analysts, and an eclectic mix of scientists—a surprising number of theoretical physicists who'd turned their attention to social networks.
"The biggest blocker (for scientists) is code," says Jake Klamka, founder of the Insight Data Science Fellows Program, an intensive six-week postdoctoral training fellowship in data science. “Going from coding in a scientific context to development at the level that technology companies do it, learning computer science fundamentals and software engineering best practices. You can't walk into an interview at Facebook or Square and say 'I just do MATLAB.'"
For software developers there are different challenges. “Engineers think about building things," Klamka explains. "Data science is about asking the right questions. That's what scientists are phenomenally good at,” he says. On the other hand, neither the scientists nor the software developers knew anything about visualization.
We were instructed to first acquire and parse the data, then filter out information which was not of interest, mine to find patterns, and finally to capture significant patterns in a visual representation. We would be using visualization as a means to explore the data, to sketch using code, and to present the final result.
“Data is invisible,” said our lavishly mustachioed visualization instructor, Giorgio. “To create a map between visual elements and abstract information, we need a visual language which has the same complexity and capabilities as a spoken language.”
The palette of visual elements included color, positioning, orientation, size, shape, style, and texture. We were all, developers and physicist alike, to become data cartographers.
Visualization didn't have to be complicated. Our first exercise used FatFonts, a numerical typeface where the amount of pixels or ink used is proportional to the number represented. The biggest mistake, we were told, was to create a visualization which was mere eye candy and did not express an insight clearly. What to represent is always more important than how to represent it.
“Visualization is extremely hard, “ says Klamka. “In our program I almost discourage people from doing visualization projects. The visualization has to communicate instantly a very precise piece of information. It has to be dead simple.” However, before we could visualize anything, we needed some data.
Our first real-world data set was supplied by the regional government of Piemonte in Turin, which was in the midst of a scandal. Suspiciously high meeting expense claims by the government's councilors (Consiglieri Regionali) were causing an uproar in the region. In response to the allegations, the government published the expense claim data from the previous year. Councilors could claim an attendance fee for each meeting, whether it was an official meeting in the council chamber or with external parties offsite, and for mileage when traveling to meetings. This was not Big Data. The entire data set consisted of a few spreadsheets.
The project team consisted of several developers, a computer scientist, a data analyst, and me. Where to start? Hilary Mason is the chief scientist at Bitly. “The hardest part of data science is understanding a problem,” she says. “People often assume that all of the answers are in the data, and that domain knowledge is a secondary concern. In fact, it's often the other way around. The data provides the context to make a decision.”
Marcello and I researched the background and added extra information—how the regional government worked, the rules for expense claims, the salaries, professions, parties, and genders of the councilors. Roberto, the data analyst, and the only one of us with any data experience, sliced-and-diced the data using Excel pivot tables—claims per person per party, age, and education distributions of councilors, attendance records. The developers tinkered with JSON dictionaries and D3 plug-ins. As a group, we came up with many good questions but no interesting answers.
Days passed and we still weren't getting anywhere. Then Roberto made a discovery. He found that, based on the expense claims, there were two peaks in meeting attendance during the year: The month of May, just before the summer break, and August. Italians go on holiday in August and there are no official meetings in the council chamber in that month. Members of the ruling Popola della Liberta political party (headed by disgraced Prime Minister Silvio Berlusconi) claimed more per person in August than any other party.
Luca converted the data into a JSON dictionary and after much wrangling about the choice of visualization, Kevin coded up a simple streamgraph showing expense claims totals for each party across the year. The regional government eventually introduced cost-cutting measures eliminating claims for meetings organized by external parties.
Cathy O'Neil is a data scientist at Johnson Research Labs. “Not every problem can be solved with data," she insists. “Some problems are really political or systemic and won't give way just because we have really excellent data with which to track them. That's analogous to thinking that if we only knew how many calories we ate, we'd all suddenly be the perfect weight.“
Next up, a tour of machine learning and network science. Machine Learning algorithms can "learn" from data, though what is learned can vary widely. A regression algorithm might be used to predict the price of the next house sold in your neighborhood given its attributes. A neural network can be trained to classify a tumor as cancerous or benign based on previous results. Clustering algorithms are used by Amazon to make personalized recommendations by assigning you to a bucket of “similar” buyers.
Network science is the study of complex networks which can be studied like natural phenomena. In the Big Data context, these are mainly networks created by human behavior: global airline routes, the structure of your Facebook network, how information flows through Twitter. We learned the difference between a clique and a small world and wrote simple Python scripts to find the person in our Facebook network with the highest “betweenness centrality” (that's you, Lital Marom) and the world's most connected airport.
For our final project, we were given a 10% sampling of seven days of global Twitter data from the height of the Arab Spring, the week that Hosni Mubarak resigned as president of Egypt. The data for the entire week was over 120GB. Our job was to find a needle of insight in this haystack of tweets. Our excavation tool was Amazon Elastic MapReduce, which for some reason I kept calling Madreduce.
MapReduce does something very simple but very powerful. The main bottleneck in operations on large data sets is reading in the data. Even if you can ramp up the input speed to 50 MB/second, it takes at least 8 hours to read in a terabyte. The solution is to split the data up into disjoint subsets, store each subset on a single disk, use local CPUs to perform an operation simultaneously on each subset, and then put the results together to produce a final answer. With MapReduce you could spin up as many CPUs as you liked to scan through the Twitter data and find say, the most retweeted tweet on a particular day. It still took several hours to run a simple operation on a single day's worth of data.
Luckily for me, my project partner, Fariba, the Iranian astrophysicist, knew not only Python but some Arabic, too. We decided to restrict our investigations to tweets from the Middle East. Tweets don't include a location, but they do usually have a time zone, which we used for rough filtering. MapReduce labored for hours looking for significant increases or decreases in the volume of tweets from various countries over the week.
On one particular day, the results seemed to show that Cairo had disappeared entirely from Twitter. Had the Egyptian government shut down Twitter? The answer turned out to be less exciting. I had excluded part of the data set from the MapReduce operation by misusing an index.
More questions, more overnight calculations, and none of our hypotheses panned out. In a last-ditch effort, Fariba wrote a script to analyze the text (not the hashtags) of all the tweets from the Middle East and find the words most commonly used on each day of the week. There was only one problem. The script kept failing to complete to MapReduce. So there we were on the last afternoon: Fariba still tinkering with the script, me eating pizza in the park.
On the following morning, the project presentations were about to start. There was no sign of Fariba. I had prepared a presentation explaining why I was a failed data scientist. Fariba finally arrived, turned to me and said “Did you see my presentation on Dropbox?” The latest script had miraculously run overnight. From the results she had created a wordcloud of the words most used by Twitter users in the Middle East on each day of the week after Mubarak's resignation. I scanned through it, elated. It told a story.
On Friday, Feb 11, 2011 the Egyptian president resigned and an army council took over the country. For the next two days the talk on Twitter was first celebratory and then started to include words like “protestors,” “police,” and “missing” as the military announced that it intended to retain power for 6 months or more while elections are scheduled.
On day four the dominant phrase on Twitter was “TheGuyBehindOmarSuleiman.” This mysterious individual, who appeared behind the Egyptian vice president during his speeches throughout the unrest, suddenly obsessed the entire region. No one knew his name, or his role. (He was later identified as Egyptian army lieutenant Colonel Hussein Sharif) but his appearance spawned a Facebook page, a YouTube song, numerous photoshopped images, and a deluge of jokes on Twitter.)
Egyptians had found a way to criticize the military leadership and have a good laugh at the same time. We had something.
If my experiences are anything to go by, data scientists are a bit like the sweater-wearing, Danish detective Sarah Lund from The Killing, who is always obsessively pursuing a line of inquiry which rarely turns out to be the correct one. In The Killing she eventually gets it right. As aspiring data scientists, we often didn't.
“The biggest challenge is making something useful,“ says Klamka. ”To go from looking for insights in data where the end goal is to publish a paper, to the goal of providing value to a business. Not 'Is this interesting?' but 'Is this useful?' ”
O'Neil warns against underestimating the depth of mathematical knowledge required. “People think they can do real data scientist work without understanding the math underlying the algorithms,” she comments “but most of the time there are weird results before you make things work. You can only really understand those weird results if you understand the mathematical engines.”
Being a data scientist may require deep technical skill, but there's another dimension that's just as important. “Data science is not just about number-crunching,“ Klamka concludes. “It's all about people. The data comes from what people are doing; great data scientists have an ability to understand people and the ideal result is something which is going to help people.”
[Image by Rosa Menkman on Flickr]