Deciding where to build a new coffee shop or fast food outlet is an expensive and risky business. Traditionally, planners use data on demographics, revenue, nearby businesses, and aggregated human ﬂow, much of which is expensive to gather. But the expense is worth it, because when it comes to foot traffic, even a few feet can make a huge difference.
“Open a new coffee shop in one street corner and it may thrive with hundreds of customers. Open it a few hundred meters down the road and it may close in a matter of months,” explains University of Cambridge researcher Anastasios Noulas and colleagues in a new paper that puts a social spin on choosing the best retail location.
In addition to the usual geographical data, Noulas’s team wanted to see if adding freely available Foursquare check-in data combined with Machine Learning algorithms to the mix could help planners choose better locations. The team focused on three chains ubiquitous in New York, Foursquare’s home turf: Starbucks, McDonald's, and Dunkin’ Donuts.
The researchers started by looking at features that might affect foot traffic, like what other businesses operated near a given location, including the number of competitors, and how diverse those businesses were. They also took into account nearby landmarks that help attract customers. People coming out of a train station, for example, often head to a Starbucks, so features like these were included in location descriptions.
Then the team turned to Foursquare, which they used to understand how people flow between locations. By analyzing 620,932 check-ins shared on Twitter over a period of 5 months (about 25% of all Foursquare check-ins during that period), they were able to determine if an entire area is popular, instead of just one retail location, and analyze how users move from one retail location to another within an area and from outside it. For each feature they identified, the team computed a score that they used to rank each candidate location.
These features and values were used to describe each location and train several different supervised Machine Learning algorithms: Support Vector Regression, M5 decision trees, and Linear Regression with regularization. Each algorithm was trained 1,000 times using a random sample of two-thirds of the locations and their known levels of popularity. The trained algorithms then tried to predict how popular the remaining third of locations would be. The result was a ranking of the locations with the optimal location at the top of the ranking. The predicted ranking was then compared with the true popularity of those locations measured via Foursquare check-ins.
The team found that the check-in patterns for each of the three chains were unique. Starbucks locations had five times as many check-ins as McDonald's and Dunkin’ Donuts, a difference not entirely accounted for by the fact that Starbucks has twice as many stores as McDonald’s and Dunkin’ Donuts in the area around Manhattan. A Starbucks was also much more likely to be located by a train station than the other two chains.
The most predictive individual features of any given location also varied between chains. Competitiveness was the most predictive feature for Starbucks, indicating that the stores do better when they face less competition from nearby competitors. Incoming Flow, or customers coming from outside the retail area, was the top feature for McDonald's, whose customers will travel further for a burger. Dunkin’ Donuts, on the other hand, sees most business from customers stopping in to refresh themselves during a shopping spree. The number of customers who also checked in at nearby other local businesses was the most important feature for the chain.
In spite of these differences, the study found that a combination of traditional geographical and Foursquare-based mobility features turned out the best predictions for all three chains. Using the methodology, the researchers chose locations for Starbucks with a 67% accuracy overall, and 76% when predicting the top 10% and the top 15% most popular locations.
But don’t think that you can attract mores stores to your neighborhood by checking in just yet. Although their results seem to show a correlation, one possible flaw in the research is that Foursquare check-ins are used as a proxy for the popularity of a location. The researchers don’t provide evidence in the paper to show that check-in numbers translate into overall popularity.
August 14, 2013
Superstar statistician Nate Silver recently ruffled some feathers in the data science world by proclaiming that “Data scientist is just a sexed up word for statistician.” Now IT industry analyst Robin Bloor has claimed that there is no such thing as data science, because a science must apply the scientific method to a particular domain:
Science studies a particular domain, whether it be chemical, biological, physical or whatever. This gives us the sciences of chemistry, biology, physics, etc. Those who study such domains will gather data in one way or another, often by formulating experiments and taking readings. In other words, they gather data. If there were a particular activity devoted to studying data, then there might be some virtue in the term “data science.” And indeed there is such an activity, and it already has a name: it is a branch of mathematics called statistics.
So is data science just statistics by another name? Data scientists seem to view statistics more as a tool they use to a greater or lesser degree in their work rather than the domain of their science, as Bloor suggests. The relationship is kind of like the one between the content of the theoretical courses you’ll find in a computer science degree and what a working coder actually does day to day.
Data scientist Hilary Mason (formerly of Bitly, now Accel Partners) made this comment about Silver’s claim: “I'm a computer scientist by training who explores data and builds algorithms, systems, and products around data. I use statistics in my practice, but would never claim to be an expert statistician.”
O’Reilly’s Analyzing the Analyzers report seems to confirm the idea that statistics is just one tool of data science rather than the focus of the field. The study showed that data science already involves a range of roles from data businessperson to data researcher, with statistics featuring much more prominently in some roles than others.
Commenters on Bloor’s post also pointed to the extensive use of machine learning, and not just statistics, in the data science world. The overlaps and differences between machine learning and statistics is in itself a contentious issue, as both fields are interested in learning from data. They just have different objectives and go about it in different ways. Data scientist and Machine Learning for Hackers author Drew Conway explains the difference this way:
Statisticians approach their work by first observing some phenomenon in the world, and then attempting to specify a model — often formally — to describe that phenomenon. Machine learners often begin their work by possessing a large number of observations of some real world process, i.e. data, and then impose structure on that data to make inferences about the data generating process.
The online debate implies that statisticians are interested in the causality and the interpretability of the formal models of the world they create. The more engineering-oriented machine learning community uses statistical methods in some of its algorithms, but is more interested in solving a practical problem in an accurate way even if the model built by the machine learning algorithm is not easily understandable. Data scientist John Mount described the distinction as follows:
The goal of machine learning is to create a predictive model that is indistinguishable from a correct model. This is an operational attitude that tends to offend statisticians who want a model that not only appears to be accurate but is in fact correct (i.e. also has some explanatory value).
But data scientists don’t see statistics or machine learning as encompassing the entirety of their discipline. Mount goes on to say:
Data science is a term I use to represent the ownership and management of the entire modeling process: discovering the true business need, collecting data, managing data, building models and deploying models into production. Machine learning and statistics may be the stars, but data science the whole show.
Finally, the comments on Bloor’s post dove into the prickly point of whether the word "science" should be used at all in this context given that it implies repeatability and peer review, neither of which may apply to data science done in commercial companies. Here, it’s useful to again point to the differences between theoretical computer science and then the everyday work of the average hacker, which is more engineering than lab work. Drew Conway captures this distinction nicely:
As data science matures as a discipline I think it will be closer to a trade discipline than a scientific one. Much in the same way there are practicing physicians, and research physicians. Practicing doctors have to constantly review medical research, and maintain their understanding of new technologies in order to best serve their patients. Conversely, research physicians run experiments and build knowledge that other doctors can implement. Data scientists will implement the work of statisticians, machine learners, mathematicians, computer scientists etc., but very likely spend little to no time building new models or methods.
Much like computer science, the data science landscape may eventually diverge into two distinct, but cooperative research and practical branches. When that happens, we may need another name, like data engineering, to describe the practical side of the field. For now, the debate over defining data science says more about the nascent, evolving nature of the field than it does the actual differences between branches.
July 22, 2013
London was engulfed by five days of riots in August 2011, the worst civil unrest the U.K. had seen in 20 years. The looting, arson, and violence during the riots resulted in five deaths, many injuries, and a property damage bill of up to £250 million ($380 million). A new video from mathematician Hannah Fry explains the patterns in rioters’ behavior and how police can use them to quell future unrest.
“We found three very simple patterns. These patterns are incredibly important since we can use them to predict how a riot will spread, help the police to design better policing strategies and ultimately stop them from spreading.”
The model of the riots created by Fry and her team, which is further described in a Nature paper, used crime data provided by London’s Metropolitan Police covering the period August 6-11th, 2011 for offenses associated with the riots, a dataset of 3,914 records. This was combined with geographical and retail data and a set of mathematical equations capturing the patterns the researchers used to try and model the behavior of rioters.
Some newspapers at the time dubbed the riots “shopping with violence.” It seems they weren’t far from wrong.
The first pattern is comparing rioters to everyday shoppers. Over 30 percent of rioters travelled less than a kilometer from where they lived to where they offended but they were prepared to go a bit further if there was a riot site which was really big or they had very little chance of getting caught or there was a lot of lootable goods. This picture is exactly what you see when you look at a similar picture of retail spending flows. Most people shop locally to where they live but they are prepared to travel a bit further to a really attractive retail site. We know an awful lot about how people shop since this information is invaluable to retailers, being able to predict where people will spend their money.
Riots broke out in many parts of the city, but while some areas were heavily hit, others remained completely unscathed. Fry’s team hypothesized that this was partly due to the shopping behavior described above, but also the interaction of police and rioters and how the idea of rioting spread throughout the city. The researchers guessed that rioters would be attracted to sites which not only offered good looting opportunities but also fewer police or more rioters, making them less likely to get caught.
If police were heavily outnumbered at a particular site, the Metropolitan police has stated that “decisions were made not to arrest due to the prioritization of competing demands…specifically, the need to protect emergency services, prevent the spread of further disorder and hold ground until the arrival of more police resources.” So once a riot site spiraled out of control, even if police were on the ground, rioters were unlikely to be caught when they were present in large numbers. The Nature paper concluded that the speed and numbers in which police arrived at a particular riot site was crucial in quelling violence. The team’s simulations also showed that around 10,000 police would have been needed to suppress disorder. Only 5,000 were deployed during the first days of the riots.
Combining the shopping and “predator-prey” analogy of police and rioters, the team’s model predicted pretty accurately which areas would be hardest hit by the riots. Map a) shows actual riot behavior while map b) shows the result of a simulation using the model. 26 of the 33 London boroughs in the simulation showed rioter percentages in the same or adjacent bands as the crime data.
Fry describes the final pattern in the team’s model—how the idea to riot spread through the city:
Imagine you have one young guy who walks past a Foot Locker getting raided and he runs in and gets himself some new trainers. He then texts a couple of his friends to come down to join him. They then text a couple more of their friends, who text more of their friends. Suddenly, one spur of the moment decision by one person has grown into a huge outburst of criminal behavior. Before we talked about places which are more susceptible to rioting. Now we are talking about people who are more susceptible to the idea of rioting. The clearest link here is deprivation. The people who were involved in the riots came from some of the most deprived areas of the city, the places with the worst schools, the highest unemployment rates and the lowest incomes. The boroughs who were the worst hit by the riots were also the boroughs which had the biggest cuts in the recent government funding, and in particular disproportionate cuts in youth services. The data points to the fact that a spark was lit in a vulnerable community, and this spark ignited to engulf the entire city and eventually the country.
July 16, 2013
On Sunday, U.K. newspaper The Sunday Times revealed that J.K. Rowling was the true author of crime thriller The Cuckoo's Calling, which she published several months ago under the name Robert Galbraith. The paper was first alerted by an anonymous Twitter tip-off and Time reports that the paper called in Pittsburgh-based professor of computer science Patrick Juola, to help them determine whether the text had indeed been written by Rowling. Joula specializes in forensic linguistics, also known as “stylometry,” which can help attribute an author to a text.
Juola has been researching the subject—now called forensic linguistics, with a focus on authorship attribution—for about a decade. He uses a computer program to analyze and compare word usage in different texts, with the goal of determining whether they were written by the same person. The science is more frequently applied in legal cases, such as with wills of questionable origin, but it works with literature too.
Joula is one of the developers of the catchily titled Java Graphical Authorship Attribution Program (JGAAP), which he used to extract the 100 most commonly used words in Rowling’s text, not including character names.
What the author won’t think to change are the short words, the articles and prepositions. “Propositions and articles and similar little function words are actually very individual,” Juola says. “It’s actually very, very hard to change them because they’re so subconscious.”
Author attribution is not a precise science. In a 2006 paper Joula and co-author John Sofko described statistical and computational methods for authorship attribution as “neither reliable, well-regarded, widely-used, or well-understood.” JGAAP was the authors’ response to the “unorganized mess” of author attribution and is intended for use by non-specialists.
JGAAP implements several steps: canonicalization, identification of events, culling, and then analysis using a Machine Learning algorithm. Canonicalization converts data that has more than one possible representation into a standard form. In the case of text, this will mean doing things like converting all capital letters into lower case and removing punctuation. An event in a text may be the occurrence of a word, character, or part of speech. Culling reduces the number of events to, say, the 100 most common words, and uses this as a representation of the source text.
Finally, a Machine Learning classification algorithm like a Support Vector Machine, or K-Nearest Neighbors, uses this representation to compare the unknown text with texts by known authors in a training set and predicts which one was most likely to have written it. Joula reveals in the Time interview that Rowling’s The Casual Vacancy, Ruth Rendell’s The St. Zita Society, P.D. James’s The Private Patient, and Val McDermid’s The Wire in the Blood were the other texts used in training—a pretty small sample. If an author not on this list had written The Cuckoo's Calling, then JGAAP could not have identified him but only the known author closest in style. Joula determined that Rowling was the most likely of these four authors to have written the book and Rowling later admitted that this was the case.
July 11, 2013
Last year, Google developer Ilya Grigorik and GitHub marketeer Brian Doll did a talk at O’Reilly Strata on what makes developers happy and angry, programming language associations, and GitHub activity by country and language. All were results from the first GitHub Data Challenge and the activities of researchers using GitHub’s public timeline data. Github has now announced the results of the second challenge.
The data was made available via an API and as a Google BigQuery dataset. BigQuery is a Web service that lets you do interactive analysis of massive datasets in an SQL-like query language. Grigorik’s favorite winning entry is the Open Source Report Card, which uses clustering and a simple expert system to generate a natural language description of your hacker personality and a weekly work tempo, and to identify other Github users who are similar to you. You can see an extract from Grigorik's report card below and generate your own.
The Open Source Report Card was developed by astro-phycisist Dan Foreman-Mackey. He calculated statistics summarizing the weekly activity of a GitHub user and then clustered them to find groups like the “Tuesday tinkerer” and “Fulltime hacker.”
I extracted the set of weekly schedule vectors for 10,000 "moderately active" GitHub users and ran K-means (with K=12) on this sample set. K-means is an algorithm for the unsupervised clustering of samples in a vector space.
Foreman-Mackey then summarized the behavior of each user in a 61-dimensional vector, which includes features like the number of contributions, active repositories and languages used, and he ran an approximate nearest neighbor algorithm to identify other users who are similar to you based on your behavior.
The final step was generating a natural language description of a particular hacker. “I made up a bunch of rules (implemented as a spaghetti of nested if statements) that concatenate strings until something like English prose falls out the other end,” he says.
The data challenge is just one aspect of the work being done with Github’s timeline data. Brian Doll explains:
A dozen or so academic research papers have been written in the last year that use the GitHub timeline data as their primary data source. Some of the research papers tried to better understand what makes software projects popular. They analyzed activity, time frames, and language across several projects to see if they could determine factors that make projects more likely to be widely adopted.
GitHub is also looking at packaging the data in alternative ways to a stream of activity ordered by time.
What many researchers want instead, is a package of specific projects, with all of its public history, along with the actual software repository data itself, to be bundled up together. I'm planning on releasing large data bundles like this to the public later this summer.
July 3, 2013
For a profession whose entire “raison d’etre” is quantitative analysis, the role of the data scientist has been surprisingly hard to pin down. Now a new e-book from O’Reilly, Analyzing the Analyzers, has surveyed 250 Data Scientists on how they see themselves and their work. The authors then applied the tools of their trade, in this case a Non-negative Matrix Factorization algorithm to cluster the data, revealing the four archetypes of the data scientist. It also found that most Data Scientists, no matter which group they fell into, “rarely work with terabyte or larger data.”
We think that terms like “data scientist,” “analytics,” and “big data” are the result of what one might call a “buzzword meat grinder.” We’ll use the survey results to identify a new, more precise vocabulary for talking about their work, based on how data scientists describe themselves and their skills.
So who are these new data scientists? A Data Businessperson is focused on how data insights can affect a company’s bottom line or “translates P-values into profits”. This group seems very similar to the old-school Data Analyst, whose skills have sometimes been unjustly discounted in the pursuit of the more fashionable Data Scientist. In fact, only about a quarter of this group described themselves as Data Scientists. Nearly a quarter are female, a much higher proportion than the other types, and they are most likely to have an MBA, have managed other employees or started a business.
The Data Creatives have the broadest range of skills in the survey. They can code and have contributed to open source projects, three quarters have academic experience and creatives are even more likely than Data Businesspeople to have done contract work (80%) or have started a business (40%). Creatives closely identify with the self-description “artist”. Psychologists, economists and political scientists, popped up surprising often among Data Researchers and Data Creatives.
Data Developers build data infrastructure or code up Machine Learning algorithms. This group is the most likely to code on a daily basis and have Hadoop, SVM or Scala on their CV. About half have Computer Science or Computer Engineering degrees.
Data Researchers seem closest to “scientists” in the sense that their work is more open ended and most are lapsed academics. Nearly 75% of Data Researchers had published in peer-reviewed journals, and over half had a PhD. Statistics is their top skill but they were least likely to have started a business, and only half had managed an employee.
Although we hate to disappoint the majority of the tech press, who seem to conflate the terms “Big Data” and “Data Science,” most of the Data Scientists surveyed don’t actually work with Big Data at all. The figure below shows how often respondents worked with data of kilobyte, megabyte, gigabyte, terabyte, and petabyte scale. Data Developers were most likely to work with petabytes of data, but even among developers this was rare.
June 25, 2013
How far will you go to have a baby? That's the question facing the one in six couples suffering from infertility in the United States, fewer than 3% of which undergo IVF. A single round of IVF can cost up to $15,000, and the success rate for women over 40 is often less than 12% per round, making the process both financially and physically taxing. According to the CEO of Univfy, Mylene Yao, the couples with the highest likelihood of success are often not the ones who receive treatment. “A lot of women are not aware of what IVF can do for them and are getting to IVF rather late,” says Yao. ”On the other hand, a lot of women may be doing treatments which have lower chances of success.”
Yao is an Ob/Gyn and researcher in reproductive medicine who teamed up with a colleague at Stanford, professor of statistics Wing H. Wong, to create a model that could predict the probability that a live birth will result from a single round of IVF. That model is now used in an online personalized predictive test that uses clinical data readily available to patients.
The main factor currently used to predict IVF success is the age of the woman undergoing treatment. “Every country has a national registry that lists the IVF success rate by age group,” Yao explains. “What we have shown over and over in our research papers is that method vastly underestimates the possibility of success. It's a population average. It's not specific to any individual woman and is useful only at a high-level country policy level.” Many European countries, whose health services fund IVF for certain couples, use such age charts to determine the maximum age of the women who can receive treatment. Yao argues that, instead, European governments should fund couples with the highest likelihood of success.
“People are mixing up two ideas. Everyone knows that aging will compromise your ability to conceive, but the ovaries of each woman age at a different pace. Unless they take into consideration factors like BMI, partner's sperm count, blood tests, reproductive history, that age is not useful. In our prediction model, for patients who have never had IVF, age accounts for 60% of the prediction; 40% of the prediction comes from other sources of information. A 33-year-old woman can be erroneously led to believe that she has great chances, whereas her IVF prospects may be very limited and if she waits further, it could compromise her chance to have a family. A 40-year-old might not even see a doctor because she thinks there is no chance.” In a 2013 research paper Univfy showed that 86% of cases analyzed did not have the results predicted by age alone. Nearly 60% had a higher probability of live birth based on an analysis of the patients’ complete reproductive profiles.
Univfy's predictive model was built from data on 13,076 first IVF treatment cycles from three different clinics and used input parameters such as the woman's body mass index, smoking status, previous miscarriages or live births, clinical diagnoses including polycystic ovarian syndrome or disease, and her male partner's age and sperm count. “If a patient says 'I have one baby,' that's worth as much as what the blood tests show,” says Yao.
Prediction of the probability of a live birth based on these parameters is a regression problem, where a continuous value is predicted from the values of the input parameters. A machine-learning algorithm called stochastic gradient boosting was used to build a boosted tree model predicting the probability of a live birth. A boosting algorithm builds a series of models, in this case up to 70,000 decision trees, which are essentially a series of if-then statements based on the values of the input parameters. Each new tree is created from a random sample of the full data set and uses the prediction errors from the last tree to improve its own predictions. The resulting model determines the relative importance of each input parameter. It turned out that while the age of the patient was still the most significant factor at more than 60% weighting, other single parameters like sperm count (9.6%) and body mass index (9.5%) were also significant.
Another Univfy model used by IVF clinics predicts the probability of multiple births. Some 30% of women receiving IVF in 2008 in the U.S. gave birth to twins since clinics often use multiple embryos to increase the chances of success. “Multiple births are associated with complications for the newborn and the mother, “ says Yao. “So for health reasons, clinics and governments want to have as few multiple births as possible. It's a difficult decision whether to put in one or two embryos.” Univfy's results showed that even when only two embryos were transferred, patients' risks of twins ranged from 12% to 55%. “The clinic can make a protocol that when the probability of multiple births is above a certain rate, then they will have only one embryo, and also identify patients who should really have two embryos. Currently there's a lot of guesswork.”
When the G8 meet in Northern Ireland next week, transparency will be on the agenda. But how do these governments themselves rate?
Open data campaigners the Open Knowledge foundation just published a preview of an Open data Census which assessed how open the critical datasets in the G8 countries really are. Open data doesn’t just mean making datasets available to the public but also distributing them in a format which is easy to process, available in bulk, and regularly updated. When the regional government of Piemonte, Italy was hit by an expenses scandal last year, the government published the expense claim data from the previous year in a set of spreadsheets embedded in a PDF, a typical example of less than accessible “open data.”
More than 30 volunteer contributors from around the world (the foundation says they include lawyers, researchers, policy experts, journalists, data wranglers, and civic hackers) assessed the openness of data sets in 10 core categories: Election Results, Company Register, National Map, Government Budgets, Government Spending, Legislation, National Statistical Office Data, Postcode/ZIP database, Public Transport Timetables, and Environmental Data on major sources of pollutants.
The U.S. topped the list for openness according to the overall score summed across all 10 categories of data, indicating that the executive order “making open and machine readable the new default for government information” announced by president Barack Obama in May this year has had some effect. The U.K. was next, followed by France, Japan, Canada, Germany, and Italy. The Russian Federation limped in last, failing to publish any of the information considered by the census as open data.
Postcode data, which is required for almost all location-based applications and services, is not easily accessible in any G8 country except Germany. “In the U.K., there's quite a big fight over postcodes since Royal Mail sells the postcodes and makes millions of pounds a year,” said Open Knowledge Foundation founder Rufus Pollock. Data on government spending was also patchy in France, Germany, and Italy. Many G8 countries scored low on company registry data, a notable point when the G8’s transparency discussions will address tax evasion and offshore companies. “Government processes aren't always built for the digital age,” said Pollock. “I heard an incredible story about 501(c)3 registration information in the U.S. where they get all this machine-readable data in and the first they do is convert it to PDF which then humans transcribe.”
The data was assessed in line with open data standards such as OpenDefinition.org and the 8 Principles of Open Government Data. Each category of data was given marks out of six depending on how many of the following criteria were met: openly licensed, publicly available, in digital format, machine readable (in a structured file format), up to date, and available in bulk. The assessment wasn’t entirely quantitative. “We strive for reasonably 'yes or no' questions but there are subtleties,” says Pollock. With transport and timetables there's rail, bus, and other public transport. What happens if your bus and tram timetables are available but not train? Or is a certain format acceptable as machine readable?”
The preview does not show how many data sets were assessed in each category but more information will be included in the full results covering 50 countries will be released later this year. For further information on the methodology of the census see the Open Knowledge Foundation’s blog post.
Olly Downs runs the Data Science team at Globys, a company which takes large-scale dynamic data from mobile operators and uses it to contextualize their marketing in order to improve customer experience, increase revenue, and maximize retention. Downs is no data novice: He was the Principal Scientist at traffic prediction startup INRIX, which is planning an IPO this year, and Globys is his seventh early-stage company. Co.Labs talked to him about how to maximize the ROI of a data science team.
How does Globys use its Data Science team?
The Telco space has always been Big Data. Any single one of our mobile operator customers produces data at a rate greater than Facebook. Globys is unique in terms of the scale with which we have embraced the data science role and its impact on the structure of the company and the core proposition of the business. Often data scientists in an organization are pseudo-consultants answering one-off questions. Our data science team is devoted to developing the science behind the technology of our core product.
You trained in hard sciences (Physics at Cambridge and Applied Mathematics at Princeton). Is Data Science really a science?
How we work at Globys is that we develop technologies via a series of experiments. Those experiments are designed to be extremely robust, as they would be in the hard sciences world, but based on data which is only available commercially, and they are designed to answer a business question rather than a fundamental research question. The methodology we use to determine if a technology is useful has the same core elements of a scientific process.
What has come along with the data science field is this cloudburst of new technologies. The science has become mixed in with mastering the technology which allows you to do the science. It's not that surprising. The web was invented by Tim Berners Lee at CERN to exchange scientific data in particle physics. Out in the applied world, the work tends to be a mixture of answering questions and finding the right questions to ask. It's very easy for a data science team to slip into being a pseudo-engineering team on an engineering schedule. It's very important to have a proportion of your time allocated to exploratory work so you have the ability to go down some rabbit holes.
How can a company integrate a data scientist into their business?
With Big Data, the awareness in the enterprise is high and the motivation to do Big Data initiatives is high, but the cultural ability to absorb the business value, and the strategic shift that might bring, is hard to achieve. My experience is if the data scientist is not viewed as a peer-level stakeholder who can have an impact on the leadership and the culture of the business, then most likely they won't be successful.
I remember working on a project on a “Save” program where anyone who called to cancel their service got two months free. The director who initiated that program had gotten a significant promotion based on its success. It wasn't a data-driven initiative. The anecdotes were much more positive than the data. What I found, after some data analysis, was that the program was retaining customers for an average of 1.2 months after the customer had been given the two months free. Every saved call the customer was taking, which included the cost of the agent talking to the customer, was actually losing them $13 per call. We came with a model-based solution which allowed the business to test who they should allow to cancel and who they should talk to. That changed the ROI on the program to plus-$22 per call taken. That stakeholder then made it from general manager to VP, and ultimately was very happy, but it took a while to make the shift to seeing that data was ultimately going to improve the outcome.
Can't you fool yourself with data as well as with anecdotes?
As a data scientist, it's hard to come to a piece of analysis without a hypothesis, a Bayesian prior (probability), a model in your mind of how you think things probably are. That makes it difficult to interrogate the data in the purest way, and even if you do, you are manipulating a subset of attributes or individuals who represent the problem that you have. Being aware of the limitations of the analysis is important. A real problem with communicating the work you have done is that while scientists are very good at explaining the caveats, the people listening are not interested in caveats but in conclusions. I remember doing an all-day experiment when I was at Cambridge to measure the Gravitational constant to four decimal places of precision. I measured to four levels of precision but the result was incorrect in the constant by more than a thousand times. You can fool yourself into thinking you have measured something with a very high level of accuracy and yet the thing you were measuring turned out to be the wrong thing.
How do measure the ROI (Return on Investment) of a data science team?
The measure of success is getting initiatives to completion, addressing a finding about the business is a measurable way. At Globys our business is around getting paid for the extra retention or revenue that we achieve for our customers. Recently, we have been leveraging the idea of every customer as a sequence of events—every purchase, every call, every SMS message, every data session, every top-up purchase—which allows us to take Machine Learning approaches (Dynamic State-Space Modeling) which otherwise do not apply to this problem domain of retaining customers. This approach outperforms the current state of the art in churn prediction modeling by about 200%. When you run an experiment to retain customers, the proportion of customers you are messaging to is biased in favor of those with a problem. We already have an optimized price elasticity and offer targeting capability so you improve your campaign by a similar factor. The normal improvement you would achieve in churn retention is in the 5% range. We are achieving improvements in the 15-20% range.
What's the biggest gap you see in data science teams?
Since our customers have been Big Data businesses for a long time, they will typically have analysts, and many of those teams are unsuccessful because the communications skill set is missing. They may have the development capability, the statistical and modeling capability, but have been very weak at communicating with the other elements of the business. What we are seeing now is some roles being hired which bridge between the data science capability and the business functions like marketing and finance. It's a product manager for the analytics team.
This story tracks the cult of Big Data: The hype and the reality. It’s everything you ever wanted to know about data science but were afraid to ask. Read on to learn why we’re covering this story, or skip ahead to read previous updates.
Take lots of data and analyze it: That’s what data scientists do and it’s yielding all sorts of conclusions that weren’t previously attainable. We can discover how our cities are run, disasters are tackled, workers are hired, crimes are committed, or even how Cupid's arrows find their mark. Conclusions derived from data are affecting our lives and are likely to shape much of the future.
Meteorologist Steven Bennett used to predict the weather for hedge funds. Now his startup EarthRisk forecasts extreme cold and warmth events up to four weeks ahead, much further in the future than traditional forecasts, for energy companies and utilities. The company has compiled 30 years of temperature data for nine regions and discovered patterns which predict extreme heat or cold. If the temperature falls in the hottest or coldest 25% of the historical temperature distribution, EarthRisk defines this as an extreme event and the company's energy customers can trade power or set their pricing based on its predictions. The company's next challenge is researching how to extend hurricane forecasts from the current 48 hours to up to 10 days.
How is your approach to weather forecasting different from traditional techniques?
Meteorology has traditionally been pursued along two lines. One line has a modeling focus and has been pursued by large government or quasi-government agencies. It puts the Earth into a computer-based simulation and that simulation predicts the weather. That pursuit has been ongoing since the 1950s. It requires supercomputers, it requires a lot of resources (The National Oceanic and Atmospheric Administration in the U.S. has spent billions of dollars on its simulation) and a tremendous amount of data to input to the model. The second line of forecasting is the observational approach. Farmers were predicting the weather long before there were professional meteorologists and the way they did it was through observation. They would observe that if the wind blew from a particular direction, it's fair weather for several days. We take the observational approach, the database which was in the farmer's head, but we quantify all the observations strictly in a statistical computer model rather than a dynamic model of the type of the government uses. We quantify, we catalog, and we build statistical models around these observations. We have created a catalog of thousands of weather patterns which have been observed since the 1940s and how those patterns tend to link to extreme weather events one to four weeks after the pattern is observed.
Which approach is more accurate?
The model-based approach will result in a more accurate forecast but because of chaos in the system it breaks down 1-2 weeks into the forecast. For a computer simulation to be perfect we would need to observe every air parcel on the Earth to use as input to the model. In fact, there are huge swathes of the planet, e.g., over the Pacific Ocean, where we don't have any weather observations at all except from satellites. So in the long range our forecasts are more accurate, but not in the short-range.
What data analysis techniques do you use?
We are using Machine Learning to link weather patterns together, to say when these kind of weather patterns occur historically they lead to these sorts of events. Our operational system uses a Genetic Algorithm for combining the patterns in a simple way and determining which patterns are the most important. We use NaiveBayes to make the forecast. We forecast, for example, that there is 60% chance that there will be an extreme cold event in the northwestern United States three weeks from today. If the temperature is a quarter of a degree less than that cold event threshold, then it's not a hit. We are in the process of researching a neural network, which we believe will give us a richer set of outputs. With the neural network we believe that instead of giving the percentage chance of crossing some threshold, we will be able to develop a full distribution of temperature output, e.g., that it will be 1 degree colder than normal.
How do you run these simulations?
We update the data every day. We have a MatLab-based modeling infrastructure. When we do our heavy processing, we will use hundreds of cores in the Amazon cloud. We do those big runs a couple of dozen times a year.
How do you measure forecast accuracy?
Since we forecast for extreme events, we use a few different metrics. If we forecast an extreme event and it occurs, then that's a hit. If we forecast an extreme event and it does not occur, that's a false alarm. Those can be misleading. If I have made one forecast and that forecast was correct, then I have a hit rate of 100% and a false alarm rate of 0%. But if there were 100 events and I only forecasted one of them and missed the other 99, that's not useful. The detection rate is the number of events that occur which we forecast. We try to get a high hit rate and detection rate but in a long-range forecast detection rate is very, very difficult. Our detection rate tends to be around 30% in a 3-week forecast. Our hit rate stays roughly the same at one week, 2 week, 3 weeks. In traditional weather forecasting the accuracy gets dramatically lower the further out you get.
Why do you want to forecast hurricanes further ahead?
The primary application for longer lead forecasts for hurricane landfall would be in the business community rather than for public safety. For public safety you need to make sure that you give people enough time to evacuate but also have the most accurate forecast. That lead time is typically two to three days right now. If people evacuate and the storm does not do damage in that area, or never hits that area, people won't listen the next time the forecast is issued. Businesses understand probability so you can present a risk assessment to a corporation which has a large footprint in a particular geography. They may have to change their operations significantly in advance of a hurricane so if it's even 30% or 40% probability then they need extra lead time.
What data can you look at to provide an advance forecast?
We are investigating whether building a catalog of synoptic (large scale) weather patterns like the North Atlantic oscillation will work for predicting hurricanes, especially hurricane tracks—so where a hurricane will move. We have quantified hundreds of weather patterns which are of the same amplitude, hundreds of miles across. For heat and cold risks we develop an index of extreme temperature. For hurricanes the primary input is an index of historic hurricane activity rather than temperature. Then you would use Machine Learning to link the weather patterns to the hurricane activity. All of this is a hypothesis right now. It's not tested yet.
What’s the next problem you want to tackle?
We worked with a consortium of energy companies to develop this product. It was specifically developed for their use. Right now the problems we are trying to solve are weather related but that's not where we see ourselves in two or five years. The weather data we have is only an input to a much bigger business problem and that problem will vary from industry to industry. What we are really interested in is helping our customers solve their business problems. In New York City there's a juice bar called Jamba juice. Jamba Juice knows that if the temperature gets higher than 95% degrees in an afternoon in the summer they need extra staff since more people will buy smoothies. They have quantified the staff increase required (but they schedule their staff one week in advance and they only get a forecast one day in advance). They use a software package with weather as an input. We believe that many business are right on the cusp of implementing that kind of intelligence. That's where we expect our business to grow.
A roomful of confused-looking journalists is trying to visualize a Twitter network. Their teacher is School of Data “data wrangler” Michael Bauer, whose organization teaches journalists and non-profits basic data skills. At the recent International Journalism Festival, Bauer showed journalists how to analyze Twitter networks using OpenRefine, Gephi, and the Twitter API.
Bauer's route into teaching hacks how to hack data was a circuitous one. He studied medicine and did postdoctoral research on the cardiovascular system, where he discovered his flair for data. Disillusioned with health care, Bauer dropped out to become an activist and hacker and eventually found his way to the School of Data. I asked him about the potential and pitfalls of data analysis for everyone.
Why do you teach data analysis skills to “amateurs”?
We often talk about how the digitization of society allows us to increase participation, but actually it creates new kinds of elites who are able to participate. It opens up the existing elites so you don't have to be an expensive lobbyist or be born in the right family to be involved, but you have to be part of this digital elite which has access to these tools and knows how to use them effectively. It's the same thing with data. If you want to use data effectively to communicate stories or issues, you need to understand the tools. How can we help amateurs to use these tools? Because these are powerful tools.
If you teach basic data skills, is there a danger that people will use them naively?
There is a sort of professional elitism which raises the fear that people might misuse the information. You see this very often if you talk to national bureaus of statistics, for example, who say “We don't give out our data since it might be misused.” When the Open Data movement started in the U.K. there was a clause in the agreement to use government data which said that you were not allowed to do anything with it which might criticize the government. When we train people to work with data, we also have to train them how to draw the right conclusions, how to integrate the results. To turn data into information you have to put it into context. So we break it down to the simplest level. What does it mean when you talk about the mean? What does it mean if you talk about average income? Or does it even make sense to talk about the average in this context?
Are there common pitfalls you teach people to avoid?
We frequently talk about correlation-causation. We have this problem in scientific disciplines as well. In Freakonomics, economist Steven D. Levitt talks about how crime rates go down when more police are employed, but what people didn't look at was that this all happened in times of economic growth. We see this in medical science too. There was this idea that because women have estrogen they are protected from heart attacks so you should give estrogen to women after menopause. This was all based on retrospective correlation studies. In the 1990s someone finally did a placebo controlled randomized trial and they discovered that hormone replacement therapy doesn't help at all. In fact it harms the people receiving it by increasing the risk of heart attacks.
How do you avoid this pitfall?
If you don't know and understand the assumptions that your experiment is making, you may end up with something completely wrong. If you leave certain factors out of your model and look at one specific thing, that's the only specific thing you can say something about. There was this wonderful example that came out about how wives of rich men have more orgasms. A University in China got hold of the data for their statistics class and they found that they didn't use the education of the women as a parameter. It turns out that women who are more educated have more orgasms. It had nothing to do with the men.
What are the limitations of a using single form of data?
That's one of the dangers of looking at Twitter data. This is the danger of saying that Twitter is democratizing because everyone has a voice, but not everyone has a voice. Only a small percentage of the population use this service and a way smaller proportion are talking a lot. A lot of them are just reading or retweeting. So we only see a tiny snapshot of what is going on. You don't get a representative population. You get a skew in your population. There was an interesting study on Twitter and politics in Austria which showed that a lot of people on there are professionals and they are there to engage. So it's not a political forum. It's a medium for politicians and people who are around politics to talk about what they are doing.
Any final advice?
Integrate multiple data sources, check your facts, and understand your assumptions.
Charts can help us understand the aggregate but they can also be deeply misleading. Here's how to stop lying with charts, without even knowing it. While it’s counterintuitive, charts can actually obscure our understanding of data—a trick Steve Jobs has exploited on stage at least once. Of course, you don’t have to be a cunning CEO to misuse charts; in fact, If you have ever used one at all, you probably did so incorrectly, according to visualization architect and interactive news developer Gregor Aisch. Aisch gave a series of workshops at the International Journalism Festival in Italy, which I attended last weekend, including one on basic data visualization guidelines.
“I would distinguish between misuse by accident and on purpose,” Aisch says.”Misuse on purpose is rare. In the famous 2008 Apple keynote from Steve Jobs , he showed the market share of different smartphone vendors in a 3D pie chart. The Apple slice of the smartphone market, which was one of the smallest, was in front so it appeared bigger.”
Aisch explained in his presentation that 3-D pie charts should be avoided at all costs since the perspective distorts the data. What is displayed in front is perceived as more important than what is shown in the background. That 19.5% of market share belonging to Apple takes up 31.5% of the entire area of the pie chart and the angles are also distorted. The data looks completely different when presented in a different order as shown below.
In fact, the humble pie charts turns out to be an unexpected mine field:
“Use pie charts with care, and only to show part of whole relationships. Two is the ideal number of slices, but never show more than five. Don’t use pie charts if you want to compare values. Use bar charts instead.”
For example, Aisch advises that you don’t use pie charts to compare sales from different years but do use to show sales per product line in the current year. You should also ensure that you don't leave out data on part of the whole:
“Use line charts to show time series data. That’s simply the best way to show how a variable changes over time. Avoid stacked area charts, they are easily mis-interpreted.”
The “I hate stack area charts” post cited in Aisch’s talk explains why:
“Orange started out dominating the market, but Blue expanded rapidly and took over. To the unwary, it looks like Green lost a bit of market share. Not nearly as much as Orange, of course, but the green swath certainly gets thinner as we move to the right end of the chart.”
In fact the underlying data shows that Green’s market share has been increasing, not decreasing. The chart plots the market share vertically, but human beings perceive the thickness of a stream at right angles to its general direction.
Google Maps also uses the Mercator projection, a method of projecting the sphere of the Earth onto a flat surface, which distorts the size of areas closer to the polar regions so, for example, Greenland looks as large as Africa.
The solution to these problems, according to Aisch, is to build visualization best practices directly into the tool as he does in his own open source visualization tool Datawrapper. “In Datawrapper we set meaningful defaults but also allow you to switch between different rule systems. There's an example for labeling a line chart. There is some advice that Edward Tufte gave in one of his books and different advice from Donna Wong so you can switch between them. We also look at the data so if you visualize a data set which has many rows, then the line chart will display in a different way than if there were just 3 rows.”
The rush to "simplify" big data is the source of a lot of reductive thinking about its utility. Data science practitioners have recently been lamenting how the data gold rush is leading to naive practitioners deriving misleading or even downright dangerous conclusions from data.
The Register recently mentioned two trends that may reduce the role of the professional data scientist before the hype has even reached its peak. The first is the embedding of Big Data tech in applications. The other is increased training for existing employees who can benefit from data tools.
"Organizations already have people who know their own data better than mystical data scientists. Learning Hadoop is easier than learning the company’s business."
This trend has already taken hold in data visualization, where tools like infogr.am are making it easy for anyone to make a decent-looking infographic from a small data set. But this is exactly the type of thing that has some data scientists worried. Cathy O' Neil (aka MathBabe) has the following to say in a recent post:
"It’s tempting to bypass professional data scientists altogether and try to replace them with software. I’m here to say, it’s not clear that’s possible. Even the simplest algorithm, like k-Nearest Neighbor (k-NN), can be naively misused by someone who doesn’t understand it well."
K-nearest neighbors is a method for classifying objects, let's say visitors to your website, by measuring how similar they are to other visitors based on their attributes. A new visitor is assigned a class, e.g., "high spenders," based on the class of its k nearest neighbors, the previous visitors most similar to him. But while the algorithm is simple, selecting the correct settings and knowing that you need to scale feature values (or verifying that you don't have many redundant features) may be less obvious.
You would not necessarily think about this problem if you were just pressing a big button on a dashboard called “k-NN me!”
Here are four problems that typically arise from a lack of scientific rigor in data projects. Anthony Chong, head of optimization at Adaptly, warns us to look out for "science" with no scientific integrity.
Through phony measurement and poor understandings of statistics, we risk creating an industry defined by dubious conclusions and myriad false alarms.... What distinguishes science from conjecture is the scientific method that accompanies it.
Given the extent to which conclusions derived from data will shape our future lives, this is an important issue. Chong gives us four problems that typically arise from a lack of scientific rigor in data projects, but are rarely acknowledged.
- Results not transferrable
- Experiments not repeatable
- Not inferring causation: Chong insists that the only way to infer causation is randomized testing. It can't be done from observational data or by using machine learning tools, which predict correlations with no causal structure.
- Poor and statistically insignificant recommendations.
Even when properly rigorous, analysis often leads to nothing at all. From Jim Manzi's 2012 book, Uncontrolled: The Surprising Payoff of Trial-and-Error for Business:
"Google ran approximately 12,000 randomized experiments in 2009, with [only] about 10 percent of these leading to business changes.”
Understanding data isn't about your academic abilities—it's about experience. Beau Cronin has some words of encouragement for engineers who specialize in storage and machine learning. Despite all the backend-as-service companies sprouting up, it seems there will always be a place for someone who truly understands the underlying architecture. Via his post at O'Reilly Radar:
I find the database analogy useful here: Developers with only a foggy notion of database implementation routinely benefit from the expertise of the programmers who do understand these systems—i.e., the “professionals.” How? Well, decades of experience—and lots of trial and error—have yielded good abstractions in this area.... For ML (machine learning) to have a similarly broad impact, I think the tools need to follow a similar path.
Want to climb the mountain? Start learning about data science here. If you know next to nothing about Big Data tools, HP's Dr. Satwant Kaur's 10 Big data technologies is a good place to start. It contains short descriptions of Big Data infrastructure basics from databases to machine learning tools.
This slide show explains one of the most common technologies in the Big Data world, MapReduce, using fruit while Emcien CEO Radhika Subramanian tells you why not every problem is suitable for its most popular implementation Hadoop.
"Rather than break the data into pieces and store-n-query, organizations need the ability to detect patterns and gain insights from their data. Hadoop destroys the naturally occurring patterns and connections because its functionality is based on breaking up data. The problem is that most organizations don’t know that their data can be represented as a graph nor the possibilities that come with leveraging connections within the data."
Efraim Moscovich's Big Data for conventional programmers goes into much more detail on many of the top 10, including code snippets and pros and cons. He also gives a nice summary of the Big Data problem from a developer's point of view.
We have lots of resources (thousands of cheap PCs), but they are very hard to utilize.
We have clusters with more than 10k cores, but it is hard to program 10k concurrent threads.
We have thousands of storage devices, but some may break daily.
We have petabytes of storage, but deployment and management is a big headache.
We have petabytes of data, but analyzing it is difﬁcult.
We have a lot of programming skills, but using them for Big Data processing is not simple.
Finally, GigaOm's programmer's guide to Big Data tools covers an entirely disjointed set of tools weighted towards application analytics and abstraction APIs for data infrastructure like Hadoop.
[Image: Flickr user Shinji WATANABE]