Kaggle has been running data competitions, which are open to all, for about three years now. In that time, it has become a place for a global community of over 100,000 aspiring and established data scientists to showcase their skills and for companies to hire them. Kaggle’s Chief Scientist, Jeremy Howard, was himself a top ranking competitor before he joined the company. He sat down with FastCo.Labs to discuss a phenomenon he has noticed: Many of Kaggle’s best competitors are self-taught.
First, some background. A Kaggle competition works like this: Companies pose a challenge--for example the Heritage Health Prize aimed to improve predictions of which patients were most likely to require a visit to hospital in the next year--and competitors vie to build the best predictive model. Prizes range up to $3 million and winners also earn a place on Kaggle’s leaderboard.
What do all the people on the leaderboard have in common? It’s not an Ivy League education or a PhD in Statistics. According to Howard, it’s creativity--and Coursera.
What’s the typical background of a competition winner?
The people who win competitions are generally not Stanford-educated or Ivy League American Mathematicians. The world's best data scientists based on their actual performance haven't gone to famous schools. Most of the winners in the last 12 months have learnt machine learning from Andrew Ng's course on Coursera. The most successful backgrounds are electrical engineering and physics. Also quite successful is astronomy. These are areas in which for decades people have had to pragmatically analyze data and do useful things with it.
About half the people who compete are from the U.S. but there's currently only one American in the top 10. We have only had one woman who has been in the top three in a competition. There are much less women than men in general in the industry and among the people who seriously compete, women are a smaller subset again.
What is driving the rise of the self-taught data scientist?
Modern Machine Learning algorithms are very sophisticated and can derive about as much insight from a data set as a human who studied it intensely could, and more since they can do it for larger data sets. A data set of 100,000 to a million records, which is actually a lot of data, you can fit into the RAM on a laptop. When I started in analytics 20 years back, we were doing neural networks for banks involving 100,000 to a million records and we were buying large servers from IBM and specialist neural network cards from AI companies. We were spending 40 to 50 million dollars to do those data sets. You can now do that for 20 seconds for free on a laptop using R, which is open source.
In every single one of our competitions where there is a benchmark, all existing scientific and industrial benchmarks have been beaten. Every single one of our competitions has created the best algorithm of its kind ever. The vast majority of winners are using laptops and R or Python. When that is not the case they have paid 5 to 10 dollars to Amazon to hire a cluster for the last day of the competition to fine-tune their algorithms.
Big data gets the most column inches because it's what people can use to sell software licenses and hard disk drives and CPUs and so forth but most problems we are trying to solve with modeling are not that complex. You don't need to look at a billion movie recommendations to figure out that people who like Lord of the Rings tend to like Star Wars.
What Machine Learning algorithms are Kaggle winners using?
There are two classes of algorithm which are dominant now. One is a class of algorithms called an ensemble of decision trees. The most well known type is the Random forest and the other important one is the Gradient Boosting. These algorithms are notable for a number of reasons. One is their incredible simplicity. It would take a good developer 3-6 hours to create their own implementation from scratch. You take a small random subset of the data you are interested in, you create a decision tree on that data, and you repeat that process for multiple small random subsets. Boosting trees are similar to random forests. The only difference is that instead of creating lots of trees from different random subsets, with Boosting trees you take the error from the previous tree and use that to improve the next. Decision trees have won more Kaggle competitions than any other algorithm by far.
The second broad class of algorithms which have started to be extremely successful in the last year are Deep Learning Networks, which are basically neural networks with many layers. If you use an Android phone, the speech recognition is being done with deep learning networks.
Because these two sets of algorithms are so rich in terms of the models they can build, they don't need nearly as much data. The inventor of deep networks, Geoffrey Hinton, has a class on Coursera about them and he points out that companies which are using them, for example for speech recognition, are using 10 to 100 times less data than they used to. Much more important perhaps, they are both immune to overfitting. Overfitting is where you create a model which is predicting the noise in the data rather than the real signal and as a result it won't be useful when you apply the model to a new data set. This means that you throw literally hundreds and thousands of variables at these algorithms and they won't get confused. That means that we can analyze things like genetic codes which will have tens or hundreds of thousands of genes. We have seen breakthroughs in areas of science which were previously not amenable to these approaches.
What are the most important skills required to win a Kaggle competition?
Now instead of having people look for obscure mathematical techniques, we know a couple of classes of techniques which work very broadly, very effectively, and with a minimum of fuss, especially ensembles of decision trees. You don't need to set lots of parameters, they are not difficult to train. We are now at a place where all the work is in the creativity of the data scientist, things like what problem should I be solving? What things should I be trying to predict? What kinds of things could be predictive? Where can I find that data?
To be successful in a Kaggle competition what you have to be really good at is the feature engineering (selection and combination of most relevant features). One of our early competitions had the amount of time it took cars to cross a freeway and there were sensors along the highway so every two minutes there was a reading. The winners of that competition used a random forest and they used visualizations to understand how traffic jams percolated across a road backwards and forwards and combined. They realized that they needed to provide the algorithm with features like the time of the previous segment divided by the time of the segment after that, the segment two minutes ago versus four minutes ago and stuff like that. That's all human ingenuity. Winning a Kaggle competition is entirely about creativity. The guy in first place has found six different ways of thinking about the problem.
There are some fields in which Deep Learning algorithms can do some of that (feature engineering) automatically, in particular speech and image recognition, so maybe in 5-10 years prediction modeling will be more and more automated. All the work will be in how to leverage these models rather than in how to build them.
What advice would you give companies who want to hire a data scientist?
If you want to hire a juggler for your circus, you would have him juggle for you and see how many things he can juggle. If you are going to hire someone to create predictive models, look at how well predictive their models predict. The idea that I would hire a data scientist based on what school they went to, whether I like what they wear to work or what country they grew up in... I just don't think any of those things are terribly important. We have had a number of companies like Facebook and Yelp hiring out of Kaggle competitions and building their core data science teams on that basis. All of the feedback from those companies has been so good that they have had multiple follow-up competitions.
One of the things we have learnt is that a good data scientist can solve a problem in any domain area with any data set, the sale price of bulldozers, the speed and direction of wind, who the authors of academic papers are. Most organizations look for domain expertise and think that their competitive advantage comes from their proprietary expertise. This is just not true any more in anything which is amenable to data analysis. If you hire one guy like that (a Kaggle winner) he will be able to solve your five most difficult problems, regardless of where they are. These guys should be superstars. They should be earning more than the CEO.
[Image: Flickr user Voltageek]