2013-08-21

Co.Labs

What Hackers Should Know About Machine Learning

A mire of algebra, stats, and dry academic research, this arcane discipline allows computers to make decisions in place of humans. But where’s a hacker to start?



Drew Conway is the co-author of Machine Learning for Hackers and must be one of the few data scientists out there who started his career working on counter-terrorism at the Department of Defense. FastCo.Labs talked to him about algebra, GitHub, and the ugly side of Machine Learning.

Why should developers learn Machine Learning?

I don't necessarily think that every developer should learn Machine Learning. Machine Learning as a discipline is interested in the application of statistical methods to decision making. If your job as an engineer is to build large systems that have nothing to do with that, then I wouldn't say that you should learn it. The process of learning it can improve your overall statistical literacy and I would say that's a general benefit in life.

Why did you write the book?

We were familiar with the reference texts around Machine Learning and all of those reference texts require a pretty substantial foundation in linear algebra, calculus, and statistics. We wanted to create a reference book which was more geared towards practitioners who were used to thinking algorithmically. We wanted one that didn't require a lot of math, didn't require a lot of statistical training.

What are the biggest gaps in the average hacker’s knowledge when learning Machine Learning?

A college intro level probability class so that you could learn how different probability distributions are reflected in the real world. Why do we care so much about the normal distribution? What is it about the normal distribution that's so fundamental to the things we observe in nature versus a binomial distribution? What kind of processes and phenomena does that represent? Then in terms of actually doing the work, linear algebra and matrix algebra. You get the probabilistic stuff so you can understand the framework of thinking about how things work and then often the linear algebra and the matrix algebra is how it gets done in software.

Someone who is a professional scientific researcher probably understands all the stuff about cleaning up data--that's their bread and butter--whereas a professional software engineer understands how to build from the ground up but hears less often “Here's some data. I need you to tell me what's going on.” The “here's some data” part is the really ugly part of cleaning it up, creating a matrix out of it, etc.

There's a curiosity that's required to do this stuff, looking at a data set and thinking what is an appropriate or interesting question to interrogate with the data--that exploratory data analysis step. I have a new dataset, I'm just going to sit at the command line and look at the density distributions and do some scatterplots and see what the structure of the data is. I think that requires some practice but also some intuition about the data generation process. Of course if you don't have any training and you have never done any of this stuff before it may seem a bit opaque at first. For most the developers I know that have no background in that, that can be a bit intimidating in the beginning.

What are the differences between doing a Machine Learning project and a development project?

Data analysis as an exploratory endeavor should be the first part of anything. You should never go into a project and say “The thing that I want to do is classification so I'm always going to run my favorite classification algorithm.” For the first half of the book we talk about “Here's a dataset, here's how to clean it up.” The chapters that John Miles White wrote on means, medians, modes, and distributions are always the things that you should do in the beginning. We want to hammer home that it's not just input-output. Input, look around, see what's going on, find structure in the data, then make the choice for methods. And then maybe iterate a couple of them. It's very cyclic. It's not linear.

A data scientist has a very different relationship with the code than a developer does. I look at the code as a tool to go from the question I am interested in answering to having some insight and that's sort of the end of that. That code is more or less disposable. For developers they are thinking about writing code to build into a larger system. They are thinking about how can I write something that can be reused? People who do large-scale Machine Learning, people at Google and Facebook, think in a similar way to a software engineer in the sense that there are lots of different interesting Machine Learning tools and methods that people can use that don't scale well to the web-scale datasets those companies are dealing with. So at the beginning their process is more like, well what is the limited set of tools that I have which can actually scale up and be useful in this question?

There are different levels. There's exploratory research data science which many people coming into jobs from academia do, and they are building tools which are more like minimum viable pieces of technology. In some places there are people who do that, but then have to figure out a way to optimize that at large scale, and then there are the people who work on production systems who are writing code which is going to be used all the time as part of the product itself.

Do we need a GitHub for data analysis?

The real limitation of GitHub is that it's not meant to be like S3 where you can can store a ton of data on it. The data limitation is a significant one. In reality I think it's fine for the data to be separate to the actual analytical code. The thing that I think was missing for a while was an appropriate way of conveying results. Most people who do data eventually get to the point where they have a graph or they have something they want to show you. But now with GitHub pages people do that all the time. If you look at Mike Bostock's stuff for D3 (a JavaScript library for visualization) his stuff is all on GitHub, he uses GitHub pages and he does a great job with it. GitHub really gets you 80% of the way there. The data portion is the real limitation but that's okay because everyone is going to want to use a different type of database, a different data structure for their project.

What would you add in a new edition of the book?

There's lots of new methods that we would certainly add. One of the things that we don't do at all in the book is ensemble approaches to Machine Learning, combining multiple methods. We don't talk at all about model fitting and evaluating quality of models. Those are certainly things we would do in a second edition. Part of the reason that we didn't do it in the first one is because they are more intermediate level things and we were going for a novice audience.

My thinking has evolved on presenting results. The way I think about presenting results now is always in the browser as an interactive thing. There's a tremendous amount of value in providing the audience with the ability to ask second-order questions about what they are observing rather than first-order ones. Imagine the thing you are looking at is just a simple scatterplot and you see one outlier. So a first-order question would be who is that outlier? If you have an interactive thing where you can go over the dot and it tells you who that is, and the second order question is why is that an outlier?

You can get pretty far with Machine Learning for Hackers but our hope is for those who want to move from hacker to real Machine Learning engineer, that they will go out there and build on the fundamentals, go out there and read Bishop and Hastie.

[Image: Flickr user Dustin McClure]






Add New Comment

2 Comments

  • Anthony Reardon

    Right on.

    Seems like Machine Learning is becoming increasingly popular because of the practical applications for processing data out there. Probably good for code hackers that want to skill up for this emerging field, but I personally find the treatment lacking.

    I've been interested in AI since reading Douglas Hofstadter's Fluid Concepts and Creative Analogies many years ago. Found that to be a fundamentally appropriate introduction to machine learning in the broader sense, but it wasn't oriented to big data specifically. I even contacted him at one point to see if he would work with me on an intelligent browser, but he pointed out to me the scope of research needed to be very narrow, controllable, and so on. I felt this was similar to the objections mentioned about focus on applications that are scalable for big data.

    So I don't like the focus on code, applications tools, and data even though I realize that's where people are trying to take the field. It doesn't sound like machine learning to me. It sounds more like specialized query and algorithm testing. If you run some program on a data set to help make sense of it, then you are really just crunching numbers. How different is that from pulling out your TI 81 calculator and plugging it in to a given data set? I guess I understand the practicality, but can't help but think the use of "learning" is a bit overhyped and convenient.

    In theory, I think the best place to start with machine learning is the human mind. This has been a point of reference for those trying to build AI systems- that to build something intelligent, you must first have an appreciation for what intelligence is. The article above seems to touch on that somewhat in the process of articulating questions, the goal of achieving a usable insight, etc. You can see there is more intelligence going into the process than just input to output. So my thing is probably that is where to start too. Makes it kind of interesting to see where they will take the next book as they introduce ensemble, model fitting, and evaluation. You can probably appreciate the ordered approach to introducing a complex subject, and I suppose this is a methodology that goes hand in hand with the scientific approach and research being conducted.

    Still, if you are oriented on the human mind, I think you benefit greatly. I liked how he talks about presenting results in an interactive browser format. Maybe you confine your definition of machine learning to functions on data, adapting the intelligence of creative function development at the starting and end points. Even with pure AI, you're not going to get anywhere anytime soon without consideration of the end-user at some point. With the pressure to make use of big data, I can see how this is useful compromise.

    This aspect of second order operation on an outlier stood out to me though. From the cognitive science standpoint, this fits perfectly with symmetry vs. asymmetry. Given a field, that something stands out, it is human nature to scrutinize that more closely. The value of that particular data, exception to the rule, or characteristic variability can hold the most leverage in understanding the system as a whole. It can be more efficient to study the breaks in the pattern than the trends that reinforce it. So you would certainly be looking for that from a human intelligence framework on research. Piaget's work on centration, signifiers, and indices comes to mind.

    Best, Anthony