One or two games in MLB is often the difference between advancing to the post-season or staying home, and an entire season can be determined by a couple of good or bad pitches. There is a huge competitive advantage to knowing the opponent’s next step. That’s one reason sport analytics is a booming field. And it explains why data scientists, both fan and professional, are figuring out how to do more accurate modeling than ever before.
One notable example is Ray Hensberger, baseball-loving technologist in the Strategic Innovation Group at Booz Allen Hamilton.
At a workshop during the GigaOm Structure conference, Hensberger shared his next-level data crunching and the academic paper his team prepared for the MIT Sloan Sports Analytics Conference. His team modeled MLB data to show with 74.5% accuracy what a pitcher is going to throw--and when.
Hensberger's calculations are more accurate than anything else published to date. But as Hensberger knows, getting the numbers right isn't easy. The problem: How to build machine-learning build models that understand baseball decision-making? And how to make them solid enough to actually work with new data in real-time game situations?
“We started with 900 pitchers," says Hensberg. "By excluding players having thrown less than 1,000 pitches total over the three seasons considered, we drew an experimental sample of about 400,” he says. “We looked at things like the number of people on base, a right-handed batter versus a left-handed batter.”
They also looked at the current at-bat (pitch type and zone history, ball-strike count); the game situation (inning, number of outs, and number and location of men on base); and pitcher/batter handedness; as well as other features from observations on pitchers that vary across ball games, such as curveball release point, fastball velocity, general pitch selection, and slider movement.
The final result? A set of pitcher-customized models and a report about what those pitchers would throw in a real game situation.
"We took the data, looked at the most common pitches they threw, then built a model that said ‘In this situation, this pitcher will throw this type of pitch--be that a slider, curveball, split-finger. We took the top four top favorite pitches of that pitcher, and we built models for each one of those pitches for each one of those pitchers,” Hensberger said.
They are methods he and his team outline in a book published by his team called The Field Guide To Data Science. “Most of [the data],“ he says, “was PITCHf/x data from MLB. There’s a ton of data out there.”
“Each pitcher-specific model was trained and tested by five-fold cross-validation testing,” Hensberger says. Cross-validation is an important part of training and testing machine learning models. Its purpose, in English: to ensure that the models aren’t biased by the data they’re triangulated by.
“The cross-validation piece, the goal of it, you’re defining a data set you can test the model with,” says Hensberger. “You’ve got to have a way of testing the model out when you’re training it, and to provide insight on how the model will generalize to an unknown data set. In this case, that would be real-time pitches.”
“You don’t want to just base your model on purely 100% on what was done historically. If we just put out this model without doing that cross-validation piece, people would probably say your model is overfit for the data that you have.”
Once the models were solid, Hensberger and his team used a machine-learning strategy known as “one-versus-rest” to run experiments to predict the type of the next pitch for each pitcher. It is based on an algorithm that allowed them to establish an “index of predictability” for a given pitcher. Then they looked at the data in three different ways:
- Predictability by pitch count, looking at pitcher predictability: When the batter is ahead (more balls than strikes), when the batter is behind (more strikes than balls), and when the pitch count is even.
- Predictability by “platooning” which looks at how well a right-handed batter will fare against a left-handed pitcher, and vice versa.
- Out-of-sample test, a test to verify the predictions by running trained models with new data to make sure they work. “We performed out-of-sample predictions by running trained classifier models using previously unseen examples from the 2013 World Series between the Boston Red Sox and the St. Louis Cardinals.”
“Overall our rate was about 74.5% predictability across all pitchers, which actually beats the previous published result at the MIT Sloan Sports Analytics conference. That that was 70%,” says Hensberger. The report published by his team was also able to predict exact pitch type better than before. “The other study only said if a fastball or not a fastball that’s going to come out of a pitcher’s hand," says Hensberger. "The models we built were for the top four pitches, so [they show] what the actually pitches were going to be.”
Hensberger’s team also made some other interesting discoveries.
“Some pitchers, just given the situation, were more predictable than others,” he says. “There is no correlation between predictability and ERA. With less predictable pitchers, you would expect them to be more effective. But that’s not true. We also found that eight of the 15 most predictable pitchers came from two teams: the Cardinals and the Reds.”
This may be a result of the catchers calling the game, influencing the pitchers and their decisions. But it also may be attributed to pitching coaches telling pitchers what to do in certain situations. “Either way,” Hensberger says, “it’s interesting to consider.”
His findings around platoon advantage are worth thinking about as well. Statistically in baseball, platoon advantage means that the batter usually has the advantage: They have better stats when they face the opposite-handed pitcher.
“What we found [in that situation] is the predictability of pitchers was around 76%. If you look at the disadvantage, the overall predictability was about 73%,” Hensberger says. “So, pitchers are a little more predictable, we found, when the batter’s at the advantage. That could play into why the stats kind of favor them.”
This work was done over the corpus of data, but Hensberger says that you run the models real-time during a game, using the time interval between pitches to compute new stats and make predictions according to the current game situation.
According to Jessica Gelman, cofounder and co-chair of the MIT Sloan Sports Analytics Conference, that type of real-time, granular data crunching is where sports analytics is headed. The field is changing fast. And Gelman proves it. Below, her overview on how dramatically it has evolved from where it was just a couple of years ago.
“If you’ve read Moneyball or watched the movie, at that point in time it was no different than what bankers do in looking for an undervalued asset. Now, finding those undervalued assets is much harder. There’s new stats that are being created all the time. It’s so much more advanced,” Gelman says.
Though it may surprise data geeks, Gelman says that formalized sport analytics still isn’t yet mainstream--not every sport or team uses data. The NHL is still lagging in analytics, with the most notable exception of the Boston Bruins. The NFL is slow to adopt as well, though more teams like the Buffalo Bills are investing in the space.
However, most other leagues are with the program. And that is accelerating. In a big way. In Major League Soccer, formal analytics are now happening. Data analysis is now standard in English Premier League football, augmented by global football by fan sites. And almost every baseball and basketball team has an analytics team.
“Some sports have been quicker to accept it than others,” says Gelman. “But it’s widely accepted at this point in time that there’s significant value to having analytics to support decision making.”
So how are analytics used in sports? Gelman says there's work happening on both the team side and on the business side.
"On the team side, some leagues do a lot with, for example, managing salaries and using analytics for that. Other leagues use it for evaluating the players on the field and making decisions about who’s going to play or who to trade. Some do both," says Gelman.
On the business side, data science increasingly influences a number of front office decisions. “That’s ticketing, pricing, and inventory management. It’s also customer marketing, enhancing engagement and loyalty, fandom, and the game-day experience,” Gelman explains. A lot of data science work looks at how people react to what in the stadium and how you keep them coming to back--versus watching at home on TV. “And then,” Gelman says, “the most recent realm of analytics is wearable technology,” which means more data will soon be available to players and coaches.
Hensberger sees this as a good thing. Ultimately, he says, the biggest winners will be the fans.
“Data science is about modeling and predicting. When this gets in the hands of everyone across the leagues, the viewing experience will get better for everybody,” he says. “You want to see competition. You don’t want to see a blowout, you want to see close games. Excitement and heart-pounding experience. That’s what brings us back to the sport.”
[Image: Flickr user A DeVigal]