The successful application of Data Science in the world of sports
Door: Tom Luijten
In the past 15 years, data science has become an essential part during the day-to-day operations in many sports. The rise of data science in sports started in baseball and quickly reached other sports such as cycling, football, and basketball. This article discusses the application and importance of data science in sports.
“It is an unfair game”
Almost every sport can be associated with this phrase since most of them are dominated by money. A well-known criticism is that the performances of clubs depend heavily on their budget. Take as an example the world’s wealthiest football club Real Madrid. It is not a coincidence that this club won the Champions League, the most prestigious club competition in European football, three out of the last four editions.
However, there are exceptions. In the last edition of the Europa League, the second prestigious club competition in European football, Ajax reached the final despite a much lower budget than most of its opponents in previous rounds. It was not a surprise that Manchester United won the final, since the budget they spent on their team (€315 million) is 17 times as big as the budget Ajax spent on their team (€19 million)1. Another example of a club that outperformed the expectations is the Premier League win of Leicester City in the season 2015-2016. Despite of the 12th turnover of all clubs in the Premier League the year before, Leicester City achieved what experts thought to be impossible: winning the title.
These examples show that football, and in general all sports, are not entirely dominated by money. A key factor for Ajax’ success is their excellent youth academy. Building a youth academy that provides talented players requires both time and money. Since most clubs are focused on the short term, investing in the youth academy is not a priority to many football clubs.
Leicester City has invested in another resource for over 10 years: data science. In the year they won the title, the team sustained a relatively small number of injuries, primarily because of the use of data science. In particular, they used data science to keep track of the players’ fitness and conditioning. Therefore, Leicester City’s manager had the availability of its best players for the majority of the season, and the club won its first title. So how did those two fields, data science and sports, come together?
For many years, people have been tracking statistics of sport matches. However, only recently computers got enough power to store this data, which can be analyzed by data scientists. Therefore, many sports still rely on straightforward statistics that were easy to collect at the time only small data sets could be analyzed. In many sports for example, simple statistics were used for a long time to gauge the qualities of a player.
In the book ‘Moneyball: The Art of Winning an Unfair Game’, author Michael Lewis tells the real-life story of baseball club Oakland Athletics in 20022. As shown in Graph 1, that year’s team salaries of the Oakland Athletics, of approximately 40 million dollars, was just a fraction of the salaries of the top teams in the league. This difference in salary made that the Oakland Athletics played a modest role in the competition the years before. In his attempt to turn the tide, the general manager hired a young Yale economics graduate, to scout players based on data analytics. Compared to traditional scouting in which players are valued based on conventional beliefs, they decided to scout players based on empirical analysis. This method of scouting players based on data analytics is called sabermetrics. Using sabermetrics, Oakland Athletics reached the playoffs in 2002 and the year afterwards.
Following up on Oakland Athletics’ succes, many clubs in the MLB changed their scouting philosophy and hired full-time sabermetric analysts. Most successful example is the 2015 season of the New York Mets, also called ‘Moneyball Mets’. A few years before, the New York Mets started using sabermetrics to scout players. In 2015, this approach resulted in reaching the World Series for which they had to defeat the Los Angeles Dodgers. This was a great achievement because of the huge difference in salaries: $273 million of the Los Angeles Dodgers compared to $101 million of the New York Mets.
The success of data science in baseball was noticed by leading persons in other sport areas. Daryl Morey, an American sports executive, introduced a more analytical philosophy to basketball, which is called ‘Moreyball’ nowadays. In cycling, team Sky hired a data scientist to apply predictive analytics which aided in the 2015 and 2016 Tour de France victories. And in the Netherlands we should ask ourselves: What would have happened at the World Cup football in Brazil if we would have applied data analytics like the Germans did or even more intensively?
Application of Data Science in sports
In this article, the application of data science in sports is subdivided into three different areas: measuring various kinds of statistics during a training or match which can be translated into relevant insights, determining the value of players, and supporting the tactical decisions made by managers.
Before data science was applied to baseball, simple statistics like counting the runs batted in (the number of times a batter makes a play that allows a run to be scored), were used to gauge the qualities of players. With the aid of applying sophisticated data analysis, more insight was provided into the relevant qualities of players. One example of such an important statistic that followed from these analyses, is the on-base percentage, which is hard to see by the naked eye. Since such types of statistics were undervalued by scouts, the Oakland Athletics could scout better players than their competitors in the 2002 season.
The rise of computers and modern technology enables sport analytics to gather bigger amounts of data using all kinds of tracking systems. In the Tour de France for example, each rider has a GPS transponder on his bike which collects all kinds of data. Using this historical data, each rider can be analyzed and the winning strategy can be predicted. Moreover, data science can be used to improve the technology in cycling. For designing a new aerodynamic suit for the Dutch cyclist Tom Dumoulin, data science was used to develop a more streamlined outfit. This suit contributed to the first Dutch male victory of a Grand Tour race since 1980.
Data science has never been more relevant in football than it is today. Next to keeping track of the player’s fitness and conditioning, another application is scouting players just like in Moneyball. The advantage compared to traditional scouting is that using big data gives the opportunity to judge the entire performance history of a player instead of short term judgement. Similar to cyclists, football players are tracked intensively during a game or training. During each game in the Premier League, 1.4 million data sets are collected, which boils down to 10 data points per second per player4. In order to analyze this data, clubs use several tools provided by sports data and technology companies.
Those companies provide information on players’ distance travelled, goal attempts, saves, and many more. In Figure 1, a graphical illustration is provided of the cameras that track the movements of players, and a heat map that can be derived from the related data. In such a heat map, the positioning of an individual player on the field is graphically represented by colors. From this, analysts are able to draw better conclusions on the performance of individual players.
Since Daryl Morey introduced data analytics in basketball ten years ago, a lot of research has been done in different areas of the sport. An interesting research, called ‘POINTWISE: Predicting Points and Valuing Decisions in Real Time with NBA Optical Tracking Data’, was published in 20146. In this research, a framework is proposed that quantifies each moment of possession. With this tool, individual players can be gauged, strategies can be adopted, and opponents can be analyzed.
In order to gather the required data, cameras were installed in 13 NBA arenas in the season 2012-2013 and stored 93 gigabytes of information into the database. Instead of only quantifying events that occur at the end of a possession such as points, turnovers, and assists, this framework quantifies all moments of possession. The proposed quantification is called expected possession value (EPV). It quantifies how many points the offence is expected to score at the end of the possession given the current situation, and is measured by the following model:
EPV(t) = E[points | dt] = ∑a∈A E[points | action a ∈ A in (t,t+ε], dt] P[action a ∈ A in (t,t+ε], dt]
In the above equation, the model to calculate the EPV at time moment t is displayed. Variable dt represents the spatial configuration of the players and ball at time moment t during the possession. Action a is a potential action from the set of all potential actions, which is denoted by A. Lastly, (t,t+ε] is the mathematical representation of the small time window from time moment t until time moment t+ε. In the model, the probability assigned to each decision the ball carrier can make, is based on historical data. Next to the probability, the EPV which corresponds to the action of that decision is taken into account. By multiplying the probability by the expectation of an action and summing up over the potential actions, the EPV of time moment t is calculated. In Figure 2, a diagram can be found for the example in which Kawhi Leonard of the San Antonio Spurs is the ball carrier.
Using data analytics is currently a widely-used tool to develop a strategy. From these data analyses, an important conclusion was that shooting three-pointers is a better strategy than shooting two-pointers. Therefore, teams in the NBA are successfully using the three-pointers strategy more often nowadays. In 2015 for example, the last five teams remaining in the Playoffs were the five best three-point shooting teams during the regular season7.
Data Science: The art to win an unfair game
Since the publication of Moneyball, the usage of data science in sports developed fast. The most important improvements have been made in gathering and analyzing the necessary data by several companies. From that data, individual players can be gauged more evidence-based than before, and teams can develop better strategies. This resulted in several successes in all kinds of sports over the past 15 years. However, as data scientists, we have only just begun to optimize the world of sports. There are still lots of challenges that lie ahead and data science will play a very important role to solve them all. Eventually, Quantics believes that data science will prove itself to be the art to win an unfair game.
2. Lewis, M. (2003). Moneyball: The art of winning an unfair game. New York: W.W. Norton.
3. Data obtained from: http://www.stevetheump.com/Payrolls.html
6. Cervone, D., D’Amour, A., Bornn, L., & Goldsberry, K. (2014, February). POINTWISE: Predicting points and valuing decisions in real time with NBA optical tracking data. In 8th Annual MIT sloan sports analytics conference, February (Vol. 28).
NB. The icons used were made by Freepik from www.flaticon.com