MLB Contract Evaluations
A self-created data analysis project | Fall 2023
Main Focuses: Data Compilation, Python
Introduction
This project was the final project for the CareerFoundry Data Analytics certification course. The requirements were to create a self-driven data analysis project where students developed their own topic, business requirements, sourced their own data, and analyzed the data accordingly. Methods used during this project include spatial analysis, exploratory data analysis (EDA), and an introduction to machine learning.
In professional sports, athletes are signed to multi-million dollar contracts, some even surpassing nine figures. This projects aims to evaluate the performance of Major League Baseball players compared to their contract salary to assess player value.
The goal of this project is to determine if MLB athletes who have been signed to contracts worth over fifty million dollars, produce the results that they are assumed to. In short, does an all-star paycheck correlate to an all-star player. This will be done by comparing player contracts to the corresponding WAR, or wins above replacement, value of a given player.
We will aim to answer the following question:
Do players who make more money produce better results?
Definitions
Wins Above Replacement (WAR)
This is an excerpt from Major League Baseball’s website that explains the function of WAR in baseball:
Definition
WAR measures a player's value in all facets of the game by deciphering how many more wins he's worth than a replacement-level player at his same position (e.g., a Minor League replacement or a readily available fill-in free agent).
For example, if a shortstop and a first baseman offer the same overall production (on offense, defense and the basepaths), the shortstop will have a better WAR because his position sees a lower level of production from replacement-level players.
The formula
For position players: (The number of runs above average a player is worth in his batting, baserunning and fielding + adjustment for position + adjustment for league + the number of runs provided by a replacement-level player) / runs per win
For pitchers: Different WAR computations use either RA9 or FIP. Those numbers are adjusted for league and ballpark. Then, using league averages, it is determined how many wins a pitcher was worth based on those numbers and his innings pitched total.
Note: fWAR refers to Fangraphs' calculation of WAR. bWAR or rWAR refer to Baseball-Reference's calculation. And WARP refers to Baseball Prospectus' statistic "Wins Above Replacement Player." The calculations differ slightly -- for instance, fWAR uses FIP in determining pitcher WAR, while bWAR uses RA9. But all three stats answer the same question: How valuable is a player in comparison to replacement level?
Why it's useful
WAR quantifies each player's value in terms of a specific numbers of wins. And because WAR factors in a positional adjustment, it is well suited for comparing players who man different defensive positions.
MLB Advanced Stats, Wins Above Replacement
High-dollar contract
For the purposes of this project, a high-dollar contract is one that has a value of over $50,000,000 USD.
Data and Tools
Data
Player Performance Data
up to date as of 08/15/2023
compiled from Baseball Reference, see Sources and References for more information.
Player Contract Data
up to date as of 07/31/2023
Data Limitations
The player contracts in question for this project are only the 320 highest salaried contracts in baseball history. Inflation has not been accounted for.
There are many statistics that can be used to evaluate player value. For simplicity this project only considers WAR, however, this is by no means the only way to evaluate player value.
It should be noted for position players, WAR only evaluates offensive performance. There is some difference in the calculation of WAR for pitchers and position players so this could potentially create different replacement assessments.
So as to not violate terms of services of Spotrac.com or Baseball-Reference, I will not be making the data from this project available. This project is for learning purposes only.
Tools
Tools used for this project include Excel, Tableau, Python and relevant libraries (pandas, NumPy, matplotlib, seaborn, plotly, folium, and sklearn).
Techniques Applied
Creating business requirements
Sourcing and compiling data
Data wrangling and consistency checks
Exploratory visual analysis
Spatial analysis
Machine Learning
Creating visualizations and presenting results
Steps Taken
Defining the Project and Sourcing Data
To source the data it first needed to be determined what the highest salaried player contracts are in baseball, this information was found from Spotrac.com. After the list of players was compiled, player performance statistics were needed. For this project, we looked at:
Player name
The team a player was signed to
The team they played for
Contract length (years)
Total salary (USD)
Actual salary (what a given player made in a given year)
Games played
Position played
WAR (wins above replacement)
Player value statistics were obtained from Baseball-Reference for the players in question. See Sources and References for more information.
At this point there were two datasets: one with player contract information and one with player statistics.
Data Cleaning and Merging
The data cleaning for this project was done in Python.
The initial datasets had over twelve columns each, however as only certain information was needed, a majority of these columns were dropped.
Standard data wrangling and consistency checks were perfromed on the datasets including:
Renaming columns and changing data types
Ensuring consistent formatting, addressing mixed data types, missing values, and duplicate values
The next step was to determine which variable to merge the datasets on; the key variable was “player_name, year”. Then the two datasets were merged together in Python.
Exploring Relationships: Correlations and Plots
During this step, the relationships between the variables in the dataset were initially examined through exploratory data analysis.
Other methods used included creating correlation heatmaps, scatterplots, pair plots and categorical plots.
Surprisingly, it was determined that none of the variables had particular strong correlations and were loosely related at best.
Here is an example of a visualization created during this process.
Spatial Analysis
Next, a point map was created using Tableau to better visualize the data. The result of this spatial analysis can be seen in the Analysis and Findings section below.
Linear Regression
Note: You can see the visual results of this regression, recreated in Tableau, in the Analysis and Findings section below.
A simple regression analysis was performed on the dataset to test the following hypothesis:
As total salary increases, the WAR value of a player increases as well.
Here total salary represents the independent variable, X, and WAR represents the dependent variable, y. Next, the variables ere reshaped into NumPy arrays and the data was split into training and test sets. Then, a linear regression was ran on the data and a scatterplot was created to show the results of the linear regression on the test set. Then, the model performance statistics were checked, mean squared error and R2 score. Here is an example of the code used to do this:
# Create objects that contain the model summary statistics.
rmse = mean_squared_error(y_test, y_predicted) # This is the mean squared error
r2 = r2_score(y_test, y_predicted) # This is the R2 score.
# Print the model summary statistics.
print('Slope:' ,regression.coef_)
print('Mean squared error: ', rmse)
print('R2 score: ', r2)
After performing this regression, the hypothesis was able to be confirmed.
Cluster Analysis
A k-means clustering algorithm was performed on the dataset. As there is great variability in some of the variables, the first step needed was to standardize the data. This was done using the sci-kit learn Standard scaler.
Next, the elbow technique was used to determine the number of clusters needed to perform the k-means algorithm. It was determined that five clusters were necessary for the best results. Then, the the k-means algorithm was performed.
A new column was created within the dataframe with the resulting clusters so descriptive statistics could be calculated for the clusters to analyze them.
Analysis and Findings
These visualizations were made using Tableau.
Spatial Analysis: Player Contracts by MLB Teams
This map shows the locations of all MLB teams represented by colored circles. The size of the circle represents the WAR value of teams based on the contracts. The color of the circle represents the amount of money that teams have spent in contracts.
From a visual inspection of this graph we can see that teams who spend more money appear to have higher WAR values.
The New York Yankees have spent 16 billion dollars on high-dollar contracts alone and have the largest WAR value of 336. Conversely, the Oakland A's have spent 396 million dollars for a war value of 9.2. Howerver, there are anomalies, such as the St. Louis Cardinals, who have the second highest WAR value but have spent significantly less money than other teams...
Linear Regression: Total Player Salary vs WAR
Hypothesis:
As total salary increases, so does WAR.
Regression Interpretation:
This is a significant result (p-value less than 0.05) and the hypothesis can be confirmed.
However, there is a weak correlation (r = 0.17) between these two variables.
This indicates that there are other factors that contribute to player contracts.
Linear Regression: Actual Player Salary vs Contract Length
Hypothesis:
As contract length increases, so does actual salary.
Regression Interpretation:
This is a significant result (p-value less than 0.05) and the hypothesis can be confirmed.
There is also a weak correlation (r = 0.20) between these two variables.
This is interesting because it would suggest that for some reason a player didn't get their total salary. Possible reasons for this could be a player being traded, cut, or retirement.
Answering Initial Questions
Do players who make more money produce better results?
Yes, from the analysis, it was able to be determined that players who have larger contracts do in fact produce larger WAR values. However, it should be noted that there is only a weak relationship between these two variables. It should also be remembered that this project only analyzed player performance in context of WAR and that there are many ways to assess value.
Retrospective
Further Consideration
If this project were repeated, we would reconsider initial questions. Upon reflection, the scope of the project is limited and could be expanded upon and explored further. It would be interesting to explore player trades, injuries, free agency options, and/or retirement. Also, evaluating players on WAR alone is limiting in and of itself. For position players, a more well-rounded evaluation could be done by considering additional statistics such as Ultimate Zone Rating, UZR, and wOPA+. For pitchers, ERA+ or FIP could also be considered. See the Sabermetrics Library from FanGraphs to learn more about these particular statistics.
During this project, we looked solely at the performance of players with high dollar contracts. This possibly led to an analysis that was in a vacuum, so to speak, and didn't consider factors such as the performance of the players in this analysis compared to their teammates or other players in the league. This comparison could aid in evaluating the value of these players within their organization. Because this was not done, the analysis performed here may not be representative of the given players’ actual value.
Future Steps
One thing this project did not consider is what success means to different MLB organizations and how that definition affects player contracts. Major League Baseball is a huge business, and individual teams act as companies within that. Different teams, have different priorities.
For example success could be many things for different teams. Such as:
World Series Titles
Ticket Sales
Division Champions
Opinions of different General Managers and Owners
Two factors we feel to be very important to this question but was unfortunately outside the scope of my project is name recognition of teams and players in addition to veteran or rookie status of players and how that affects contracts.
These topics would all be worth exploring to get a better sense of success for different MLB organizations.
Sources and References
For all player data sources from Baseball-Reference, follow this link.
All 30 MLB Teams: Location, Stadium and Website Information. MLB.Com. MLB Advanced Media, LP. www.mlb.com/team.
“MLB Active Player Contracts.” Spotrac.Com, Spotrac, www.spotrac.com/mlb/contracts/sort-value/all-time/limit-330/. Accessed 14 Aug. 2023.
Wins above replacement (WAR): Glossary. MLB.com. MLB Advanced Media, LP. https://www.mlb.com/glossary/advanced-stats/wins-above-replacement
Further Reading
Here are some interesting links for anyone interested in learning more about Wins Above Replacement:
Baseball-Reference WAR Explained
Pitcher WAR and Defensive Support
Making the Case for WAR as Baseball's Most Perfect Statistic
Thanks
Special thanks to my tutor, Ayya Elzarka, and my mentor, John Kocur, for all their feedback.
My brother, Patrick, for creating my love of baseball.