In our first two posts, we introduced the UNC case competition and discussed our clustering and play-type analyses of NCAA teams. In this third and final post on the topic, we present a simpler analysis, a regression of players’ NCAA statistics in predicting NBA win shares (WS). Asking ourselves the question “can we predict NBA performance solely looking at a player’s NCAA statistics” lends itself to such an approach. While this analysis does not answer directly the case question, which asked specifically about systems generating superstars, it was nonetheless an interesting analysis to perform. Our approach was as follows:
- For all players who played in the NCAA and were drafted into the NBA in the drafts from 2004 – 2012, download their advanced statistics for their most recent NCAA season, as well as their offensive and defensive win shares (oWS, dWS) over their first 4 years in the NBA, all from basketball-reference. These regressions will be used to predict NBA oWS and dWS as a function of a player’s advanced NCAA statistics.
- Since different statistics may be more useful for predicting success at different positions, we then split the downloaded data into 10 separate datasets, grouping players first by position, and then within position splitting up each player’s offensive and defensive statistics.
- For each of the 10 datasets, we ran a 5-fold cross-validated lasso regression, fitting defensive statistics to actual dWS, and offensive statistics to actual oWS. This created the regression equations that could be used for prediction.
- With these fitted regressions, we predicted oWS and dWS for current NCAA players based on their NCAA stats, and created confidence intervals for these predictions.
The last 2 bullets make the analysis sound more complex than it actually is. It’s not. Lasso regressions are similar to simple linear regression analyses with the added advantage that they will remove the NCAA statistics that have little use predicting dWS and oWS. That is, if we fit a regression using 10 statistics to predict oWS, the resulting regression equation will probably have fewer than the 10 statistics, whereas a simple linear regression will always keep all 10. Further, 5-fold cross-validation is simply a technique that helps improve the predictive ability of regressions.
To predict oWS, we used these advanced offensive statistics:
- Effective field goal % (eFG%)
- Offensive rebound % (ORB%)
- Assist % (AST%)
- Turnover % (TOV%)
- Usage % (USG%)
- Points per shot (PPS)
- Offensive rating (ORtg)
- Floor impact counter (FIC)
- Player efficiency rating (PER)
And to predict dWS, we used these advanced defensive statistics:
- Defensive rebound % (DRB%)
- Steal % (STL%)
- Block % (BLK%)
- Defensive rating (DRtg)
- Floor impact counter (FIC)
To get a sense for the results, 2 of the 10 regression outputs are provided below. To use the output to estimate the number of oWS for an NCAA small forward, we simply use the formula -52.84 + 17.76*(eFG%) + 0.45*(ORtg) – 0.15*(PER), plugging in the player’s actual statistics where appropriate.
Across all 10 regression outputs, we noticed a few trends. For predicting oWS, at any position, ORtg was the most prevalent predictor, and the same holds for DRtg when predicting dWS. Despite their limitations, I have been a fan of ORtg and DRtg for some time, and it was reassuring to see the lasso regressions consider these variables as the most predictive. Next, most of the 10 regressions kept between 2-4 predictors. For predictions of oWS, this means not using 6-8 of the statistics at all. The high correlation between variables (a high eFG% typically is associated with a high ORtg), which is not good when running lasso regressions, likely explains part of why so many statistics were not kept. Also, none of the regressions were too accurate, with r-squared values mostly between 0.2 and 0.35.
With the regression outputs on hand, and the NBA draft this evening, we next predicted overall WS for each of the players ranked in the top 30 of the draft. We present this table below, using the most recent mock draft from hoopshype.com and excluding estimates for international players in the mock draft. Note that while standard errors for each coefficient are shown in the regression output, the overall regression standard errors, which are a measure of reliability of the estimates as a whole (rather than an accuracy of each coefficient), are not shown. These regression standard errors allow us to create confidence intervals around our projections, effectively saying “with X% certainty, we believe this player’s WS will be between these two numbers).
As is fairly clear, these confidence intervals are very wide, and it is our opinion that the output from the regression analysis would not be able to assist a GM on draft night in identifying who to draft. The expected WS range widely and seemingly random of expected draft position, and the confidence intervals range from bust to superstar for most players.
Reflecting on this analysis, it seems we did not make enough adjustments or have enough data to perform a more accurate regression analysis. We lacked potentially useful statistics such as a player’s height, weight, conference / strength of schedule, and minutes played in his final NCAA basketball season, only used each player’s final NCAA basketball season statistics rather than their entire NCAA career statistics, and did not account for injuries after a player was drafted, which could make an otherwise accurate prediction appear grossly inaccurate. Further, while splitting the downloaded data into separate datasets for positions, offense, and defense, we effectively reduced an already small sample size for a regression analysis (~450 players drafted in the timeframe analyzed) into 5 even smaller sample sizes (~90 players drafted at each position in the timeframe analyzed), which probably hurt the accuracy of a regression analysis more than it helped.
It is worth noting that, despite this missing data and the lack of adjustments, we believe an improved regression analysis of a similar format would still result in shortcomings. Despite the occasional high draft pick that becomes a bust, NBA scouts do a very good job, probably better than the other 3 major sports, of identifying the best young talent and making sure they get drafted in the correct draft spot. This analysis then helped us to realize what NBA scouts and front office personnel have probably known for quite some time, which is that we cannot and should not assess a player solely based on their NCAA statistics.
As an extra, we toss in one last graph showing the performance of international players relative to their draft position. We will leave to you to interpret the graph, and will just add that blue markers represent players picked in the top 10, red markers are players picked from 11-60, and the 30th overall pick would have expected win shares of 4.5 given that draft position. With this, are international players typically a good pick? What percentage of international top 10 picks exceeded expectations based on their draft slot? What range of picks does it appear that teams have been able to find success drafting international players?
Thanks for reading, we hope you enjoyed.