Do Certain NCAA Basketball Systems Generate NBA Stars More Often? (3 OF 3)

Nicholas Canova

In our first two posts, we introduced the UNC case competition and discussed our clustering and play-type analyses of NCAA teams. In this third and final post on the topic, we present a simpler analysis, a regression of players’ NCAA statistics in predicting NBA win shares (WS). Asking ourselves the question “can we predict NBA performance solely looking at a player’s NCAA statistics” lends itself to such an approach. While this analysis does not answer directly the case question, which asked specifically about systems generating superstars, it was nonetheless an interesting analysis to perform. Our approach was as follows:

  • For all players who played in the NCAA and were drafted into the NBA in the drafts from 2004 – 2012, download their advanced statistics for their most recent NCAA season, as well as their offensive and defensive win shares (oWS, dWS) over their first 4 years in the NBA, all from basketball-reference. These regressions will be used to predict NBA oWS and dWS as a function of a player’s advanced NCAA statistics.
  • Since different statistics may be more useful for predicting success at different positions, we then split the downloaded data into 10 separate datasets, grouping players first by position, and then within position splitting up each player’s offensive and defensive statistics.
  • For each of the 10 datasets, we ran a 5-fold cross-validated lasso regression, fitting defensive statistics to actual dWS, and offensive statistics to actual oWS. This created the regression equations that could be used for prediction.
  • With these fitted regressions, we predicted oWS and dWS for current NCAA players based on their NCAA stats, and created confidence intervals for these predictions.

The last 2 bullets make the analysis sound more complex than it actually is. It’s not. Lasso regressions are similar to simple linear regression analyses with the added advantage that they will remove the NCAA statistics that have little use predicting dWS and oWS. That is, if we fit a regression using 10 statistics to predict oWS, the resulting regression equation will probably have fewer than the 10 statistics, whereas a simple linear regression will always keep all 10. Further, 5-fold cross-validation is simply a technique that helps improve the predictive ability of regressions.

To predict oWS, we used these advanced offensive statistics:

  • Effective field goal % (eFG%)
  • Offensive rebound % (ORB%)
  • Assist % (AST%)
  • Turnover % (TOV%)
  • Usage % (USG%)
  • Points per shot (PPS)
  • Offensive rating (ORtg)
  • Floor impact counter (FIC)
  • Player efficiency rating (PER)

And to predict dWS, we used these advanced defensive statistics:

  • Defensive rebound % (DRB%)
  • Steal % (STL%)
  • Block % (BLK%)
  • Defensive rating (DRtg)
  • Floor impact counter (FIC)

To get a sense for the results, 2 of the 10 regression outputs are provided below. To use the output to estimate the number of oWS for an NCAA small forward, we simply use the formula -52.84 + 17.76*(eFG%) + 0.45*(ORtg) – 0.15*(PER), plugging in the player’s actual statistics where appropriate.

oWS dWS regression

Across all 10 regression outputs, we noticed a few trends. For predicting oWS, at any position, ORtg was the most prevalent predictor, and the same holds for DRtg when predicting dWS. Despite their limitations, I have been a fan of ORtg and DRtg for some time, and it was reassuring to see the lasso regressions consider these variables as the most predictive. Next, most of the 10 regressions kept between 2-4 predictors. For predictions of oWS, this means not using 6-8 of the statistics at all. The high correlation between variables (a high eFG% typically is associated with a high ORtg), which is not good when running lasso regressions, likely explains part of why so many statistics were not kept. Also, none of the regressions were too accurate, with r-squared values mostly between 0.2 and 0.35.

With the regression outputs on hand, and the NBA draft this evening, we next predicted overall WS for each of the players ranked in the top 30 of the draft. We present this table below, using the most recent mock draft from and excluding estimates for international players in the mock draft. Note that while standard errors for each coefficient are shown in the regression output, the overall regression standard errors, which are a measure of reliability of the estimates as a whole (rather than an accuracy of each coefficient), are not shown. These regression standard errors allow us to create confidence intervals around our projections, effectively saying “with X% certainty, we believe this player’s WS will be between these two numbers).


As is fairly clear, these confidence intervals are very wide, and it is our opinion that the output from the regression analysis would not be able to assist a GM on draft night in identifying who to draft. The expected WS range widely and seemingly random of expected draft position, and the confidence intervals range from bust to superstar for most players.

Reflecting on this analysis, it seems we did not make enough adjustments or have enough data to perform a more accurate regression analysis. We lacked potentially useful statistics such as a player’s height, weight, conference / strength of schedule, and minutes played in his final NCAA basketball season, only used each player’s final NCAA basketball season statistics rather than their entire NCAA career statistics, and did not account for injuries after a player was drafted, which could make an otherwise accurate prediction appear grossly inaccurate. Further, while splitting the downloaded data into separate datasets for positions, offense, and defense, we effectively reduced an already small sample size for a regression analysis (~450 players drafted in the timeframe analyzed) into 5 even smaller sample sizes (~90 players drafted at each position in the timeframe analyzed), which probably hurt the accuracy of a regression analysis more than it helped.

It is worth noting that, despite this missing data and the lack of adjustments, we believe an improved regression analysis of a similar format would still result in shortcomings. Despite the occasional high draft pick that becomes a bust, NBA scouts do a very good job, probably better than the other 3 major sports, of identifying the best young talent and making sure they get drafted in the correct draft spot. This analysis then helped us to realize what NBA scouts and front office personnel have probably known for quite some time, which is that we cannot and should not assess a player solely based on their NCAA statistics.


As an extra, we toss in one last graph showing the performance of international players relative to their draft position. We will leave to you to interpret the graph, and will just add that blue markers represent players picked in the top 10, red markers are players picked from 11-60, and the 30th overall pick would have expected win shares of 4.5 given that draft position. With this, are international players typically a good pick? What percentage of international top 10 picks exceeded expectations based on their draft slot? What range of picks does it appear that teams have been able to find success drafting international players?

Intl Players

Thanks for reading, we hope you enjoyed.


Do Certain NCAA Basketball Systems Generate NBA Stars More Often? (2 OF 3)

Nicholas Canova

In our first post, we introduced this year’s UNC Basketball Analytics Summit case competition and began by classifying NBA players as superstars and busts based on their first 4 years performance in the NBA, as well as assessing net win shares (net WS) for each drafted player. In this second post, we begin by discussing our clustering of NCAA teams by play-types, and move to analyzing play-types further for trends across each position. We believe these to be our most interesting analyses, and this post will likely be a few paragraphs longer than our first and third posts. We will do our best to keep the longer post interesting.

Likely the most important question we had to ask and answer throughout the contest was “How should we quantitatively group NCAA teams into systems?” Since the case question specifically asked about certain types of systems, however left to us how to define on our own what exactly a system is, we thought long on this and came up with three strong possibilities:

  • Could we cluster teams by the general offensive strategy they use? For example, does Duke primarily run a triangle offense, motion offense, Princeton offense, pick and roll offense, etc.? What about UNC, Kentucky and Gonzaga? What about every small-conference D-I school?
  • Could we cluster teams by looking at teams’ coaches? NCAA coaching turnover is much lower than NBA coaching turnover, and if certain NCAA coaches are more likely to run the same system each year, this may be useful for clustering.
  • Could we cluster teams by the play-types a team runs most frequently? Is there play-type data, and if we could obtain it, could we see which teams run certain plays more or less frequently than other teams?

We considered the first option as too subjective of an analysis. Given that we needed to classify both current as well as historical NCAA teams, we considered this to be an unreasonable and likely inaccurate approach. We also considered the second option as highly subjective, as well as too incomplete. Grouping similar coaches by coaching style leaves much to an eye test and little to a more quantitative analysis of the offenses strategy. This left the third option, a clustering of teams by the frequency with which they ran each type of play. Using play-by-play data from Synergy Sports from 2006 – 2015, we were able to pull the percentage of plays of each of the 11 offensive play-types (see below for the different play-types) for each NCAA team for each season. We then wrote a k-nearest neighbors clustering algorithm that treated each team-season’s breakdown of play-types ran as an 11-dimensional vector and separated teams into 8 clusters based on the euclidian difference of these play-type vectors. All this means is that teams that ran similar plays at a similar frequency are grouped into the same cluster, which is much simpler than my previous sentence.

All play types

The set of 11 tables above summarizes the results from our initial clustering. Each table represents one of the 11 play-types, and each of the 8 bars within each table represents the percentage of that play ran by teams in that cluster. For example, looking below at the 11th table for the spot up play-type, we see that teams in the 5th cluster ran close to 35% of their plays as spot-up plays, whereas teams in the 6th cluster ran less than 20% of their plays as spot-up plays. Spot Up

With this clustering of teams, we could then ask ourselves what types of plays are being run more or less frequently by systems that are generating star and bust players. The table below summarizes our initial findings, and shows that clusters 4, 6, and 7 generated the best ratios of stars to busts and also had the highest net WS per player, whereas clusters 5 and 8 performed poorly. The descriptions column attempts to give a play-type description of what differentiates each cluster the most. Looking at the 7th cluster, whose teams ran a higher percentage of isolation plays and was otherwise fairly balanced, we see that this cluster included 59 teams that sent at least 1 player to the NBA, 9 players of which became stars and 6 of which became busts based on our earlier criteria, and whose drafted players on average outperformed their draft position expected WS by 1.912 per player across the players drafted from those 59 teams.Cluster Performance

In terms of net WS per player, 2 of the 3 strongest performing clusters feature offenses that emphasize isolation plays, whereas both of the 2 weakest performing clusters de-emphasize isolation plays. Further, the strongest cluster de-emphasizes spot up shooting whereas the weakest cluster emphasizes spot up shooting. We leave to you to compare further this table and the play-type graphs to reveal other patterns of over- and under-performance of certain clusters of teams by play-types.

Extending this sort of analysis, we next took a look at the offensive tendencies of those systems that superstars and busts came from, at each position on the court. That is to say, we expect that teams with very good players at specific positions would lean their offensive strategies more towards play-types featuring these players. Wouldn’t NCAA teams with elite centers run more post-up plays? Do teams with elite point guards push the ball more in transition? The graphs below answer these questions, with interpretation of the graphs as follows – there are 5 graphs, 1 for each position. Each graph features the 11 play-types shown earlier, and for each play-type both a red bar that displays whether the NCAA teams of players that became NBA stars at that position ran a higher or lower percentage of each play-type than the offenses of players that were drafted but did not become NBA stars at that position, and a blue bar that displays whether the NCAA teams of players that became NBA busts at that position ran a higher or lower percentage of each play-type than the offenses of players that were drafted but did not become NBA busts at that position… these graphs are a bit difficult to explain and can be difficult to draw insights from, so maybe read that last sentence again, and let’s look at the graphs to understand more.

Star PF Star SG
Star SGStar CStar PG

Looking at the bottom graph, on point guards, we see that NCAA teams whose point guard was drafted and became an NBA star ran transition plays roughly 18% more frequently than did NCAA teams whose point guard was drafted but did not become an NBA star. Alternatively, NCAA teams whose point guard was drafted and became an NBA bust ran transition plays 33% less frequently than did NCAA teams whose point guard was drafted but did not become an NBA bust. This makes sense intuitively, as teams with star point guards should be more willing to push the ball in transition, trusting their talented point guard to make good decisions with the ball. The first graph, on power forwards, makes intuitive sense too, where we see the teams with star power forwards run fewer spot up shooting plays (not typically a play featuring the power forward in college) and more post up plays. Again, we leave to you to dig more nuggets of insight from the graphs and make connections with what plays we would expect a team to favor given stars at certain positions.

With this, we wrap up the second post, which I hope was as interesting for you to read as it was for me to type out. Our third post will follow shortly, with our last analyses and concluding thoughts on the competition.

Do Certain NCAA Basketball Systems Generate NBA Stars More Often? (1 of 3)

Nicholas Canova

In April 2016, the University of North Carolina hosted its annual Sports Analytics Summit, featuring a series of excellent guest speakers, from Dean Oliver to Ken Pomeroy, as well as a case competition that challenged teams to analyze the effects of NCAA basketball systems on generating star NBA players. More specifically, the case challenged participants to answer the question “Are there certain types of systems (offensive and/or defensive) that work to best identify future NBA superstars?” Our team of four entered the competition, focusing on the impact of offensive systems specifically, and we present here our core analyses answering the question and thoughts throughout the process.

Given the open-endedness of the challenge, we asked ourselves several initial questions including (1) what constitutes an NBA superstar and bust player, (2) how could we categorize NCAA basketball teams into different systems, and (3) what analyses / metrics could we look at that may indicate an NCAA player is more likely to become an NBA superstar or bust than is already expected for that player. We will address the majority of our work in detail over 3 short posts, highlighting some of the key assumptions in this first post. Looking at each of these 3 questions in detail should give a fairly thorough review of our overall analysis.

First, what constitutes an NBA superstar? We considered several metrics for classifying superstars, including a player’s number of all-star appearances, his box score stats both for impressiveness and consistency, performance in the NBA playoffs, etc., however we ultimately selected a player’s total win shares (WS) over the first 4 years of his career as the sole metric to classify a star player, which brings up a key factor of our analysis. Since an underlying focus of the analysis is helping teams identify NBA superstars (the case competition was hosted and judged by the Charlotte Hornets), we looked only at player performance over the first 4 years of their career after being drafted, which is the time period during which they are contractually tied to a team before reaching free agency. Mentions of total WS throughout the post should be read as a player’s total WS over his first 4 years after being drafted. Since a player’s likelihood of becoming a superstar is of course closely tied to his first 4 years of performance, we did not see this focus as limiting. As for the cutoff, we selected 20 WS over a player’s first 4 years. WS assesses a player’s overall NBA value in terms of the share of their teams’ wins each player is accountable for, and serves well in determining superstar players.

Stars         Busts

Second, what constitutes an NBA bust? We considered this question more challenging to quantify than the question on superstars, believing we could not look at WS alone on an absolute basis. Think about it this way – is a 60th overall pick with 0 WS a greater or lesser bust than a 1st overall pick with 5 WS? (5 WS over 4 years is very low for a top 10 pick – Greg Oden, highly considered one of the NBA’s premier bust players, even had 6.8 WS, whereas a star player such as Kevin Durant had 38.3 over this period). As expected, we consider that 1st overall pick to be the bigger bust than the 60th pick, due to the higher expectations put on top draft picks. More specifically, we considered any player drafted in the top 20 overall, with fewer than 8 total WS, whose WS were more than 6 fewer than what would have been expected given their draft position as a bust player. Both cutoffs for NBA superstar and bust seem arbitrary, but were selected them such that 5% – 10% of all players drafted were classified as stars and busts, respectively. The tables above highlight several of the star and bust players taken in the drafts between 2006 – 2012, and the players included in each table seems reasonable and passes a reasonableness test. Since this analysis requires 4 years of NBA WS data, we did not look at players drafted more recently than 2012, and lacked certain data earlier than 2006.

The last item we’d like to highlight in this post is clarifying what is meant by “WS were more than 6 fewer than what would have been expected given their draft position”. We will refer to total WS in excess of expected WS as net WS, and it is calculated based on the difference between actual WS and the expected number of WS given a player’s draft position. The graph below shows historically the average number of win shares in a player’s first 4 seasons at each draft position, with a line of best fit. We can use the graph’s line of best fit to estimate how many WS we expect a player to have then, given their draft position. For a player to over-perform their draft position, he would need to earn more WS than what the best fit line estimates. Going back to our earlier example, 1st overall pick Greg Oden would be expected to earn (-5.789 * ln(1) + 24.2) = 24.2 WS win shares, however only earned 6.8 WS, for a net WS of -17.4 As for Kevin Durant, his actual WS of 38.3 vs. expected WS given draft position of 20.2 resulted in a net WS of 18.1.

WS vs Draft Pick

With this basic foundation laid down, in the next post we will begin to look at our main clustering analysis of NCAA systems based on play-types, and extend this clustering analysis to the college systems of those players we’ve classified as stars and busts using the criterion above.