DATA TIME #1: PREDICTING NBA DRAFT NUMBER BY NBA PERFORMANCE

7 min readJun 23, 2021

Hello. Welcome to Data Time, a blog series in which I will use computers to explain the world to you, a weary traveler/job recruiter. My brain is admittedly small (and approximately 83% pudding), but thanks to the magic of Pandas and scikit-learn even I can guide you through the complicated world of data science.

For my first post, I’ve decided to dig into a big pile of NBA statistics with the goal of figuring out the relationship between basketball statistics and where/if a player was drafted.

I wanted to understand how players perform in relation to their draft position.

Cleanin’ and Wranglin’

The data set, which I found here, contains the stats of every player who set foot on an NBA court from the 1996–1997 season through the 2019–2020 season.

Here’s what the columns looked like when I originally loaded the dataset:

As you can see there are a lot of irrelevant columns. I started off by cutting out the high cardinality columns like “college”, “country”, “draft_year”, and “season”. I also cut out “player_name”, as my goal was to determine the relationship between on-the-court statistics to where a player was drafted. The player himself was irrelevant. I cut out “team_abbreviation” for the same reason. Finally, I cut out “draft_round”. Since the NBA draft only has two rounds, I wanted to create my own categories based on “draft_number”.

I also removed the column “Unnamed: 0”, which was a strange duplication of my index. It didn’t seem to affect the data, but it was an eye sore.

Next, I cut every draft pick above 60 from the data and mapped every undrafted player to the number 61. Every draft since 1989 has been only two rounds. This seemed like an easy way to keep the data clean and cut out outliers like Šarūnas Marčiulionis, who was drafted 127th overall in 1987, but was already on the USSR national team.

Following this, if was time to map the data in the column “draft_numbers” to the categories. They are as follows:

1–10 = Star Potential (better be good)

11–30 = Starter Potential (should be decent)

31–61 = Flyer (complete dart throw)

Last but not least, I cut out any player who played less than 15 games, as the players with extremely small sample sizes can have pretty outlandish stats.

I jammed all this fun into a wrangle function and I was ready to start building my models.

Matt Clibanoff, Modeling Agent

To explicitly state it for the record: my target is “draft_number”. Every other column was part of my feature matrix.

After completing my training/validation split, my baseline accuracy was 0.363.

I then tried to create a model based on a random forest classifier. The results were less than ideal.

My training accuracy was 1.0 and my validation accuracy was .601.

Not only was my model overfitting, but the validation and training sets were so far off I may has well have been using completely unrelated data for both. On the bright side, I beat my baseline.

At this point it was important to remember the first rule of ̶s̶e̶l̶f̶ ̶d̶e̶f̶e̶n̶s̶e̶ data science: Don’t Panic

NOT PANICKING

Next, I tried a logistic regression. This worked much better.

I got a training accuracy score of 0.516 and a validation accuracy score 0.528.

At this point I was feeling good and figured it was time to make some graphs.

GRAPH TOWN

Before charting, I thought it might be interesting to split my graphs into two sections. The first section would consist of the most important stats (in relation to my target) that I thought were the most important. In the second section, I used a feature matrix to determine what the three most important actually were and graphed those as well. Here were my results:

Graphs of intuitively important features:

Graphs of actually important features (Star_Potential):

Graphs of actually important features (Starter_Potential):

Graphs of actually important features (Flyer):

L’interprétation

As it turns out, predicting the most valuable features for “potential_star” players is pretty intuitive. I got two out of three right and assists would have been my fourth pick behind games played.

The three most important features in the “potential_starter” category were games played, height, and usage percentage. This last one makes a lot of sense, as usage percentage tracks how involved a player is in team plays. Coaches are hoping this percentage is high for their low-end starters/ off-the-bench players.

As for flyers, the only positively correlated features are true shooting percentage, offensive rebound percentage, and assist percentage. The fact that these stats are all markers of efficiency makes a lot of sense considering these are the type of players who might only log five-to-ten minutes per game.

LAND OF CONFUSION*

Okay so how often do players live up (or down) to scouts and draft experts expectations? Are you feeling confused?**

This nifty little confusion matrix tells me that my model correctly predicted whether a player was drafted in the top 10, 66.4%, when they drafted between 11th–30th 55.7% , and when they were drafted 31st or later 59.7% of the time.

Further, it shows that only 7.6% of players who were drafted outside of the first round of the draft performed to my model’s standard of a top 10 draft pick. It would seem that it’s pretty rare for a low draft pick to amount to much in the NBA.

In fact, according to my model, only 24.2% of late draft picks even end up reaching the requisite stats for starter potential. Manu Ginobili, drafted 57th overall in 1999, is perhaps the king of all outliers in this regard. He has four NBA championships and was a two time all star. He also won the NBA’s Sixth Man of the Year Award following the 2007–2008 season.

On the flip side, high draft picks are duds a bit more often. About 17.4% of players drafted in the top 10 underperform to the point that they only reach my model’s “Flyer” status. Shouts out Kwame Brown.

Players labeled “starter_potential” have the greatest variance in their playing performance but not by much. Only 15.8% of players drafted between 11th and 30th ever play up to the model’s standard for top ten draft picks.

*The Phil Collins version not the bad Disturbed cover

**Ugh…even I hated that one.

WRAP-UP

So what have we learned? Only 11.7% of players drafted below 10th ever end up playing to the high standard of “star_potential” set by my model. 68.2% of late draft picks are doomed to a life of limited minutes and 48.7% of players drafted in the middle will surprise fans and coaches, for better or worse.

NBA scouts are pretty good at picking obvious talent. That said, anyone who watches March Madness could probably pick the top 10 guys off the board. And while a bit of variance in the middle is to be expected, it’s a bit shocking how often players with “starter_potential” perform outside of their respective category in either direction. As it turns out, scouting talent in the NBA is a bit of a crapshoot.

GITHUB LINK: https://github.com/mattclibanoff/Build-Week-Unit-2