mark felepchuk

Getting the data

All the data that I’m using came from scrapping http://basketball-reference.com, except for team abbreviations, which I got from Wikipedia and made some edits to align with Basketball-Reference’s abbreviations (you can see these changes in the code). Unfortunately Basketball-Refernce doesn’t have an API, but their website is pretty easy to navigate using the functions htmltab and readfile. So here are the steps to getting the data:

1. Download team abbreviations

Like I mentioned above, team abbreviations were taken from Wikipedia and some edits were made to align with Basketball-Reference. The following htmltab function reads the html table from Wikipedia and turns the data into a tibble. I then make some edits to Charlotte, Phoenix and Brooklyn to align with Basketball-Reference’s naming conventions. I also added in two new abbreviations for teams that don’t exist anymore - the Charlotte Bobcats and the New Orleans Hornets.

team_abbs <- as.tibble(htmltab("https://en.wikipedia.org/wiki/Wikipedia:WikiProject_National_Basketball_Association/National_Basketball_Association_team_abbreviations", which=1))
  
  team_abbs[team_abbs$`Abbreviation/Acronym`=='CHA',]$`Abbreviation/Acronym` = "CHO"
  team_abbs[team_abbs$`Abbreviation/Acronym`=='PHX',]$`Abbreviation/Acronym` = "PHO"
  team_abbs[team_abbs$`Abbreviation/Acronym`=='BKN',]$`Abbreviation/Acronym` = "BRK"
  team_abbs = rbind(team_abbs, c('NOH', 'New Orleans Hornets'))
  team_abbs = rbind(team_abbs, c('CHA', 'Charlotte Bobcats'))

2. Get the schedules

For this analysis I’m looking at officiating patterns from the post-lockout era (2013 to 2019). I need to access boxscore data for each game to get the official assignments for those games. Each game is located in an html file with the address of ‘YEAR+MONTH+DAY+0+HOMEABBREVIATION.html’. For example, Philiadelphia’s October 16, 2018 game in Boston would be coded as 201810160BOS.html. The full URL would be: https://www.basketball-reference.com/boxscores/201810160BOS.html.

To access the data for each of these games, I need the schedules for each season from 2013 to 2019. Basketball-Reference has a table for each month of each season. For the month of October, the url doesn’t have a month identifier and for November to June, it includes a ‘-month’ in the url. So for the 2019 Season, October’s schedule would be here: https://www.basketball-reference.com/leagues/NBA_2019_games.html and November’s would be here: https://www.basketball-reference.com/leagues/NBA_2019_games-november.html.

I set up a for loop and goes through the 2013 to 2019 seasons and downloads the schedules for each month into a single tibble. Here’s what that code looks like:

schedule <- tibble()
  
  months <- c('','-november', '-december', '-january','-february','-march','-april','-may','-june')
  years <- 2013:2019
  
  for (year in years){
    print(year)
    for (month in months){
      print(month)
      dest <- paste('https://www.basketball-reference.com/leagues/NBA_',year,'_games', month,'.html', sep='')
      
      temp <- as.tibble(htmltab(dest, which=1)) %>% 
        mutate(Month = substring(Date,6,8), Day = ifelse(str_detect(substring(Date, 10,11),','),substring(Date, 10,10),substring(Date, 10,11)), Year = substring(Date,nchar(Date)-3,nchar(Date)), 
               dateClean = mdy(paste(Month, Day, Year)))
    
      schedule <- schedule %>% bind_rows(temp)
    }
  }
  
  names(schedule)[3] <- 'Away'
  names(schedule)[4] <- 'AwayPTS'
  names(schedule)[5] <- 'Home'
  names(schedule)[6] <- 'HomePTS'
  
  schedule <- schedule %>% left_join(team_abbs, by=c("Home"="Franchise"))
  names(schedule)[13] = "HomeABB"
  schedule <- schedule %>% left_join(team_abbs, by=c("Away"="Franchise"))
  names(schedule)[14] = "AwayABB"

The last few lines just rename some columns that have ugly names from the htmltab function. The schedule tibble only has the teams’ full names and I need the abbreviations for the html code to access the boxscores. So I left join team abbreviations for both the home and away team’s full names.

I then need to build my html boxscore code. I’m calling this variable htmlCode. I need to stitch together date and home team abbreviation in the correct format to access the correct webpage (as discussed above).

schedule <- schedule %>% mutate(htmlCode = paste(year(dateClean), ifelse(nchar(month(dateClean))==1,paste(0,month(dateClean),sep=''), month(dateClean)), ifelse(nchar(day(dateClean))==1,paste(0,day(dateClean),sep=''),day(dateClean)), 0, HomeABB, sep=""))

3. Download the box score data - warning: here comes a really big for loop.

Now comes the fun part. Located in each boxscore webpage are two data tables that I want - home and away boxscores - and I can access these using the htmltab function.I also want the Pace statistic to adjust fouls per game to fouls per 100 possessions and I want to get the names of the officials. Pace and official assignments are not located in html tables so I can’t use htmltab. What I need to do is look through the html source file and find a pattern of characters that can be used to pull the data.

Official Names

Here’s the part of the for loop that will take the official assignments for the boxscore with the correct htmlCode.

    htmlfile <- read_file(paste('https://www.basketball-reference.com/boxscores/', htmlCode, '.html', sep=''))
      
      startString <- str_locate(htmlfile, 'Officials:')
      endString <- str_locate(htmlfile, 'Attendance:')
      
      tempString <-substring(htmlfile, startString, endString)
      
      firstOfficialStart <- str_locate(tempString,'r.html')
      tempString <- substring(tempString, firstOfficialStart[1,2]+3)
      firstOfficialEnd <- str_locate(tempString,'<')
      firstOfficial <- substring(tempString, 1, firstOfficialEnd[1,2]-1)[1]
      secondOfficialStart <- str_locate(tempString,'r.html')
      tempString <- substring(tempString, secondOfficialStart[1,2]+3)[1]
      secondOfficialEnd <- str_locate(tempString,'<')
      secondOfficial <- substring(tempString, 1, secondOfficialEnd[1,2]-1)[1]
      thirdOfficialStart <- str_locate(tempString,'r.html')
      tempString <- substring(tempString, thirdOfficialStart[1,2]+3)[1]
      thirdOfficialEnd <- str_locate(tempString,'<')
      thirdOfficial <- substring(tempString, 1, thirdOfficialEnd[1,2]-1)[1]

Looking at the html source code, I found that the string pattern ‘Official:’ was where the official’s names started and the string pattern: ‘Attendance:’ was what came after the last official’s name. It took a bit of trial and error to get the correct start and end points for each official, but essentially, I’m making a temporary string that takes the characters between the word ‘Official:’ and the word ‘Attendance:’. For the first official I’m looking at the source code to get where it is located and trimming a string to get the first officials name. I do the same thing for the second and third officials. Basically it’s a bit of data wrangling using the Stringr package from the Tidyverse.

Pace data

Pace data was a bit harder to locate and required a bit more trial and error, but I used the same process to find it as I did for the officials’ names. I needed to include an str_remove function sinc epace statistics greater than/equal to 100.0 are five characters long and anything less is four characters long. I always take five characters into my pace variable and use str_remove to take away a ‘<’ if it was part of the five characters I read in. A ‘<’ occurred right after the last digit of pace, so four character pace statistics would look like ‘94.6<’ compared to a five character statistic, like ‘101.3’. I then wrap it in as.numeric() to get a number.

Here’s the code:

    pace_loc <- str_locate_all(htmlfile, fixed('data-stat=\"pace'))[[1]][3,2]
      pace <- as.numeric(str_remove(substring(htmlfile,pace_loc+4, pace_loc+8),"\\<"))

Putting it all together

Here’s the entire for-loop. It starts by pulling the officials’ names, finds the pace statistics and then downloads the boxscores. I append the officals’ names to each home and visitor table along with pace and then append both to a tibble called fullSet. If 10 players played for each team in a game, the game would occupy 60 observations in the fullSet tibble (10 players x 2 teams x 3 officials).

for (htmlCode in schedule$htmlCode){
  #Get the ref data
    tryCatch({
      htmlfile <- read_file(paste('https://www.basketball-reference.com/boxscores/', htmlCode, '.html', sep=''))
      
      startString <- str_locate(htmlfile, 'Officials:')
      endString <- str_locate(htmlfile, 'Attendance:')
      
      tempString <-substring(htmlfile, startString, endString)
      
      firstOfficialStart <- str_locate(tempString,'r.html')
      tempString <- substring(tempString, firstOfficialStart[1,2]+3)
      firstOfficialEnd <- str_locate(tempString,'<')
      firstOfficial <- substring(tempString, 1, firstOfficialEnd[1,2]-1)[1]
      secondOfficialStart <- str_locate(tempString,'r.html')
      tempString <- substring(tempString, secondOfficialStart[1,2]+3)[1]
      secondOfficialEnd <- str_locate(tempString,'<')
      secondOfficial <- substring(tempString, 1, secondOfficialEnd[1,2]-1)[1]
      thirdOfficialStart <- str_locate(tempString,'r.html')
      tempString <- substring(tempString, thirdOfficialStart[1,2]+3)[1]
      thirdOfficialEnd <- str_locate(tempString,'<')
      thirdOfficial <- substring(tempString, 1, thirdOfficialEnd[1,2]-1)[1]
      
      pace_loc <- str_locate_all(htmlfile, fixed('data-stat=\"pace'))[[1]][3,2]
      pace <- as.numeric(str_remove(substring(htmlfile,pace_loc+4, pace_loc+8),"\\<"))
      
      #Now get the player data
      
      boxscore_link <- paste('https://www.basketball-reference.com/boxscores/',htmlCode,'.html',sep='')
      
      home <- htmltab(boxscore_link, which=3)
      game <- schedule[schedule$htmlCode == htmlCode,]
      home$HomeTeam = game$HomeABB
      home$Date = game$dateClean
      home$Result = ifelse(game$HomePTS > game$AwayPTS, 1,0)
      
      home <- home %>%  filter(!str_detect(Starters, 'Reserves'))  %>%
        filter(!str_detect(`Basic Box Score Stats >> MP`, 'Did Not Play')) %>% 
        select(Name = Starters, Team = HomeTeam, Date = Date, FTA = `Basic Box Score Stats >> FTA`, 
               PF = `Basic Box Score Stats >> PF`) %>% mutate(Pace = pace, firstOfficial = firstOfficial, secondOfficial = secondOfficial, 
               thirdOfficial = thirdOfficial) %>%   gather(key='OfficialNumber', value='Official',firstOfficial:thirdOfficial)
    
    
      away <- htmltab(boxscore_link, which=1)
      away$AwayTeam = game$AwayABB
      away$Date = game$dateClean
      away$Result = ifelse(game$HomePTS > game$AwayPTS, 0,1)
      
      away <- away %>%  filter(!str_detect(Starters, 'Reserves'))  %>%
        filter(!str_detect(`Basic Box Score Stats >> MP`, 'Did Not Play')) %>% 
        select(Name = Starters, Team = AwayTeam, Date = Date, FTA = `Basic Box Score Stats >> FTA`, 
               PF = `Basic Box Score Stats >> PF`) %>% 
        mutate(Pace = pace, firstOfficial = firstOfficial, secondOfficial = secondOfficial, thirdOfficial = thirdOfficial) %>%   
        gather(key='OfficialNumber', value='Official',firstOfficial:thirdOfficial)
      
      home <- home %>% mutate(OppFouls = sum(as.numeric(away$PF))/3)
      away <- away %>% mutate(OppFouls = sum(as.numeric(home$PF))/3)
      
      fullSet <- fullSet %>% bind_rows(home, away)
      closeAllConnections()}, error=function(e){cat("ERROR :", htmlCode, " ", conditionMessage(e), "\n")})
  
        }

Everything is put inside a tryCatch function to output errors without exiting the loop. For some reason, the occasional game won’t download properly (I get an XML error). I think it may have to do with page load time. Since I’m looking at seven years of games, I’m okay with losing a few observations. In the end I get about 450,000 observations, dividing by 60 means that about 7,500 games of information. That represents about 80% of all games played (assuming all playoff series go to seven games). For the purposes of this analysis I think it is sufficient.

4. Cleaning the dataset

Now that I have fullSet, I realized that I wanted the winner of each game as well. Rather than re-downloading the data (which takes forever!) I created a tibble called winnerName that duplicates the schedule for the home team and away team so I can left join it against fullSet and the Team name column. I then used mutate for create a variable called ‘Win’ that tests if the Winner of the game (included in winnerName) is equal to the Team’s name from the boxscore. If it is, they get a 1 if not a 0. When averaged over the season, I’ll have the team’s win percentage.

The next interesting variable is the PacePF variable. When I first started looking at the data I didn’t include pace and it skewed the results such that slower-paced teams had fewer fouls called than faster-paced teams. This makes sense, since there are less opportunities to call fouls if there are less possessions in a game. PacePF takes the total fouls called in a game and divides it by the number of possession in the game and then multiplies it by 100 to get the number of fouls per 100 possessions. This adjusts for pace.

I also included a variable called Opponent that I use to left_join Opponent win percentage on. Additional mutates include creating some factors to be used later and creating a nested if statement to get a Season variable so I can filter based on season later on.

schedule_result <- schedule %>% mutate(Winner = ifelse(HomePTS > AwayPTS, HomeABB, AwayABB)) %>% 
    select(dateClean, HomeABB, AwayABB, Winner)
  
  winnerName <- bind_rows(schedule_result %>% mutate(Team = HomeABB), schedule_result %>% mutate(Team = AwayABB))
  
  fullSet1 <- fullSet %>% left_join(winnerName, by=c("Date" = "dateClean", "Team" = "Team")) %>%
    mutate(Win = ifelse(Winner == Team, 1, 0)) %>% mutate(Name = as.factor(Name), Team=as.factor(Team), FTA=as.numeric(FTA), PF = as.numeric(PF), 
                                 PacePF = 100*PF/Pace,
                                 OfficialNumber=as.factor(OfficialNumber),
                                 Official=as.factor(Official)) %>% na.omit() %>%
    mutate(Season=ifelse(Date <'2013-07-01',2013,
                  ifelse(Date < '2014-07-01', 2014,
                         ifelse(Date < '2015-07-01',2015,
                                ifelse(Date <'2016-07-01', 2016,
                                       ifelse(Date < '2017-07-01', 2017,
                                              ifelse(Date < '2018-07-01', 2018,
                                                     ifelse(Date < '2019-07-01', 2019
                                                            ))))))),
           Opponent = ifelse(HomeABB==Team, AwayABB, HomeABB)) %>%
    select(-HomeABB, -AwayABB) 
  
  games_played <- fullSet1 %>% count(Team, Date, Season, sort = TRUE) %>% count(Team, Season)
  games_reffed <- fullSet1 %>% count(Team, Date, Official, Season) %>% count(Team, Season, Official)
  game_data <- fullSet1  %>% group_by(Team, Date, Season, Official, Opponent) %>% summarise(Fouls = sum(PacePF), OppFouls = sum(100*OppFouls/Pace)/n(), FoulDiff = Fouls-OppFouls, Win = mean(Win))
  
  foulsPerGame <- fullSet1 %>% group_by(Team, Official, Season) %>% summarise(Fouls=sum(PacePF), WinPct = mean(Win)) %>% left_join(games_reffed) %>% mutate(FPG = Fouls/nn)

## Joining, by = c("Team", "Official", "Season")

OfficialAvg <- foulsPerGame %>%group_by(Season, Official) %>% summarise(SeasonAvg=mean(Fouls/nn))
  TeamAvg <- foulsPerGame %>%group_by(Season, Team) %>% summarise(TeamSeasonAvg=mean(Fouls/nn), SeasonWinPct = mean(WinPct))
  
  fullSet1 <- fullSet1 %>% left_join(TeamAvg, by=c("Opponent"="Team", "Season" = "Season")) %>% 
    mutate(OpponentPCT = SeasonWinPct) %>% select(-SeasonWinPct, -TeamSeasonAvg)

## Warning: Column `Opponent`/`Team` joining character vector and factor,
  ## coercing into character vector

foulsPerGame <- fullSet1 %>% group_by(Team, Official, Season) %>% summarise(Fouls=sum(PacePF), WinPct = mean(Win), OpponentPct = mean(OpponentPCT)) %>% left_join(games_reffed) %>% mutate(FPG = Fouls/nn)

## Joining, by = c("Team", "Official", "Season")

#Need to change these to PacePF rather than PFs...
  foulsByTeam <- fullSet1 %>% group_by(Date, Team) %>% summarise(Fouls=sum(PacePF)/3, FTA = sum(FTA)/3)
  foulsByTeamRef <- fullSet1 %>% group_by(Date, Team, Official) %>% summarise(Fouls=sum(PacePF), FTA = sum(FTA))

Now we can get into some analysis.

Officials are pretty consistent year-over-year

Here’s a distribution of the fouls called per game in the NBA from 2013 to 2019. I’ve included a typical density function (this is the bell-curve looking chart) and a cumulative distribution function. Cumulative distribution functions (CDFs) can be intrepreted as follows: when a line is closer to the left, it means a greater proportion of games have fewer fouls called per 100 possessions than a line that is skewed to the right. For example in the CDF below, which shows fouls per 100 possessions over the 2013 to 2019 period, in 2018 (the purple line) 50% of games had 20 or fewer fouls per 100 possessions called.

As you can see, the distributions are pretty similar year-over-year. In the season that just ended, fouls per 100 possessions were right in line with other years (this is the pink line) and more than in 2018, where fouls per 100 possessions were the lowest compared to previous seven years (this is the purple line).

ggplot(game_data, aes(x=Fouls, color=as.factor(Season))) + 
    geom_density() + theme(legend.title = element_blank()) + xlab("Fouls per 100 possessions") +ylab("") + ggtitle('Distribution of Fouls per 100 Possessions')

ggplot(game_data, aes(x=Fouls, color=as.factor(Season))) + 
    stat_ecdf(geom='smooth') + theme(legend.title = element_blank()) + xlab("Fouls per 100 possessions") +ylab("") +
    ggtitle('Cumulative distribution of Fouls per 100 Possessions')

Some teams get called for more fouls…

Looking across all years, there are certain teams that have more fouls called per 100 possessions than others. From 2013 to 2019, there was a differential of about 4 fouls called per 100 possessions between the most penalized team, the NBA Champion Toronto Raptors (23.2 fouls per 100 possessions) and the least penalized team, the San Antonio Spurs (19.0 fouls per 100 possessions). Other notable teams near the top are the Memphis Grizzlies (23.0), Phoenix Suns (22.8) and Oklahoma City Thunder (22.7). And some big names at the bottom in addition to the Spurs are the Charlotte Hornets (19.4), Minnesota Timberwolves (20.3) and the Atlanta Hawks (20.4).

fpt <- foulsByTeam  %>% mutate(Team=as.character(Team))  %>% mutate(Team = ifelse(Team=="NOH","NOP", ifelse(Team=="CHA", "CHO", Team))) %>% group_by(Team) %>% summarise(FoulsPer100Pos = mean(Fouls)) %>% arrange(desc(FoulsPer100Pos))
  
  ## Above I combined the Charlotte Hornets and Charlotte Bobcats into one team abbreviation and the New Orleans Hornets and New Orleans Pelicans into another. 
  
  kable(fpt)

Team	FoulsPer100Pos
TOR	23.19220
MEM	22.98415
PHO	22.78055
OKC	22.65043
LAC	22.53629
NYK	22.49705
BOS	22.14371
PHI	22.10638
UTA	22.03290
HOU	21.95324
WAS	21.94419
GSW	21.74085
DEN	21.70252
MIL	21.68955
SAC	21.63830
MIA	21.63177
IND	21.59965
DAL	21.27908
BRK	21.18571
POR	21.12683
LAL	21.01824
NOP	21.01345
ORL	20.94263
CLE	20.83862
DET	20.70564
CHI	20.49397
ATL	20.39566
MIN	20.27035
CHO	19.35301
SAS	19.08410

This is an interesting result. I didn’t expect to see such a wide spread after adjusting for pace. I’m going to look more at the top penalized team and the bottom penalized team.

The NBA Champion Toronto Raptors don’t have the same statistical distrbution of fouls called

Warning: I’m a Raptors fan, but I’m going to hopefully let the statistics do the talking…

Using the Kolmogorov-Smirnov test (KS-test), you can compare if two sample distributions are from the same population. I’m using the KS-test here to ask “is the difference in fouls called on the Raptors statistically different than fouls called on the rest of the league?”

So what does the KS-test say?

kstest = tibble()
  
  notTOR <- game_data %>% mutate(TorontoFlag = ifelse(Team=="TOR", "TOR", "OTHER"))
  
  for (season in unique(notTOR$Season)){
    a <- ks.test(filter(notTOR, Season==season & TorontoFlag=="OTHER")$Fouls,filter(notTOR, Season==season & TorontoFlag=="TOR")$Fouls)
    result <- tibble(Season=season, PValue = round(a$p.value,20),Result = ifelse(a$p.value<0.05,'Statistically Different', 'Statistically Similar') )
    kstest <- kstest %>% bind_rows(result)
  }
  
  kable(kstest)

Season	PValue	Result
2013	0.0000000	Statistically Different
2014	0.0000000	Statistically Different
2015	0.0023224	Statistically Different
2016	0.0996374	Statistically Similar
2017	0.0000000	Statistically Different
2018	0.0000000	Statistically Different
2019	0.0003129	Statistically Different

Every season was statistically different than the rest of the NBA except for 2016. This means that fouls per 100 possessions called on the Raptors would be expected to be drawn from a different population than fouls called on other teams. Graphically, this can be shown using cumulative distribtuion functions.

ggplot(notTOR, aes(x=Fouls, color = as.factor(TorontoFlag))) + stat_ecdf(geom='smooth') + 
   theme(legend.title = element_blank()) + xlab("Fouls per 100 possessions") +ylab("") +
    ggtitle('Cumulative distribution of Fouls per 100 Possessions: Toronto Raptors (2013-2019)')

And for completeness, here are the CDFs for each year from 2013 to 2019.

ggplot(notTOR, aes(x=Fouls, color = as.factor(TorontoFlag))) + stat_ecdf(geom='smooth') + 
    facet_wrap(vars(Season)) + theme(legend.title = element_blank()) + xlab("Fouls per 100 possessions") +ylab("") +
    ggtitle('Cumulative distribution of Fouls per 100 Possessions: Toronto Raptors')

What’s with the Spurs: Enter the upside-down

Since pace is adjusted for in the data, I think there must be a Popvich factor as their foul counts are low. These are the CDFs for the Spurs:

notSAS <- game_data %>% mutate(SanAntonioFlag = ifelse(Team=="SAS", "SAS", "OTHER"))
  
  ggplot(notSAS, aes(x=Fouls, color = as.factor(SanAntonioFlag))) + stat_ecdf(geom='smooth') + 
   theme(legend.title = element_blank()) + xlab("Fouls per 100 possessions") +ylab("") +
    ggtitle('Cumulative distribution of Fouls per 100 Possessions: San Antonio Spurs (2013-2019)')

From 2013 to 2019, in 75% of games the Spurs were called for about 21 or fewer fouls, compared to 50% of games for the rest of the league.

And for completeness, here are the CDFs for each year from 2013 to 2019.

ggplot(notSAS, aes(x=Fouls, color = as.factor(SanAntonioFlag))) + stat_ecdf(geom='smooth') + 
    facet_wrap(vars(Season)) + theme(legend.title = element_blank()) + xlab("Fouls per 100 possessions") +ylab("") +
    ggtitle('Cumulative distribution of Fouls per 100 Possessions: San Antonio Spurs')

And here are the KS-tests. As you’d expect, the distributions aren’t similar to the league as a whole. Every season sees the Spurs with a statistically different distribution.

kstest = tibble()
  for (season in unique(notSAS$Season)){
    a <- ks.test(filter(notSAS, Season==season & SanAntonioFlag=="OTHER")$Fouls,filter(notSAS, Season==season & SanAntonioFlag=="SAS")$Fouls)
    result <- tibble(Season=season, PValue = round(a$p.value,20),Result = ifelse(a$p.value<0.05,'Statistically Different', 'Statistically Similar') )
    kstest <- kstest %>% bind_rows(result)
  }
  
  kable(kstest)

Season	PValue	Result
2013	0.0000050	Statistically Different
2014	0.0000000	Statistically Different
2015	0.0067365	Statistically Different
2016	0.0000000	Statistically Different
2017	0.0000055	Statistically Different
2018	0.0000000	Statistically Different
2019	0.0000000	Statistically Different

So do certain officials call a tighter game for certain teams?

When taking into account the officiating patterns for specific refs, I’ve made three adjustments: official adjustments, team adjustments and win percentage adjustments. I’ve established above that certain teams (cough: Raptors) seem to have a disadvantage when it comes to overall fouls called independent of the referees.

To see if there are certain officiating patterns, I’ve taken the data set and looked at fouls called per 100 possessions by referee and by season.

For example, here are the top five teams that Marc Davis called the most fouls on in the 2017 to 2019 seasons.

diffPlotData <- foulsPerGame %>% arrange(desc(FPG)) %>% left_join(OfficialAvg) %>% left_join(TeamAvg) %>%
    mutate(TeamDifferential = FPG-TeamSeasonAvg, PctImpact = 1 + (WinPct - (SeasonWinPct/(OpponentPct+SeasonWinPct)))) %>% 
    mutate(RefImpact = as.numeric(100*(FPG/SeasonAvg)*(2-FPG/TeamSeasonAvg)*(1/PctImpact))) %>%
    select(-TeamDifferential) %>% 
    arrange(desc(RefImpact))

## Joining, by = c("Official", "Season")

## Joining, by = c("Team", "Season")

k<-diffPlotData %>% filter(Official=="Marc Davis" & Season==2019) %>% 
    select(Season, Team, Official, GamesOfficiated=nn, FoulsPer100Possessions = FPG) %>% 
    arrange(desc(FoulsPer100Possessions)) %>% head(5)
  
  kable(k)

Season	Team	Official	GamesOfficiated	FoulsPer100Possessions
2019	PHO	Marc Davis	3	25.76730
2019	MEM	Marc Davis	1	24.52026
2019	OKC	Marc Davis	4	23.94846
2019	GSW	Marc Davis	8	23.87804
2019	LAC	Marc Davis	6	23.81115

The issue with looking purely at fouls per 100 possessions and saying: ‘wow Marc Davis really hates the Suns’ is that we’re not comparing Marc Davis feelings towards the Suns to: 1. How Marc Davis calls games against the rest of the league in that season, and, 2. How all the officials call games against the Suns in that season.

Foul adjustments

I’ve made two adjustments to the data to try to capture these issues related to fouls called in games.

The first is an ‘official’ adjustment. It takes the percent difference between the average fouls called per 100 possessions per team for that official in that season.

The second is the ‘team’ adjustment. As shown above, some teams just get more fouls called on them. Either due to style of play, pace or ‘league consipracy’ (I jest). This adjustment takes the percentage difference in the team’s average fouls called per 100 possessions by season and the average number of fouls called by the official in question to get an idea of the number of fouls called by the offical given the average reputation of the team.

So for example, in Phoenix Suns games officiated by Marc Davis in 2019, the Suns averaged 25.8 fouls per 100 possessions (this is the FPG variable, below). Across the entire season, the Suns averaged 23.7 fouls per 100 possessions (this is the TeamSeasonAvg variable, below), while Marc Davis called 21.2 fouls per 100 possessions across all of his games in 2019 (this is the SeasonAvg variable, below).

The Official Adjustment would be 25.7673/21.15367 = 1.218101 - meaning Marc Davis called about 22% more fouls on the Suns than he did on average.

The Team Adjustment would be 25.7673/23.727 = 1.085991 - meaning in games officiated by Marc Davis, the Suns had about 9% more fouls called than they did on average.

Combining the Official Adjustment x Team Adjustment gives a foul impact of 1.218101 x (1-0.085991) = 1.113355. So introducing Marc Davis to Phoenix Suns games in 2019 led to an increase in fouls per 100 possessions of about 11%.

kable(diffPlotData %>% filter(Season=="2019", Team=="PHO", Official=="Marc Davis") %>% select(Team, Official, Season, FPG, SeasonAvg, TeamSeasonAvg) %>% mutate(OfficialAdjustment = FPG/SeasonAvg, TeamAdjustment = FPG/TeamSeasonAvg) %>% mutate (FoulImpact = OfficialAdjustment*(2-TeamAdjustment)))

Team	Official	Season	FPG	SeasonAvg	TeamSeasonAvg	OfficialAdjustment	TeamAdjustment	FoulImpact
PHO	Marc Davis	2019	25.7673	21.15367	23.727	1.218101	1.085991	1.113355

Refs can affect the game in ways that aren’t seen through fouls

So when I started this analysis I was going to solely focus on fouls and figure out an official’s impact on a game through fouls called. However, I realized while pulling some data together that I wouldn’t be telling the whole story. There can be other missed calls or no-calls or strange decisions that aren’t related specifically to personal fouls. I decided to add in another variable that would look at win percentages. Since teams may have terrible win percentages with specific refs.

A prime example was seen in the 2019 playoffs when the Houston Rockets had Scott Foster assigned to their game and the basketball analysts were quick to point out the Rockets terrible record with Scott Foster officiating their games in the playoffs.

I’ve tried to measure the impact of officials on win percentage by looking at: A. The team’s win percentage in the season for all games in the dataset B. The opponent’s win percentage in the season for all games in the dataset, and, C. The team’s win percentage in games officiated by the referee in question in the season

Using these three inputs, I’ve proposed a Win Percentage Impact metric that is calculated as:

(1 + (Team’s Win Pct With Specific Ref - (Team’s Overall Season Win Pct/(Opponent’s Win Percentage in the Season +Team’s Overall Season Win Pct)

WinPctImpact = 1 + (C - (A/(A+B))) - using the letters above.

I’m trying to capture what would have been expected for the outcome (thats the A/(A+B)) and am subtractitng it from what the result was in those games with that official. I add one to the number to get it centered around one rather than around zero.

This may not be perfect, but it gives a scale centered around 1 where an impact greater than 1 is in favour of the team and an impact less than 1 would be considered a negative bias.

So for Phoenix and Marc Davis again, the WinPctImpact would be:

WinPctImpact = 1 + (0.3333 - (0.4731994/(0.4731994+0.495598))) = 0.8448933.

kable(diffPlotData %>% filter(Season=="2019", Team=="PHO", Official=="Marc Davis") %>% select(Team, Official, Season, RefWinPct = WinPct, SeasonWinPct, OpponentPct) %>% mutate(WinPctImpact = (1+(RefWinPct - SeasonWinPct/(SeasonWinPct+OpponentPct))) ))

Team	Official	Season	RefWinPct	SeasonWinPct	OpponentPct	WinPctImpact
PHO	Marc Davis	2019	0	0.2081247	0.5172489	0.7130793

Putting it all together

To get an overall RefImpact score, I multiply FoulImpact by the inverse of WinPctImpact. I take the inverse so they’re both moving in the same direction (i.e., greater than one is a ‘disadvantage’ and less than one is an ‘advantage’). I then multiply this by 100, since everyone always like things being out of 100. So a perfectly unbiased ref for a specific team would have a RefImpact score of 100.

diffPlotData <- foulsPerGame %>% arrange(desc(FPG)) %>% left_join(OfficialAvg) %>% left_join(TeamAvg) %>%
    mutate(RefWinPct = WinPct) %>% mutate(OfficialAdjustment = FPG/SeasonAvg, TeamAdjustment = FPG/TeamSeasonAvg) %>% 
    mutate (FoulImpact = OfficialAdjustment*(2-TeamAdjustment)) %>% 
    mutate(WinPctImpact = (1+(RefWinPct - SeasonWinPct/(SeasonWinPct+OpponentPct)))) %>% 
    mutate(RefImpact = 100*FoulImpact*(1/WinPctImpact)) %>%
    arrange(desc(RefImpact))

## Joining, by = c("Official", "Season")

## Joining, by = c("Team", "Season")

The table below shows the largest RefImpacts for each Team:Official combination in the 2019 season (number of games officiated greater than or equal to five). The toughest Official:Team pairing is Mark Lindsay with the Portland Trailblazers. His RefImpact score is nearly 200. He calls slightly more fouls than average, but their win percentage is terrible with Mark.

  kable(diffPlotData %>% filter(Season==2019, nn>=5) %>% select(Season, Team, Official, FoulImpact, WinPctImpact, RefImpact) %>% group_by(Team) %>% top_n(1))

## Selecting by RefImpact

Season	Team	Official	FoulImpact	WinPctImpact	RefImpact
2019	BOS	Mike Callahan	1.0051836	0.5053707	198.9003
2019	SAC	Kevin Scott	1.0194418	0.5129371	198.7459
2019	BRK	Brandon Adair	1.0369219	0.5278196	196.4538
2019	MEM	Tom Washington	1.0832577	0.5586404	193.9097
2019	ORL	Scott Wall	0.9141238	0.5105586	179.0439
2019	PHI	Derek Richardson	1.0712792	0.6024405	177.8232
2019	ATL	John Goble	1.0885081	0.6150747	176.9717
2019	WAS	Scott Twardoski	0.9846930	0.5718867	172.1832
2019	CHI	Kevin Cutler	1.0248325	0.6084254	168.4401
2019	DAL	Jonathan Sterling	0.9408535	0.5629238	167.1369
2019	TOR	Brian Forte	1.0469611	0.6365599	164.4717
2019	HOU	Pat Fraher	1.0698855	0.6564406	162.9828
2019	PHO	Tyler Ford	1.1026866	0.6945068	158.7726
2019	DET	Derrick Collins	1.0995975	0.7266369	151.3270
2019	POR	Mark Lindsay	1.0067114	0.6689314	150.4955
2019	GSW	John Goble	1.0138228	0.6808992	148.8947
2019	OKC	John Goble	1.0221091	0.6898892	148.1555
2019	CLE	Rodney Mott	0.9753442	0.6617915	147.3794
2019	DEN	Zach Zarba	0.9738101	0.6608538	147.3564
2019	MIA	Scott Foster	0.9830912	0.6677663	147.2208
2019	LAL	Ken Mauer	0.9376757	0.6427890	145.8761
2019	NYK	Sean Wright	1.0003332	0.6991680	143.0748
2019	UTA	Ed Malloy	1.0288903	0.7200854	142.8845
2019	MIL	Brent Barnaky	0.9296845	0.6628068	140.2648
2019	IND	Marc Davis	0.9402634	0.6981466	134.6799
2019	CHO	Derrick Collins	0.9337469	0.6935120	134.6403
2019	SAS	Eric Dalen	0.9088094	0.6791793	133.8099
2019	MIN	Gediminas Petraitis	0.9548545	0.7301576	130.7738
2019	NOP	Rodney Mott	0.9575534	0.7506633	127.5610
2019	LAC	John Goble	1.0776021	0.8536205	126.2390

A name you won’t see on the list is Scott Foster, which begs the question:

Does Scott Foster hate us?

I mean, if you’re a Raptors fan, no not really. Officiating Toronto’s 2019 games, Foster had a score of 95.8. This means the Raptors actually do slightly better in games Foster officiates. He calls about an even number of fouls and the Raptors win percentage is slightly higher than expected when Scott shows up. Even I’m surprised by this. People may think he’s not the best official (I tend to agree) but he seems to be equally subpar towards both teams rather than showing a bias.

Now if you’re a fan of the Grizzlies, Hawks, Rockets (surprise, surprise), 76ers or Cavs, yes, Scott hates you.

  kable(diffPlotData %>% filter(Season==2019, nn>=3, Official=="Scott Foster") %>% select(Season, Team, Official, FoulImpact, WinPctImpact, RefImpact))

Season	Team	Official	FoulImpact	WinPctImpact	RefImpact
2019	MEM	Scott Foster	1.0575210	0.5470768	193.30395
2019	ATL	Scott Foster	1.0711483	0.5991798	178.76908
2019	MIA	Scott Foster	0.9830912	0.6677663	147.22083
2019	CLE	Scott Foster	0.9519794	0.6868480	138.60118
2019	PHI	Scott Foster	0.9821365	0.7889502	124.48651
2019	HOU	Scott Foster	1.0341822	0.8319093	124.31430
2019	OKC	Scott Foster	1.0004699	0.8227727	121.59736
2019	DAL	Scott Foster	0.8750520	0.7720084	113.34746
2019	LAC	Scott Foster	1.0687552	0.9977861	107.11265
2019	TOR	Scott Foster	1.0080289	1.0462297	96.34872
2019	BOS	Scott Foster	0.9521920	1.0109618	94.18674
2019	DEN	Scott Foster	0.9796784	1.0545283	92.90205
2019	WAS	Scott Foster	0.9628572	1.0456791	92.07960
2019	BRK	Scott Foster	0.9863848	1.0728018	91.94474
2019	IND	Scott Foster	0.9243631	1.0171456	90.87815
2019	NYK	Scott Foster	0.9779518	1.1254495	86.89433
2019	POR	Scott Foster	0.9707246	1.1715720	82.85659
2019	ORL	Scott Foster	0.8762911	1.0987563	79.75300
2019	MIL	Scott Foster	0.8809798	1.2015946	73.31756
2019	SAS	Scott Foster	0.8499485	1.1804840	72.00000
2019	GSW	Scott Foster	1.0083915	1.4063355	71.70348
2019	LAL	Scott Foster	0.9329831	1.3216338	70.59316
2019	UTA	Scott Foster	0.9820196	1.4216737	69.07489
2019	SAC	Scott Foster	0.9361580	1.4385647	65.07584

And if you’re the Warriors, Kings, Pacers or Lakers, you should be pretty happy when Scott Foster walks into your building.

How long has Foster hated the Rockets?

I’m writing this in jest, I don’t actually think Scott Foster actively hates the Rockets. Maybe just Chris Paul (we’ll see when he joins OKC). But here’s the breakdown of Scott Foster’s RefImpact for the Rockets over time.

  kable(diffPlotData %>% filter(Team=="HOU", Official=="Scott Foster") %>% select(Season, Team, Official, FoulImpact, WinPctImpact, RefImpact) %>% arrange(Season))

Season	Team	Official	FoulImpact	WinPctImpact	RefImpact
2013	HOU	Scott Foster	0.8880490	1.4312080	62.04891
2014	HOU	Scott Foster	0.9603178	1.0489909	91.54682
2015	HOU	Scott Foster	0.9802553	1.0450976	93.79557
2016	HOU	Scott Foster	1.0383135	1.3287667	78.14115
2017	HOU	Scott Foster	0.9027420	0.9846645	91.68016
2018	HOU	Scott Foster	0.7311231	1.2145899	60.19506
2019	HOU	Scott Foster	1.0341822	0.8319093	124.31430

So, Scott Foster’s fascination with the Rockets is really just a 2019 thing. He was actually quite pro-Houston in all years prior to 2019.

Wither Marc and Tony

The other two much maligned refs are Tony Brothers and Marc Davis. So here’s a table for 2019 showing the Ref Impacts for Tony and Marc. There are some NAs since I’ve limited the number of games officiated to greater than or equal to five.

  kable(diffPlotData %>% filter(Season==2019, nn>=3, Official=="Marc Davis" | Official == "Tony Brothers") %>% select(Season, Team, Official, RefImpact) %>% spread(key=Official, value=RefImpact))

Season	Team	Marc Davis	Tony Brothers
2019	ATL	89.51189	NA
2019	BOS	88.91339	75.26967
2019	BRK	79.75380	NA
2019	CHI	95.39141	NA
2019	CHO	76.20787	118.11010
2019	CLE	138.62071	NA
2019	DAL	76.11391	118.66721
2019	DEN	69.98748	65.20177
2019	DET	90.73033	119.10518
2019	GSW	118.53018	66.63027
2019	HOU	122.90409	NA
2019	IND	134.67995	85.85253
2019	LAC	94.43896	134.49564
2019	LAL	72.01400	NA
2019	MEM	NA	115.91210
2019	MIA	113.10720	NA
2019	MIL	107.39844	90.72186
2019	MIN	113.22994	166.79996
2019	NOP	NA	104.39771
2019	NYK	104.71669	90.35938
2019	OKC	104.92639	111.94815
2019	ORL	120.97842	81.62341
2019	PHI	138.52458	95.14026
2019	PHO	156.13342	NA
2019	POR	111.59470	100.90415
2019	SAC	89.83028	105.94420
2019	SAS	82.56237	115.01777
2019	TOR	70.65162	80.83718
2019	UTA	112.71276	99.50794
2019	WAS	NA	84.77816

If you’re playing Team X, which ref do you want to be there?

Let’s look at the NBA Champion Toronto Raptors. To answer this question, I’ve decided to make a heatmap. The code below provides a breakdown of how I’m making the heat map.

The adjustment I’m making takes Toronto’s RefImpact scores for each referee and divides these impacts by the RefImpact scores for each referee and each team. For example, in 2019 Marc Davis had a ref impact score for Toronto of 94.86737 (in the table above), the code takes 94.86737 and divides it by every other team’s RefImpact score for Marc Davis. Say we wanted to see a Toronto vs Golden State matchup if Marc Davis was an official in the game - we would take Toronto’s score of 94.86737 and divide it by Golden State’s RefImpact score of 139.58406. This would result in a score of 0.6796433. I multiply that by 100 to get a relative score of about 68. Any score less than 100 would favour Toronto in a matchup and anything more than 100 favours the other team. So a score of 68 favours the Raptors.

Below is a heatmap of all RefImpact scores for each team and referee relative to the Toronto Raptors. Anything white to light gray would favour the Raptors, anything in that light-medium gray would be about even. Moving from medium-gray to blue starts to slightly favour the opponent and blue to red to black becomes increasingly in favour of the opponent. Purple tiles are NAs - these are referee-opponent combinations that didn’t occur in the 2019 dataset.

TOR <- diffPlotData %>% filter(Season==2019, Team=="TOR") %>% select(Official, RefImpact)

## Adding missing grouping variables: `Team`

TORTable <- diffPlotData %>% filter(Season==2019, Team!="TOR", Official %in% TOR$Official) %>% select(Season, Team, Official, RefImpact) %>% spread(key=Official, value=RefImpact)
  
  adjustment <- TOR %>% spread(key=Official, value=RefImpact)
  adjustment <- adjustment[,-1]
  
  graphData <- tibble()
  
  for (i in 1:nrow(TORTable)){
    graphData <- bind_rows(graphData,c(TORTable[i,1:2],100*adjustment/TORTable[i,3:ncol(TORTable)]))
  }
  
  graphData <- graphData %>% gather(key="Official", value="RefImpact", -Season, -Team)
  
  breaks <- c(50,150)
  
  blups <- brewer.pal(name="BuPu", n=9)
  
  sc <- rescale(graphData$RefImpact, to = c(0,1))
  
  ggplot(graphData, aes(Team, Official)) + geom_tile(aes(fill = RefImpact)) +
    scale_fill_gradientn(colours= c('white', 'gray70', 'steelblue', 'red', 'black'), values =c(0,.1957, 0.33747, 0.47, 0.74), na.value = 'purple') #breaks in the scale bar

Say the Raptors were set to play the Warriors, the best three-official team for the Raptors would be something like Ben Taylor, Marat Kogut and Tre Maddox. If you’re the Warriors, the best three-official team would be Ron Garretson, Rodney Mott and Mike Callahan.

Conclusion

Overall, you see some interesting variations in NBA outcomes based on the referees that are assigned to the game. This is an interesting topic that has some applicability when looking into daily fantasy or game outcome predictions. This project gives a wide overview on scrapping data, exploratory data analysis, feature engineering and some modelling to understand how refereeing impacts the game.

001 : Data Storytelling

Getting the data

1. Download team abbreviations

2. Get the schedules

3. Download the box score data - warning: here comes a really big for loop.

Official Names

Pace data

Putting it all together

4. Cleaning the dataset

Officials are pretty consistent year-over-year

Some teams get called for more fouls…

The NBA Champion Toronto Raptors don’t have the same statistical distrbution of fouls called

Warning: I’m a Raptors fan, but I’m going to hopefully let the statistics do the talking…

What’s with the Spurs: Enter the upside-down

So do certain officials call a tighter game for certain teams?

Foul adjustments

Refs can affect the game in ways that aren’t seen through fouls

Putting it all together

Does Scott Foster hate us?

How long has Foster hated the Rockets?

Wither Marc and Tony

If you’re playing Team X, which ref do you want to be there?

Conclusion