mark felepchuk

Here's the player clustering app I created in shiny where you can select any player from the last 40 years and use the slider to select their age in the league. The app then uses k-means clustering to give the most similar other players at the same age.

Keep scrolling below the app to read on about the EDA and clustering work I did.

Part 1: Downloading Data and Goals Analysis

I wanted a little project in R to scrape some web data, clean it up a bit and do some analysis. I decided to look into NHL stats as data doesn’t seem to be as easily available compared to NBA stats. I trawled around the internet to find a website that would be easy to use to grab the data. I found http://hockey-reference.com to suit my needs. On their site they have the Play Index Tool that you can query tables of information. I pulled regular season statistics for players from 1979-80 to 2017-18 and was given the first 200 observations with a ‘Next’ button at the bottom (but no reference to the total number of observations).

So I looked at the URL and found that the end included a feature called ‘offset’ that shows the index number for the first player shown in the table. Here’s the URL for reference: https://www.hockey-reference.com/play-index/psl_finder.cgi?c2stat=&c4stat=&c2comp=gt&is_playoffs=N&order_by_asc=&birthyear_max=&birthyear_min=&c1comp=gt&year_min=1980&request=1&franch_id=&is_hof=&birth_country=&match=single&year_max=2018&c3comp=gt&season_end=-1&is_active=&c3stat=&lg_id=NHL&order_by=goals&season_start=1&c1val=&threshhold=5&c3val=&c2val=&am_team_id=&handed=&rookie=N&pos=S&describe_only=&c1stat=&draft=&c4val=&age_min=0&c4comp=gt&age_max=99&offset=200

At the end you see ‘offset=200’ this would return the 201st to 400th observations in the dataset. It took a bit of manual plug-and-play into the html to figure out the end of the dataset. The last offset call I needed was ‘offset=29200’. So now that I knew that I had enough info to start pulling data into R.

I first loaded in the tidyverse (as always) and the package htmltab which is a great package for reading in html tables, like the one from Hockey-Reference.com.

library(tidyverse)
  library(htmltab)

Next, I set up an empty tibble and initiated a variable to walk through the offsets – starting at 0 and increasing by 200 until we reach the magic number 29,200, this variable is defined as locs.

temp_players <- tibble()
  locs <- seq(0,29200,200)

I looped throught html tables and merged them into a dataframe using htmltab and which is set to the table that I wanted to download from the html address. Usually you need to set the parameter which, but there is only one table on this page and which defaults to download the first table.

for (loc in locs){
    dest <- paste("https://www.hockey-reference.com/play-index/psl_finder.cgi?c2stat=&c4stat=&c2comp=gt&is_playoffs=N&order_by_asc=&birthyear_max=&birthyear_min=&c1comp=gt&year_min=1980&request=1&franch_id=&is_hof=&birth_country=&match=single&year_max=2018&c3comp=gt&season_end=-1&is_active=&c3stat=&lg_id=NHL&order_by=goals&season_start=1&c1val=&threshhold=5&c3val=&c2val=&am_team_id=&handed=&rookie=N&pos=S&describe_only=&c1stat=&draft=&c4val=&age_min=0&c4comp=gt&age_max=99&offset=",
                  loc, sep="")
  
    temp <- htmltab(dest)
  
    temp_players <- temp_players %>% bind_rows(temp)
  }

Once I had the player statistics downloaded, I did some data wrangling/feature engineering to get the data in the right form. I found that going back to the 80s there were some differences in team abbreviations with the abbreviation data set I found (more on that later).

players <- temp_players %>% 
   mutate(Tm = ifelse(Tm=="WIN","WPG", ifelse(Tm=="CBH", "CHI",
     ifelse(Tm=="MDA","ANA",Tm))),
    Player = str_remove(Player, "\\*"), Pos =as.factor(Pos), 
    Tm = as.factor(Tm), Age = as.numeric(Age), Season=as.factor(Season),
    G=as.numeric(G), GP=as.numeric(GP), A = as.numeric(`Scoring >> A`),
    PTS = as.numeric(`Scoring >> PTS`), PM = as.numeric(`+/-`), 
    TOI = as.numeric(TOI), S = as.numeric(S), 
    SHG = as.numeric(`Goals >> SH`), PIM = as.numeric(PIM), 
    GWG = as.numeric(`Goals >> GW`)) %>% 
   select (Player, Tm, Pos, Season, Age, GP, G, A, PTS, PM, PIM) %>% filter(Tm != "TOT")

I then decided it may be interesting to do some tabulations/visualizations based on the decade the players played in. The decades were added below.

players <- players %>% 
    mutate(in80s = ifelse(Season %in% levels(players$Season)[1:10],1,0), 
           in90s = ifelse(Season %in% levels(players$Season)[11:20],1,0),
           in00s = ifelse(Season %in% levels(players$Season)[21:30],1,0),
           in10s = ifelse(Season %in% levels(players$Season)[31:38],1,0),
           decade = as.factor(ifelse(Season %in% levels(players$Season)[1:10],'80s',
            ifelse(Season %in% levels(players$Season)[11:20],'90s',
            ifelse(Season %in% levels(players$Season)[21:30],'00s', '10s')))))
  yearly_goals <- players %>% group_by(Season) %>% 
   summarise(total_goals = sum(G), mean_goals = mean(G), sd_goals = sd(G))

Now I want to do some plots, so I’ll load in ggplot2.

library(ggplot2)

I wanted to look at a few things in the dataset: goals and points over time, the distribution of goals and points, the distribution of games played, penalty minutes over time and the age of players in the league.

Goals

Here’s a breakdown of goals per season. As you can see, last year the highest scoring year in the NHL. This is followed closely by 2005-06 and 1992-93. The two low scoring years are lockout seasons.

ggplot(yearly_goals, aes(x=Season, y=total_goals)) + 
   geom_bar(stat='identity', aes(fill=total_goals)) + 
   theme_classic() + 
   theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
   theme(legend.position="none") + ggtitle("Goals per Season")

unnamed-chunk-7-1

How does the distribution of goals vary across these high scoring seasons? Are all high scoring seasons the same? Let’s take a look:

ggplot(players %>% 
   filter(Season == '2017-18' | Season=='2005-06' | Season == '1992-93'), 
   aes(y=G, x=Season)) +geom_boxplot(aes(colour=Season)) + 
   theme_classic() + 
   theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
   theme(legend.position="none") + 
   ggtitle("Goals Distribution in the Three Highest Scoring Years")

unnamed-chunk-8-1

It looks like 1992-93 had the highest outlier goal scorers that boosted the overall goal totals, where 2017-18 was much more evenly distrbuted. It’s interesting as median goals look very similar in all three years. Here’s an cumulative distribution function that shows the how goals are distributed across the population.

ggplot(players %>% 
  filter(Season == '2017-18' | Season=='2005-06' | Season == '1992-93'), 
   aes(G)) +geom_density(geom='smooth',aes(colour=Season)) + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  theme(legend.position="right") + 
  ggtitle("Density Function of Goals in Three Highest Scoring Years")

unnamed-chunk-9-1

ggplot(players %>% 
  filter(Season == '2017-18' | Season=='2005-06' | Season == '1992-93'),
   aes(G)) +stat_ecdf(geom='smooth',aes(colour=Season)) + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  theme(legend.position="right") + 
  ggtitle("CDF of Goals in Three Highest Scoring Years")

unnamed-chunk-9-2

So there are quite a few more 15 to 30 goal scorers in 2005-06 and 2017-18 compared to 1992-93, where total goals were bolstered by a few big goal scorers. Here are the top five goal scorers in each year.

players %>% filter(Season == '1992-93') %>% 
  select(Season, Player, G) %>% arrange(desc(G)) %>% top_n(5)

## # A tibble: 5 x 3
  ##   Season  Player                G
  ##                   
  ## 1 1992-93 Alexander Mogilny    76
  ## 2 1992-93 Teemu Selanne        76
  ## 3 1992-93 Mario Lemieux        69
  ## 4 1992-93 Luc Robitaille       63
  ## 5 1992-93 Pavel Bure           60

players %>% filter(Season == '2005-06') %>% 
  select(Season, Player, G) %>% arrange(desc(G)) %>% top_n(5)

## # A tibble: 5 x 3
  ##   Season  Player                G
  ##                   
  ## 1 2005-06 Jonathan Cheechoo    56
  ## 2 2005-06 Jaromir Jagr         54
  ## 3 2005-06 Ilya Kovalchuk       52
  ## 4 2005-06 Alex Ovechkin        52
  ## 5 2005-06 Dany Heatley         50

players %>% filter(Season == '2017-18') %>% 
  select(Season, Player, G) %>% arrange(desc(G)) %>% top_n(5)

## # A tibble: 5 x 3
  ##   Season  Player               G
  ##                  
  ## 1 2017-18 Alex Ovechkin       49
  ## 2 2017-18 Patrik Laine        44
  ## 3 2017-18 William Karlsson    43
  ## 4 2017-18 Evgeni Malkin       42
  ## 5 2017-18 Eric Staal          42

Assists vs Goals

As expected, years with more goals yield more assists. But there is some variation. The first chart shows the overall relationship in total assists and total goals since 1979-80. The second chart compares assists per goal in each season.

ggplot(players %>% group_by(Season) %>% 
  summarise(A=sum(A), G=sum(G)), aes(x=A, y=G)) + geom_point() + 
  theme_classic() + theme(legend.position="right") + 
  ggtitle("Assists vs. Goals")

unnamed-chunk-11-1

ggplot(players %>%group_by(Season) %>% 
  summarise(AperG=sum(A)/sum(G)), aes(x=Season, y=AperG)) + 
  geom_point() + theme_classic() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  theme(legend.position="none") + ggtitle("Assists per Goal by Season")

unnamed-chunk-11-2

It looks as though assists per goal have been going up, but there is quite a bit of variation between years.

Are players getting younger?

There seems to be a youth movement in the NHL. Are players getting younger? Let’s find out.

The chart below shows the distribution of age by decade.

ggplot(players, aes(Age)) + stat_ecdf(aes(colour=decade), 
  geom='smooth') + theme_classic() + theme(legend.position="right") + 
  ggtitle("Age Distribution by Season")

unnamed-chunk-12-1

ggplot(players %>% group_by(decade) %>% summarise(AvgAge = mean(Age))
  , aes(y=AvgAge, x=decade)) + geom_bar(stat='identity') + 
  theme_classic() + scale_x_discrete(limits=c('80s','90s','00s','10s'))
  + theme(legend.position="none") + ggtitle("Average Age by Season")

unnamed-chunk-12-2

It doesn’t look like the overall league is getting younger. The 1980s were the youngest. This may mean that players’ longevity in the league is increasing, this could be due to a different style of game and/or better physical therapy, among many other factors, I’m sure.

Penalty Minutes

Are players getting penalized less? There appears to be a lot less violence nowadays than there was historically. Can this be seen through penalty minutes?

ggplot(players %>% group_by(Season) %>% summarise(PIM = sum(PIM)), 
  aes(x=Season, y=PIM)) + geom_bar(stat='identity', aes(fill=PIM)) + 
  theme_classic() + theme(axis.text.x = element_text(angle = 90, 
  hjust = 1)) + theme(legend.position="none") + 
  ggtitle("Penalty Minutes per Season")

unnamed-chunk-13-1

ggplot(players, aes(PIM)) + stat_ecdf(aes(colour=decade), 
  geom='smooth') + theme_classic() + theme(legend.position="right") + 
  ggtitle("PIM Distribution by Season")

unnamed-chunk-13-2

Looks like there are a lot fewer penalty minutes and most players in the current decade commit far fewer penalties. The 1980s and 1990s had very similar distributions, which is interesting. As the game evolved in the 2000s to now, there has been a significant drop in penalty minutes.

Joining Team Data

Now that I have a player dataset in good working order, I wanted to get the corresponding regular season data for teams for the last 38 years. Like with the player data, I used http://hockey-reference.com to access yearly standings and team statistics.

For team standings, I’m going to use the htmltab package in R and roll through the URLs on http://hockey-reference.com. Here’s the URL for reference: https://www.hockey-reference.com/leagues/NHL_2018.html.

This URL corresponds to the 2017-18 season. So to get each season I needed to create a loop to go through each URL from NHL_1980.html to NHL_2018.html, skipping the lockout year in 2005.

library(tidyverse)
  library(htmltab)
  
  years <- seq(1980,2018,1)
  
  years <- years[-which(years==2005)]
  
  standings = tibble()
  i=1
  
  for (year in years) { 
  
    link <- paste("https://www.hockey-reference.com/leagues/NHL_",
                   year,".html",sep="")
  
    east = htmltab(link, which =1)
    west = htmltab(link, which =2)
  
    league_temp = bind_rows(east,west)
  
    league_temp <- league_temp[which(!is.na(as.numeric(league_temp$GP))),]
    league_temp <- league_temp %>% 
       mutate(Season = as.factor(levels(players$Season)[i]), 
        Team = V1) %>%
       mutate(inPlayoffs = str_detect(Team, "\\*"), 
       Team = str_remove(Team, "\\*")) %>% select(-V1)
  
    standings = standings %>% bind_rows(league_temp)
  
    i=i+1
  }
  
  standings <- standings %>% mutate (Team = as.factor(Team)) %>% 
   select(-OL, -`RPt%`, -ROW, -`PTS%`, -SRS, -SOS)

You’ll see in the code chunk above that I needed to use the levels from the Season data from the players data frame to correctly assign the season to the ‘i’th Season in the loop. I also created an inPlayoffs boolean column that is set to TRUE when there is an asterisk (*) beside the team’s name, which is how Hockey-Reference.com denotes a playoff bound team.

Now, the players data set only had team abbreviations and the team statistics only have full team names. So I found a list of full team names and abbreviations that I loaded in and linked to the standings data frame with a left join. The list I found was written as “Abbreviation – Full Team Name”. I used the separate and mutate_all functions to separately the full team name and abbreviation at the hyphen and used mutate_all and str_trim to remove excess whitespace.

abbs <- read.csv('teamabbs.csv', header=F)
  
  abb_key <- abbs %>% separate(V1, sep="\\–", into=c("Tm", "Team")) %>% 
   mutate_all(funs(str_trim))
  
  standings <- standings %>% left_join(abb_key) %>% 
   mutate(PCT = `PTS%`)

Now that I have the abbreviation keys (abb_key) cleaned up, I can use them in a left join with the standings data to create a standings data frame that could be merged with the players data frame on the team abbreviation and on the Season.

Before merging the standings, I want to select only certain columns from the data frame and convert numbers to ‘numeric’ and characters to factors.

standings_to_merge <- standings %>% 
   select(Tm, inPlayoffs, Season, PCT, PTS, GF) %>% 
   mutate_at(.funs=funs(as.numeric), .vars=vars(PCT:GF)) %>% 
   mutate_at(.funs=funs(as.factor), .vars=vars(Tm:Season))

Now I can left join the players data frame with the standings_to_merge data frame. I merge on Tm, the team abbreviation, and the Season. I do some renaming of columns and create a few new variables:

Goal Share: the share of goals that a player was responsible for on their team
Point Share: the share of points that a player was responsible for on their team

combined <- players %>% left_join(standings_to_merge, 
   by=c('Tm', 'Season')) %>% rename(TeamPTS = PTS.y, PTS = PTS.x) %>% 
   mutate(Goal_Share = 100*G/GF, Point_Cont = 100*PTS/GF) %>% 
   filter(Tm != "TOT")

Using the combined data frame, I’ll create two data sets that will be used for some analysis, data and data_w_names. I also create per 100 games variables to adjust for lockout seasons and seasons with 84 games, as there were in the early 1990s.

data <- combined %>% filter(GP==25) %>% select(GP, Pos, 
    Age:PIM, Goal_Share, Point_Cont, inPlayoffs, PCT, Tm) %>% 
   select(-PTS) %>% mutate(Gp100g = 100* G/GP, Ap100g = 100*A/GP, 
   PMp100g = 100*PM/GP, PIMp100g = 100*PIM/GP) %>% 
   select(-G,-A,-PM,-PIM)
  
  data_w_names <- combined %>% filter(GP==25) %>% select(Player, GP, 
   Pos, Age:PIM, Goal_Share, Point_Cont, inPlayoffs, PCT,Tm) %>% 
   select(-PTS) %>% mutate(Gp100g = 100* G/GP, Ap100g = 100*A/GP, 
   PMp100g = 100*PM/GP, PIMp100g = 100*PIM/GP) %>% 
   select(-G,-A,-PM,-PIM)

Clustering

I was curious about clustering and discovered the mclust package in R. Mclust is a model-based clustering algorithm that uses the Bayesian Information Criterion (BIC) to select the correct number of clusters.

I didn’t want the clustering to be dependent on making the playoffs or by position. I was curious if R could bucket players based purely on statistics and if there were natural clusters of defensemen and forwards.

cluster_data <- data %>% select(-Pos, -inPlayoffs,-Tm)
  scaled_data <- scale(cluster_data)
  
  library(mclust)
  
  d_clust <- Mclust(as.matrix(scaled_data), G=1:25, 
   modelNames = mclust.options("emModelNames"))
  clust <- d_clust$classification

Mclust determined that there are 25(!!) different kinds of hockey players in the NHL. I joined the groupings back on to the data frames to see which players fell into what cluster.

data <- data %>% mutate(clust = clust$classification)
  
  data_w_names <- data_w_names %>% mutate(clust=clust$classification)

The first thing I looked at with the clusters was to summarize the average number of goals, assists, penalty minutes and average plus/minus for each cluster. I also included a new column called ‘playoffs’ that gives the probability of players in that cluster making the playoffs.

clusts <- data_w_names %>% group_by(clust) %>% 
   summarise(n=n(),age=mean(Age),goals = mean(Gp100g), 
   assists = mean(Ap100g), PM = mean(PMp100g), PIM = mean(PIMp100g), 
   playoffs = sum(as.logical(inPlayoffs))/n())
  
  print.data.frame(clusts)

##    clust    n      age     goals  assists         PM       PIM  playoffs
  ## 1      1  883 26.80294  4.248453 17.98241 -0.6189196 116.20253 0.6817667
  ## 2      2  980 25.46531 29.687414 30.54330  1.9894193 157.36590 0.7806122
  ## 3      3  815 27.36564  3.109565 23.05783  0.7442024 114.92284 0.6073620
  ## 4      4  423 27.01891 18.227418 30.63753 -1.2851738  62.80609 0.5153664
  ## 5      5  804 25.26990  9.857683 18.45907 -4.2452292 116.61916 0.7213930
  ## 6      6  890 24.99663 14.294625 21.63204 -3.6118897  63.21156 0.5966292
  ## 7      7  893 27.30907 15.060819 26.88672 -0.7392935 104.06142 0.6159015
  ## 8      8  853 27.54865  8.145523 25.12051 -0.2238181  63.81644 0.6295428
  ## 9      9 1026 26.66959 24.342307 32.82442 -1.1877275  67.37584 0.6559454
  ## 10    10  625 26.11040 15.736803 35.20244  2.2649887  83.11508 0.6896000
  ## 11    11  708 26.29802 24.655795 57.22555  8.0925426  81.00971 0.8149718
  ## 12    12 1046 27.84034  5.711213 24.55611  1.3394717  80.09759 0.5745698
  ## 13    13  876 26.33790  9.202790 19.42255 -0.8396226 107.81259 0.6472603
  ## 14    14  884 26.80656 24.201678 28.27459  0.8754662  48.31338 0.5871041
  ## 15    15  751 26.27830 20.419216 30.47688  2.0849098  77.79579 0.7496671
  ## 16    16  768 27.06120 26.566974 41.54588  1.4875195  69.44245 0.6093750
  ## 17    17  655 25.81069 21.804079 20.40046 -3.4090249  51.82682 0.6564885
  ## 18    18  728 26.51786 27.050890 25.97334 -0.4061682  67.35099 0.6263736
  ## 19    19  467 27.19914 17.243157 58.56567  8.5520661 168.21426 0.7644540
  ## 20    20  921 27.66341 15.095984 34.97082 -2.1350962  52.19773 0.3908795
  ## 21    21  811 26.53391 23.478720 26.21238 -8.4361888  89.88663 0.2441430
  ## 22    22  966 26.61698  3.727737 11.31705 -7.0102647 148.20956 0.6004141
  ## 23    23  482 25.15353 51.961866 48.34392 13.3663578 114.62708 0.7717842
  ## 24    24  956 27.55439 17.163595 27.31180 -3.6552898  61.62064 0.4476987

Two of the more interesting groups to be are clusters 22 and 23. Cluster 22 appears to be ‘goons’ and 23 appear to be the superstars.

## Cluster 22
  data_w_names %>%filter(clust==22) %>% select(Player, Pos)

## # A tibble: 966 x 2
  ##    Player           Pos
  ##  1 John Kordic      RW
  ##  2 Marty McSorley   RW
  ##  3 Jack Carlson     LW
  ##  4 Tim Hunter       RW
  ##  5 Kevin McClelland C
  ##  6 Bob Probert      LW
  ##  7 Torrie Robertson LW
  ##  8 Perry Anderson   LW
  ##  9 Tom Chorske      LW
  ## 10 Tony Granato     RW
  ## # ... with 956 more rows

##Cluster 23
  data_w_names %>% filter(clust==23) %>% select(Player, Pos)

## # A tibble: 482 x 2
  ##    Player            Pos
  ##  1 Wayne Gretzky     C
  ##  2 Wayne Gretzky     C
  ##  3 Brett Hull        RW
  ##  4 Mario Lemieux     C
  ##  5 Alexander Mogilny RW
  ##  6 Teemu Selanne     RW
  ##  7 Teemu Selanne     RW
  ##  8 Wayne Gretzky     C
  ##  9 Brett Hull        RW
  ## 10 Wayne Gretzky     C
  ## # ... with 472 more rows

I see Bob Probert, Marty McSorley and Jack Carlson in cluster 22 and Gretzky, Lemiux, Selanne and Hull in cluster 23.

But cluster 11 has the highest probability of making the playoffs, even though inPlayoffs wasn’t included in the clustering algorithm. Who are these guys?

## Cluster 11
  data_w_names %>% filter(clust==11) %>% select(Player, Pos)

## # A tibble: 708 x 2
  ##    Player            Pos
  ##  1 Wayne Gretzky     C
  ##  2 Pat LaFontaine    C
  ##  3 Kent Nilsson      C
  ##  4 Mario Lemieux     C
  ##  5 Marcel Dionne     C
  ##  6 Peter Stastny     C
  ##  7 Mark Messier      C
  ##  8 Adam Oates        C
  ##  9 Daniel Alfredsson RW
  ## 10 Mats Naslund      LW
  ## # ... with 698 more rows

There’s an interesting package called ggradar that creates radar plots. Here’s a radar plot for clusters 11, 22 and 23.

library(ggradar)
  
  clusts %>% mutate_each(funs(rescale),-clust) %>% 
   filter(clust==11 | clust==22 | clust==23) -> radar_clust
  
  ggradar(radar_clust)

plot of chunk unnamed-chunk-11

Logistic Regression

The last bit of work I want to showcase in this post is a logistic regression to see if we can predict with any accuracy the probability that a player will make the playoffs given their statistics and their importance on the team from a point-getting perspective. I’m using the caret package to run a 10-fold cross-validation repeated 10 times.

library(caret)
  
  class_data <- data %>% select(-PCT)
  
  set.seed(321)
  training <- sample(1:nrow(data), round(nrow(data)*.7))
  
  class_data_train <- class_data[training,]
  class_data_test <- class_data[-training,]
  
  fitControl <- trainControl(## 10-fold CV
    method = "repeatedcv",
    number = 10,
    repeats = 10,
    savePredictions = TRUE
  )
  
  lreg <- train(inPlayoffs~.,data=class_data_train, 
   method="glm",family=binomial(), trControl=fitControl)
  
  confusionMatrix(lreg)

## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix
  ##
  ## (entries are percentual average cell counts across resamples)
  ##
  ##           Reference
  ## Prediction FALSE TRUE
  ##      FALSE  24.3  8.9
  ##      TRUE   14.2 52.6
  ##
  ##  Accuracy (average) : 0.7688

So with nearly 77% accuracy, the model can predict if a player will make the playoffs given their statistical profile and the team that they’re on.

And using the withheld data set, the regression predicts with almost 76% accuracy.

preds <- predict(lreg, newdata=class_data_test)
  
  print(sum(1-abs(as.logical(preds) - 
   as.logical(class_data_test$inPlayoffs)))/nrow(class_data_test))

## [1] 0.7558563

Pretty cool - we can predict with pretty good accuracy!

Shiny App

I'm not going to go into detail on deploying my shiny app from above. Essentially I used the player data to create a k-means cluster and then attached the cluster grouping data to the player data. The shiny table uses a big data.frame to query the player and age that's selected and then it lists the other members of the cluter. Some cool ones to check are:

Wayne Gretzky from ages 20 to 26 (he was really in a class of his own).

Auston Matthews in his age 19 season (where he's paired with the second pick in the same draft)

Mario Lemieux has a lot of seasons where he's by himself (and one with Gretzky)

Look up your favourite NHL tough guy and see who he compares with over his career

002 : Unsupervised Learning

TLDR: