Rでロジスティック回帰（２値分類）をやってみた

Ｒでロジスティック回帰（2値分類）の練習をやってみました。

概要

NBAのデータを使用する
各年の成績データを使用して、その年にオールスターに選ばれたかを予測する（2値分類）

データ

以下で作成したデータを使用します。
allstars
hadadada00.hatenablog.com

seasons_stats
hadadada00.hatenablog.com

準備

最初に、ライブラリの読み込みと、モデルの説明変数（targets）を設定しておきます。

library(caret)
library(formattable)
library(nbastats)
library(tidyverse)

### settings
#target columns
targets <- c(
  "MP",  # minutes played
  "PTS", # points
  "FT",  # free throws
  "AST", # assists
  "STL", # steals
  "BLK", # blocks
  "TOV"  # tunrovers
)

モデル構築用データ作成

次にデータを読み込んでモデル構築用のデータを作っていきます。今回はモデル構築に2015-2016年シーズンのデータを使用し、テストに2016-2017年シーズンのデータを使用します。
NBAの登録選手約480名に対し、オールスターに選出されるのは25名ほどのため、サンプルを各シーズンのサンプルを分割するのではなく、前年の年のデータでモデルを構築し、翌年のデータで検証する方針としました。

まずは、2015-2016年シーズンのデータを成績データ(seasons_stats)、オールスターデータ(allstars)からそれぞれ抽出します。

### data tranformation
# load data
stats <- seasons_stats
stars <- allstars

# train data (2016)
stats <- stats %>% 
  filter(Year == "2016")
stars <- stars %>% 
  filter(year == "2016")

次に、seasons_statsデータで１選手１レコードにします。これは、以下の選手のようにシーズン中に移籍した場合、各チームでの成績と、トータルの成績両方がデータとして存在しているためです。このような場合はトータルのデータを使用することにしました。

f:id:hadadada00:20190701003904p:plain

# in case of duplicate rows in stats (traded during season)
# use total stats(TOT)
stats <- stats %>% 
  group_by(Player) %>% 
  mutate(n = row_number(),
         max_n = max(n)) %>% 
  filter(max_n != 1 & Tm == "TOT" | max_n == 1) %>% 
  ungroup()

次にseasons_statsとallstarsをマージします。マージキーは選手の名前を使用し、全て小文字にしてから使用します。

# merge stats with all stars
stats <- stats %>% 
  mutate(name_s = tolower(Player))

stars <- stars %>% 
  mutate(mflg = 1,
         name_a = tolower(name))

stats_stars <- left_join(stats, stars, by = c("name_s" = "name_a")) %>% 
  mutate(is_allstar = coalesce(mflg, 0))

# check merge
nrow(stars)

stats_stars %>% 
  group_by(is_allstar) %>% 
  summarise(n = n())

> # check merge
> nrow(stars)
[1] 26

> stats_stars %>% 
+   group_by(is_allstar) %>% 
+   summarise(n = n())
# A tibble: 2 x 2
  is_allstar     n
       <dbl> <int>
1          0   450
2          1    26

マージの結果、26選手のオールスター選出情報が全て成績データと紐づきました。

モデル構築

次にいよいよモデルを構築します。概要で説明したとおり、2015-2016年シーズンのデータを全てモデル構築に使用します。

### modeling
x <- stats_stars %>% 
  select(targets) %>% 
  sapply(as.integer)

y <- stats_stars$is_allstar

# standarise
x <- scale(x)

# make model
model <- lm(y ~ MP + PTS + FT + AST + STL + BLK + TOV, data = x)

モデルが出来たので中身を確認します。

summary(model)

> summary(model)
Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.60042 -0.05272  0.00042  0.03274  0.76162 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.054622   0.007321   7.461 4.21e-13 ***
xMP         -0.245296   0.023193 -10.576  < 2e-16 ***
xPTS         0.250727   0.030534   8.211 2.15e-15 ***
xFT          0.057598   0.019642   2.932 0.003529 ** 
xAST         0.090719   0.018559   4.888 1.40e-06 ***
xSTL         0.061315   0.015763   3.890 0.000115 ***
xBLK         0.055393   0.009783   5.662 2.61e-08 ***
xTOV        -0.104087   0.026695  -3.899 0.000111 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1597 on 468 degrees of freedom
Multiple R-squared:  0.5143,	Adjusted R-squared:  0.507 
F-statistic: 70.79 on 7 and 468 DF,  p-value: < 2.2e-16

各変数の係数を見ると、全て1%有意(**)を満たしております。係数の符号を見てみると以下のようになっています。

変数	符号	解釈
xMP	-	プレイ時間が短い方がオールスターに選ばれやすい
xPTS	+	ポイント数が多い方がオールスターに選ばれやすい
xFT	+	フリースロー数が多い方がオールスターに選ばれやすい
xAST	+	アシスト数が多い方がオールスターに選ばれやすい
xSTL	+	スティール数が多い方がオールスターに選ばれやすい
xBLK	+	ブロック数が多い方がオールスターに選ばれやすい
xTOV	-	ターンオーバー数が少ない方がオールスターに選ばれやすい

符号についても納得性が高いです。プレイ時間については一瞬逆のように思えますが、「他の条件（ポイント数、フリースロー数、アシスト数etc.）をコントロールした上では、出場時間が短い方が評価が高い」、つまり、「短い出場時間で好成績を出せた選手の評価が高い」と解釈すれば納得できると思います。

次に、２値分類のaccuracyを確認します。

### predict & check accuracy
p_hat <- predict(model, data.frame(x))
y_hat <- ifelse(p_hat > 0.5, 1, 0)

confusionMatrix(as.factor(y_hat), as.factor(y))

> confusionMatrix(as.factor(y_hat), as.factor(y))
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 448  10
         1   2  16
                                          
               Accuracy : 0.9748          
                 95% CI : (0.9564, 0.9869)
    No Information Rate : 0.9454          
    P-Value [Acc > NIR] : 0.001436        
                                          
                  Kappa : 0.7145          
                                          
 Mcnemar's Test P-Value : 0.043308        
                                          
            Sensitivity : 0.9956          
            Specificity : 0.6154          
         Pos Pred Value : 0.9782          
         Neg Pred Value : 0.8889          
             Prevalence : 0.9454          
         Detection Rate : 0.9412          
   Detection Prevalence : 0.9622          
      Balanced Accuracy : 0.8055          
                                          
       'Positive' Class : 0

Balanced Accuracyは0.8055です。
次に、2016-2017年シーズンのデータでテストを行います。これまでと同じ手順を2016-2017年シーズンのデータに適用して、テスト用データを作成し、作成したモデル（model）で予測を行い、Balanced Accuracyを確認します。

### test in 2017 data
data(seasons_stats)
data(allstars)
# use 2017 data
stats_test <- seasons_stats %>% 
  filter(Year == "2017")

stars_test <- allstars %>% 
  filter(year == "2017")

# in case of duplicate rows in stats (traded during season)
# use total stats(TOT)
stats_test <- stats_test %>% 
  group_by(Player) %>% 
  mutate(n = row_number(),
         max_n = max(n)) %>% 
  filter(max_n != 1 & Tm == "TOT" | max_n == 1) %>% 
  ungroup()

# merge stats with all stars
stats_test <- stats_test %>% 
  mutate(name_s = tolower(Player))

stars_test <- stars_test %>% 
  mutate(mflg = 1,
         name_a = tolower(name))

stats_stars_test <- left_join(stats_test, stars_test, by = c("name_s" = "name_a")) %>% 
  mutate(is_allstar = coalesce(mflg, 0))

# check merge
stats_stars_test %>% 
  group_by(is_allstar) %>% 
  summarise(n = n())

### predict
x_test <- stats_stars_test %>% 
  select(targets) %>% 
  sapply(as.integer)

y_test <- stats_stars_test$is_allstar

# standarise
x_test <- data.frame(scale(x_test))

p_test_hat <- predict(model, newdata = x_test)
y_test_hat <- ifelse(p_test_hat > 0.5, 1, 0)

confusionMatrix(as.factor(y_test_hat), as.factor(y_test))

> confusionMatrix(as.factor(y_test_hat), as.factor(y_test))
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 459  10
         1   2  15
                                          
               Accuracy : 0.9753          
                 95% CI : (0.9573, 0.9872)
    No Information Rate : 0.9486          
    P-Value [Acc > NIR] : 0.002596        
                                          
                  Kappa : 0.7019          
                                          
 Mcnemar's Test P-Value : 0.043308        
                                          
            Sensitivity : 0.9957          
            Specificity : 0.6000          
         Pos Pred Value : 0.9787          
         Neg Pred Value : 0.8824          
             Prevalence : 0.9486          
         Detection Rate : 0.9444          
   Detection Prevalence : 0.9650          
      Balanced Accuracy : 0.7978          
                                          
       'Positive' Class : 0