Rでスクレイピングをしてみた part2

Ｒでロジスティック回帰の練習をしようと思ったのですが、自分がもっているデータの中に２値分類の問題がつくれそうなデータがなかったため、新たにデータを取得することにしました。
今回はwikipediaからNBAのオールスター出場選手のデータを取得します。

List of NBA All-Stars - Wikipedia

やり方は以前スクレイピングをやったときと同じです。今回はネットから情報を取得した後のデータ加工が少し大変でした。

hadadada00.hatenablog.com

データの取得

まずはrvestパッケージを使用してウェブからhtmlファイルを取得します。

library(rvest)
library(tidyverse)

### get data from web
# set url 
url <- "https://en.wikipedia.org/wiki/List_of_NBA_All-Stars"

# get html file
h <- read_html(url)

次にhtmlファイルの中からデータの入っているテーブルを探します。

# parse html
table <- h %>% 
  html_nodes("table")

# check the table tag data
table

> # check the table tag data
> table
{xml_nodeset (14)}
 [1] <table class="wikitable"><tbody>\n<tr>\n<td>\n<b>#</b>\n</ ...
 [2] <table class="wikitable sortable" style="width:100%"><tbod ...
 [3] <table class="nowraplinks hlist collapsible autocollapse n ...
 [4] <table style="width:100%;padding:0"><tbody><tr>\n<td style ...
 [5] <table style="width:100%;padding:0"><tbody><tr>\n<td style ...
 [6] <table style="width:100%;padding:0"><tbody><tr>\n<td style ...
 [7] <table style="width:100%;padding:0"><tbody><tr>\n<td style ...
 [8] <table style="width:100%;padding:0"><tbody><tr>\n<td style ...
 [9] <table style="width:100%;padding:0"><tbody><tr>\n<td style ...
[10] <table style="width:100%;padding:0"><tbody><tr>\n<td style ...
[11] <table style="width:100%;padding:0"><tbody><tr>\n<td style ...
[12] <table class="nowraplinks hlist collapsible autocollapse n ...
[13] <table class="nowraplinks navbox-subgroup" style="border-s ...
[14] <table class="nowraplinks navbox-subgroup" style="border-s ...

テーブルがたくさんあります。
wikipediaのソースを表示して確認すると(table class="wikitable sortable" style="width:100%")のタグがついたテーブルが今回欲しいテーブルのようです。 f:id:hadadada00:20190627234502p:plain

そのため2番目のテーブルデータを取得します。
データを取得したらView()で中身を確認します。

# pick up all-star players table
allstars <- table[[2]] %>% 
  html_table()

# check the table
View(allstars)

f:id:hadadada00:20190627234805p:plain

データはうまくデータフレームに入っています。

データの整形

後はデータをキレイにしていきます。
今回は以下のように、名前とAll Starに選出された年が入った2列のデータを目指します。
f:id:hadadada00:20190627235029p:plain

まずは、列名を変更します。

### transform data
# change names
names(allstars) <- c("Player", "counts", "Selections", "Notes", "Reference")

View(allstars)

f:id:hadadada00:20190628000130p:plain

次に実際に使用するPlayer列とSelections列のデータをキレイにします。
Player列は不要な記号が末尾についているため、正規表現を使用して、名前の部分だけを取り除きます。
併せて、Selections列にenダッシュがはいっているため、これをハイフンに変換します。

ハイフンに似てる文字の文字コード - Qiita

enダッシュの変換方法については以下を参考にしました。

character - How can I add an en dash to a plot in R? - Stack Overflow

library(Cairo)

# clean data
allstars <- allstars %>%
  mutate(names = str_extract(Player, "[A-Za-z'. -]+"),
         years = gsub("\U2013", "-", Selections))

View(allstars)

f:id:hadadada00:20190628000806p:plain

これでPlayer列とSelection列をきれいにしたnames列とyears列が作成できました。
確認してみると一部Player列→names列の作成がうまくできていないデータがあります。

f:id:hadadada00:20190628001110p:plain — 244行のデータは上手くPlayer列からnames列に変換できていない。

理由は選手の名前がEnglish alphabet以外のアルファベットで表記されているためです。
選手の指名の表記に応じて複数の変換ロジックを組むのが大変だったため、今回は表記がおかしくなる5名の選手名を直接修正することにしました。

# fix irregular(non-English alphabet) name players
allstars[244, "names"] <- "Zydrunas Ilgauskas"
allstars[324, "names"] <- "Goran Dragic"
allstars[355, "names"] <- "Nikola Jokic"
allstars[387, "names"] <- "Kristaps Porzingis"
allstars[414, "names"] <- "Nikola Vucevic"

以降はnames列とyears列を使用して、目的のデータを作っていきます。
かなりムリヤリになってしまいました。

まずnamesとyearsをそれぞれベクトルとして抽出します。
これらを加工して得たベクトルを初期化したallstarsというデータフレームに格納していきます。

### create data table
# separate names column and years column as single dataframe
names <- allstars["names"]
years <- allstars["years"]

# initialise allstars
allstars <- data.frame(name = NULL, year = NULL)

ここで、yearsの各行から年を抽出してベクトルを作成します。ポイントは連続して選出された場合2000-2005のようにハイフンでつながっていることです。そのため、ベクトル作成用の関数を作成します。

### utility
selection_years <- function(str) {
  # split string 
  str_array <- str_split(str, ";", simplify = TRUE)  # split by ";"
  str_array <- gsub(" ", "", str_array)  # remove space
  str_array <- gsub(",", "", str_array)  # remove camma
  # number of elements of selection years
  ncol <- ncol(str_array)
  selection_years <- c()
  for(i in 1:ncol) {
    # single year(e.g 1997) or multiple years(e.g. 1997-2000)
    is_single <- !(str_detect(str_array[1, i], pattern = "-"))
    # swith the way of making vector by is_single
    if (is_single) {
      selection_years <- append(selection_years, as.integer(str_array[1, i]))
    } else {
      start <- as.integer(str_split(str_array[1, i], "-", simplify = TRUE)[1, 1])
      end <- as.integer(str_split(str_array[1, i], "-", simplify = TRUE)[1, 2])
      selection_years <- append(selection_years, start:end)
    }
  }
  selection_years
}

ダラダラ長くなってしまいましたが、yearsに格納された要素を行列に変換して(1×N列)、各列が単一の年（2000）の場合と、複数年の場合(2000-2005)で処理を分岐させて1つのベクトルを作成しています。

この関数を使用して、yearsから選出年が1つずつ格納されたベクトルを作成して、選出された年数と同じ長さの名前ベクトルをnamesから作成してつなげます。
これを全ての選手について繰り返すことで目的のデータを得ます。

# pick up selection years from years 
# , and combine it with names 
n <- nrow(names)
for(i in 1:n) {
  n <- names[i, 1]
  ys <- selection_years(years[i, 1])
  l <- length(ys)
  ns <- rep(n, l)
  allstars <- rbind(allstars, data.frame(name = ns, year = ys))
}