Rでggplot2を使ってロジスティック回帰分析の結果のオッズをフォレストプロット風のプロット図を作成する（その2）

「ロジスティック回帰分析　フォレストプロット」のキーワードで検索すると現在、前に書いた記事が結構上位にヒットする。

indenkun.hatenablog.com

別に手法はそれでいいのだが、ロジスティック回帰分析のオッズ比と95％信頼区間の計算を{broom}でやったほうが簡単かなと思ったので、そのデータをつかってRで{ggplot2}を使ってロジスティック回帰分析の結果のオッズをフォレストプロット風のプロット図を作成する（Rewrite、といっても信頼区間の計算で{broom}を使おうというただそれだけの違い）。

要旨：Rでロジスティック回帰分析をしてオッズ比を出して、そのオッズ比をggplot2でフォレストプロット風のプロット図（95％信頼区間エラーバー付き）を作る。

先の記事と同じでTitanicデータの生存データを使ってやってみる。

Titanicデータの中身はClass（等級）、Sex（性別）、Age（大人か子供か）、Survive（生死）になっている。

データそのものが、集計済みクロス表になっているのでロジスティック回帰分析しやすいように観測データごとのデータに前処理する。

df <- data.frame(Titanic)
df <- data.frame(Class = rep(df$Class, df$Freq),
                 Sex = rep(df$Sex, df$Freq),
                 Age = rep(df$Age, df$Freq),
                 Survived = rep(df$Survived, df$Freq))

これで、

head(df)

##   Class  Sex   Age Survived
## 1   3rd Male Child       No
## 2   3rd Male Child       No
## 3   3rd Male Child       No
## 4   3rd Male Child       No
## 5   3rd Male Child       No
## 6   3rd Male Child       No

となる。

これをSurvivedを従属変数として、それ以外の変数すべてを独立変数とするロジスティック回帰分析を行う(一応中身も見ておく)。

df.logit <- glm(Survived ~ ., family = binomial(link = "logit"), data = df)
summary(df.logit)

## 
## Call:
## glm(formula = Survived ~ ., family = binomial(link = "logit"), 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0812  -0.7149  -0.6656   0.6858   2.1278  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.6853     0.2730   2.510   0.0121 *  
## Class2nd     -1.0181     0.1960  -5.194 2.05e-07 ***
## Class3rd     -1.7778     0.1716 -10.362  < 2e-16 ***
## ClassCrew    -0.8577     0.1573  -5.451 5.00e-08 ***
## SexFemale     2.4201     0.1404  17.236  < 2e-16 ***
## AgeAdult     -1.0615     0.2440  -4.350 1.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2769.5  on 2200  degrees of freedom
## Residual deviance: 2210.1  on 2195  degrees of freedom
## AIC: 2222.1
## 
## Number of Fisher Scoring iterations: 4

ここでオッズ比と95%信頼区間を求めるときに{broom}パッケージのtidy()関数をつかって計算してもらう（前記事の手法に問題があるわけではないが、なんとなくナウい気がするだけ）。

# {broom}パッケージをインストールしていなければ、
# install.package("broom")
# でインストールする。
library(broom)
df.conf <- tidy(df.logit, conf.int = TRUE, exponentiate = TRUE)
df.conf

## # A tibble: 6 x 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)    1.98      0.273      2.51 1.21e- 2    1.16      3.39 
## 2 Class2nd       0.361     0.196     -5.19 2.05e- 7    0.245     0.529
## 3 Class3rd       0.169     0.172    -10.4  3.69e-25    0.120     0.236
## 4 ClassCrew      0.424     0.157     -5.45 5.00e- 8    0.312     0.577
## 5 SexFemale     11.2       0.140     17.2  1.43e-66    8.57     14.9  
## 6 AgeAdult       0.346     0.244     -4.35 1.36e- 5    0.214     0.558

これで必要なオッズ比と95％信頼区間が計算できたのであとはこれで作図していく。

library(ggplot2)
df.conf |> 
  ggplot(aes(x = estimate, y = term)) +
  geom_point() +
  geom_vline(xintercept = 1, linetype = "dashed") +
  geom_errorbar(aes(xmin = conf.low, xmax = conf.high), width = 0.2) + 
  labs(title = "Odds Ratios for\nSurvive of Titanic passenger", 
       x = "OR and 95% confidence interval", y = "") +
  # y軸が下から順になるのでその順番を並び替える
  scale_y_discrete(limits = rev(df.conf$term)) +
  theme_classic()

f:id:indenkun:20220418131757p:plain

これでできた。

Class1stやSexMale、AgeChildをオッズ比1でグラフ中に加えようとすると、それ用のデータを作る必要がある(と思う)。

# termに入れたい名前をいれて、estimateが1のみのデータフレームをつくる
OR1.df <- data.frame(term = c("Class1st", "SexMale", "AgeChild"),
                     estimate = 1)
library(dplyr)

## 
##  次のパッケージを付け加えます: 'dplyr'

##  以下のオブジェクトは 'package:stats' からマスクされています: 
## 
##      filter, lag

##  以下のオブジェクトは 'package:base' からマスクされています: 
## 
##      intersect, setdiff, setequal, union

# もとのデータとくっつけて、データの順番を並び替える
df.conf <- bind_rows(df.conf, OR1.df)
df.conf <- df.conf[c(1, 7, 2:4, 8, 5, 9, 6), ]

このデータで同じように作図すると作れる。

library(ggplot2)
df.conf |> 
  ggplot(aes(x = estimate, y = term)) +
  geom_point() +
  geom_vline(xintercept = 1, linetype = "dashed") +
  geom_errorbar(aes(xmin = conf.low, xmax = conf.high), width = 0.2) + 
  labs(title = "Odds Ratios for\nSurvive of Titanic passenger", 
       x = "OR and 95% confidence interval", y = "") +
  scale_y_discrete(limits = rev(df.conf$term)) +
  theme_classic()