We all have used stepwise regression at some point. Stepwise regression is known to be sensitive to initial inputs. One way to mitigate this sensitivity is to repeatedly run stepwise regression on bootstrap samples.
R has a nice package called bootStepAIC() which (from its description) “Implements a Bootstrap procedure to investigate the variability of model selection under the stepAIC() stepwise algorithm of package MASS.”
It provides a lot of information as an output and sometimes it can get challenging to keep track of all of this information especially if there are a lot of covariates. In this post we’ll try to come up with a simple visualization aimed at summarizing the output from the function boot.stepAIC()
.
Running boot.stepAIC()
Using the boot.stepAIC()
is fairly simple. Just input an already fitted lm/glm
model and th associated dataset.
We’ll use the BostonHousing
dataset from the mlbench
package. More details here
library(bootStepAIC) library(plotly) library(mlbench) # Load Boston housing dataset data("BostonHousing") # Fit Linear regression model fit <- lm(crim ~ ., data = BostonHousing) # Run bootstrapped stepwise regression fit.boot <- boot.stepAIC(fit, data = BostonHousing, B = 100) # That's it !
Collecting required information
The output from boot.stepAIC()
contains the following. Note that each output is shown as a percentage (based on the total number of bootstrapped samples)
- No of times a covariate was featured in the final model from
stepAIC()
- No of times a covariate’s coefficient sign was positive / negative
- No of times a covariate was statistically significant (default at alpha = 5%)
We’ll collect all of this information first and create data frames so as to make charting easier later on.
Note that in this particualr example there is a variable by the name chas which is a factor with levels 0 and 1. R renames the variable as chas1 by default.
# Extract data nBoot <- summary(fit.boot)[8,1] origModel <- paste(names(coef(fit.boot$OrigModel)), collapse = " + ") stepModel <- paste(names(coef(fit.boot$OrigStepAIC)), collapse = " + ") # Names of covariates covariates <- rownames(fit.boot$Covariates) nCovariates <- length(covariates) # Matrix of number of times each covariate was picked coef.pick <- fit.boot$Covariates # Matrix for the consistency of sign on each covariate coef.sign <- fit.boot$Sign # Change name for "chas" since it is a factor rownames(coef.sign)[7] <- "chas" coef.sign <- coef.sign[match(rownames(coef.pick), rownames(coef.sign)),] # Matrix for statistical significance coef.stat <- fit.boot$Significance # Change name for "chas" since it is a factor rownames(coef.stat)[11] <- "chas" coef.stat <- coef.stat[match(rownames(coef.pick), rownames(coef.stat)),] # Make into long form for charting later coef.stat.long <- data.frame() for(i in 1:length(coef.stat)){ n <- round(coef.stat[i],0) vec <- seq(0, n, by = 2) mat <- data.frame(rep(names(coef.stat)[i], length(vec)), vec, paste("% Sig", n)) names(mat) <- c("variable", "sig", "text") # We'll use mode = "line". NA helps separate line segments coef.stat.long <- rbind(coef.stat.long, mat, c(NA, NA)) } # Convert to dataframes coef.pick <- as.data.frame(coef.pick) coef.stat <- as.data.frame(coef.stat) coef.sign <- as.data.frame(coef.sign) names(coef.pick) <- "pick" names(coef.sign) <- c("pos", "neg") names(coef.stat) <- "stat"
Plot
Now that we have all the information we need, we just need to plot. The plot is arranged as such:
- One layer for the number of times a variable was picked up by
stepAIC()
(barplot) - One layer for the positive and negative coefficients (scatter plot using triangles)
- One layer for the number of times a variable was significant (vertical line chart)
- Annotation for some other information
# Base plot for number of times a variable was picked by stepAIC plot_ly(coef.pick, x = rownames(coef.pick), y = pick, type = "bar", opacity = 0.75, name = "Times picked (%)", hoverinfo = "text", text = pick.text, marker = list(color = "#00994d", line = list(width = 2))) %>% # Layer for number of times a variable was statistically significant at 5% add_trace(data = coef.stat.long, x = variable, y = sig, type = "scatter", mode = "markers + line", name = "Stat. Sig (%)", line = list(color = "#ffdb4d", width = 15), hoverinfo = "text", text = text) %>% # Layer for number of times a variable's coefficient was positive add_trace(data = coef.sign, x = rownames(coef.pick), y = rep(-5, nCovariates), type = "scatter", mode = "markers", name = "Coef Sign(% pos)", marker = list(symbol = "triangle-up", size = pos/scale, color = "#4da6ff", line = list(color = "black", width = 2)), hoverinfo = "text", text = sign.text.up) %>% # Layer for number of times a variable's coefficient was negative add_trace(data = coef.sign, x = rownames(coef.pick), y = rep(-10, nCovariates), type = "scatter", mode = "markers", name = "Coef Sign(% neg)", marker = list(symbol = "triangle-down", size = neg/scale, color = "#ff704d", line = list(color = "black", width = 2)), hoverinfo = "text", text = sign.text.down) %>% # Layout, annotations, axis options etc layout(xaxis = list(title = "<b>Covariates</b>"), yaxis = list(title = "<b>Percentage(%)</b>", tickmode = "array", tickvals = round(seq(0, 100, length.out = 10), 0), domain = c(0.2, 1)), plot_bgcolor = "#e1efc3", paper_bgcolor = "#e1efc3", annotations = list( list(x = 0.1, y = 1, xref = "paper", yref = "paper", xanchor = "left", yanchor = "top", ax = 0, ay = 0, text = "Visualizing <em>boot.stepAIC()</em>", font = list(family = "serif", size = 30)), list(x = 0.3, y = 0.1, xref = "paper", yref = "paper", xanchor = "left", yanchor = "top", ax = 0, ay = 0, text = paste("<em>Original Model:</em>", origModel), font = list(family = "PT Sans Narrow", size = 15)), list(x = 0.21, y = 0.05, xref = "paper", yref = "paper", xanchor = "left", yanchor = "top", align = "left", ax = 0, ay = 0, text = paste("<em>Stepwise Model:</em>", stepModel), font = list(family = "PT Sans Narrow", size = 15)), list(x = 0.8, y = 0.90, xref = "paper", yref = "paper", xanchor = "left", yanchor = "top", align = "left", ax = 0, ay = 0, text = paste0("<em>No. of Covariates:</em>", nCovariates, "<br>", "<em>No. of bootstrap samples:</em>", nBoot, "<br>"), font = list(family = "PT Sans Narrow", size = 15)) ))