Ensure you have the following packages installed:
# Only need to do this once
install.packages("dplyr")
install.packages("readr")
install.packages("ggplot2")
Load the packages. Ask for help if any of these fail with an error.
# Do this in each new session
library(dplyr)
library(readr)
library(ggplot2)
%>%
, and ggplot2 recommended?
dropdown: FAQs, resources, etc.Go to bioconnector.org. Hit the data link. Download nhanes.csv
and nhanes_dd.csv
. Save them both somewhere on your computer you can easily find. E.g., create a new folder on your desktop called rstats
and save it there.
Open RStudio. File – New file – R script. Save this file as nhanes_analysis.R
in the same rstats
folder above.
Quit RStudio, then double-click the R script to open R running in the rstats
folder. You can also do this through the menu with Session – Set working directory – To Source File Location.
Load the required libraries and data:
library(dplyr)
library(readr)
nh <- read_csv("nhanes.csv")
(If you don’t have these packages installed, use install.packages("dplyr")
and install.packages("dplyr")
).
T-tests are specific cases of ANOVA, which is a specific case of linear regression.
Linear regression: “Do levels of a categorical variable affect the response?”
ANOVA/t-tests: “Does the mean response differ between levels of a categorical variable?”
Simple linear regression, single predictor variable X:
\[Y = \beta_{0} + \beta_{1}X\]
Multiple regression:
\[Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \epsilon\]
What if we wanted to model the discrete outcome, e.g., whether someone is insured, against several other variables, similar to how we did with multiple linear regression? We can’t use linear regression because the outcome isn’t continuous – it’s binary, either Yes or No. For this we’ll use logistic regression to model the log odds of binary response. That is, instead of modeling the outcome variable, \(Y\), directly against the inputs, we’ll model the log odds of the outcome variable.
If \(p\) is the probability that the individual is insured, then \(\frac{p}{1-p}\) is the odds that person is insured. The generalized linear model is expressed as:
\[log(\frac{p}{1-p}) = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k\]
Where \(\beta_0\) is the intercept, \(\beta_1\) is the increase in the log odds of the outcome for every unit increase in \(x_1\). Rearranging the formula a little bit will show that \(e^{\beta_1}\) is the odds ratio for \(x_1\).
Logistic regression is a type of generalized linear model (GLM). We fit GLM models in R using the glm()
function. It works like the lm()
function except we specify which GLM to fit using the family
argument. Logistic regression requires family=binomial
.
mod <- glm(y ~ x, data=yourdata, family='binomial')
summary(mod)
Power (a.k.a. “sensitivity”): probability that your test correctly rejects the null hypothesis when the alternative hypothesis is true. That is, if there really is an effect (difference in means, association between categorical variables, etc.), how likely are you to be able to detect that effect at a given statistical significance level, given certain assumptions. Know two, estimate the third:
(Other assumptions include desired \(\alpha\) and \(\sigma\)).
# Power for a t-test
power.t.test(power, n, delta, sig.level=0.05, sd=1)
# Power for a two-sample test of proportions
power.prop.test(power, n, p1, p2, sig.level=0.05)
We put a lot of emphasis on tidy data, where each column is a variable and each row is an observation. But even when we start with tidy data, we don’t end up with tidy models. The output from tests like t.test
or lm
are not data.frames, and it’s hard to get the information out of the model object that we want. The broom package bridges this gap.
tidy
: constructs a data frame that summarizes the model’s statistical findings like coefficients and p-values.augment
: add columns to the original data that was modeled, like predictions and residuals.glance
: construct a concise one-row summary of the model with information like \(R^2\) that are computed once for the entire model.