---
title: "R Examples - An Introduction to Bayesian Statistics"
author: "Christina Knudson and An-Ting Jhuang"
output:
pdf_document:
toc: yes
toc_depth: '3'
html_document:
toc: yes
toc_depth: 3
---
```{r setup, include = FALSE, message = FALSE, warning = FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
This document demonstrates the hands-on examples using R for the short course - an introduction to Bayesian statistics.
# Install required packages
```{r, message = FALSE, warning = FALSE}
#Load packages
library(devtools)
library(ggplot2)
library(HDInterval)
library(MCMCpack)
library(mcmc)
library(mcmcse)
library(Rcpp)
library(RcppArmadillo)
library(stableGR)
```
# Coin flip example
We have a coin and wonder the probability of heads. We're willing to toss the coin a large number of times to find out more about this coin. Let $\theta$ be the probability of heads, $n$ be the number of tosses, and $y$ be the number of heads in $n$ tosses.
*Practical*: How likely is heads? \
*Statistical*: What's the estimate of $\theta$?
First, let's set up the problem.
```{r}
n <- n_post <- 10^5
y <- 4.8*10^4
alpha <- 2
beta <- 2
```
Now, let's create the posterior and likelihood.
```{r}
set.seed(13878)
th_post <- rbeta(n_post,alpha+y,beta+n-y)
th_pror <- rbeta(n_post,alpha,beta)
y_like <- rbinom(n_post,n,y/n)
```
We can now plot the prior density for $\theta$.
```{r}
plot(density(th_pror),xlab=expression(theta[prior]),main="")
```
Below is the likelihood.
```{r}
plot(density(y_like), xlab = "number of heads",main="")
```
Finally, we can plot the posterior density along with the maximum likelihood estimate.
```{r}
#Density plot of posterior likelihood and ML/Bayes estimates
plot(density(th_post),xlab=expression(theta[post]),main="")
abline(v=.48,col="darkgreen",lty=2,lwd=2)
abline(v=(alpha+y)/(alpha+beta+n),col="blue",lwd=2)
legend(.473,250,c("ML","Bayes"),col=c("darkgreen","blue"),pch=rep(19,2),bty="n")
legend(.4805,250,"ML and Bayes estimates \n almost the same",bty="n")
```
# Linear regression
## Demo
We'll run a linear regression using frequentist and Bayesian methods. We use the *bikeshare* dataset as a sample dataset. The data contains the hourly and daily count of rentalbikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information. The objective is to investigate if perceived temperature is related to the number of registered riders.
Let's start with some exploratory data analysis.
```{r}
#Import the data
bikes <- read.csv("http://www.cknudson.com/data/bikes.csv")
#EDA
ggplot(bikes, aes(y = riders_registered, x = temp_feel)) +
geom_point() +
xlab("Perceived Temp (Fahrenheit )") +
ylab("No. of registered riders")
summary(bikes)
```
Next, let's fit two linear regression models: one frequentist and one Bayesian.
```{r}
#Frequentist
freq.fit <- lm(riders_registered ~ temp_feel, data=bikes)
#Bayesian
bayes.fit <- MCMCregress(riders_registered~temp_feel,
b0=freq.fit$coefficients,B0=.01,
c0=.1,d0=.1,
beta.start=freq.fit$coefficients,
data=bikes,burnin=10^3,mcmc=10^5)
#Output
summary(freq.fit)
summary(bayes.fit)
```
## You try
Everyone: Tweak the code in the demo above to fit a linear regression with the number of registered riders as the response and the humidity as the predictor. Use the frequentist linear model coefficients for the initial values.
If you finish the above problem early, try one or both of these problems:
* If you are familiar with multiple linear regression, try creating a linear model with the number of registered riders as the response and a few variables (such as the windspeed, the perceived temperature, and the humidty) as the predictors. (The syntax for this is the same as that of the lm function.) Use the frequentist linear model coefficients for the initial values.
* If you want to play with initial conditions, try changing the starting values from the frequentist linear model coefficients to something else. Notice how the starting point can effect the resulting model. Vary your Monte Carlo sample size (your chain length) to see how it interacts with starting values to impact the resulting model.
# Logistic regression
## Demo
Next, we consider a logistic regression example. For this, we create a function that calculates the log unnormalized posterior density.
```{r}
data(logit)
out <- glm(y~x1,data=logit,family=binomial,x=TRUE)
```
Create our function that calculates the log unnormalized posterior density.
```{r}
lupost_factory <- function(x,y)function(beta){
eta <- as.numeric(x%*%beta)
logp <- ifelse(eta<0,eta-log1p(exp(eta)),-log1p(exp(-eta)))
logq <- ifelse(eta<0,-log1p(exp(eta)),-eta-log1p(exp(-eta)))
logl <- sum(logp[y==1])+sum(logq[y==0])
return(logl-sum(beta^2)/8)
}
lupost <- lupost_factory(out$x,out$y)
```
Now, we run a Metropolis random walk using the log unnormalized posterior function.
```{r, cache = TRUE}
set.seed(317)
beta.init <- as.numeric(coefficients(out))
out <- metrop(lupost,beta.init,1e3)
names(out)
out$accept
```
After running our sampler, we can look at some plots. The MCMC samples are contained in out$batch.
```{r}
plot(ts(out$batch),main="",xlab="Iteration")
acf(out$batch)
```
Now, we continue to use the lupost factory function, but let's move on to a more interesting data set: the Titanic survival data.
```{r}
data("titanic.complete")
names(titanic.complete)
```
For the response, use whether the passenger survived. For the predictor, use the fare the passenger paid.
```{r, cache = TRUE}
set.seed(317)
titanic.mod <- glm(Survived ~ Fare, data = titanic.complete, family = "binomial")
beta.init <- as.numeric(coefficients(titanic.mod))
titanic.mod <- MCMClogit(Survived ~ Fare, data = titanic.complete)
summary(titanic.mod)
```
## You try
Continue to use the MCMClogit function and the Titanic survival data, but change the the predictor.
* Use the age of the passenger to predict the log odds of survival.
* If you are comfortable with multiple regression, incorporate multiple predictors. (The syntax is the same as that of the lm function.)
# MCMC diagnostics
## Demo
After running the samplers, you can look at the MCMC diagnostics. One of the most popular diagnostics is the Gelman-Rubin diagnostic. We use the stable version of it from the package stableGR.
Input the MCMC samples.
```{r}
stable.GR(titanic.mod)
```
If you want to isolate just the multivariate Gelman-Rubin statistic (mpsrf):
```{r}
stable.GR(titanic.mod)$mpsrf
```
We say the markov chain has converged once the Gelman Rubin statistic (psrf) is below a threshold. We can calculate the threshold (i.e. the target psrf) as long as we know the number of parameters (p) and the number of chains (m). We can check the former with
```{r}
ncol(titanic.mod)
```
```{r}
target.psrf(p=2, m=1)
```
Since our psrf is larger than the target, we need to sample more samples. If we don't want to do the comparison ourself, we can use the following command.
```{r}
n.eff(titanic.mod)
```
It tells us whether we have enough samples with the converged argument. It also tells us the effective sample size (n.eff), meaning the number of uncorrelated samples that produce the same variance as our correlated (MCMC) samples.
In this case, we need to have about 7 or 8 times the Monte Carlo sample size. We can update this in a few ways. An easy (but less intelligent way) is to rerun the code with the larger Monte Carlo sample size. The default (which we used before) was 10000, so let's multiply that by 8 to be safe.
```{r, cache = TRUE}
newmod <- MCMClogit(Survived ~ Fare, data = titanic.complete, mcmc=80000)
n.eff(newmod)
```
Our new, bigger sample is now sufficiently large (it has converged). There are other (smarter) ways to increase the sample size without having to redo it, but we don't have time to get into that today. (Essentially, you want to take the last value of your Markov chain and use that as your new initial value. Run the model from there, then append your new samples to your old samples.)
Other potentially useful model information includes the Monte Carlo standard errors. Standard error (as you know it in the frequentist setting) represents the variability from sample to sample, assuming identical sample sizes and sampling schemes. Monte Carlo standard error represents the variability from MCMC sample to MCMC sample, assuming the chains have equal lengths and are constructed with the same sampler. We want small Monte Carlo standard errors.
The following command treats each variable (or column of our MCMC sample) separately.
```{r}
mcse.mat(titanic.mod)
```
This next command estimates the covariance matrix.
```{r}
mcse.multi(titanic.mod)
```
Visualizations help make this make sense. The code below plots the credible region (jointly constructed for the two components). We can see an ellipse. A perfect circle would indicate independence between the components. The elliptical shape tells us the components are dependent.
```{r}
mcerror <- mcse.multi(titanic.mod, blather = TRUE)
plot(confRegion(mcerror, level = .95), xlab = "intercept", ylab = "slope", type = "l")
```
## You try
Choose one of the models you have created. Implement the methods shown above.
# Credible Sets
## Demo
Once we are certain we have a sufficently large sample, we can construct credible sets (or credible intervals), the Bayesian analog of the frequentist confidence interval. An easy way to calculate a credible interval is by using quantiles. The code below calculates a 95 percent credible interval for our second parameter.
```{r}
quantile(titanic.mod[,2], c(.025, .975))
```
This is useful for hypothesis testing! Suppose you want to know whether the fare has any relationship to the log odds of a passenger's survival. If there's no relationship, then the coefficient on fare is zero. This credible interval does not contain 0. Therefore, we can say that there is indeed a linear relationship between the fare a passenger paid and the passenger's log odds of survival. In fact, because the coefficient is positive, we can say that passengers who paid more had higher chances of survival.
Monte Carlo standard errors are always a good accompaniment to any estimate. Let's find the Monte Carlo standard errors for the lower endpoint of our .025 quantile.
```{r}
mcse.q.mat(titanic.mod, method = "bm", q=.025)
```
The standard error is quite small compared to the coefficient for fare. Even after taking the standard error into account, we can be fairly certain that the .025 quantile is not zero.
The credible intervals based on the highest posterior density can be calculated as follows.
```{r}
hdi(titanic.mod)
```
This credible interval tells a very similar story: it does not contain zero, so we can safely say that the fare a passenger paid has a linear relationship with their log odds of survival.
## You try
Choose one of the models you have created. Implement the methods shown above.