An Educational Stroll With Stan - Part 2
By Ken Koon Wong in r R stan cmdstanr bayesian beginner
September 28, 2023
I learned a great deal throughout this journey. In the second part, I gained knowledge about implementing logistic regression in Stan. I also learned the significance of data type declarations for obtaining accurate estimates, how to use posterior to predict new data, and what generated quantities in Stan is for. Moreover, having a friend who is well-versed in Bayesian statistics proves invaluable when delving into the Bayesian realm! Very fun indeed!
We’ve looked at linear regression previously, now let’s take a look at logistic regression.
Objectives
- Load Library & Simulate Simple Data
- What Does A Simple Logistic Regression Look Like In Stan?
- Visualize It!
- How To Predict Future Data ?
- Acknowledgement/Fun Fact
- Lessons Learnt
Load Library & Simulate Simple Data
library(tidyverse)
library(cmdstanr)
library(bayesplot)
library(kableExtra)
set.seed(1)
n <- 1000
w <- rnorm(n)
x <- rbinom(n,1,plogis(-1+2*w))
y <- rbinom(n,1,plogis(-1+2*x + 3*w))
collider <- -0.5*x + -0.6*y + rnorm(n)
df <- list(N=n,x=x,y=y,w=w, collider=collider) #cmdstanr uses list
df2 <- tibble(x,y,collider,w) #this is for simple logistic regression check
Same DAG as
before but the difference is both x
and y
are binary via binomial distribution.
Look At GLM summary
model <- glm(y ~ x + w, df2, family = "binomial")
summary(model)
##
## Call:
## glm(formula = y ~ x + w, family = "binomial", data = df2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.93224 -0.42660 -0.06937 0.29296 2.79793
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0352 0.1240 -8.348 <2e-16 ***
## x 2.1876 0.2503 8.739 <2e-16 ***
## w 2.9067 0.2230 13.035 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1375.9 on 999 degrees of freedom
## Residual deviance: 569.2 on 997 degrees of freedom
## AIC: 575.2
##
## Number of Fisher Scoring iterations: 6
Nice! x
and w
coefficients and y intercept
are close to our simulated model! x coefficient
is 2.1875657 (true: 2), w coefficient
is 2.9066911 (true:3).
What About Collider?
model2 <- lm(collider ~ x + y, df2)
summary(model2)
##
## Call:
## lm(formula = collider ~ x + y, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5744 -0.6454 -0.0252 0.7058 2.8273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.007275 0.044300 0.164 0.87
## x -0.461719 0.090672 -5.092 4.23e-07 ***
## y -0.610752 0.086105 -7.093 2.48e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.032 on 997 degrees of freedom
## Multiple R-squared: 0.1753, Adjusted R-squared: 0.1736
## F-statistic: 105.9 on 2 and 997 DF, p-value: < 2.2e-16
Not too shabby. x coefficient
is -0.4617194 (true: -0.5), y coefficient
is -0.6107524 (true: -0.6). Perfect! But how do we do this on Stan?
Let’s break down what exactly we want to estimate.
\begin{gather} y\sim\text{bernoulli}(p) \\ \text{logit}(p) = a_y+b_{yx}.x+b_{yw}.w \\ \\ collider\sim\text{normal}(\mu_{collider},\sigma_{collider}) \\ \mu_{collider}=a_{collider}+b_{collider\_x}.x+b_{collider\_y}.y \end{gather}
we’re basically interested in \(a_y\)
, \(b_{yx}\)
, \(b_{yw}\)
, \(a_{collider}\)
, \(b_{collider\_x}\)
, \(b_{collider\_y}\)
. To see if they reflect the parameters set in our simulation.
Logistic Regression via Stan
data {
int N;
array[N] int x;
array[N] int y;
array[N] real w; #note that this is not int but real
array[N] real collider; #same here
}
parameters {
// parameters for y (bernoulli)
real alpha_y;
real beta_yx;
real beta_yw;
// parameters for collider (normal)
real alpha_collider;
real beta_collider_x;
real beta_collider_y;
real sigma_collider;
}
model {
// prior
// default for real will be uniform distribution
// likelihood
y ~ bernoulli_logit(alpha_y + beta_yx * to_vector(x) + beta_yw * to_vector(w));
collider ~ normal(alpha_collider + beta_collider_x * to_vector(x) + beta_collider_y * to_vector(y), sigma_collider);
}
Save the above under log_sim.stan
.Note that we didn’t have to use inverse_logit, bernoulli_logit nicely turn that equation into inverse logit for us.
Did you also notice that data declaration has array[N] (data type) (variable)
instead of variable[N]
. This is the new way of declaring the structure in Stan.
Run The Model in R and Analyze
mod <- cmdstan_model("log_sim.stan")
fit <- mod$sample(data = df,
chains = 4,
iter_sampling = 2000,
iter_warmup = 1000,
seed = 123,
parallel_chains = 4
)
fit$summary()
Not too shabby either! Stan
model accurately estimated the alpha_y, beta_yx, beta_yw, alpha_collider, beta_collider_x, beta_collider_y
parameters. Rhat
is 1, less than 1.05. ess_bulk & ess_tail are >100. Model diagnostic looks good!
Visualize The Beautiful Convergence
mcmc_trace(fit$draws())
How To Predict Future Data ?
You know how in R
or python
, you can save the model and then type something like predict
and the probability will miraculously appear? Well, to my knowledge, you can’t in Stan or cmdstanr. I heard you can with rstanarm, rethinking, brms etc. But let’s roll with it and see how we can do this.
Instructions:
- Extract your coefficient of interest from posterior
- Write another Stan model with generated quantities
- Feed New Data onto the Stan model and extract the expected value.
Recall this was our formula:
\begin{gather} y\sim\text{bernoulli}(p) \\ \text{logit}(p) = a_y+b_{yx}.x+b_{yw}.w \end{gather}
We want to extract alpha_y
, beta_yx
, and beta_yw
.
Let’s Estimate y
#Extract parameters and then mean it
alpha_y <- fit$draws(variables = "alpha_y") |> mean()
beta_yx <- fit$draws(variables = "beta_yx") |> mean()
beta_yw <- fit$draws(variables = "beta_yw") |> mean()
#new data
set.seed(2)
n <- 100
x <- rbinom(n,1,0.5) #randomly assign a 1 for x
w <- rnorm(n) #randomly generate a continuous data for w
df_new <- list(N=n,x=x,w=w,alpha_y=alpha_y,beta_yx=beta_yx,beta_yw=beta_yw)
New Stan model
data {
int<lower=0> N;
array[N] int x;
array[N] real w;
real alpha_y;
real beta_yx;
real beta_yw;
}
// parameters {
// array[N] real y_pred;
// }
generated quantities {
array[N] real<lower=0,upper=1> y_pred;
for (i in 1:N) {
y_pred[i] = inv_logit(alpha_y + beta_yx * x[i] + beta_yw * w[i]);
}
}
Save the above to log_sim_pred.stan
.
Note that this time, instead of declaring the model equation, we provided equation on generated quantities
which essentially calculates the y_pred
according to our formula.
Load Prediction Stan Model
mod <- cmdstan_model("log_sim_pred.stan")
fit2 <- mod$sample(data = df_new,
iter_sampling = 1,
chains =1,
fixed_param = T)
Notice that we provided df_new
as data, changed iter_sampling
to 1, if not we’ll just get a bunch of same numbers. Give it a try yourself! same goes with chains
, additional chains of same values provide no additional value. Lastly, we have to specify fixed_param
.
Merge Predicted y
and True y
# create df with y_pred
df_pred <- as.data.frame(fit2$draws(variables = "y_pred")) |>
pivot_longer(cols = everything(), names_to = "y_pred1", values_to = "y_pred") |>
select(y_pred)
# create df w y_actual
df_actual <- tibble(x=x,w=w,y_actual=plogis(-1+2*x + 3*w)) |>
select(y_actual)
# merge the 2 dfs and check the diff
df_combined <- df_pred |>
add_column(df_actual) |>
mutate(diff = y_actual - y_pred)
# load the first 5 rows
df_combined |>
head(5) |>
kable()
y_pred | y_actual | diff |
---|---|---|
0.029370 | 0.0288923 | -0.0004777 |
0.999270 | 0.9992532 | -0.0000168 |
0.382207 | 0.3347584 | -0.0474486 |
0.936812 | 0.9441253 | 0.0073133 |
0.129852 | 0.1050137 | -0.0248383 |
Not too shabby! Differences are quite small for the first 5. Let’s histogram it and see.
df_combined |>
ggplot(aes(x=diff)) +
geom_histogram() +
theme_minimal()
Visualize Predicted and Actual y
df_combined |>
ggplot(aes(x=y_actual,y=y_pred)) +
geom_point() +
geom_smooth(formula = "y ~ x") +
theme_minimal()
Almost a straight line through. Awesomeness! Not really sure why between 0.25 and 0.75 there were up down pattern. If you know, please let me know!
Acknowledgement/Fun Fact:
I truly want to thank Alec Wong for helping me with a problem I encountered. Initially my estimates were off, especially the collider parameters. Spent a few days and was not able to find out the cause. When you run into problem, stop and work on something else, or ask a friend! I did both! Alec Wong wrote a JAGS model and was able to extract accurate estimates. He sent me the script, we fed the script into chatGPT to spit out a Stan model and then realized that my data declaration had a mistake! Well, 2 mistakes! I put both w
and collider
as int
instead of real
. Here are the estiamtes when both w
and collider
were declared as int.
Notice how the collider
parameters are off !?!?! The 95% credible intervals don’t even contain the true value.
Again, THANK YOU ALEC !!!
Things To Improve On:
- Will explore further using informed prior in the future
- Will explore simulated data of sens/spec of a diagnostic test and then apply prior to obtain posterior
- Will explore how Stan behaves with NULL data variable.
Lessons learnt:
- Learnt how to do logistic regression using cmdstanr
- Declaration of data type is important in Stan to get accurate estimates
- Stan has changed y[n] to array[n] y
- Learnt from Alec that rnorm(n, mu, sigma) == mu + rnorm(n, 0, sigma)
- Stan Manual is a good reference to go to
If you like this article:
- please feel free to send me a comment or visit my other blogs
- please feel free to follow me on twitter, GitHub or Mastodon
- if you would like collaborate please feel free to contact me
- Posted on:
- September 28, 2023
- Length:
- 7 minute read, 1485 words