An Educational Stroll With Stan - Part 2
By Ken Koon Wong in r R stan cmdstanr bayesian beginner
September 28, 2023
I learned a great deal throughout this journey. In the second part, I gained knowledge about implementing logistic regression in Stan. I also learned the significance of data type declarations for obtaining accurate estimates, how to use posterior to predict new data, and what generated quantities in Stan is for. Moreover, having a friend who is well-versed in Bayesian statistics proves invaluable when delving into the Bayesian realm! Very fun indeed!

We’ve looked at linear regression previously, now let’s take a look at logistic regression.
Objectives
- Load Library & Simulate Simple Data
- What Does A Simple Logistic Regression Look Like In Stan?
- Visualize It!
- How To Predict Future Data ?
- Acknowledgement/Fun Fact
- Lessons Learnt
Load Library & Simulate Simple Data
library(tidyverse)
library(cmdstanr)
library(bayesplot)
library(kableExtra)
set.seed(1)
n <- 1000
w <- rnorm(n)
x <- rbinom(n,1,plogis(-1+2*w))
y <- rbinom(n,1,plogis(-1+2*x + 3*w))
collider <- -0.5*x + -0.6*y + rnorm(n)
df <- list(N=n,x=x,y=y,w=w, collider=collider) #cmdstanr uses list
df2 <- tibble(x,y,collider,w) #this is for simple logistic regression check
   
  
Same DAG as 
before but the difference is both x and y are binary via binomial distribution.
Look At GLM summary
model <- glm(y ~ x + w, df2, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = y ~ x + w, family = "binomial", data = df2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.93224  -0.42660  -0.06937   0.29296   2.79793  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.0352     0.1240  -8.348   <2e-16 ***
## x             2.1876     0.2503   8.739   <2e-16 ***
## w             2.9067     0.2230  13.035   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1375.9  on 999  degrees of freedom
## Residual deviance:  569.2  on 997  degrees of freedom
## AIC: 575.2
## 
## Number of Fisher Scoring iterations: 6
Nice! x and w coefficients and y intercept are close to our simulated model! x coefficient is 2.1875657 (true: 2), w coefficient is 2.9066911 (true:3).
What About Collider?
model2 <- lm(collider ~ x + y, df2)
summary(model2)
## 
## Call:
## lm(formula = collider ~ x + y, data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5744 -0.6454 -0.0252  0.7058  2.8273 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.007275   0.044300   0.164     0.87    
## x           -0.461719   0.090672  -5.092 4.23e-07 ***
## y           -0.610752   0.086105  -7.093 2.48e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.032 on 997 degrees of freedom
## Multiple R-squared:  0.1753,	Adjusted R-squared:  0.1736 
## F-statistic: 105.9 on 2 and 997 DF,  p-value: < 2.2e-16
Not too shabby. x coefficient is -0.4617194 (true: -0.5), y coefficient is -0.6107524 (true: -0.6). Perfect! But how do we do this on Stan?
Let’s break down what exactly we want to estimate.
\begin{gather} y\sim\text{bernoulli}(p) \\ \text{logit}(p) = a_y+b_{yx}.x+b_{yw}.w \\ \\ collider\sim\text{normal}(\mu_{collider},\sigma_{collider}) \\ \mu_{collider}=a_{collider}+b_{collider\_x}.x+b_{collider\_y}.y  \end{gather}
we’re basically interested in \(a_y\), \(b_{yx}\), \(b_{yw}\), \(a_{collider}\), \(b_{collider\_x}\), \(b_{collider\_y}\). To see if they reflect the parameters set in our simulation.
Logistic Regression via Stan
data {
  int N;
  array[N] int x;
  array[N] int y;
  array[N] real w; #note that this is not int but real
  array[N] real collider; #same here
}
parameters {
  // parameters for y (bernoulli)
  real alpha_y;
  real beta_yx;
  real beta_yw;
  
  // parameters for collider (normal)
  real alpha_collider;
  real beta_collider_x;
  real beta_collider_y;
  real sigma_collider;
}
model {
  // prior
  // default for real will be uniform distribution
  
  // likelihood
     y ~ bernoulli_logit(alpha_y + beta_yx * to_vector(x) + beta_yw * to_vector(w));
     collider ~ normal(alpha_collider + beta_collider_x * to_vector(x) + beta_collider_y * to_vector(y), sigma_collider);
  
}
Save the above under log_sim.stan.Note that we didn’t have to use inverse_logit, bernoulli_logit nicely turn that equation into inverse logit for us.
Did you also notice that data declaration has array[N] (data type) (variable) instead of variable[N]. This is the new way of declaring the structure in Stan.
Run The Model in R and Analyze
mod <- cmdstan_model("log_sim.stan") 
fit <- mod$sample(data = df,
                  chains = 4,
                  iter_sampling = 2000,
                  iter_warmup = 1000,
                  seed = 123,
                  parallel_chains = 4
                  )
fit$summary()

Not too shabby either! Stan model accurately estimated the alpha_y, beta_yx, beta_yw, alpha_collider, beta_collider_x, beta_collider_y parameters. Rhat is 1, less than 1.05. ess_bulk & ess_tail are >100. Model diagnostic looks good!
Visualize The Beautiful Convergence
mcmc_trace(fit$draws())
 
   
  
How To Predict Future Data ?
You know how in R or python, you can save the model and then type something like predict and the probability will miraculously appear? Well, to my knowledge, you can’t in Stan or cmdstanr. I heard you can with rstanarm, rethinking, brms etc. But let’s roll with it and see how we can do this.
Instructions:
- Extract your coefficient of interest from posterior
- Write another Stan model with generated quantities
- Feed New Data onto the Stan model and extract the expected value.
Recall this was our formula:
\begin{gather} y\sim\text{bernoulli}(p) \\ \text{logit}(p) = a_y+b_{yx}.x+b_{yw}.w  \end{gather}
We want to extract alpha_y, beta_yx, and beta_yw.
Let’s Estimate y
  
#Extract parameters and then mean it
alpha_y <- fit$draws(variables = "alpha_y") |> mean()
beta_yx <- fit$draws(variables = "beta_yx") |> mean()
beta_yw <- fit$draws(variables = "beta_yw") |> mean()
#new data
set.seed(2)
n <- 100
x <- rbinom(n,1,0.5) #randomly assign a 1 for x
w <- rnorm(n) #randomly generate a continuous data for w
df_new <- list(N=n,x=x,w=w,alpha_y=alpha_y,beta_yx=beta_yx,beta_yw=beta_yw)
New Stan model
data {
  int<lower=0> N;
  array[N] int x;
  array[N] real w;
  real alpha_y;
  real beta_yx;
  real beta_yw;
}
// parameters {
//   array[N] real y_pred;
// }
generated quantities {
  array[N] real<lower=0,upper=1> y_pred;
  
  for (i in 1:N) {
    y_pred[i] = inv_logit(alpha_y + beta_yx * x[i] + beta_yw * w[i]);
}
}
Save the above to log_sim_pred.stan.
Note that this time, instead of declaring the model equation, we provided equation on generated quantities which essentially calculates the y_pred according to our formula.
Load Prediction Stan Model
mod <- cmdstan_model("log_sim_pred.stan")
fit2 <- mod$sample(data = df_new,
                  iter_sampling = 1,
                  chains =1, 
                  fixed_param = T)
Notice that we provided df_new as data, changed iter_sampling to 1, if not we’ll just get a bunch of same numbers. Give it a try yourself! same goes with chains, additional chains of same values provide no additional value. Lastly, we have to specify fixed_param.
Merge Predicted y and True y
  
# create df with y_pred
df_pred <- as.data.frame(fit2$draws(variables = "y_pred")) |>
  pivot_longer(cols = everything(), names_to = "y_pred1", values_to = "y_pred") |>
  select(y_pred)
# create df w y_actual
df_actual <- tibble(x=x,w=w,y_actual=plogis(-1+2*x + 3*w)) |>
  select(y_actual)
# merge the 2 dfs and check the diff
df_combined <- df_pred |>
  add_column(df_actual) |>
  mutate(diff = y_actual - y_pred)
# load the first 5 rows
df_combined |>
  head(5) |>
  kable()
| y_pred | y_actual | diff | 
|---|---|---|
| 0.029370 | 0.0288923 | -0.0004777 | 
| 0.999270 | 0.9992532 | -0.0000168 | 
| 0.382207 | 0.3347584 | -0.0474486 | 
| 0.936812 | 0.9441253 | 0.0073133 | 
| 0.129852 | 0.1050137 | -0.0248383 | 
Not too shabby! Differences are quite small for the first 5. Let’s histogram it and see.
df_combined |>
  ggplot(aes(x=diff)) +
  geom_histogram() +
  theme_minimal()
 
Visualize Predicted and Actual y
  
df_combined |>
  ggplot(aes(x=y_actual,y=y_pred)) +
  geom_point() +
  geom_smooth(formula = "y ~ x") + 
  theme_minimal()
 
Almost a straight line through. Awesomeness! Not really sure why between 0.25 and 0.75 there were up down pattern. If you know, please let me know!
Acknowledgement/Fun Fact:
I truly want to thank Alec Wong for helping me with a problem I encountered. Initially my estimates were off, especially the collider parameters. Spent a few days and was not able to find out the cause. When you run into problem, stop and work on something else, or ask a friend! I did both! Alec Wong wrote a JAGS model and was able to extract accurate estimates. He sent me the script, we fed the script into chatGPT to spit out a Stan model and then realized that my data declaration had a mistake! Well, 2 mistakes! I put both w and collider as int instead of real. Here are the estiamtes when both w and collider were declared as int.

Notice how the collider parameters are off !?!?! The 95% credible intervals don’t even contain the true value.
Again, THANK YOU ALEC !!!
Things To Improve On:
- Will explore further using informed prior in the future
- Will explore simulated data of sens/spec of a diagnostic test and then apply prior to obtain posterior
- Will explore how Stan behaves with NULL data variable.
Lessons learnt:
- Learnt how to do logistic regression using cmdstanr
- Declaration of data type is important in Stan to get accurate estimates
- Stan has changed y[n] to array[n] y
- Learnt from Alec that rnorm(n, mu, sigma) == mu + rnorm(n, 0, sigma)
- Stan Manual is a good reference to go to
If you like this article:
- please feel free to send me a comment or visit my other blogs
- please feel free to follow me on twitter, GitHub or Mastodon
- if you would like collaborate please feel free to contact me
- Posted on:
- September 28, 2023
- Length:
- 7 minute read, 1485 words