An Educational Stroll With Stan - Part 2

By Ken Koon Wong in r R stan cmdstanr bayesian beginner

September 28, 2023

I learned a great deal throughout this journey. In the second part, I gained knowledge about implementing logistic regression in Stan. I also learned the significance of data type declarations for obtaining accurate estimates, how to use posterior to predict new data, and what generated quantities in Stan is for. Moreover, having a friend who is well-versed in Bayesian statistics proves invaluable when delving into the Bayesian realm! Very fun indeed!

We’ve looked at linear regression previously, now let’s take a look at logistic regression.

Objectives

Load Library & Simulate Simple Data
What Does A Simple Logistic Regression Look Like In Stan?
Visualize It!
How To Predict Future Data ?
Acknowledgement/Fun Fact
Lessons Learnt

Load Library & Simulate Simple Data

library(tidyverse)
library(cmdstanr)
library(bayesplot)
library(kableExtra)

set.seed(1)
n <- 1000
w <- rnorm(n)
x <- rbinom(n,1,plogis(-1+2*w))
y <- rbinom(n,1,plogis(-1+2*x + 3*w))
collider <- -0.5*x + -0.6*y + rnorm(n)
df <- list(N=n,x=x,y=y,w=w, collider=collider) #cmdstanr uses list
df2 <- tibble(x,y,collider,w) #this is for simple logistic regression check

Same DAG as before but the difference is both x and y are binary via binomial distribution.

Look At GLM summary

model <- glm(y ~ x + w, df2, family = "binomial")
summary(model)

## 
## Call:
## glm(formula = y ~ x + w, family = "binomial", data = df2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.93224  -0.42660  -0.06937   0.29296   2.79793  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.0352     0.1240  -8.348   <2e-16 ***
## x             2.1876     0.2503   8.739   <2e-16 ***
## w             2.9067     0.2230  13.035   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1375.9  on 999  degrees of freedom
## Residual deviance:  569.2  on 997  degrees of freedom
## AIC: 575.2
## 
## Number of Fisher Scoring iterations: 6

Nice! x and w coefficients and y intercept are close to our simulated model! x coefficient is 2.1875657 (true: 2), w coefficient is 2.9066911 (true:3).

What About Collider?

model2 <- lm(collider ~ x + y, df2)
summary(model2)

## 
## Call:
## lm(formula = collider ~ x + y, data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5744 -0.6454 -0.0252  0.7058  2.8273 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.007275   0.044300   0.164     0.87    
## x           -0.461719   0.090672  -5.092 4.23e-07 ***
## y           -0.610752   0.086105  -7.093 2.48e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.032 on 997 degrees of freedom
## Multiple R-squared:  0.1753,	Adjusted R-squared:  0.1736 
## F-statistic: 105.9 on 2 and 997 DF,  p-value: < 2.2e-16

Not too shabby. x coefficient is -0.4617194 (true: -0.5), y coefficient is -0.6107524 (true: -0.6). Perfect! But how do we do this on Stan?

Let’s break down what exactly we want to estimate.

\begin{gather} y\sim\text{bernoulli}(p) \\ \text{logit}(p) = a_y+b_{yx}.x+b_{yw}.w \\ \\ collider\sim\text{normal}(\mu_{collider},\sigma_{collider}) \\ \mu_{collider}=a_{collider}+b_{collider\_x}.x+b_{collider\_y}.y \end{gather}

we’re basically interested in \(a_y\), \(b_{yx}\), \(b_{yw}\), \(a_{collider}\), \(b_{collider\_x}\), \(b_{collider\_y}\). To see if they reflect the parameters set in our simulation.

Logistic Regression via Stan

data {
  int N;
  array[N] int x;
  array[N] int y;
  array[N] real w; #note that this is not int but real
  array[N] real collider; #same here
}

parameters {
  // parameters for y (bernoulli)
  real alpha_y;
  real beta_yx;
  real beta_yw;
  
  // parameters for collider (normal)
  real alpha_collider;
  real beta_collider_x;
  real beta_collider_y;
  real sigma_collider;
}

model {
  // prior
  // default for real will be uniform distribution
  
  // likelihood
     y ~ bernoulli_logit(alpha_y + beta_yx * to_vector(x) + beta_yw * to_vector(w));
     collider ~ normal(alpha_collider + beta_collider_x * to_vector(x) + beta_collider_y * to_vector(y), sigma_collider);
  
}

Save the above under log_sim.stan.Note that we didn’t have to use inverse_logit, bernoulli_logit nicely turn that equation into inverse logit for us.

Did you also notice that data declaration has array[N] (data type) (variable) instead of variable[N]. This is the new way of declaring the structure in Stan.

Run The Model in R and Analyze

mod <- cmdstan_model("log_sim.stan") 

fit <- mod$sample(data = df,
                  chains = 4,
                  iter_sampling = 2000,
                  iter_warmup = 1000,
                  seed = 123,
                  parallel_chains = 4
                  )

fit$summary()

Not too shabby either! Stan model accurately estimated the alpha_y, beta_yx, beta_yw, alpha_collider, beta_collider_x, beta_collider_y parameters. Rhat is 1, less than 1.05. ess_bulk & ess_tail are >100. Model diagnostic looks good!

Visualize The Beautiful Convergence

mcmc_trace(fit$draws())

How To Predict Future Data ?

You know how in R or python, you can save the model and then type something like predict and the probability will miraculously appear? Well, to my knowledge, you can’t in Stan or cmdstanr. I heard you can with rstanarm, rethinking, brms etc. But let’s roll with it and see how we can do this.

Instructions:

Extract your coefficient of interest from posterior
Write another Stan model with generated quantities
Feed New Data onto the Stan model and extract the expected value.

Recall this was our formula: \begin{gather} y\sim\text{bernoulli}(p) \\ \text{logit}(p) = a_y+b_{yx}.x+b_{yw}.w \end{gather}

We want to extract alpha_y, beta_yx, and beta_yw.

Let’s Estimate `y`

#Extract parameters and then mean it
alpha_y <- fit$draws(variables = "alpha_y") |> mean()
beta_yx <- fit$draws(variables = "beta_yx") |> mean()
beta_yw <- fit$draws(variables = "beta_yw") |> mean()

#new data
set.seed(2)
n <- 100
x <- rbinom(n,1,0.5) #randomly assign a 1 for x
w <- rnorm(n) #randomly generate a continuous data for w
df_new <- list(N=n,x=x,w=w,alpha_y=alpha_y,beta_yx=beta_yx,beta_yw=beta_yw)

New Stan model

data {
  int<lower=0> N;
  array[N] int x;
  array[N] real w;
  real alpha_y;
  real beta_yx;
  real beta_yw;
}

// parameters {
//   array[N] real y_pred;
// }

generated quantities {
  array[N] real<lower=0,upper=1> y_pred;
  
  for (i in 1:N) {
    y_pred[i] = inv_logit(alpha_y + beta_yx * x[i] + beta_yw * w[i]);
}
}

Save the above to log_sim_pred.stan.

Note that this time, instead of declaring the model equation, we provided equation on generated quantities which essentially calculates the y_pred according to our formula.

Load Prediction Stan Model

mod <- cmdstan_model("log_sim_pred.stan")

fit2 <- mod$sample(data = df_new,
                  iter_sampling = 1,
                  chains =1, 
                  fixed_param = T)

Notice that we provided df_new as data, changed iter_sampling to 1, if not we’ll just get a bunch of same numbers. Give it a try yourself! same goes with chains, additional chains of same values provide no additional value. Lastly, we have to specify fixed_param.

Merge Predicted `y` and True `y`

# create df with y_pred
df_pred <- as.data.frame(fit2$draws(variables = "y_pred")) |>
  pivot_longer(cols = everything(), names_to = "y_pred1", values_to = "y_pred") |>
  select(y_pred)

# create df w y_actual
df_actual <- tibble(x=x,w=w,y_actual=plogis(-1+2*x + 3*w)) |>
  select(y_actual)

# merge the 2 dfs and check the diff
df_combined <- df_pred |>
  add_column(df_actual) |>
  mutate(diff = y_actual - y_pred)

# load the first 5 rows
df_combined |>
  head(5) |>
  kable()

y_pred	y_actual	diff
0.029370	0.0288923	-0.0004777
0.999270	0.9992532	-0.0000168
0.382207	0.3347584	-0.0474486
0.936812	0.9441253	0.0073133
0.129852	0.1050137	-0.0248383

Not too shabby! Differences are quite small for the first 5. Let’s histogram it and see.

df_combined |>
  ggplot(aes(x=diff)) +
  geom_histogram() +
  theme_minimal()

Visualize Predicted and Actual `y`

df_combined |>
  ggplot(aes(x=y_actual,y=y_pred)) +
  geom_point() +
  geom_smooth(formula = "y ~ x") + 
  theme_minimal()

Almost a straight line through. Awesomeness! Not really sure why between 0.25 and 0.75 there were up down pattern. If you know, please let me know!

Acknowledgement/Fun Fact:

I truly want to thank Alec Wong for helping me with a problem I encountered. Initially my estimates were off, especially the collider parameters. Spent a few days and was not able to find out the cause. When you run into problem, stop and work on something else, or ask a friend! I did both! Alec Wong wrote a JAGS model and was able to extract accurate estimates. He sent me the script, we fed the script into chatGPT to spit out a Stan model and then realized that my data declaration had a mistake! Well, 2 mistakes! I put both w and collider as int instead of real. Here are the estiamtes when both w and collider were declared as int.

Notice how the collider parameters are off !?!?! The 95% credible intervals don’t even contain the true value.

Again, THANK YOU ALEC !!!

Things To Improve On:

Will explore further using informed prior in the future
Will explore simulated data of sens/spec of a diagnostic test and then apply prior to obtain posterior
Will explore how Stan behaves with NULL data variable.

Lessons learnt:

Learnt how to do logistic regression using cmdstanr
Declaration of data type is important in Stan to get accurate estimates
Stan has changed y[n] to array[n] y
Learnt from Alec that rnorm(n, mu, sigma) == mu + rnorm(n, 0, sigma)
Stan Manual is a good reference to go to

If you like this article:

please feel free to send me a comment or visit my other blogs
please feel free to follow me on twitter, GitHub or Mastodon
if you would like collaborate please feel free to contact me

Posted on:: September 28, 2023

Length:: 7 minute read, 1485 words

Categories:: r R stan cmdstanr bayesian beginner

Tags:: r R stan cmdstanr bayesian beginner

See Also:: Bias, Variance, and Doubly Robust Estimation: Testing The Promise of TMLE in Simulated Data; My Attempt To Reproduce Stanford HIVdb Sequence and Mutation Analysis From Scratch; Learning And Exploring The Workflow of RNA-Seq Analysis - A Note To Myself