Stable Diffusion 3 in R? Why not? Thanks to {reticulate} 🙏❤️🙌

By Ken Koon Wong in r R python positron stable diffusion generative AI reticulate art txt2img

September 1, 2024

‘Fascinating’ describes my journey with Stable Diffusion 3. It’s deepened my appreciation for original art and masterpieces. Understanding how to generate quality art is just the beginning—it drives me to explore the underlying structure. Join me in exploring SD3 in R!"

Objectives

Disclaimer

This is for educational purpose only. Please read through the intended uses, safety, risk and mitigation here. All images in this article were generated by Stable Diffusion 3 (SD3). Your hardware may be different than mine. The following codes were made for Mac Metal.

Motivation

image

Since our fascination with LLM, our next adventure is generative AI Art. This comes in handy when generating images for blogs. Since our experience with prompt engineering, RAG, has been quite informative, why not give genAI Art a try as well! Our next adventure is in text2img gen AI. Let’s take a look, a simple approach from installation and getting started in R!

Python <- Positron -> R

image

Alright, I have to admit, I did not start this off in R first. I used python in Positron to figure out how it works before transitioning to R. To be honest, Positron has been one of my favorite IDE! I wasn’t much of a pythoner, mainly because of the lack of single line / chunk execution (without being in a notebook of course), as it’s hard for me to understand what each chunk of code means or returns without executing them one by one, just like Rstudio! But with Positron, it’s like Rstudio for python! It’s been a great journey and I am finding myself liking python the same as R! It’s a great feeling! You can technically use Rstudio for python (which I have), but I found the autocompletion in Rstudio did not return a complete lists of python modules as VScode or Positron. 🤷‍♂️ Maybe it’s just me. Getting to know both languages python and R, I feel what’s there not to ❤️ in both of these elegant languages !? I say, we should use both! 🤣 Oh what do I know…

What is Stable Diffusion?

Stable Diffusion is a deep learning, text-to-image model that uses diffusion techniques to generate detailed images based on text descriptions. It’s also capable of other tasks like inpainting and outpainting. Released in 2022, it’s a product of Stability AI and is considered part of the current AI boom. Unlike previous proprietary models, Stable Diffusion is open-source and can run on most consumer GPUs, making it more accessible.

Installation

image

I assume you already know how to use R and have reticulate installed with some python knowledge.

1. Create a python environment

library(reticulate)

virtualenv_create(envname = "sd") # you can change sd to whatever you want

Virtual environments in Python provide isolated spaces for projects, each with its own set of dependencies and potentially different Python versions. This isolation prevents conflicts between projects, ensures reproducibility, and makes dependency management easier.

You would have to restart your IDE in order to use the environment

2. Use the virtual environment & Download the necessary python modules

library(reticulate)

use_virtualenv("sd")

# installation
py_install(c("diffusers","transformers","torch","torchvision", "torchaudio","accelerate","sentencepiece","protobuf"), envname = "sd", pip = T)

Reproducible Code & Explaination

1. Load modules

diffusers <- import("diffusers")
torch <- import("torch")
StableDiffusion3Pipeline <- diffusers$StableDiffusion3Pipeline
pil <- import("PIL")

2. Load Model

pipe <- StableDiffusion3Pipeline$from_pretrained("stabilityai/stable-diffusion-3-medium", torch_dtype=torch$float16)
pipe$to("mps") #assign to metal, change to 'cuda' if you have nvidia gpu or 'cpu' if you have neither

# prepare to generate seed
generator <- torch$Generator()

Now, if this is your first time running this, it might take sometime to download the model. It may ask for a Huggingface API key, if you have not created an account or obtained a key, please click here to request an access token. You will need to check “read/write” on repo access in order for the token to work. If you have tried and failed, please let me know, I’ll see if I can assist you in getting the right one.

Once your model is downloaded, it will get loaded and we’re ready to go! If you want to save the model for future local use without re-downloading, save it to your desired directory.

2.5 Optional

pipe$save_pretrained("stable_diffusion_v3_model/")

3 Prompt & Settings

metadata <- list(
  prompt = 'paint mona lisa by da vinci in Picasso\'s cubism style which is represented by fragmented forms, multiple perspectives in a single image, geometric shapes',
  num_inference_steps = 60L,
  height = 512L,
  width = 512L,
  seed = 1000L,
  guidance_scale = 8
)

output <- pipe(prompt = metadata$prompt, prompt_3=metadata$prompt, num_inference_steps = metadata$num_inference_steps, height = metadata$height, width = metadata$width, generator = generator$manual_seed(metadata$seed), guidance_scale = metadata$guidance_scale)

output$images[[1]]$show()

If you’re just trying out, you can do withou the metadata and insert all those parameters directly onto pipe. But if you’re planning to generate multiple images, it’s best to save them in a list for easy metadata insertion when you’re saving to PNG.

Depending on the speed of your hardware, the generation of the image may take sometime, with the above prompt and setting, mine took about 2-3 seconds per iteration (per num_inference_steps). You can play around with that setting to speed things up before finetuning the image to get better quality. Same with the width and height, the smaller it is the faster it generates.

With the above prompt, you should see exactly this.

If you want the model to be more consistent in following the prompt, you can increase the guidance_scale parameter. The higher the number, the more the model will try to follow the prompt. But this also takes a hit in quality.

prompt_3 is needed if your prompt has longer than 77 tokens as the regular CLIP would cut off anything after that. See this.

Lastly, the show() code will open up the png file created in a separate window. You can view it in Rstudio, but will have to use package such as magick.

Results

Paint mona lisa by da vinci in Picasso's cubism style which is represented by fragmented forms, multiple perspectives in a single image, geometric shapes

image

Interesting, it does have some cubism style but not distinctive enough to be recognized as Picasso’s in my opinion. Might have to work a bit more on the prompt, play with the seed, guidance scale in order to get the result I really want.


Traditional Chinese ink wash painting depicting a ballet dancer in motion. Elegant, flowing brushstrokes capture the graceful movements of the dancer. Minimalist style with areas of negative space. Emphasis on dynamic lines and gestural forms. Monochromatic palette with varying shades of black ink on aged, textured paper. Subtle splatter and dry brush techniques add depth. Composition inspired by Song Dynasty landscapes.

image

I think LLM generated prompts appear to have much more detailed used by SD3.


generate new york central park oil paiting with an ukiyo-e style style, sunset, winter season

image

Even though this is not an LLM generated prompt, I am quite content with this. It depicts the essence of central park, with the city buildings behind, in a slight ukiyo-e style. I wonder how I can make it more of that style with prompt. ?perhaps I have to use LoRA


generate a painting of an oak tree, summer, highlighting the intricate details of the oak leaf shape and its vein, dandelion seeds floating in the air as the wind blows across the oak tree, with realistic portraiture, sfumato technique, classical themes

image

I am quite content with this one as well, though I had to increase the num_inference_step significantly. The color though is not very summer like 🤣 looks more like a fall, and i don’t think dandelion seeds are seen at that time of the year. lol! Oh well.


Photorealistic close-up of a white ceramic mug on golden sand, with ‘Too blessed to be stressed’ written in elegant script. Tranquil beach scene in background with gentle waves rolling onto shore. Soft, warm lighting reminiscent of golden hour. Sharp focus on mug with slight depth of field blurring the ocean. Visible texture of sand grains around the mug. Reflections of sky and water on the mug’s glossy surface. Hyper-detailed rendering with vibrant yet natural colors

image

Too blessed, be wh!?!?! I beg your pardon !?! 🤣 Picture looks great though! Might have to try different seed and or increase guidance? What do you think?


Georgia O’ Keeffe Flower Painting of dandelion flower, floral motif

image

This looks quite pretty. Doesn’t look very O’Keefe though… it’s pretty stiil!

PNG and Metadata

Instead of saving prompt and config separately, we can save them all inside png file.

# Insert metadata to save as png
info <- pil$PngImagePlugin$PngInfo()
info$add_text(key = "prompt", value = "whatever prompt we want")

# Save PNG
output$images[[1]]$save("something.png", format="PNG", pnginfo = info)

And when you load them through pillow, you can read the metadata like so

test_png <- pil$Image$open("ballet2.png")
# Metadata of the ballet dancer in caligraphic style
test_png$info
## $prompt
## [1] "Traditional Chinese caligraphic painting depicting a ballet dancer in motion. Elegant, flowing brushstrokes capture the graceful movements of the dancer. Minimalist style with areas of negative space. Emphasis on dynamic lines and gestural forms. Monochromatic palette with varying shades of black ink on aged, textured paper. Subtle splatter and dry brush techniques add depth. Composition inspired by Song Dynasty landscapes."
## 
## $num_inference_steps
## [1] 200
## 
## $height
## [1] 512
## 
## $width
## [1] 512
## 
## $seed
## [1] 1001
## 
## $guidance_scale
## [1] 8

The numbers will be in string, you can use the bonus function below to easily convert them into integer or double.

Bonus!!!

image

Let’s write a function that adds the metadata above automatically.

# function for conversion
pnginfo <- function(x) {
  result <- pil$PngImagePlugin$PngInfo()
  for (i in 1:length(x)) {
    result$add_text(key = names(x)[i], value = as.character(x[[i]]))
  }
  return(result)
}

# save in a list
info <- pnginfo(metadata)

# then add in when saving to png
output$images[[1]]$save("code.png", format="PNG", pnginfo = info)

Well, I figured that there is quite a bit to convert character in the metadata PNG to integer etc to be quite cumbersome, I’ve written a function to do the conversion so that it is simple enough to get a reproducible piece.

convert_back <- function(x) {
  int_float <- function(num) { 
    if (num %% 1 == 0) {
      return(as.integer(num))
    } else {
      return(as.numeric(num))
    }
  }
  
  for (i in 1:length(x)) {
    if (str_detect(x[[i]], "^[0-9]")) { x[[i]] <- int_float(as.numeric(x[[i]])) }
    else { next }
  }
  
  return(x)
}

metadata <- test_png$info

metadata <- convert_back(metadata)

Now you can easily convert the metadata back to integer or double. Why the difference in integer and oduble distinction? Some of the diffusers parameters prefer integer as opposed to double/float

And if there are any images in this article interest you, you can download the png and use the method described above to get the prompt and parameters to reproduce the exact piece. It’s deterministic!

Why SD3 in R?

image

Investigate function and its output

Well, it works really well when we’re investigating what’s in the package, and also generated objects. I had no idea what the output of the pipe was, but at least with Rstudio, we can easily click on the list and see what’s up! You can probably do that in Positron as well. Not VScode, at least not to my knowledge. Maybe I’m wrong.

Easily generate images

We can easily generate images for our blog, projects etc. Here we’ve written functions to automatically insert metadata onto our png files and also to convert it back! How cool! LOL, using a python package.

More control over parameters

Why SD3 though? Why not just use gemini, openAI, midjourney? Well, yes you can use those, but I think having the control of the seed, and its parameters, I think we can slowly understand how SD3 generates certain aspect of style. I think that’s quite important.

Code chunk execution

This I think is the winner for me. In both Positron and Rstudio this allows me to learn python and also the modules of interest, without being in a notebook! It flows better for me as a non-power user.

Opportunities for improvement

image

Wow, there are so much to learn! We didn’t even go through the basics and fundamentals of SD3, the tokenizer, encode, VAE, denoiser, scheduler, etc! Maybe next time! But the math and algorithm behind is just purely fascinating! Latent space, and the most mind-blowing of all, it’s deterministic!

Other things I want to learnt and apply are:

  • upscale: which is the turn a low-res image to a high-res, using img2img. I think this will be quite straight forward
  • controlnet: which is to control the output of the image, such as the color, style, etc. I think this will be quite challenging
  • IP-adapter: which is to adapt the image to a certain style, such as surrealism, cubism, etc. I think this will be quite challenging as well. Think of it as extracting feature of an image and use it to guide image generation with prompt.
  • apply LoRA for certain style, such as ukiyo-e woodblock print or traditional chinese caligraphic style.
  • If we truly want to focus on art, we should use WebUI such as ComfyUI instead of writing codes for these. Though there are definitely benefits in automating certains tasks with codes

I highly recommend this book Using Stable Diffusion with Python by Andrew Zhu if you want to get deeper into how diffusers module and SD3 works.

Lessons learnt

image

  • negative prompt does not work much in SD3
  • longer prompt (>77 tokens) can be inserted in prompt_3
  • challenging for model to maintain fidelity in generating exact words, instead of using the word “spell”, use “say” I thought might be a better approach
  • generate a few inference steps (e.g. 20) and see if the initial result looks good before increasing the steps for further refinement of the quality
  • the diffusers documentation is not bad! Quite informative
  • generate prompts via LLM creates a better quality image
  • set seed and tweak to learn how to model behaves
  • learnt how to insert metadata in png files, if you’re interested in knowing my prompt and params for all pngs here, please feel free to use pillow to extract it!


If you like this article:

Posted on:
September 1, 2024
Length:
12 minute read, 2409 words
Categories:
r R python positron stable diffusion generative AI reticulate art txt2img
Tags:
r R python positron stable diffusion generative AI reticulate art txt2img
See Also:
Gemini 1.5 Flash Better Than RAG? Let's Check It Out In R!
Llama, Llama, Oh Give Me A Sign. What's In The Latest IDSA Guideline?
V_s__l_ng M_ss_ng D_t_ W_th D_G & S_m_l_t__n