Hugging Face 🤗, with a warm embrace, meet R️ ❤️

By Ken Koon Wong in r R hugging face bert finbert transformers reticulate

September 1, 2023

I’m delighted that R users can have access to the incredible Hugging Face pre-trained models. In this demonstration, we provide a straightforward example of how to utilize them for sentiment analysis using GPT-generated synthetic data from evaluation comments. Let’s go!

Interesting Problem 😎

What if you’re faced with a list of survey comments that you need to sift through? Apart from reading them one by one, is there a method that could potentially introduce a new perspective and expedite this process? Are there any models available for performing sentiment analysis?

Objectives:

Brief Intro to Transfomers Python Module & Hugging Face

Transformers

image

In comes Transformers, which provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. These models support common tasks in different modalities, such as: NLP, computer vision, audio, and multimodal. Transformers support framework interoperability between PyTorch, TensorFlow, and JAX. Pretty cool, right!? But wait, this is a python 🐍 API! No fear, as before, we’ve demonstrated how R is able to use python modules with ease. Let’s code!

image

About Hugging Face 🤗

Hugging Face is a technology company specializing in natural language processing (NLP) and machine learning, best known for its Transformers library, an open-source collection of pre-trained models and tools that simplify the use of advanced NLP techniques. Established in 2016, the company has become a significant contributor to the field of AI, democratizing access to state-of-the-art models like BERT, GPT-2, and many others. Their platform allows developers, researchers, and businesses to easily implement complex NLP tasks such as sentiment analysis, text summarization, and machine translation. With a robust community of users contributing to its ecosystem, Hugging Face has become a go-to resource for those looking to harness the power of machine learning for language-based tasks.

Installing Transformers and Loading Module

library(reticulate)
library(tidyverse)
library(DT)

# install transformers
# py_install("transformers", pip = T) # remember to uncomment and do this first

# load transformers module 
transformer <- import("transformers")
autotoken <- transformer$AutoTokenizer
autoModelClass <- transformer$AutoModelForSequenceClassification

The above code when loading transformers resemble the below in python

from transformers import AutoTokenizer, AutoModelForSequenceClassification

Load Reuter Dataset

# load data
df <- read_csv("reuters_headlines2.csv") |>
  head(10)

# extract the headlines section
df_list <- df |>
  pull(Headlines)

Load Pre-trained Model & Predict

When you go to Hugging Face Model section, click on text classification and then sort by most likes. The above is a snapshot of that. Through wisdom of crowd, I think the top liked pre-trained models might be good ones to try out! Let’s give them a try!

Load model

tokenizer <- autotoken$from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model <- autoModelClass$from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

Let’s look at what the model predicts

model$config$id2label
## $`0`
## [1] "NEGATIVE"
## 
## $`1`
## [1] "POSITIVE"

Ahh, ok. For distilbert-base-uncased-finetuned-sst-2-english, the output would be negative which is 0 or positive which is 1.

Let’s feed our data onto tokinzer and see what is in it?

inputs <- tokenizer(df_list, padding=TRUE, truncation=TRUE, return_tensors='pt') # pt stands for pytorch

inputs$data
## $input_ids
## tensor([[  101,  3956,  2000,  2224,  3424,  1011,  7404,  6627,  2000,  4675,
##          21887, 23350,  1005,  8841,  4099,  1005,   102,     0,     0],
##         [  101,  1057,  1012,  1055,  1012,  4259,  2457,  2000,  3319,  9531,
##           1997, 14316,  3860,  4277,   102,     0,     0,     0,     0],
##         [  101, 24547, 28637,  3619,  2000,  2485,  2055,  3263,  5324,  1999,
##           2142,  2163,   102,     0,     0,     0,     0,     0,     0],
##         [  101,  2762,  1005,  1055,  2482,  3422, 16168,  4520, 20075, 29227,
##           2000,  6366,  6206,  7937,  4007,  1024,  3189,   102,     0],
##         [  101,  2859,  2758,  1057,  1012,  1055,  1012,  2323,  2425,  7608,
##           2000,  2689, 11744,  1999,  6629,  5216,   102,     0,     0],
##         [  101, 21396,  6290,  4152,  2117,  6105,  4895, 23467,  2075,  7045,
##           7708,  2011,  1057,  1012,  1055,  1012, 17147,   102,     0],
##         [  101, 10321, 20202,  1999,  2148,  3792,  2000,  3789,  2006,  2586,
##           3989,  1024,  1059,  2015,  3501,   102,     0,     0,     0],
##         [  101,  3119,  1011,  7591, 15768,  2006, 14607,  2004, 12503, 21094,
##            102,     0,     0,     0,     0,     0,     0,     0,     0],
##         [  101,  4717,  2884,  4487,  3736,  9397, 25785,  2015,  2007,  2117,
##           1011,  4284,  3463,  1010, 17472,  3105,  7659,   102,     0],
##         [  101,  2317,  2160,  1005,  1055, 23524,  2758,  1005,  2093,  9326,
##           2017,  1005,  2128,  2041,  1005,  2005,  1062,  2618,   102]])
## 
## $attention_mask
## tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
##         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
##         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
##         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
##         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
##         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
##         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
##         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
##         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
##         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

Interesting! input_ids are the numerical representations of the tokens in your input sequence(s). The first value 101 is the special token [CLS], which is often used as a sequence classifier in models like BERT. attention_mask tensor indicates which positions in the input sequence should be attended to and which should not (usually padding positions). A 1 means the position should be used in the attention mechanism, while a 0 usually signifies padding or another value to be ignored.

Now let’s dive into the tokenization of the data

df_list[1:5]

input$data[[1]][0:4] # notice that python begins with 0

Above is a snapshot of my console that showed the actual words and the tokens. It looks like token 2000 is to. Note that the tokens begin with 101 and end with 102.

In transformer models like BERT, certain special tokens are often used to help the model understand the task it should perform. These special tokens are represented by special IDs. The 101 and 102 tokens are such special tokens, and they have particular meanings:

101 represents the [CLS] (classification) token. This is usually the first token in a sequence and is used for classification tasks. For tasks like sequence classification, the hidden state corresponding to this token is used as the aggregate sequence representation for classification.

102 represents the [SEP] (separator) token. This token is used to separate different segments in a sequence. For instance, if you’re inputting two sentences into BERT for a task like question-answering or natural language inference, the [SEP] token helps the model distinguish between the two sentences.

As practice, what tokens are U.S.? Hover here for answer.

Let’s Check The Prediction

## reticulate does not have ** function to pass the params
outputs <- model(inputs$input_ids, attention_mask=inputs$attention_mask)

outputs
## $logits
## tensor([[ 2.1572, -1.8241],
##         [-1.6254,  1.5929],
##         [ 1.3497, -1.1460],
##         [ 3.3878, -2.8804],
##         [ 3.8068, -3.1309],
##         [ 2.1719, -1.8269],
##         [ 1.6600, -1.5161],
##         [ 2.0822, -1.8792],
##         [ 4.2344, -3.4873],
##         [ 1.8456, -1.4874]], grad_fn=<AddmmBackward0>)

Ahh, these are in logits. Also, noted that we cannot do model(**inputs) like in python, we’d have to pass in individual parameters.

Load torch and change to probability

torch <- import("torch")
predictions <- torch$nn$functional$softmax(outputs$logits, dim=1L)

predictions
## tensor([[9.8168e-01, 1.8320e-02],
##         [3.8484e-02, 9.6152e-01],
##         [9.2384e-01, 7.6159e-02],
##         [9.9811e-01, 1.8920e-03],
##         [9.9903e-01, 9.6959e-04],
##         [9.8199e-01, 1.8008e-02],
##         [9.5992e-01, 4.0076e-02],
##         [9.8132e-01, 1.8679e-02],
##         [9.9956e-01, 4.4292e-04],
##         [9.6555e-01, 3.4454e-02]], grad_fn=<SoftmaxBackward0>)

Yes! There’re in probabilities now. But how do we turn these tensors into tibble ?

# turn tensor to list
pred_table <- predictions$tolist()

# map list into dataframe
table <- map_dfr(pred_table, ~ tibble(positive = .[2], negative = .[1]))

datatable(table)

Awesome! Looks like at least the coding worked. Let’s combine the comments and the scores to check.

df |>
  head(10) |>
  select(Headlines) |>
  add_column(table) |>
  datatable()

wow, most news are quite negative. 🤣 Not sure if distilbert-base-uncased-finetuned-sst-2-english is the best pre-trained model for these data.

Let’s check out ProsusAI/finbert

tokenizer <- autotoken$from_pretrained("ProsusAI/finbert")
model <- autoModelClass$from_pretrained("ProsusAI/finbert")
inputs <- tokenizer(df_list, padding=TRUE, truncation=TRUE, return_tensors='pt')
outputs <- model(inputs$input_ids, attention_mask=inputs$attention_mask)
predictions <- torch$nn$functional$softmax(outputs$logits, dim=1L)
pred_table <- predictions$tolist()
table <- map_dfr(pred_table, ~ tibble(positive = .[1], negative = .[2], neutral = .[3]))

df |>
  select(Headlines) |>
  add_column(table) |>
  datatable()

I like the additional option of neutral. This might actually be very helpful for our actual problem in evaluation comments.

Predict GPT4 generated comments 🤖

First, Generate Data

Second, Use finBERT for Sentiment Analysis

eval_df <- read_csv("eval_comment.csv") |> 
  pull(comment)

inputs <- tokenizer(eval_df, padding=TRUE, truncation=TRUE, return_tensors='pt')
outputs <- model(inputs$input_ids, attention_mask=inputs$attention_mask)
predictions <- torch$nn$functional$softmax(outputs$logits, dim=1L)
pred_table <- predictions$tolist()
table <- map_dfr(pred_table, ~ tibble(positive = .[1], negative = .[2], neutral = .[3]))

df_final <- tibble(comment = eval_df) |>
  add_column(table) |>
  select(-negative) |>
  mutate(positive = positive + neutral) |>
  select(-neutral)

datatable(df_final)

Wow, not bad! If we put a threshold of 0.9 or more to screen out negative comments we might do pretty good!

Third, datatable with backgroundColor conditions for Aesthetics 📊

datatable(df_final, options = list(columnDefs = list(list(visible = FALSE, targets = 2)))) |>
  formatStyle(columns = "comment",
              backgroundColor = styleInterval(cuts =
                c(0.5, 0.95), values =
                c('#FF000033', '#FFA50033', '#0000FF33')
              ),
                valueColumns = "positive")

Notice that I had to set a threshold of 0.95 to ensure all negative comments are captured. Meaning, only comments with sentiment of more than 0.95 will have blue background. If anything between 0.5 and 0.95 it would be yellow. Anything less than 0.5 will be red.

image

We’re done !!! Now we know how to access Hugging Face pre-trained model through transformers! This opens up another realm of awesomeness!

Acknowledgement

  • This Colab link really had helped me to modify some of the codes to make it work in R
  • Thanks to my brother Ken S’ng, who inspired me to explore hugging face with his previous python script
  • Thanks to chatGPT for generating synthetic evaluation data!
  • Of course, last but not least, the wonderful open-source community of Hugging Face! 🤗

Lessons learnt

  • Markdown hover text can be achieved through [](## "")
  • Changing alpha of hex code can be achieved through chatGPT prompt.
  • There are tons of great pre-trained models in Hugging Face, can’t wait to explore further!


If you like this article:

Posted on:
September 1, 2023
Length:
9 minute read, 1850 words
Categories:
r R hugging face bert finbert transformers reticulate
Tags:
r R hugging face bert finbert transformers reticulate
See Also:
Tidyverse 🪐to Polars 🐻‍❄️: My Notes
Stable Diffusion 3 in R? Why not? Thanks to {reticulate} 🙏❤️🙌
Gemini 1.5 Flash Better Than RAG? Let's Check It Out In R!