Intro to Machine Learning using Transformers

A few months ago, I began to learn about machine learning and AI. Initially, I was expecting a steep learning curve with a lot of complex math. But after some exploring, I discovered a rich ecosystem of frameworks, libraries, and services which makes it relatively easy for a developer like me to use machine learning.

What is Machine Learning?

Machine learning is good at tasks which are easy for a person to do, but difficult to implement in code. For example, determining the species of animal in a photo or determining if a product review is negative or positive. So instead of writing code to do the task, in machine learning I write code to train the machine to do the task. For the former, the training process involves providing the machine with a set of animal photos and the name of the animal in each photo so it can learn about the unique features of each animal and subsequently identify it the next time it is given a different photo of the animal. Similarly, for the latter, the machine is provided with a set of product reviews and whether the review is negative or positive so it can learn what a negative or positive review sounds like and be able to label it as such the next time it is given a different review. In machine learning, the thing being trained to do a certain task is called the model and the data used to train it is called the dataset.

In this post, I will demonstrate how to train and use a model to classify reviews as positive or negative by training it on a Yelp reviews dataset using the Hugging Face Transformers framework. Along the way, I will discuss the relevant machine learning concepts.

Development Environment

The code for training models is written in Python. The execution of this code requires a lot of computations that would take too long to complete on CPUs. Instead, the code needs to be executed on GPUs. Notebooks provide a browser based code editing environment where I can write and execute my code on GPUs. Most notebook services offer free plans with access to limited compute and storage resources and paid plans if more resources are needed. For this demo, I use Google Colab notebooks.

First I need to install library dependencies in my notebook. transformers[sentencepiece] is the dev version of transformers which provides functions for training and using models. datasets provides functions for downloading and processing the data used for training and validating models. evaluate provides functions for calulating metrics when training models. accelerate is also required for training.


!pip install transformers[sentencepiece] datasets evaluate accelerate

Hugging Face provides a git like repository for saving and sharing models called Model Hub. To save the trained model from my notebook to Model Hub, the notebook needs to authenticate the Model Hub. It can do this with a user access token with write permissions which I create in the user settings section of the Hugging Face website and provide to the notebook by calling notebook_login.


from huggingface_hub import notebook_login

notebook_login()

Using Pre Trained Models

If a machine learning app was a person, the model would be its brain. From a more technical perspective, you can think of the model as a function that performs a specific task such as determining the sentiment, either positive or negative, of a given text or filling in the missing word in a sentence. Unlike in programming where a function contains code which implements the desired functionality, in machine learning, the function contains millions of numeric parameters called weights which are configured to do the desired task through training. During training, the model's weights are initialized to random values and used to make predictions on the training data. Then the weights are adjusted based on how far off the predictions are from their expected values. This cycle of making predictions and adjusting weights is repeated until the accuracy of the model reaches an acceptable value. Going back to our analogy of the model as a person's brain, training a baby to classify a product review written in English as positive or negative would be more difficult than training an adult that already understands the English language. The equivalent of the adult brain in machine learning is the pre trained model. This is a model that has already been trained to do a similar task in the target language. Since this pre trained model already understands the target language, it will be easier to train or fine tune it to do a new task.

There are a lot of pre trained models to choose from. Since we are dealing with text, we need to pick a language model. There are three categories of language models to choose from depending on your target task. Encoder models are good at understanding input text and commonly used for tasks like sentence classification. Decoder models are good at generating output text and commonly used for tasks like sentence completion. Encoder-Decoder models are good at doing both and commonly used for tasks such as summarizing a sentence.

Since I want my model to classify reviews written in English, I need to chose an encoder model that has been pre trained on English text that I can train, or fine tune to do this. One common option is the BERT model, which was pre trained in an un supervised manner, to fill in missing, or masked words in English sentences. Un supervised training means that the data the model was trained on was not labeled. For example, given a complete sentence, a random word is removed and then the model is used to predict the missing word in the sentence. I chose to use the distilbert-base-uncased model because it performs similarly to BERT, but is smaller and faster.


model_name = 'distilbert/distilbert-base-uncased'

The pipeline function is used to run the task the model was trained to do. Here I am using the distilbert-base-uncased model to fill in the masked word, specified using [MASK], in the sentence.


from transformers import pipeline

unmasker = pipeline(task='fill-mask', model=model_name)
unmasker("My favorite sport is [MASK].")

[{'score': 0.08831855654716492,
  'token': 5742,
  'token_str': 'swimming',
  'sequence': 'my favorite sport is swimming.'},
 {'score': 0.07627089321613312,
  'token': 3455,
  'token_str': 'basketball',
  'sequence': 'my favorite sport is basketball.'},
 {'score': 0.06569057703018188,
  'token': 2374,
  'token_str': 'football',
  'sequence': 'my favorite sport is football.'},
 {'score': 0.06273501366376877,
  'token': 21383,
  'token_str': 'archery',
  'sequence': 'my favorite sport is archery.'},
 {'score': 0.06088290363550186,
  'token': 4715,
  'token_str': 'soccer',
  'sequence': 'my favorite sport is soccer.'}]

Preparing Datasets

Models cannot process raw text input directly. First the text needs to be converted to a format the model understands using a tokenizer. Generally during tokenization, the text is split into a sequence of tokens and each token is mapped to its numeric id from a vocabulary created during pre training of the model. Often special tokens and metadata are added to provide additional information to the model, for example to denote the start or end of sentences. Since tokenization differs across models it is important that the tokenizer used when running or fine tuning a model matches the tokenizer used during pre training of the model. When using the model with the pipeline function, the function automatically determines the correct tokenizer to use and tokenizes the input text before passing it to the model. But when fine tuning the model, the text in the training data needs to be tokenized manually.

Datasets available for training models can be found on Hugging Face Hub. The load_dataset function is used to load the yelp_review_full dataset.


from datasets import load_dataset

original_datasets = load_dataset('yelp_review_full')
original_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

During training, the model should learn about the features which make a review negative or positive so that it can later label reviews it has not seen before. But if a model is trained for too long or with too small a dataset, the model may become too specific, or overfit to the data it was trained on. When this occurs, the model will only be able to accurately label reviews it was trained on, but not reviews it has not seen before. If the dataset used to train the model was also used to validate its accuracy, it would be difficult to determine if a high accuracy was due to a well trained model or an overfitted one. But when a different dataset is used to validate it, a well trained model's accuracy would still be high while an overfitted model's accuracy of would be low because it would fail to label the reviews which it has not seen before. Therefore, to detect overfitting, it is important that the training dataset always be different from the validation dataset. The yelp_review_full dataset is already split into a train dataset and test dataset which can be used for validating the model.


original_train_dataset = original_datasets['train']

original_train_dataset[1:3]

{'label': [1, 3],
 'text': ["Unfortunately, the frustration of being Dr. Goldberg's patient...",
  "Been going to Dr. Goldberg for over 10 years..."]}

original_train_dataset.features

{'label': ClassLabel(names=['1 star', '2 star', '3 stars', '4 stars', '5 stars'], id=None),
 'text': Value(dtype='string', id=None)}

Each item in the original dataset has a text field containing the text of a review and a label field containing the number of stars, from one to five, given to the subject of the review. But I want my model to label reviews as positive or negative, not by number of stars. So for training purposes, I will assume that a review is positive if it has three or more stars and negative otherwise. Since the model uses the labels of the data it was trained on, before using this dataset for training, I need to convert the labels from number of stars to positive or negative. I also need to tokenize the review text using the model's tokenizer. Both can be done using the map method of the dataset.

The map method's parameters are the function containing the mapping logic and the batched boolean keyword argument specifying whether mapping should be done in batches. When batched is true, the function is passed a dictionary where the keys are the fields of the dataset and values are batch sized list of the fields' values. The function also returns a dictionary. If a key in this dictionary matches a field in the dataset, the key's values updates the field's values. Otherwise, a new field is added to the dataset for the key and its values.

I define a function named label_text which uses a Python list comprehension to update labels greater than or equal to two, which is equivalent to three stars because label is zero based, to one for positive and everything else to zero for negative. The label key is used to reference the updated labels in the returned dictionary so the map method I pass the this function to knows to update the label field instead of adding a new field.


def label_text(dict):
  return {'label': [1 if label >= 2 else 0 for label in dict['label']]}

labeled_datasets = original_datasets.map(label_text, batched=True)
labeled_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

labeled_datasets['train'][1:3]

{'label': [0, 1],
 'text': ["Unfortunately, the frustration of being Dr. Goldberg's patient...",
  "Been going to Dr. Goldberg for over 10 years..."]}

To tokenize the text field in the datasets, I need to use the same tokenizer that was used during the pre training of the model. I can get it by passing the model name to the from_pretrained method of the AutoTokenizer class. I define a function named tokenize_text which uses the tokenizer to tokenize the text. The truncation boolean keyword argument of tokenizer is set to true so that it truncates any text that is longer than maximum length supported by the model. The tokenizer returns a dictionary containing the input_ids and attention_mask keys which are added to the datasets as new fields. The input_ids key references a list of lists containing the token ids of each text after it has been tokenized. During training, data will be passed to the model in batches. Since the model requires the lists of input_ids in each batch be of the same length, they are padded to the length of the longest one in the batch using a padding token id, which the model ignores. To let the model know which indexes in input_ids to pay attention to, the attention_mask key is used to reference a list of lists with the same dimensions as input_ids where each index contains a one if the corresponding index in input_ids contains a real token and a zero if its padding.


from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_text(dict):
  return tokenizer(dict['text'], truncation=True)

tokenized_datasets = labeled_datasets.map(tokenize_text, batched=True)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 50000
    })
})

tokenized_datasets['train'][1:3]

{'label': [0, 1],
 'text': ["Unfortunately, the frustration...", "Been going to..."],
 'input_ids': [[101, 6854, 1010, 1996, ...], [101, 2042, 2183, 2000, ...]],
 'attention_mask': [[1, 1, 1, 1, ...], [1, 1, 1, 1, ...]]}

Training the Model

The tokenized datasets do not contain any padding yet. This is because the amount of padding for each item is dependent on the batch it belongs to and the batches are assembled later during training. The class responsible for assembling the batches and padding is named DataCollatorWithPadding. It is initialized with the tokenizer so it knows what token to use for padding.


from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

A model contains layers of weights. The layers near the beginning are trained to identify general features of the input, while the layers near the end, also known as the head, are trained to use those features to do a specific task on the input. For example, in a language model, the layers near the beginning might identify the parts of speech in a sentence, while the head uses this information to classify the sentence. To train a pre trained model to do a different task, I need to replace its current head with a new head that supports the new task. The weights of the new head are initialized to random values and will be optimized for the new task during training. Since I want to train the distilbert-base-uncased model to label reviews, also known as sequence classification, I need to get an instance of this model with a sequence classification head. I can do this by calling the from_pretrained method of the AutoModelForSequenceClassification class. The method parameters are the name of the pre trained model, the number of labels the model will use, and the mappings between the numeric label and its human readable name.


id2label = {0: 'negative', 1: 'positive'}
label2id = {'negative': 0, 'positive': 1}

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2, id2label=id2label, label2id=label2id)

During training, the dataloader creates batches from the training dataset using the data collator. The model makes its predictions on a batch. A measure of how far off the predictions are from their expected values, also known as the loss, is calculated. To determine how to change the weights to improve or lower the loss, the gradient of the loss is calculated. Then the weights are updated based on the gradient. This process is repeated until the model has seen all of the batches in the dataset, also known as an epoch. At the end of the epoch, performance metrics for the trained model are computed from the predictions it makes on the validation dataset. The Trainer is a high level class which implements these steps when provided with the model, the training and validation datasets, the tokenizer, the data collator, and a function defining how to compute the metrics.

To determine how good my trained model is at labeling reviews, I configure the Trainer to report its accuracy at the end of each epoch, which is the percentage of reviews the model labeled correctly from the validation dataset. To specify how to compute accuracy, I define a function named compute_metrics which I will provide to the Trainer. This function is passed a tuple containing the prediction and label for each review and is expected to return a dictionary containing the accuracy key and its computed value. The first element of the tuple contains the predictions as a list of logits, where a logit is a list containing the probability for each label. The second element of the tuple contains the actual labels. The Evaluate library provides modules for computing various metrics. To use it to compute accuracy, I load the module for accuracy by name and call its compute method with the predicted and actual labels as arguments. Since the compute method expects the predictions argument to be a list of labels, not logits, I have to map each logit to the index in the logit containing the highest probability. The compute method returns a dictionary containing the accuracy key and its computed value, which I return from the compute_metrics function.


import evaluate
import numpy as np

def compute_metrics(valid_set_preds):
    metric = evaluate.load('accuracy')
    logits, labels = valid_set_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

The Trainer is configured using TrainingArguments. Here I configure it to compute metrics for the model and upload the model to Model Hub under my_model_name at the end of each epoch. By default, the model is trained for three epochs.


my_model_name = 'distilbert-base-uncased-finetuned-yelp'

from transformers import TrainingArguments

training_args = TrainingArguments(
    my_model_name, eval_strategy='epoch', save_strategy='epoch', push_to_hub=True
)

The Trainer is created using the arguments prepared earlier.


from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

To train the model, call the train method of the Trainer. Ideally, loss should decrease and accuracy should increase after every epoch.


trainer.train()

After training is complete, the final version of the model needs to be manually pushed to the Model Hub using the push_to_hub method of the Trainer.


trainer.push_to_hub()

Using the Model

The model's page on Model Hub contains an inference widget for trying out the model on some input.

To use the model in code, create a pipeline function using the model's fully qualified name and call it with the text to classify.


from transformers import pipeline

classifier = pipeline(task='text-classification', model='vinhanguyen/distilbert-base-uncased-finetuned-yelp')
classifier('The wonton soup is delicious!')

[{'label': 'positive', 'score': 0.9996329545974731}]

classifier('The fried rice is nasty.')

[{'label': 'negative', 'score': 0.971501886844635}]

Resources:

No comments:

Post a Comment

Intro to Machine Learning using Transformers

A few months ago, I began to learn about machine learning and AI. Initially, I was expecting a steep learning curve with a lot of complex...