Our services

Get started

Our services

Our work

Careers

Partnership

Get started

Our services

Get started

Fine-Tuning Language Models: A How-To Guide

Apr 25, 2024

Abdullah Ahmed, Hashim Hayat, Daheem Hayat

Artificial Intelligence

LLM

Guide

Summary

This article details the process of fine-tuning Large Language Models (LLMs) for specific applications, contrasting it with training from scratch. It outlines the importance of selecting the right dataset and model, and provides a step-by-step guide to fine-tuning, including an example with GPT-2 for sentiment analysis.

Key insights:

Fine-Tuning vs. Training from Scratch: Fine-tuning leverages pre-trained models to save resources, focusing on tailoring models to specific tasks with less data and computational power.
Choosing the Right Dataset and Model: Success in fine-tuning significantly depends on selecting task-relevant datasets and an appropriate pre-trained model that aligns with the desired output characteristics.
Step-by-Step Fine-Tuning Process: Includes setting up the right hardware and software, preparing data, loading and adjusting the pre-trained model, training with new data, and continuous monitoring and evaluation.
Practical Example with GPT-2: Demonstrates fine-tuning GPT-2 for sentiment analysis, detailing steps from using Hugging Face libraries for data loading and model preparation to training and evaluating the model.
Best Practices in Fine-Tuning: Highlights the importance of dataset selection, model choice, hyper-parameter adjustment, and validation to prevent common issues like overfitting and ensure optimal model performance.

Introduction

Large Language Models (LLMs) are revolutionizing the way applications leverage artificial intelligence. These models are trained on large datasets and can generate human-like text and respond to complex queries. The ability to mimic human behavior with precision makes LLMs suitable for a wide array of applications across industries like education, finance, e-commerce, and more.

While using pre-trained LLMs can help save a lot of time that would be spent in training them, their true potential lies in fine-tuning them for a specific domain. Fine-tuning involves further training on an LLM on specific datasets that align with the desired use cases. This process results in significantly improved performance of the outputs the models generate.

This article explores the basics of fine-tuning LLMs, the process of preparing for fine-tuning, and a step-by-step guide to fine-tuning your own LLM. Finally, we will present an example of fine-tuning a model.

Basics of Fine-Tuning Large Language Models

Fine-tuning is the process of taking a pre-trained LLM and further training it on a more specialized dataset that is relevant to a particular task or domain. This allows the model to tune its parameters and knowledge to handle specific requirements.

Fine-Tuning vs Training from Scratch

Training an entire model from scratch requires a large amount of data and computational resources as the parameters are randomly initialized before being trained. On the other hand, fine-tuning involves taking a pre-trained model and then training it on a smaller, tasks-specific dataset. This process allows the model to leverage the general language understanding learned from the large dataset while adapting to the specifics of the new task at hand.

Choosing the Right Dataset and Model

The success of fine-tuning largely depends on choosing the correct dataset and model. The dataset should reflect specific characteristics of the task that it is being prepared to effectively teach the model new information. For example, if the task is to enhance customer interaction in an AI-powered chatbot for a financial service, a good dataset should include customer inquiries, transaction dialogues, and support interactions specific to financial topics.

Secondly, a suitable base model should be selected for fine-tuning. You may consider factors like the model’s architecture, size, input/output specifications, the layers of the pre-trained model, and performance on related tasks. Selecting a model that aligns with the target task can streamline the fine-tuning process.

The Fine-Tuning Process: A Step-by-Step Guide

To begin fine-tuning an LLM, ensure that your environment is properly set up with the necessary tools. The key components of the process include:

Hardware: Large models may require one or more powerful GPUs. Ensuring that you have enough memory and processing power is crucial.

Software: Install machine learning frameworks such as TensorFlow and PyTorch to work with models.

Data Preparation Tools: Tools are required to prepare the new dataset that is to be used for fine-tuning. These tasks involve data cleaning, transformation, and augmentation.

Development Environment: Choose a development environment that supports machine learning development such as Jupyter Notebook or an IDE.

Once you have met the prerequisites, you can begin the process of fine-tuning a large language model. Here is a step-by-step guide:

1. Data Preparation: Make sure your domain or task-specific dataset is ready to be fed to the model. At this stage, you need to think about tasks like cleaning the data, handling missing values, and formatting the text to align with the model’s input requirements. You can also consider data augmentation techniques to expand the dataset.

2. Load the Model: Load the pre-trained model along with its pre-trained weights and architecture. This model serves as the base starting point that will be fine-tuned for a specific task.

3. Adjusting Parameters: This step includes optimizing parameters for fine-tuning such as the learning rate, number of training epochs, and batch size - similar to training a new model. Furthermore, you can select which layers of the base model to freeze and only fine-tune a subset of layers.

4. Training on New Data: Feed the prepared dataset into the model. During this stage, the model’s weights are updated to learn patterns and representations of the new domain.

5. Monitor: Keep track of the loss and accuracy during the training process. Based on these metrics, you might need to adjust the parameters set in step three to enhance model performance.

6. Evaluation: Evaluate the performance of the fine-tuned model on a held-out test set. Compare these results with the base model’s performance to understand how fine-tuning has improved the model’s performance.

Example of Fine-Tuning a Large Language Model

Using the steps mentioned above, we will now showcase a step-by-step example of fine-tuning a large language model. In this example, we will fine-tune GPT-2 to perform sentiment analysis.

Pre-Requisites

We will be using Google Colab with a T4 GPU to perform our fine-tuning. Although it is not a powerful environment to be working in, the resources are enough for our task. For the model and dataset, we will be taking advantage of Hugging Face’s Python packages. If you are following us along, make sure to install datasets, accelerate, transformers, evaluate, numpy, and pandas in your environment.

The datasets library allows us to easily import ready-to-use datasets from the Hugging Face platform. A full list of their datasets can be found here. In addition to this, the transformers library allows us to import pre-trained models and tokenizers.

Step-by-Step Guide

1. Data Preparation: Let’s import the dataset and examine its format.

from datasets import load_dataset
import pandas as pd

dataset = load_dataset("poem_sentiment")

df = pd.DataFrame(dataset['train'])

df.head()

The lines above load the “poem_sentiment” dataset from Hugging Face. Then, we convert the training split of the data into a Pandas data frame and print out the first five entries for inspection. Here is the output:

As we can see, the dataset contains three columns: id, verse_text, and label. Our variables of interest are verse_text and label. A more detailed description of the dataset is available on Hugging Face here.

We now know that for every row, we have a pair of verse_text as well as a label that defines its sentiment. These sentiments, according to the description on the website are:

0 = negative

1 = positive

2 = no impact

3 = mixed (both negative and positive)

An important thing to note here is that we have four labels in total, which will be used in the next stage of the process.

Next, we need to tokenize our data in an acceptable format for GPT-2.

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(data):
    return tokenizer(data["verse_text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

train_dataset = tokenized_datasets["train"]

test_dataset = tokenized_datasets["test"]

The transformers library includes a pre-trained tokenizer for GPT-2. Using that, we can map verse_text from every row in the dataset to their respective tokenized version.

In the last two lines, we split our dataset into train and test - which are already defined in the original dataset that was imported.

2. Load the Model: Again, we will be using transformers to help us load the GPT-2 model. This step is fairly straightforward.

from transformers import GPT2ForSequenceClassification

model = GPT2ForSequenceClassification.from_pretrained("gpt2",
num_labels=4)

Remember to adjust num_labels according to the number of labels in your dataset. In the previous stage, we found that our specific dataset has four distinct labels.

3. Adjusting Parameters: At this stage, we should set up the training process and the evaluation metrics.

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="logs",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=4
)

In the code above, we have specified the training arguments. Here is an overview of our parameters:

output_dir: The directory where the outputs (like model checkpoints and training logs) will be saved.
per_device_train_batch_size: The number of examples (batch size) to process on each device (like a GPU or CPU) during training. We have used a value of one since we are working with limited hardware resources.
per_device_eval_batch_size: Similar to per_device_train_batch_size, but this setting applies to evaluation phases.
gradient_accumulation_steps: Since we are using smaller batch sizes, it may lead to less stable gradient estimates, which can affect the training negatively. Gradient accumulation addresses this by dividing a large batch into smaller mini-batches and accumulating the computed gradients over these mini-batches. In other words, instead of immediately updating the weights, the gradients are accumulated over four iterations in our case.

import evaluate
import numpy as np

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Here, we define our accuracy metrics for the model. The compute_metrics function computes the accuracy by first determining the prediction of the model by extracting the labels with the highest probability of being correct according to the model and then comparing the predicted labels with the actual labels.

Finally, we define our trainer:

from transformers import Trainer

trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
compute_metrics=compute_metrics,
)

The Trainer takes parameters for the model to use, the training arguments (defined previously), the training dataset, the evaluation dataset, and a function to compute accuracy metrics.

4. Train the Model: Let’s begin the training process by initiating our Trainer object.

trainer.train()

This process may take some time depending on the size of the model, the size of the dataset, and the hardware being used. For us, it took roughly fifteen minutes.

5. Evaluation: Once the model has been re-trained, we can evaluate it using the compute metrics and the testing dataset we previously acquired.

trainer.evaluate()

The output should display how well your model is performing on the test dataset through metrics like loss and accuracy.

Fine-Tuning: Best Practices

Choose the Right Dataset

Selecting an appropriate dataset is crucial for successful fine-tuning. It is essential to ensure that the dataset is large and diverse enough to prevent overfitting (when the model memorizes the training data instead of underlying patterns).

At the same time, the dataset should be representative of the task and contain sufficient examples to avoid underfitting, which occurs when the model fails to learn effectively due to a lack of data.

Lastly, it is important to strictly separate the training and validation/test datasets to avoid any overlapping or duplicated data, which can result in overly optimistic performance estimates.

Choosing the Right Model

The choice of pre-trained base models can significantly impact the fine-tuning performance. It is essential to select a model that is well-suited for the task by considering factors like model size, architecture, and the type of data used for initial training. For example, larger models with more parameters may perform better but require substantially heavier computational resources. Therefore, it is important to evaluate the trade-offs between model size, performance, and computational requirements based on the task at hand.

Adjust Hyper-Parameters

Fine-tuning the hyper-parameters such as learning rate, batch size, and number of epochs is also important for optimizing the model’s performance. Techniques like rate scheduling, gradient accumulation, gradient clipping, and early stopping can improve convergence and prevent overfitting. Therefore, the parameters should be adjusted accordingly while keeping the hardware constraints in mind.

Validation

Regular evaluation of the model’s performance is essential for monitoring the model’s progress and preventing problems like overfitting. Choosing an appropriate evaluation metric for the task (for example, accuracy, F1-score, BLEU score, ROGUE score, etc) is crucial for accurately assessing the model’s performance in a specific domain.

Conclusion

In conclusion, fine-tuning pre-trained Large Language Models on targeted datasets unlocks their true potential for specific use cases. However, the process must be executed carefully by performing careful data preparation, model selection, and hyperparameter tuning. As natural language AI rapidly evolves, mastering fine-tuning will be essential for organizations aiming to capitalize on the power of LLMs across diverse applications and industries. With the right approaches, fine-tuned models can unlock various possibilities with enhanced capabilities and improved performance across specific domains.

Authors

Hashim Hayat

Cornell University

Abdullah Ahmed

NYU Abu Dhabi

Daheem Hayat

National Defence University

References

https://www.turing.com/resources/finetuning-large-language-models#

https://www.scribbledata.io/blog/fine-tuning-large-language-models/

https://www.datacamp.com/tutorial/fine-tuning-large-language-models

https://huggingface.co

Other Insights

This insight highlights how Steve shifts automation from passive execution to intelligent collaboration across daily workflows.