A beginner's guide to fine tuning language models without a GPU

Welcome to my series A layman's guide to language models. In the previous blog, we discussed what a language model is, types of language model and some techniques which, by current standards are the frontiers for fine tuning languge models. I

n this blog, we will discuss a code first approach to fine tuning open source model called Llama2, once you get familiarized with one language model, it is easy to jump to other language models

Llama 2 is a family of pre-trained and fine-tuned large language models (LLMs)released by Meta AI in 2023. Released free of charge for research and commercial use.

There are several state of the art models releasing everyday and i would encourage you to look around the leaderboard by UC Berkley, which at the time of writing this blog looks something like this

As you can see, GPT4 models are at the top. However, for the purpose of this tutorial I will be using the llama2 model.

llama2 is available on hugging face , a platform to find machine learning models, datasets and research papers. Here, we are using the shards version of the Llama2 instead of the actual, base model.

Sharding of a language model simply means dividing the large model into smaller pieces called shared where each shard is a self contained host.

The idea is when you have a single GPU or limited RAM, we load each shard one by one thus offloading the memory requirments

Getting started

While there are several steps involved in fine tuning a language model, the normal strcuture still remains the same

  1. We gather the text data that we want to use in order to fine tune our model. Often, this data is a collection of highly curated articles, conversations or any other form of character sequences specific to the use cases. Researchers use publicly available sources to scrape the data, they use high quality conversations or sometimes even make use of language model to generate the training data

  2. After the data is generated, we use the language model that we want to finetune. A lot of open source models are available to download locally if you want to tailor your model to a specific use case where security is a concern and the confidentiality has to be maintained. You can also get the models from hugging face hub via an API, more on that later

  3. Once we have the dataset ready, we use tokenizers to convert our dataset into chunks which are interpretable by language models

  4. We then apply fine tuning techniques and use the training data to update the weights and biases of our model to get the fine tuned data ready. There are several platforms and libraries such as hugging face and langchain that make this process a whole lot easier

Take a minute to familiarize yourself with the hugging face interface as we will be using it frequently throughout the course of this blog and sign up for a free hugging face account as well

After you are done signing up, go to your profile and generate an access token using the access tokens tab

These tokens are used for authentication purposes whenever we use hugging face via an API!


Generating corpus for fine tuning

In order to fine tune the data, we need some really high quality data. You can find several datasets on the hugging face hub

If you want to use custom datasets, it is awfully easy to create one. I will be sharing how i created my own custom dataset consisting of scary stories gathered from reddit.

I used PRAW library in python that makes scraping the reddit data easy

Creation of a custom dataset

PRAW Installation

PRAW can be installed using:

python3 -m venv ./venv
pip3 install praw

import praw
import os

After installing praw, use your reddit credentials, namely, your client id and your client secret key which can be found on the reddit

click on this option and you should have your own client id and secret key soon. Use this to connect to the reddit instance

def initialization():
    global reddit
    reddit = praw.Reddit(**reddit_credentials)
    return(reddit)

# Call the initialization function to initialize the Reddit instance
initialization()

if reddit instance is not initaized properly, you might get the value true here instead. In that case, it is reccomended to check your authentication credentials properly. Now that the instance has been initalized, let us begin scraping the data

For my data, I want the top 350 posts from the r/nosleep subreddit.

I start off by running a simple for loop iterating through each of these top posts and appending them into a csv in the format {Human: Generate a story on {post title}. Assistant : {post}}

 for post in subreddit.top(limit=350):
         # Create the text in the desired format
        text_info = f"##human: can you please generate a story on {post.title} \n\n ##assistant: {post.selftext}\n\n"
        # Write the data to the CSV file
        csv_writer.writerow([post.title, text_info, post.selftext])

Doing this should get you a csv file that contains our training data. The reason i to get top 350 posts is because top posts are the one that are popular among reddit users and are assumed to be of higher quality and correct syntax. This gives me my training data


Loading the libraries

We will be using the transformers library from hugging face to do the actual finetuning

Installing the prerequisities

  •   #installing these libraries, may take a while
      !pip install torch==2.0.1
      !pip install transformers @ git+https://github.com/huggingface/transformers@de9255de27abfcae4a1f816b904915f0b1e23cd9
      !pip install peft
      !pip install bitsandbytes==0.39.1
    

After installing the libraries, let us now import the neccesary packages.This step is crucial as it allows us to use the functions written by others so we dont have to reinvent the wheel.

  • Please don't think too much of this step, after a while it will come naturally

      import torch
      from datasets import load_dataset
      from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, GenerationConfig
      from peft import LoraConfig, get_peft_model, PeftConfig, PeftModel, prepare_model_for_kbit_training
      from trl import SFTTrainer
      import warnings
      warnings.filterwarnings("ignore")
    

    login to your hugging face account using where you will be asked to enter your credentials

      from huggingface_hub import notebook_login
      notebook_login()
    

After succesfully logging in, you will get a text saying


Loading the model :

Let us load the model using AutoModeForCausalLM function which is a part of the autoclasses of the transformer library.

The causal llm are used in cases where we need to predict next token in sequence. This is useful for creating chatbots, summaries or articles etc.

In my case, i want to use the fine tuned model for text generation hence using this subclass makes the most sense. There are other classes available for seq2seq and masked LLM.

You can refer here for more info.

model = AutoModelForCausalLM.from_pretrained(
    model_name="abhishek/llama-2-7b-hf-small-shards",
    device_map="auto",  
    trust_remote_code=True
)
new_model="shreshtha2002/llama2_reddit"
tokenizer = AutoTokenizer.from_pretrained(
    model=model,
)
tokenizer.pad_token = tokenizer.eos_token

We enter the parameters model name where we provide the link to hf repo where our model is which in our case is the shared version of Llama2, and we set the trust_remote_code to true which means that it allows custome code.

We also name our new model and we set up tokenizer using the the tokenizer class, which helps convert the raw text data into the type of numerical data that the model understands.

We are using the autoclass for both the tokenizer and loading the models because given the path to the model they automatically guess the architecture and load the pretained weights and or models.

For most usecases, we only really need autoclasses but if you are working with a model that is stored locally, there are other options available as well.

Now that we have loaded our model, let us move on to the next step


Load the dataset.

After creating the dataset, i uploaded it on hugging face hub using a private repo and that is the dataset that i will be using to finetune the model today, i will load it using the load_dataset function

  • This article by hugging face tells you how to upload your dataset on the hub. Since this is fairly straightforward, i will be moving on to the next step i.e. loading our dataset

we import the dataset library and then define a variable which contains the path to our dataset (this can be found on the model card where you have uploaded the dataset). I use the load dataset function to get the dataset.

from datasets import load_dataset

dataset_name = "shreshtha2002/training_reddit_data"
dataset = load_dataset(dataset_name, split="train[0:250]")

Since this dataset contains stories of the authors, respecting their prviacy I will not be attaching the screenshots of the actual fine tuning.

Please note that this a toy project and I don't wish to utilize this commercial or any other purposed. When training the model, it is important to be mindful of where your data comes from and use the data ethically!


Setting up configurations

After loading the class, I am storing all of my configurations in a config.yaml file. This is what the strcuture of the file looks like.

I have two variables in there. One has my configuration for LoRa and another one has training arguments. Let us understand what each of these mean

Training argumentsConcept
learning ratelearning rate is an hyprerparameter which esentially tells the model how fast or slow to go. It is ideal to start with a small learning rate hence we are taking 1e-3 here
weight decayweight decay is a regularization technique to prevent overfitting. It does so by penalizing the larger weights. The idea is that larger weights create a more biased model hence by penalizing larger weights and encouraging smaller weights, we are creating a model that is more balanced
evaluation stratergyThere are two types of evaluation startergies namely steps and epoch. steps mean we divide the training data into batch and for each time the gradient is updated,we take a batch of the data. this would be a benificial stratergy for larger data. epochs refers to iteration which means each time we update the gradient, we use the entire data instead of batches. hence choosing the epoch stratergy means that the evulation is done at the end of every epoch
num_train_epochsThis tells us the number of training epochs to perform. the default is 3
LoRa ParameterConcept
rstands for the number of parameters of the updated weight matrices, if we have smaller number that means we have fewer trainable parameters
alphaLoRa alpha is basically a scaling factor which determines how strongly the lora is going to affect your base model. Research suggests taking small numbers such as 8,16,32 or 64
dropoutdropout is a popular technique in deep learning to prevent overfitting where we randomly take out some layers during the training phase
biasbias is nothing but a constant which is added along with input. we are keeping the default settings here and keeping it to none
        config = """---
        lora_config:
          r: 64
          lora_alpha: 16
          lora_dropout: 0.05

          bias: none
        TrainingArguments:
            output_dir="/trained/llama2output",
            learning_rate=1e-3,
            num_train_epochs=2,
            weight_decay=0.01,
            evaluation_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True,
        )
        """
        lora_config = LoraConfig(**config["lora_config"])
        model = prepare_model_for_kbit_training(model)
        model = get_peft_model(model, peft_config)
        model.print_trainable_parameters()

Now that we have defined our training arguments, loaded the model and the dataset it is time to move onto the actual training.

For training we will be making use of the SFTTrainer class by the hugging face.

SFTTrainer is a crucial step in RHLF training as it provides us with customization and is actually a better choice for smaller datasets.

For bigger datasets, you might want to look into the trainer class. The SFTTrainer has functionality to helps us with Parameter fine tuning.


Training the model

I will be calling the SFT trainer class and passing on all the arguments to begin my training. This might take anywhere from several hours to days depending on your compute and the type of the data that you are using.

Let us look at all the arguments we are going to be passing to the SFTTrainer class and what each of them mean

ParameterConcept
Modelwe supply the base model that we loaded previously to this argument
train_datasetthe dataset on which we want to finetune our model
peft_configthe arguments which we configured in our yaml file
dataset_text_fieldname of the field in dataset which contains text
max_seq_lengththe maximum sequence lenght of the input. please ensure that this parameter is correct
tokenizertokenizer that we configured
argsthe training arguments which we configured in our yaml file
  • 
                trainer = SFTTrainer(
                    model=model,
                    train_dataset=dataset,
                    peft_config=lora_config,
                    dataset_text_field="text",
                    max_seq_length=1024,
                    tokenizer=tokenizer,
                    args=(**config["TrainingArguments"]),
                    packing=False,
                )
      trainer.model.save_pretrained(new_model)
      trainer.tokenizer.save_pretrained(new_model)
    

    After the training has been completed, you should be able to see a folder in the directory you created for output. That is your fine tuned model ready. You can use it to generate all sorts of insight!

    I hope you found this post useful. If you have any questions please dont hesitate to reach out via email or in the comments!