Welcome to my series A layman's guide to language models. In the previous blog, we discussed what a language model is, types of language model and some techniques which, by current standards are the frontiers for fine tuning languge models. I
n this blog, we will discuss a code first approach to fine tuning open source model called Llama2, once you get familiarized with one language model, it is easy to jump to other language models
Llama 2 is a family of pre-trained and fine-tuned large language models (LLMs)released by Meta AI in 2023. Released free of charge for research and commercial use.
There are several state of the art models releasing everyday and i would encourage you to look around the leaderboard by UC Berkley, which at the time of writing this blog looks something like this
As you can see, GPT4 models are at the top. However, for the purpose of this tutorial I will be using the llama2 model.
llama2 is available on hugging face , a platform to find machine learning models, datasets and research papers. Here, we are using the shards version of the Llama2 instead of the actual, base model.
Sharding of a language model simply means dividing the large model into smaller pieces called shared where each shard is a self contained host.
The idea is when you have a single GPU or limited RAM, we load each shard one by one thus offloading the memory requirments
Getting started
While there are several steps involved in fine tuning a language model, the normal strcuture still remains the same
We gather the text data that we want to use in order to fine tune our model. Often, this data is a collection of highly curated articles, conversations or any other form of character sequences specific to the use cases. Researchers use publicly available sources to scrape the data, they use high quality conversations or sometimes even make use of language model to generate the training data
After the data is generated, we use the language model that we want to finetune. A lot of open source models are available to download locally if you want to tailor your model to a specific use case where security is a concern and the confidentiality has to be maintained. You can also get the models from hugging face hub via an API, more on that later
Once we have the dataset ready, we use tokenizers to convert our dataset into chunks which are interpretable by language models
We then apply fine tuning techniques and use the training data to update the weights and biases of our model to get the fine tuned data ready. There are several platforms and libraries such as hugging face and langchain that make this process a whole lot easier
Take a minute to familiarize yourself with the hugging face interface as we will be using it frequently throughout the course of this blog and sign up for a free hugging face account as well
After you are done signing up, go to your profile and generate an access token using the access tokens tab
These tokens are used for authentication purposes whenever we use hugging face via an API!
Generating corpus for fine tuning
In order to fine tune the data, we need some really high quality data. You can find several datasets on the hugging face hub
If you want to use custom datasets, it is awfully easy to create one. I will be sharing how i created my own custom dataset consisting of scary stories gathered from reddit.
I used PRAW library in python that makes scraping the reddit data easy
Creation of a custom dataset
PRAW Installation
PRAW can be installed using:
python3 -m venv ./venv
pip3 install praw
import praw
import os
After installing praw, use your reddit credentials, namely, your client id and your client secret key which can be found on the reddit
click on this option and you should have your own client id and secret key soon. Use this to connect to the reddit instance
def initialization():
global reddit
reddit = praw.Reddit(**reddit_credentials)
return(reddit)
# Call the initialization function to initialize the Reddit instance
initialization()
if reddit instance is not initaized properly, you might get the value true here instead. In that case, it is reccomended to check your authentication credentials properly. Now that the instance has been initalized, let us begin scraping the data
For my data, I want the top 350 posts from the r/nosleep subreddit.
I start off by running a simple for loop iterating through each of these top posts and appending them into a csv in the format {Human: Generate a story on {post title}. Assistant : {post}}
for post in subreddit.top(limit=350):
# Create the text in the desired format
text_info = f"##human: can you please generate a story on {post.title} \n\n ##assistant: {post.selftext}\n\n"
# Write the data to the CSV file
csv_writer.writerow([post.title, text_info, post.selftext])
Doing this should get you a csv file that contains our training data. The reason i to get top 350 posts is because top posts are the one that are popular among reddit users and are assumed to be of higher quality and correct syntax. This gives me my training data
Loading the libraries
We will be using the transformers library from hugging face to do the actual finetuning
Installing the prerequisities
#installing these libraries, may take a while !pip install torch==2.0.1 !pip install transformers @ git+https://github.com/huggingface/transformers@de9255de27abfcae4a1f816b904915f0b1e23cd9 !pip install peft !pip install bitsandbytes==0.39.1
After installing the libraries, let us now import the neccesary packages.This step is crucial as it allows us to use the functions written by others so we dont have to reinvent the wheel.
Please don't think too much of this step, after a while it will come naturally
import torch from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, GenerationConfig from peft import LoraConfig, get_peft_model, PeftConfig, PeftModel, prepare_model_for_kbit_training from trl import SFTTrainer import warnings warnings.filterwarnings("ignore")
login to your hugging face account using where you will be asked to enter your credentials
from huggingface_hub import notebook_login notebook_login()
After succesfully logging in, you will get a text saying
Loading the model :
Let us load the model using AutoModeForCausalLM function which is a part of the autoclasses of the transformer library.
The causal llm are used in cases where we need to predict next token in sequence. This is useful for creating chatbots, summaries or articles etc.
In my case, i want to use the fine tuned model for text generation hence using this subclass makes the most sense. There are other classes available for seq2seq and masked LLM.
You can refer here for more info.
model = AutoModelForCausalLM.from_pretrained(
model_name="abhishek/llama-2-7b-hf-small-shards",
device_map="auto",
trust_remote_code=True
)
new_model="shreshtha2002/llama2_reddit"
tokenizer = AutoTokenizer.from_pretrained(
model=model,
)
tokenizer.pad_token = tokenizer.eos_token
We enter the parameters model name where we provide the link to hf repo where our model is which in our case is the shared version of Llama2, and we set the trust_remote_code to true which means that it allows custome code.
We also name our new model and we set up tokenizer using the the tokenizer class, which helps convert the raw text data into the type of numerical data that the model understands.
We are using the autoclass for both the tokenizer and loading the models because given the path to the model they automatically guess the architecture and load the pretained weights and or models.
For most usecases, we only really need autoclasses but if you are working with a model that is stored locally, there are other options available as well.
Now that we have loaded our model, let us move on to the next step
Load the dataset.
After creating the dataset, i uploaded it on hugging face hub using a private repo and that is the dataset that i will be using to finetune the model today, i will load it using the load_dataset function
-
This article by hugging face tells you how to upload your dataset on the hub. Since this is fairly straightforward, i will be moving on to the next step i.e. loading our dataset
we import the dataset library and then define a variable which contains the path to our dataset (this can be found on the model card where you have uploaded the dataset). I use the load dataset function to get the dataset.
from datasets import load_dataset
dataset_name = "shreshtha2002/training_reddit_data"
dataset = load_dataset(dataset_name, split="train[0:250]")
Since this dataset contains stories of the authors, respecting their prviacy I will not be attaching the screenshots of the actual fine tuning.
Please note that this a toy project and I don't wish to utilize this commercial or any other purposed. When training the model, it is important to be mindful of where your data comes from and use the data ethically!
Setting up configurations
After loading the class, I am storing all of my configurations in a config.yaml file. This is what the strcuture of the file looks like.
I have two variables in there. One has my configuration for LoRa and another one has training arguments. Let us understand what each of these mean
Training arguments | Concept |
learning rate | learning rate is an hyprerparameter which esentially tells the model how fast or slow to go. It is ideal to start with a small learning rate hence we are taking 1e-3 here |
weight decay | weight decay is a regularization technique to prevent overfitting. It does so by penalizing the larger weights. The idea is that larger weights create a more biased model hence by penalizing larger weights and encouraging smaller weights, we are creating a model that is more balanced |
evaluation stratergy | There are two types of evaluation startergies namely steps and epoch. steps mean we divide the training data into batch and for each time the gradient is updated,we take a batch of the data. this would be a benificial stratergy for larger data. epochs refers to iteration which means each time we update the gradient, we use the entire data instead of batches. hence choosing the epoch stratergy means that the evulation is done at the end of every epoch |
num_train_epochs | This tells us the number of training epochs to perform. the default is 3 |
LoRa Parameter | Concept |
r | stands for the number of parameters of the updated weight matrices, if we have smaller number that means we have fewer trainable parameters |
alpha | LoRa alpha is basically a scaling factor which determines how strongly the lora is going to affect your base model. Research suggests taking small numbers such as 8,16,32 or 64 |
dropout | dropout is a popular technique in deep learning to prevent overfitting where we randomly take out some layers during the training phase |
bias | bias is nothing but a constant which is added along with input. we are keeping the default settings here and keeping it to none |
config = """---
lora_config:
r: 64
lora_alpha: 16
lora_dropout: 0.05
bias: none
TrainingArguments:
output_dir="/trained/llama2output",
learning_rate=1e-3,
num_train_epochs=2,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
"""
lora_config = LoraConfig(**config["lora_config"])
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
Now that we have defined our training arguments, loaded the model and the dataset it is time to move onto the actual training.
For training we will be making use of the SFTTrainer class by the hugging face.
SFTTrainer is a crucial step in RHLF training as it provides us with customization and is actually a better choice for smaller datasets.
For bigger datasets, you might want to look into the trainer class. The SFTTrainer has functionality to helps us with Parameter fine tuning.
Training the model
I will be calling the SFT trainer class and passing on all the arguments to begin my training. This might take anywhere from several hours to days depending on your compute and the type of the data that you are using.
Let us look at all the arguments we are going to be passing to the SFTTrainer class and what each of them mean
Parameter | Concept |
Model | we supply the base model that we loaded previously to this argument |
train_dataset | the dataset on which we want to finetune our model |
peft_config | the arguments which we configured in our yaml file |
dataset_text_field | name of the field in dataset which contains text |
max_seq_length | the maximum sequence lenght of the input. please ensure that this parameter is correct |
tokenizer | tokenizer that we configured |
args | the training arguments which we configured in our yaml file |
trainer = SFTTrainer( model=model, train_dataset=dataset, peft_config=lora_config, dataset_text_field="text", max_seq_length=1024, tokenizer=tokenizer, args=(**config["TrainingArguments"]), packing=False, ) trainer.model.save_pretrained(new_model) trainer.tokenizer.save_pretrained(new_model)
After the training has been completed, you should be able to see a folder in the directory you created for output. That is your fine tuned model ready. You can use it to generate all sorts of insight!
I hope you found this post useful. If you have any questions please dont hesitate to reach out via email or in the comments!