Ollama Made Simple in 12 Hours: Hour 10 - Advanced Fine-Tuning Techniques

Lecture Notes:

1. Concepts

What is Fine-Tuning?

Fine-tuning refers to the process of taking a pre-trained model and adjusting its weights based on a smaller, task-specific dataset. This allows the model to adapt and perform better on specialized tasks (e.g., summarizing PDFs, extracting data from websites) without requiring the massive computational resources needed for training a model from scratch.

Advanced Fine-Tuning Techniques

Fine-tuning is an iterative process that can be enhanced with advanced strategies to optimize the model's performance. These strategies are designed to improve the model's efficiency and its ability to generalize on new, unseen data.

1. Learning Rate Schedulers

A learning rate scheduler adjusts the learning rate during training to prevent overshooting the optimal solution and to accelerate convergence.
Types:
- Constant Learning Rate: Keeps the learning rate constant.
- Step Decay: Reduces the learning rate after a set number of epochs.
- Exponential Decay: Gradually decreases the learning rate.
- Cosine Annealing: Gradually reduces the learning rate in a cosine curve to explore a wide range of potential solutions before narrowing down.

2. Early Stopping

Stops training when the model’s performance on a validation set no longer improves. This helps prevent overfitting and saves time by avoiding unnecessary training steps.

3. Data Augmentation

Expands the size and variety of your training dataset by applying transformations to the input data (e.g., rotating images, paraphrasing text). This allows the model to generalize better to new data.

4. Gradient Accumulation

A technique to simulate a larger batch size when limited by GPU memory. The gradients are accumulated over multiple smaller mini-batches before performing a parameter update.

5. Model Regularization

Helps prevent the model from overfitting by adding a penalty to the loss function based on the complexity of the model.
Types:
- L1/L2 Regularization: Adds a penalty to the weights of the model to prevent them from becoming too large.
- Dropout: Randomly drops units (neurons) in the neural network during training to prevent overfitting.

6. Knowledge Distillation

Involves training a smaller model (student) to mimic the behavior of a larger, more powerful model (teacher). The smaller model can achieve similar performance with fewer parameters and resources.

2. Key Aspects of Advanced Fine-Tuning

Optimizing Hyperparameters
- Fine-tuning involves selecting the right hyperparameters, including learning rate, batch size, optimizer type, and number of epochs. Using techniques like grid search and random search can help find optimal settings.
Transfer Learning
- Fine-tuning a pre-trained model on a specific task takes advantage of the knowledge the model has already learned from a vast corpus of general data, reducing the amount of training required for task-specific adaptation.
Model Evaluation During Fine-Tuning
- It's crucial to evaluate the model at various stages of fine-tuning to ensure that improvements are being made and that the model is not overfitting.
Computational Resources
- Advanced fine-tuning techniques often require more computational resources. Optimizing the training process (e.g., through gradient accumulation or data parallelism) can help manage these resources effectively.

3. Implementation of Advanced Fine-Tuning Techniques

Prerequisites:

Pre-trained model (e.g., Llama).
A dataset for the specific task (e.g., PDF summarization, web scraping).
Python packages: transformers, torch, datasets, sklearn.

Learning Rate Scheduler

A learning rate scheduler can be used to adjust the learning rate dynamically during training.

from transformers import AdamW, get_linear_schedule_with_warmup
import torch

# Initialize model and tokenizer
model = LlamaForCausalLM.from_pretrained("llama-7b")
optimizer = AdamW(model.parameters(), lr=5e-5)

# Define scheduler
epochs = 3
train_dataloader = DataLoader(training_data, batch_size=8, shuffle=True)
num_training_steps = len(train_dataloader) * epochs
lr_scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)

# Training loop with learning rate scheduler
for epoch in range(epochs):
    for batch in train_dataloader:
        optimizer.zero_grad()
        inputs = batch["input_ids"].to(device)
        labels = batch["labels"].to(device)
        
        outputs = model(inputs, labels=labels)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()  # Adjust learning rate

    print(f"Epoch {epoch + 1} completed with loss: {loss.item()}")

Early Stopping

Early stopping ensures that the training process halts once the model's performance on the validation set stops improving.

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",  # Evaluate at the end of each epoch
    save_strategy="epoch",        # Save the model checkpoint at the end of each epoch
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,   # Load the best model after training
    metric_for_best_model="accuracy",  # Best model based on accuracy
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    tokenizer=tokenizer,
)

trainer.train()

Data Augmentation for Text

In NLP tasks like summarization or question answering, data augmentation can involve techniques such as paraphrasing or using back-translation to create new examples from existing ones.

from nltk.corpus import wordnet

def synonym_augmentation(text):
    words = text.split()
    augmented_words = []
    
    for word in words:
        synonyms = wordnet.synsets(word)
        if synonyms:
            synonym = synonyms[0].lemmas()[0].name()  # Choose first synonym
            augmented_words.append(synonym)
        else:
            augmented_words.append(word)
    
    return " ".join(augmented_words)

augmented_text = synonym_augmentation("The research paper discusses novel methods in machine learning.")
print(augmented_text)

Gradient Accumulation

To simulate larger batch sizes without requiring large memory, you can accumulate gradients over several mini-batches before performing a gradient update.

from torch.utils.data import DataLoader

gradient_accumulation_steps = 4  # Accumulate gradients over 4 mini-batches

optimizer.zero_grad()
for step, batch in enumerate(train_dataloader):
    inputs = batch["input_ids"].to(device)
    labels = batch["labels"].to(device)
    
    outputs = model(inputs, labels=labels)
    loss = outputs.loss
    loss.backward()

    # Perform optimization step every `gradient_accumulation_steps` steps
    if (step + 1) % gradient_accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Model Regularization (Dropout)

Incorporating dropout in your model can help regularize the neural network and avoid overfitting.

from transformers import LlamaForCausalLM, LlamaConfig

# Define model configuration with dropout
config = LlamaConfig.from_pretrained("llama-7b")
config.attention_probs_dropout_prob = 0.1  # Dropout in attention layers
config.hidden_dropout_prob = 0.1  # Dropout in hidden layers

# Load model with custom configuration
model = LlamaForCausalLM(config)

# Training the model
optimizer = AdamW(model.parameters(), lr=5e-5)
for epoch in range(epochs):
    model.train()
    for batch in train_dataloader:
        optimizer.zero_grad()
        inputs = batch["input_ids"].to(device)
        labels = batch["labels"].to(device)
        
        outputs = model(inputs, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

4. Real-Life Example: Fine-Tuning a Summarization Model

In this example, we will fine-tune a pre-trained Llama model for summarizing research papers. We will use early stopping, learning rate scheduling, and data augmentation techniques to ensure optimal training.

Objective: Fine-tune a pre-trained Llama model on a summarization dataset.
Dataset: A collection of research papers and their corresponding summaries.
Techniques Applied:
- Learning Rate Scheduler: Gradual adjustment of the learning rate.
- Early Stopping: Halt training when the validation loss plateaus.
- Data Augmentation: Increase dataset diversity using paraphrasing.
- Model Regularization: Use dropout to prevent overfitting.

5. Summary

Advanced Fine-Tuning Techniques are essential to improving the performance of your model, particularly when you're working with specialized tasks like summarizing PDFs or extracting data.
Key techniques like learning rate scheduling, early stopping, data augmentation, and gradient accumulation allow for more efficient training and better model generalization.
Model Regularization (e.g., dropout) and knowledge distillation can further help in making the model robust and efficient.

6. Homework/Practice

Fine-tune a pre-trained model for a custom task (e.g., summarization, Q&A, etc.).
Implement a learning rate scheduler and evaluate its impact on training.
Apply data augmentation and observe how it affects model generalization on unseen data.
Experiment with gradient accumulation for large batch sizes on a resource-limited machine.

This concludes the lecture on Advanced Fine-Tuning Techniques.

Ollama Made Simple in 12 Hours

Saturday, 18 January 2025

Hour 10 - Advanced Fine-Tuning Techniques