Saturday, 18 January 2025

Hour 9 - Metrics & Evaluation for Fine-Tuned Models

Lecture Notes: 


1. Concepts

What are Model Metrics?

  • Metrics are quantitative measures used to evaluate the performance of a model. They help assess how well a model is performing, both during training and after fine-tuning.
  • Metrics are essential in understanding the accuracy, precision, recall, F1-score, and other aspects of model performance.

Why are Metrics Important?

  • Metrics guide model improvements, provide insight into whether fine-tuning has been successful, and identify areas where the model can be further enhanced.
  • The evaluation process helps determine if the model can generalize well to new, unseen data or if it’s overfitting to the training data.

Key Types of Metrics for NLP Models:

  1. Accuracy: The percentage of correct predictions over the total predictions.
  2. Precision: The proportion of positive predictions that are actually correct.
  3. Recall: The proportion of actual positives that were correctly predicted.
  4. F1-Score: The harmonic mean of precision and recall, providing a balance between the two.
  5. BLEU (Bilingual Evaluation Understudy): Used primarily for evaluating machine translation models (or tasks like summarization).
  6. ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Used for evaluating the quality of summaries by comparing the overlap of n-grams between the model output and a reference summary.
  7. Loss Function: Measures how far the model’s predictions are from the actual output. During fine-tuning, the goal is to minimize the loss.

2. Key Aspects of Metrics & Evaluation

  1. Choosing the Right Metric:

    • The right metric depends on the task. For tasks like summarization, ROUGE and BLEU are often used. For classification tasks, accuracy, precision, and recall are more relevant.
  2. Overfitting vs. Generalization:

    • Overfitting happens when a model performs well on training data but poorly on new data. Evaluating the model on both training and validation data helps detect overfitting.
    • Generalization refers to how well the model performs on unseen data.
  3. Evaluation Datasets:

    • Use validation and test datasets to evaluate the model.
    • Validation Set: Used during training to tune hyperparameters and prevent overfitting.
    • Test Set: Used only after training to evaluate the final performance of the model.
  4. Model Evaluation Pipeline:

    • Step 1: Prepare the evaluation dataset.
    • Step 2: Generate predictions using the fine-tuned model.
    • Step 3: Compare the model’s predictions to the true outputs using metrics.

3. Implementation of Evaluation and Metrics

Prerequisites:

  • Fine-tuned model (e.g., a PDF summarization model).
  • Evaluation dataset (e.g., PDFs with summaries or web-scraped content).

Example: Evaluating a Fine-Tuned Model

Step 1: Set Up Metrics (Accuracy, Precision, Recall, F1, BLEU, ROUGE)

You’ll use sklearn for traditional metrics (Accuracy, Precision, Recall, F1) and rouge-score for ROUGE and BLEU.

pip install scikit-learn rouge-score
Step 2: Generate Predictions

Assume you have a fine-tuned model that generates summaries for research papers. Here’s how to evaluate it:

from transformers import LlamaForCausalLM, LlamaTokenizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from rouge_score import rouge_scorer

# Load model and tokenizer
model = LlamaForCausalLM.from_pretrained("./fine_tuned_model")
tokenizer = LlamaTokenizer.from_pretrained("./fine_tuned_model")

# Define evaluation data (text of research papers and their corresponding summaries)
eval_data = [
    {"input": "Research paper content 1", "output": "Summary of paper 1"},
    {"input": "Research paper content 2", "output": "Summary of paper 2"},
    # Add more samples for evaluation
]

# Generate predictions using the fine-tuned model
def generate_summary(input_text):
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding=True)
    summary_ids = model.generate(inputs["input_ids"], max_length=100, num_beams=2, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

predictions = [generate_summary(d['input']) for d in eval_data]
actuals = [d['output'] for d in eval_data]
Step 3: Calculate Evaluation Metrics

Now, let’s calculate some key metrics.

  1. Accuracy:
    • Compare if the generated summary exactly matches the target summary.
# Simple exact match accuracy
accuracy = accuracy_score(actuals, predictions)
print(f"Accuracy: {accuracy:.4f}")
  1. Precision, Recall, F1-Score:
    • If your summaries are in binary or multi-class format, use precision, recall, and F1.
precision = precision_score(actuals, predictions, average="macro")
recall = recall_score(actuals, predictions, average="macro")
f1 = f1_score(actuals, predictions, average="macro")

print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
  1. ROUGE Score:
    • ROUGE scores compare the overlap between the model’s generated summary and the reference summary.
# Using the rouge_score library
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
rouge_scores = [scorer.score(actual, pred) for actual, pred in zip(actuals, predictions)]

# Print ROUGE scores
for i, score in enumerate(rouge_scores):
    print(f"Example {i+1}: ROUGE-1: {score['rouge1'].fmeasure:.4f}, ROUGE-2: {score['rouge2'].fmeasure:.4f}, ROUGE-L: {score['rougeL'].fmeasure:.4f}")
  1. BLEU Score:
    • BLEU is commonly used for evaluating machine translation or text generation tasks.
from nltk.translate.bleu_score import sentence_bleu

# Compute BLEU score
bleu_scores = [sentence_bleu([actual.split()], pred.split()) for actual, pred in zip(actuals, predictions)]
print(f"BLEU Score: {sum(bleu_scores) / len(bleu_scores):.4f}")
Step 4: Visualize the Results (Optional)

Visualizing the performance of your model can give you a clearer understanding of its strengths and weaknesses.

import matplotlib.pyplot as plt

# Example: Plot ROUGE Scores for different examples
rouge_1_scores = [score['rouge1'].fmeasure for score in rouge_scores]
rouge_2_scores = [score['rouge2'].fmeasure for score in rouge_scores]
rouge_L_scores = [score['rougeL'].fmeasure for score in rouge_scores]

plt.plot(rouge_1_scores, label='ROUGE-1')
plt.plot(rouge_2_scores, label='ROUGE-2')
plt.plot(rouge_L_scores, label='ROUGE-L')
plt.legend()
plt.title("ROUGE Scores for Each Example")
plt.xlabel("Example Index")
plt.ylabel("ROUGE Score")
plt.show()

4. Real-Life Example: Evaluating PDF Summarization

Consider a scenario where you have a fine-tuned model that summarizes research papers (PDFs).

  1. Objective: Evaluate how well the model generates summaries by comparing them to human-provided summaries.
  2. Metrics: Use accuracy, ROUGE, and BLEU to evaluate the performance. ROUGE is ideal for summarization because it captures recall of important words, and BLEU ensures the fluency of the summary.
Step 1: Scrape and Label PDF Data

Use PyPDF2 to scrape content from PDFs and manually label a few examples with reference summaries.

import PyPDF2

def extract_text_from_pdf(pdf_path):
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
        return text

pdf_text = extract_text_from_pdf("sample_paper.pdf")
print(pdf_text[:500])  # Print first 500 characters of extracted text
Step 2: Fine-Tune and Evaluate Model

Fine-tune the model with PDF data and evaluate the performance using the metrics described above.


5. Code Summary

from transformers import LlamaForCausalLM, LlamaTokenizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
import matplotlib.pyplot as plt

# Load model and tokenizer
model = LlamaForCausalLM.from_pretrained("./fine_tuned_model")
tokenizer = LlamaTokenizer.from_pretrained("./fine_tuned_model")

# Example: Evaluation Data
eval_data = [{"input": "Research paper content 1", "output": "Summary of paper 1"}]

# Generate predictions
predictions = [generate_summary(d['input']) for d in eval_data]
actuals = [d['output'] for d in eval_data]

# Evaluate with Accuracy, Precision, Recall, F1-Score
accuracy = accuracy_score(actuals, predictions)
precision = precision_score(actuals, predictions, average="macro")
recall = recall_score(actuals, predictions, average="macro")
f1 = f1_score(actuals, predictions, average="macro")

# ROUGE Scores
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
rouge_scores = [scorer.score(actual, pred) for actual, pred in zip(actuals, predictions)]

#

BLEU Score bleu_scores = [sentence_bleu([actual.split()], pred.split()) for actual, pred in zip(actuals, predictions)]

Visualization of ROUGE Scores

plt.plot([score['rouge1'].fmeasure for score in rouge_scores], label='ROUGE-1') plt.legend() plt.show()


---

### **6. Summary**

- **Concepts Covered**: Metrics for evaluation, including accuracy, precision, recall, F1-score, ROUGE, BLEU, and loss functions.
- **Key Aspects**: Evaluation ensures that models generalize well to new data and do not overfit. Different metrics are suited for different types of tasks (summarization, classification).
- **Real-Life Example**: Evaluating a PDF summarization model using ROUGE, BLEU, and traditional metrics.
- **Implementation**: Code for calculating various metrics using Python and common libraries like `sklearn`, `rouge-score`, and `nltk`.

---

### **7. Homework/Practice**

1. Evaluate your fine-tuned model using the above metrics on a new test set of PDFs or web-scraped data.
2. Experiment with different evaluation strategies such as using multiple BLEU references or adjusting the length of summaries.

No comments:

Post a Comment

OpenWebUI - Beginner's Tutorial

  OpenWebUI Tutorial: Setting Up and Using Local Llama 3.2 with Ollama Introduction This tutorial provides a step-by-step guide to setting...