Lecture Notes:
1. Concepts
What are Model Metrics?
- Metrics are quantitative measures used to evaluate the performance of a model. They help assess how well a model is performing, both during training and after fine-tuning.
- Metrics are essential in understanding the accuracy, precision, recall, F1-score, and other aspects of model performance.
Why are Metrics Important?
- Metrics guide model improvements, provide insight into whether fine-tuning has been successful, and identify areas where the model can be further enhanced.
- The evaluation process helps determine if the model can generalize well to new, unseen data or if it’s overfitting to the training data.
Key Types of Metrics for NLP Models:
- Accuracy: The percentage of correct predictions over the total predictions.
- Precision: The proportion of positive predictions that are actually correct.
- Recall: The proportion of actual positives that were correctly predicted.
- F1-Score: The harmonic mean of precision and recall, providing a balance between the two.
- BLEU (Bilingual Evaluation Understudy): Used primarily for evaluating machine translation models (or tasks like summarization).
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Used for evaluating the quality of summaries by comparing the overlap of n-grams between the model output and a reference summary.
- Loss Function: Measures how far the model’s predictions are from the actual output. During fine-tuning, the goal is to minimize the loss.
2. Key Aspects of Metrics & Evaluation
-
Choosing the Right Metric:
- The right metric depends on the task. For tasks like summarization, ROUGE and BLEU are often used. For classification tasks, accuracy, precision, and recall are more relevant.
-
Overfitting vs. Generalization:
- Overfitting happens when a model performs well on training data but poorly on new data. Evaluating the model on both training and validation data helps detect overfitting.
- Generalization refers to how well the model performs on unseen data.
-
Evaluation Datasets:
- Use validation and test datasets to evaluate the model.
- Validation Set: Used during training to tune hyperparameters and prevent overfitting.
- Test Set: Used only after training to evaluate the final performance of the model.
-
Model Evaluation Pipeline:
- Step 1: Prepare the evaluation dataset.
- Step 2: Generate predictions using the fine-tuned model.
- Step 3: Compare the model’s predictions to the true outputs using metrics.
3. Implementation of Evaluation and Metrics
Prerequisites:
- Fine-tuned model (e.g., a PDF summarization model).
- Evaluation dataset (e.g., PDFs with summaries or web-scraped content).
Example: Evaluating a Fine-Tuned Model
Step 1: Set Up Metrics (Accuracy, Precision, Recall, F1, BLEU, ROUGE)
You’ll use sklearn
for traditional metrics (Accuracy, Precision, Recall, F1) and rouge-score
for ROUGE and BLEU.
pip install scikit-learn rouge-score
Step 2: Generate Predictions
Assume you have a fine-tuned model that generates summaries for research papers. Here’s how to evaluate it:
from transformers import LlamaForCausalLM, LlamaTokenizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from rouge_score import rouge_scorer
# Load model and tokenizer
model = LlamaForCausalLM.from_pretrained("./fine_tuned_model")
tokenizer = LlamaTokenizer.from_pretrained("./fine_tuned_model")
# Define evaluation data (text of research papers and their corresponding summaries)
eval_data = [
{"input": "Research paper content 1", "output": "Summary of paper 1"},
{"input": "Research paper content 2", "output": "Summary of paper 2"},
# Add more samples for evaluation
]
# Generate predictions using the fine-tuned model
def generate_summary(input_text):
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding=True)
summary_ids = model.generate(inputs["input_ids"], max_length=100, num_beams=2, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return summary
predictions = [generate_summary(d['input']) for d in eval_data]
actuals = [d['output'] for d in eval_data]
Step 3: Calculate Evaluation Metrics
Now, let’s calculate some key metrics.
- Accuracy:
- Compare if the generated summary exactly matches the target summary.
# Simple exact match accuracy
accuracy = accuracy_score(actuals, predictions)
print(f"Accuracy: {accuracy:.4f}")
- Precision, Recall, F1-Score:
- If your summaries are in binary or multi-class format, use precision, recall, and F1.
precision = precision_score(actuals, predictions, average="macro")
recall = recall_score(actuals, predictions, average="macro")
f1 = f1_score(actuals, predictions, average="macro")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
- ROUGE Score:
- ROUGE scores compare the overlap between the model’s generated summary and the reference summary.
# Using the rouge_score library
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
rouge_scores = [scorer.score(actual, pred) for actual, pred in zip(actuals, predictions)]
# Print ROUGE scores
for i, score in enumerate(rouge_scores):
print(f"Example {i+1}: ROUGE-1: {score['rouge1'].fmeasure:.4f}, ROUGE-2: {score['rouge2'].fmeasure:.4f}, ROUGE-L: {score['rougeL'].fmeasure:.4f}")
- BLEU Score:
- BLEU is commonly used for evaluating machine translation or text generation tasks.
from nltk.translate.bleu_score import sentence_bleu
# Compute BLEU score
bleu_scores = [sentence_bleu([actual.split()], pred.split()) for actual, pred in zip(actuals, predictions)]
print(f"BLEU Score: {sum(bleu_scores) / len(bleu_scores):.4f}")
Step 4: Visualize the Results (Optional)
Visualizing the performance of your model can give you a clearer understanding of its strengths and weaknesses.
import matplotlib.pyplot as plt
# Example: Plot ROUGE Scores for different examples
rouge_1_scores = [score['rouge1'].fmeasure for score in rouge_scores]
rouge_2_scores = [score['rouge2'].fmeasure for score in rouge_scores]
rouge_L_scores = [score['rougeL'].fmeasure for score in rouge_scores]
plt.plot(rouge_1_scores, label='ROUGE-1')
plt.plot(rouge_2_scores, label='ROUGE-2')
plt.plot(rouge_L_scores, label='ROUGE-L')
plt.legend()
plt.title("ROUGE Scores for Each Example")
plt.xlabel("Example Index")
plt.ylabel("ROUGE Score")
plt.show()
4. Real-Life Example: Evaluating PDF Summarization
Consider a scenario where you have a fine-tuned model that summarizes research papers (PDFs).
- Objective: Evaluate how well the model generates summaries by comparing them to human-provided summaries.
- Metrics: Use accuracy, ROUGE, and BLEU to evaluate the performance. ROUGE is ideal for summarization because it captures recall of important words, and BLEU ensures the fluency of the summary.
Step 1: Scrape and Label PDF Data
Use PyPDF2
to scrape content from PDFs and manually label a few examples with reference summaries.
import PyPDF2
def extract_text_from_pdf(pdf_path):
with open(pdf_path, "rb") as file:
reader = PyPDF2.PdfReader(file)
text = ""
for page in reader.pages:
text += page.extract_text()
return text
pdf_text = extract_text_from_pdf("sample_paper.pdf")
print(pdf_text[:500]) # Print first 500 characters of extracted text
Step 2: Fine-Tune and Evaluate Model
Fine-tune the model with PDF data and evaluate the performance using the metrics described above.
5. Code Summary
from transformers import LlamaForCausalLM, LlamaTokenizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
import matplotlib.pyplot as plt
# Load model and tokenizer
model = LlamaForCausalLM.from_pretrained("./fine_tuned_model")
tokenizer = LlamaTokenizer.from_pretrained("./fine_tuned_model")
# Example: Evaluation Data
eval_data = [{"input": "Research paper content 1", "output": "Summary of paper 1"}]
# Generate predictions
predictions = [generate_summary(d['input']) for d in eval_data]
actuals = [d['output'] for d in eval_data]
# Evaluate with Accuracy, Precision, Recall, F1-Score
accuracy = accuracy_score(actuals, predictions)
precision = precision_score(actuals, predictions, average="macro")
recall = recall_score(actuals, predictions, average="macro")
f1 = f1_score(actuals, predictions, average="macro")
# ROUGE Scores
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
rouge_scores = [scorer.score(actual, pred) for actual, pred in zip(actuals, predictions)]
#
BLEU Score
bleu_scores = [sentence_bleu([actual.split()], pred.split()) for actual, pred in zip(actuals, predictions)]
Visualization of ROUGE Scores
plt.plot([score['rouge1'].fmeasure for score in rouge_scores], label='ROUGE-1')
plt.legend()
plt.show()
---
### **6. Summary**
- **Concepts Covered**: Metrics for evaluation, including accuracy, precision, recall, F1-score, ROUGE, BLEU, and loss functions.
- **Key Aspects**: Evaluation ensures that models generalize well to new data and do not overfit. Different metrics are suited for different types of tasks (summarization, classification).
- **Real-Life Example**: Evaluating a PDF summarization model using ROUGE, BLEU, and traditional metrics.
- **Implementation**: Code for calculating various metrics using Python and common libraries like `sklearn`, `rouge-score`, and `nltk`.
---
### **7. Homework/Practice**
1. Evaluate your fine-tuned model using the above metrics on a new test set of PDFs or web-scraped data.
2. Experiment with different evaluation strategies such as using multiple BLEU references or adjusting the length of summaries.