Lecture Notes:
1. Concepts
What is Fine-Tuning?
- Fine-tuning is the process of adjusting a pre-trained model to improve its performance for a specific task or dataset.
- Fine-tuning allows a model to better understand and generate responses based on domain-specific data, improving its accuracy and usefulness in real-world applications.
Why Fine-Tune PDF and Web Scraping Models?
- Models that are trained on general data may not understand the nuances or specific needs of tasks like summarizing academic papers or extracting specific data from web pages.
- Fine-tuning allows the model to specialize in these tasks by exposing it to relevant, labeled data.
Key Idea
- Fine-tuning involves updating the weights of a model after it has been pre-trained. This is achieved by training it on new data that aligns with the target task.
2. Key Aspects of Fine-Tuning
- Base Model Selection:
- Choose a model that already has useful general knowledge. Models like
Llama
are good starting points for fine-tuning.
- Choose a model that already has useful general knowledge. Models like
- Dataset Preparation:
- Labeled Data: For fine-tuning, you need a labeled dataset. For example, if you want to fine-tune a model for summarizing research papers, you need a dataset of papers paired with their summaries.
- For PDFs: Label the data with clear instructions for the model to understand key points, summaries, or other types of content.
- For Web Scraping: You can label data for specific types of information such as titles, articles, or key facts extracted from scraped web pages.
- Training Process:
- The training process involves using small batches of data to modify the model’s weights.
- Learning Rate: A key parameter for fine-tuning that controls how much the weights change during training.
- Evaluation:
- After fine-tuning, evaluate the model to check if it performs well on new, unseen data.
- Transfer Learning:
- Fine-tuning is a form of transfer learning, where you apply knowledge from one domain (general model) to another (specific task).
3. Implementation
Prerequisites:
- Python Libraries:
torch
,transformers
,ollama
- Data Preparation: A dataset with labeled examples of the target task (summaries, extracted content).
Example: Fine-Tuning for PDF Summarization
Step 1: Create a Dataset for Fine-Tuning
First, prepare a small dataset of PDF summaries (input-output pairs).
# Example dataset for fine-tuning (PDF summaries)
data = [
{"input": "Text of research paper 1", "output": "Summary of paper 1"},
{"input": "Text of research paper 2", "output": "Summary of paper 2"},
# Add more labeled examples
]
Step 2: Define the Model and Tokenizer
For fine-tuning, you’ll need to choose a pre-trained model. Let's assume we are working with Llama
.
from transformers import LlamaForCausalLM, LlamaTokenizer
# Load the pre-trained model and tokenizer
model = LlamaForCausalLM.from_pretrained("llama")
tokenizer = LlamaTokenizer.from_pretrained("llama")
Step 3: Tokenize the Dataset
Convert the text data into tokens that can be fed into the model.
inputs = tokenizer([d['input'] for d in data], padding=True, truncation=True, return_tensors="pt")
labels = tokenizer([d['output'] for d in data], padding=True, truncation=True, return_tensors="pt")
# Create dataset for PyTorch
import torch
class PDFSummaryDataset(torch.utils.data.Dataset):
def __init__(self, inputs, labels):
self.inputs = inputs
self.labels = labels
def __getitem__(self, idx):
return {"input_ids": self.inputs["input_ids"][idx], "labels": self.labels["input_ids"][idx]}
def __len__(self):
return len(self.inputs["input_ids"])
# Create DataLoader for batching
train_dataset = PDFSummaryDataset(inputs, labels)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=2, shuffle=True)
Step 4: Fine-Tune the Model
Now, you can start the fine-tuning process using the dataset.
from transformers import Trainer, TrainingArguments
# Define training arguments
training_args = TrainingArguments(
output_dir="./model_output", # output directory
evaluation_strategy="steps", # evaluation strategy to adopt during training
learning_rate=5e-5, # learning rate
per_device_train_batch_size=2, # batch size
num_train_epochs=3, # number of epochs
weight_decay=0.01 # weight decay to avoid overfitting
)
# Define the Trainer
trainer = Trainer(
model=model, # the pre-trained model
args=training_args, # training arguments
train_dataset=train_dataset, # training dataset
eval_dataset=train_dataset # evaluation dataset (optional)
)
# Fine-tune the model
trainer.train()
Step 5: Save the Fine-Tuned Model
After training, save your fine-tuned model.
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")
4. Real-Life Example
Scenario: Fine-Tuning for Extracting Key Information from Web Scraped Articles
- Objective: Fine-tune a model to extract specific information (e.g., author name, publication date, and article summary) from web pages scraped using
BeautifulSoup
.
Step 1: Scrape Data from the Web
Use the requests
and BeautifulSoup
libraries to scrape articles from a webpage.
from bs4 import BeautifulSoup
import requests
def scrape_web_article(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
title = soup.find("h1").get_text()
author = soup.find("span", class_="author").get_text() # Example class
return {"title": title, "author": author}
# Example: scrape an article
article = scrape_web_article("https://example.com/article")
print(article)
Step 2: Label the Data
Label the scraped content with the correct output (summary, author name, etc.).
web_data = [
{"input": "Text from scraped article 1", "output": "Summary and key points"},
# Add more data
]
Step 3: Fine-Tune the Model
Follow the same fine-tuning steps as in the PDF case, using the web-scraped content.
5. Code Summary
from transformers import LlamaForCausalLM, LlamaTokenizer, Trainer, TrainingArguments
import torch
# Load and prepare the model and tokenizer
model = LlamaForCausalLM.from_pretrained("llama")
tokenizer = LlamaTokenizer.from_pretrained("llama")
# Prepare dataset (input-output pairs)
data = [
{"input": "Text of research paper 1", "output": "Summary of paper 1"},
# Add more labeled examples
]
inputs = tokenizer([d['input'] for d in data], padding=True, truncation=True, return_tensors="pt")
labels = tokenizer([d['output'] for d in data], padding=True, truncation=True, return_tensors="pt")
# Fine-tune the model
training_args = TrainingArguments(
output_dir="./model_output", num_train_epochs=3, per_device_train_batch_size=2, learning_rate=5e-5
)
trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset)
trainer.train()
# Save the fine-tuned model
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")
6. Summary
- Concepts Covered: Fine-tuning, transfer learning, dataset preparation, training, and evaluation.
- Key Aspects: Fine-tuning requires a labeled dataset, careful model selection, and tuning of hyperparameters.
- Real-Life Example: Fine-tuning a model for summarizing research papers (PDFs) and extracting key details from web-scraped content.
- Implementation: Steps involved creating datasets, tokenizing them, fine-tuning the model, and evaluating it.
7. Homework/Practice
- Fine-tune the model you created in the previous lesson to summarize a new set of PDFs.
- Use web-scraped content and fine-tune the model for extracting key details (e.g., title, author, summary) from articles.
- Experiment with different learning rates and batch sizes to see how they affect model performance.
These lecture notes provide a step-by-step introduction to fine-tuning models for custom tasks like PDF summarization and web scraping, offering practical examples with Python and Ollama CLI code.
No comments:
Post a Comment